[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

text-base penalty: A proposed solution

From: Kean Johnston <jkj_at_sco.com>
Date: 2002-12-16 13:27:30 CET

All,

About 6 weeks ago I started (restarted) the text-base penalty thread.
A lot was said, and I can precis it all if needed, but now that I
have taken a much closer look at svn than I had at the time, and now
that I actually have time to play with it seriously, I'd like to
re-open that particular can of worms. Please bear with me.

My personal goal (and its important, because it reflects typical large
project usage) is to convert the SCO OpenServer source code base from
its current tools-on-top-of-SCCS system to one managed by svn. This is
an ambitious project, complicated by the sheer volume of existing
history,
and the size of the code base. Other issues, such as importing from
SCCS,
can be addressed later. Before I even begin such a project with just
the current top-of-tree code for a proof-of-concept project, there are
things like the text base penalty I need to address.

First of all, I would like to state an objection to the notion of the
cached text base. The documentation sings its praises, and posts from
previous threads seem to indicate that people place a great deal of
importance on the ability to edit files on aeroplanes or environments
where they are not network-attached to their respository, but I submit
that such usage caters to such a tiny percentage of the audience likely
to use svn that making all other users incur the penalty seems
misguided.
It is approximately equivalent to the government producing every
document
it ever prints in every language on earth, on the offchance that one or
two of its employees are more comfortable reading in their native
language
than in English. A noble sentiment, to be sure, but rather impractical.
In the case of the text base, there are other soultions that are equally
functional, yet much smaller and (in my opinion) simpler.

Second, before anyone is tempted to throw the "disk space is cheap"
argument
around, please don't. A software solution should not rely on the
*CURRENT*
costs of hardware in *SOME COUNTRIES* as a means of justifying a
solution
that can be implemented more cheaply yet just as functionally. Using
the
price of hardware is a mental trap. It is equivalent to a 3D graphics
package author saying "Oh, there is no need to optimize my raytracing
algorithm, CPUs are getting faster and cheaper and I can just wait for
good hardware to hide my bad design". Such thinking thwarts innovation,
which is usually a prime motivator for any software package in the first
place. So, in order to have a clear and rational debate about the text
base, I think it would be prudent to list its benefits, especially as
they relate to the overall system functionality.

o Allows for easy reversion of files without repository access
o Allows for easy detection of changes to the WC w/o repository access
o Allows for minimal client-to-server communication on commit
o Allows for local diffs (handy for developers) w/o repository access

I am sure there are other hidden benefits, but from a systemic
perspective,
or most certainly from an end users persective, those are the benefits.
These are what I see as the negatives.

o Thwarts tools like cscope that scan for all source files
o Makes `find . -name \*.[cCh]|xargs grep SOMETHING` much less useful
o Uses exactly double the space, which is wasted in read-only
environments
o Doubles the workload of any directory-traversal tools
o Doubles the inode count (this IS an issue!)

Just as there are hidden benefits, I am equally sure there are hidden
negatives. Time and more experience may lengthen this list. To some
of these problems, there are solutions, I know. Yes, I can use

  find . -name \*.[cCh]|sed -e '/\.svn/d'|xargs grep SOMETHING

Yes, I can trim out .svn files from cscope file lists. The real point
though, is that in both of these cases, the version control system is
getting in the way of practices that are as old as time, and for little
benefit (assuming an equally functional solution can be found). The
point
about double the space is a rather important one. I am sure we are not
the only company in the world that has several "nightly build boxes"
that build different levels of the tree, such as the head, currently-in-
system-test, and Keans-own-special-hacking version. Such build machines
are almost always read-only clients of the version control system. They
get the source as it stands at the start of the build, do the build, and
they're done. What possible reason is there to have cached files for
such uses? Another example. In something the size of OpenServer, most
developers have their own areas of expertise, and tend to make changes
to specific portions of the tree. I am sure most companies work the
same
way. However, each developer needs a full source tree in order to do
prototype builds, or (in my case) needs to have done at least one full
build before partial builds can take place. But from my perspective,
almost all of the tree is read only. I only care about the console
driver, or the licensing subsystem, or the Apache port or whatever. But
the vast majority of the tree I am never going to edit. Why should I
have duplicates of ALL of that code? The answer is I shouldn't.

Last, the point about duplicate inodes. Not all systems out there are
modern, not all of them have things some developers take for granted
(like loadable kernel modules, large file support, practically infinite
numbers of inodes). Some of us have to contend with smaller systems,
and there is no reason that the filesystem stress should be double what
it currently is, just because a text base is "easy".

Baring all the above in mind, I would like to propose a solution. First
up is a description of the actual problem domain. This is a client side
issue, and should have no bearing on the server. I see all changes
being
completely handled by the client, with nothing (beyond perhaps a default
preference) set in the server (with one caveat, see below). So, my
proposed
solution is for the following problem:

  "Design a system that provides all the current functionality of
   the duplicated-contents text-base approach in a way that minimalizes
   the actual duplication of data, or, ideally, eliminates it."

First and foremost, there should be a new config file option, nominally
called "text_base_method". This can have (currently) four possible
values:
duplicate, compress, copy_on_edit or checksum. The semantic meanings of
these
values are:

  duplicate - duplicate the file verbatim. IE, as things currently
stand.
  compress - duplicate the file, but compress it, using any compressor
  copy_on_edit - the meat of my proposed change. See below.
  checksum - the bones of my proposed change. See below.

For the "compress" method, it would be nice to allow the user to choose
the compressor they want to use, as opposed to hard-coding a solution
into the client via something like Zlib. To this end, perhaps there
should
be two other config options: "compress_pipe" and "decompress_pipe",
whose
values are the commands that can be opened as a pipe to compress or
decompress a file, respectively. Designing this is not the issue at
hand
however.

The third possible value, "copy_on_edit" should be almost
self-explanatory.
It implies a new client side command, "svn edit". With this text-base
method, when a client retrieves a WC, it simply stores a checksum and
date/size properties for the files in a flat-ASCII file in .svn. If the
user wants to edit the file, they first issue an "svn edit" command with
the name of the file. This command then copies over the current file
contents into .svn/text-base, and marks the entry in the flat-ASCII file
as being edited. This then allows the user to do local diffs, revert
files easily, and do small diffs on commit. It essentially provides all
of the current text-base functionality, simply delayed. There are other
advantages to an "svn edit" command. Without wanting to distract you
and open up a rat-hole discussion (please just CONSIDER these ideas,
lets not debate them in this thread), an "svn edit" would enable us to:

  o Store at the root a list of all changed files for almost
instantaneous
     determination of changed files in a tree (a BIG issue for large
trees).
     Think of how useful an "svn editing" command would be, that could
     instantly tell you what files you have changed in a tree.
  o Notify the repository of the intent to edit, such that other users
who
     do an svn edit of the same file can receive a gentle reminder that
they
     MAY be in danger of a conflict
  o Possibly even provide a respository administrator the option of
enforcing
     the notion of a lock-modify-unlock approach to versioning, while
keeping
     all of Subversion's other features in play

The primary objection I see to this method is the obvious "what happens
if
a user changes a file without first issuing an svn edit". Well, that
case
could be handled by a policy setting in a server config, or client
config,
or even just in established practice. If the user makes changes, they
should
be allowed to keep them, but they will pay a small penalty for having
forgotten to do the svn edit first. They will not be able to do a
revert
or local diff, but they SHOULD still be able to diff the file or make a
commit, as long as they have access to the repository. It will slightly
increase the client-to-repository traffic, and if that is inconvenient
for the user, then they will soon learn to remember to svn edit. It
will
even be possible to revert, again, as long as they have access to the
repository. However, since MOST users are connected (i.e very few do
this stuff on planes or in the space shuttle), this is likely to be a
very small, barely noticable problem. This is a great segue into the
forth text base mechanism.

The last mechanism is the "checksum". This text base method assumes
constant access to the repository, and never duplicates files. All it
does is maintain the flat ASCII database of the files checksum, size
and modification times. Any attempt to svn diff, revert or commit
requires access to the repository, and the client will retrieve the
original contents from the server and then resume normal operation as
it currently does. Yes, on commit this implies a double-download
penalty, but for most installations, I bet thats less painful (because
it is rarer) than a fully duplicated source base. And besides, it
is optional. It is also better to do a download-then-diff-then-submit
than to submit the entire contents of the changed file and let the
server do the diffs, because this would involve changing the server,
and for people with direct but slow access to the repository, chances
are they are on an ADSL line that has higher downstream than upstream
speeds.

The one thing I cannot decide on (I can go either way on this) is
whether
the options for the text base should be set in the server config file
or really in the client config as I described above. I kinda like the
idea of a repository maintainer having the ability to control this, but
I also like the idea of the client knowing whats best for their own
particular needs. I think the best possible approach would be to allow
the server to set the default, and allow clients to over-ride it.
Possibly
even add the ability for the respository maintainer to enforce a
particular
method. For example, in the server's config:

  default_text_base_method = copy_on_edit
  allow_client_method_override = true # or false to enforce default
method

Anyway ... thats my idea. Let the flames begin ... I have my asbestos
suit on :)

Kean

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org
Received on Mon Dec 16 13:28:06 2002

This is an archived mail posted to the Subversion Dev mailing list.

This site is subject to the Apache Privacy Policy and the Apache Public Forum Archive Policy.