[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: Another working copy library

From: David Anderson <dave_at_natulte.net>
Date: 2007-01-17 12:51:20 CET

On 1/17/07, Ph. Marek <philipp.marek@bmlv.gv.at> wrote:
> > I think that libsvn_wc_sqlite addresses the issues I pointed out at
> > the beginning of this mail: tree crawls are minimized,
> If you can get the indizes in a sequential chunk, it may be worth it.
> Else you'll get the harddisk seeking around, too (although better, I admit).

So SQLite is not as efficient as writing our own raw database format
fine-tuned for svn, or for that matter, our own OS-level filesystem
specially built to handle svn working copies. I can live with that

> > commandline tools don't find text-base dupes all over
> > the place,
> As long as you store them in a BLOB or similar, why not? grep happily looks
> into binary files, just to give you the filename.

So it'll tell me wc.db matches, once; rather than the current
behavior, which is to give me twice the number of actual results I
want. That's already a big bonus. If we add a zlib svn_stream in and
out of the text-base DB, we also might save a little disk space, and
we eliminate even the single binary file match.

> And if you use "grep --exclude *.svn*" it doesn't matter if there's one file
> or 1000.

Fair enough. That's still 1000 times less work, and one less
commandline argument I need to pass the many times I want to grep over
an svn checkout.

> > and we have a clear internal API where we can handle the
> > text-base storage problem cleanly. And, hopefully, most operations are
> > reduced to an SQL select statement, which can be blindingly fast if
> > the database is indexed properly.
> "If the database is indexed properly". So it will have some
> storage-space-cost.

Most optimizations do.

> I don't want to be seen as outright against that idea - it surely has it's
> merits. I just don't know whether it makes sense to store multi-GB in a
> database, when there's a filesystem available. It feels a bit like
> file-in-database-on-nfs-mount-on-loopback-mounted-file-on-samba-share, if you
> get what I mean ;-)

I completely understand that feeling. Note outright that I mentionned
that I want a clean internal API where alternative text-base handling
mechanisms can be added as time goes/needs are felt. The database
storage solution just seemed obvious, if we have a SQLite database
hanging around and open anyway.

It has a few other advantages over storing as regular files: less
inodes are used by storing them all together, less os-level metadata
is needed to store these files (which we care about only for their
content), and tools that crawl the working copy will be slightly less
confusing when used.

> I think that, when such a big thing is being done, it may be good to break a
> bit more -- don't store local text bases.
> That saves us 50% of storage space and grep is happy.
> Use partial-MD5-hashes to check for modifications (like fsvs does), and if
> there's a ra call "ra_get_file_ranges" fsvs would be happy, too :-)

Note my mention of a clean internal API where additional text-base
handling mechanisms can be added. One would be the current behavior,
store them all locally. Pros, more operations available without
hitting the network; Cons, big wc == lots of disk space wasted.
Another would be, say, always grab from the network. Pros, almost zero
space overhead in the WC, and you're good if you're on a local network
or behind a caching proxy, or only ever updating; Cons, if you're not,
your working copy will be the suck slow as it pounds the network.
Another might be to put the text-bases into a configurable location on
the local drive, so that multiple working copies of the same project
can share them as they come and go. At this point, I don't have any
solid plans beyond "current behavior" and "kill my network please"
implementations, but having a clean internal API keeps our options

> Or, to not have such a big change, simply define an alternate storage
> container; in fsvs speak the "Working copy Administrative Area" (WAA), and
> use the MD5 of the files as an index there.
> That would additionally allow to share the text-bases across multiple
> check-outs, which is a nice benefit.
> If the directories are set apart, a grep won't look there.

See above. If the internal API is done right, this would definitely be
a possibility.

> (I don't want to sound like 'look what fsvs does better' -- fsvs is not
> thought for source control -- but I believe that it's got a few things right.
> [If it didn't, I'd change it to :-])

I'm sure it did! However, I think that getting rid of text-bases
alltogether is a bit steep of a change. For most projects, I like my
text-bases. The extra meg or so of disk space doesn't bother me, and
it gets me wicked fast local operations. That all starts falling down
when you start getting the gcc trunk, or checking out big hunks of
KDE/Gnome, or when using svn to store media files, as ID software did
iirc. For those cases, being able to toss text-bases, or have another
elegant way of handling them, would be great.

- Dave

To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org
Received on Wed Jan 17 12:51:30 2007

This is an archived mail posted to the Subversion Dev mailing list.