[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: svn commit: rev 1380 - trunk/subversion/include trunk/subversion/libsvn_fs

From: Colin Putney <colin_at_whistler.com>
Date: 2002-02-27 02:15:50 CET

On Tuesday, February 26, 2002, at 04:34 PM, cmpilato@collab.net wrote:

>> All that said, looking at switching the 'strings' table to a BTree
>> with dup
>> keys (meaning, it has [consecutive] multiple records) could be a win.
>> Random
>> access will be tougher, but that happens *very* rarely. Usually, we
>> sequence
>> through it. But even if we *did* need random access, it is possible to
>> fetch
>> just the *length* of a record, determine whether you need content from
>> it,
>> or to skip to the next record.
>>
>> Understanding the question above: how often does the delta code call
>> the
>> stream->write function will tell us more information about the
>> buffering. My
>> guess is that it depends heavily upon the incoming delta. If we're
>> talking
>> about a pure file upload, then we'll have a series of large windows and
>> writes. But if we're talking about a serious diff, then we could have a
>> whole bunch of little writes as the file is reconstructed.
>>
>> I'd say that a buffer is good, but we could probably reduce the size
>> to 100k
>> or something, and use duplicate keys to improve the writing perf.
>>
>> Are you up for it Mike? :-)
>
> I have a feeling that the incoming data tends to be no larger than
> something near 100k, the size of the svndiff encoding windows. My
> tests were on imports, so all the data coming into the filesystem was
> svndiff's equivalent of full-text. I'll betcha that those were 100K
> windows with one op: NEW (and 102400 bytes of data). The buffering
> earns us nothing if it drops to a value smaller than the average size
> of a chunk of data written to the filesystem. It needs to float at
> a value that is like The Most Memory I Can Stand For the FS to Use.
>
> As for the BTree thing, I don't see the advantage in this case. Sure,
> it might help our reads of data (or it might hurt, if getting a range
> of text means we have to potentially hit the database more than once
> because that range is spread out over multiple records), but the
> problems I had today are strictly related to the number of times
> record data was written to the database. Perhaps you're forking a new
> thread, though, and I'm missing it?

I suspect that this is actually aimed at optimizing streamy writes.
Consider the case of writing a 400K string into the database using 100K
buffers. The database operations will go something like this:

- create record
- write 100K

- read 100K
- delete record
- create record
- write 200K

- read 200K
- delete record
- create record
- write 300K

- read 300K
- delete record
- create record
- write 400K

It just gets worse and worse as the file size increases, and all of it
goes into the logs.

If you break the data into multiple records, you can just create a new
record for each bufferful of data and tack it onto the end.

An analogy would be reading data from a file into memory. You could call
realloc() before every read to make room for the incoming data, but it's
way better to create a linked list of buffers.

I don't really know how BDB is implemented, so they might have some
clever way of minimizing the hit for appending data to a record, but
this is probably why your log files ballooned so much.

Colin

Colin Putney www.whistler.com
Information Systems (877) 932-0606
Whistler.com (604) 935-0035 x221

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org
Received on Sat Oct 21 14:37:10 2006

This is an archived mail posted to the Subversion Dev mailing list.

This site is subject to the Apache Privacy Policy and the Apache Public Forum Archive Policy.