Per the discussion on streamy FS writes, I went and wrote up the code to use
Berkeley's duplicate keys functionality. Basically, that means that we have
multiple (ordered) data records for a single key. In aggregate, those
records represent the data for the specified key.
The intent here is to write new records with additional blocks of data,
rather than appending to an existing record (thus modifying that record,
thus causing a lot of log activity).
This patch passes 'make check' with no problems.
Now, the good info: when I committed an 'add' of a 25 meg file, it produced
only 25 meg of logs (as expected: log the data before committing to the db).
It also did it in pretty short order, and without sucking up a ton of VM.
There was still some observable growth over time during the commit, but I'm
guessing it is some kind of structure, rather than directly related to the
content (the commit for the file seemed to reach about 9 meg of memory,
slowing growing, rather than jumping).
The time was also pretty good. The import took about a minute. Given the
amount of I/O, that might be about right. Future perf testing can tell us.
However, I don't have any pre-Greg-patch numbers (using Mike's latest 4 meg
buffering). Nor do I have numbers pre-buffer or pre-streaming. All of that
data would be really good to have, to see where we started and where we're
going. I'd also like to see if this dup key stuff has improved performance,
or just reduced our log file spamming.
The patch isn't quite ready for committing: I need to update the doc for
svn_fs__string_read(). It was already out of date, and with this change,
I've also introduced the "may return less than you asked for" semantic of
most of our other reading functions.
The hard-coding of 500k in tree.c should also go (I was lazy and didn't want
to recompile everything by changing the constant in svn_fs.h :-). Note that
cmpilato and I think that constant should move into tree.c anyways.
For now, I'm just posting the patch so others can run some of the
comparative tests. I need sleep :-)
Here is a log message to aid in understanding the patch:
* libsvn_fs/strings-table.c (svn_fs__open_strings_table): set the flags on
the db to enable duplicate keys.
(locate_key): new function to allocate a cursor, locate the first record
of data for a key, and return its length.
(get_next_length): use the cursor to get the length of the next record of
data [for the key]
(svn_fs__string_read): use locate_key and get_next_length to locate the
data record for the requested offset. return whatever data is available
in that data record, or the requested length (whichever is less). note
that this changes the semantics to "return some amount" rather than
"return all requested"
(get_key_and_bump): new function containing code factored out of
svn_fs__stirng_append; it gets the current 'next-key' value and bumps
the value in the database. it has also been updated to deal with the new
'put' semantics of databases with dup keys.
(svn_fs__string_append): just shove another record into the database
(svn_fs__string_clear): we have to delete prior contents (all the data
records associated with the key) since we can't just 'put' a zero-length
value over the top of the old.
(svn_fs__string_size): revamped to total all the data records for the
(svn_fs__string_copy): revamp. rather than reading and appending to a new
record, we just copy all the records to the new key.
* libsvn_fs/tree.c (window_consume): HACK. quick change to the buffer limit
* tests/libsvn_fs/strings-reps-test.c (verify_expected_record): print more
information when an expected size is not met. adjust call to
svn_fs__string_read() to compensate for not necessarily getting all the
Greg Stein, http://www.lyra.org/
Received on Sat Oct 21 14:37:10 2006
To unsubscribe, e-mail: email@example.com
For additional commands, e-mail: firstname.lastname@example.org