Sleepycat Support Folk:
In the course of writing code for the Subversion project, I came
across what appears to be a bug in Berkeley DB 3.2.9. The bug was
found while debugging some code in our test cases (specifically, the
"strings-reps-test" in subversion/tests/libsvn_fs -- see
http://subversion.tigris.org for instructions on snatching our source
code via CVS if you care).
In our code, we are in a while loop, using get() with DB_DBT_PARTIAL
set to read 100 bytes at a time from a record in a BTREE table. We do
this operation twice, once immediately after writing a new record, and
once immediately after appending data to the end of that first
record. Our loop is basically this:
offset = 0;
while (1)
{
amount_to_read = 100;
read AMOUNT_TO_READ bytes starting at OFFSET;
/* amount_to_read should now contain the amount read */
if (amount_to_read < 100)
break;
offset += amount_to_read;
}
The second set of these reads (after the append of more data) loops
forever because the .size field in the result DBT is always set to
100. Eventually, we get to the point where our offset is off the end
of the record, and *still* .size comes back 100.
The following is an email sent to a co-worker describing the problem
from our specific code's point of view (and going into more detail
about what I found while tracing into DB code). At the end of the
mail is a patch which fixed the bug for us. I'm sorry that I can't
provide a single little test program to demonstrate the problem -- if
you require one, I'll certainly give it shot, though.
Thanks!
C. Michael Pilato
cmpilato@collab.net
---------------------------------------------------------------------
Karl, here's the dealio on the Berkeley problem we were seeing today.
BACKGROUND:
Recall the situation here. We put() a record into a BTREE table, and
then read it back 100 bytes at a time to make sure the write
succeeded. Then put() more data at the end of that record (a basic
append operation), then read the whole record back 100 bytes at a time
to make sure the second write succeeded (and that the record's data is
truly a concatenation of the first and second strings written to the
record). In our test case, our second set of 100-byte get()s is
looping infinitely because results.size (where 'results' is a DBT) is
always set to 100, instead of dropping to some number < 100 when we
read the last portion of the record.
DIAGNOSIS:
The first put() of a record is a string of 726 bytes. The Berkeley
BTREE type for this data is B_KEYDATA (defined in db_page.h). All
goes well for this put() and subsequent set of 100-byte get()s.
The second write is an append of 427 more bytes (for a total of 1153).
Now, in bt_put.c:__bam_iitem(), it is decided that there isn't enough
room in the current page (which only has room for 1007 bytes, if my
interpretation is accurate) for the appended record data, so a new
page of type B_OVERFLOW is created and used for the data in this
put(). This change of BTREE type sends our second set of get()s off
on a slightly different codepath...the one that returns the wrong
values.
Just to make sure, I edited string2 in the test case so that the
combined size of string1 + string2 was 1008 bytes. Running the test
at this point still caused the infinite loop that occurs when BDB
doesn't tell you you've read the whole record.
Another edit to string2, chopping off one character (for a string1 +
string2 result of 1007, the apparent maximum size of the BTREE page)
was all it took to make the second put() NOT turn into a B_OVERFLOW
record, and *poof!* the infinite loop was gone; all was well.
As determined earlier today, the final for() loop in
db_overflow.c:__db_goff() looked like it never returned an updated
value for .size when one was attempting to read the end of a record
(where the data wasn't big enough to fill the provided buffer).
SOLUTION:
I believe the following patch will do the trick (this is a local edit
to the 3.2.9 version of db_overflow.c), and I recommend that we submit
this patch to the Berkeley DB team:
--- db_overflow.c.3.2.9 Tue Jun 19 15:47:02 2001
+++ db_overflow.c Tue Jun 19 15:53:36 2001
@@ -155,6 +155,7 @@
pgno = h->next_pgno;
memp_fput(dbp->mpf, h, 0);
}
+ dbt->size -= needed;
return (0);
}
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org
Received on Sat Oct 21 14:36:32 2006