long records and partial record access [#3812]

From: Keith Bostic <bostic_at_sleepycat.com>
Date: 2001-04-23 17:39:38 CEST

Hi, my name is Keith Bostic and I'm with Sleepycat Software. I'll own
your Support Request for now.

> I have some questions about how Berkeley DB is designed to perform in
> certain circumstances.
>
> I've been assuming that Berkeley DB manages very large record values
> (say, gigabytes long) roughly as efficiently as the Unix filesystem
> would, and that manipulating values like that using the partial record
> access stuff (http://www.sleepycat.com/docs/ref/am/partial.html) would
> be roughly as efficient as using seek, read, and write on a Unix file.
> Is this actually the case?

For reading and record-creation writing, Berkeley DB is reasonably
efficient. DB stores overflow records as linked lists of pages, so
there is little wasted space (less wasted space for large page sizes
where the page overhead is less significant). There are no database
format limitations on the size of these records, either, you can make
them as large as you like.

For modification, they are significantly less efficient than UNIX.

> Berkeley DB's partial record functions actually go beyond what seek,
> read, and write offer, in that you can replace a section of a record
> with text of a different length, thus effectively inserting or
> deleting text from a record, as if it were a text editing buffer. How
> efficient are those operations? If I just insert some bytes in the
> middle of a large record, does it rewrite the entire tail of the
> record?

This is exactly where the inefficiencies are found. Berkeley DB is
optimized for records that are small relative to the page size. Most
operations on overflow records require instantiating the entire record
in memory. For example, to modify a chunk of data in a record, the
record will be instantiated in contiguous memory, the data will be
modified, the old chain of pages will be deleted, and a new chain of
pages will be added. So, it's even worse than you supposed, not only
will the entire tail be rewritten, the entire record is likely to be
rewritten.

There are additional inefficiencies, as well. If the overflow records
are transactionally protected, the before- and after-images of the
entire record will be written into the transactional log, which wastes
a tremendous amount of log space and CPU/disk performance.

> I'm sure you folks would fix any bugs we might find; I'm asking
> about the performance you'd expect to see from your data structures,
> assuming the implementation is correct.

We've had a couple of changes on the drawing board for quite awhile.

First, true BLOB support. Even if we fixed the inefficiencies I just
described, there would still not be an incremental interface for record
return to the application -- it's unreasonable for video to be stored
into the database, and then copied into a single contiguous chunk of
memory for return to the application. Our approach to BLOB support
would almost certainly be exporting an external file descriptor
interface to the application, and storing the BLOBs in files created
outside of the database.

Then, applications could read/write the files directly to/from disk,
and Berkeley DB would use underlying filesystem rename semantics to
protect the transactional nature of the records. Obviously, using
flat-text files in this way would not make data insertion/deletion
within BLOBs much more efficient, but I think it's reasonable to treat
BLOBs as write-once, append-many, read-many types of data structures.
I do not believe BLOBs where data is incrementally modified are common,
except as part of appending new information.

Second, we could certainly introduce a number of efficiencies into the
Berkeley DB code to deal with overflow records; for example, we could
only rewrite the pages in the linked list that were actually modified,
only log the page changes, not the entire record, and so on. Frankly,
we're not sure if this is a good choice on our part or not. If we
introduce true BLOB support, then it's only the records that are too
large for database pages but too small for BLOBs, that the changes would
help.

I'd be very interested in hearing back from you what kind of support
you'd like to see Berkeley DB offer, and where you expect performance
boundaries to lie.

Finally, all of this work is largely waiting on a customer base to
support it -- so far, few customers have asked for these changes.

Regards,
--keith

=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Keith Bostic
Sleepycat Software Inc. bostic@sleepycat.com
118 Tower Rd. +1-781-259-3139
Lincoln, MA 01773 http://www.sleepycat.com
Received on Sat Oct 21 14:36:29 2006

This message: [ Message body ]
Next message: Richard Vernon: "Subversion article for Linux Journal"
Previous message: Karl Fogel: "Re: Minor little nit in autogen.sh"

Contemporary messages sorted: [ By Date ] [ By Thread ] [ By Subject ] [ By Author ] [ By messages with attachments ]