The kinds of errors that we're talking about are likely to take down large
chunks of data at a time. Error correcting for them is a somewhat complex
process (as described by GregH below).
Personally, I'd say our answer for 1.0 is simply "error correction is
performed by using RAID drives and a solid backup methodology." We will
implement error detection so that the admin knows that a restore from backup
is necessary. This is a simple strategy which allows us to deliver the
software when people need it: soon.
Cheers,
-g
On Tue, Apr 03, 2001 at 02:40:42PM +0800, Lim Swee Tat wrote:
> Hi,
> I made the suggestion in the hopes of seeing a rather inexpensive
> checksum that will perhaps perform error correction, if such needs
> arise.
> Is it possible to make this checksum transparently so that we can then
> have both data integrity and error correction? But I guess under the
> current implementation of the Berkeley DB, this rather difficult since it
> will involve slipping a layer under the BerkeleyDB, rite??
> I understand the need to keep it simple, but if the tradeoff for
> getting error correction system that is already taking time doing
> checksums is minor, I'm willing to trade speed for the error correction
> capability.
>
> Ciao
> ST Lim
>
> At [Tue, Apr 03, 2001 at 01:47:41AM ], Greg Hudson <ghudson@MIT.EDU> wrote:
> > To: Jim Blandy <jimb@zwingli.cygnus.com>
> > cc: dev@subversion.tigris.org
> > Subject: Re: Linux Kernel Summit
> > Date: Tue, 03 Apr 2001 01:47:41 -0400
> > From: Greg Hudson <ghudson@MIT.EDU>
> >
> > (Note: I'm not advocating we actually do any of this.)
> >
> > > (Now, I haven't really thought this through, but it seems to me that
> > > any error-correcting data would have to be proportional in length to
> > > the thing it was capable of detecting and correcting errors in.
> > > That would kind of defeat the purpose of using deltas to begin
> > > with.)
> >
> > Not at all. Consider this "simple" approach (it's not simple or
> > adviseable in practice because it would involve slipping a layer
> > underneath the Berkeley DB, but it's still theoretically sound):
> >
> > * Include in each page of the database a checksum of the page,
> > such that you know if a page has been corrupted.
> >
> > * After every block of N pages in the database, keep an extra
> > "parity page" which contains a bitwise parity calculation of
> > the block of pages.
> >
> > Now if any single page is corrupted, you blow away that page and
> > recompute it using the parity page and the other pages in that block.
> > Of course, corruption would probably happen across a swath of pages,
> > so instead of having a parity block across a sequential group of
> > blocks you'd keep it across a non-local group of blocks (blocks 1,
> > 1001, 2001, ...), or across blocks which live on different spindles.
> > The space cost is 1/N times the previous size of your database, plus
> > the cost of the page checksums.
> >
> > This is all basic RAID stuff, with checksums thrown in because RAID
> > normally assumes that disk failure are detectable. There exist more
> > advanced error correction techniques which can recover from more
> > complicated failures.
> ------------------------------------------------------------------
> | Lim Swee Tat | Office Automation: The use of computers |
> | 3ui Pte Ltd | to improve efficiency in the office by |
> | 10 Anson Road | removing anyone you would want to talk |
> | International Plaza | with over coffee. |
> | #05-17 | |
> | Singapore 079903 | |
> | +65-220-7529 ext 26 | |
> ------------------------------------------------------------------
--
Greg Stein, http://www.lyra.org/
Received on Sat Oct 21 14:36:27 2006