Re: log cache assertion

From: <Stefan.Fuhrmann_at_etas.de>
Date: 2007-07-06 04:27:29 CEST

Stefan Küng <tortoisesvn@gmail.com> wrote:

> Just tried showing the log for the TSVN trunk, and I got an assertion
> while loading the cache file. I've attached the cache file.
> The assertion is in the insert() method: it tries to insert an already
> existing revision.

A real WTF that took me hours to track down!
Should be fixed for new cache files in r9987.

> I don't know how I could get there, I mean how I ended up with two
> (identical?) entries in the cache file, but maybe you can find out?

Well, the good news: it was only the token dictionary that is used
to represent the log messages. Also, if this assert() isviolated,
it means that there are multiple index entries pointing to equivalent
data. So, such data is (hopefully) not "wrong" but just redundant.
And it's an indication that some other algorithm is probably broken ...

The bug at hand is difficult to describe (sorry for the complexity
in the log caching but I think it is necessary):

* CStringDictionary stores unique strings and associates them with
  an ID. The log cache uses these containers to replace string data
  with simple integer data, i.e. it can operate on fixed-size numbers
  (indices, tokens) instead of variable-length character streams.

* CIndexPairDictionary stores unique pairs of integers and associates
  them with an ID. The log cache uses that to recursively construct
  paths (parentPathID,stringID). In a different context, it represents
  compressed data (i.e. one token to be used in place of two others).

The bug caused multiple entries (i.e. replacement tokens) for the
same pair of orginal tokens.

* CTokenizedStringContainer is used to represent large text strings
  (here: log messages). Every string gets broken up into "words".
  Using a CStringDictionary the text will be translated into a
  sequence of word tokens.

  Since word sequences are highly redundant, they get compressed even
  further: Every token pair that is used at least 3 times gets replaced
  by a new token until no such pairs can be found anymore. However,
  for performance reasons, CPairPacker is invoked only upon Save() and
  only if it seems worthwhile.

  New texts will always get compressed using the exisiting pair
dictionary.
  New texts may also create an opportunity for further compression
  (addional pairs) and the next invocation of CPairPacker will use it.

  It is important, though, that these new pairs are disjoint with
  exisiting ones. Otherwise, the list of newly introduced pair
  replacements will be partly redundant with the exisiting one.
  More than one token would designate the same compressed sequence
  which is waste.

  To make the lists disjoint, new texts must perform *all* replacements
  according to the existing list. The bug was caused by the algorithm
  not being able to combine two pair tokens - it would only aggregate
  word tokens to an exisiting word or pair token.

Pew. For the sake of documentation ...

-- Stefan^2.
Received on Fri Jul 6 04:27:25 2007

This message: [ Message body ]
Next message: Lübbe Onken: "RE: Spellchecking of manuals"
Previous message: Simon Large: "Re: Spellchecking of manuals"
In reply to: Stefan Küng: "log cache assertion"
Next in thread: Stefan Küng: "Re: log cache assertion"
Reply: Stefan Küng: "Re: log cache assertion"

Contemporary messages sorted: [ By Date ] [ By Thread ] [ By Subject ] [ By Author ] [ By messages with attachments ]