Stefan Küng <firstname.lastname@example.org> wrote:
> Just tried showing the log for the TSVN trunk, and I got an assertion
> while loading the cache file. I've attached the cache file.
> The assertion is in the insert() method: it tries to insert an already
> existing revision.
A real WTF that took me hours to track down!
Should be fixed for new cache files in r9987.
> I don't know how I could get there, I mean how I ended up with two
> (identical?) entries in the cache file, but maybe you can find out?
Well, the good news: it was only the token dictionary that is used
to represent the log messages. Also, if this assert() isviolated,
it means that there are multiple index entries pointing to equivalent
data. So, such data is (hopefully) not "wrong" but just redundant.
And it's an indication that some other algorithm is probably broken ...
The bug at hand is difficult to describe (sorry for the complexity
in the log caching but I think it is necessary):
* CStringDictionary stores unique strings and associates them with
an ID. The log cache uses these containers to replace string data
with simple integer data, i.e. it can operate on fixed-size numbers
(indices, tokens) instead of variable-length character streams.
* CIndexPairDictionary stores unique pairs of integers and associates
them with an ID. The log cache uses that to recursively construct
paths (parentPathID,stringID). In a different context, it represents
compressed data (i.e. one token to be used in place of two others).
The bug caused multiple entries (i.e. replacement tokens) for the
same pair of orginal tokens.
* CTokenizedStringContainer is used to represent large text strings
(here: log messages). Every string gets broken up into "words".
Using a CStringDictionary the text will be translated into a
sequence of word tokens.
Since word sequences are highly redundant, they get compressed even
further: Every token pair that is used at least 3 times gets replaced
by a new token until no such pairs can be found anymore. However,
for performance reasons, CPairPacker is invoked only upon Save() and
only if it seems worthwhile.
New texts will always get compressed using the exisiting pair
New texts may also create an opportunity for further compression
(addional pairs) and the next invocation of CPairPacker will use it.
It is important, though, that these new pairs are disjoint with
exisiting ones. Otherwise, the list of newly introduced pair
replacements will be partly redundant with the exisiting one.
More than one token would designate the same compressed sequence
which is waste.
To make the lists disjoint, new texts must perform *all* replacements
according to the existing list. The bug was caused by the algorithm
not being able to combine two pair tokens - it would only aggregate
word tokens to an exisiting word or pair token.
Pew. For the sake of documentation ...
Received on Fri Jul 6 04:27:25 2007