[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: Numbers encoding in FSFS log addressing indexes

From: Stefan Fuhrmann <stefan.fuhrmann_at_wandisco.com>
Date: Wed, 25 Jun 2014 20:33:42 +0200

On Wed, Jun 25, 2014 at 8:03 PM, Daniel Shahaf <d.s_at_daniel.shahaf.name>
wrote:

> Stefan Fuhrmann wrote on Wed, Jun 25, 2014 at 17:34:43 +0200:
> > On Wed, Jun 25, 2014 at 5:09 PM, Ivan Zhakov <ivan_at_visualsvn.com> wrote:
> >
> > > Subversion 1.8 and before in general uses human readable decimal
> > > format to store numbers in FSFS repositories on disk.
> >
> >
> > True. However, there are exceptions to that general rule.
> > The index data uses the same basic encoding as we
> > already use in txdelta. In both cases, encoding density
> > is critical I/O performance.
> >
>
> Is "density" the right word? The density ratio between base-2⁷ encoding
> and base-10 encoding is a constant factor, is that constant significant?
>

It is. In source code repositories, the indexes are something
like 5% of the repository size - with great variation depending
on average representation delta size. A base-10 encoding
would use about 2.5x as much space (2+x for the number
plus the separating white space).

The crux is that FSFS is very random access because most
nodes get changed independently, i.e. not only their HEAD
rev differs greatly but also the location of their delta chains.
Format7 manages to eliminate a great portion of that randomness
for the cost of random index access.

To make that trade-off work, the index must be small. So, we
would not be talking about 5% extra data (and I/O) but rather
+100% addressing I/O overhead. I never actually measured
that but the principle holds. BTW, it is one of the advantages
of FSX, that it combines multiple items into a single container
allowing for much smaller indexes.

> Perhaps an ASCII hexadecimal integer would solve whatever the problems
> with ASCII decimals are that a txdelta (base-2⁷) integer solves?
>
> > For instance, if you disable deltification in the ruby repo
> > (but keeping compression active), it explodes to 9.7GB,
> > a factor of 22.8. From that it should be obvious how
> > important space efficient encoding is to Subversion.
> >
>
> What does deltification have to do with choosing between ASCII-encoding
> and svndiff-encoding of 64-bit integers?
>

The operation sequence becomes a significant contribution
to the repository size when less than 4% of your content
is left after deltification (it's actually less than 2%, the remainder
is meta data overhead). Using longer encodings may have
a significant impact on repo size. That's probably why the
current encoding was chosen way back when.

I chose this as a reference for what people deemed a good
rationale for diverging from the "human readable" rule.

> > > Log addressing
> > > implementation on trunk introduces new encoding for storing numbers in
> > > indexes. Quoting log addressing indexes format documentation [1]
> > >
> >
> > I'm not even sure there is documentation for our txdelta
> > on-disk representation. So, FSFS indexes are doing a
> > better job in that department, ATM.
>
> Why is this relevant to the subject at hand? Good job for writing
> documentation, but lack of documentation wasn't Ivan's concern.
>

I was merely sandbagging against future "hard to maintain"
claims etc. Why would someone quote the 7b/8b encoding
scheme docs if not to use it *against* the current code?

-- Stefan^2.
Received on 2014-06-25 20:34:08 CEST

This is an archived mail posted to the Subversion Dev mailing list.

This site is subject to the Apache Privacy Policy and the Apache Public Forum Archive Policy.