[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Numbers encoding in FSFS log addressing indexes

From: Ivan Zhakov <ivan_at_visualsvn.com>
Date: Wed, 25 Jun 2014 19:09:45 +0400

Subversion 1.8 and before in general uses human readable decimal
format to store numbers in FSFS repositories on disk. Log addressing
implementation on trunk introduces new encoding for storing numbers in
indexes. Quoting log addressing indexes format documentation [1]
[[[
Encoding
--------

The final index file format is tuned for space and decoding efficiency.
Indexes are stored as a sequence of variable integers. The encoding is
as follows:

* Unsigned integers are stored in little endian order with a variable
  length 7b/8b encoding. If most significant bit a byte has been set,
  the next byte has also belongs to the same value.

  0x00 .. 0x7f -> 0x00 .. 0x7f ( 7 bits stored in 8 bits)
  0x80 .. 0xff -> 0x80 0x01 .. 0xff 0x01 (14 bits stored in 16 bits)
  0x100 .. 0x3fff -> 0x80 0x02 .. 0xff 0x7f (14 bits stored in 16 bits)
  0x100000000 -> 0x80 0x80 0x80 0x80 0x10 (35 bits stored in 40 bits)

  Technically, we can represent integers of arbitrary lengths. Currently,
  we only generate and parse up to 64 bits.

* Signed integers are mapped onto the unsigned value space as follows:

  x >= 0 -> 2 * x
  x < 0 -> -2 * x - 1

  Again, we can represent arbitrary length numbers that way but the code
  is currently restricted to 64 bits.

Most data is unsigned by nature but will be stored differentially using
signed integers.
]]]

I'm unhappy with choosen encoding since it's not human readable. Also
it is not so good for performance as storing 8 bytes for every number.

I think indexes should use one of the following format:
1. Use human readable decimal numbers with trailing newline: this will
   be consistent with original FSFS encoding and easier to investigate
   corruptions.

2. Just store 64-bit numbers as 8-byte in some fixed endianess (little endian
   for example). This will give us maximum performance since we get fixed
   length index records. While they still be somewhat human readable using
   HEX editors.

The current encoding is unacceptable, because it makes repository
maintenance and recovery nearly impossible.

[1] http://svn.apache.org/repos/asf/subversion/trunk/subversion/libsvn_fs_fs/structure-indexes

-- 
Ivan Zhakov
CTO | VisualSVN | http://www.visualsvn.com
Received on 2014-06-25 17:10:31 CEST

This is an archived mail posted to the Subversion Dev mailing list.

This site is subject to the Apache Privacy Policy and the Apache Public Forum Archive Policy.