As I said, it'll still happily take svndiff0 with no problems.
Until we resolve the issue of writing out the header, in fact, the
patches to force it to use svndiff1 only are in my local tree.
That means the only thing that uses svndiff1 in this patch are the delta
tests.
I gathered statistics on space savings using a printf to print out the
difference between the size of the svndiff1 encoded address, and the
svndiff0 encoded address, and wrote a little python to sum the output
(since it's just lines of -1, 1, 2, etc).
For a single random-test run (upped to 300 iterations rather than 30), we
usually save 58k over all the files it creates.
It varies from 48k-75k, averaging about 58k (this is a total of 30000
iterations).
For each file it does deltas between, the savings ranges from -9 to 2732.
The frequency of negative savings is < 1% (Out of all the single file
savings, 5 were negative).
This is consistent with running various imports of repositories and
comparing sizes with svndiff0 versions of the same repos.
Notes:
Version 1:
Version 1 of the svndiff format brings us a bit closer to vcdiff, and
borrows more ideas.
However, as with version 0, we encode things a little differently than
vcdiff avoid a space penalty.
Ideas taken:
Use of near and same tables, as well as current position, to reduce offset
size to save bytes
in encoding.
Use of modes to determine how we encoded the address.
Changes from vcdiff description of caches/address encoding:
Rather than encode mode separately than offset, we steal 3 bits of the
offset and put the mode in there.
Rationale: Saves quite a few bytes.
The standard cache size is 4, 2 not 4, 3 (ie near table size == 4, same
table size = 2)
Rationale: Needed to encode modes in 3 bits rather than 4. This is because
max mode num is samesize+nearsize+1. 4+3+1=8 different modes=4 bits.
4+2+1=7 different modes=3 bits.
We check the near table for the entry before adding it, and don't add it
if it already exists.
Rationale: This increases hit percentage on the near table by 10%. The
cost of checking at most 4 entries on an update is not noticeable.
Side note:
Current position, as used here, in vcdiff, and the code, is a misnomer.
It's actually a good estimate of a solution to the real problem.
The real problem you want to solve for here encoding (which is just
encoding it based off the current position/magic number. Thus, encoding
is based off of "here") is thus:
Given:
3 bins (the parentheses is the range of numbers that you would place in
the bin)
[1 byte (0-127)] [2 byte (128-16383)] [3 byte (16383...2 million)]
Where:
60% of numbers are in bin 2, 20% in bin 1, 20% in bin 3.
Objective:
Find the number (n) that gives you the most points.
How to get points:
For each number (x) in bin 3
if (n-x) would fall in bin 1, you get 2 point.
For each number (x) in bin 2
If (n-x) would fall in bin 1, you get 1 point.
Because of how our encoder works, most numbers fall into the 2
byte range.
source offset + source len should be larger than the numbers we need to
bin, but not too much larger. Thus, it moves a fair number of things from
bin 2 to bin 1.
However, as i've shown, here encoding is really looking for a magic
number, it just happens to be that the current position is an okay magic
number.
I can't think, off the top of my head, of an O(n) algorithm to solve the
above problem. Obviously, I know how to do it using 0-1 integer
programming, which is way too slow, and there is a naieve O(n^2)
algorithm, which is also too slow. If anyone can think of one, let me
know.
Anyway, here's the diff.
I just want to see thoughts, it still needs a bit more work.
I've tried to make sure all the new routines are commented approriately,
etc.
Index: ./include/svn_delta.h
===================================================================
--- ./include/svn_delta.h
+++ ./include/svn_delta.h Thu Feb 14 12:48:03 2002
@@ -242,7 +242,7 @@
void svn_txdelta_to_svndiff (svn_stream_t *output,
apr_pool_t *pool,
svn_txdelta_window_handler_t *handler,
- void **handler_baton);
+ void **handler_baton, unsigned char version);
/* Return a writable generic stream which will parse svndiff-format
data into a text delta, invoking HANDLER with HANDLER_BATON
Index: ./libsvn_fs/reps-strings.c
===================================================================
--- ./libsvn_fs/reps-strings.c
+++ ./libsvn_fs/reps-strings.c Wed Feb 13 23:46:07 2002
@@ -1439,7 +1438,7 @@
/* Setup a stream to convert the textdelta data into svndiff windows. */
svn_txdelta (&txdelta_stream, source_stream, target_stream, trail->pool);
svn_txdelta_to_svndiff (new_target_stream, trail->pool,
- &new_target_handler, &new_target_handler_baton);
+ &new_target_handler, &new_target_handler_baton, 0);
/* subpool for the windows */
wpool = svn_pool_create (trail->pool);
Index: ./mod_dav_svn/repos.c
===================================================================
--- ./mod_dav_svn/repos.c
+++ ./mod_dav_svn/repos.c Wed Feb 13 22:54:04 2002
@@ -1671,7 +1671,7 @@
svn_stream_set_close(o_stream, dav_svn_close_filter);
/* get a handler/baton for writing into the output stream */
- svn_txdelta_to_svndiff(o_stream, resource->pool, &handler, &h_baton);
+ svn_txdelta_to_svndiff(o_stream, resource->pool, &handler, &h_baton, 0);
/* got everything set up. read in delta windows and shove them into
the handler, which pushes data into the output stream, which goes
Index: ./tests/libsvn_delta/svndiff-test.c
===================================================================
--- ./tests/libsvn_delta/svndiff-test.c
+++ ./tests/libsvn_delta/svndiff-test.c Wed Feb 13 23:07:46 2002
@@ -47,7 +47,7 @@
#else
encoder = svn_base64_encode (svn_stream_from_stdio (stdout, NULL), NULL);
#endif
- svn_txdelta_to_svndiff (encoder, NULL, &svndiff_handler, &svndiff_baton);
+ svn_txdelta_to_svndiff (encoder, NULL, &svndiff_handler, &svndiff_baton, 1);
svn_txdelta_send_txstream (txdelta_stream,
svndiff_handler,
svndiff_baton,
Index: ./tests/libsvn_delta/random-test.c
===================================================================
--- ./tests/libsvn_delta/random-test.c
+++ ./tests/libsvn_delta/random-test.c Thu Feb 14 12:24:51 2002
@@ -205,7 +211,7 @@
delta_pool);
/* Make stage 2: encode the text delta in svndiff format. */
- svn_txdelta_to_svndiff (stream, delta_pool, &handler, &handler_baton);
+ svn_txdelta_to_svndiff (stream, delta_pool, &handler, &handler_baton, 1);
/* Make stage 1: create the text delta. */
svn_txdelta (&txdelta_stream,
Index: ./libsvn_delta/svndiff.c
===================================================================
--- ./libsvn_delta/svndiff.c
+++ ./libsvn_delta/svndiff.c Thu Feb 14 18:52:40 2002
@@ -26,19 +26,135 @@
#define NORMAL_BITS 7
#define LENGTH_BITS 5
-
/* ----- Text delta to svndiff ----- */
-
+/* Structure of the svndiff v1 cache */
+struct svd_cache_s
+{
+ /* Near table */
+ int near_size;
+ apr_off_t *near;
+ int near_index;
+ /* Same table */
+ int same_size;
+ apr_off_t *same;
+ /* Pool we allocate cache stuff from */
+ apr_pool_t *pool;
+ /* Decode start position, to avoid adding arguments to 50 functions to
+ pass this around, since it never changes for a given decode. */
+ apr_off_t pos;
+};
+typedef struct svd_cache_s svndiff_cache_t;
/* We make one of these and get it passed back to us in calls to the
window handler. We only use it to record the write function and
baton passed to svn_txdelta_to_svndiff (). */
struct encoder_baton {
svn_stream_t *output;
svn_boolean_t header_done;
- apr_pool_t *pool;
+ unsigned char version;
+ apr_pool_t *pool;
};
+/* Encoding modes for svndiff v1. Can't be an enum since it needs to
+ take into account the near and same modes. */
+#define SVNDIFF_SELF 0
+#define SVNDIFF_HERE 1
+/* Near modes are svndiff_HERE .. near_size + 1.
+ Same modes are near_size + 2 .. near_size + same_size + 1.
+ Currently, we exploit the fact that we use cache sizes of 4 and 2,
+ to encode the mode into 3 bits. */
+
+/* We need 3 bits for the mode right now. */
+#define SVNDIFF_MODE_BITS 3
+#define SVNDIFF_MODE_MASK (1 << 2 | 1 << 1 | 1 << 0)
+
+/* Free the passed in svndiff cache. */
+static void
+svndiff_cache_free (svndiff_cache_t *cache)
+{
+ if (cache)
+ {
+ svn_pool_destroy (cache->pool);
+ }
+}
+
+/* Create a new svndiff cache, with near table size NEARSIZE, and same
+ table size SAMESIZE. */
+static svndiff_cache_t *
+svndiff_cache_new (int nearsize, int samesize)
+{
+ svndiff_cache_t *cache;
+ apr_pool_t *pool;
+
+ /* Create a new pool for this cache. We could just let the user
+ pass in a pool, but we only make one or two allocations (literally
+ one or two) total. In reality, it doesn't even make sense to use
+ a pool here, i'm doing it for consistency. */
+
+ pool = svn_pool_create (NULL);
+ cache = (svndiff_cache_t *)apr_pcalloc (pool, sizeof (svndiff_cache_t));
+ cache->pool = pool;
+ if (nearsize > 0)
+ {
+ cache->near_size = nearsize;
+ cache->near = (apr_off_t *)apr_pcalloc (cache->pool,
+ nearsize * sizeof (apr_off_t));
+ }
+ if (samesize > 0)
+ {
+ cache->same_size = samesize;
+ cache->same = (apr_off_t *)apr_pcalloc (cache->pool,
+ samesize * 256 * sizeof(apr_off_t));
+ }
+ return cache;
+}
+
+/* Clear out the current data in the cache, resetting it to the
+ initial state. */
+static void
+svndiff_cache_clear (svndiff_cache_t *cache)
+{
+ int i;
+ for (i = 0; i < cache->near_size; i++)
+ cache->near[i] = 0;
+ for (i = 0; i < cache->same_size * 256; i++)
+ cache->same[i] = 0;
+ cache->near_index = 0;
+ cache->pos = 0;
+}
+
+/* Update the cache with a new address. */
+static void svndiff_cache_update (svndiff_cache_t *cache, apr_off_t address)
+{
+ /* Put the address into the next near slot. When we run out of near
+ slots, we start back at the first one. */
+ if (cache->near_size > 0)
+ {
+ int i;
+ int foundit = 0;
+ /* To increase hit percentage, don't put duplicate addresses in
+ the cache. */
+ for (i = 0; i < cache->near_index; i++)
+ {
+ if (cache->near[i] == address)
+ {
+ foundit = 1;
+ }
+ }
+ if (!foundit)
+ {
+ cache->near[cache->near_index] = address;
+ cache->near_index = (cache->near_index + 1) % cache->near_size;
+ }
+ }
+ /* Put the address into the same table at the place corresponding
+ to address % table_slots. */
+ if (cache->same_size > 0)
+ {
+ cache->same[address % (cache->same_size * 256)] = address;
+ }
+}
+
/* Encode VAL into the buffer P using the variable-length svndiff
integer format. Return the incremented value of P after the
@@ -86,6 +202,67 @@
return p;
}
+/* Encode an address using the current position, and cache, to find
+ the best encoding we can. Return the mode of the best encoding (IE
+ how we did it), in the mode parameter. Return the new encoded
+ address. */
+static apr_off_t
+svndiff_encode_addr (svndiff_cache_t *cache, apr_off_t address,
+ apr_off_t curpos, int *mode)
+{
+ int i;
+ apr_off_t diff;
+ apr_off_t bestenc; /* Best encoding of address. */
+ int bestmode; /* Mode for best encoding. */
+
+ /* Short circuit the 1 byte cases. Pointless to try to find a
+ better encoding if it already is < 1 byte. */
+ if ((address << SVNDIFF_MODE_BITS) < 1 << 7)
+ {
+ svndiff_cache_update (cache, address);
+ *mode = SVNDIFF_SELF;
+ return address;
+ }
+ /* We start out assuming the best encoding is just the address
+ itself. */
+ bestenc = address;
+ bestmode = SVNDIFF_SELF;
+
+ /* First try encoding based on the offset from the current
+ position. */
+ diff = curpos - address;
+ if (diff >= 0 && diff < bestenc)
+ {
+ bestenc = diff;
+ bestmode = SVNDIFF_HERE;
+ }
+ /* Now try encoding using the near table. */
+ for (i = 0; i < cache->near_size; ++i)
+ {
+ diff = address - cache->near[i];
+ if (diff >= 0 && diff < bestenc)
+ {
+ bestenc = diff;
+ bestmode = i+2;
+ }
+ }
+ /* Lastly, try encoding using the same table. */
+ if (cache->same_size > 0)
+ {
+ diff = address % (cache->same_size * 256);
+ if (cache->same[diff] == address && diff < bestenc)
+ {
+ bestenc = diff % 256;
+ bestmode = cache->near_size + 2 + (diff / 256);
+ }
+ }
+ /* Update the cache, regardless of what we choose. */
+ svndiff_cache_update (cache, address);
+ /* Lastly, return the mode/new encoded address. */
+ *mode = bestmode;
+ return bestenc;
+}
+
/* Append an encoded integer to a string. */
static void
@@ -109,12 +286,18 @@
const svn_txdelta_op_t *op;
svn_error_t *err;
apr_size_t len;
+ svndiff_cache_t *cache = NULL;
+ /* Version 1 needs a cache. */
+ if (eb->version == 1)
+ cache = svndiff_cache_new (4, 2);
/* Make sure we write the header. */
if (eb->header_done == FALSE)
{
+ char svnver[4] = "SVN\0";
len = 4;
- err = svn_stream_write (eb->output, "SVN\0", &len);
+ svnver[3] = eb->version;
+ err = svn_stream_write (eb->output, svnver, &len);
if (err != SVN_NO_ERROR)
return err;
eb->header_done = TRUE;
@@ -140,7 +323,6 @@
return svn_stream_close (output);
}
-
/* Encode the instructions. */
for (op = window->ops; op < window->ops + window->num_ops; op++)
{
@@ -157,7 +339,36 @@
else
ip = encode_int (ip + 1, op->length);
if (op->action_code != svn_txdelta_new)
- ip = encode_int (ip, op->offset);
+ {
+ /* For version 0, it's just the offset. */
+ if (eb->version == 0)
+ ip = encode_int (ip, op->offset);
+ /* Version 1 tries to encode the address better using the
+ cache and the current position. */
+ else if (eb->version == 1)
+ {
+ int mode;
+ apr_off_t offset;
+ offset = svndiff_encode_addr (cache, op->offset, window->sview_offset + window->sview_len, &mode);
+ /* If the new encoding, offset, is greater than bits(apr_off_t) -
+ SVNDIFF_MODE_BITS, we return an error.
+
+ Why not handle it as a special case?
+ Because I don't think it's actually possible, given
+ that we can encode based on current position, which is
+ set to window->sview_offset + window->sview_len.
+
+ This means you'd need a window > ~46 bits long to
+ even approach this error. */
+ if (((offset << SVNDIFF_MODE_BITS) >> SVNDIFF_MODE_BITS) != offset)
+ {
+ return svn_error_create (SVN_ERR_SVNDIFF_INVALID_OPS, 0, NULL,
+ pool,
+ "offset too large, would cause error decoding");
+ }
+ ip = encode_int (ip, offset << SVNDIFF_MODE_BITS | mode);
+ }
+ }
svn_stringbuf_appendbytes (instructions, ibuf, ip - ibuf);
}
@@ -180,8 +391,9 @@
{
len = window->new_data->len;
err = svn_stream_write (eb->output, window->new_data->data, &len);
- }
-
+ }
+ if (cache)
+ svndiff_cache_free (cache);
svn_pool_destroy (pool);
return err;
}
@@ -190,7 +402,7 @@
svn_txdelta_to_svndiff (svn_stream_t *output,
apr_pool_t *pool,
svn_txdelta_window_handler_t *handler,
- void **handler_baton)
+ void **handler_baton, unsigned char version)
{
apr_pool_t *subpool = svn_pool_create (pool);
struct encoder_baton *eb;
@@ -199,7 +411,7 @@
eb->output = output;
eb->header_done = FALSE;
eb->pool = subpool;
-
+ eb->version = version;
*handler = window_handler;
*handler_baton = eb;
}
@@ -240,6 +452,8 @@
not transmit the whole svndiff data stream, you will want this to
be FALSE. */
svn_boolean_t error_on_early_close;
+ /* svndiff version in use by delta. */
+ unsigned char version;
};
@@ -264,6 +478,61 @@
return NULL;
}
+/* Decode the address for svndiff version 1, where it can be encoded
+ in different ways using the cache/current position. Return the
+ buffer positioned just after the address. Set val to the address
+ value. */
+static const unsigned char *
+svndiff_decode_addr (svndiff_cache_t *cache, apr_off_t *val,
+ const unsigned char *buffer,
+ const unsigned char *end)
+{
+ apr_off_t address;
+ int m;
+ apr_off_t currpos = cache->pos;
+ int mode;
+ /* First, get our address + mode. */
+ buffer = decode_int (&address, buffer, end);
+ /* Mask out the mode. */
+ mode = (address & SVNDIFF_MODE_MASK);
+ /* And shift out the address. */
+ address >>= SVNDIFF_MODE_BITS;
+ /* For the self mode, we don't need to do anything further. The
+ address is encoded as itself. */
+ if (mode == SVNDIFF_SELF)
+ {
+ }
+ /* For here mode, subtract the address from the current position. */
+ else if (mode == SVNDIFF_HERE)
+ {
+ address = currpos - address;
+ }
+ /* For near modes, address is the (value of slot mode in the table +
+ address). */
+ else if ((m = mode - (SVNDIFF_HERE + 1)) >= 0 && m < cache->near_size)
+ {
+ address = cache->near[m] + address;
+ }
+ /* For same modes, address is the value of slot (mode * 256 +
+ address) in the table. */
+ else if ((m = mode - (SVNDIFF_HERE + 1 + cache->near_size)) >= 0
+ && m < cache->same_size)
+ {
+ address = cache->same[address + m * 256 ];
+ }
+ /* If it wasn't any of these modes, give up and return NULL, to
+ signal an error. */
+ else
+ {
+ return NULL;
+ }
+
+ /* Update the cache with the address we just decoded. */
+ svndiff_cache_update (cache, address);
+ /* Set the value, and return the newly positioned buffer. */
+ *val = address;
+ return buffer;
+}
/* Decode an instruction into OP, returning a pointer to the text
after the instruction. Note that if the action code is
@@ -272,7 +541,8 @@
static const unsigned char *
decode_instruction (svn_txdelta_op_t *op,
const unsigned char *p,
- const unsigned char *end)
+ const unsigned char *end,
+ svndiff_cache_t *cache)
{
apr_off_t val;
@@ -299,15 +569,31 @@
}
if (op->action_code != svn_txdelta_new)
{
- p = decode_int (&val, p, end);
- if (p == NULL)
- return NULL;
- op->offset = val;
+ /* If we have a cache, we must be version 1+, and need to decode
+ the address. Would be nice if we had the baton or something,
+ so we could be sure what version it was. */
+ if (cache != NULL)
+ {
+ p = svndiff_decode_addr (cache, &val, p, end);
+ if (p == NULL)
+ return NULL;
+ op->offset = val;
+ }
+ /* No cache means we are version 0. */
+ else
+ {
+ p = decode_int (&val, p, end);
+ if (p == NULL)
+ return NULL;
+ op->offset = val;
+ }
}
return p;
}
+
+
/* Count the instructions in the range [P..END-1] and make sure they
are valid for the given window lengths. Return -1 if the
instructions are invalid; otherwise return the number of
@@ -317,7 +603,8 @@
const unsigned char *end,
apr_size_t sview_len,
apr_size_t tview_len,
- apr_size_t new_len)
+ apr_size_t new_len,
+ svndiff_cache_t *cache)
{
int n = 0;
svn_txdelta_op_t op;
@@ -325,7 +612,7 @@
while (p < end)
{
- p = decode_instruction (&op, p, end);
+ p = decode_instruction (&op, p, end, cache);
if (p == NULL || op.length < 0 || op.length > tview_len - tpos)
return -1;
switch (op.action_code)
@@ -370,13 +657,19 @@
/* Chew up four bytes at the beginning for the header. */
if (db->header_bytes < 4)
{
+ char compto[4] = "SVN\0";
apr_size_t nheader = 4 - db->header_bytes;
if (nheader > *len)
nheader = *len;
- if (memcmp (buffer, "SVN\0" + db->header_bytes, nheader) != 0)
+ /* It doesn't matter what the last digit is, but we do need to record it.
+ IOW, this whole thing is pointless if nheader == 1. */
+ compto[3] = buffer[nheader-1];
+ if (memcmp (buffer, compto + db->header_bytes, nheader) != 0)
return svn_error_create (SVN_ERR_SVNDIFF_INVALID_HEADER,
0, NULL, db->pool,
"svndiff has invalid header");
+ db->version = compto[3];
+
*len -= nheader;
buffer += nheader;
db->header_bytes += nheader;
@@ -403,6 +696,7 @@
svn_txdelta_window_t window = { 0 };
svn_string_t new_data;
svn_txdelta_op_t *ops;
+ svndiff_cache_t *cache = NULL;
/* Read the header, if we have enough bytes for that. */
p = (const unsigned char *) db->buffer->data;
@@ -458,8 +752,22 @@
/* Count the instructions and make sure they are all valid. */
end = p + inslen;
+
+ /* Version 1 needs a cache of size 4,2. */
+ if (db->version == 1)
+ {
+ cache = svndiff_cache_new (4, 2);
+ cache->pos = sview_offset + sview_len;
+ }
ninst = count_and_verify_instructions (p, end, sview_len,
- tview_len, newlen);
+ tview_len, newlen,
+ cache);
+
+ /* Reset the cache if we have one, so we can actually do the
+ decoding. */
+ if (cache)
+ svndiff_cache_clear (cache);
+
if (ninst == -1)
return svn_error_create (SVN_ERR_SVNDIFF_INVALID_OPS, 0, NULL,
db->pool,
@@ -472,9 +780,12 @@
ops = apr_palloc (db->subpool, ninst * sizeof (*ops));
npos = 0;
+
+ if (cache)
+ cache->pos = sview_offset + sview_len;
for (op = ops; op < ops + ninst; op++)
{
- p = decode_instruction (op, p, end);
+ p = decode_instruction (op, p, end, cache);
if (op->action_code == svn_txdelta_new)
{
op->offset = npos;
@@ -502,7 +813,8 @@
/* Remember the offset and length of the source view for next time. */
db->last_sview_offset = sview_offset;
db->last_sview_len = sview_len;
-
+ if (cache)
+ svndiff_cache_free (cache);
/* We've copied stuff out of the old pool. Toss that pool and use
our new pool.
### might be nice to avoid the copy and just use svn_pool_clear
@@ -554,6 +866,7 @@
db->last_sview_offset = 0;
db->last_sview_len = 0;
db->header_bytes = 0;
+ db->version = 0;
db->error_on_early_close = error_on_early_close;
stream = svn_stream_create (db, pool);
svn_stream_set_write (stream, write_handler);
Index: ./libsvn_delta/xml_output.c
===================================================================
--- ./libsvn_delta/xml_output.c
+++ ./libsvn_delta/xml_output.c Wed Feb 13 22:45:19 2002
@@ -596,7 +596,7 @@
#else
encoder = svn_base64_encode (output, eb->pool);
#endif
- svn_txdelta_to_svndiff (encoder, eb->pool, handler, handler_baton);
+ svn_txdelta_to_svndiff (encoder, eb->pool, handler, handler_baton, 0);
return err;
}
Index: ./libsvn_ra_dav/commit.c
===================================================================
--- ./libsvn_ra_dav/commit.c
+++ ./libsvn_ra_dav/commit.c Wed Feb 13 22:45:12 2002
@@ -1014,7 +1014,7 @@
svn_stream_set_write(stream, commit_stream_write);
svn_stream_set_close(stream, commit_stream_close);
- svn_txdelta_to_svndiff(stream, subpool, handler, handler_baton);
+ svn_txdelta_to_svndiff(stream, subpool, handler, handler_baton, 0);
return NULL;
}
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org
Received on Sat Oct 21 14:37:08 2006