[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

[PATCH] Replace vdelta with xdelta variant

From: Niels Werensteijn <n.werensteijn_at_student.utwente.nl>
Date: Thu, 17 Jan 2008 23:49:32 +0100

Hi,
This is my first contribution to opensource so please tell me everything
I am doing wrong :)

I have a big binary installer commited in svn. Exporting this binary
costs a lot of time. I traced this down to vdelta. I noticed that it
only gets used for chucks which have no source steam. XDelta is supposed
to be much faster than vdelta. So I thought: why not adjust the xdelta
algorithm to make a diff from itself. I copy/pasted most of the
algorithm, adding some minor stuff.

So far I have tested this patch with 2 repositories.
1. Consists of only the sqlexpres installer (large binary 32 mb,
compressed).
2. Is a software project of mine, consisting mainly of delphi source code.

The export time for 1 dropped from 10 secs to 7 secs. For repository 2
it fropped from 58 secs to 53. The on disk sizes are negligibly smaller
with the patch.

I'd like to hear what others are experiencing, and ofcourse eventually
to include this patch in the trunk :)

Regards,
Niels Werensteijn

[[[
Replace vdelta with an xdelta variant that can create deltas from its
own data.

* subversion/libsvn_delta/delta.h
   (svn_txdelta__xdelta_self): New prototype

* subversion/libsvn_delta/text_delta.c
   (compute_window): replace call to svn_txdelta__vdelta with
   a call to svn_txdelta__xdelta_self

* subversion/libsvn_delta/xdelta.c
   (init_blocks_table): Split this function in two. This part
   only initialized the blocks stuct
   (init_blocks_table_with_data): This has the same functionality
   as the old init_blocks_table
   (compute_delta): Change the call to old init_blocks_table to the
   new init_blocks_table_with_data
   (compute_delta_self): The routine that calculates the delta from
   own data
   (svn_txdelta__xdelta_self): The entry point for compute_delta_self
]]]

Index: subversion/libsvn_delta/delta.h
===================================================================
--- subversion/libsvn_delta/delta.h (revision 28885)
+++ subversion/libsvn_delta/delta.h (working copy)
@@ -84,6 +84,11 @@
                          apr_size_t target_len,
                          apr_pool_t *pool);
 
+/* Create xdelta window data against own data. Allocate temporary data from POOL. */
+void svn_txdelta__xdelta_self(svn_txdelta__ops_baton_t *build_baton,
+ const char *start,
+ apr_size_t target_len,
+ apr_pool_t *pool);
 
 #ifdef __cplusplus
 }
Index: subversion/libsvn_delta/xdelta.c
===================================================================
--- subversion/libsvn_delta/xdelta.c (revision 28885)
+++ subversion/libsvn_delta/xdelta.c (working copy)
@@ -155,17 +155,13 @@
   return blocks->slots[h].pos;
 }
 
-/* Initialize the matches table from DATA of size DATALEN. This goes
- through every block of MATCH_BLOCKSIZE bytes in the source and
- checksums it, inserting the result into the BLOCKS table. */
+/* Initialize the matches table woutout data */
 static void
-init_blocks_table(const char *data,
- apr_size_t datalen,
+init_blocks_table(apr_size_t datalen,
                   struct blocks *blocks,
- apr_pool_t *pool)
+ apr_pool_t *pool)
 {
   apr_size_t i;
- struct adler32 adler;
   apr_size_t nblocks;
   apr_size_t nslots = 1;
 
@@ -184,7 +180,22 @@
       blocks->slots[i].adlersum = 0;
       blocks->slots[i].pos = (apr_size_t)-1;
     }
+}
 
+/* Initialize the matches table from DATA of size DATALEN. This goes
+ through every block of MATCH_BLOCKSIZE bytes in the source and
+ checksums it, inserting the result into the BLOCKS table. */
+static void
+init_blocks_table_with_data(const char *data,
+ apr_size_t datalen,
+ struct blocks *blocks,
+ apr_pool_t *pool)
+{
+ apr_size_t i;
+ struct adler32 adler;
+
+ init_blocks_table(datalen, blocks, pool);
+
   for (i = 0; i < datalen; i += MATCH_BLOCKSIZE)
     {
       /* If this is the last piece, it may be blocksize large */
@@ -315,7 +326,7 @@
     }
 
   /* Initialize the matches table. */
- init_blocks_table(a, asize, &blocks, pool);
+ init_blocks_table_with_data(a, asize, &blocks, pool);
 
   /* Initialize our rolling checksum. */
   init_adler32(&rolling, b, MATCH_BLOCKSIZE);
@@ -383,3 +394,146 @@
                 data + source_len, target_len,
                 pool);
 }
+
+
+/* Compute a delta from own using a variation of xdelta.
+
+ This algorithm works the same as xdelta, with the exception
+ that it gets blocks from its own data.
+
+ 1. Checksum the first MATCH_BLOCKSIZE block of bytes using adler32, and
+ insert the checksum into a match table with the position of the match.
+ 2. Go through the target byte by byte, starting at position MATCH_BLOCKSIZE.
+ See if that byte starts a match that we have in the match table.
+ 2a. If so, try to extend the match as far as possible both
+ forwards and backwards, and then insert a source copy
+ operation into the delta ops builder for the match.
+ Care must be taken here to find match data only from "previous" data,
+ that we know is already in the target window.
+ 2b. If not, insert the byte as new data using an insert delta op.
+
+ Our implementation doesn't immediately insert "insert" operations,
+ it waits until we have another copy, or we are done. The reasoning
+ is twofold:
+
+ 1. Otherwise, we would just be building a ton of 1 byte insert
+ operations
+ 2. So that we can extend a source match backwards into a pending
+ insert operation, and possibly remove the need for the insert
+ entirely. This can happen due to stream alignment.
+*/
+static void
+compute_delta_self(svn_txdelta__ops_baton_t *build_baton,
+ const char *start,
+ apr_uint32_t tsize,
+ apr_pool_t *pool)
+{
+ struct blocks blocks;
+ struct adler32 rolling;
+ apr_uint32_t currentBlockSum;
+ apr_size_t currentBlockPos;
+
+ apr_size_t sz, lo, hi;
+
+ /* If the size of the target is smaller than 2 blocksizes, just
+ insert the entire target. */
+ if (tsize < 2*MATCH_BLOCKSIZE)
+ {
+ svn_txdelta__insert_op(build_baton, svn_txdelta_new,
+ 0, tsize, start, pool);
+ return;
+ }
+
+
+ /* Initialize the matches table and first block. */
+ init_blocks_table(tsize, &blocks, pool);
+ currentBlockSum =
+ adler32_sum(init_adler32(&rolling, start, MATCH_BLOCKSIZE));
+ add_block(&blocks, currentBlockSum, 0);
+
+ /* Save data for the current block */
+ currentBlockPos = MATCH_BLOCKSIZE;
+ currentBlockSum =
+ adler32_sum(init_adler32(&rolling, start + MATCH_BLOCKSIZE, MATCH_BLOCKSIZE));
+
+ /* We start with wanting to insert atleast the first block */
+ apr_size_t pending_insert_start = 0, pending_insert_len = MATCH_BLOCKSIZE;
+
+ for (sz = tsize, lo = currentBlockPos, hi = MATCH_BLOCKSIZE; lo < sz;)
+ {
+ apr_size_t apos = 0;
+ apr_size_t alen = 1;
+ apr_size_t badvance = 1;
+ apr_size_t next;
+ svn_boolean_t match;
+
+ /* A little bit ugly here. We don't want find_match to go back to far.
+ To make sure it finds only data from blocks what will be in the target
+ window we must do two things.
+ 1. limit the "source data" length to the currentBlockPos.
+ 2. We have to limit the amount of bytes find_match goes back. Else it could
+ match bytes that were not inserted yet. This is done by moving the target
+ "start" to currentBlockPos.
+ */
+ match = find_match(&blocks, &rolling, start, currentBlockPos, start + currentBlockPos,
+ tsize-currentBlockPos, lo-currentBlockPos, &apos,
+ &alen, &badvance, &pending_insert_len);
+
+ /* If we didn't find a real match, insert the byte at the target
+ position into the pending insert. */
+ if (! match)
+ ++pending_insert_len;
+ else
+ {
+ if (pending_insert_len > 0)
+ {
+ svn_txdelta__insert_op(build_baton, svn_txdelta_new,
+ 0, pending_insert_len,
+ start + pending_insert_start, pool);
+ pending_insert_len = 0;
+ }
+ /* Reset the pending insert start to immediately after the
+ match. */
+ pending_insert_start = lo + badvance;
+ svn_txdelta__insert_op(build_baton, svn_txdelta_target,
+ apos, alen, NULL, pool);
+ }
+ next = lo;
+ for (; next < lo + badvance; ++next)
+ {
+ adler32_out(&rolling, start[next]);
+ if (next + MATCH_BLOCKSIZE < tsize)
+ {
+ adler32_in(&rolling, start[next + MATCH_BLOCKSIZE]);
+
+ /* Now check if we at a position where a new block starts.
+ if yes, insert current block info into the math table,
+ and save current block info for later insertion */
+ if (next == currentBlockPos + MATCH_BLOCKSIZE)
+ {
+ add_block(&blocks, currentBlockSum, currentBlockPos);
+ currentBlockPos += MATCH_BLOCKSIZE;
+ currentBlockSum = adler32_sum(&rolling);
+ }
+ }
+ }
+ lo = next;
+ hi = lo + MATCH_BLOCKSIZE;
+ }
+
+ /* If we still have an insert pending at the end, throw it in. */
+ if (pending_insert_len > 0)
+ {
+ svn_txdelta__insert_op(build_baton, svn_txdelta_new,
+ 0, pending_insert_len,
+ start + pending_insert_start, pool);
+ }
+}
+
+void svn_txdelta__xdelta_self(svn_txdelta__ops_baton_t *build_baton,
+ const char *start,
+ apr_size_t target_len,
+ apr_pool_t *pool)
+{
+ compute_delta_self(build_baton, start, target_len, pool);
+}
Index: subversion/libsvn_delta/text_delta.c
===================================================================
--- subversion/libsvn_delta/text_delta.c (revision 28885)
+++ subversion/libsvn_delta/text_delta.c (working copy)
@@ -148,7 +148,7 @@
   build_baton.new_data = svn_stringbuf_create("", pool);
 
   if (source_len == 0)
- svn_txdelta__vdelta(&build_baton, data, source_len, target_len, pool);
+ svn_txdelta__xdelta_self(&build_baton, data, target_len, pool);
   else
     svn_txdelta__xdelta(&build_baton, data, source_len, target_len, pool);
 
@@ -160,8 +160,6 @@
   return window;
 }
 
-
-
 svn_txdelta_window_t *
 svn_txdelta_window_dup(const svn_txdelta_window_t *window,
                        apr_pool_t *pool)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe_at_subversion.tigris.org
For additional commands, e-mail: dev-help_at_subversion.tigris.org
Received on 2008-01-17 23:51:05 CET

This is an archived mail posted to the Subversion Dev mailing list.

This site is subject to the Apache Privacy Policy and the Apache Public Forum Archive Policy.