I have submitted a proposal to Summer of Code 2006 on this task.
The following is my proposal,
-------------------------------------
Name: Qi, Fei
Email: fred.qi@gmail.com
IM: fred.qi@gmail.com (gtalk)
Language: Chinese, Native;
English, fluently reading, writing and speaking.
* PROJECT TITLE
----------------------------------------------------------------------
Compressed or optional text base storage in Subversion
----------------------------------------------------------------------
* SUMMARY
In Subversion, difference comparison and deltas generation are
performed off-line based on the locally cached text bases. Text
bases of a certain working copy are the unmodified files in the base
revision. But such a design doubles approximately the storage space
needed on the client side. Two feasible solutions of reducing the
storage are: (a) compress the text bases, and (b) disable caching
text bases of some or all of the files in the working copy. My
proposal is to add a mechanism combines the two solutions to manage
text bases.
The following features are planned to be implemented:
- By setting options in the runtime configuration files, users can
(a) switch between using original and compressed text bases, and
(b) enable or disable caching large binary files.
- By specifying a special property on a certain file, one of the
three caching mechanisms can be chosen: original, compressed, and
excluded (caching disabled). Note that the text bases can be
excluded on client side only if the file is a binary one.
* DETAILS of PROJECT
Compressed or optional text base storage in Subversion have been
discussed for a long time in Subversion's development community,
- SoC description: http://subversion.tigris.org/project_tasks.html
- issue 525: http://subversion.tigris.org/issues/show_bug.cgi?id=525
- issue 908: http://subversion.tigris.org/issues/show_bug.cgi?id=908
These discussions give the start base of implementing this proposal.
** Implementations of the Two Solutions
In my opinion, the two solutions have similar consequence but are
different in essence. Utilizing compressed text bases does NOT
affect the working model of Subversion. It increases only the
runtime complexity introduced by compressing and/or decompressing
the text bases. Thus its implementation is somewhat straightforward.
But disabling the caching of text bases changes the work model of
Subversion because comparison (diff) and generation of deltas depend
directly on text bases.
If a file without cached text base has been modified and intend to
be committed, there are three (or more) potential working cycles:
1) abort and warn the user
- abort the commit process
- prompt the user to enable caching of the corresponding file
- enable caching by the user
- restart the commit process
2) temporarily download the base revision
- send a request of base revision to the server
- temporarily download the base revision
- generate the deltas and committed changes
- remove the base file since caching is disabled
3) make Subversion work without cached text bases
- split large binary files into small blocks, for example, 32KB
- stores locally the very short message digests of all blocks
- detect changes by comparing digests of corresponding blocks
- send only the changed blocks to the server or request and
download only the changed blocks to the client.
- generate deltas and commit changes (on server or client side).
All the above working cycles solve the problem introduced by disable
caching text bases. The first one can be easily implemented, but
introduces inconvenient manual operations. The latter two cycles
require modifications in both the client and server sides. The
problem of the second one is the heavy load of transmission during a
commit. Since the contents of large files change seldom, the second
cycle is feasible. The third one concerns the collision of message
digest algorithms. There is a report that different contents give
same MD5 digests (http://eprint.iacr.org/2004/199.pdf). But
collisions have not been found in SHA-1 algorithm. Some
investigations should be down to avoid collisions. I prefer to
implement the third working model.
According to these discussions, I suggest to add a section of
runtime configuration options and a special property to manage text
bases.
** Runtime Configurations for text-base Management
I suggest to add a new section, 'text-base', to the set of options
of runtime configuration. This section provides options of text
bases management on the client side:
- compressed: This is a binary option (yes/no). This instructs
Subversion client to cache compressed or original text bases. Set
this to 'yes' to enable caching text bases in compressed format.
- exclude-large-bins: This is a binary switch (yes/no). Set this
variable to 'yes' if the user want Subversion to disable caching
large binary files automatically. Whether the file is large or not
is determined by comparing its size with a threshold that
specified by the variable 'exclusion-threshold'.
- exclusion-threshold: This option should be a positive number. Its
value describes whether a binary file is large enough to turn off
the caching of its corresponding text-base. The suggested default
value is 512KB.
- digest-block-size: This variable specifies the size of blocks the
binary files will be split into. This option should be a positive
number and its default value is suggested to be 32KB.
** Special Property for text-base Management
A special property, 'svn:text-base', is suggested to be added. This
property indicates the way Subversion stores the text base of
corresponding file. Its value of can be one of the follows:
- original: This causes Subversion to store the corresponding text
base in its original format.
- compressed: This causes Subversion to store the text base in
compressed format.
- excluded: This cause Subversion to work without cached text base.
This value is applicable only to binary files.
* SCHEDULE
In this summer, my main work is to finish my Ph.D dissertation.
According to my plan, I can work for this project (3~4 hours) * (4~5
days) per week. The following is my detailed schedule ('+' indicates
a milestone):
May 22:
- commence with project.
W01 (May 22 ~ May 28):
- communicate with mentors to confirm the proposal and goals
- read related codes and documents in Subversion
W02 (May 29 ~ Jun. 4):
- sketch the framework of text-base management
- prepare test cases
- implement the user interface
W03 (Jun. 5 ~ Jun. 11):
- implement the compressed IO based on svn_stream_compressed()
- add logging support
W04 (Jun. 12 ~ Jun. 18):
- implement compressed text bases support in checkout/update
commands
W05 (Jun. 19 ~ Jun. 25):
- implement compressed text bases support in commit/diff command
+W06 (Jun. 26 ~ Jul. 2): (Mid-program evaluations, Jun. 30)
- finish the compressed text bases management
- commence the working model without cached text bases
W07 (Jul. 3 ~ Jul. 9):
- function(s) for splitting files into blocks
- function(s) for generating message digests of blocks of files
(apr-util provides the MD4 and MD5 algorithm)
W08 (Jul. 10 ~ Jul. 16):
- comparison based on message digests of blocks
- support in checkout/update commands
W09 (Jul. 17 ~ Jul. 23):
- request blocks on client side
- receive blocks on client side
W10 (Jul. 24 ~ Jul. 30):
- send blocks on server side
W11 (Jul. 31 ~ Aug. 6):
- generation of deltas from blocks
- finish the commit command on client side
+WW (Aug. 7 ~ Aug. 21):
- finish the optional caching support
- write a final report
- pencil down
* Experiences with Subversion and Programming
** Experiences with Subversion
I have been a user of Subversion for more than one and a half years.
Subversion is a great version control system which out performs all
the ones I used before I enter the world of Subversion. I am very
familiar with the commands and configuration of Subversion.
I have subscribed the development mailing list and download the
source code of Subversion when I heard of SoC 2006. I have read the
'Hacker's Guide to Subversion' and documentations in some header
files.
** Experiences with Programming
I have using C/C++ as my major development language for more than
eight years. Though most of my development work are done under
Windows, I have experiences of developing communication programs
under Unix/Linux.
I am a good team player. I have participated in several projects,
and three main projects are listed below (More details is available
in my resume web page):
- SportsPartner project: This project aims to track the players and
analyze their actions in sports (soccer) games. I am the team
leader and key algorithm developer.
- NightView project: This project aims to design and implement a
vision-based pedestrians detector to improve the safety of nightly
driving. I am a consultant of this research and develop project.
- Microarray Image Analysis: This project aims to detect and
quantify the intensities of spots on scanned microarray images. My
task is to design and implement the algorithm of detect and
recognize the regular structures of grids on such images.
* BIBLIOGRAPHY
I got a B. Eng. from Northwestern Polytechnical University, Xi'an,
China, in July. 2000. I am now a Ph.D candidate majoring in control
science and engineering at Department of Automation, Tsinghua
University, Beijing, China. I am expected to get my Ph.D degree in
Jan. 2007.
My resume can be found at the following link addresses:
- HTML format: http://fred.qi.googlepages.com/resume.html
- PDF format: http://fred.qi.googlepages.com/cv-qf.pdf
* OTHER PROJECTS in SoC 2006
I plan to apply another one or two projects mentored by boost
organization. But I prefer to work for this project.
-----
Best regards,
Fei Qi
On 5/8/06, Sachin Garg <schngrg@gmail.com> wrote:
>
> I looked at bug ID 908, which wants that the local copy in text-base
> should be stored compressed. I did a little digging around in code and
> felt it shouldnt be very hard to implement this and it will atleast
> make my life easier.
>
> I am not going through the Google summer of code thing (am no longer a
> student either :-) but would like to implement this feature (assuming
> someone hasnt already started working on this).
>
> I am a long time subversion user (on Windows, TortoiseSVN) but new to
> subversion code, so will need some guidance if you guys want me to
> work on this.
>
> Some quick quesitions:
>
> # Is libsvn_wc/ the only place where I will need to edit code, or do I
> need to look in other directories too? Which ones?
>
> # Do we already have a compression library (zlib?) linked in subversion?
>
> # How much additional delay this is expected to result in during
> checkouts and commits? Should I use something lightweight like zlib or
> will it be fine to use bzip2 which can give better compression but
> will be slower?
>
> # Do we want files in text-base to be always compressed, or do we want
> text-base compression to be optional?
>
> Bug no 525 (optional text-base storage) is slightly related, maybe I
> can have a design which will make it easier to implement 525 too. Like
> implementing text-base access as a layer which can have multiple
> implmentations:
>
> 1. Direct file read
> 2. Read compressed file
> 3. Fetch from server
>
>
> Another possible todo item (which runs in opposite direction from the
> above items :-)
>
> Just like SVN stores text-base for local diffs, how about generalizing
> it to store N previous revisions and change log entires. Storing
> additional revisions shouldn't result in too much bloat, as we can
> probably store just the diffs and can make more operations local.
>
> Sachin Garg [India]
> www.sachingarg.com | www.c10n.info
Received on Mon May 8 11:10:06 2006