Re: [Reminder] Subversion a mentor for Google Summer of Code

From: Qi Fred <fred.qi_at_gmail.com>
Date: 2006-05-08 11:09:30 CEST

I have submitted a proposal to Summer of Code 2006 on this task.
The following is my proposal,
-------------------------------------

Name: Qi, Fei
Email: fred.qi@gmail.com
IM: fred.qi@gmail.com (gtalk)
Language: Chinese, Native;
English, fluently reading, writing and speaking.

* PROJECT TITLE
----------------------------------------------------------------------
Compressed or optional text base storage in Subversion
----------------------------------------------------------------------

* SUMMARY

  In Subversion, difference comparison and deltas generation are
  performed off-line based on the locally cached text bases. Text
  bases of a certain working copy are the unmodified files in the base
  revision. But such a design doubles approximately the storage space
  needed on the client side. Two feasible solutions of reducing the
  storage are: (a) compress the text bases, and (b) disable caching
  text bases of some or all of the files in the working copy. My
  proposal is to add a mechanism combines the two solutions to manage
  text bases.

The following features are planned to be implemented:

  - By setting options in the runtime configuration files, users can
    (a) switch between using original and compressed text bases, and
    (b) enable or disable caching large binary files.

  - By specifying a special property on a certain file, one of the
    three caching mechanisms can be chosen: original, compressed, and
    excluded (caching disabled). Note that the text bases can be
    excluded on client side only if the file is a binary one.

* DETAILS of PROJECT

  Compressed or optional text base storage in Subversion have been
  discussed for a long time in Subversion's development community,
  - SoC description: http://subversion.tigris.org/project_tasks.html
  - issue 525: http://subversion.tigris.org/issues/show_bug.cgi?id=525
  - issue 908: http://subversion.tigris.org/issues/show_bug.cgi?id=908
  These discussions give the start base of implementing this proposal.

** Implementations of the Two Solutions

  In my opinion, the two solutions have similar consequence but are
  different in essence. Utilizing compressed text bases does NOT
  affect the working model of Subversion. It increases only the
  runtime complexity introduced by compressing and/or decompressing
  the text bases. Thus its implementation is somewhat straightforward.
  But disabling the caching of text bases changes the work model of
  Subversion because comparison (diff) and generation of deltas depend
  directly on text bases.

If a file without cached text base has been modified and intend to
be committed, there are three (or more) potential working cycles:

  1) abort and warn the user
     - abort the commit process
     - prompt the user to enable caching of the corresponding file
     - enable caching by the user
     - restart the commit process

  2) temporarily download the base revision
     - send a request of base revision to the server
     - temporarily download the base revision
     - generate the deltas and committed changes
     - remove the base file since caching is disabled

  3) make Subversion work without cached text bases
     - split large binary files into small blocks, for example, 32KB
     - stores locally the very short message digests of all blocks
     - detect changes by comparing digests of corresponding blocks
     - send only the changed blocks to the server or request and
       download only the changed blocks to the client.
     - generate deltas and commit changes (on server or client side).

  All the above working cycles solve the problem introduced by disable
  caching text bases. The first one can be easily implemented, but
  introduces inconvenient manual operations. The latter two cycles
  require modifications in both the client and server sides. The
  problem of the second one is the heavy load of transmission during a
  commit. Since the contents of large files change seldom, the second
  cycle is feasible. The third one concerns the collision of message
  digest algorithms. There is a report that different contents give
  same MD5 digests (http://eprint.iacr.org/2004/199.pdf). But
  collisions have not been found in SHA-1 algorithm. Some
  investigations should be down to avoid collisions. I prefer to
  implement the third working model.

  According to these discussions, I suggest to add a section of
  runtime configuration options and a special property to manage text
  bases.

** Runtime Configurations for text-base Management

  I suggest to add a new section, 'text-base', to the set of options
  of runtime configuration. This section provides options of text
  bases management on the client side:

  - compressed: This is a binary option (yes/no). This instructs
    Subversion client to cache compressed or original text bases. Set
    this to 'yes' to enable caching text bases in compressed format.

  - exclude-large-bins: This is a binary switch (yes/no). Set this
    variable to 'yes' if the user want Subversion to disable caching
    large binary files automatically. Whether the file is large or not
    is determined by comparing its size with a threshold that
    specified by the variable 'exclusion-threshold'.

  - exclusion-threshold: This option should be a positive number. Its
    value describes whether a binary file is large enough to turn off
    the caching of its corresponding text-base. The suggested default
    value is 512KB.

  - digest-block-size: This variable specifies the size of blocks the
    binary files will be split into. This option should be a positive
    number and its default value is suggested to be 32KB.

** Special Property for text-base Management

  A special property, 'svn:text-base', is suggested to be added. This
  property indicates the way Subversion stores the text base of
  corresponding file. Its value of can be one of the follows:

- original: This causes Subversion to store the corresponding text
base in its original format.

- compressed: This causes Subversion to store the text base in
compressed format.

- excluded: This cause Subversion to work without cached text base.
This value is applicable only to binary files.

* SCHEDULE

  In this summer, my main work is to finish my Ph.D dissertation.
  According to my plan, I can work for this project (3~4 hours) * (4~5
  days) per week. The following is my detailed schedule ('+' indicates
  a milestone):

  May 22:
    - commence with project.
  W01 (May 22 ~ May 28):
    - communicate with mentors to confirm the proposal and goals
    - read related codes and documents in Subversion
  W02 (May 29 ~ Jun. 4):
    - sketch the framework of text-base management
    - prepare test cases
    - implement the user interface
  W03 (Jun. 5 ~ Jun. 11):
    - implement the compressed IO based on svn_stream_compressed()
    - add logging support
  W04 (Jun. 12 ~ Jun. 18):
    - implement compressed text bases support in checkout/update
      commands
  W05 (Jun. 19 ~ Jun. 25):
    - implement compressed text bases support in commit/diff command
+W06 (Jun. 26 ~ Jul. 2): (Mid-program evaluations, Jun. 30)
    - finish the compressed text bases management
    - commence the working model without cached text bases
  W07 (Jul. 3 ~ Jul. 9):
    - function(s) for splitting files into blocks
    - function(s) for generating message digests of blocks of files
      (apr-util provides the MD4 and MD5 algorithm)
  W08 (Jul. 10 ~ Jul. 16):
    - comparison based on message digests of blocks
    - support in checkout/update commands
  W09 (Jul. 17 ~ Jul. 23):
    - request blocks on client side
    - receive blocks on client side
  W10 (Jul. 24 ~ Jul. 30):
    - send blocks on server side
  W11 (Jul. 31 ~ Aug. 6):
    - generation of deltas from blocks
    - finish the commit command on client side
+WW (Aug. 7 ~ Aug. 21):
    - finish the optional caching support
    - write a final report
    - pencil down

* Experiences with Subversion and Programming

** Experiences with Subversion

  I have been a user of Subversion for more than one and a half years.
  Subversion is a great version control system which out performs all
  the ones I used before I enter the world of Subversion. I am very
  familiar with the commands and configuration of Subversion.

  I have subscribed the development mailing list and download the
  source code of Subversion when I heard of SoC 2006. I have read the
  'Hacker's Guide to Subversion' and documentations in some header
  files.

** Experiences with Programming

  I have using C/C++ as my major development language for more than
  eight years. Though most of my development work are done under
  Windows, I have experiences of developing communication programs
  under Unix/Linux.

  I am a good team player. I have participated in several projects,
  and three main projects are listed below (More details is available
  in my resume web page):

  - SportsPartner project: This project aims to track the players and
    analyze their actions in sports (soccer) games. I am the team
    leader and key algorithm developer.

  - NightView project: This project aims to design and implement a
    vision-based pedestrians detector to improve the safety of nightly
    driving. I am a consultant of this research and develop project.

  - Microarray Image Analysis: This project aims to detect and
    quantify the intensities of spots on scanned microarray images. My
    task is to design and implement the algorithm of detect and
    recognize the regular structures of grids on such images.

* BIBLIOGRAPHY

  I got a B. Eng. from Northwestern Polytechnical University, Xi'an,
  China, in July. 2000. I am now a Ph.D candidate majoring in control
  science and engineering at Department of Automation, Tsinghua
  University, Beijing, China. I am expected to get my Ph.D degree in
  Jan. 2007.

  My resume can be found at the following link addresses:
  - HTML format: http://fred.qi.googlepages.com/resume.html
  - PDF format: http://fred.qi.googlepages.com/cv-qf.pdf

* OTHER PROJECTS in SoC 2006

I plan to apply another one or two projects mentored by boost
organization. But I prefer to work for this project.
-----
Best regards,
Fei Qi

On 5/8/06, Sachin Garg <schngrg@gmail.com> wrote:
>
> I looked at bug ID 908, which wants that the local copy in text-base
> should be stored compressed. I did a little digging around in code and
> felt it shouldnt be very hard to implement this and it will atleast
> make my life easier.
>
> I am not going through the Google summer of code thing (am no longer a
> student either :-) but would like to implement this feature (assuming
> someone hasnt already started working on this).
>
> I am a long time subversion user (on Windows, TortoiseSVN) but new to
> subversion code, so will need some guidance if you guys want me to
> work on this.
>
> Some quick quesitions:
>
> # Is libsvn_wc/ the only place where I will need to edit code, or do I
> need to look in other directories too? Which ones?
>
> # Do we already have a compression library (zlib?) linked in subversion?
>
> # How much additional delay this is expected to result in during
> checkouts and commits? Should I use something lightweight like zlib or
> will it be fine to use bzip2 which can give better compression but
> will be slower?
>
> # Do we want files in text-base to be always compressed, or do we want
> text-base compression to be optional?
>
> Bug no 525 (optional text-base storage) is slightly related, maybe I
> can have a design which will make it easier to implement 525 too. Like
> implementing text-base access as a layer which can have multiple
> implmentations:
>
> 1. Direct file read
> 2. Read compressed file
> 3. Fetch from server
>
>
> Another possible todo item (which runs in opposite direction from the
> above items :-)
>
> Just like SVN stores text-base for local diffs, how about generalizing
> it to store N previous revisions and change log entires. Storing
> additional revisions shouldn't result in too much bloat, as we can
> probably store just the diffs and can make more operations local.
>
> Sachin Garg [India]
> www.sachingarg.com | www.c10n.info
Received on Mon May 8 11:10:06 2006

This message: [ Message body ]
Next message: Giovanni Bajo: "Re: Merge tracking proposal"
Previous message: Sachin Garg: "Re: [Reminder] Subversion a mentor for Google Summer of Code"
In reply to: Sachin Garg: "Re: [Reminder] Subversion a mentor for Google Summer of Code"
Next in thread: Peter N. Lundblad: "Optional/compressed text bases (was: Re: [Reminder] Subversion a mentor for Google Summer of Code)"
Reply: Peter N. Lundblad: "Optional/compressed text bases (was: Re: [Reminder] Subversion a mentor for Google Summer of Code)"
Reply: Sachin Garg: "Re: [Reminder] Subversion a mentor for Google Summer of Code"
Reply: Wesley J. Landaker: "Re: [Reminder] Subversion a mentor for Google Summer of Code"

Contemporary messages sorted: [ By Date ] [ By Thread ] [ By Subject ] [ By Author ] [ By messages with attachments ]