[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: Evaluating SVN as a Document Management Solution

From: B. Smith-Mannschott <benpsm_at_gmail.com>
Date: Sat, 8 Mar 2008 15:01:12 +0100

On Mar 8, 2008, at 00:50, Tom Blough wrote:
>>> typically in MS Office applications, deliverables are typically
>>> drawings in AutoCAD or Microstation, and database content
>> is typically
>>> financial data from which reports are generated.
>
> For your application, your repository will be huge. All of the file
> types
> you mention are binary. Therefore, SVN cannot calculate a diff on
> the file
> and will end up storing a copy of the complete file.

This is incorrect. You're probably thinking of CVS or some similarly
brain-damaged revision control system. SVN uses compressed binary
differences between versions for storage in its repository.

This works well for text of course. It also works well for binary
formats which don't themselves use compression, such as Microsoft
Word's DOC, uncompressed TIFF, ...

> There was a recent thread concerning using XML data formats for newer
> versions of Office in order to save diff content, but that can cause
> problems due to the fact that XML is not order specific. Office
> can, and
> does, generate different XML for the same document.

Well, yes, that will tend to make your differences larger than they
have to be. The real problem however is that most of these "XML"
formats are not, in fact, XML but rather XML compressed within a ZIP
archive.

Where Subversions binary differencing and compression fails is on file
formats that are themselves compressed (OpenDocument, OfficeOpenXML,
PNG, GIF, JPG, ...). Because of the compression, even a small change
in the document may cause it's representation on disk to change
completely. The difference algorithm can't "see through" this.
Furthermore, subversion's built-in compression (like any compression
algorithm) won't be able to further compress something that's already
compressed.

I've done an experiment to verify this. I set up three repositories
each containing a single document in one of three formats. In this
case, I used the text of _The Count of Monte Cristo_ from Project
Gutenberg as ASCII Text (2568 KB), as Microsoft Word DOC (6384 KB) and
as OpenOffice ODT (1060 KB). I created 8 variants of each of these
documents (inserting or removing a paragraph here or there) to
represent minor edits. I then made 80 commits to each of the three
repositories drawing upon the aforementioned 8 variants in round-robin
fashion to simulate a history of 80 minor edits made and committed.
While doing this I kept track of the total size of the repository.

* All three repositories grow linearly in size, but the ODT repository
grows more quickly (steeper slope).

* The ODT repository is smallest for the first few commits but quickly
out grows the TXT and DOC repositories.

* The DOC repository is larger than the TXT repository and grows
slightly faster in comparison.

* The size difference between the TXT and DOC repositories is not as
large as the relative size of the formats (2568 KB vs 6384 KB) might
suggest. DOC may be twice as large as TXT but much of this difference
is redundancy which SVN is quite capable of compressing away.

* Final repository sizes after 80 commits: TXT = 10052 KB; DOC = 16288
KB; ODT = 58260 KB.

See also attached PNG.

// Ben Smith-Mannschott

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe_at_tortoisesvn.tigris.org
For additional commands, e-mail: users-help_at_tortoisesvn.tigris.org

growth_or_repo_through_repeated_minor_edits75.png
Received on 2008-03-08 15:01:37 CET

This is an archived mail posted to the TortoiseSVN Users mailing list.

This site is subject to the Apache Privacy Policy and the Apache Public Forum Archive Policy.