[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Philisophical Problem (was: Poll: do we really need newline conversion?)

From: Jay Freeman \(saurik\) <saurik_at_saurik.com>
Date: 2001-12-11 23:40:52 CET

OK, I replied to this specific post from this thread mainly because it was the closest to seeing what the actual problem here is (not because I was responding _to_ it): version control in the most esoteric sense has different requirements than "concurrent version control" (borrowing the CVS term while I construct this document).

Subversion isn't the perfect version control system.

Subversion, in its current form, CAN'T be the perfect version control or even anything all that close.

Subversion can _try_ to be the perfect "concurrent" version control system, but then it loses it's perfection as a standard version control when it adds special support for various file formats (such as text files) to do things that you _must have_ such as "merge changes from bob into my file".

(Please note that I admit that I might use the wrong term as I've just defined them later in this document... I apologize ahead of time... I'm typing this in a semi-rush as I have a million things that _must_ get done today before I fly back to Chicago tomorrow, but I consider this e-mail quite important.)

You want your horror stories? I'll give you your horror stories :), but they are going to run the gamut of everything that concurrent version control tends to ignore in the quest to be an elegant implementation of version control rather than arguing _directly_ for line ending (as I think talking about line endings in particular as a separate problem is a red herring). I'm more attacking this philosophy I'm seeing behind Subversion than this individual point.

Subversion uses diff and patch. This is a limitation. Everyone talking about "version control shouldn't get into these kinds of discussions" need to take a look at the actual implementation and note that as long as diff and patch are used text files get special treatment. Hell... that was one of the founding features of these forms of version controls: text files will get special features that parse things into lines so we can do diffs on them, hence merging changes that are made between two parties.

Diff isn't even a good way of thinking about changes to a "text file": it's just become the de facto way because thinking about changes to a file is a difficult problem and no one has written a better way yet. A lot of the code in projects I have worked with (not mine, I tend to just sigh and give up my project to the version control gods) have terribly, terribly formatted code mainly because of their version control system not helping them with special case scenarios.

A few examples:

Documentation
-------------

I haven't seen a project that uses CVS that has text documentation that assumes the client machine can word wrap; it's all formatted to 60 or 80 columns. I think this is a pity, but there are some issues going on there, especially with more specialized file formats that CVS still treats as text files so it can merge them. Add too many words to a single line, and either the next line looks stupid or a change is forced to the entire file for wont of a "this file should be formatted to 80-columns automatically, and diffs should never be taken on the already formatted version".

Line Indention
--------------

If I'm doing version control on C++ files, the abstraction of a "text file" loses information. Any version control system that treats my C++ files as a "text file" will lose this information. A blank line doesn't change the file unless it comes right after a line that ends in '\', nor does adding white space to other bricks of white-space.

OK, you might say "that is a change to the file we should log"... what if the change to the file was only in two locations but it _forced_ me to change the ENTIRE FILE? In nmap I needed to add an 'if' statement around a good chunk of code, which made me want to indent most of the code in between that. Well, whenever I got patches from Fyodor (the original author) I had to MANUALLY merge the entire portion that had its indentation changed. I finally gave in and changed it back so I could maintain compatibility with this project.

Why? This is because, in order to support "merge", there is a special feature that is particular for "text files", but "C++ source code" doesn't have the same restrictions that "text files" do. The perfect version control system doesn't treat C++ files as text files, but it treats them as C++ files. Obviously this is going to get difficult, and equally obviously none of us are going to wait for it to happen, but until it happens you are losing information in my file. The same goes for line endings.

XML/SGML Formats
----------------

Now, imagine this same sort of problem on an XML file. I want to add a tag around a few existing tags, and to maintain the format that I want I perform an indentation on possibly the entire document... why do I get hit with conflicts in merges? Even worse, let's take a look at changes that are made _within_ a line. If I modify a single attribute on an XML tag, that shouldn't conflict with someone else modifying another attribute, should it? This also means I'm going to be formatting my XML documents with the idea that they are going to be VERSIONED before any other consideration, simply because I need to make sure that diff and patch will make sense on my organizational scheme.

Unfortunately, CVS loses the contextual information that an XML format provides while performing these "merge"s when it approximates it as a diff/patchable format. Now, if all one wanted was version control, then it makes entire sense that EXACTLY what the user placed in the repository should be what is stored, and that there should be no fudging going on when checking out. However, as I mentioned, this isn't just versioning, this is also merging and concurrent access, which means that the contextual information is _needed_ to make the experience make sense. The diff/patch kludge is hurting this goal.

Line Endings
------------

Files _do_ get randomly changed between formats. Sometimes they are files you don't even know are getting randomly changed. Different service packs of Visual Studio do different things to your Visual C++ project files. At some point Fyodor and I got off on our line endings and Fyodor's patches can't merge with mine. Fault? Amazingly (?) I wouldn't blame line-conversion; I blame the lack of metadata.

Project Files
-------------

Visual Studio project files are more annoying for other reasons. The main one being that people tend to not understand the file format... this is helped in Visual Studio.NET where it is XML, but that only means you understand the parsing rules, not what the tags mean. Why is this important? I would say because they almost should be thought of as binary files, in the sense that a diff/patch merge shouldn't be done on them... (and until I took on the responsibility of dealing with all of the project conflicts in our 3D engine, we _did_ have them in our repository flagged binary).

Here's the thing, line-by-line merges don't make sense for project files. The file format wasn't designed to be line-by-line friendly (which is what most of this e-mail is about, how line-by-line needs to be kept in mind constantly when working with diff/patch). Example: when you go into the project settings, and change a compile option, there is one single line in the project file that has all of the command line options for the compiler that gets changed. This means that if Jake changes a compile option, and I've changed a compile option, there will be this obscure conflict on a bunch of command line options.

How about this one: the IDE doesn't guarantee that the files will be listed in the same order when you save your project. That means that you sometimes add a file, and then get a conflict when the added file goes near the top of the list, and one from near the top of the list also gets shoved to the bottom. You could model this as a "move" operation from near the top of the file to near the end, followed by an "addition", but even that starts to get problematic as the syntax is identical to the IDE if it has ""'s around the file names and if is doesn't (as there is no arbitrary-ness, the file name will go up till the end of the line), so the IDE will choose one or the other depending on what version of the parser it has (which is upgraded by service packs to the editor).

Cross-File Modifications
------------------------

The very first thing I got asked when I explained how Subversion finally models directories to my good friend Steve was "does it keep track of sections of code that have been moved between files?", and all I could say was "good point, no". This happens a lot, I want to move content from one function into another function within the same file, or I decide that this function is better off in a separate file, or even part of a different class or DLL. CVS doesn't deal with this as I would want it to, and causes conflicts to occur on code that people are modifying on those functions even when the actual code content that got edited might not have changed. You also lose the history associated with the content when it goes from one file to another.

Sure, I love the ability to move files around, but it seems like a special-case hack to the more general problem of moving content between files: the general case for "move" is a new file was created, and then all of the content from the original file was transplanted to the new file, and the old file was deleted. If that were supported, you would hope to do "lines 50-67 of file main.c were moved to follow line 7 of tree.cpp" and would be all that much closer to solving the actual problem rather than pacifying the people who just need to occasionally rename a file.

Conclusion
----------

I think subversion definitely needs to deal with line endings in order to do merges correctly. If you take something that is _truly_ a text file, which in the abstract sense is a string of characters chunked into an ordered list of "lines", and check it out onto a Windows machine, you may approximate the line endings as \r\n, but it does the idea of a "text file" injustice if you think about it as a single string of characters that is occasionally punctuated by \r\n.

However, I have a more powerful conclusion :). The merging engine needs to be extensible in the be-all and end-all of concurrent version control. I.e., based on possibly the MIME type, there should be special support for doing merge operations on Visual Studio projects (maybe a Python script that gets sent from the server back to the client if the client has an old version... maybe a set of Parrot byte codes to be more language neutral, but something). This way, the lists of changes can be modeled as "a file was added, the box for debug information was checked, and the output file name was changed" rather than "here is a diff between two files whose content I don't understand".

Subversion probably won't bother with that, but I can still dream :-).

Sincerely,
Jay Freeman (saurik)
saurik@saurik.com

-----Original Message-----
From: Branko Èibej [mailto:brane@xbc.nu]
Sent: Tuesday, December 11, 2001 1:43 PM
To: dev@subversion.tigris.org
Subject: Re: Poll: do we really need newline conversion?

...

As to newline conversion being outside the scope of a version control
system ... strictly speaking, so is support for showing diffs on text
files. you only *need* to be able to get the different versions, you can
diff them yourself ... (ok, ok, never mind.)

...

-- 
Brane ÄŒibej   <brane_at_xbc.nu>   http://www.xbc.nu/brane/
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org
Received on Sat Oct 21 14:36:52 2006

This is an archived mail posted to the Subversion Dev mailing list.

This site is subject to the Apache Privacy Policy and the Apache Public Forum Archive Policy.