[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: Merge mode (Was Classifying files as binary or text)

From: Branko Cibej <brane_at_xbc.nu>
Date: Thu, 19 Nov 2009 09:34:53 +0100

Julian Foad wrote:
> Here is a vision of a solution that doesn't treat MIME-type as "special"
> but can still take advantage of it, both for 99% "Do The Right Thing"
> without manual setting of another property, and also for backward
> compatibility.
> (1) Use of line-wise merge shall be determined by a set of rules of the
> following kinds:
> * a specified property not existing, or existing and its value
> matching a specified pattern
> * the file name matching a specified pattern
> * a content scan determining whether the content is predominantly
> plain text

I'll point out that we already have all of that. :)

    * "a specified property", currently svn:mime-type;
    * "the file name", currently autoprops;
    * "a content scan", see svn_io_detect_mimetype, for whatever that's

Though there's room for improvement of course.

I do not at disagree with these principles; but as usual, the devil is
in the details.

First a word about using file names for content type determination. As
others have pointed out, that's shaky ground, based on, e.g., XML
troubles. My own recent experience has to do with video files -- not
really a good example for "source control" -- and has taught me to
*never* take the file name (or extension) into account if you want to
preserve sanity.

This implies that the file name should be the *least* important
determinator of content type; an assumption that we happily break with

A content scan is a much better choice, and of course our current
content scan is "less than perfect." There are a lot of things we can
improve there, including using commonly available algorithms for
determining a file's content type (although for example the current
libmagic, which someone mentioned, is deficient:

    $ file foo.txt
    foo.txt: data
    $ cat foo.txt
    This is UTF-16BE text

One would expect the common UTF encodings to be recognized.)

Which leaves us with a property that takes precedence over all other
determinators. Which is what the proposal is all about, and what your
proposal transmogrified into. :)

Others have pointed out many problems with using and/or extending
svn:mime-type for this purpose. I'll add a couple more:

    * IIRC iANA's MIME registry has different types registered to the
      same file extensions. Coupled with the ambiguities of content scan
      results -- and XML is a prime example of these -- you can use
      neither file extension nor content scan to unambiguously set the
      correct value, so you'd /have/ to set it manually. If you have to
      manually set a property, it might as well be one that is less
      ambiguous in the diff/merge/blame context than svn:mime-type.
    * The type of a file's contents do not necessarily determine how a
      user wants to merge the file. I can imagine having a file that's
      essentially text, but must never be automatically merged -- some
      kind of package manifest, for example.

> The rules for combining the rules are (something straight-forward, to be
> decided).
> The rules can be written to match svn:mime-type and svn:merge-mode, thus
> achieving the best results that mime-type matching can provide and yet
> also the explicit setting when it is wanted.
> (2) Use of line-wise diff shall be determined by a set of rules of the
> above kinds.
> (3) Use of line-wise blame shall be determined by a set of rules of the
> above kinds.

You imply that one might want diff/merge to behave differently from
blame. I don't think that's a reasonable thing to do. The answer to the
question, "who changed this bit of a file?" is very tightly coupled to:
"what changed between these two versions of a file?" If you use
different alogirthms for blame and diff, then blame results become even
fuzzier than they are now.

Take for example the following sequence of changes:

diff -u r1.xml r2.xml
--- r1.xml 2009-11-19 09:10:54.299911291 +0100
+++ r2.xml 2009-11-19 09:11:09.235528762 +0100
@@ -1,3 +1,3 @@
-this is text and more text
+this is text<br />and more text
$ diff -u r2.xml r3.xml
--- r2.xml 2009-11-19 09:11:09.235528762 +0100
+++ r3.xml 2009-11-19 09:11:19.223518197 +0100
@@ -1,3 +1,3 @@
-this is text<br />and more text
+this is text<br/>and more text

A line-based blame will flag the whole line as changed between r2 and
r3, but an XML-based diff will tell you that only the line-break was
changed, and also that the change has no semantic effect.

> (4) Default rules should be supplied.
> (5) The rules can be customised in the client's config. (Not that
> client-side config is ideal, but that's the kind of config we have and
> it is manageable.)

Have suitcase, will travel.

> (6) The other issue is whether to store the results of that
> determination in versioned properties on the file (svn:merge-mode,
> svn:diff-mode, svn:blame-mode). If we leave the determination till run
> time, then a client that has a better merge (or diff or blame) tool for
> certain file types will be able to use it.

I fail to see how that last bit is relevant. A client that has a better
merge or diff or blame tool will be able to use it regardless of what
properties we set or don't set on a file, or when or if we scan the
file's contents. It's safe to assume that any really smart diff tool
will have its own set of criteria for determining how to deal with a
file, and/or will allow the user to change its mode interactively. So it
can safely ignore whatever Subversion tells it about a file's type.

Unless you're thinking of teaching SVN to launch different diff tools
based on determined file type. Somehwat overengineered, IMHO, and anyone
can acheive the same effect today with a suitable wrapper.

-- Brane

Received on 2009-11-19 09:35:13 CET

This is an archived mail posted to the Subversion Dev mailing list.