[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: Merge mode (Was Classifying files as binary or text)

From: Julian Foad <julianfoad_at_btopenworld.com>
Date: Wed, 18 Nov 2009 03:05:28 +0000

I need to spend some time replying, late at night though it is.

Let me try to explain why I think a "how to merge" property should not
be the primary indicator of how subversion should merge each file.

Principle
=========

I have read that, in the realm of data handling, there is a principle
that it is a bad idea to tag data with annotations that say what kind of
actions can or should be performed on it. That kind of coupling is
unscalable. Instead, it is better to tag data with an indication of what
meaning and/or what syntax the data has, and then let tools decide what
to do, based on that information.

We already have one data-type indicator: svn:mime-type. Now, MIME type
is far from a complete data type specifier. It is insufficient for our
needs, in theory. However, in practice, it is nearly sufficient. (See
Problem 2 below for an exception.)

We also have another data-type indicator: the file name. A file name is
also an incomplete source of metadata, and some file names ("README" or
"CHANGES") give no indication at all of the format, but it is useful in
many cases ("*.py", "*.c").

Problem 1 (limited recognition of MIME types)
=========

Subversion mis-categorizes a lot of MIME types as "binary" (and
therefore will not merge or diff or blame them) which really are
line-based text formats.

The list of such MIME types is continually evolving so it is not
possible for Subversion to have a built-in complete list. However, it is
easy for new releases of Subversion to have an updated list.

It is not much harder for Subversion to have a configuarable list of
which MIME types (or MIME type patterns) should be considered mergeable.
(The configuration could be extensible: it could say line-wise-mergeable
or not mergeable or XML-mergeable or ...)

Problem 2 (mergeable and non-mergeable XML files)
=========

A user has some XML files on which a line-based merge is useful, and
some XML files on which that is not useful, and wishes to label both
kinds with svn:mime-type=text/xml. Let us suppose one format has each
XML tag on a separate line, and the other has them all run together and
line breaks inserted at arbitrary places in the sequence. It may be
possible to find a different MIME type for one of the file types, but
that may well not be possible, e.g. if they are both proprietary or
arbitrary XML formats.

This problem is not limited to XML files. Consider two "plain text"
files, MIME type text/plain, with different kinds of text in them. One
has line-based content, such as a shopping list, and changes usually
leave many lines unchanged. The other contains the text of a newspaper
article with line breaks at roughly every 70 characters in a stream of
words, so two similar versions of it may have very few whole lines in
common.

I believe this is a real but relatively uncommon requirement. It is a
genuine example of the MIME being insufficient to determine (line-wise)
mergeability. There are many file formats that can be regarded as being
(line-wise) mergeable or non-mergeable depending on some aspect of their
content that cannot be reflected in the MIME type. It is uncommon in the
sense that most Subversion users' needs can be satisfied by
distinguishing mergeability based on the MIME type, or better the MIME
type and file name taken together, of their files.

To solve this problem when it exists, there does indeed need to be
further metadata about the content type of the file. (Alternatively it
could be metadata that says how to merge the file, but see "Principle"
above.)

Solution 0 (merge-mode)
==========

So we could add a property to each file which says whether the file is
to be considered line-wise mergeable by Subversion, and say that this
property will be the primary source of this information. What are the
pros and cons of this?

Pro: The user can force a line-wise merge on one file and no merge
attempt on another file even when MIME type and file name are
insufficient distinguishers.

Pro: The user can forget about providing MIME type at all, and just set
this property to one of the pre-defined two types of merging (line-based
or none), if that is all the user cares about.

Con: This property associates the file with one simple kind of merging;
but the best merge tool available on the client may not be that simple
kind. If we want to use a better merge tool, say an XML-aware merge
tool, this property actually gets in the way: it tells us to use a
simple line-based merge on this file. It would have been better if the
property had said, "this file contains line-based XML, so you might want
to use an XML-aware merge rather than a simple line-based merge if you
can". In other words, we really want to tell the client what the content
type is, and let the client choose the best merge tool for that content
type.

Con: This property conveys redundant information. In almost all cases,
the MIME type and/or file name are sufficient information. It is wrong
to pretend that MIME type and file name are not good sources of
how-to-merge information, and to leave their currently weak and
deficient interpretation as just a deprecated backward-compatibility
fallback.

Con: Not extensible to diff, blame, etc. An indication that the file is
line-wise-mergeable is not really a good indication of whether the file
can be line-wise diffed or blamed.

Proposal
========

This is the full, long-term proposal. We can choose a subset of this to
do initially.

(1) Make svn merge/diff/blame take into account the file name as well as
the svn:mime-type in deciding whether to operate in a "line-wise" mode
or not operate at all.

(2) Update the built-in MIME type and filename patterns.

  * Update the built-in selection based on svn:mime-type to recognise a
list of MIME types that is reasonably up-to-date right now (even though
it will be out of date by the time the released software is in use).

  * Update the built-in selection based on file names to recognize a
reasonable list of file name patterns.

(3) Provide a client-side config for extending and overriding the rules
that map MIME type and file name to a merge/diff/blame mode. This mode
should be specifiable in the config, not just "line-wise" or "none" but
any other named mode. Provide config options for specifying the merge
tool, diff tool and blame tool per mode. Tools should be specifiable as
none, built-in or external.

(4) Add an optional property for selecting a particular merge mode (and
diff mode and blame mode) for the cases where (1) and (2) are
insufficient or inconvenient.

Regards,
- Julian

Mike Samuel wrote:
> Proposal:
> ========
> (1) Add documentation on the svn:merge-mode property that lists the
> allowed values as ("simple" and "none")
> (2) Add example autoprops rules to the documentation that sets
> svn:merge-mode to "simple" for the following file types
> application/ecmascript
> application/json
> application/xml
> image/svg+xml
> (3) Change the text quoted from the SVN manual under Background to
> read as below.
> (4) Update the implementation to agree.
>
> Subversion treats the following files as [[mergable]]:
>
> * Files with no svn:mime-type [[and no svn:merge-mode]]
> * Files with a svn:mime-type starting "text/"
> * Files with a svn:mime-type equal to "image/x-xbitmap"
> * Files with a svn:mime-type equal to "image/x-xpixmap"
> * [[Files with a svn:merge-mode that is equal to "simple"]]
>
> All other files are treated as [[unmergeable]], meaning that
> Subversion will:
>
> * Not attempt to automatically merge received changes with
> local changes during svn update or svn merge
> * Not show the differences as part of svn diff
> * Not show line-by-line attribution for svn blame
>
> In all other respects, Subversion treats [[mergable]] files the
> same as [[unmergeable]] files, e.g. if you set
> the svn:keywords or svn:eol-style properties, Subversion will
> perform keyword substitution
> or newline conversion on [[unmergeable]] files.
>
>
> Goal:
> ====
> To update the scheme by which svn {update,diff,merge,blame} to allow
> merging of files
> with svn:mime-type outside the hard-coded list currently used.
>
> This determination should be independent of the platform svn
> is running on, so independent of the set of supported character sets.
>
> This scheme should not complicate future extensions to the merge
> system which might wish to use a different merge policy, e.g. for XML
> than for source code files.
>
> This scheme should work with autoprops, and other mechanisms repository
> administrators use to manage files. Specifically, some kinds of XML can
> be meaningfully meged, and others cannot.
>
> This scheme should work within existing limitations, such as the inability
> to merge UTF-16 and UTF-32.
>
>
> Background:
> ==========
> The current behavior is described at
> http://subversion.tigris.org/faq.html#binary-files
>
> Subversion treats the following files as text:
>
> * Files with no svn:mime-type
> * Files with a svn:mime-type starting "text/"
> * Files with a svn:mime-type equal to "image/x-xbitmap"
> * Files with a svn:mime-type equal to "image/x-xpixmap"
>
> All other files are treated as binary, meaning that Subversion will:
>
> * Not attempt to automatically merge received changes with
> local changes during svn update or svn merge
> * Not show the differences as part of svn diff
> * Not show line-by-line attribution for svn blame
>
> In all other respects, Subversion treats binary files the same as
> text files, e.g. if you set
> the svn:keywords or svn:eol-style properties, Subversion will
> perform keyword substitution
> or newline conversion on binary files.
>
> Common source code mime-types are misclassified, and that problem is
> likely to grow because of current IANA policy.
> Mime-types are handed out by the IANA, which only assigns text/*
> mime-types for file-types that are meant to be human readable. Source
> code is explicitly not considered human readable. This is why many
> source code and data mime-types are in the application/* group or
> other non text/* groups: application/json, application/ecmascript,
> application/xml, image/svg+xml.
> RFC 4288 ( ftp://ftp.rfc-editor.org/in-notes/rfc4288.txt ) says this
> Expected uses for the "application" media type
> include but are not limited to file transfer, spreadsheets,
> presentations, scheduling data, and languages for "active"
> (computational) material.
>
> ------------------------------------------------------
> http://subversion.tigris.org/ds/viewMessage.do?dsForumId=462&dsMessageId=2419155

------------------------------------------------------
http://subversion.tigris.org/ds/viewMessage.do?dsForumId=462&dsMessageId=2419275
Received on 2009-11-18 04:05:55 CET

This is an archived mail posted to the Subversion Dev mailing list.