[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Classifying files as binary or text

From: Mike Samuel <mikesamuel_at_gmail.com>
Date: Thu, 12 Nov 2009 14:02:09 -0800

I would like to try and tackle issue 1002 which addresses how
subversion classifies files as binary or text.
Below I flesh out a proposal I listed in that bug, and list a few
variants. Please let me know if this looks good, and which of the
variants, if any are desired.

The current behavior is described at
    To determine whether a contextual merge is possible, Subversion
    examines the svn:mime-type property. If the file has no svn:mime-type
    property, or has a mime-type that is textual (e.g. text/*), Subversion
    assumes it is text. Otherwise, Subversion assumes the file is binary.

http://www.ietf.org/rfc/rfc2046.txt describes the charset mime-type parameter.

Common source code mime-types are misclassified, and that problem is
likely to grow because of current IANA policy.
Mime-types are handed out by the IANA, which only assigns text/*
mime-types for file-types that are meant to be human readable. Source
code is explicitly not considered human readable. This is why many
source code and data mime-types are in the application/* group or
other non text/* groups: application/json, application/ecmascript,
application/xml, image/svg+xml.
RFC 4288 ( ftp://ftp.rfc-editor.org/in-notes/rfc4288.txt ) says this
    Expected uses for the "application" media type
    include but are not limited to file transfer, spreadsheets,
    presentations, scheduling data, and languages for "active"
    (computational) material.

   To allow proper binary/text determinations for content based on the
same svn:mime-type property now used that recognizes that text/*
mime-types are rarely granted for source code, and that the goal of a
revision control system is to store source code.
   This determination should not be overly sensitive to the version of
subversion used, which makes a frequently updated mime-type list

    Classify a file as text if any of the following are true
    (1) the existing classification algorithm classifies it as text
    (2) the svn:mime-type property includes a attribute ";charset=..."
with a non-blank charset name.

Variant 1:
Append the criteria above with
    (3) Use the charset from svn:charset if there is none from (2)
See http://svn.haxx.se/dev/archive-2008-06/0941.shtml

Variant 1:
If the charset from (2/3) is not recognized, ignore it.

Variant 2:
If the charset from (2/3) is recognized but the file is not correctly
encoded with it.
If the charset is "UTF-8" and the byte sequence contains a sequence
not allowed in UTF-8 (e.g. byte 0xFF) then ignore it.


Received on 2009-11-12 23:06:25 CET

This is an archived mail posted to the Subversion Dev mailing list.