I would like to try and tackle issue 1002 which addresses how
subversion classifies files as binary or text.
Below I flesh out a proposal I listed in that bug, and list a few
variants. Please let me know if this looks good, and which of the
variants, if any are desired.
The current behavior is described at
To determine whether a contextual merge is possible, Subversion
examines the svn:mime-type property. If the file has no svn:mime-type
property, or has a mime-type that is textual (e.g. text/*), Subversion
assumes it is text. Otherwise, Subversion assumes the file is binary.
http://www.ietf.org/rfc/rfc2046.txt describes the charset mime-type parameter.
Common source code mime-types are misclassified, and that problem is
likely to grow because of current IANA policy.
Mime-types are handed out by the IANA, which only assigns text/*
mime-types for file-types that are meant to be human readable. Source
code is explicitly not considered human readable. This is why many
source code and data mime-types are in the application/* group or
other non text/* groups: application/json, application/ecmascript,
RFC 4288 ( ftp://ftp.rfc-editor.org/in-notes/rfc4288.txt ) says this
Expected uses for the "application" media type
include but are not limited to file transfer, spreadsheets,
presentations, scheduling data, and languages for "active"
To allow proper binary/text determinations for content based on the
same svn:mime-type property now used that recognizes that text/*
mime-types are rarely granted for source code, and that the goal of a
revision control system is to store source code.
This determination should not be overly sensitive to the version of
subversion used, which makes a frequently updated mime-type list
Classify a file as text if any of the following are true
(1) the existing classification algorithm classifies it as text
(2) the svn:mime-type property includes a attribute ";charset=..."
with a non-blank charset name.
Append the criteria above with
(3) Use the charset from svn:charset if there is none from (2)
If the charset from (2/3) is not recognized, ignore it.
If the charset from (2/3) is recognized but the file is not correctly
encoded with it.
If the charset is "UTF-8" and the byte sequence contains a sequence
not allowed in UTF-8 (e.g. byte 0xFF) then ignore it.
Received on 2009-11-12 23:06:25 CET