I would like to try and tackle issue 1002 which addresses how
subversion classifies files as binary or text.
http://subversion.tigris.org/issues/show_bug.cgi?id=1002
Below I flesh out a proposal I listed in that bug, and list a few
variants. Please let me know if this looks good, and which of the
variants, if any are desired.
Background:
The current behavior is described at
http://svnbook.red-bean.com/en/1.1/apas08.html
To determine whether a contextual merge is possible, Subversion
examines the svn:mime-type property. If the file has no svn:mime-type
property, or has a mime-type that is textual (e.g. text/*), Subversion
assumes it is text. Otherwise, Subversion assumes the file is binary.
http://www.ietf.org/rfc/rfc2046.txt describes the charset mime-type parameter.
Common source code mime-types are misclassified, and that problem is
likely to grow because of current IANA policy.
Mime-types are handed out by the IANA, which only assigns text/*
mime-types for file-types that are meant to be human readable. Source
code is explicitly not considered human readable. This is why many
source code and data mime-types are in the application/* group or
other non text/* groups: application/json, application/ecmascript,
application/xml, image/svg+xml.
RFC 4288 ( ftp://ftp.rfc-editor.org/in-notes/rfc4288.txt ) says this
Expected uses for the "application" media type
include but are not limited to file transfer, spreadsheets,
presentations, scheduling data, and languages for "active"
(computational) material.
Goal:
To allow proper binary/text determinations for content based on the
same svn:mime-type property now used that recognizes that text/*
mime-types are rarely granted for source code, and that the goal of a
revision control system is to store source code.
This determination should not be overly sensitive to the version of
subversion used, which makes a frequently updated mime-type list
problematic.
Proposal:
Classify a file as text if any of the following are true
(1) the existing classification algorithm classifies it as text
(2) the svn:mime-type property includes a attribute ";charset=..."
with a non-blank charset name.
Variant 1:
Append the criteria above with
(3) Use the charset from svn:charset if there is none from (2)
See http://svn.haxx.se/dev/archive-2008-06/0941.shtml
Variant 1:
If the charset from (2/3) is not recognized, ignore it.
Variant 2:
If the charset from (2/3) is recognized but the file is not correctly
encoded with it.
If the charset is "UTF-8" and the byte sequence contains a sequence
not allowed in UTF-8 (e.g. byte 0xFF) then ignore it.
cheers,
mike
------------------------------------------------------
http://subversion.tigris.org/ds/viewMessage.do?dsForumId=462&dsMessageId=2417284
Received on 2009-11-12 23:06:25 CET