Classifying files as binary or text

From: Mike Samuel <mikesamuel_at_gmail.com>
Date: Thu, 12 Nov 2009 14:02:09 -0800

I would like to try and tackle issue 1002 which addresses how
subversion classifies files as binary or text.
http://subversion.tigris.org/issues/show_bug.cgi?id=1002
Below I flesh out a proposal I listed in that bug, and list a few
variants. Please let me know if this looks good, and which of the
variants, if any are desired.

Background:
The current behavior is described at
http://svnbook.red-bean.com/en/1.1/apas08.html
    To determine whether a contextual merge is possible, Subversion
    examines the svn:mime-type property. If the file has no svn:mime-type
    property, or has a mime-type that is textual (e.g. text/*), Subversion
    assumes it is text. Otherwise, Subversion assumes the file is binary.

http://www.ietf.org/rfc/rfc2046.txt describes the charset mime-type parameter.

Common source code mime-types are misclassified, and that problem is
likely to grow because of current IANA policy.
Mime-types are handed out by the IANA, which only assigns text/*
mime-types for file-types that are meant to be human readable. Source
code is explicitly not considered human readable. This is why many
source code and data mime-types are in the application/* group or
other non text/* groups: application/json, application/ecmascript,
application/xml, image/svg+xml.
RFC 4288 ( ftp://ftp.rfc-editor.org/in-notes/rfc4288.txt ) says this
    Expected uses for the "application" media type
    include but are not limited to file transfer, spreadsheets,
    presentations, scheduling data, and languages for "active"
    (computational) material.

Goal:
To allow proper binary/text determinations for content based on the
same svn:mime-type property now used that recognizes that text/*
mime-types are rarely granted for source code, and that the goal of a
revision control system is to store source code.
This determination should not be overly sensitive to the version of
subversion used, which makes a frequently updated mime-type list
problematic.

Proposal:
    Classify a file as text if any of the following are true
    (1) the existing classification algorithm classifies it as text
    (2) the svn:mime-type property includes a attribute ";charset=..."
with a non-blank charset name.

Variant 1:
Append the criteria above with
(3) Use the charset from svn:charset if there is none from (2)
See http://svn.haxx.se/dev/archive-2008-06/0941.shtml

Variant 1:
If the charset from (2/3) is not recognized, ignore it.

Variant 2:
If the charset from (2/3) is recognized but the file is not correctly
encoded with it.
If the charset is "UTF-8" and the byte sequence contains a sequence
not allowed in UTF-8 (e.g. byte 0xFF) then ignore it.

cheers,
mike

------------------------------------------------------
http://subversion.tigris.org/ds/viewMessage.do?dsForumId=462&dsMessageId=2417284
Received on 2009-11-12 23:06:25 CET

This message: [ Message body ]
Next message: C. Michael Pilato: "Re: A tiny request for the repository move"
Previous message: Stefan Sperling: "Re: [PATCH] Add error msg when trying to launch external merge tool on prop conflict"
Next in thread: Stefan Sperling: "Re: Classifying files as binary or text"
Reply: Stefan Sperling: "Re: Classifying files as binary or text"
Reply: Mark Phippard: "Re: Classifying files as binary or text"

Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ] [ by messages with attachments ]