Re: Classifying files as binary or text

From: Branko Cibej <brane_at_xbc.nu>
Date: Fri, 13 Nov 2009 04:27:30 +0100

Mike Samuel wrote:
> 2009/11/12 Branko Čibej <brane_at_xbc.nu>:
>
>> Mike Samuel wrote:
>>
>>> 2009/11/12 Branko Čibej <brane_at_xbc.nu>:
>>>
>>>
>>>> What that amounts to is that, for completeness and correctness, you'd
>>>> have to carefully catalogue all the media types and applicable
>>>> attributes, and make complex decisions based on that data; *and* keep it
>>>> up to date, which apparently is not as easy as one would like (e.g., see
>>>> http://roy.gbiv.com/untangled/2009/wrangling-mimetypes). I don't think
>>>> the goal would justify the ongoing effort involved.
>>>>
>>>>
>>> Why? Why not just take into account the charset mime-type parameter
>>> which can only be present on texty types?
>>> I'm not suggesting this as an ultimate solution, just an incremental
>>> improvement of an existing feature that is already well documented and
>>> understood in other domains.
>>>
>>>
>> What do you do if the charset attribute isn't present? You either fall
>> back to the current "broken" behaviour, or you interpret the media type.
>>
>
> The current proposal outlined in the first mail of this thread is to
> classify a file as texty if
> (1) the current "broken" behavior says so
> (2) OR if there is a charset attribute present regardless of value.
>

Your proposal and Mark's only differ in extended the parsing of
svn:mime-type vs. introducing a new property. Mine adds the option of
having more fine-grained choices about diff algorithms in the future
without actually having to know anything about specific media types, but
that's just a "future-proof" not an immediate requirement.

On the surface, the choice appears simple: roll a die, or flip a coin,
or (heh) have a duel. But Hyrum makes a very good example of the working
copy library nightmare. Even picking "strcmp" over "regex.match" can
have far-reaching consequences, but I'm more concerned about the
/potential/ creature feep: "Oh, we do /this/ and /this/ with
svn:mime-type, why not check it for colour and taste, too." Somewhat
overdone, perhaps, but it's /so/ easy to fall into the trap of just
adding a little something to an existing feature.

>> (Oh, we don't properly interpret the Unicode end-of-line code point in
>> UTF-8 files.)
>>
>
> End of line codepoint? Are you talking about U+2028 and U+2029?
>

Those would appear to be the ones, yes. I don't know offhand if the
paragraph separator implies end-of-line.

-- Brane

------------------------------------------------------
http://subversion.tigris.org/ds/viewMessage.do?dsForumId=462&dsMessageId=2417382
Received on 2009-11-13 04:27:44 CET

This message: [ Message body ]
Next message: Hyrum K. Wright: "Re: Classifying files as binary or text"
Previous message: Mark Phippard: "Re: Classifying files as binary or text"
In reply to: Mike Samuel: "Re: Classifying files as binary or text"
Next in thread: Mike Samuel: "Re: Classifying files as binary or text"
Reply: Mike Samuel: "Re: Classifying files as binary or text"

Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ] [ by messages with attachments ]