[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: Classifying files as binary or text

From: Branko Cibej <brane_at_xbc.nu>
Date: Fri, 13 Nov 2009 04:27:30 +0100

Mike Samuel wrote:
> 2009/11/12 Branko ─îibej <brane_at_xbc.nu>:
>> Mike Samuel wrote:
>>> 2009/11/12 Branko ─îibej <brane_at_xbc.nu>:
>>>> What that amounts to is that, for completeness and correctness, you'd
>>>> have to carefully catalogue all the media types and applicable
>>>> attributes, and make complex decisions based on that data; *and* keep it
>>>> up to date, which apparently is not as easy as one would like (e.g., see
>>>> http://roy.gbiv.com/untangled/2009/wrangling-mimetypes). I don't think
>>>> the goal would justify the ongoing effort involved.
>>> Why? Why not just take into account the charset mime-type parameter
>>> which can only be present on texty types?
>>> I'm not suggesting this as an ultimate solution, just an incremental
>>> improvement of an existing feature that is already well documented and
>>> understood in other domains.
>> What do you do if the charset attribute isn't present? You either fall
>> back to the current "broken" behaviour, or you interpret the media type.
> The current proposal outlined in the first mail of this thread is to
> classify a file as texty if
> (1) the current "broken" behavior says so
> (2) OR if there is a charset attribute present regardless of value.

Your proposal and Mark's only differ in extended the parsing of
svn:mime-type vs. introducing a new property. Mine adds the option of
having more fine-grained choices about diff algorithms in the future
without actually having to know anything about specific media types, but
that's just a "future-proof" not an immediate requirement.

On the surface, the choice appears simple: roll a die, or flip a coin,
or (heh) have a duel. But Hyrum makes a very good example of the working
copy library nightmare. Even picking "strcmp" over "regex.match" can
have far-reaching consequences, but I'm more concerned about the
/potential/ creature feep: "Oh, we do /this/ and /this/ with
svn:mime-type, why not check it for colour and taste, too." Somewhat
overdone, perhaps, but it's /so/ easy to fall into the trap of just
adding a little something to an existing feature.

>> (Oh, we don't properly interpret the Unicode end-of-line code point in
>> UTF-8 files.)
> End of line codepoint? Are you talking about U+2028 and U+2029?

Those would appear to be the ones, yes. I don't know offhand if the
paragraph separator implies end-of-line.

-- Brane

Received on 2009-11-13 04:27:44 CET

This is an archived mail posted to the Subversion Dev mailing list.