[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: Classifying files as binary or text

From: Mike Samuel <mikesamuel_at_gmail.com>
Date: Thu, 12 Nov 2009 19:45:51 -0800

2009/11/12 Branko Čibej <brane_at_xbc.nu>:
> Mike Samuel wrote:
>> 2009/11/12 Branko Čibej <brane_at_xbc.nu>:
>>
>>> Mike Samuel wrote:
>>>
>>>> 2009/11/12 Branko Čibej <brane_at_xbc.nu>:
>>>>
>>>>
>>>>> What that amounts to is that, for completeness and correctness, you'd
>>>>> have to carefully catalogue all the media types and applicable
>>>>> attributes, and make complex decisions based on that data; *and* keep it
>>>>> up to date, which apparently is not as easy as one would like (e.g., see
>>>>> http://roy.gbiv.com/untangled/2009/wrangling-mimetypes). I don't think
>>>>> the goal would justify the ongoing effort involved.
>>>>>
>>>>>
>>>> Why?  Why not just take into account the charset mime-type parameter
>>>> which can only be present on texty types?
>>>> I'm not suggesting this as an ultimate solution, just an incremental
>>>> improvement of an existing feature that is already well documented and
>>>> understood in other domains.
>>>>
>>>>
>>> What do you do if the charset attribute isn't present? You either fall
>>> back to the current "broken" behaviour, or you interpret the media type.
>>>
>>
>> The current proposal outlined in the first mail of this thread is to
>> classify a file as texty if
>> (1) the current "broken" behavior says so
>> (2) OR if there is a charset attribute present regardless of value.
>>
>
> Your proposal and Mark's only differ in extended the parsing of
> svn:mime-type vs. introducing a new property. Mine adds the option of

Agreed.

> having more fine-grained choices about diff algorithms in the future
> without actually having to know anything about specific media types, but
> that's just a "future-proof" not an immediate requirement.

Ok. Sorry to go backwards. Can you state your proposal again?

> On the surface, the choice appears simple: roll a die, or flip a coin,
> or (heh) have a duel. But Hyrum makes a very good example of the working
> copy library nightmare. Even picking "strcmp" over "regex.match" can
> have far-reaching consequences, but I'm more concerned about the
> /potential/ creature feep: "Oh, we do /this/ and /this/ with
> svn:mime-type, why not check it for colour and taste, too." Somewhat
> overdone, perhaps, but it's /so/ easy to fall into the trap of just
> adding a little something to an existing feature.

This is an argument against overriding and I acknowledge that
overriding has all the problems you describe.
No-one has yet explained why my assertion that my proposal is not
overriding is flawed. Please do so before raising the spectre of
overriding again.

>>> (Oh, we don't properly interpret the Unicode end-of-line code point in
>>> UTF-8 files.)
>>>
>>
>> End of line codepoint?  Are you talking about U+2028 and U+2029?
>>
>
> Those would appear to be the ones, yes. I don't know offhand if the
> paragraph separator implies end-of-line.

Newline support is language specific.
Unicode newlines generally include those two, U+0085 and the latin-1
sequences \r\n, \r, \n.
The Unicode consortium's current recommendation on unicode support for
source code is to treat them all as line-breaking.
Many on the EcmaScript committee generally consider it a mistake to
have included those (one of the reasons why JSON can never be a subset
of EcmaScript is that U+2028 can appear in a JSON string unescaped but
not in an EcmaScript string),

> -- Brane
>

------------------------------------------------------
http://subversion.tigris.org/ds/viewMessage.do?dsForumId=462&dsMessageId=2417388
Received on 2009-11-13 04:46:08 CET

This is an archived mail posted to the Subversion Dev mailing list.