Re: Classifying files as binary or text

From: Mike Samuel <mikesamuel_at_gmail.com>
Date: Thu, 12 Nov 2009 19:04:36 -0800

2009/11/12 Branko ÄŒibej <brane_at_xbc.nu>:
> Mike Samuel wrote:
>> 2009/11/12 Branko ÄŒibej <brane_at_xbc.nu>:
>>
>>> What that amounts to is that, for completeness and correctness, you'd
>>> have to carefully catalogue all the media types and applicable
>>> attributes, and make complex decisions based on that data; *and* keep it
>>> up to date, which apparently is not as easy as one would like (e.g., see
>>> http://roy.gbiv.com/untangled/2009/wrangling-mimetypes). I don't think
>>> the goal would justify the ongoing effort involved.
>>>
>>
>> Why? Â Why not just take into account the charset mime-type parameter
>> which can only be present on texty types?
>> I'm not suggesting this as an ultimate solution, just an incremental
>> improvement of an existing feature that is already well documented and
>> understood in other domains.
>>
>
> What do you do if the charset attribute isn't present? You either fall
> back to the current "broken" behaviour, or you interpret the media type.

The current proposal outlined in the first mail of this thread is to
classify a file as texty if
(1) the current "broken" behavior says so
(2) OR if there is a charset attribute present regardless of value.

> [...]
>
>> svn diff is part of the core and it is not just a presentation tool so
>> you can't just add a --force flag and rely on users to do what they
>> want. Â It is used to generate patches that are uploaded by tools like
>> rietveld: codereview.appspot.com.
>>
>
> I dont' understand; which part of "svn diff" needs to know about
> charsets? Ignoring the known problem with not being able to interpret

It doesn't. It needs to have a concept of whether a file is text or
binary. That is the problem I am trying to address.
$ svn diff foo.gif
does and should continue to do a binary diff, printing something like
"foo.gif r1 and foo.gif r2 differ"

> UTF-16 and UTF-32 (which can be solved without relying on the value of
> svn:mime-type or any other property), and ignoring EBCDIC, then as far
> as I know, "svn diff" output will remain valid regardless of the
> particular text encoding used in the file. As far as I know, all other
> single-byte and multi-byte encodings use the ASCII control codes to
> indicate line endings, and are constructed so that these cannot be
> misinterpreted. That's all that really matters to "svn diff".
>
> (Oh, we don't properly interpret the Unicode end-of-line code point in
> UTF-8 files.)

End of line codepoint? Are you talking about U+2028 and U+2029?

>>>> Â This is exactly the kind of
>>>> overloading that I think should be avoided.
>>>>
>>> That is why I proposed that we retain svn:mime-type, since it is
>>> possible to encode all the MIME attributes in that property, and thus
>>> end up using that property to answer question about the file's contents,
>>> whilst using a new one to answer questions about how to merge it.
>>>
>>
>> Sure. Â I want to make svn:mime-type incrementally more accurate by
>> taking into account the fact that a charset parameter in a mime-type
>> implies textiness.
>>
>>
>>>> Â Yes, merging and diffing
>>>> are definitely related concerns, but doing type-sensitive merging is a
>>>> big change
>>>>
>>> Not really. We do type-sensitive merging today, albeit in a very limited
>>> way. And I prefer to _keep_ it as limited as makes sense.
>>>
>>
>> Ah, cool. Â I was unaware of that.
>>
>
> Caveat lector: "We merge differently based on whether we believe the
> file to be text or binary." That's the "very limited way" I was
> referring to. Nothing fancier.

Gotcha.

>>>> Â and I think that it would be good not to introduce new
>>>> properties to enable it until there's significant agreement on goals
>>>> and how to stage rollout.
>>>>
>>>>
>>> Sure. Hence this discussion.
>>>
>>
>> Yes, but I want to fix a specific bug. Â I'd rather not gate fixing
>> issue 1002 on the design of new features that I'm incompetent to
>> debate.
>>
>
> What new features would these be? That issue is about finding a way to
> be more precise in the text vs. binary decision. It happens to
> explicitly mention more precise interpretation of MIME types, but that's
> just one, and not necessarily the best, solution.

Improving the text vs. binary distinction is all I want to do.

> -- Brane
>

------------------------------------------------------
http://subversion.tigris.org/ds/viewMessage.do?dsForumId=462&dsMessageId=2417367
Received on 2009-11-13 04:04:50 CET

This message: [ Message body ]
Next message: Branko Cibej: "Re: Classifying files as binary or text"
Previous message: Branko Cibej: "Re: Classifying files as binary or text"
In reply to: Branko Cibej: "Re: Classifying files as binary or text"
Next in thread: Mark Phippard: "Re: Classifying files as binary or text"
Reply: Mark Phippard: "Re: Classifying files as binary or text"
Reply: Branko Cibej: "Re: Classifying files as binary or text"

Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ] [ by messages with attachments ]