Re: Classifying files as binary or text

From: Mike Samuel <mikesamuel_at_gmail.com>
Date: Thu, 12 Nov 2009 19:58:45 -0800

2009/11/12 Mark Phippard <markphip_at_gmail.com>:
> On Thu, Nov 12, 2009 at 10:04 PM, Mike Samuel <mikesamuel_at_gmail.com> wrote:
>> 2009/11/12 Branko Čibej <brane_at_xbc.nu>:
>>> Mike Samuel wrote:
>>>> 2009/11/12 Branko Čibej <brane_at_xbc.nu>:
>>>>
>>>>> What that amounts to is that, for completeness and correctness, you'd
>>>>> have to carefully catalogue all the media types and applicable
>>>>> attributes, and make complex decisions based on that data; *and* keep it
>>>>> up to date, which apparently is not as easy as one would like (e.g., see
>>>>> http://roy.gbiv.com/untangled/2009/wrangling-mimetypes). I don't think
>>>>> the goal would justify the ongoing effort involved.
>>>>>
>>>>
>>>> Why? Why not just take into account the charset mime-type parameter
>>>> which can only be present on texty types?
>>>> I'm not suggesting this as an ultimate solution, just an incremental
>>>> improvement of an existing feature that is already well documented and
>>>> understood in other domains.
>>>>
>>>
>>> What do you do if the charset attribute isn't present? You either fall
>>> back to the current "broken" behaviour, or you interpret the media type.
>>
>> The current proposal outlined in the first mail of this thread is to
>> classify a file as texty if
>> (1) the current "broken" behavior says so
>> (2) OR if there is a charset attribute present regardless of value.
>
> So that I can understand your proposal. Suppose I have an XML file
> that is encoded with UTF-16. I set the svn:mime-type property to have
> a value of application/xml;charset=utf16. You would propose that we
> treat this as text even though we know that SVN contextual merging
> algorithm in its current form cannot handle this properly? So
> basically we would have to tell users "don't do that"?

I agree UTF-16 and UTF-32 handling need to be addressed by any
proposal, and that they might introduce backwards compatibility
problems if dealt with wrong, and that an entirely new property would
side-step such backwards compatibility concerns.

Let me make sure I understand current behavior before I comment.

What happens if I do the following

python -c 'print u"Hello, World!".encode("UTF-16")' > /tmp/foo.txt
svn add foo.txt
svn propset svn:mime-type 'text/plain;charset=UTF-16'

Am I right in guessing that that gets a mime-type of
application/octet-stream at svn add time because it contains a byte
outside [\x01-\x7e], which then gets overridden in the next step?
And that is unlikely to be problematic now because the user would
notice when they set the mime-type to text/plain;charset=UTF-16 that
the diff breaks?

And you're concerned that there might be similar code lurking in many
repositories but with a mime-type like application/xml;charset=UTF-16
which would suddenly break?

> If that is not your proposal, then how do you plan to handle valid
> charset values that are identifying encodings that Subversion cannot
> handle? Also, what if any validation of charset values would you put
> in place? Can I enter a made up value for charset?
>
> I do agree in principle that the presence of charset could be a
> reasonable indicator of "textness" but that does not change the fact
> that Subversion is not presently capable of managing all types of text
> files.
>
> --
> Thanks
>
> Mark Phippard
> http://markphip.blogspot.com/
>

------------------------------------------------------
http://subversion.tigris.org/ds/viewMessage.do?dsForumId=462&dsMessageId=2417393
Received on 2009-11-13 04:59:10 CET

This message: [ Message body ]
Next message: Mike Samuel: "Re: Classifying files as binary or text"
Previous message: Branko Cibej: "Re: Classifying files as binary or text"
In reply to: Mark Phippard: "Re: Classifying files as binary or text"
Next in thread: Branko Cibej: "Re: Classifying files as binary or text"
Reply: Branko Cibej: "Re: Classifying files as binary or text"

Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ] [ by messages with attachments ]