Re: Classifying files as binary or text

From: Branko Cibej <brane_at_xbc.nu>
Date: Fri, 13 Nov 2009 05:09:12 +0100

Mike Samuel wrote:
> 2009/11/12 Mark Phippard <markphip_at_gmail.com>:
>
>> On Thu, Nov 12, 2009 at 10:04 PM, Mike Samuel <mikesamuel_at_gmail.com> wrote:
>>
>>> 2009/11/12 Branko Čibej <brane_at_xbc.nu>:
>>>
>>>> Mike Samuel wrote:
>>>>
>>>>> 2009/11/12 Branko Čibej <brane_at_xbc.nu>:
>>>>>
>>>>>
>>>>>> What that amounts to is that, for completeness and correctness, you'd
>>>>>> have to carefully catalogue all the media types and applicable
>>>>>> attributes, and make complex decisions based on that data; *and* keep it
>>>>>> up to date, which apparently is not as easy as one would like (e.g., see
>>>>>> http://roy.gbiv.com/untangled/2009/wrangling-mimetypes). I don't think
>>>>>> the goal would justify the ongoing effort involved.
>>>>>>
>>>>>>
>>>>> Why? Why not just take into account the charset mime-type parameter
>>>>> which can only be present on texty types?
>>>>> I'm not suggesting this as an ultimate solution, just an incremental
>>>>> improvement of an existing feature that is already well documented and
>>>>> understood in other domains.
>>>>>
>>>>>
>>>> What do you do if the charset attribute isn't present? You either fall
>>>> back to the current "broken" behaviour, or you interpret the media type.
>>>>
>>> The current proposal outlined in the first mail of this thread is to
>>> classify a file as texty if
>>> (1) the current "broken" behavior says so
>>> (2) OR if there is a charset attribute present regardless of value.
>>>
>> So that I can understand your proposal. Suppose I have an XML file
>> that is encoded with UTF-16. I set the svn:mime-type property to have
>> a value of application/xml;charset=utf16. You would propose that we
>> treat this as text even though we know that SVN contextual merging
>> algorithm in its current form cannot handle this properly? So
>> basically we would have to tell users "don't do that"?
>>
>
> I agree UTF-16 and UTF-32 handling need to be addressed by any
> proposal, and that they might introduce backwards compatibility
> problems if dealt with wrong, and that an entirely new property would
> side-step such backwards compatibility concerns.
>
> Let me make sure I understand current behavior before I comment.
>
> What happens if I do the following
>
> python -c 'print u"Hello, World!".encode("UTF-16")' > /tmp/foo.txt
> svn add foo.txt
> svn propset svn:mime-type 'text/plain;charset=UTF-16'
>
> Am I right in guessing that that gets a mime-type of
> application/octet-stream at svn add time

Yes

> because it contains a byte outside [\x01-\x7e],

The algorithm isn't /quite/ as simple as that, but yes.

> which then gets overridden in the next step?
> And that is unlikely to be problematic now because the user would
> notice when they set the mime-type to text/plain;charset=UTF-16 that
> the diff breaks?
>

The diff contains a mixture of multi-byte and wide-character strings.
Depending on whether your UTF-16 is big- or little-endian, it may
incorrectly split lines in the middle of a 16-bit code sequence.

> And you're concerned that there might be similar code lurking in many
> repositories but with a mime-type like application/xml;charset=UTF-16
> which would suddenly break?
>

I believe that's what he meant, yes; existing files lurking, or new
files being created. Mark, correct me if I misuderstood.

-- Brane

------------------------------------------------------
http://subversion.tigris.org/ds/viewMessage.do?dsForumId=462&dsMessageId=2417397
Received on 2009-11-13 05:09:27 CET

This message: [ Message body ]
Next message: Mike Samuel: "Re: Classifying files as binary or text"
Previous message: Mike Samuel: "Re: Classifying files as binary or text"
In reply to: Mike Samuel: "Re: Classifying files as binary or text"
Next in thread: Mike Samuel: "Re: Classifying files as binary or text"
Reply: Mike Samuel: "Re: Classifying files as binary or text"
Reply: Mark Phippard: "Re: Classifying files as binary or text"

Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ] [ by messages with attachments ]