Re: Classifying files as binary or text

From: Mike Samuel <mikesamuel_at_gmail.com>
Date: Thu, 12 Nov 2009 20:11:43 -0800

2009/11/12 Branko ÄŒibej <brane_at_xbc.nu>:
> Mike Samuel wrote:
>> 2009/11/12 Mark Phippard <markphip_at_gmail.com>:
>>
>>> On Thu, Nov 12, 2009 at 10:04 PM, Mike Samuel <mikesamuel_at_gmail.com> wrote:
>>>
>>>> 2009/11/12 Branko ÄŒibej <brane_at_xbc.nu>:
>>>>
>>>>> Mike Samuel wrote:
>>>>>
>>>>>> 2009/11/12 Branko ÄŒibej <brane_at_xbc.nu>:
>>>>>>
>>>>>>
>>>>>>> What that amounts to is that, for completeness and correctness, you'd
>>>>>>> have to carefully catalogue all the media types and applicable
>>>>>>> attributes, and make complex decisions based on that data; *and* keep it
>>>>>>> up to date, which apparently is not as easy as one would like (e.g., see
>>>>>>> http://roy.gbiv.com/untangled/2009/wrangling-mimetypes). I don't think
>>>>>>> the goal would justify the ongoing effort involved.
>>>>>>>
>>>>>>>
>>>>>> Why? Â Why not just take into account the charset mime-type parameter
>>>>>> which can only be present on texty types?
>>>>>> I'm not suggesting this as an ultimate solution, just an incremental
>>>>>> improvement of an existing feature that is already well documented and
>>>>>> understood in other domains.
>>>>>>
>>>>>>
>>>>> What do you do if the charset attribute isn't present? You either fall
>>>>> back to the current "broken" behaviour, or you interpret the media type.
>>>>>
>>>> The current proposal outlined in the first mail of this thread is to
>>>> classify a file as texty if
>>>> (1) the current "broken" behavior says so
>>>> (2) OR if there is a charset attribute present regardless of value.
>>>>
>>> So that I can understand your proposal. Â Suppose I have an XML file
>>> that is encoded with UTF-16. Â I set the svn:mime-type property to have
>>> a value of application/xml;charset=utf16. Â You would propose that we
>>> treat this as text even though we know that SVN contextual merging
>>> algorithm in its current form cannot handle this properly? Â So
>>> basically we would have to tell users "don't do that"?
>>>
>>
>> I agree UTF-16 and UTF-32 handling need to be addressed by any
>> proposal, and that they might introduce backwards compatibility
>> problems if dealt with wrong, and that an entirely new property would
>> side-step such backwards compatibility concerns.
>>
>> Let me make sure I understand current behavior before I comment.
>>
>> What happens if I do the following
>>
>> python -c 'print u"Hello, World!".encode("UTF-16")' > /tmp/foo.txt
>> svn add foo.txt
>> svn propset svn:mime-type 'text/plain;charset=UTF-16'
>>
>> Am I right in guessing that that gets a mime-type of
>> application/octet-stream at svn add time
>
> Yes
>
>> Â because it contains a byte outside [\x01-\x7e],
>
> The algorithm isn't /quite/ as simple as that, but yes.
>
>> Â which then gets overridden in the next step?
>> And that is unlikely to be problematic now because the user would
>> notice when they set the mime-type to text/plain;charset=UTF-16 that
>> the diff breaks?
>>
>
> The diff contains a mixture of multi-byte and wide-character strings.
> Depending on whether your UTF-16 is big- or little-endian, it may
> incorrectly split lines in the middle of a 16-bit code sequence.

I thought BOMs were widely used with UTF-16 for this very reason. Is
that not the case?

>> And you're concerned that there might be similar code lurking in many
>> repositories but with a mime-type like application/xml;charset=UTF-16
>> which would suddenly break?
>>
>
> I believe that's what he meant, yes; existing files lurking, or new
> files being created. Mark, correct me if I misuderstood.

Ok, existing files, or new files created by existing tools.

> -- Brane
>
>

------------------------------------------------------
http://subversion.tigris.org/ds/viewMessage.do?dsForumId=462&dsMessageId=2417400
Received on 2009-11-13 05:11:55 CET

This message: [ Message body ]
Next message: C. Michael Pilato: "Re: A tiny request for the repository move"
Previous message: Branko Cibej: "Re: Classifying files as binary or text"
In reply to: Branko Cibej: "Re: Classifying files as binary or text"
Next in thread: Branko Cibej: "Re: Classifying files as binary or text"
Reply: Branko Cibej: "Re: Classifying files as binary or text"

Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ] [ by messages with attachments ]