[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: Classifying files as binary or text

From: Mike Samuel <mikesamuel_at_gmail.com>
Date: Thu, 12 Nov 2009 20:11:43 -0800

2009/11/12 Branko Čibej <brane_at_xbc.nu>:
> Mike Samuel wrote:
>> 2009/11/12 Mark Phippard <markphip_at_gmail.com>:
>>> On Thu, Nov 12, 2009 at 10:04 PM, Mike Samuel <mikesamuel_at_gmail.com> wrote:
>>>> 2009/11/12 Branko Čibej <brane_at_xbc.nu>:
>>>>> Mike Samuel wrote:
>>>>>> 2009/11/12 Branko Čibej <brane_at_xbc.nu>:
>>>>>>> What that amounts to is that, for completeness and correctness, you'd
>>>>>>> have to carefully catalogue all the media types and applicable
>>>>>>> attributes, and make complex decisions based on that data; *and* keep it
>>>>>>> up to date, which apparently is not as easy as one would like (e.g., see
>>>>>>> http://roy.gbiv.com/untangled/2009/wrangling-mimetypes). I don't think
>>>>>>> the goal would justify the ongoing effort involved.
>>>>>> Why?  Why not just take into account the charset mime-type parameter
>>>>>> which can only be present on texty types?
>>>>>> I'm not suggesting this as an ultimate solution, just an incremental
>>>>>> improvement of an existing feature that is already well documented and
>>>>>> understood in other domains.
>>>>> What do you do if the charset attribute isn't present? You either fall
>>>>> back to the current "broken" behaviour, or you interpret the media type.
>>>> The current proposal outlined in the first mail of this thread is to
>>>> classify a file as texty if
>>>> (1) the current "broken" behavior says so
>>>> (2) OR if there is a charset attribute present regardless of value.
>>> So that I can understand your proposal.  Suppose I have an XML file
>>> that is encoded with UTF-16.  I set the svn:mime-type property to have
>>> a value of application/xml;charset=utf16.  You would propose that we
>>> treat this as text even though we know that SVN contextual merging
>>> algorithm in its current form cannot handle this properly?  So
>>> basically we would have to tell users "don't do that"?
>> I agree UTF-16 and UTF-32 handling need to be addressed by any
>> proposal, and that they might introduce backwards compatibility
>> problems if dealt with wrong, and that an entirely new property would
>> side-step such backwards compatibility concerns.
>> Let me make sure I understand current behavior before I comment.
>> What happens if I do the following
>> python -c 'print u"Hello, World!".encode("UTF-16")' > /tmp/foo.txt
>> svn add foo.txt
>> svn propset svn:mime-type 'text/plain;charset=UTF-16'
>> Am I right in guessing that that gets a mime-type of
>> application/octet-stream at svn add time
> Yes
>>  because it contains a byte outside [\x01-\x7e],
> The algorithm isn't /quite/ as simple as that, but yes.
>>  which then gets overridden in the next step?
>> And that is unlikely to be problematic now because the user would
>> notice when they set the mime-type to text/plain;charset=UTF-16 that
>> the diff breaks?
> The diff contains a mixture of multi-byte and wide-character strings.
> Depending on whether your UTF-16 is big- or little-endian, it may
> incorrectly split lines in the middle of a 16-bit code sequence.

I thought BOMs were widely used with UTF-16 for this very reason. Is
that not the case?

>> And you're concerned that there might be similar code lurking in many
>> repositories but with a mime-type like application/xml;charset=UTF-16
>> which would suddenly break?
> I believe that's what he meant, yes; existing files lurking, or new
> files being created. Mark, correct me if I misuderstood.

Ok, existing files, or new files created by existing tools.

> -- Brane

Received on 2009-11-13 05:11:55 CET

This is an archived mail posted to the Subversion Dev mailing list.