[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: Classifying files as binary or text

From: Mark Phippard <markphip_at_gmail.com>
Date: Thu, 12 Nov 2009 22:26:40 -0500

On Thu, Nov 12, 2009 at 10:04 PM, Mike Samuel <mikesamuel_at_gmail.com> wrote:
> 2009/11/12 Branko Čibej <brane_at_xbc.nu>:
>> Mike Samuel wrote:
>>> 2009/11/12 Branko Čibej <brane_at_xbc.nu>:
>>>
>>>> What that amounts to is that, for completeness and correctness, you'd
>>>> have to carefully catalogue all the media types and applicable
>>>> attributes, and make complex decisions based on that data; *and* keep it
>>>> up to date, which apparently is not as easy as one would like (e.g., see
>>>> http://roy.gbiv.com/untangled/2009/wrangling-mimetypes). I don't think
>>>> the goal would justify the ongoing effort involved.
>>>>
>>>
>>> Why?  Why not just take into account the charset mime-type parameter
>>> which can only be present on texty types?
>>> I'm not suggesting this as an ultimate solution, just an incremental
>>> improvement of an existing feature that is already well documented and
>>> understood in other domains.
>>>
>>
>> What do you do if the charset attribute isn't present? You either fall
>> back to the current "broken" behaviour, or you interpret the media type.
>
> The current proposal outlined in the first mail of this thread is to
> classify a file as texty if
> (1) the current "broken" behavior says so
> (2) OR if there is a charset attribute present regardless of value.

So that I can understand your proposal. Suppose I have an XML file
that is encoded with UTF-16. I set the svn:mime-type property to have
a value of application/xml;charset=utf16. You would propose that we
treat this as text even though we know that SVN contextual merging
algorithm in its current form cannot handle this properly? So
basically we would have to tell users "don't do that"?

If that is not your proposal, then how do you plan to handle valid
charset values that are identifying encodings that Subversion cannot
handle? Also, what if any validation of charset values would you put
in place? Can I enter a made up value for charset?

I do agree in principle that the presence of charset could be a
reasonable indicator of "textness" but that does not change the fact
that Subversion is not presently capable of managing all types of text
files.

-- 
Thanks
Mark Phippard
http://markphip.blogspot.com/
------------------------------------------------------
http://subversion.tigris.org/ds/viewMessage.do?dsForumId=462&dsMessageId=2417381
Received on 2009-11-13 04:27:03 CET

This is an archived mail posted to the Subversion Dev mailing list.