Re: Classifying files as binary or text

From: Mark Phippard <markphip_at_gmail.com>
Date: Fri, 13 Nov 2009 08:45:52 -0500

On Thu, Nov 12, 2009 at 11:09 PM, Branko Čibej <brane_at_xbc.nu> wrote:

>> And you're concerned that there might be similar code lurking in many
>> repositories but with a mime-type like application/xml;charset=UTF-16
>> which would suddenly break?
>>
>
> I believe that's what he meant, yes; existing files lurking, or new
> files being created. Mark, correct me if I misuderstood.

I was not considering pre-existing values where suddenly it would
break, but that is also a good point. For me, the fundamental problem
is that we do not care what the charset is, and therefore we should
not build a feature that is based on it. We want to know if it is OK
to treat the file as text when merging, Right now, that is just a
boolean question, and in the future perhaps it might just grow to
include a couple more choices. But we do not care if it is ASCII,
Latin2 or UTF-8. We just want to know if we can treat it as text.

From the point of view of a user, if we have a property that asks for
charset, it would make perfect sense to enter proper values like
UTF-16 if that is in fact the encoding being used in the file. I do
not think we can tell users that they have to leave off charsets that
we cannot treat as text, so that means we have the maintenance burden
of recognizing every charset we cannot support. As I have also said,
I think it will open the door for future bug reports and questions
from people that are expecting us to actually implement behaviors
based on the charset. For example, if a file is listed as ASCII, and
then someone converts it to UTF-8 and never updates the property we
are not going to complain -- it is still text for us. From the point
of view of the user, they might not understand why we did not warn
them that the declared charset and the actual encoding are different.
I have certainly viewed plenty of web pages that said they were ASCII
but contained UTF-8 that the browser did not render properly.

Since the thing we care about is whether or not we can treat it as
text, I think that is the sort of property we should add. If the
property is not present, I would fall back on our current algorithm to
determine textness. I agree with Branko's proposal that it makes
sense to implement a solution that could be expanded in the future if
there was a need. So I would not add a boolean property.

I do not think Subversion needs to be modified to automatically set
this property (other than via the existing auto props mechanism). I
would just leave it as a property that a user could add to files to
force Subversion to treat them as text.

-- 
Thanks
Mark Phippard
http://markphip.blogspot.com/
------------------------------------------------------
http://subversion.tigris.org/ds/viewMessage.do?dsForumId=462&dsMessageId=2417545

Received on 2009-11-13 14:46:15 CET

This message: [ Message body ]
Next message: Kannan: "[PATCH] Fix some deprecation warnings"
Previous message: Mark Phippard: "Re: [RFC] source repository at Apache"
In reply to: Branko Cibej: "Re: Classifying files as binary or text"
Next in thread: Branko Cibej: "Re: Classifying files as binary or text"

Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ] [ by messages with attachments ]