[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: Classifying files as binary or text

From: Mike Samuel <mikesamuel_at_gmail.com>
Date: Thu, 12 Nov 2009 17:55:54 -0800

2009/11/12 Mark Phippard <markphip_at_gmail.com>:
> On Thu, Nov 12, 2009 at 7:59 PM, Mike Samuel <mikesamuel_at_gmail.com> wrote:
>> 2009/11/12 Mark Phippard <markphip_at_gmail.com>:
>>> On Thu, Nov 12, 2009 at 6:20 PM, Mike Samuel <mikesamuel_at_gmail.com> wrote:
>>>> Conclusions from the svn:charset thread that Mark pointed out:
>>>> (1) This proposal should not gate on svn:charset since it isn't yet
>>>> recognized as official
>>>> (2) We should avoid the term encoding in documentation of this feature.
>>>> (3) There may be some bad interactions between ";charset=" in
>>>> svn:mime-type and auto-props, but this proposal does not raise new
>>>> issues, and those issues are a result of an error (possibly since
>>>> fixed?) in auto-props.
>>>>
>>>>
>>>> From the svn:charset thread:
>>>>
>>>> Much of the early debate deals with svn:charset being non-standard and
>>>> non-approved.  I tend to agree with Stefan, that this proposal
>>>> shouldn't gate on svn:charset being approved so I suggest tabling
>>>> variant 1.
>>>
>>> Correct me if I am wrong, but the only real goal we have right now is
>>> to improve SVN's ability to tell itself "this is text" and I can do
>>> textual merging?
>>
>> That is correct.
>>
>>
>>> So why not just add an svn:text property that has a
>>> value of '*'.  The presence of the property means "treat this as
>>> text".
>>
>> To make sure I understand your counter-proposal, would a file be
>> treated as text if at least one of (svn:mime-type starts with "text/"
>> or matches the existing whitelist) OR (svn:text exists and is "*")?
>>
>> Or are you advocating dropping the first clause which is there for
>> backwards-compatibility?
>
> We would need to be backwards-compatible.  Any new property would
> exist so that a file with a mime-type of say application/xml could be
> treated as text.  But if there is no mime type of a text/* mime type
> it should also still be treated as text.
>
>>> My problem with charset is that it has implications that SVN does
>>> something based on the charset.  For example, maybe it creates an
>>> expectation that we validate the content of the file with the stated
>>> charset, or that we can convert the content if you change the charset.
>>>  Why use a property whose value has meaning if we do not do anything
>>> with that meaning.  I do not think it makes sense to drag in hook
>>> scripts or what other clients might do either, as there is nothing
>>> stopping people from adding there own charset property.
>>
>> I think a new property is warranted to avoid overloading meaning of an
>> existing one.
>> I don't think this qualifies as overloading though.
>> The svn:mime-type property is already linked to this determination,
>> and for backwards compatibility that should not change.
>> The concept of "is-textual" is linked to the "text/*" mime-type group,
>> which the current implementation takes into account, and to the
>> charset mime-type attribute in RFC 2046, which the current
>> implementation does not take into account; so I view this as an
>> attempt to fix an incomplete interpretation of an existing standard.
>
> I just think it is more complicated than you think.  For example, I
> assume you are aware that SVN cannot perform textual merges
> (currently) for UTF-16 or UTF-32 encoded files.  So if you have an
> svn:mime-type property of: application/xml;charset=utf16 then you have
> to parse that and know to treat it as binary.  So what do you if you
> just get charset=foo or charset=xyz?  I think this takes us in the
> wrong direction.
>
> Then there is the other class of issues I raised.  Such as a file has
> a value of charset=ascii but the file context is really UTF8 and
> someone thinks we should validate the content.
>
>>> So why not just make this simpler?  With an svn:text property you just
>>> have to change any routine that determines if a file is text to look
>>> for the presence of that property first, and then continue with the
>>> other checks if it is not found.
>>
>> Assuming you advocate the backwards compatible option in my question
>> above, I think multiplying properties unnecessarily is a move towards
>> greater complexity.
>
> I think we should go for something that is more specific.  Branko's
> ideas are fine too and would lay the ground work for expanding it in
> the future.

I dislike the svn:type suggestion because it might be hemming in
design of a complex new feature, and because I'm skeptical that
designing a whole new classification system for files will be an
improvement over mime-types.
And I would prefer to avoid adding a new property like svn:text just
to work around tricky but ultimately fixable UTF-16/32 issues.

That said, at the end of the day I just want some way to correctly
specify a mime-type so that that meta-information is available to web
servers, without making it impossible for code-review tools to
properly diff my code.
So if the svn dev group at large blesses one of these three schemes,
I'm happy to do the work to implement it.

What are next steps? Do we put things to a vote, duel, appeal to a
higher power?

> --
> Thanks
>
> Mark Phippard
> http://markphip.blogspot.com/
>

------------------------------------------------------
http://subversion.tigris.org/ds/viewMessage.do?dsForumId=462&dsMessageId=2417350
Received on 2009-11-13 02:56:04 CET

This is an archived mail posted to the Subversion Dev mailing list.

This site is subject to the Apache Privacy Policy and the Apache Public Forum Archive Policy.