[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: Classifying files as binary or text

From: Mike Samuel <mikesamuel_at_gmail.com>
Date: Thu, 12 Nov 2009 18:40:24 -0800

2009/11/12 Branko Čibej <brane_at_xbc.nu>:
> Mike Samuel wrote:
>> 2009/11/12 Branko Cibej <brane_at_xbc.nu>:
>>> Mike Samuel wrote:
>>>> 2009/11/12 Mark Phippard <markphip_at_gmail.com>:
>>>>> On Thu, Nov 12, 2009 at 6:20 PM, Mike Samuel <mikesamuel_at_gmail.com> wrote:
>>>>>> Conclusions from the svn:charset thread that Mark pointed out:
>>>>>> (1) This proposal should not gate on svn:charset since it isn't yet
>>>>>> recognized as official
>>>>>> (2) We should avoid the term encoding in documentation of this feature.
>>>>>> (3) There may be some bad interactions between ";charset=" in
>>>>>> svn:mime-type and auto-props, but this proposal does not raise new
>>>>>> issues, and those issues are a result of an error (possibly since
>>>>>> fixed?) in auto-props.
>>>>>> From the svn:charset thread:
>>>>>> Much of the early debate deals with svn:charset being non-standard and
>>>>>> non-approved.  I tend to agree with Stefan, that this proposal
>>>>>> shouldn't gate on svn:charset being approved so I suggest tabling
>>>>>> variant 1.
>>>>> Correct me if I am wrong, but the only real goal we have right now is
>>>>> to improve SVN's ability to tell itself "this is text" and I can do
>>>>> textual merging?
>>>> That is correct.
>>>>> So why not just add an svn:text property that has a
>>>>> value of '*'.  The presence of the property means "treat this as
>>>>> text".
>>>> To make sure I understand your counter-proposal, would a file be
>>>> treated as text if at least one of (svn:mime-type starts with "text/"
>>>> or matches the existing whitelist) OR (svn:text exists and is "*")?
>>>> Or are you advocating dropping the first clause which is there for
>>>> backwards-compatibility?
>>> I think we all agree that using the MIME-type to decide whether we can
>>> use contextual text-base merge for a file has turned out to be trickier
>>> than we originally expected. It makes sense to find a better solution to
>>> the problem.
>> I'm afraid that I'm unfamiliar with these discussions since I just
>> joined this list and have never submitted a patch to SVN before.
>> Can you explain why my argument that mime-types do specify "textiness"
>> as described in RFC 2046 is flawed or point me at threads that discuss
>> why the trickiness?
> Media types in themselves are not enough to specify textiness, as I
> believe you pointed out yourself; svn:mime-type is properly the file's
> media type, since encoding and other attributes can affect the actual
> content.
> The trickiness stems from the fact that it is not "easy" to answer the
> question whether some media type describes a text-like file. Initially,
> years ago, we somewhat naïvely assumed that if it matches the pattern
> "text/*", then it's text, otherwise it's not. It turns out that a whole
> group of media types in the application/* family are in fact text, and a
> couple in image/* and so on; there's no actual restriction that I'm
> aware of in the MIME rules that would forbid you to describe text in any
> of the major media-type families.
> What that amounts to is that, for completeness and correctness, you'd
> have to carefully catalogue all the media types and applicable
> attributes, and make complex decisions based on that data; *and* keep it
> up to date, which apparently is not as easy as one would like (e.g., see
> http://roy.gbiv.com/untangled/2009/wrangling-mimetypes). I don't think
> the goal would justify the ongoing effort involved.

Why? Why not just take into account the charset mime-type parameter
which can only be present on texty types?
I'm not suggesting this as an ultimate solution, just an incremental
improvement of an existing feature that is already well documented and
understood in other domains.

>>> I would suggest just slightly future-proofing Mark's proposal, and
>>> certainly staying backward-compatible. The rules should go like this:
>>>    * If a file has no content-related property (i.e, no svn:mime-type),
>>>      treat it as text.
>>>    * If there's only an svn:mime-type property, keep its current semantics.
>>>    * Introduce a new property that overrides svn:mime-type, but don't
>>>      call it svn:text (which implies it's a boolean), but, e.g., just
>>>      svn:type or svn:merge-mode or some such.
>>> The value of this property is a keyword. Initialliy, the allowed values
>>> would be "text" and "binary", but I can imagine adding "xml" in the
>>> future if someone wants to add XML-aware merging (I've come across such
>>> requests everal times, have used a similar feature in ClearCase to good
>>> effect). svn:mime-type needn't go away or be deprecated; It's useful in
>>> other contexts.
>> That sounds like a fine idea, but a (how-to-merge?) property seems
>> only tangentially related to (is-text?).
> Actually, as far as Subversion is concerned, how-to-merge is the only
> relevant question. Presentation of contents is someone else's problem
> (most of the time).

svn diff is part of the core and it is not just a presentation tool so
you can't just add a --force flag and rely on users to do what they
want. It is used to generate patches that are uploaded by tools like
rietveld: codereview.appspot.com.

>>   This is exactly the kind of
>> overloading that I think should be avoided.
> That is why I proposed that we retain svn:mime-type, since it is
> possible to encode all the MIME attributes in that property, and thus
> end up using that property to answer question about the file's contents,
> whilst using a new one to answer questions about how to merge it.

Sure. I want to make svn:mime-type incrementally more accurate by
taking into account the fact that a charset parameter in a mime-type
implies textiness.

>>   Yes, merging and diffing
>> are definitely related concerns, but doing type-sensitive merging is a
>> big change
> Not really. We do type-sensitive merging today, albeit in a very limited
> way. And I prefer to _keep_ it as limited as makes sense.

Ah, cool. I was unaware of that.

>>  and I think that it would be good not to introduce new
>> properties to enable it until there's significant agreement on goals
>> and how to stage rollout.
> Sure. Hence this discussion.

Yes, but I want to fix a specific bug. I'd rather not gate fixing
issue 1002 on the design of new features that I'm incompetent to

> -- Brane

Received on 2009-11-13 03:40:33 CET

This is an archived mail posted to the Subversion Dev mailing list.