[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: Classifying files as binary or text

From: Branko Cibej <brane_at_xbc.nu>
Date: Fri, 13 Nov 2009 03:57:20 +0100

Mike Samuel wrote:
> 2009/11/12 Branko ─îibej <brane_at_xbc.nu>:
>> What that amounts to is that, for completeness and correctness, you'd
>> have to carefully catalogue all the media types and applicable
>> attributes, and make complex decisions based on that data; *and* keep it
>> up to date, which apparently is not as easy as one would like (e.g., see
>> http://roy.gbiv.com/untangled/2009/wrangling-mimetypes). I don't think
>> the goal would justify the ongoing effort involved.
> Why? Why not just take into account the charset mime-type parameter
> which can only be present on texty types?
> I'm not suggesting this as an ultimate solution, just an incremental
> improvement of an existing feature that is already well documented and
> understood in other domains.

What do you do if the charset attribute isn't present? You either fall
back to the current "broken" behaviour, or you interpret the media type.


> svn diff is part of the core and it is not just a presentation tool so
> you can't just add a --force flag and rely on users to do what they
> want. It is used to generate patches that are uploaded by tools like
> rietveld: codereview.appspot.com.

I dont' understand; which part of "svn diff" needs to know about
charsets? Ignoring the known problem with not being able to interpret
UTF-16 and UTF-32 (which can be solved without relying on the value of
svn:mime-type or any other property), and ignoring EBCDIC, then as far
as I know, "svn diff" output will remain valid regardless of the
particular text encoding used in the file. As far as I know, all other
single-byte and multi-byte encodings use the ASCII control codes to
indicate line endings, and are constructed so that these cannot be
misinterpreted. That's all that really matters to "svn diff".

(Oh, we don't properly interpret the Unicode end-of-line code point in
UTF-8 files.)

>>> This is exactly the kind of
>>> overloading that I think should be avoided.
>> That is why I proposed that we retain svn:mime-type, since it is
>> possible to encode all the MIME attributes in that property, and thus
>> end up using that property to answer question about the file's contents,
>> whilst using a new one to answer questions about how to merge it.
> Sure. I want to make svn:mime-type incrementally more accurate by
> taking into account the fact that a charset parameter in a mime-type
> implies textiness.
>>> Yes, merging and diffing
>>> are definitely related concerns, but doing type-sensitive merging is a
>>> big change
>> Not really. We do type-sensitive merging today, albeit in a very limited
>> way. And I prefer to _keep_ it as limited as makes sense.
> Ah, cool. I was unaware of that.

Caveat lector: "We merge differently based on whether we believe the
file to be text or binary." That's the "very limited way" I was
referring to. Nothing fancier.

>>> and I think that it would be good not to introduce new
>>> properties to enable it until there's significant agreement on goals
>>> and how to stage rollout.
>> Sure. Hence this discussion.
> Yes, but I want to fix a specific bug. I'd rather not gate fixing
> issue 1002 on the design of new features that I'm incompetent to
> debate.

What new features would these be? That issue is about finding a way to
be more precise in the text vs. binary decision. It happens to
explicitly mention more precise interpretation of MIME types, but that's
just one, and not necessarily the best, solution.

-- Brane

Received on 2009-11-13 03:57:44 CET

This is an archived mail posted to the Subversion Dev mailing list.