[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: text and binary files in SVN

From: Vincent Lefevre <vincent+svn_at_vinc17.org>
Date: 2007-08-15 00:49:17 CEST

On 2007-08-14 16:43:19 +0200, Erik Huelsmann wrote:
> On 8/14/07, Vincent Lefevre <vincent+svn@vinc17.org> wrote:
> > On 2007-08-14 16:25:10 +0200, Erik Huelsmann wrote:
> > > There's an algorithm to estimate whether files are binary or texty:
> > >
> > > Check the first 1024 bytes to be within the 020-0x7F and 0x07-0x0D
> > > regions. If more than 85% of the bytes fall in that region (and none
> > > were 0x00), then the file is probably texty.
> >
> > I wonder if non-occidental users would agree with you.
>
> They don't have to. This is what currently defines texty and we've had
> had no complaints. It's based on what diff thinks what's texty.
      ^^^^^^^^^^^^^

This is not true (see below).

> > And what about UTF-16?
>
> There's no support for wide characters in the built-in diff routine.
> You can use external diff routines, or provide a patch to support
> it...

There have been some complaints concerning UTF-16 (but the threads also
mention the problem of UTF-8 sometimes being recognized as binary), and
there's even an open issue:

  http://subversion.tigris.org/issues/show_bug.cgi?id=2194

> > One can have compressed XML files with text/xml mime-type. How does
> > Subversion handle that?
>
> As incorrectly as the mime-type. Clearly a compressed XML file isn't
> text. More appropriate seems application/xml. Or even
> application/x-gzip+xml.

No, this is wrong. For instance, see /etc/mime.types distributed in
Debian:

# Note: Compression schemes like "gzip", "bzip", and "compress" are not
# actually "mime-types". They are "encodings" and hence must _not_ have
# entries in this file to map their extensions. The "mime-type" of an
# encoded file refers to the type of data that has been encoded, not the
# type of encoding.

Apache behaves the same way: the compression is declared in a separate
header (Content-Encoding). That's HTTP/1.1 (RFC 2616) after all...

> > Also, for instance, is text/rtf more textual than application/x-sh
> > as far as diff is concerned?
>
> Yes, because it doesn't have a text/* mime-type.

But that's wrong: doing a textual diff on sh scripts makes more sense
than doing one on RTF files. Again, there have been several complaints.

-- 
Vincent Lefèvre <vincent_at_vinc17.org> - Web: <http://www.vinc17.org/>
100% accessible validated (X)HTML - Blog: <http://www.vinc17.org/blog/>
Work: CR INRIA - computer arithmetic / Arenaire project (LIP, ENS-Lyon)
---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@subversion.tigris.org
For additional commands, e-mail: users-help@subversion.tigris.org
Received on Wed Aug 15 00:47:18 2007

This is an archived mail posted to the Subversion Users mailing list.