[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: Ascii/binary detection.

From: <kfogel_at_collab.net>
Date: 2001-08-03 19:43:36 CEST

Branko =?ISO-8859-2?Q?=C8ibej?= <brane@xbc.nu> writes:
> Right. Note that when we were discussing this (damn, don't have that
> archive any more ...), someone pointed out that the "text/*" mime types
> actually imply CRLF line endings. But I think we can safely ignore that;
> Subversion is not an MUA.

     "Every program attempts to expand until it can read mail.
      Those programs which cannot so expand are replaced by ones that
      can."
                   -- [Third] Law of Software Envelopment

                  [also apparently known as "Zawinski's Law", but jwz
                  quotes it thus, so don't know if he's the originator]

> >1. Develop a heuristic for determining the binariness of a file, say
> > svn_io_is_binary_file ()
> >
> (Two suggestions: a) don't mark the file as binary just because there's
> a byte with value >= 128 in it; b) if other tests aren't conclusive,
> check for extremely long lines in the file?)

That combination is exactly what we were planning to do, yeah -- some
combination of a) at least a certain percentage of bytes with the high
bit set, and b) long lines.

If there's a library out there which does this already, we should use
it, probably. Anybody know of one?

> >2. During `svn add', svn_io_is_binary_file () is called (only on
> > files, of course). If it returns TRUE, the property
> > `svn:mime-type' is set on the file with a value of
> > `application/octet-stream'.
> >
> What do you think about following the HTTP convention here? Call the
> property svn:content-type, and encode the character set, too? Not that
> we'll do anything with that info in 1.0.

Seems like a good idea.

> I agree. Since the heuristic can't be 100% accurate, we definitely have
> to say what we guess about the file.

Also +1.

> Why not keyword substitution? Just make it off by default. If the user
> wants keyword substitution in binary files, we cna always let him shoot
> himself in the foot. Besides, it can actually make sense in some kinds
> of binary formats.

That makes sense. As long as Subversion doesn't initiate it, it's fine.

> There are (used to be?) systems where lines are delimited from both
> ends. On VMS, a line started with a LF and ended with a CR, IIRC. How
> about a more generic approach: the value of this property is a pair of
> strings, one for the BOL and one for the EOL marker. 'native' would
> still have the same meaning, while 'dos', 'unix' and 'mac' would be
> aliases for ':\r\n', ':\n' and ':\n\r' (or whatever), respectively. A
> VMS guy would make 'native' an alias for '\n:\r'.

Clever, yeah. +1. Why not support everything, if it's easy? :-)

> > Absence of this property means that no line-ending substitution
> > should occur at all.
> >
> Um. I'd rather use 'none' (':', if you accept the idea outlined above),
> and make 'native' the default for text files. Oh, and we have to
> prescribe the repository's native format, so that we can send deltas
> back and forth.

We can assign `none' (or `:') that meaning, if we choose, but we still
have to handle the case where the property is simply absent, and the
appropriate behavior in that case is, obviously, no conversion. So
our code for reporting the newline conversion status to the user (for
example) would still have to special-case the property's absence, at
least if we have any reporting mechanism more fancy than the user
simply doing a proplist/propget.

> This looks good, even if you ignore all my comments.

Even this one?

:-)

-K

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org
Received on Sat Oct 21 14:36:35 2006

This is an archived mail posted to the Subversion Dev mailing list.

This site is subject to the Apache Privacy Policy and the Apache Public Forum Archive Policy.