[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: Heuristic for detecting 'binary' data vs. 'text' data [was: FW: Generating a dump file using a powershell script]

From: Julian Foad <julian.foad_at_wandisco.com>
Date: Tue, 22 Jun 2010 16:25:25 +0100

(I'm just changing the subject line.)
- Julian
 

On Tue, 2010-06-22 at 16:58 +0200, Bert Huijben wrote:
> > -----Original Message-----
> > From: Geoff Worboys [mailto:geoff_at_telesiscomputing.com.au]
> > Sent: dinsdag 22 juni 2010 16:37
> > To: users_at_subversion.apache.org
> > Subject: Generating a dump file using a powershell script
> >
>
> <snip>
>
> > Q2: When writing the code to try and identify text versus
> > binary files I decided to look at what subversion did ... but
> > now I am confused. In libsvn_subr\io.c function
> > svn_io_detect_mimetype2 a comment says:
> > going to examine the first block of data, and make sure that 85%
> > of the bytes are such that their value is in the ranges 0x07-0x0D
> > or 0x20-0x7F, and that 100% of those bytes is not 0x00.
> > but my reading of this code
> > if (((binary_count * 1000) / amt_read) > 850)
> > {
> > *mimetype = generic_binary;
> > return SVN_NO_ERROR;
> > }
> > suggests that it is actually setting the type to binary only
> > if it finds more than 85% are binary bytes (in earlier code a
> > file binary if forced if any null byte is found).
> >
> > Can anyone explain this? A bug or am I missing something?
>
> Looking at the code, this seems looks like a bug to me. But it's not a bug
> that I like to fix without further review, because the current code might
> work better then the intended behavior for users of different character
> sets.
>
> So it might be safer to just fix the documentation.
>
> Bert
Received on 2010-06-22 17:26:08 CEST

This is an archived mail posted to the Subversion Dev mailing list.

This site is subject to the Apache Privacy Policy and the Apache Public Forum Archive Policy.