[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: Subversion binary file detection is look like broken

From: Julian Foad <julianfoad_at_btopenworld.com>
Date: Fri, 3 Oct 2014 13:15:19 +0100

Stefan Sperling wrote:
> On Fri, Oct 03, 2014 at 11:26:32AM +0400, Navrotskiy Artem wrote:
>>    Subversion console client try to detect binary file with algorythm:
>>
>>     1. File is NOT BINARY if it contains only BOM UTF-8 signature (why not
>>        check as first N bytes is corret UTF-8?);
>>     2. File is BINARY if first 1024 bytes contains ZERO byte (uniform
>>        distribution of bytes takes change of absent ZERO byte: (1 - 1 /
>>        256) ^ 1024 = ~1.8%);
>>     3. File is BINARY if first 1024 bytes contains over 85% of characters
>>        not in range 0x07-0x0D, 0x20-0x7F (total we have 153 "binary"
>>        bytes, ~60%).
>>
>>    This algoritm looks like broken.

The requirement (3) for >85% non-ASCII* bytes => binary, was a historical accident. The
original intention was >15% non-ASCII bytes => binary, or in other words >85% ASCII bytes => text. Quoting from libsvn_subr/io.c:svn_io_is_binary_data():

     NOTE:  Originally, I intended to target 85% of the bytes being in
     the specified ranges, but I flubbed the condition.  At any rate,
     folks aren't complaining, so I'm not sure that it's worth
     adjusting this retroactively now.

Perhaps now is the time to change that to match the original intent.

* I use the term ASCII loosely to mean "bytes in those two ranges".

> Can you suggest a better algoritm?
>
>> For example:
>>     1. File "text.txt":
>> Is file contains text block from wikipedia about subversion in UTF-8
>> (https://ru.wikipedia.org/wiki/Subversion) and unfortunaly contains too
>> many cyrillic charactes (on character - 2 "binary" bytes).
>>     2. File "binary.txt" detected as "text"
>> It was created by "dd if=/dev/urandom of=binary.txt count=1 bs=2048" and
>> unfortunaly does not contains ZERO byte in first 1024 bytes.

Changing the 85% condition would fix example 2. However it would make example 1 occur more often, unless we also make valid UTF-8 be detected as text.

It does sound like a good idea to make valid UTF-8 be detected as text.

- Julian
Received on 2014-10-03 14:15:50 CEST

This is an archived mail posted to the Subversion Dev mailing list.

This site is subject to the Apache Privacy Policy and the Apache Public Forum Archive Policy.