[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: Subversion binary file detection is look like broken

From: Artem V. Navrotskiy <bozaro_at_yandex.ru>
Date: Sat, 04 Oct 2014 10:03:46 +0400

Hello,

03.10.2014 15:35, Stefan Sperling пишет:
> On Fri, Oct 03, 2014 at 11:26:32AM +0400, Navrotskiy Artem wrote:
>> Hello,
>>
>>
>>
>> Subversion console client try to detect binary file with algorythm:
>>
>> 1. File is NOT BINARY if it contains only BOM UTF-8 signature (why not
>> check as first N bytes is corret UTF-8?);
>> 2. File is BINARY if first 1024 bytes contains ZERO byte (uniform
>> distribution of bytes takes change of absent ZERO byte: (1 - 1 /
>> 256) ^ 1024 = ~1.8%);
>> 3. File is BINARY if first 1024 bytes contains over 85% of characters
>> not in range 0x07-0x0D, 0x20-0x7F (total we have 153 "binary"
>> bytes, ~60%).
>>
>> This algoritm looks like broken.
>>
> Can you suggest a better algoritm?
About false positive:

 1. If text file detected as binary:
      * with "svn:auto-props = '*.txt = svn:eol-style=native'" svn
        client block adding this file: svn:eol-style and
        svn:mime-type=application/octet-stream can't be defined
        simultaneously;
        You have a workaround:
          o create empty file;
          o run svn add for empty file;
          o replace empty file by real data;
          o commit.
      * you can't diff and merge this file (Cannot display: file marked
        as a binary type.).
        You can't fix it, because you can't remove svn:mime-type
        property in last modified revision.
 2. If binary file detected as text:
      * svn diff and merge display unusable output.
        You can fix it in current revision by set svn:mime-type property.

I think, false positive, when text file detected as binary is more annoying.

About file type detection:

 1. File detection algorythm must be as simple, as possible.
 2. If first N bytes contains ZERO byte - file is binary.
 3. If file is valid UTF-8 - file is text.
 4. If file contains too many binary characters - file is binary.
    I think, definitely binary charactes is: 0x00-0x08, 0x0B, 0x0C,
    0x0E-0x1F, 0x7F (29 characters, ~11.3%).
    This characters very rarely uses in text files. Characters from
    range 0x80-0xFF can identify as letters in some encodings.
    Comparison threshold should be significantly lower than the
    percentage of data characters in a normal distribution.
    For example, if file contains about 2.5% of N bytes as "binary"
    characters - this file is binary.

Overall, I seem to be successful following implementations:

 1. Git autodetection: if first 8000 bytes contains ZERO byte - file is
    binary.
    + As simple, as possible;
    + Can't detect text files as binary;
    - Can detect some binary files as text;
 2. Byte range autodetection: if first N bytes contains byte from range
    0x00-0x08 or 0x0E-0x1F - file is binary.
    + Still simple;
    - Can detect some short binary files as text;
 3. Byte range autodetection: if first N bytes contains about N% of
    bytes: 0x00-0x08, 0x0B, 0x0C, 0x0E-0x1F, 0x7F - file is binary.
    - Not so simple;

Best regards,
Navrotskiy Artem.
Received on 2014-10-04 08:04:19 CEST

This is an archived mail posted to the Subversion Dev mailing list.

This site is subject to the Apache Privacy Policy and the Apache Public Forum Archive Policy.