03.10.2014 15:35, Stefan Sperling пишет:
> On Fri, Oct 03, 2014 at 11:26:32AM +0400, Navrotskiy Artem wrote:
>> Subversion console client try to detect binary file with algorythm:
>> 1. File is NOT BINARY if it contains only BOM UTF-8 signature (why not
>> check as first N bytes is corret UTF-8?);
>> 2. File is BINARY if first 1024 bytes contains ZERO byte (uniform
>> distribution of bytes takes change of absent ZERO byte: (1 - 1 /
>> 256) ^ 1024 = ~1.8%);
>> 3. File is BINARY if first 1024 bytes contains over 85% of characters
>> not in range 0x07-0x0D, 0x20-0x7F (total we have 153 "binary"
>> bytes, ~60%).
>> This algoritm looks like broken.
> Can you suggest a better algoritm?
About false positive:
1. If text file detected as binary:
* with "svn:auto-props = '*.txt = svn:eol-style=native'" svn
client block adding this file: svn:eol-style and
svn:mime-type=application/octet-stream can't be defined
You have a workaround:
o create empty file;
o run svn add for empty file;
o replace empty file by real data;
* you can't diff and merge this file (Cannot display: file marked
as a binary type.).
You can't fix it, because you can't remove svn:mime-type
property in last modified revision.
2. If binary file detected as text:
* svn diff and merge display unusable output.
You can fix it in current revision by set svn:mime-type property.
I think, false positive, when text file detected as binary is more annoying.
About file type detection:
1. File detection algorythm must be as simple, as possible.
2. If first N bytes contains ZERO byte - file is binary.
3. If file is valid UTF-8 - file is text.
4. If file contains too many binary characters - file is binary.
I think, definitely binary charactes is: 0x00-0x08, 0x0B, 0x0C,
0x0E-0x1F, 0x7F (29 characters, ~11.3%).
This characters very rarely uses in text files. Characters from
range 0x80-0xFF can identify as letters in some encodings.
Comparison threshold should be significantly lower than the
percentage of data characters in a normal distribution.
For example, if file contains about 2.5% of N bytes as "binary"
characters - this file is binary.
Overall, I seem to be successful following implementations:
1. Git autodetection: if first 8000 bytes contains ZERO byte - file is
+ As simple, as possible;
+ Can't detect text files as binary;
- Can detect some binary files as text;
2. Byte range autodetection: if first N bytes contains byte from range
0x00-0x08 or 0x0E-0x1F - file is binary.
+ Still simple;
- Can detect some short binary files as text;
3. Byte range autodetection: if first N bytes contains about N% of
bytes: 0x00-0x08, 0x0B, 0x0C, 0x0E-0x1F, 0x7F - file is binary.
- Not so simple;
Received on 2014-10-04 08:04:19 CEST