Re: Issue 520 in tortoisesvn: TortoiseMerge fails to detect utf-16 without BOM

From: Oto BREZINA <brezina_at_printflow.eu>
Date: Tue, 23 Jul 2013 12:15:55 +0200

On 2013-07-20 20:48, Stefan Küng wrote:
> On 20.07.2013 20:32, Oto BREZINA wrote:
>>> Already there:
>>> double click on the status bar at the bottom where the encoding of the
>>> file is shown.
>> In bottom menu you can change encoding File will be stored, not encoding
>> use to load a file. E.g there may be ASCII file which is detected as
>> UTF-8 even there are bytes with value over 127. Then you may need to
>> reload file with forced format...
> Ups, right. That's only for saving.
>
>>>> I had some ideas about this, but as it is started I'll not try to
>>>> implement those.
>>>> It was based on odd/even positioning of 0 bytes. In addition to new
>>>> lines, and spaces.
>>> Curious: why newlines and spaces to detect the encoding?
>> In utf-16 let say chinise, is not much of 0 bytes, and there may be
>> valid values with zero upper byte as well as lower one is zero. New
>> lines and spaces are most probable characters, even their counter parts
>> (with swapped bytes) are correct too, but really rare (0x2000 - en quad,
>> 0x0a00 and 0x0d00- seems be incorrect unicodes )
> Interesting.
> But I think for now, just counting null chars should be enough. Won't
> work for the situations you just mentioned, but for all others it will
> work. And it's much better than what we have now which is not detecting
> it at all.
If you are not against I will change current implementation while it
seems good,
here is little issues, e. g. when file contains only chars encoded bytes
without MSB set, first fast scan will scan whole (or almost whole file)
and then cd/50 may not happen, or zero counting may be all together skipped.

BTW what is "// continue slow" part for, I guess I implemented it, but
seems be redundant (a little faster) version of "// check remaining text
for UTF-8 validity", running for up to 7 chars, so I will remove it.
And you make in run only if no NonANSI was found, thus it is used only
for terminal up to 7 bytes of file
Still "// check remaining text for UTF-8 validity" seems to implementing it.

Other extreme case is for 2 bytes files ASCII is returned, while for
longer ASCII files (7bit chars), UTF8 may be returned upon REG settings.

>> Just note don't forget UTF-32 in detection to be complete at once.
> That's a job for maybe 1.9 - right now we only detect those with the
> BOM. And to be honest, I've never even had one file that was encoded
> like that, so even if we rely on the BOM there it won't affect many
> people if we don't detect such files that don't have a BOM.
100% true, but if you design test in more cases in mind you usually came
with better tests (even you don't implement all of them at once)

Of course, except if less common cases will make it too hard to
implement ...
>
> Stefan
Oto

------------------------------------------------------
http://tortoisesvn.tigris.org/ds/viewMessage.do?dsForumId=757&dsMessageId=3061068

To unsubscribe from this discussion, e-mail: [dev-unsubscribe_at_tortoisesvn.tigris.org].
Received on 2013-07-23 12:16:05 CEST

This message: [ Message body ]
Next message: Stefan Küng: "TortoiseSVN 1.8.1 released"
Previous message: Lübbe Onken: "TortoiseSVN translation status report for r24562"
In reply to: Stefan Küng: "Re: Issue 520 in tortoisesvn: TortoiseMerge fails to detect utf-16 without BOM"

Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ] [ by messages with attachments ]