[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: Issue 520 in tortoisesvn: TortoiseMerge fails to detect utf-16 without BOM

From: Oto BREZINA <brezina_at_printflow.eu>
Date: Sat, 20 Jul 2013 20:32:08 +0200

On 20. júla 2013 19:55, Stefan Küng wrote:
> On 20.07.2013 19:43, Oto BREZINA wrote:
>> On 20. júla 2013 19:03, tortoisesvn_at_googlecode.com wrote:
>>> Status: Started
>>> Owner: tortoisesvn
>>> Labels: Type-Defect Priority-Medium Milestone-1.8.1
>>>
>>> New issue 520 by tortoisesvn: TortoiseMerge fails to detect utf-16 without
>>> BOM
>>> http://code.google.com/p/tortoisesvn/issues/detail?id=520
>>>
>>> Files that are encoded in utf-16 but do not have a BOM are treated as ASCII
>>> or UTF8 in TortoiseMerge.
>>> The encoding detection logic only check the BOM.
>>>
>>> We have to improve that detection logic so that utf-16 files are detected
>>> correctly even if they don't have a BOM.
>>>
>> Note for 1.9: All detection are fain but will never be a 100%, thats
>> true especialy for short file, where there is not enough data for
>> reliable statiscs analyse. So there should be way to load file in
>> prespecified by user in load or other menu.
> Already there:
> double click on the status bar at the bottom where the encoding of the
> file is shown.
In bottom menu you can change encoding File will be stored, not encoding
use to load a file. E.g there may be ASCII file which is detected as
UTF-8 even there are bytes with value over 127. Then you may need to
reload file with forced format...
>
>> I had some ideas about this, but as it is started I'll not try to
>> implement those.
>> It was based on odd/even positioning of 0 bytes. In addition to new
>> lines, and spaces.
> Curious: why newlines and spaces to detect the encoding?
In utf-16 let say chinise, is not much of 0 bytes, and there may be
valid values with zero upper byte as well as lower one is zero. New
lines and spaces are most probable characters, even their counter parts
(with swapped bytes) are correct too, but really rare (0x2000 - en quad,
0x0a00 and 0x0d00- seems be incorrect unicodes )

Just note don't forget UTF-32 in detection to be complete at once.

>
> Stefan

Oto

------------------------------------------------------
http://tortoisesvn.tigris.org/ds/viewMessage.do?dsForumId=757&dsMessageId=3060927

To unsubscribe from this discussion, e-mail: [dev-unsubscribe_at_tortoisesvn.tigris.org].
Received on 2013-07-20 20:32:18 CEST

This is an archived mail posted to the TortoiseSVN Dev mailing list.

This site is subject to the Apache Privacy Policy and the Apache Public Forum Archive Policy.