Re: Bug with UTF-8 files

From: Felix Saphir <felix.saphir_at_kantarmedia.com>
Date: Thu, 28 Jul 2011 10:02:53 +0200

Am 28.07.2011 09:35, schrieb Ulrich Eckhardt:
> On Wednesday 27 July 2011, you wrote:
>> Hi. I'm using TortoiseSVN 1.6.16, Build 21511 and have next bug:
>>
>> patch with newly created file(s) in utf-8 codepage is applied wrong. Here
>> is the explanation:
>>
>> 1. Create new file in utf-8 (without BOM)
>> 2. Add to it some lines with text in few languages, that have different
>> ansi codapages(eg russian(ansi - 1251) polish(ansi-1250)
>> english(ansi-1252) etc)
>> 3. Create patch using tortoisesvn. At this stage
>> all looks fine- when you'll open patch the codepage will be treated as utf
>> and all chars are ok
>> 4. Revert changes to tree(or use another tree) and
>> apply patch. Tortoisesvn will create needed file but it will be not in
>> utf-8 but in ansi with broken non1252-chars.
>
> Just to confirm, did you verify with a hex editor or similar tool that the
> file did contain valid UTF-8 after editing (step 2) and that it didn't contain
> valid UTF-8 after applying the patch (step 4)? The point is that without the
> BOM some tools will apply heuristics which can and do fail.

There is an exact test for UTF-8.

> What puzzles me is also your explanation. You say the file is "not in utf-8
> but in ansi with broken non1252-chars", what exactly does that mean? If you
> open a file with text encoded in UTF-8 and interpret its contents differently,
> like e.g. the current single-byte codepage, of course its content is garbled.

I can confirm this: The patch was correct UTF-8, the file created by the
patch was not. All the "funny" characters were replaced by a question
mark, except for the greek characters: alpha became 'a', beta a 'ß'.
I've checked this in a hex editor.

Felix

------------------------------------------------------
http://tortoisesvn.tigris.org/ds/viewMessage.do?dsForumId=4061&dsMessageId=2805192

To unsubscribe from this discussion, e-mail: [users-unsubscribe_at_tortoisesvn.tigris.org].
Received on 2011-07-28 10:03:11 CEST

This message: [ Message body ]
Next message: Igor Paliychuk: "Re: Bug with UTF-8 files"
Previous message: Ulrich Eckhardt: "Re: Bug with UTF-8 files"
In reply to: Ulrich Eckhardt: "Re: Bug with UTF-8 files"
Next in thread: Igor Paliychuk: "Re: Bug with UTF-8 files"

Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ] [ by messages with attachments ]