[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: Character Encoding

From: Kevin Grover <kevin_at_kevingrover.net>
Date: Thu, 26 Jun 2008 14:57:01 -0700

Which editor on Windows? (What version of Windows?)
What version of AIX?

As I mentioned: getting encoding correct between editors is
(currently) non-trivial. For most OS's and encodings, there's no
indication (in the file) of what the encoding was used. Also remember
the svn:encoding is for data svn stores (file names, comments, etc,
NOT the contents of the file).

So, when you edit a file on Windows, you're getting cp1252, _unless_
you used a smart text editor and asked for another encoding.

When you use AIX you're getting another encoding (not sure what).
What does your locale say?

If you used Notepad in Windows, you can use 'Save-As' and choose an
encoding. Choices are: ANSI, 'Unicode', 'Unicode BIG ENDIN', or
'UTF-8'. If you use Notepad to 'Save As' and choose UTF8 and over
write the original file. svn will probalby then see the file as
changed (because it was originally in the native windows encoding, but
is now in another encoding).

Try saving it as Unicode or UTF8 and see how vi behaves on AIX. Keep
in mind. UTF8 files have no indicator in them that the are utf8 ---
so if your aix environment is not natively running in utf8 vi will be
running using another encoding. If you use vim (VI iMproved) on aix
it will look for the -*- coding: utf8 -*- line (if you added it).

If you save the file as Unicode (or Unicode big endian) then it will
be saved with 16 byte characters (a lot of zeros in there). But, any
unicode aware editor should be able to open and edit the file.
HOWEVER, these types of files are currently treated a binary by svn
(no diffs or merges).

More details: for grins, I create some files in windows (using
notepad) called 'test-ansi.txt', 'test-utf8.txt', 'test-unicode.txt'
and 'test-unicode-be.txt'.

They all contain "This is a test\r\n"

Here are the sizes:

H:\>dir test*
06/26/2008 14:41 16 test-ansi.txt
06/26/2008 14:42 34 test-unicode-be.txt
06/26/2008 14:41 34 test-unicode.txt
06/26/2008 14:41 19 test-utf8.txt

I then ran this python script:

for f in ['test-ansi.txt', 'test-utf8.txt', 'test-unicode.txt',
'test-unicode-be.txt']:
    s = open(f, 'rb').read()
    print f, ' '.join(["%02x" % ord(c) for c in s])

And received this output:

test-ansi.txt 54 68 69 73 20 69 73 20 61 20 74 65 73 74 0d 0a
test-utf8.txt ef bb bf 54 68 69 73 20 69 73 20 61 20 74 65 73 74 0d 0a
test-unicode.txt ff fe 54 00 68 00 69 00 73 00 20 00 69 00 73 00 20 00
61 00 20 00 74 00 65 00 73 00 74 00 0d 00 0a 00
test-unicode-be.txt fe ff 00 54 00 68 00 69 00 73 00 20 00 69 00 73 00
20 00 61 00 20 00 74 00 65 00 73 00 74 00 0d 00 0a

Which is the filename followed by a hex dump of the bytes in the file.
 I was wrong: The utf8 file (at least as created by Notepad) does have
a magic marker (ef bb bf) at the front. None of the text editors I
use add anything. When I tried to read the utf8 file in Python using
the ut8f encoding, it failed (import codecs; s =
codecs.open('test-utf8.txt', 'r', 'utf8').read()) with an encoding
error.

I'm still sorely confused with all the encoding issues. Anyway, I
hope this helps clarify you Windows - AIX issues.

- kevin

On Thu, Jun 26, 2008 at 3:25 AM, <boliver_at_lvlomas.com> wrote:
> Thanks for the info. Let give a walk through of this problem and hopefully that might help narrow down the possible issues.
>
> - file is viewed in a Windows based editor and the encoding is fine
> - file is commited to Subversion
> - an svn export is then done on the file to an AIX system
> - when I view the file in a VI editor on the AIX system, the encoding is wonky and the French characters are garbbled
> - but if I delete the garbled characters and paste the French characters back into the file, while still using the VI editor, they go in fine.
>
> So I am not sure where the glitch or problem is. It may not be a Subversion problem at all but this is where I am starting.
>
> Thanks.
>
>
> Bryan N Oliver
> Supervisor, Software Development
> L.V. Lomas Limited
> boliver_at_lvlomas.com
> Phone: (905) 458-7111 ext 395
> Cell: (416) 999-2377
> Fax: ( 905 ) 458-3571
>
>
> ----- Original Message -----
> From: "Kevin Grover" [kevin_at_kevingrover.net]
> Sent: 06/25/2008 02:59 PM MST
> To: Bryan Oliver
> Cc: users_at_subversion.tigris.org
> Subject: Re: Character Encoding
>
>
>
> On Wed, Jun 25, 2008 at 1:19 PM, <boliver_at_lvlomas.com> wrote:
>>
>>
>> I have a file that I store in Subversion. The file is plain text but does
>> contain some French character text which needs to be encoded properly with
>> UTF-8. These characters are encoding properly in the Subversion system
>> because when I view the file through the Tortoise client I see the correct
>> characters. But when I update this file on another machine, the character
>> encoding isn't correct and the French text comes out garbled.
>>
>> I am doing the update on an AIX machine into a working directory.
>>
>> Is there a way I can tell the Subversion client on the AIX box about the
>> encoding???
>>
>> Thanks.
>>
>>
>> ----------------------------------------------------------------------------
>>
>>
>>
>> CONFIDENTIALITY: The information in this message is legally privileged and
>> confidential. In the event of a transmission error and if you are not the
>> individual or entity mentioned above, you are hereby advised that any use,
>> copying or reproduction of this document is strictly forbidden. Please
>> advise us of this error and destroy this message.
>>
>>
>> CONFIDENTIALITÉ: L'information apparaissant dans ce message électronique
>> est de nature légalement privilégiée et confidentielle. Si ce message vous
>> est parvenu par erreur et que vous n'êtes pas le destinataire visé, vous
>> êtes par les présentes avisé que tout usage, copie ou distribution de ce
>> message est strictement interdit. Vous êtes donc prié de nous informer
>> immédiatement de cette erreur et de détruire ce message.
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe_at_subversion.tigris.org
>> For additional commands, e-mail: users-help_at_subversion.tigris.org
>>
>>
>
> The only changes made to text files are line endings (if you set
> svn:eol-style). No transcoding happens.
>
> Subversion uses encodings for filenames (and properties and messages
> it prints out). The contents of the file are _not_ part of the deal.
> You need to make sure you edit the file correctly. There are many
> editors for Windows that can edit files in UTF8 or UTF16 (notepad can,
> I believe). You need to make sure you use an editor that does proper
> encoding on both machines. Windows uses cp1252 for a default
> encoding. I have no idea what AIX uses. If it's a relatively modern
> version, it probably uses utf8. (You can probably look at the value
> of the LANG env var, or the output of the locale command).
>
> If you use TSVN, how are you looking at the contents of the file? I
> didn't think TSVN had a viewer application?
>
> Some caveats: some encodings (UTF-16 and UTF-8 WITH BOM (Byte Order
> Mark)) embed magic at the begging of the file so that readers can
> figure out what the encoding is. Most other (plain text files for
> example) have no indication. You (as the user) must know where the
> file was created and where it will be used.
>
> Some editors (Emacs) and languages (Python) look for special markup in
> the file (-*- coding: utf-8 -*-) or (-*- coding: latin-1 -*-) and will
> use the specified encoding.
>
> XML files defaults to UTF8 if not specified otherwise --- but most
> text-only editors don't know this and don't do the right thing: when
> you insert high-bit characters (ord(x)>127), the just insert the raw
> character code for the default encoding of the system.
>
> Because of the above, even if you have a properly encoded file, you
> may see garbage when viewing/editing it with a program that is
> un-aware of the encoding used by the file (it tries to use it's own
> system default encoding).
>
> Hope this helps some. Good luck.
>

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe_at_subversion.tigris.org
For additional commands, e-mail: users-help_at_subversion.tigris.org
Received on 2008-06-26 23:57:30 CEST

This is an archived mail posted to the Subversion Users mailing list.

This site is subject to the Apache Privacy Policy and the Apache Public Forum Archive Policy.