[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: Call For Votes: converting log messages to UTF-8

From: Hontvari Jozsef <hontvari_at_solware.com>
Date: 2002-05-31 21:09:36 CEST

I guess everybody who regularly use latin-2 or any other non-latin-1 charset
will vote to #1. I am pretty sure from past experience that if you do not
declare explicitly that any text in subversion is UTF-8 encoded, then in
practice subversion (and its clients) will be typically used as an ASCII
only application. (That also means that this should not be a client option,
it must be enforced.)

I do not know what is the situation with Unix, but in Windows using a locale
has been straightforward for years, and I cannot really imagine how a client
could miss a conversion. (The only additional feature which should be
useful - theoretically - if I could temporarily override the assumed
character encoding when supplying input to a client in file. I mean if my
locale is Latin-2, but I saved the log message in UTF-8 for example, then I
would be able to say to the client, that hey, this file is in UTF-8 and not
in Latin-2.)

--- Original Message -----
From: "Karl Fogel" <kfogel@newton.ch.collab.net>
To: <sussman@collab.net>
Cc: <dev@subversion.tigris.org>
Sent: Friday, May 31, 2002 6:09 PM
Subject: Re: Call For Votes: converting log messages to UTF-8

> Ben Collins-Sussman <sussman@collab.net> writes:
> > In my mind, risk #1 is much more dangerous. If the logmsg is
> > accidentally corrupted at input-time, it's gone forever. This is much
> > worse than possibly seeing a garbled display in some GUI textbox --
> > that problem is fixable by heuristics (or project policy).
>
> I'd like to add that I have used such heuristics in real life.
>
> More than once, I've had data in some unknown charset (I knew it was
> Chinese, I just didn't know which encoding). I've put it in a display
> editor and basically flipped through various encodings until suddenly
> it "clicked" and the text made sense.
>
> This heuristic depends on user feedback, but it's 100% reliable (data
> rarely makes sense in two different encodings :-) )... And most
> importantly, it was only possible because I had the *original* data,
> not some tool's mis-reencoding of the original data.
>
> This is why I don't buy the argument that the data is "useless" if you
> don't know the charset. It's simply not true. You may have to do
> some work, even some non-automatable work, but you can almost always
> figure it out with some basic educated guessing. (And automated
> algorithms based on word-frequency tables are easy to imagine, though
> I don't know if anyone's implemented that yet.) But nothing can be
> done if the original data was lost.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
> For additional commands, e-mail: dev-help@subversion.tigris.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org
Received on Sat Jun 1 14:11:19 2002

This is an archived mail posted to the Subversion Dev mailing list.

This site is subject to the Apache Privacy Policy and the Apache Public Forum Archive Policy.