Re: converting unconvertible UTF-8 data

From: Karl Fogel <kfogel_at_newton.ch.collab.net>
Date: 2002-07-22 05:21:39 CEST

Ulrich Drepper <drepper@redhat.com> writes:
> > 3) Have a fuzzy conversion function that tries to convert all the
> > data, but if that fails, converts every character it can and
> > replaces the others with ?\XXX (or some standard sequence) to
> > indicate the Unicode value of the failed character.
>
> Preferrable to this is the use of transliteration. You are talking
> about a transformation which can lose information anyway. Some iconv()
> implementation (glibc's and GNU libiconv's) support transliteration.
> Just add //TRANSLIT to the to-charset option string of the iconv_open
> call.
>
> The problem with transliteration is, though, that it is locale
> dependent. So the result may differ depending on the selected locale.

That's not the only problem -- the portability issue in your previous
paragraph is the real showstopper.

What happens if we add "//TRANSLIT" to a charset with an iconv
implementation that doesn't know anything about transliteration? Is
it guaranteed to ignore unknown appends of the form "//FOO", or can it
bomb because can't find the charset named "ISO-8859-1//TRANSLIT"?

If we at least know that adding "//TRANSLIT" will do no harm, then we
could add it right away (where it's not present already). But if it
could cause a problem, then it doesn't help us.

Either way, we may still eventually want our own fuzzy function to
supply whatever cannot be depended on from iconv. It's good if
Subversion behaves as close to the same everywhere as possible.

And we can eventually give our "fuzzy" function the option of doing
transliteration. But I think the initial implementation would better
output ?\XXX for each unconverted byte, since that's simple to get
right initially. Incremental improvements (perhaps with additional
run-time options) are possible from there.

-K

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org
Received on Mon Jul 22 05:34:31 2002

This message: [ Message body ]
Next message: David Summers: "Re: [PATCH] Remove svn_for_cvs_users from build.conf"
Previous message: Blair Zajac: "Re: vendor branch questions"
In reply to: Ulrich Drepper: "Re: converting unconvertible UTF-8 data"
Next in thread: Ulrich Drepper: "Re: converting unconvertible UTF-8 data"
Reply: Ulrich Drepper: "Re: converting unconvertible UTF-8 data"

Contemporary messages sorted: [ By Date ] [ By Thread ] [ By Subject ] [ By Author ] [ By messages with attachments ]