se˝ior ┐tyrtle? wrote:
> On 7/26/05 2:22 AM, "Julian Foad" <email@example.com> sneezed:
>>To encode the value, we would probably choose base-64, but consider the case of
>>a textual field that is mostly ASCII but has some non-UTF-8 bytes too, or is
>>entirely UTF-8 but contains some characters that are not valid in XML. To
>>allow such proterty values to be readable we might prefer to use an encoding
>>which preserves most of the text in a readable form, and just escapes the
>>disallowed characters or bytes.
I was thinking of things like French text encoded in ISO-8859-2 (Central
European 8-bit encoding), where most of the characters are ASCII but some are
neither ASCII nor UTF-8. In such cases, I was thinking that it would be nice
if the text were mostly readable, with the accented characters represented by
some escaping mechanism. But maybe I'm being irrational. Maybe that wouldn't
generally be readable enough to be really useful, and would require yet another
form of encoding/decoding, so maybe it would be more useful to have the value
encoded instead in a standard form such as base-64.
I'll abandon that idea.
> Didn't the [short lived] discussion on types in properties come to the
> conclusion that just because data -can- be represented in one way does not
> mean any conclusions should be drawn about that data.
Well, yes, but that was about divining the meaning of the data. I'm now
looking for a solution for how best to encode or represent the data without
reference to its meaning.
> Just because a
> property is mostly ASCII doesn't mean it is, or is a particulary useful
> string when rendered as such (the first few bytes of many file formats are
> ASCII, but I could pick one or two that would be actually meaningful to me).
Yes. However, I think it is safe to say that very many property values found
in Subversion will contain text. Of those, many will be plain ASCII or UTF-8.
To encode them all (apart from "svn:*", perhaps) as base-64 when they would
otherwise have been readable plain text would make the XML less efficient
space-wise and unfriendly for humans to deal with.
My latest suggestion is to use the property value directly if it is an XML-safe
byte sequence, otherwise encode it in base-64.
That means that ASCII and UTF-8 property values will be readable unless they
contain certain unusual characters. Some arbitrary binary values will be
presented as (generally meaningless) UTF-8 text. Some non-UTF-8 strings will
also be presented as UTF-8 text, but only the ASCII parts will be shown
correctly. Everything else will be in base-64.
This is not divining the meaning of the data, it is just choosing between two
different encodings based purely on whether it is possible to use the more
efficient one. It has the pleasant and intentional effect of making text
readable where easily possible, but not guaranteeing to do so. Any user of
this XML output has to be prepared to base-64-decode any property value and
cannot know in advance whether it will need to do so.
I don't think it is a problem that some unreadable values will appear as
strings, so long as the format indicates "This value is represented here as a
string" rather than "This value is semantically a text string".
Does that sound reasonable?
To unsubscribe, e-mail: firstname.lastname@example.org
For additional commands, e-mail: email@example.com
Received on Tue Jul 26 14:06:50 2005