[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: Problem with UTF-8-files and creating and appliying patches

From: Gert Kello <gert.kello_at_gmail.com>
Date: Wed, 17 Feb 2010 16:42:10 +0200

On 17 February 2010 16:32, Felix Saphir <felix.saphir_at_presswatch.de> wrote:

> Gert Kello schrieb:
> >> While you might be correct about TortoiseMerge and BOM, UTF-8 has a
> >> defined byte-order, so there is no need for a BOM (see
> >> <http://www.unicode.org/faq/utf_bom.html#bom5>).
> >
> > Well, actually there is. From the same page,
> >
> > Some protocols allow optional BOMs in the case of untagged text. In those
> > cases,
> > - Where a text data stream is known to be plain text, but of unknown
> > encoding, BOM can be used as a signature. If there is no BOM, the
> encoding
> > could be anything.
> >
> > That is usually the case of plain-text files, such as program code source
> > -> You do not know what should be used as encoding.
>
> Correct, but would you really rely on the BOM to detect the encoding?
> What if I used an editor unaware of Unicode (and there are plenty) to
> insert a byte sequence, that has no meaning in UTF-8? You (or your
> program) can detect that sequence only by looking at the contents, the
> BOM (whether present or not) does not help you at all.
>

If the file has BOM, and the content does not make sense in UTF-8, there's
no need to search further, as the data is corrupted somehow.

If it does not have BOM then the trial & error approach is needed to find
out the real encoding.

So yes, not reliable marker, but still helpful.

Gert

------------------------------------------------------
http://tortoisesvn.tigris.org/ds/viewMessage.do?dsForumId=4061&dsMessageId=2448383

To unsubscribe from this discussion, e-mail: [users-unsubscribe_at_tortoisesvn.tigris.org].
Received on 2010-02-17 15:42:15 CET

This is an archived mail posted to the TortoiseSVN Users mailing list.

This site is subject to the Apache Privacy Policy and the Apache Public Forum Archive Policy.