On 17 February 2010 16:32, Felix Saphir <felix.saphir_at_presswatch.de> wrote:
> Gert Kello schrieb:
> >> While you might be correct about TortoiseMerge and BOM, UTF-8 has a
> >> defined byte-order, so there is no need for a BOM (see
> >> <http://www.unicode.org/faq/utf_bom.html#bom5>).
> >
> > Well, actually there is. From the same page,
> >
> > Some protocols allow optional BOMs in the case of untagged text. In those
> > cases,
> > - Where a text data stream is known to be plain text, but of unknown
> > encoding, BOM can be used as a signature. If there is no BOM, the
> encoding
> > could be anything.
> >
> > That is usually the case of plain-text files, such as program code source
> > -> You do not know what should be used as encoding.
>
> Correct, but would you really rely on the BOM to detect the encoding?
> What if I used an editor unaware of Unicode (and there are plenty) to
> insert a byte sequence, that has no meaning in UTF-8? You (or your
> program) can detect that sequence only by looking at the contents, the
> BOM (whether present or not) does not help you at all.
>
If the file has BOM, and the content does not make sense in UTF-8, there's
no need to search further, as the data is corrupted somehow.
If it does not have BOM then the trial & error approach is needed to find
out the real encoding.
So yes, not reliable marker, but still helpful.
Gert
------------------------------------------------------
http://tortoisesvn.tigris.org/ds/viewMessage.do?dsForumId=4061&dsMessageId=2448383
To unsubscribe from this discussion, e-mail: [users-unsubscribe_at_tortoisesvn.tigris.org].
Received on 2010-02-17 15:42:15 CET