[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: Ensuring File Encoding

From: B. Smith-Mannschott <bsmith.occs_at_gmail.com>
Date: Thu, 1 Oct 2009 21:36:21 +0200

2009/10/1 David Weintraub <qazwart_at_gmail.com>:
> We are beginning to have problems with file encoding. We want to ensure all files we commit are in fact encoded in UTF-8. I would like to add this ability in my pre-commit hook, and reject any commits which has files in it that aren't encoded in UTF-8 (well, text files). But I am not 100% sure how to test a file's encoding.
>
> How can I test to see if a file is encoded in UTF-8?

I just do something like this. works well enough in practice since not
all possible byte sequences are vaild UTF-8.

def looks_like_utf8(bytes):
    """Attempt to decode bytes under the assumption that they are
UTF-8. Return False if this throws a UnicodeDecodeError, otherwise
return True."""
    try:
        bytes.decode("UTF-8")
    except UnicodeDecodeError:
        return False
    else:
        return True

def looks_like_utf8_file(path):
    return looks_like_utf8(file(path, "rb").read())

------------------------------------------------------
http://subversion.tigris.org/ds/viewMessage.do?dsForumId=1065&dsMessageId=2402661

To unsubscribe from this discussion, e-mail: [users-unsubscribe_at_subversion.tigris.org].
Received on 2009-10-01 21:37:12 CEST

This is an archived mail posted to the Subversion Users mailing list.

This site is subject to the Apache Privacy Policy and the Apache Public Forum Archive Policy.