[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: [PATCH] Don't strip Content-Type in .po files on Windows

From: Julian Foad <julianfoad_at_btopenworld.com>
Date: Wed, 15 Aug 2012 00:06:17 +0100 (BST)

Philip Martin wrote:
> Роман Донченко <dpb_at_corrigendum.ru> writes:
>>  - On Windows, we only support linking with svn-win32-libintl, which
>>  is hacked to disable all encoding conversions.
>
> I assume it provides bind_textdomain_codeset as a function that does
> nothing.
>
>>  - Even if someone links with with his own version of libintl, it's
>>  a safe bet that it will be new enough to support
>>  bind_textdomain_codeset, so we  can just call that.
>
> If somebody managed to use an old version I guess there would be some
> sort of linker failure.  Build time?  Run time?
>
> If somebody managed to use their own version that did encoding
> conversions I guess that would also be OK since it would be a
> utf8-utf8 no-op.
>
> The patch looks OK but I don't know enough about the Windows build to
> know whether the above is correct.

I agree, without fully understanding all the details.

I went digging in the archives to try to understand the issue.  Maybe
this will help others to do so too.  The email threads around the time
when this 'charset stripping' was originally added are "RFC: Solving
gettext & UTF-8 brokenness" [1] and "Stripping 'charset=' from po files
[the sequal]"[2].  Roman's previous submission of this patch resulted
in a short discussion that petered out [3].

From what I can gather, the stripping of the 'charset=' was originally
introduced as a hack to work around a mis-feature of the 'gettext' or
'libintl'/'libiconv' subsystems, that would undesirably change the
encoding away from UTF-8.  We have two other ways to stop that
re-coding: we call bind_textdomain_codeset() if available; and/or we
link to a 'libintl'/'libiconv' implementation that doesn't do
re-coding.  On Windows we do the latter; we don't currently do the
former, but Roman's patch does.

It appears that Roman finds that that those other methods are
sufficient.  Removing this hack makes his life better (eliminates some
warnings from some translation tools, and simplifies the build system)
and still provides the desired UTF-8 result.

The worst that could happen after applying this patch is someone
building Subversion on Windows, and linking an incompatible
'libintl'/'libiconv', might find that Subversion error messages are
improperly displayed or are replaced by an 'Invalid UTF-8 sequence'
message.  That's no big deal -- it's a straightforward issue for the
builder/maintainer to solve and is not subtle or data-corrupting.

The .po file format is supposed to have this 'charset' field present,
so in my opinion we shouldn't strip it if we don't have to: I could
imagine how stripping it could cause extra warnings in translation
tools and/or extra work for translators operating those tools, even
if we hadn't actually had this report and this patch.

So it seems to me that we should apply this patch.

Brane was involved in the previous discussions and he's away today
so I'd like to wait to hear his input.  If he has no particular
objection then I'll be happy to commit this even though I can't
easily test it.

- Julian

[1] Email thread "RFC: Solving gettext & UTF-8 brokenness" on dev@
in May 2004: <http://subversion.tigris.org/ds/viewMessage.do?dsForumId=462&dsMessageId=345172>
or <http://svn.haxx.se/dev/archive-2004-05/0047.shtml>.

[2] Email thread "Stripping 'charset=' from po files [the sequal]" on
dev@ in May 2004; a relevant place to start reading is the message
from Greg Hudson on 2004-05-13: <http://subversion.tigris.org/ds/viewMessage.do?dsForumId=462&dsMessageId=351459>
or <http://svn.haxx.se/dev/archive-2004-05/0593.shtml>.

[3] Email thread "Re: [PATCH] Remove po file charset stripping on
Windows" on dev@ in Aug/Sep 2009; a relevant place to start reading
is Roman's message on 2009-09-13: <http://subversion.tigris.org/ds/viewMessage.do?dsForumId=462&dsMessageId=2394298>
or <http://svn.haxx.se/dev/archive-2009-09/0225.shtml>.

Roman Donchenko (Роман Донченко) wrote on 6 August 2012:
> I submitted this patch a couple of years ago, but it didn't get much
> attention. I still think it's relevant, so I'll try again.
>
> [[[
> On Windows, don't strip the Content-Type field from .po files during
> their compilation.
>
> GNU libintl, by default, converts the l10n strings into the locale
> encoding, while Subversion requires UTF-8. This conversion can be
> suppressed by calling bind_textdomain_codeset, but certain old
> versions of libintl don't have that, so the Unix build system checks
> for the existence of that function, and if it's not present, strips
> the Content-Type header from the .po files (which prevents encoding
> conversion, as well, but makes msgfmt complain).
>
> When building on Windows, this stripping is done unconditionally,
> but is completely unnecessary:
>
> - On Windows, we only support linking with svn-win32-libintl, which
> is hacked to disable all encoding conversions.
> - Even if someone links with with his own version of libintl, it's a
> safe bet that it will be new enough to support bind_textdomain_codeset,
> so we can just call that.
>
> This patch removes Content-Type stripping on Windows, which gets rid
> of msgfmt warnings, as well as simplifies the build system.
>
> * build/generator/build_locale.ezt: Remove the strip-po-charset.py
>   invocation.
>
> * build/generator/gen_win.py:
>   (POFile.__init__): don't store the .spo file name.
>
> * build/strip-po-charset.py: Delete.
>
> * subversion/libsvn_subr/nls.c:
>   (svn_nls_init): Move the bind_textdomain_codeset call out of the
>     #ifdef WIN32 block, so it's executed on Windows, as well.
>
> * subversion/svn_private_config.hw: Indicate that
>   bind_textdomain_codeset is available if NLS is enabled.
> ]]]
Received on 2012-08-15 01:06:52 CEST

This is an archived mail posted to the Subversion Dev mailing list.