[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

utf8proc on Win32

From: Paul Burba <ptburba_at_gmail.com>
Date: Tue, 25 Mar 2008 12:11:44 -0400

Yesterday in IRC:

<ehu> anybody up for solving the unicode (de)composition problem in 1.6?
<stsp> aka fixing the unicode spec? :)
<ehu> heh. well, I think it'll be easier to adapt Subversion.
<markphip> ehu: It seems like a good first step would be to
definitively state what the impact of integrating ICU would be
<ehu> markphip: that seems to be where we left off last time, yes.
However, I felt that we ended up with that discussion because not
everybody was convinced there's actually a problem.
<ehu> was that your idea too?
<markphip> the problem seems clear, the only question I thought was
whether the problem was bad enough to justify the only seeming
<ehu> ah. ok.
<danderson> isn't ICU all C++?
<danderson> (random interjection, I didn't follow the past discussions)
<ehu> danderson: no idea, but it has C, C++ and Java APIs
<ehu> (at least)
<markphip> http://www.icu-project.org/apiref/icu4c/
<danderson> ah, okay. Then all is well.
<markphip> it is a big project that covers a lot of areas, but it is
designed to be pruned down
<markphip> the question is how big would what we need be
<markphip> plus all the issues that come with adding in a new dep
<ehu> right. I was actually hoping that we could tap into another
project (like libiconv through glibc) on some unices.
<ehu> but it seems libiconv doesn't come close.
<ehu> how about utf8proc?
<ehu> it's targeted at normalizing UTF-8
<markphip> ehu: the license of utf8proc looks OK
<markphip> I do not see how it can be so small though, and still be complete
<ehu> markphip: there's a 1.1MB C file containing only mapping tables.
<markphip> Does it only support Unicode to/from UTF8? Would that be a problem?
<ehu> it only supports de/encode of UTF-8 which is exactly what we
want. No translation whatsoever to other codepages.
<markphip> so we could just add this to our existing routines as a way
to normalize the UTF-8?
<ehu> (we already have the translation to other CPs)
<ehu> exactly
<markphip> that sounds worth looking at
<ehu> yes, because with ICU, you'd have to find out how to strip it down.
<markphip> I figured with ICU we would wind up using it to also
replace other things we do, that it can do better
<ehu> so did I (replacing apr_iconv), but since we have started using
Windows translation functions anyway, we don't really depend on
apr_iconv anyway.
<ehu> the resulting .so is < 500kB
<ehu> (wow!)
<markphip> ehu: I think it is because it is limited in scope
<markphip> which in our case, might be good
<jackr> does utf8proc handle composed/decomposed, that's what I want to know
<markphip> It is not clear (to me) from looking at the doc strings if
you need to know how your Unicode string is currently composed
* ehu was about to say that
<jackr> but I don't think composition is so hard that it would show up
in the size figures
<ehu> jackr: you mean both composed and decomposed in one string?
<ehu> jackr: which size figures?
<ehu> sorry, I'm doing 2 things here. I'm not getting the last part of
the conversation
<ehu> jackr: repository size? memory footprint?
<ehu> time spent in translation in profiling?
<markphip> ehu: I think he meant the size of the code and library
<markphip> I believe most of the size of ICU is in the data tables to
convert from every possible code page into Unicode
<ehu> ok. right. well, in that case: I think that's correct. MT could
have a bigger code effect
<ehu> this might just bring us on the same page.
<ehu> (would it be possible to compile it on Windows?)
<ehu> pburba: you have a build system running on Win32, right?
<pburba> yes
<markphip> it should be, just two .c files and a .h
<ehu> would you have time for a little experiment somewhere in the next 2 weeks?
<markphip> hack the build so it looks like they are part of libsvn_subr
<pburba> ehu: Sure
<ehu> could you try to see if what markphip says would work on Windows?
<markphip> ehu essentially wants to know if utf8proc will build on Win32

I was able to build utf8proc on Windows using the Visual Studio IDE
(only built the core, not the ruby or PostgreSQL parts). To get it to
build the following minor tweaks were required:

A) Typedef ssize_t as an int in utf8proc.h, since it isn't defined in
Visual Studio's C89 compliant headers.

B) Create "stdbool.h" and "inttypes.h", neither of which exists on
Windows (again these are C99 standard headers).

I was able to compile Subversion with calls to some utf8proc
functions, but didn't test anything explicitly beyond that.

I also tried to integrate utf8proc into the Windows build system in
the same way we handle other external targets like SASL and BDB. I
didn't have a lot of luck with this, I'll need more time or the help
of someone who understands the Win32 build process better. But I
didn't want to do either until we know we are going to use utf8proc.


To unsubscribe, e-mail: dev-unsubscribe_at_subversion.tigris.org
For additional commands, e-mail: dev-help_at_subversion.tigris.org
Received on 2008-03-25 17:11:56 CET

This is an archived mail posted to the Subversion Dev mailing list.