On Wed, Oct 18, 2017 at 10:11 PM Daniel Shahaf <d.s_at_daniel.shahaf.name>
wrote:
> Troy Curtis Jr wrote on Wed, Oct 18, 2017 at 03:49:57 +0000:
>
> > > > 3. Is the assumption of utf8 encoding sufficiently reasonable?
> > >
> > > In Subversion, we assume that argv and stdout are encoded in APR's
> > > "system encoding", but basically everything else — including nearly
> > > every single string parameter to every public API — is required to be
> in
> > > UTF-8. svn_cmdline.h contains some of the exceptions.
> > >
> > > I'm not sure whether that information answers your question, though.
> In
> > > what concrete cases do the bindings have to convert between str and
> > > bytes? What are the compatibility considerations for user code
> > > (consumers of the swig-py2 bindings that want to upgrade to py3)?
> > >
> > >
> > It answers it well enough. The implicit assumption of string handling in
> > py2 is one of the things py3 set out to address. Since currently py2 are
> > really making assumptions already, going from 2->3 using utf8 should be
> > reasonable IMHO.
>
> I don't follow how you reach the conclusion that using utf8 should be
> reasonable. Could you trace your steps?
>
> Our goal, presumably, is for py3's "import svn" to be as much of a
> drop-in replacement to py2's "import svn" as possible. In what cases
> does py2 code pass py2 str objects to libsvn_swig_py? What happens to
> those objects in py3? The whole question of default encoding only
> matters if what in py2 was a str object is under py3 a bytes object.
> Maybe we should require py3 user code to pass a str object, mooting the
> question entirely?
>
>
str objects are what would be expected to come into and out of the swig API
on both py2 and py3, which is exactly what I am striving to maintain. It
is just that they mean slightly different things depending on the context
between the 2 versions. In py2, it was perfectly reasonable to expect a
str object to come out of a raw socket read or a file from disk. You could
then treat it as "bytes" by only doing thing with it that you would do with
bytes, such as using other modules such as 'struct' or maybe one of the
compression libraries. Or if the developer expected those sources would
provide printable character data, they would treat it as a "string" and do
things like splits(), strips(), and formatting operations. If it turned
out that there were bytes inside that were not actually printable, you'd
get a decode error during one of the string operations.
Instead with py3 a str object is a unicode object and thus printable. The
bytes->unicode happens explicitly in a decode() operation. But since that
is now an explicit action, a specific encoding must be chosen instead of
relying on some default, which in py2 would likely just be ASCII. If we
use UTF8, it will work everywhere ASCII would have worked in py2, and would
work in additional places where a decode error would have popped up.
The places where we'd expect this to actually need to happen are less on
the Python API side (since it should be str in and out), but more on the
binding's usage of the underlying Subversion API. All the 'char *' will be
considered 'bytes' by py3, and need to be decoded to str (unicode), before
it is given to the Python API so that the user gets the expected str
objects.
So in general I think going with utf8 is the way to go. However, there a
few places I plan on looking carefully at:
1. Anywhere raw data is "streamed in": I'm not sure if this exists in the
API or not, I am assuming it does somewhere to manually feed data into the
library to be used as content of a commit.
2. Filesystem paths: The general case is there are separate functions for
dealing with "the filesystem encoding" whichever it happens to be [1]. So
perhaps one concrete question is are paths provided to the API and
elsewhere within Subversion assumed to be UTF8? If so, then the plan to
use utf8 works perfectly. If there are exceptions, than I'll need to be
extra careful in the conversion.
Troy
1: https://docs.python.org/3/c-api/unicode.html#file-system-encoding
Received on 2017-10-19 06:06:14 CEST