Troy Curtis Jr wrote on Thu, Oct 19, 2017 at 04:05:57 +0000:
> The places where we'd expect this to actually need to happen are less on
> the Python API side (since it should be str in and out), but more on the
> binding's usage of the underlying Subversion API. All the 'char *' will be
> considered 'bytes' by py3, and need to be decoded to str (unicode), before
> it is given to the Python API so that the user gets the expected str
> objects.
Ah! So you're thinking of data going from libsvn_* to user Python code,
not the other way around. Now it makes sense. (I was thinking of user
data passed into to the bindings.)
In this case, I think the rule is: if it's NUL terminated, then it's in
UTF-8 (except for some isolated exceptions such as svn_cmdline_*); if
it's a counted-length string, then it's bytes. For example, a property
hash is an apr_hash_t* mapping const char* to const svn_string_t*,
corresponding to the data model where property names are UTF-8 strings
and property values are opaque binary blobs.
This is supposed to be explicitly documented, by the way. For example,
svn_path.h states
.
* All incoming and outgoing paths are non-NULL and in UTF-8, unless
* otherwise documented.
.
but, apparently, the newer svn_dirent_uri.h doesn't have such a statement.
> So in general I think going with utf8 is the way to go. However, there a
> few places I plan on looking carefully at:
> 1. Anywhere raw data is "streamed in": I'm not sure if this exists in the
> API or not, I am assuming it does somewhere to manually feed data into the
> library to be used as content of a commit.
Functions that add data, at various layers, are:
- svn_client_add5()
- svn_delta_editor_t
- svn_repos_load_fs6()
- svn_fs_make_file()
Paths in the repository are always in UTF-8. File contents are treated
as opaque binary blobs and are generally presented as streams (either
svn_stream_t or an svndiff/txdelta stream).
> 2. Filesystem paths: The general case is there are separate functions for
> dealing with "the filesystem encoding" whichever it happens to be [1]. So
> perhaps one concrete question is are paths provided to the API and
> elsewhere within Subversion assumed to be UTF8?
Subversion's functions generally take UTF-8, but there are exceptiosn,
such as *_canonicalize() and *_internal_style(). They're generally
implemented in terms of apr_* functions which expect a different
encoding (see e.g. svn_io_check_file()).
Cheers,
Daniel
Received on 2017-10-19 15:57:43 CEST