[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: The Data Sanitization Plan

From: Greg Stein <gstein_at_lyra.org>
Date: 2002-06-25 22:31:09 CEST

On Tue, Jun 25, 2002 at 02:52:31PM -0500, Ben Collins-Sussman wrote:
> 1. all paths passed to libsvn_* are assumed to be
> - using '/' separators
> - using canonicalized case
> - in UTF-8


> 2. all URLs passed to libsvn_* are assumed to be
> - properly URL-escaped
> - in UTF-8


> 3. all log messages passed to libsvn_* are assumed to be
> - in UTF-8


> Our thoughts are to create three utility routines in libsvn_subr,
> something like:
> svn_sanitize_path()
> svn_sanitize_url()
> svn_sanitize_logmsg()

The last one is not needed. We can simply use a function that converts from
the locale charset to UTF-8. I don't think we need to call out a special
function for that.

Note that Marcus has a big-ass patch outstanding. Some/all of that patch
needs to be applied to the codebase. In particular, there are some utility
functions for converting to UTF-8.

Also: it has a patch to APR's xlate functions which should be applied to

> * Karl says apr_file_path_merge() will give us the "canonical" case
> of path components. That sounds good. Except the name of that
> func is really weird. :-)

It comes from the semantic, "given a canonicalized root, merge <this>
fragment into the path."

> * Rumors has it that there exist various apr iconv routines to
> convert data to UTF-8. We've already decided that each

Look at Marcus' patch.

> sanitization func is going to take a locale argument; in the case
> of our cmdline client, this information can be gathered either by
> getting the system locale, or by using a particular locale from
> the commandline (--locale ?)

Yes, take a source character set (string). The client can use
APR_LOCALE_CHARSET from apr_xlate.h if they want to use the system locale.

> * We can easily write a routine to convert '\' into '/'

Already done (by Branko). See svn_path_internal_style(). It modifies a
stringbuf in place, so we may want to consider changing that to be:

  const char * svn_path_internal_style(const char *path, apr_pool_t *pool);

> * The only real question is whether (and how) our cmdline client
> should "automatically escape" URLs. Is this too dangerous? Is
> there some reasonable heuristic to use? It would stink if I had
> to type this:
> svn diff -r3:4 http://path/to/my%20file

I'm almost positive that you will have to type it that way. Consider these
two URLs:


The URLs have entirely different meanings. If we escape the first one, then
we change the meaning of the URL. The ambiguity is then, "which of the two
meanings were intended?"

However, if we carefully read through RFC 2396, we may find that we can
reliably escape the URLs. For example, in the above set of URLs, the URL
with the trailing "?" is not a "legal" URL for our purposes. Queries,
parameters, and anchors are not allowed in our URLs for repositories. Thus,
we could "assume" that a "?" present in the URL is intended to be escaped,
rather than forming a "query" within the URL.

But this is going to require a review of which characters are in and which
are out. And then to consider this stuff in reference to UTF-8 encoding...

Eek :-) For now, I would recommend /not/ escaping (because it could damage
the URL) unless/until we have a firm document on the viability of escaping
the inputs.


Greg Stein, http://www.lyra.org/
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org
Received on Tue Jun 25 22:28:48 2002

This is an archived mail posted to the Subversion Dev mailing list.