[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

URL encoding/decoding (was: svn ls shows escaped paths)

From: Greg Stein <gstein_at_lyra.org>
Date: 2003-06-03 05:23:10 CEST

On Mon, Jun 02, 2003 at 08:11:33AM -0500, cmpilato@collab.net wrote:
> I think Greg H. knows what you're talking about. What he's saying,
> though, is that ra_svn and ra_local don't pass around URIs internally
> -- they don't need to, because they're not talking to a web server
> that so requires. So for them to implement the __get_dir() interface
> that you proposed means that they have to actually URI-*en*code their
> results before handing them back to the user (who will then
> immediately URI-*de*code them.

Yah, but he talked about how far out to push the encoding, which befoozled
me. Whatever...

> Only ra_dav plays with URIs internally.

Right. But our libraries' interfaces, including the RA interface all speak
in terms of

  1) canonicalized filesystem paths
  2) relative (canonical) paths
  3) URLs

I don't even think the RA interface deals with (1).

> So, what was that about the ra-layer defining the interfaces? :-) I'm
> kidding, of course, because the destructiveness of a URI-decode
> operation is important to note.

It isn't that. Look at all of our interfaces. You just revamped
svn_client_mkdir(). It takes an array of paths or URLs. We *do* use URLs.
Further, nothing above the RA layer knows anything about the URLs except
for the hierarchical /-based modeling of resources (e.g. collections of
resources, and /-separated, and whatnot). Those upper layers don't know if
the RA layer will be talking DAV, C-based APIs, or the custom svn protocol.
But they *do* know that they are passing a URL. And URLs have specific
restrictions imposed upon them by RFC 2396.

IOW, this isn't about ra_dav, but that we are nominally passing around URLs,
but we are sloppy about them. Most of the time, we don't bother with the
notion of URL-encoding or -decoding. That's because we've patched it all up
inside of ra_dav to compensate for the higher levels mucking it up :-)

> That said, where *do* we draw the line on these encodings? For
> example, we drive editors all the time against the RA library, and we
> don't URI-encode the input to those editors -- ra_dav does that
> internally. The ra->check_path() interface takes non-URI-encoded
> strings. The reporter interface -- the same. The MERGE response
> bump_handler() code does URI-decoding before calling a client
> callback, yet other callbacks seem to use relative URLs.

Yup. My first reaction to your checkin was, "woah. that's not right." But
after thinking on it some more, I arrived at the same point as you:

> My points: a) There are no well-defined rules about the URI "lines" in
> the RA interface. b) As a result, there's world of inconsistency going
> on in there, and c) perhaps we should have an intelligent discussion
> about this situation.

For (b) and (c), yes. For (a), I don't entirely agree: we have parameters
named URL (or whatever, but they are intended to pass actual URLs), but we
don't always treat them properly.

> > The change to __get_dir *decodes* a URI and returns that from a library.
> > That doesn't make sense. That decoding can also lose information, which
> > means that the code which is calling RA cannot turn the URI around and use
> > it with the SVN libraries.
> ...and I have all too recent memories of the pain that this kind of
> early decoding can cause...

Heh. No doubt.

[ for those unfamiliar with the background here: Apache decodes the incoming
  URL, then passes that to its various module. unfortunately, sometimes the
  modules need an *encoded* URL, but Apache has already destroyed it. thus,
  the module can only make a best guess. ]


So to start the discussion...

Note that editors are defined to pass paths that are relative to some root.
I don't think we necessarily have to do anything to the paths, but if
somebody joins those paths to a URL, then some action will need to be taken.
Thankfully, it isn't too hard since we don't allow '/' in our path segments.
We can just encode all other must-be-escaped characters (and ignore the
slashes), and slam the encoded value onto the URL base.

The reporter interface is similar. All of the paths are specified as
relative to the base URL.

Note that ra_local and ra_svn have no requirements for passing a composed
URL over the wire, like ra_dav does. They are perfectly free to pass the
base URL and then the relative path in SVN-canonical form. Thus, they may be
able to avoid some URL-encoding on the relative paths.

Much of the RA interface seems to be path-based, relative to the base URL
used to open the RA session.

Hmm. I wonder, then, if we do have a problem with the dir entries stuff. If
we classify those as relative paths in the SVN-canon style, then decoding
*is* just fine.

And the decoding isn't actually destructive in this case. The thing that
really kills you with decoding is when '/', '?', and '&' appears in the URL.
But SVN-canon doesn't allow '/' within a segment, and '?' and '&' have no
special meaning in an SVN path. Thus, we can always safely encode them
without fear of changing the meaning of the URL.

Well... so it would seem the question comes down to "where do we join an SVN
(relative) path to something we know is a URL?" The primary places to look
are the callers of RA->open, to see where they got the repos_URL. Whether it
came straight from the the client input, or whether they joined a base URL
and an unencoded path. I'd be leery of the diff stuff when it goes to open
new RA sessions (does it still do that?). A quick look also shows us using
entry->url a bit, so any time URLs are joined there could be an issue.

Hunh. Mike pointed out that ra_local decodes the URL pretty early. I just
looked at ra_svn, and it ships the encoded URL over to the server, which
then decodes it [to look for the repos].

So I'm not sure that we've got a huge problem here. But I'm not entirely
confident that we've joined things properly everywhere.



Greg Stein, http://www.lyra.org/
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org
Received on Tue Jun 3 05:20:15 2003

This is an archived mail posted to the Subversion Dev mailing list.

This site is subject to the Apache Privacy Policy and the Apache Public Forum Archive Policy.