[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: [RFC] Canonical Paths

From: Marcus Comstedt <marcus_at_mc.pp.se>
Date: 2002-08-29 20:07:16 CEST

Greg Stein <gstein@lyra.org> writes:

> As a URL, it would be something like:
>
> http://svn.example.com/repos/subdir/gaz%2fonk
>
> i.e use URL escaping to avoid the '/' interpretation.
>
> However, within our libraries... what to do? Beats the crap outta me. We
> would need to use/invent an escaping mechanism. Personally, I would simply
> say that the character is not allowed [on entry to our libs], except as a
> path separator.

Hm, I think I may actually have a solution to this problem. Slightly
hackish, but it should be workable.

We're using UTF-8 representation of the paths. In UTF-8, ASCII
characters (such as '/') are encoded as themselves, a single octet
with the MSB cleared. There exists also multibyte sequences of length
2-6, with each octet having the MSB set, thus making them easily
distinguishable from ASCII characters.

The shortest multibyte sequence (of length 2) has the following
structure:

110xxxx 10xxxxxx

A header of N (2<=N<=6) ones followed by a zero marks the first octet
of a multibyte sequence of length N, and A header of 10 marks a
continuation octet. The x bits then hold the actual character code.
So a two octet sequence can represent Unicode characters in the range
[0 .. 2^10-1], or [0 .. 1023].

Here comes the trick. Notice that this range includes the range
[0 .. 127], the ASCII characters. (In fact all UTF-8 multibyte
escapes have a range which includes ASCII, since they all start at 0.)
That is, although an ASCII character such as '/' is normally encoded
as its ASCII representation (00101111), we could instead encode it as
11000000 10101111, which would then be a kind of _escaped_ '/',
distinguishable from (in fact completely unrelated to if you just look
at single octets) a normal '/' used as path separator. In the same
way, we could encode the problematic NUL character as 11000000
10000000. In fact, this is exactly what Java does to NUL characters
when storing them in UTF-8 strings, so there exists a precedent of
using a scheme like this.

Now, the beauty of it all is that the bulk of the code needs no change
at all. Code that separates paths by looking for the octet 0x2f can
continue to do so. And we're still using normal C strings, so
sprintf:ing etc is no problem. The things that _do_ need to be done to
make it all work are simply the following:

* In the URL escape code, when an 0xc0 octet is encountered, don't
  escape it as %c0. Instead, take the next octet & 0x3f, and escape
  _that_ as %xx (would become %2f or %00 in the cases at hand, but any
  character in [0 .. 63] can automatically have their potential
  escaped variants handled by the same code at no extra cost.

  Note that a "real" two-byte UTF-8 sequence will never begin with
  0xc0, since the character would have code 0x80 or higher, giving the
  first octet the actual range [0xc2 .. 0xdf].

* On a system where '/' is a nonspecial filename character (such as
  MacOS Classic), canonicalization takes place as follows:

  1) UTF-8 encode the path
  2) Replace all '/' with the escaped variant
  3) Replace the path separator (':') with "real" '/'s.

* Decanonicalization has to be reverse of canonicalization as usual of
  course, so it should do the above but backwards on Mac.

And that's it. Presto, instant '/' in filenames support (and NULs as
well, if we make a special svn_canonicalize_path() that takes Pascal
strings...) I must say I have impressed myself with this solution at
least. ;-)

  // Marcus

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org
Received on Thu Aug 29 20:08:19 2002

This is an archived mail posted to the Subversion Dev mailing list.

This site is subject to the Apache Privacy Policy and the Apache Public Forum Archive Policy.