[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: New entries file format.

From: <peter_at_pdavis.cx>
Date: 2003-01-28 06:23:39 CET

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

[Sorry that I forgot to make my original post to the list, so everybody,
read the context.]

On Monday 27 January 2003 20:16, you wrote:
 Wow. Let me say up front that I appreciate the input. I have a few
 questions before I'm fully convinced now that my 45 minutes of coding
 were not well-spent. More below:

 Peter Davis peter@pdavis.cx writes:
  * Conceptually, the name of an entry is an attribute of the entry.
  That's why I think it should remain an attribute of the entry
  element.

 Fine, but hardly a technical argument.

Well, my sanity has been technically ruined by many a poorly designed XML
format. :-) (Not that this would necessarily be poorly designed.)

  * Making the name CDATA essentially prohibits adding children to the
  entry element in the future. Who knows what those could be, but why
  eliminate that option? (Yes, technically there is no restriction on
  CDATA+children together, but it's ugly as hell, and whitespace from
  pretty-printing becomes a big issue.)

 This is a technical argument, and one that I considered, too. It just
 seemed so unlikely that we'd need to subcategorize what is essential a
 dirent that I dismissed it. Still, worth noting, and we probably
 should avoid painting ourselves into a corner.

Maybe having nested entries could some day facilitate svn:externals or
having a single .svn dir for an entire wc, or maybe the prop file could be
eliminated by nesting appropriate XML in the entries themselves. Like I
said, who knows? -- nothing to argue here.

  * Would newlines in filenames still need to be encoded? They're legal
  under UNIX, and work fine in XML, although they ruin any
  pretty-printing. But what about carriage returns in filenames? XML's
  newline normalization still requires them to be escaped.

 I've asked the XML parser to preserve whitespace in the CDATA, so I
 get newlines as newlines.

Actually, I was talking about the CR in CRLF (or just CR by itself).
See http://www.w3.org/TR/REC-xml#sec-line-ends.

  * Code-wise, it is hardly more complicated to encode tabs as in
  addition to and .

 My concern is not about special-casing tabs, but about special-casing
 every character that's not preserved by parsing attribute values
 (versus those not preserved when parsing CDATA). Is it a parser bug
 that an attribute with a tab comes back with spaces, even though an
 attribute with two contiguous spaces is returned as two contiguous
 spaces? I'm not clear on what the XML spec states there (and got a
 little lost reading the docs at W3.org).

No bug, see http://www.w3.org/TR/REC-xml#AVNormalize. According to the
listed algorithm, each whitespace char gets transformed into a single
space. So a tab and a space will become two spaces, two newlines become
two spaces, and so do two real spaces. It's only a bug if the attribute's
type is not CDATA according to the DTD, but since the entries file has no
DTD (and since the filename wouldn't fit into any other type), obviously
that is not the case.

Now about this special casing: is this merely because libsvn_subr/xml.c
functions provide to escape CDATA but not attributes? From line 50 of
that file:

  while (q end *q != '' *q != '' *q != ''
          *q != '' *q != '\'')

If you ask me, there needs two be two sets of functions:
(svn_)xml_escape_(*string*), and (svn_)xml_escape_attr_(*string*).

The only difference between CDATA and attribute escaping is the addition of
the four newline characters, #x20, #xD, #xA, and #x9, and the possible
addition of the single- and double-quote chars (by the way, why are quotes
being escaped for normal CDATA?):

  while (q end *q != '' *q != '' *q != ''
          *q != '' *q != '\'' *q != '\n'
          *q != '\r' *q != '\t' *q != ' ')
  // with appropriate additions to the switch()

Fixing attribute escaping, which as far as I can tell is currently
completely broken with regard to whitespace, will kill two birds.
Filenames are not the only thing that could potentially be affected by the
bug, so it needs to be fixed either way, and fixing it will eliminate the
technical need for this change. Did you decide to implement this change
because you tried to make a file with a tab in the name? I just tried it,
and it is in fact completely broken, unless there is a bug in the XML
parser that doesn't normalize tabs to spaces.

  Oh yeah, while we're on the topic of entry names: would you care to
  clean up the svn:this_dir hack? Perhaps if changed to a name
  element, the lack of such an element could mean the current dir?

 Actually, SVN_WC_ENTRY_THIS_DIR used to be set to , but programmers
 became lazy and started assuming such, and not using the #define.
 That was the original reason I changed the #define to svn:this_dir.
 Now that Subversion actually has a large community full of watchful,
 code-reviewing eyes, it's probably safe to switch this back.

Cool.

- --
Peter Davis
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.1 (GNU/Linux)

iD8DBQE+NhPbhDAgUT1yirARAl7CAJ0T1VTJHi3EDNxQvWaL92WfEYuihgCgmkBI
3pSvNnCQ4EkyMX4J7PsYCoo=
=x1x5
-----END PGP SIGNATURE-----

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org
Received on Sat Oct 14 02:21:11 2006

This is an archived mail posted to the Subversion Dev mailing list.