Daniel Rall <dlr@finemaltcoding.com> writes:
> On Mon, 26 Sep 2005, kfogel@collab.net wrote:
> > Greg Hudson <ghudson@MIT.EDU> writes:
> > > We can't possibly be the first Apache httpd module to be putting
> > > pathnames in log messages, can we? What do other modules do?
> >
> > That was exactly the question we started out with. Paul Querna said
> > he thought they were URI-encoded, but as we discussed, we weren't
> > really sure whether that applied to paths so much as URLs. If these
> > things were clearly URLs, then it would have been an easy call, but
> > they're not (to most people).
>
> I believe it applies to any logged piece of data which can be manipulated
> by the client (e.g. the URI, HTTP headers, etc.). As mentioned by Justin,
> the URL encoding of such data offers some degree of security in conjunction
> with poorly written log parsing/analysis programs, which are often hooked
> up to httpd via a pipe.
I feel like the security part of this argument is a no-go. One way or
another, we're going to delimit these things in a perfectly parsable
manner -- whether it's URI encoding, or single quotes with backslash
escaping, parsing it should be a breeze. It would have to be a pretty
poor parser to get thrown by either method :-).
But the larger question isn't resolved yet (more IRC bikeshedding
happened). Summary of the two options:
1. URI-encoding, no delimiter except whitespace.
Justin Erenkrantz and Mike Pilato lean toward this, because this is
httpd and URI-encoding is the standard for URLs in httpd, and these
paths may be thought of as URLs. They point out (correctly -- I
checked) that everything we log in access_log and error_log right
now is URI-encoded. The URI-encoding solution means less new code,
since we already have functions to do this. It also means less
custom code for parsers, since URI-decoding is already widely
implemented.
2. Single-quote as delimiter, and escape single-quotes in the paths.
I lean toward this solution, for several reasons. One, it's the
standard for paths in Subversion (e.g., in error messages), and I
think those who read these logs will be understanding these strings
as paths in a Subversion repository, not as URLs. Two, if you're
reading a log file in an editor or a paging program, it can render
single-quote-delimited UTF-8 strings readably (and with nice
colorization in some cases), whereas I know of no editor that does
this with URI-encoded strings. Being able to read your paths,
without sending them through an external parser, is nice.
Three, look at an actual high-level operation log event:
[Sun Jun 05 16:45:52 2005] [info] [client 127.0.0.1] \
svn operation: user 'jrandom', \
repos '/usr/local/repositories/myproj': blame 'space in name'
Notice how the repository and path are *not* concatenated together
into one URL here; furthermore, the repository path is given OS
space, not URL space. Compare that to how we treat a URL path in
access_log:
127.0.0.1 - jrandom [05/Jun/2005:16:45:51 -0500] \
"PROPFIND /repositories/myproj/space%20in%20name HTTP/1.1" 404 347
Sure, that's URL-encoded there, but that's because it's a URL :-).
There's no system path called "/repositories/blame-tests-1". This
is why I really feel our custom high-level operation logging does
not deal in URLs. It deals in two kinds of paths: system (OS)
paths, and Subversion in-repository paths.
If we were to do URI-encoding, would we URI-encode that first part,
the path *to* the repository? I don't think so; at least, I hope
we can all agree that that would be kind of weird :-). But if
we're not going to URI-encode that part, how do we justify
URI-encoding the path-in-repos part? (And, by the way, how *are*
those to-repository paths getting encoded? Yikes. They're not
controlled by the calls to apr_table_set()...)
The Escaping Problem:
=====================
In URL-encoding, of course, the escaping question is settled. "%" is
the escape character, and we all know how it's used. No problems
there.
In the single-quote solution, the escape character would be
single-quote itself. Not "\", because (as Justin pointed out)
httpd-2.0.x/server/util.c:ap_escape_logitem() processes the whole log
item. It backslash-escapes certain characters, including of course
backslash itself, so there would be no way for us to get an odd number
of backslashes into a path, or indeed anywhere in the log item.
Single-quote is the next best choice, since it's already known from
SQL and maybe other places, and it reads well (i.e., one can often
deduce the quoting just from looking at some strings that use it).
However, it gets worse: ap_escape_logitem() also *omits* double
quotes. It doesn't escape them, it just skips them entirely. So we
have to escape double quotes as well, in a way that doesn't use any
double-quotes in the escaped representation. We can do best by
ditching single-quote as the escape character and using something
else, let's say "`":
`1 == unescapes to ==> '
`2 == unescapes to ==> "
`` == unescapes to ==> `
This works (httpd leaves ` alone) but it sure is unintuitive.
So, the escaping situation for any non-URI-encoded solution is uglier
than one would hope. On the other hand, it's not really any less ugly
than URI-encoding, which escapes characters (such as Space) that would
otherwise be quite readable in a log, and escapes them using a numeric
mapping system too.
Whew.
I think this lays out all the issues. If I've forgotten any arguments
in favor of the URI-encoding solution, my apologies -- no omissions
were deliberate.
What should we do?
I feel like the "these are paths not URLs" argument is still pretty
compelling. Also, when discussing it in IRC I did not offer the above
examples from actual logs, so I don't know what Justin and Mike will
think after reading this message. Justin? Mike? Others? Bueller?
-Karl
--
www.collab.net <> CollabNet | Distributed Development On Demand
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org
Received on Tue Sep 27 00:07:54 2005