[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: Some issues on svn propget (Re: svn commit: r1881985 - /subversion/trunk/subversion/tests/cmdline/merge_tests.py)

From: Daniel Shahaf <d.s_at_daniel.shahaf.name>
Date: Mon, 05 Oct 2020 17:39:46 +0000

Yasuhito FUTATSUKI wrote on Mon, 05 Oct 2020 13:30 +00:00:
> On 2020/10/05 1:57, Daniel Shahaf wrote:
> > Yasuhito FUTATSUKI wrote on Sun, 04 Oct 2020 21:56 +0900:
>
> >> On 2020/09/26 19:12, Daniel Shahaf wrote:
> >>> 1 % svn propset svn:ignore "予定表.txt" ./
> >>> 2 property 'svn:ignore' set on '.'
> >>> 3 % svn propset foo:ignore "予定表.txt" ./
> >>> 4 property 'foo:ignore' set on '.'
> >>> 5 % LC_ALL=ja_JP.eucjp svn pl -v
> >>> 6 Properties on '.':
> >>> 7 foo:ignore
> >>> 8 予定表.txt
> >>> 9 svn:ignore
> >>> 10 ͽɽ.txt
> >>>
> >>> 11 % LC_ALL=C svn pg --strict svn:ignore
> >>> 12 {U+4E88}{U+5B9A}{U+8868}.txt
> >>>
> >>> 13 % svn propset svn:ignore "{U+4E88}.txt" ./
> >>> 14 property 'svn:ignore' set on '.'
> >>> 15 % sqlite3 .svn/wc.db .dump | me
> >>> 16 (svn:ignore 29 {U+4E88}{U+5B9A}{U+8868}.txt )
> >>> 17 % svn pg --strict svn:ignore
> >>> 18 {U+4E88}{U+5B9A}{U+8868}.txt
> >>> .
> >>> So, I think there are a number of different issues/gotchas here:
> >>>
> >>> - It's not possible to get the raw value of an svn:* property in
> >>> a working copy if the value is not representable in the local encoding.
> >>
> >> I belive that if we want to get property values precisely, we should
> >> use xml output, although --no-newline is enough in most case except
> >> this case.
> >
> > Hmm, that's an interesting one. On the one hand, «propget --xml»
> > does resolve the ambiguity issue of the ad-hoc escaping; on the other
> > hand:
> >
> > - We shouldn't require CLI users to use an XML parser in order to
> > retrieve values of binary blobs.
>
> Then do we need a new output format for "strict" values?

+1

> > - The XML document declares itself to be in UTF-8. Does that mean XML
> > parsers are allowed to treat the dumped property values as UTF-8 and,
> > for example, convert the byte sequence (that comprises the value) to
> > another byte sequence, that's equivalent when treated as UTF-8 but
> > not equivalent when treated as binary blobs? (For example, convert
> > the UTF-8 to composed or decomposed normal form.)
>
> At least we expect there is no conversion of byte sequence on parsing,
> if the value is considered to be safe by svn_xml_is_xml_safe(). If it
> is not so, I think outputs of --xml is broken.
>

I don't think svn_xml_is_xml_safe() addresses the above concern. By
code inspection, that function will return TRUE on the strings «é»
(U+00E9, bytes: C3 A9) and «é» (U+0065 U+0301, bytes: 65 CC 81), so the
XML/Unicode/renormalization question remains. (The two example strings
here are the composed/decomposed equivalents of each other.)

I agree that if XML parsers are allowed to do such renormalizations,
then «propget --xml» is broken.

> Moreover, as properties have no meta data about its contents, we can't
> determine a property is a text or not even if it contains only printable
> characters, like 'eicar.com'[1].

Yes, property values are binary, so serialization/deserialization ought
to preserve them byte for byte.

However, when dumping a particular value, there's nothing stopping us
from inspecting that value to see whether it, say, consists entirely of
printable ASCII or not, and taking different codepaths depending on the
result of the check.

> So it is not so curious even if we might
> use base64 encoding for all properties (but I don't think it is good
> idea).

Well, using base64 unconditionally would definitely be robust, but I think
it'd be too much: we could, for one, forgo base64 for property values
that are entirely ASCII.

Cheers,

Daniel
Received on 2020-10-05 19:40:16 CEST

This is an archived mail posted to the Subversion Dev mailing list.

This site is subject to the Apache Privacy Policy and the Apache Public Forum Archive Policy.