On Tue, May 31, 2011 at 01:41:54AM +0300, Daniel Shahaf wrote:
> How would you handle a repository that contains the following
> nodes/fspaths:
>
> /foo/bår (in UTF-8)
> /foo/bår (in latin1)
>
> ?
>
>
> How would you handle a repository that contains:
> /foo/barÉ (in latin1)
> /foo/barŠ (in latin2)
>
> ?
All the ISO-8859 (latin) encodings are single-byte encodings.
It's not possible to know what the encoding is supposed to be if
paths in different ISO-8859 encodings entered the repository.
They all decode to different but valid strings of characters.
In the first iteration of this feature I would simply assume one
user-specified source encoding and try to convert data that isn't
UTF-8 from the source encoding to UTF-8.
In case multiple single-byte encodings are present this means that some
characters will be wrong but the repository will work again without
manual intervention. In case multiple multi-byte encodings other than
UTF-8 are present this approach can fail and might require manual fixing
(no worse than the current situation).
This could still be improved upon if necessary.
> > We should also make svnadmin verify complain if paths are not in UTF-8.
>
> +1.
>
> The validation that 'load' and 'commit' trigger is path_valid() in
> fs_loader.c.
Thanks for the hint. I'm now running tests on a patch for this.
Received on 2011-05-31 01:07:42 CEST