[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: SVN Blame Returns Corrupt Data

From: Branko Čibej <brane_at_wandisco.com>
Date: Fri, 11 Oct 2013 21:29:47 +0200

On 11.10.2013 19:25, Stefan Sperling wrote:
> On Fri, Oct 11, 2013 at 09:52:31AM -0700, Ben Reser wrote:
>> On 10/11/13 9:22 AM, Branko Čibej wrote:
>>> You'd have to extend Subversion's file type detection to detect UTF-16.
>>> See svn_io_detect_mimetype2 in line 3333 in this file:
>>>
>>> http://svn.apache.org/viewvc/subversion/trunk/subversion/libsvn_subr/io.c?view=markup
>>> Subversion currently only looks at the first 1k Bytes of a file. It may
>>> be enough to check that this initial part of the file contains only
>>> valid UTF-16 (BE or LE) codes.
>> Even if all we looked for is the BOM it might be helpful enough. I suspect the
>> development tools producing UTF-16 are including BOMs. Windows seems to be
>> fond of including them, Notepad puts one even on UTF-8.
> Couldn't Subversion automatically convert UTF-16 files to UTF-8 before
> processing them for diff/merge/blame, and convert output written to
> the original files back to UTF-16?

That would be less work than supporting whitespace compression, etc. in
UTF-16, but we'd still have to detect U+2424 as an end-of-line marker in
UTF-8 text.

Still, we'd actually have to correctly identify UTF-16 content first,
and handle invalid byte sequences.

> That would require some work because existing streams, strings, and files
> passed around in the code would need to be wrapped so that translation
> to/from the internal from/to the external encoding is seamless.
>
> But I don't see why such an approach couldn't be made to work in principle.
> It might even result in some spring cleaning in the code base and pave the
> way for improved handling of file formats such as XML for diff and merge.

Can't see what XML has to do with it. The diff algorithm already uses a
tokenizer; and for XML, that should be good enough most of the time.

> What do you think? Is it worth adding this to our project ideas page?

It's already here: http://subversion.tigris.org/issues/show_bug.cgi?id=2194

-- Brane

-- 
Branko Čibej | Director of Subversion
WANdisco // Non-Stop Data
e. brane_at_wandisco.com
Received on 2013-10-11 21:30:31 CEST

This is an archived mail posted to the Subversion Users mailing list.

This site is subject to the Apache Privacy Policy and the Apache Public Forum Archive Policy.