[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

RE: SVN Blame Returns Corrupt Data

From: Bob Archer <Bob.Archer_at_amsi.com>
Date: Fri, 11 Oct 2013 16:12:50 +0000

> On 11.10.2013 17:19, Bob Archer wrote:
> >> On 11.10.2013 16:55, Bob Archer wrote:
> >>>> On 11.10.2013 15:58, Bob Archer wrote:
> >>>>>> On Thu, Oct 10, 2013 at 5:49 PM, Bob Archer
> <Bob.Archer_at_amsi.com>
> >>>> wrote:
> >>>>>> I assume he was asking how to "fix" the blame. Cause, sure, he
> >>>>>> could open the file, convert it back to UTF-8 with CRLF line
> >>>>>> endings... and commit it... of course, now blame is going to show
> >>>>>> him on every line, since he just changed every line.
> >>>>>>
> >>>>>> That's exactly what I meant. You're correct with how the blame
> >>>>>> is handled. I committed the UTF-8 copy to a test branch, diff'd,
> >>>>>> and it showed every line as being changed. Unfortunately it
> >>>>>> looks like this is our
> >>>> best option.
> >>>>> Yep, we have done the same thing. As a matter of fact, I just over
> >>>>> the past
> >>>> few days rescripted all our database scripts to be UTF-8 since
> >>>> merging them just doesn't work correctly when they are UTF-16 even
> >>>> if you remove the binary mime type.
> >>>>>> On Thu, Oct 10, 2013 at 7:07 PM, Ben Reser <ben_at_reser.org> wrote:
> >>>>>> At current blame is not UTF-16 aware.
> >>>>> It's not just blame that isn't... the diff engine, or whatever
> >>>>> detects file
> >>>> types always considers UTF-16 files to be binary. If you "add" a
> >>>> UTF-16 file you see that svn adds the application/octet-stream mime
> >>>> type. There is an issue in the bug database about this from when I
> >>>> reported/complained about it... however it hasn't been addressed.
> >>>> I'm surprised still at this time that svn still can't support
> >>>> UTF-16 text files as
> >> text wrt adding, diffing, blaming, etc.
> >>>> It's quite simple: no-one has written the necessary code. While I
> >>>> can understand it's an interesting feature for Windows users, most
> >>>> Subversion developers have other things to do. This being a
> >>>> volunteer project, and most of us do not use Windows, you can
> >>>> hardly expect anyone to spend several weeks on solving a problem
> >>>> that has a perfectly simple workaround. Since
> >>>> UFT-8 and UTF-16 can be interchanged without data loss, there are
> >>>> other, much more important things to do in Subversion.
> >>> I appreciate all that you said. I didn't expect that UTF-16 was so
> >>> uncommon
> >> in non-Windows OSes. A large number of dev tools that I work with on
> >> Windows, especially the Microsoft tools default to creating UTF-16 files.
> >>> I disagree with your "can be converted without data loss". If you
> >>> need UTF-
> >> 16 then you need it. Also, if you are working in an international
> >> team and you have developers with other language Oss which have
> >> different code pages then what you see when you look at a UTF-8 file
> >> might be different than what I see.
> >>
> >> I don't follow. Both UTF-16 and UTF-8 are complete representations of
> >> the Unicode character set. Exactly the same code sequences can be
> >> represented in both encodings. You can convert from UTF-16 to UTF-8
> >> and back and get exactly the same sequence of bytes.
> >>
> > Ok, I have to back pedal here a bit. You are correct, UTF-8 is a Unicode
> format and can store all characters. It's not a UTF-8 vs UTF-16 issue (Friday
> senior moment). What I recall being told by one of the subversion
> developers was that subversion only supported the ASCII character set and
> while UTF-8 was compatible with ASCII it didn't truly support Unicode files.
> >
> > However, this blog entry seems to dispute that:
> >
> > http://rhubbarb.wordpress.com/2012/04/28/svn-unicode/
> >
> > Would adding that mime-type to this file fix the blame issues this user is
> seeing?
>
> I think the user is just very lucky. Subversion does not actually try to interpret
> the svn:mime-type property, other than to determine whether to treat a file
> as text or binary. (The comment is correct in that the proper parameter is
> charset=, not encoding=, but that's not important for this discussion).
>
> Subversion's merge algorithm depends on being able to detect line endings
> in the file, and always scans the file as a sequence of bytes.
> There are several ways to represent line endings in a UTF-16 file (shown here
> as hex byte sequences):
>
> * 00 0A (Unix newline, UTF16-BE)
> * 00 0D 00 0A (Windows newline, UTF16-BE)
> * 0A 00 (Unix newline, UTF16-LE)
> * 0D 00 0A 00 (Windows newline, UTF16-LE)
> * 24 24 (Unicode newline, same in LE and BE)
>
> Subversion, however, expects one of the following newline sequences:
>
> * 0A (Unix newline)
> * 0D 0A (Windows newline)
>
> My best guess as to what's happening is that the 0A bytes, a.k.a. the ASCII
> newline character, are interpreted as the end-of-line markers, and the zero
> bytes are treated as part of the text. In most cases, the result will be close to
> correct, as long as there are no conflicts in the merge -- because Subversion
> will not emit conflict markers in UTF-16.
>
> Of course, if someone used the U+2424 newline code point instead, then in
> the worst case, the whole file would be interpreted as a single line.
>
> -- Brane

Great information.. thanks for that.

Bottom line is use UTF-8 for your text files and svn will be happy and work correctly. How hard would it be to create a warning on an add that a file looks like UTF-16 and should be converted to UTF-8 otherwise it will be treated as a binary file?

BOb
Received on 2013-10-11 18:13:27 CEST

This is an archived mail posted to the Subversion Users mailing list.

This site is subject to the Apache Privacy Policy and the Apache Public Forum Archive Policy.