[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: SVN Blame Returns Corrupt Data

From: Branko Čibej <brane_at_wandisco.com>
Date: Fri, 11 Oct 2013 17:43:30 +0200

On 11.10.2013 17:19, Bob Archer wrote:
>> On 11.10.2013 16:55, Bob Archer wrote:
>>>> On 11.10.2013 15:58, Bob Archer wrote:
>>>>>> On Thu, Oct 10, 2013 at 5:49 PM, Bob Archer <Bob.Archer_at_amsi.com>
>>>> wrote:
>>>>>> I assume he was asking how to "fix" the blame. Cause, sure, he
>>>>>> could open the file, convert it back to UTF-8 with CRLF line
>>>>>> endings... and commit it... of course, now blame is going to show
>>>>>> him on every line, since he just changed every line.
>>>>>>
>>>>>> That's exactly what I meant. You're correct with how the blame is
>>>>>> handled. I committed the UTF-8 copy to a test branch, diff'd, and
>>>>>> it showed every line as being changed. Unfortunately it looks like
>>>>>> this is our
>>>> best option.
>>>>> Yep, we have done the same thing. As a matter of fact, I just over
>>>>> the past
>>>> few days rescripted all our database scripts to be UTF-8 since
>>>> merging them just doesn't work correctly when they are UTF-16 even if
>>>> you remove the binary mime type.
>>>>>> On Thu, Oct 10, 2013 at 7:07 PM, Ben Reser <ben_at_reser.org> wrote:
>>>>>> At current blame is not UTF-16 aware.
>>>>> It's not just blame that isn't... the diff engine, or whatever
>>>>> detects file
>>>> types always considers UTF-16 files to be binary. If you "add" a
>>>> UTF-16 file you see that svn adds the application/octet-stream mime
>>>> type. There is an issue in the bug database about this from when I
>>>> reported/complained about it... however it hasn't been addressed. I'm
>>>> surprised still at this time that svn still can't support UTF-16 text files as
>> text wrt adding, diffing, blaming, etc.
>>>> It's quite simple: no-one has written the necessary code. While I can
>>>> understand it's an interesting feature for Windows users, most
>>>> Subversion developers have other things to do. This being a volunteer
>>>> project, and most of us do not use Windows, you can hardly expect
>>>> anyone to spend several weeks on solving a problem that has a
>>>> perfectly simple workaround. Since
>>>> UFT-8 and UTF-16 can be interchanged without data loss, there are
>>>> other, much more important things to do in Subversion.
>>> I appreciate all that you said. I didn't expect that UTF-16 was so uncommon
>> in non-Windows OSes. A large number of dev tools that I work with on
>> Windows, especially the Microsoft tools default to creating UTF-16 files.
>>> I disagree with your "can be converted without data loss". If you need UTF-
>> 16 then you need it. Also, if you are working in an international team and you
>> have developers with other language Oss which have different code pages
>> then what you see when you look at a UTF-8 file might be different than
>> what I see.
>>
>> I don't follow. Both UTF-16 and UTF-8 are complete representations of the
>> Unicode character set. Exactly the same code sequences can be represented
>> in both encodings. You can convert from UTF-16 to UTF-8 and back and get
>> exactly the same sequence of bytes.
>>
> Ok, I have to back pedal here a bit. You are correct, UTF-8 is a Unicode format and can store all characters. It's not a UTF-8 vs UTF-16 issue (Friday senior moment). What I recall being told by one of the subversion developers was that subversion only supported the ASCII character set and while UTF-8 was compatible with ASCII it didn't truly support Unicode files.
>
> However, this blog entry seems to dispute that:
>
> http://rhubbarb.wordpress.com/2012/04/28/svn-unicode/
>
> Would adding that mime-type to this file fix the blame issues this user is seeing?

I think the user is just very lucky. Subversion does not actually try to
interpret the svn:mime-type property, other than to determine whether to
treat a file as text or binary. (The comment is correct in that the
proper parameter is charset=, not encoding=, but that's not important
for this discussion).

Subversion's merge algorithm depends on being able to detect line
endings in the file, and always scans the file as a sequence of bytes.
There are several ways to represent line endings in a UTF-16 file (shown
here as hex byte sequences):

  * 00 0A (Unix newline, UTF16-BE)
  * 00 0D 00 0A (Windows newline, UTF16-BE)
  * 0A 00 (Unix newline, UTF16-LE)
  * 0D 00 0A 00 (Windows newline, UTF16-LE)
  * 24 24 (Unicode newline, same in LE and BE)

Subversion, however, expects one of the following newline sequences:

  * 0A (Unix newline)
  * 0D 0A (Windows newline)

My best guess as to what's happening is that the 0A bytes, a.k.a. the
ASCII newline character, are interpreted as the end-of-line markers, and
the zero bytes are treated as part of the text. In most cases, the
result will be close to correct, as long as there are no conflicts in
the merge -- because Subversion will not emit conflict markers in UTF-16.

Of course, if someone used the U+2424 newline code point instead, then
in the worst case, the whole file would be interpreted as a single line.

-- Brane

-- 
Branko Čibej | Director of Subversion
WANdisco // Non-Stop Data
e. brane_at_wandisco.com
Received on 2013-10-11 17:44:26 CEST

This is an archived mail posted to the Subversion Users mailing list.

This site is subject to the Apache Privacy Policy and the Apache Public Forum Archive Policy.