[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: "Save to Clipboard" including BOM marker

From: Stefan Küng <tortoisesvn_at_gmail.com>
Date: Fri, 17 Jul 2015 19:00:38 +0200

On 17.07.2015 09:47, Gavin Lambert wrote:
> On 17/07/2015 04:27, quoth Stefan Küng:
>>> This seems incorrect. The clipboard should only contain textual
>>> content; it should not include an initial BOM in any case.
>>> (*Files* contain an initial BOM because there is otherwise no
>>> reliable way to determine if the content is ANSI or Unicode. The
>>> clipboard does not have that issue.)
> BTW it may have been unclear, but in this part I'm referring to a BOM
> at the very start of the clipboard data (ie. what would be the
> initial BOM in the patch file itself), not a BOM from the diff
> content of the original files.
> It's possible that I misinterpreted you, but it sounded like you were
> saying that TortoiseMerge would write a BOM to the clipboard,
> followed by the actual patch content (as if the clipboard were a
> file). That's the part that I was mainly objecting to, not whether
> the patch content contained another BOM or not.
> If I did misinterpret this then I apologise for the noise.
>>> This shouldn't happen either. The BOM should be stripped from
>>> the file content prior to generating the diff.
>> Sorry, but that would be a big bug. The diff must contain the BOM,
>> because it would be broken if you add or remove the BOM from a
>> file, and then do a diff: if a patch file would not contain changes
>> to the BOM, you could not apply such a patch file and get the
>> correct results.
> True, although I was referring to the clipboard copy or UI display
> rather than the file output -- you're already showing format
> differences on the status bar after all.
> It's a tricky one though because if the patch is copied to the
> clipboard and then pasted into a file editor then it doesn't seem
> like there's a good solution either way.

It's actually even worse:
svn creates the patches in the local codepage. That means, the context
info (filepaths, revision numbers, dates, line numbers) are all written
in the local codepage.
But the file content (i.e., the diffed lines) are written in the
codepage of the diffed files (just a byte-by-byte copy).
So it's possible that the context is in a local codepage, but then the
diffed lines in e.g. utf8. Or even another codepage!
And even worse: I found some projects which had some files encoded with
local codepage, and others in utf8. So if you create a patch file over
those files, you'll end up with a patch file that even has the diff
lines in different encodings.

What I do in TMerge when trying to apply a patch file is trying to
determine the code page of each line individually. Only that way all
bases are covered.
The only problem here is that detecting the encoding is not fully
accurate, especially if you only have a few chars to do the detection.
And since each line has to be detected separately - well, I'm sure you
get the problem :)

> Ideally there should be some out-of-band way to signal to the patch
> tool to change the file encoding without that affecting the patch
> content. Unfortunately that seems like the sort of thing that should
> have been done a couple of decades ago, and may not be practical
> now.

Wouldn't work either since a patch file can contain diffs of multiple
files, and each file can have a separate encoding.

> That brings up a related question, though -- how would you generate a
> patch that would successfully convert a previously existing file from
> ANSI to UTF-16-without-BOM format? I don't think there's any way to
> represent that, unless you start making assumptions based on the
> format of the patch file itself (which you also lose when passing
> through the clipboard). I guess that's a less common scenario
> because UTF-16 files are always supposed to have a BOM. (Some
> editors will let you do it though.)

svn patch files can't deal with utf-16. Only small char strings are
supported (e.g., ANSI/utf8). svn can not handle wide char strings - at
least not with the patch feature.


   oo  // \\      "De Chelonian Mobile"
  (_,\/ \_/ \     TortoiseSVN
    \ \_/_\_/>    The coolest interface to (Sub)version control
    /_/   \_\     http://tortoisesvn.net
To unsubscribe from this discussion, e-mail: [users-unsubscribe_at_tortoisesvn.tigris.org].
Received on 2015-07-17 19:00:35 CEST

This is an archived mail posted to the TortoiseSVN Users mailing list.