[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: CString to CStringA conversion and other questions :)

From: Stefan Küng <tortoisesvn_at_gmail.com>
Date: Mon, 20 Aug 2012 18:44:48 +0200

On 19.08.2012 21:15, Oto BREZINA wrote:
> Thanks for prompt answers ...
>
> On 2012-08-19 20:33, Stefan Küng wrote:
>> On 19.08.2012 16:19, Oto BREZINA wrote:
>>> Finally I have some spare time so I start with some work on T-Merge
>>>
>>> 1.
>>> Do you know if "CStringA sLine = CStringA(sLineT)" is internally using
>>> *WideCharToMultiByte*
>>> <http://msdn.microsoft.com/en-us/library/windows/desktop/dd374130%28v=vs.85%29.aspx>?
>>> What is for "CStringA(sLineT)" conversion here?
>> yes, it does, via some c-runtime functions.
>> But this conversion is utf16 to ansi, not utf8.
>>
>> why do you ask?
> Just want be sure Load and Save are in pair and loaded and saved file
> have same content. I'll try some test cases to make sure.
> For load is used:
> MultiByteToWideChar(CP_ACP, MB_PRECOMPOSED, (LPCSTR)pFileBuf,
> dwReadBytes, pWideBuf, ret)
> And for save:
> CStringA sLine = sLineT;
> Does not look same. ...
> Which one is better? I guess with some persistant buffer(as is for save
> in UTF16BE) WideCharToMultiByte should be better/faster ...

We use CStringA conversion when saving because we save line-by-line.
We use MultiByteToWideChar because we're converting the whole file
content in one go (using CString here would mean to first create a copy
of the content).

>>
>>> 2.
>>> I don't use STL much what is your preferred container for BYTE array,
>>> according some webs candidates are:
>>> CStringA - have count, operator [], copy on write, but may be
>>> missleading of use
>>> vector - have count and []
>>> unique_ptr - have [], but lack count
>> it depends on your use case.
>> for example, if you don't need [] access but only iteration, use a deque
>> instead of a vector - especially for big arrays.
> I need [] or iteration AND .get() or &[0] AND count. I'll check deque.
>>> 3.
>>> Thru code I have seen lot of *ptr++ based algos (sometimes in unnatural
>>> way for me), are those quicker then based on ptr[]?
>>> According some quite old optimalisation guide ptr[] could be faster
>>> becouse of less increment instructions, and simpler cache management,
>>> however it was in about '97.
>>> Do you have any real performance tests/data - I tried to run my own, but
>>> I was unable to start performance tests as I'm not admin ... will try
>>> again later.
>> ptr incrementations can be faster than index based access, usually when
>> using std containers.
> What about char * and wchar_t * ?
> like in CheckUnicodeType, of Load "fill in the lines into the array"
> part. Data can be quite big.
> Most important part is access instruction *ptr vs ptr[i] and number of
> cache miss. But easiest way is to check that for big data.

I'm wondering here: why do you want to change that part of the code?
Does it not work? Is it too slow?

>>> 4.
>>> I would like to write filters classes for ASCII, UTF8, UTF16BE, UTF16LE
>>> and add UTF32s reads/writes. Do you have/know any preferred
>>> interface/template for that job?
>> why? such filter classes would be good for streams, but we don't use
>> streams in TMerge but load the files completely in one go.
> If you check my last few commits there are four save encodings, which
> share same pattern/algo, but differ in one "line" encoding itself.
> First I thought about functor (?), but classes seems be more readable.
>
> Most important part here will be buffer (question 2) which will be part
> of object. This will allow reduce allocations.

>>> 5.
>>> Have you any specific reason to not support UTF32, or just too small use
>>> cases.
>> Is there even a tool/app/whatever that writes such files?
>> I've never seen such a file myself.
>> So why implement something that won't be used?
> Quite agree. If this is only reason I would implemented that. I guess
> this format is really rare. And if used then on Linux. But it makes me
> feel, that application is unfinished whenever I read, that UTF32 is
> thread as binary ...
> To add support for UTF32 is write load and write filter (x2 BE, LE) ...
> should be easy ...
> Can be good new feature for 1.8.

not really: while the svn diff lib doesn't support even utf16, it
doesn't break either for those because it skips over the null bytes.
But utf32 wouldn't work - too many null bytes in a normal text.

>>> Have you any reason to not support other EOLs? According
>>> http://en.wikipedia.org/wiki/Newline only NEL seems be questionable.
>> actually, yes: the svn diff library doesn't support them, so supporting
>> them in TMerge makes no sense. We could split the lines there, but the
>> diffing engine would treat those as one line and so the diff would be
>> shown wrong.
> Ok, sounds reasonable.
>
> There is as much use for thoses EOLS as for UTF32, so no big deal, but:
> If I get that correctly, diff is made on temp files in UTF8 format. In
> case other EOLs is used we can convert them to EOL_AUTOLINE, and make diff.

That would work for *showing* the diff. But when saving edited content,
you would save the converted EOLs.

> This lead me to other question:
> 6.
> When saving to enforced UTF8 for all but UTF8BOM BOM iwas not saved -
> was this intentional? Can be BOM missing for all UTF8 enforced files.
> Correct? This was implemented in r23192.

There's an option in the settings to save files as utf8 even if they're
detected as ANSI.
You're now saving those with a BOM, which isn't what we did before. In
that case you should write the file without a BOM (always without BOM if
possible, only if the file had one when loading, then we write the BOM too).

>
> 7.
> If I read code correctly On CFileTextLines::CheckLineEndings you detect
> EOL_LFCR, but in CFileTextLines::Load this is one is decoded as EOL_LF
> and EOL_CR or EOL_CRLF.
> Is this intentional?
> I guess EOL_LFCR is too rare to be really wanted, but why to detect it
> in Check then. This makes all EOL_AUTOLINE EOL_LFCR.

I don't understand what you mean here.
In Load(), the line endings are checked by calling CheckLineEndings(),
there's no separate detection.

Stefan

-- 
        ___
   oo  // \\      "De Chelonian Mobile"
  (_,\/ \_/ \     TortoiseSVN
    \ \_/_\_/>    The coolest Interface to (Sub)Version Control
    /_/   \_\     http://tortoisesvn.net
------------------------------------------------------
http://tortoisesvn.tigris.org/ds/viewMessage.do?dsForumId=757&dsMessageId=2999662
To unsubscribe from this discussion, e-mail: [dev-unsubscribe_at_tortoisesvn.tigris.org].
Received on 2012-08-20 18:45:29 CEST

This is an archived mail posted to the TortoiseSVN Dev mailing list.

This site is subject to the Apache Privacy Policy and the Apache Public Forum Archive Policy.