Thanks for prompt answers ...
On 2012-08-19 20:33, Stefan Küng wrote:
> On 19.08.2012 16:19, Oto BREZINA wrote:
>> Finally I have some spare time so I start with some work on T-Merge
>> Do you know if "CStringA sLine = CStringA(sLineT)" is internally using
>> What is for "CStringA(sLineT)" conversion here?
> yes, it does, via some c-runtime functions.
> But this conversion is utf16 to ansi, not utf8.
> why do you ask?
Just want be sure Load and Save are in pair and loaded and saved file
have same content. I'll try some test cases to make sure.
For load is used:
MultiByteToWideChar(CP_ACP, MB_PRECOMPOSED, (LPCSTR)pFileBuf,
dwReadBytes, pWideBuf, ret)
And for save:
CStringA sLine = sLineT;
Does not look same. ...
Which one is better? I guess with some persistant buffer(as is for save
in UTF16BE) WideCharToMultiByte should be better/faster ...
>> I don't use STL much what is your preferred container for BYTE array,
>> according some webs candidates are:
>> CStringA - have count, operator , copy on write, but may be
>> missleading of use
>> vector - have count and 
>> unique_ptr - have , but lack count
> it depends on your use case.
> for example, if you don't need  access but only iteration, use a deque
> instead of a vector - especially for big arrays.
I need  or iteration AND .get() or & AND count. I'll check deque.
>> Thru code I have seen lot of *ptr++ based algos (sometimes in unnatural
>> way for me), are those quicker then based on ptr?
>> According some quite old optimalisation guide ptr could be faster
>> becouse of less increment instructions, and simpler cache management,
>> however it was in about '97.
>> Do you have any real performance tests/data - I tried to run my own, but
>> I was unable to start performance tests as I'm not admin ... will try
>> again later.
> ptr incrementations can be faster than index based access, usually when
> using std containers.
What about char * and wchar_t * ?
like in CheckUnicodeType, of Load "fill in the lines into the array"
part. Data can be quite big.
Most important part is access instruction *ptr vs ptr[i] and number of
cache miss. But easiest way is to check that for big data.
>> I would like to write filters classes for ASCII, UTF8, UTF16BE, UTF16LE
>> and add UTF32s reads/writes. Do you have/know any preferred
>> interface/template for that job?
> why? such filter classes would be good for streams, but we don't use
> streams in TMerge but load the files completely in one go.
If you check my last few commits there are four save encodings, which
share same pattern/algo, but differ in one "line" encoding itself.
First I thought about functor (?), but classes seems be more readable.
Most important part here will be buffer (question 2) which will be part
of object. This will allow reduce allocations.
>> Have you any specific reason to not support UTF32, or just too small use
> Is there even a tool/app/whatever that writes such files?
> I've never seen such a file myself.
> So why implement something that won't be used?
Quite agree. If this is only reason I would implemented that. I guess
this format is really rare. And if used then on Linux. But it makes me
feel, that application is unfinished whenever I read, that UTF32 is
thread as binary ...
To add support for UTF32 is write load and write filter (x2 BE, LE) ...
should be easy ...
Can be good new feature for 1.8.
>> Have you any reason to not support other EOLs? According
>> http://en.wikipedia.org/wiki/Newline only NEL seems be questionable.
> actually, yes: the svn diff library doesn't support them, so supporting
> them in TMerge makes no sense. We could split the lines there, but the
> diffing engine would treat those as one line and so the diff would be
> shown wrong.
Ok, sounds reasonable.
There is as much use for thoses EOLS as for UTF32, so no big deal, but:
If I get that correctly, diff is made on temp files in UTF8 format. In
case other EOLs is used we can convert them to EOL_AUTOLINE, and make diff.
This lead me to other question:
When saving to enforced UTF8 for all but UTF8BOM BOM iwas not saved -
was this intentional? Can be BOM missing for all UTF8 enforced files.
Correct? This was implemented in r23192.
If I read code correctly On CFileTextLines::CheckLineEndings you detect
EOL_LFCR, but in CFileTextLines::Load this is one is decoded as EOL_LF
and EOL_CR or EOL_CRLF.
Is this intentional?
I guess EOL_LFCR is too rare to be really wanted, but why to detect it
in Check then. This makes all EOL_AUTOLINE EOL_LFCR.
Oto ot(ik) BREZINA - 오토
To unsubscribe from this discussion, e-mail: [dev-unsubscribe_at_tortoisesvn.tigris.org].
Received on 2012-08-19 21:16:10 CEST