[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Re: CString to CStringA conversion and other questions :)

From: Stefan Küng <tortoisesvn_at_gmail.com>
Date: Mon, 20 Aug 2012 21:03:43 +0200

On 20.08.2012 20:40, Oto BREZINA wrote:

> I'm NOT about to rewrite CheckUnicodeType, just was wonder when I read
> it if *ptr++ is faster.
> In fact I was little bit about - I wrote simple UTF8 validator some time
> ago, and seeing your implementation with lot of nested ifs, ++ etc, I
> was wondering what are cons and pros to compare with my - more state
> based implementation.
> Everything stopped on Performance Analysis tool.
>
> In CheckUnicodeType you get thru array twice just to verify if it is UTF8.
> And it can be enhanced to UTF16BE/LE detection - statisticaly based, so
> not 100% accurate, but it would cost some processing. For now you need
> BOM for UTF16s.
>
> It started with simple task:
> Last line of file should have no newline on it. In current
> implementation you keep information in attribute m_bReturnAtEnd. This
> makes some editing at file end quite hard. For example you can
> add/remove new line on last line using CTRL enter, but this is usually
> not applied on edited file. Keeping this attribute actual seems to be
> hard task with lot of possible bugs. But even I was able to remove this
> attribute and add last line (when needed) it does not apeared in views...
> It seems that you use line data from diff where is empty last line
> missing. I'll come back to this later.
>
> In that time I found that there is quite lot of code duplication in Load
> and Save and duplices are not same... Making code harder to read and
> maintain. Plus missing UTF32 encoding
>
> Other motivation was to something simple, before starting code editing
> in multiple views.

Looks like a good idea.
If you want to do some profiling:
have a look at the src\utils\profiling.h file. It's a simple time
measuring tool that helps finding out if a change has an effect on
performance or not.

Simply add
PROFILE_BLOCK;
somewhere inside a function, then make a release build, run the app and
that function, exit the app. A file will be written to the same folder
as the exe is where all the time measurements are shown.

to profile a simple line, use
PROFILE_LINE(call_function());

But I prefer the
PROFILE_BLOCK;

in case you need to only profile part of a function, create a 'block'
using brackets:

function(...)
{
   ...
   ...
   {
     PROFILE_BLOCK;
     // code to profile
     ...
   }
}

> So T-Merge do a diff directly on UTF16 files?
> Utf8 temp file is created only if encoding differ like UTF16 and UTF8,
> or never ?
> Can that be simply enforced for UTF32 files?
>
> From CDiffData::Load it seems that UTF16 files are saved as UTF8 in
> temp, for diff purposes, Am I right?

hmm, right. We could just always save those files as utf8 and then run
the svn diff lib functions on those. In that case, even utf32 files
could be supported by TMerge.

>> That would work for *showing* the diff. But when saving edited content,
>> you would save the converted EOLs.
> Of course one, we load(ed). We only need to get know to upper layer e.g
> CDiffData::Load, that diff needs enforce UTF8, while on enforced UTF8
> save all non standard EOLS can be converted to AUTO. We'll lose little
> bit of diff this way through.
>
> Making temp UTF8 needed in UTF16,32 Exotic EOLS, different Encodings
> ASCII and UTF8 ...

not much of a problem I think. Unless of course the files are huge, then
converting them first to utf8 will take a few seconds. But I guess
that's better than not working with such files at all.

>>> This lead me to other question:
>>> 6.
>>> When saving to enforced UTF8 for all but UTF8BOM BOM iwas not saved -
>>> was this intentional? Can be BOM missing for all UTF8 enforced files.
>>> Correct? This was implemented in r23192.
>> There's an option in the settings to save files as utf8 even if they're
>> detected as ANSI.
>> You're now saving those with a BOM, which isn't what we did before. In
>> that case you should write the file without a BOM (always without BOM if
>> possible, only if the file had one when loading, then we write the BOM too).
> r23193 is like: ((!bSaveAsUTF8)&&(m_UnicodeType ==
> CFileTextLines::UTF8BOM))
> Should mean If NOT SaveAsUtf8 and ... then save BOM.

We don't need the BOM for the diff. We only want to save the BOM if it
was there when loading the file.

> bSaveAsUTF8 is only for user requests, or for diff purposes too?

for both.

>
>>> 7.
>>> If I read code correctly On CFileTextLines::CheckLineEndings you detect
>>> EOL_LFCR, but in CFileTextLines::Load this is one is decoded as EOL_LF
>>> and EOL_CR or EOL_CRLF.
>>> Is this intentional?
>>> I guess EOL_LFCR is too rare to be really wanted, but why to detect it
>>> in Check then. This makes all EOL_AUTOLINE EOL_LFCR.
>> I don't understand what you mean here.
>> In Load(), the line endings are checked by calling CheckLineEndings(),
>> there's no separate detection.
> CheckLineEndings can detect EOL_LFCR, Load not. Is this what you want?

Load() calls CheckLineEndings(), which can detect EOL_LFCR.
Or am I missing something here?

Stefan

-- 
        ___
   oo  // \\      "De Chelonian Mobile"
  (_,\/ \_/ \     TortoiseSVN
    \ \_/_\_/>    The coolest Interface to (Sub)Version Control
    /_/   \_\     http://tortoisesvn.net
------------------------------------------------------
http://tortoisesvn.tigris.org/ds/viewMessage.do?dsForumId=757&dsMessageId=2999683
To unsubscribe from this discussion, e-mail: [dev-unsubscribe_at_tortoisesvn.tigris.org].
Received on 2012-08-20 21:03:58 CEST

This is an archived mail posted to the TortoiseSVN Dev mailing list.

This site is subject to the Apache Privacy Policy and the Apache Public Forum Archive Policy.