How to walk along UTF-16 codepoints? - c++

I have the following definition of varying ranges which correspond to codepoints and surrogate pairs:
https://en.wikipedia.org/wiki/UTF-16#Description
My code is based on ConvertUTF.c from the Clang implementation.
I'm currently struggling with wrapping my head around how to do this.
The code which is most relevant from LLVM's implementation that I'm trying to understand is:
unsigned short bytesToWrite = 0;
const u32char_t byteMask = 0xBF;
const u32char_t byteMark = 0x80;
u8char_t* target = *targetStart;
utf_result result = kConversionOk;
const u16char_t* source = *sourceStart;
while (source < sourceEnd) {
u32char_t ch;
const u16char_t* oldSource = source; /* In case we have to back up because of target overflow. */
ch = *source++;
/* If we have a surrogate pair, convert to UTF32 first. */
if (ch >= UNI_SUR_HIGH_START && ch <= UNI_SUR_HIGH_END) {
/* If the 16 bits following the high surrogate are in the source buffer... */
if (source < sourceEnd) {
u32char_t ch2 = *source;
/* If it's a low surrogate, convert to UTF32. */
if (ch2 >= UNI_SUR_LOW_START && ch2 <= UNI_SUR_LOW_END) {
ch = ((ch - UNI_SUR_HIGH_START) << halfShift)
+ (ch2 - UNI_SUR_LOW_START) + halfBase;
++source;
} else if (flags == kStrictConversion) { /* it's an unpaired high surrogate */
--source; /* return to the illegal value itself */
result = kSourceIllegal;
break;
}
} else { /* We don't have the 16 bits following the high surrogate. */
--source; /* return to the high surrogate */
result = kSourceExhausted;
break;
}
} else if (flags == kStrictConversion) {
/* UTF-16 surrogate values are illegal in UTF-32 */
if (ch >= UNI_SUR_LOW_START && ch <= UNI_SUR_LOW_END) {
--source; /* return to the illegal value itself */
result = kSourceIllegal;
break;
}
}
...
Specifically they say in the comments:
If we have a surrogate pair, convert to UTF32 first.
and then:
If it's a low surrogate, convert to UTF32.
I'm getting lost along the lines of "if we have.." and "if it's.." and my response being while reading the comments: "what do we have?" and "what is it?"
I believe ch and ch2 is the first char16 and the next char16 (if one exists), checking to see if the second is part of a surrogate pair, and then walking along each char16 (or do you walk along pairs of chars?) until the end.
I'm getting lost along the lines of how they are using UNI_SUR_HIGH_START, UNI_SUR_HIGH_END, UNI_SUR_LOW_START, UNI_SUR_LOW_END, and their use of halfShift and halfBase.
Wikipedia also notes:
There was an attempt to rename "high" and "low" surrogates to "leading" and "trailing" due to their numerical values not matching their names. This appears to have been abandoned in recent Unicode standards.
Making note of "leading" and "trailing" in any responses may help clarify things as well.

ch >= UNI_SUR_HIGH_START && ch <= UNI_SUR_HIGH_END checks if ch is in the range where high surrogates are, that is, [D800-DBFF]. That's it. Then the same is done for checking if ch2 is in the range where low surrogates are, meaning [DC00-DFFF].
halfShift and halfBase are just used as prescribed by the UTF-16 decoding algorithm, which turns a pair of surrogates into the scalar value they represent. There's nothing special being done here; it's the textbook implementation of that algorithm, without any tricks.

Related

what does variable != 0xFF mean in C++?

I have the following if function that has a condition on an array of a data buffer, which stores data of a wav file
bool BFoundEnd = FALSE;
if (UCBuffer[ICount] != 0xFF){
BFoundEnd = TRUE;
break;
}
I was just confused on how 0xFF defines the condition inside the if function.
what does variable != 0xFF mean in C++?
variable is presumably an identifier that names a variable.
!= is the inequality operator. It results in false when left and right hand operands are equal and true otherwise.
0xFF is an integer literal. The 0x prefix means that the literal uses hexadecimal system (base 16). The value is 255 in decimal system (base 10) and 1111'1111 in binary system (base 2). For more information about the base i.e. the radix of numeral systems, see wikipedia: Radix
Reread with comments:
// Remembers if buffer did end with 0xFF (255) or not.
bool BFoundEnd = FALSE;
// ... Later in loop.
// Actual check for above variable
// (where "if ... != ..." means if not equal).
if (UCBuffer[ICount] != 0xFF) {
BFoundEnd = TRUE;
// Cancels looping as reached buffer-end.
break;
}
// ... Outside the loop.
// Handles something based on variable.
In short, someone decided to make 255 a special value which marks the end of the buffer and/or array (instead of providing array length).

Efficiency & Readabilty of a C++ For Loop that Compares two C-style Strings

So I've created my own function to compare two C Strings:
bool list::compareString(const char array1[], const char array2[])
{
unsigned char count;
for (count = 0; array1[count] != '\0' && array2[count] != '\0' && (array1[count] == array2[count] || array1[count + 32] == array2[count] || array1[count] == array2[count+32]); count++);
if (array1[count] == '\0' && array2[count] == '\0')
return true;
else
return false;
}
The parameter of my for loop is very long because it brings count to the end of at least one of the strings, and compares each char in each array in such a way that it their case won't matter (adding 32 to an uppercase char turns that char into its lowercase counterpart).
Now, I'm guessing that this is the most efficient way to go about comparing two C Strings, but that for loop is hard to read because of its length. What I've been told is to use a for loop instead of a while loop whenever possible because a for loop has the starting, ending, and incrementing conditions in its starting parameter, but for this, that seems like it may not apply.
What I'm asking is, how should I format this loop, and is there a more efficient way to do it?
Instead of indexing into the arrays with count, which you don't know the size of, you can instead operate directly on the pointers:
bool list::compareString(const char* array1, const char* array2)
{
while (*array1 != '\0' || *array2 != '\0')
if (*array1++ != *array2++) return false; // not the same character
return true;
}
For case insensitive comparison, replace the if condition with:
if (tolower(*array1++) != tolower(*array2++)) return false;
This does a safe character conversion to lower case.
The while loop checks if the strings are terminated. It continues while one of the strings is not yet terminated. If only 1 string has terminated, the next line - the if statement, will realize that the characters don't match (since only 1 character is '\0', and returns false.
If the strings differ at any point, the if statement returns false.
The if statement also post-increments the pointers so that it tests the next character in the next iteration of the while loop.
If both strings are equal, and terminate at the same time, at some point, the while condition will become false. In this case, the return true statement will execute.
If you want to write the tolower function yourself, you need to check that the character is a capital letter, and not a different type of character (eg. a number of symbol).
This would be:
inline char tolower(char ch)
{
return (ch >= 'A' && ch <= 'Z' ? (ch + 'a' - 'A') : ch);
}
I guess you are trying to do a case-insensitive comparison here. If you just need the fastest version, use a library function: strcasecmp or stricmp or strcmpi (name depends on your platform).
If you need to understand how to do it (I mean, is your question for learning purpose?), start with a readable version, something like this:
for (index = 0; ; ++index)
{
if (array1[index] == '\0' && array2[index] == '\0')
return true; // end of string reached
if (tolower(array1[index]) != tolower(array2[index]))
return false; // different characters discovered
}
Then measure its performance. If it's good enough, done. If not, investigate why (by looking at the machine code generated by the compiler). The first step in optimization might be replacing the tolower library function by a hand-crafted piece of code (which disregards non-English characters - is it what you want to do?):
int tolower(int c)
{
if (c >= 'A' && c <= 'Z')
return c + 'a' - 'A';
}
Note that I am still keeping the code readable. Readable code can be fast, because the compiler is going to optimize it.
array1[count + 32] == array2[count]
can lead to an OutOfRangeException, if the length of the array is smaller than 32.
You can use strcmp for comparing two strings
You have a few problems with your code.
What I'd do here is move some of your logic into the body of the for loop. Cramming everything into the for loop expression massively reduces readability without giving you any performance boosts that I can think of. The code just ends up being messy. Keep the conditions of the loop to testing incrementation and put the actual task in the body.
I'd also point out that you're not adding 32 to the character at all. You're adding it to the index of the array putting you at risk of running out of bounds. You need to test the value at the index, not the index itself.
Using an unsigned char to index an array gives you no benefits and only serves to reduce the maximum length of the strings that you can compare. Use an int.
You could restructure the code so that it looks like this:
bool list::compareString(const char array1[], const char array2[])
{
// Iterate over the strings until we find the string termination character
for (int count = 0; array1[count] != '\0' && array2[count] != '\0'; count++) {
// Note 0x20 is hexadecimal 32. We're comparing two letters for
// equality in a case insensitive way.
if ( (array1[count] | 0x20) != (array2[count] | 0x20) ) {
// Return false if the letters aren't equal
return false;
}
}
// We made it to the end of the loop. Strings are equal.
return true;
}
As for efficiency, it looks to me like you were trying to reduce:
The size of the variables that you're using to store data in
memory
The number of individual lines of code in your solution
Neither of these are worth your time. Efficiency is about how many steps (not lines of code, mind you) it will take to perform a task and how those steps scale as the inputs get bigger. For instance, how much slower would it be to compare the content of two novels for equality than two single word strings?
I hope that helps :)

C++ function convertCtoD

I'm new to C++. As part of an assignment we have to write to functions, but I don't know what the teacher mean by what he is requesting. Has anyone seen this or at least point me in the right direction. I don't want you to write the function, I just don't know what the output or what he is asking. I'm actually clueless right now.
Thank you
convertCtoD( )
This function is sent a null terminated character array
where each character represents a Decimal (base 10) digit.
The function returns an integer which is the base 10 representation of the characters.
convertBtoD( )
This function is sent a null terminated character array
where each character represents a Binary (base 2) digit.
The function returns an integer which is the base 10 representation of the character.
This function is sent a null terminated character array where each character represents a Decimal (base 10) digit. The function returns an integer which is the base 10 representation of the characters.
I'll briefly mention the fact that "an integer which is the base 10 representation of the characters" is useless here, the integer will represent the value whereas "base 10 representation" is the presentation of said value.
However, the desription given simply means you take in a (C-style) string of digits and put out an integer. So you would start with:
int convertCtoD(char *decimalString) {
int retVal = 0
// TBD: needs actual implementation.
return retVal;
}
This function is sent a null terminated character array where each character represents a Binary (base 2) digit. The function returns an integer which is the base 10 representation of the character.
This will be very similar:
int convertBtoD(char *binaryString) {
int retVal = 0
// TBD: needs actual implementation.
return retVal;
}
You'll notice I've left the return type as signed even though there's no need to handle signed values at all. You'll see why in the example implementation I provide below as I'm using it to return an error condition. The reason I'm providing code even though you didn't ask for it is that I think five-odd years is enough of a gap to ensure you can't cheat by passing off my code as your own :-)
Perhaps the simplest example would be:
int convertCToD(char *decimalStr) {
// Initialise accumulator to zero.
int retVal = 0;
// Process each character.
while (*str != '\0') {
// Check character for validity, add to accumulator (after
// converting char to int) then go to next character.
if ((*str < '0') || (*str > '9')) return -1;
retVal *= 10;
retVal += *str++ - '0';
}
return retVal;
}
The binary version would basically be identical except that it would use '1' as the upper limit and 2 as the multiplier (as opposed to '9' and 10).
That's the simplest form but there's plenty of scope for improvement to make your code more robust and readable:
Since the two functions are very similar, you could refactor out the common bits so as to reduce duplication.
You may want to consider an empty string as invalid rather than just returning zero as it currently does.
You probably want to detect overflow as an error.
With those in mind, it may be that the following is a more robust solution:
#include <stdbool.h>
#include <limits.h>
int convertBorCtoD(char *str, bool isBinary) {
// Configure stuff that depends on binary/decimal choice.
int maxDigit = isBinary ? '1' : '9';
int multiplier = maxDigit - minDigit + 1;
// Initialise accumulator to zero.
int retVal = 0;
// Optional check for empty string as error.
if (*str == '\0') return -1;
// Process each character.
while (*str != '\0') {
// Check character for validity.
if ((*str < '0') || (*str > maxDigit)) return -1;
// Add to accumulator, checking for overflow.
if (INT_MAX / multiplier < retVal) return -1;
retVal *= multiplier;
if (INT_MAX - (*str - '0') < retVal) return -1;
retVal += *str++ - '0';
}
return retVal;
}
int convertCtoD(char *str) { return convertBorCtoD(str, false); }
int convertBtoD(char *str) { return convertBorCtoD(str, true); }

Problems parsing a Microsoft compound document

I'm having a bit of a struggle wrestling with the compound document format.
I'm working in C at the moment but am having problems with locating the directory sector.
I can obtain the compound doc header which is trivial and I know the formula for finding a file offset of a sector id (secid + 1 << sec_size), but whenever I use this formula to convert the secid to fileoffset for the directory I get random values.
Can someone help me understand how I resolve secid offsets properly and maybe also how to develop secid chains from the sector allocation table in a compound document?
Here is an example of what I've tried:
comp_doc_header* cdh((comp_doc_header*)buffer);
printf("cdoc header:%d\n", sizeof(cd_dir_entry));
if(cdh->rev_num == 0x003E)printf("rev match\n");
//check magic number
if(cdh->comp_doc_id[0] != (unsigned char)0xD0 ||
cdh->comp_doc_id[1] != (unsigned char)0xCF ||
cdh->comp_doc_id[2] != (unsigned char)0x11 ||
cdh->comp_doc_id[3] != (unsigned char)0xE0 ||
cdh->comp_doc_id[4] != (unsigned char)0xA1 ||
cdh->comp_doc_id[5] != (unsigned char)0xB1 ||
cdh->comp_doc_id[6] != (unsigned char)0x1A ||
cdh->comp_doc_id[7] != (unsigned char)0xE1)
return 0;
buffer += 512;
//here i try and get the first directory entry
cd_dir_entry* cde((cd_dir_entry*)&buffer[(cdh->first_sector_id + 1) << 512]);
EDIT: (secid + 1) * 512 should be (secid + 1) * sec_size
Is this C? I can't parse your first or last posted lines
cd_dir_entry* cde((cd_dir_entry*)&buffer[(cdh->first_sector_id + 1) << 512]);
It appears you're declaring cde as a function that returns a pointer to a cd_dir_entry; but the parameter prototype is all wrong ... so you're calling the function and multiplying the result by cd_dir_entry and promptly ignoring the result of the multiplication.
Edit
My simplification trying to understand the line
cd_dir_entry* cde(<cast>&buffer[(cdh->first_sector_id + 1) << 512]);
cd_dir_entry* cde(<cast>&buffer[<elem>]);
cd_dir_entry* cde(<parameter>);
/* this is either a function prototype */
/* or a multiplication with `cd_dir_entry` and the value returned from cde() */
/* in either case it does nothing (no side-effects present), */
/* unless cde messes with global variables */

Fastest way to do a case-insensitive substring search in C/C++?

Note
The question below was asked in 2008 about some code from 2003. As the OP's update shows, this entire post has been obsoleted by vintage 2008 algorithms and persists here only as a historical curiosity.
I need to do a fast case-insensitive substring search in C/C++. My requirements are as follows:
Should behave like strstr() (i.e. return a pointer to the match point).
Must be case-insensitive (doh).
Must support the current locale.
Must be available on Windows (MSVC++ 8.0) or easily portable to Windows (i.e. from an open source library).
Here is the current implementation I am using (taken from the GNU C Library):
/* Return the offset of one string within another.
Copyright (C) 1994,1996,1997,1998,1999,2000 Free Software Foundation, Inc.
This file is part of the GNU C Library.
The GNU C Library is free software; you can redistribute it and/or
modify it under the terms of the GNU Lesser General Public
License as published by the Free Software Foundation; either
version 2.1 of the License, or (at your option) any later version.
The GNU C Library is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
Lesser General Public License for more details.
You should have received a copy of the GNU Lesser General Public
License along with the GNU C Library; if not, write to the Free
Software Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA
02111-1307 USA. */
/*
* My personal strstr() implementation that beats most other algorithms.
* Until someone tells me otherwise, I assume that this is the
* fastest implementation of strstr() in C.
* I deliberately chose not to comment it. You should have at least
* as much fun trying to understand it, as I had to write it :-).
*
* Stephen R. van den Berg, berg#pool.informatik.rwth-aachen.de */
/*
* Modified to use table lookup instead of tolower(), since tolower() isn't
* worth s*** on Windows.
*
* -- Anders Sandvig (anders#wincue.org)
*/
#if HAVE_CONFIG_H
# include <config.h>
#endif
#include <ctype.h>
#include <string.h>
typedef unsigned chartype;
char char_table[256];
void init_stristr(void)
{
int i;
char string[2];
string[1] = '\0';
for (i = 0; i < 256; i++)
{
string[0] = i;
_strlwr(string);
char_table[i] = string[0];
}
}
#define my_tolower(a) ((chartype) char_table[a])
char *
my_stristr (phaystack, pneedle)
const char *phaystack;
const char *pneedle;
{
register const unsigned char *haystack, *needle;
register chartype b, c;
haystack = (const unsigned char *) phaystack;
needle = (const unsigned char *) pneedle;
b = my_tolower (*needle);
if (b != '\0')
{
haystack--; /* possible ANSI violation */
do
{
c = *++haystack;
if (c == '\0')
goto ret0;
}
while (my_tolower (c) != (int) b);
c = my_tolower (*++needle);
if (c == '\0')
goto foundneedle;
++needle;
goto jin;
for (;;)
{
register chartype a;
register const unsigned char *rhaystack, *rneedle;
do
{
a = *++haystack;
if (a == '\0')
goto ret0;
if (my_tolower (a) == (int) b)
break;
a = *++haystack;
if (a == '\0')
goto ret0;
shloop:
;
}
while (my_tolower (a) != (int) b);
jin:
a = *++haystack;
if (a == '\0')
goto ret0;
if (my_tolower (a) != (int) c)
goto shloop;
rhaystack = haystack-- + 1;
rneedle = needle;
a = my_tolower (*rneedle);
if (my_tolower (*rhaystack) == (int) a)
do
{
if (a == '\0')
goto foundneedle;
++rhaystack;
a = my_tolower (*++needle);
if (my_tolower (*rhaystack) != (int) a)
break;
if (a == '\0')
goto foundneedle;
++rhaystack;
a = my_tolower (*++needle);
}
while (my_tolower (*rhaystack) == (int) a);
needle = rneedle; /* took the register-poor approach */
if (a == '\0')
break;
}
}
foundneedle:
return (char*) haystack;
ret0:
return 0;
}
Can you make this code faster, or do you know of a better implementation?
Note: I noticed that the GNU C Library now has a new implementation of strstr(), but I am not sure how easily it can be modified to be case-insensitive, or if it is in fact faster than the old one (in my case). I also noticed that the old implementation is still used for wide character strings, so if anyone knows why, please share.
Update
Just to make things clear—in case it wasn't already—I didn't write this function, it's a part of the GNU C Library. I only modified it to be case-insensitive.
Also, thanks for the tip about strcasestr() and checking out other implementations from other sources (like OpenBSD, FreeBSD, etc.). It seems to be the way to go. The code above is from 2003, which is why I posted it here in hope for a better version being available, which apparently it is. :)
The code you posted is about half as fast as strcasestr.
$ gcc -Wall -o my_stristr my_stristr.c
steve#solaris:~/code/tmp
$ gcc -Wall -o strcasestr strcasestr.c
steve#solaris:~/code/tmp
$ ./bench ./my_stristr > my_stristr.result ; ./bench ./strcasestr > strcasestr.result;
steve#solaris:~/code/tmp
$ cat my_stristr.result
run 1... time = 6.32
run 2... time = 6.31
run 3... time = 6.31
run 4... time = 6.31
run 5... time = 6.32
run 6... time = 6.31
run 7... time = 6.31
run 8... time = 6.31
run 9... time = 6.31
run 10... time = 6.31
average user time over 10 runs = 6.3120
steve#solaris:~/code/tmp
$ cat strcasestr.result
run 1... time = 3.82
run 2... time = 3.82
run 3... time = 3.82
run 4... time = 3.82
run 5... time = 3.82
run 6... time = 3.82
run 7... time = 3.82
run 8... time = 3.82
run 9... time = 3.82
run 10... time = 3.82
average user time over 10 runs = 3.8200
steve#solaris:~/code/tmp
The main function was:
int main(void)
{
char * needle="hello";
char haystack[1024];
int i;
for(i=0;i<sizeof(haystack)-strlen(needle)-1;++i)
{
haystack[i]='A'+i%57;
}
memcpy(haystack+i,needle, strlen(needle)+1);
/*printf("%s\n%d\n", haystack, haystack[strlen(haystack)]);*/
init_stristr();
for (i=0;i<1000000;++i)
{
/*my_stristr(haystack, needle);*/
strcasestr(haystack,needle);
}
return 0;
}
It was suitably modified to test both implementations. I notice as I am typing this up I left in the init_stristr call, but it shouldn't change things too much. bench is just a simple shell script:
#!/bin/bash
function bc_calc()
{
echo $(echo "scale=4;$1" | bc)
}
time="/usr/bin/time -p"
prog="$1"
accum=0
runs=10
for a in $(jot $runs 1 $runs)
do
echo -n "run $a... "
t=$($time $prog 2>&1| grep user | awk '{print $2}')
echo "time = $t"
accum=$(bc_calc "$accum+$t")
done
echo -n "average user time over $runs runs = "
echo $(bc_calc "$accum/$runs")
You can use StrStrI function which finds the first occurrence of a substring within a string. The comparison is not case-sensitive.
Don't forget to include its header - Shlwapi.h.
Check this out: http://msdn.microsoft.com/en-us/library/windows/desktop/bb773439(v=vs.85).aspx
use boost string algo. It is available, cross platform, and only a header file (no library to link in). Not to mention that you should be using boost anyway.
#include <boost/algorithm/string/find.hpp>
const char* istrstr( const char* haystack, const char* needle )
{
using namespace boost;
iterator_range<char*> result = ifind_first( haystack, needle );
if( result ) return result.begin();
return NULL;
}
For platform independent use:
const wchar_t *szk_wcsstri(const wchar_t *s1, const wchar_t *s2)
{
if (s1 == NULL || s2 == NULL) return NULL;
const wchar_t *cpws1 = s1, *cpws1_, *cpws2;
char ch1, ch2;
bool bSame;
while (*cpws1 != L'\0')
{
bSame = true;
if (*cpws1 != *s2)
{
ch1 = towlower(*cpws1);
ch2 = towlower(*s2);
if (ch1 == ch2)
bSame = true;
}
if (true == bSame)
{
cpws1_ = cpws1;
cpws2 = s2;
while (*cpws1_ != L'\0')
{
ch1 = towlower(*cpws1_);
ch2 = towlower(*cpws2);
if (ch1 != ch2)
break;
cpws2++;
if (*cpws2 == L'\0')
return cpws1_-(cpws2 - s2 - 0x01);
cpws1_++;
}
}
cpws1++;
}
return NULL;
}
Why do you use _strlwr(string); in init_stristr()? It's not a standard function. Presumably it's for locale support, but as it's not standard, I'd just use:
char_table[i] = tolower(i);
I'd advice you to take some of the common strcasestr implementation that already exists. For example of glib, glibc, OpenBSD, FreeBSD, etc. You can search for more with google.com/codesearch. You can then make some performance measurements and compare the different implementation.
Assuming both input strings are already lowercase.
int StringInStringFindFirst(const char* p_cText, const char* p_cSearchText)
{
int iTextSize = strlen(p_cText);
int iSearchTextSize = strlen(p_cSearchText);
char* p_cFound = NULL;
if(iTextSize >= iSearchTextSize)
{
int iCounter = 0;
while((iCounter + iSearchTextSize) <= iTextSize)
{
if(memcmp( (p_cText + iCounter), p_cSearchText, iSearchTextSize) == 0)
return iCounter;
iCounter ++;
}
}
return -1;
}
You could also, try using masks... if for example most of the strings you are going to compare only contains chars from a to z, maybe it's worth to do something like this.
long GetStringMask(const char* p_cText)
{
long lMask=0;
while(*p_cText != '\0')
{
if (*p_cText>='a' && *p_cText<='z')
lMask = lMask | (1 << (*p_cText - 'a') );
else if(*p_cText != ' ')
{
lMask = 0;
break;
}
p_cText ++;
}
return lMask;
}
Then...
int main(int argc, char* argv[])
{
char* p_cText = "this is a test";
char* p_cSearchText = "test";
long lTextMask = GetStringMask(p_cText);
long lSearchMask = GetStringMask(p_cSearchText);
int iFoundAt = -1;
// If Both masks are Valid
if(lTextMask != 0 && lSearchMask != 0)
{
if((lTextMask & lSearchMask) == lSearchMask)
{
iFoundAt = StringInStringFindFirst(p_cText, p_cSearchText);
}
}
else
{
iFoundAt = StringInStringFindFirst(p_cText, p_cSearchText);
}
return 0;
}
This will not consider the locale, but If you can change the IS_ALPHA and TO_UPPER you can make it to consider it.
#define IS_ALPHA(c) (((c) >= 'A' && (c) <= 'Z') || ((c) >= 'a' && (c) <= 'z'))
#define TO_UPPER(c) ((c) & 0xDF)
char * __cdecl strstri (const char * str1, const char * str2){
char *cp = (char *) str1;
char *s1, *s2;
if ( !*str2 )
return((char *)str1);
while (*cp){
s1 = cp;
s2 = (char *) str2;
while ( *s1 && *s2 && (IS_ALPHA(*s1) && IS_ALPHA(*s2))?!(TO_UPPER(*s1) - TO_UPPER(*s2)):!(*s1-*s2))
++s1, ++s2;
if (!*s2)
return(cp);
++cp;
}
return(NULL);
}
If you want to shed CPU cycles, you might consider this - let's assume that we're dealing with ASCII and not Unicode.
Make a static table with 256 entries. Each entry in the table is 256 bits.
To test whether or not two characters are equal, you do something like this:
if (BitLookup(table[char1], char2)) { /* match */ }
To build the table, you set a bit everywhere in table[char1] where you consider it a match for char2. So in building the table you would set the bits at the index for 'a' and 'A' in the 'a'th entry (and the 'A'th entry).
Now this is going to be slowish to do the bit lookup (bit look up will be a shift, mask and add most likely), so you could use instead a table of bytes so you use 8 bits to represent 1 bit. This will take 32K - so hooray - you've hit a time/space trade-off! We might want to make the table more flexible, so let's say we do this instead - the table will define congruences instead.
Two characters are considered congruent if and only if there is a function that defines them as equivalent. So 'A' and 'a' are congruent for case insensitivity. 'A', 'À', 'Á' and 'Â' are congruent for diacritical insensitivity.
So you define bitfields that correspond to your congruencies
#define kCongruentCase (1 << 0)
#define kCongruentDiacritical (1 << 1)
#define kCongruentVowel (1 << 2)
#define kCongruentConsonant (1 << 3)
Then your test is something like this:
inline bool CharsAreCongruent(char c1, char c2, unsigned char congruency)
{
return (_congruencyTable[c1][c2] & congruency) != 0;
}
#define CaseInsensitiveCharEqual(c1, c2) CharsAreCongruent(c1, c2, kCongruentCase)
This kind of bit fiddling with ginormous tables is the heart of ctype, by the by.
If you can control the needle string so that it is always in lower case, then you can write a modified version of stristr() to avoid the lookups for that, and thus speed up the code. It isn't as general, but it can be faster - slightly faster. Similar comments apply to the haystack, but you are more likely to be reading the haystack from sources outside your control for you cannot be certain that the data meets the requirement.
Whether the gain in performance is worth it is another question altogether. For 99% of applications, the answer is "No, it is not worth it". Your application might be one of the tiny minority where it matters. More likely, it is not.