optimizing string search for a single byte

optimizing string search for a single byte - c++

In general, string search algorithms (like Boyer-Moore) are optimized for cases where the search string is long-ish. That is, Boyer-Moore is great because by lining up the search string with our text, we can skip N = len(search string) characters if the end of the search string doesn't match the text.
But what if our search string is really short? Like a single byte or character? In this case, Boyer-Moore doesn't help much.
So, what are some alternative algorithms for speeding up searching?
I know that many optimized library search routines (like memchr in C), take the strategy of reading the input string word by word, rather than char by char. So on a 64-bit machine, 8 bytes can be examined at once, rather than a single byte.
I'd like to know how these optimized string/byte searches actually work. How does the actual comparison work then? I know it obviously must involve bit masking - but I don't see how performing all the bit masking is any better than simply search character by character.
So, suppose our search char is 0xFF. Ignoring alignment issues, let's say we have some input buffer: void* buf. We can read it word by word by saying:
const unsigned char search_char = 0xFF;
unsigned char* bufptr = static_cast<unsigned char*>(buf);
unsigned char* bufend = bufptr + BUF_SIZE;
while (bufptr != bufend)
{
// Ignore alignment concerns for now, assume BUF_SIZE % sizeof(uintptr_t) == 0
//
std::uinptr_t next_word = *reinterpret_cast<std::uintptr_t*>(bufptr);
// ... but how do we compare next_word with our search char?
bufptr += sizeof(std::uintptr_t);
}
I also realize that the above code is not strictly portable, because std::uintptr_t isn't guaranteed to actually be word size. But let's assume for the sake of this question that std::uinptr_t is equal to the processor word size. (An actual implementation would likely need platform-specific macros to get the actual word-size)
So, how do we actually check if the byte 0xFF occurs anywhere in the value of next_word?
We can use OR operations of course, but it seems we'd still need to perform a lot of OR'ing and bit shifting to check each byte of next_word, at which point it becomes questionable whether this optimization is actually any better than simply scanning character by character.

You can use this snippet from Bit Twiddling Hacks:
#define haszero(v) (((v) - 0x01010101UL) & ~(v) & 0x80808080UL)
#define hasvalue(x,n) \
(haszero((x) ^ (~0UL/255 * (n))))
It effectively XORs each byte with the character to be tested, then determines if any byte is now zero.
At this point you can get the location of the matching byte (or bytes) from the return value of the expression, e.g. the value will be 0x00000080 if the least significant byte matches the value.

Related

Append to String a Signed Int (Converted to Bytes) in Big Endian

I have a 4 byte integer (signed), and (i) I want to reverse the byte order, (ii) I want to store the bytes (i.e. the 4 bytes) as bytes of the string. I am working in C++. In order to reverse the byte order in Big Endian, I was using the ntohl, but I cannot use that due the fact that my numbers can be also negative.
Example:
int32_t a = -123456;
string s;
s.append(reinterpret_cast<char*>(reinterpret_cast<void*>(&a))); // int to byte
Further, when I am appending these data, it seems that I am appending 8 bytes instead of 4, why?
I need to use the append (I cannot use memcpy or something else).
Do you have any suggestion?

I was using the ntohl, but I cannot use that due the fact that my numbers can be also negative.
It's unclear why you think that negative number would be a problem. It's fine to convert negative numbers with ntohl.
s.append(reinterpret_cast<char*>(reinterpret_cast<void*>(&a)));
std::string::append(char*) requires that the argument points to a null terminated string. An integer is not null terminated (unless it happens to contain a byte that incidentally represents a null terminator character). As a result of violating this requirement, the behaviour of the program is undefined.
Do you have any suggestion?
To fix this bug, you can use the std::string::append(char*, size_type) overload instead:
s.append(reinterpret_cast<char*>(&a), sizeof a);
reinterpret_cast<char*>(reinterpret_cast<void*>
The inner cast to void* is redundant. It makes no difference.

C++ Turning Character types into int type

So I read and was taught that subtracting '0' from my given character turns it into an int, however my Visual Studio isn't recognizing that here, saying a value of type "const char*" cannot be used to initialize an entity of type int in C++ programming here.
bigint::bigint(const char* number) : bigint() {
int number1 = number - '0'; // error code
for (int i = 0; number1 != 0 ; ++i)
{
digits[i] = number1 % 10;
number1 /= 10;
digits[i] = number1;
}
}
The goal of the first half is to simply turn the given number into a type int. The second half is outputting that number backwards with no leading zeroes. Please note this function is apart of the class declared given in a header file here:
class bigint {
public:
static const int MAX_DIGITS = 50;
private:
int digits[MAX_DIGITS];
public:
// constructors
bigint();
bigint(int number);
bigint(const char * number);
}
Is there any way to convert the char parameter to an int so I can then output an int? Without using the std library or strlen, since I know there is a way to use the '0' char but I can't seem to be doing it right.

You can turn a single character in the range '0'..'9' into a single digit 0..9 by subtracting '0', but you cannot turn a string of characters into a number by subtracting '0'. You need a parsing function like std::stoi() to do the conversion work character-by-character.
But that's not what you need here. If you convert the string to a number, you then have to take the number apart. The string is already in pieces, so:
bigint::bigint(const char* number) : bigint() {
while (number) // keep looping until we hit the string's null terminator
{
digits[i] = number - '0'; // store the digit for the current character
number++; // advance the string to the next character
}
}
There could be some extra work involved in a more advanced version, such as sizing digits appropriately to fit the number of digits in number. Currently we have no way to know how many slots are actually in use in digits, and this will lead to problems later when the program has to figure out where to stop reading digits.

I don't know what your understanding is, so I will go over everything I see in the code snippet.
First, what you're passing to the function is a pointer to a char, with const keyword making the char immutable or "read only" if you prefer.
A char is actually a 8-bit sized 1 integer. It can store a numerical value in binary form, which can be also interpreted as a character.
Fundamental types - cppreference.com
Standard also expects char to be a "type for character representation". It could be represented in ASCII code, but it could be something else like EBCDIC maybe, I'm not sure. For future reference just remember that ASCII is not guaranteed, although you're likely to never use a system where it's no ASCII (if I'm correct). But it's not so much that char is somehow enforcing encoding - it's the functions that you pass those chars and char pointers to, that interpret their content as characters in ASCII encoding, while on some obscure or legacy platforms they could actually interpret them as characters in some less common encoding. Standard however demands that encoding used has this property: codes for characters '0' to '9' are subsequent, and thus '9' - '0' means: subtract code of '0' from code of '9'. The result is 9, because code for '9' is 9 positions from code for '0' in ASCII. Ranges 'a'-'z' and 'A'-'Z' have this quality as well, in case you need that, but it's a little bit trickier if your input is in base higher than 10, like a popular base of 16 called hexadecimal.
A pointer stores an address, so the most basic functionality for it is to "point" to a variable. But it can be used in various ways, one of which, very frequent in C, is to store address of the beginning of an array of variables of the same type. Those could be chars. We could interpret such an array as a line of text, or a string (a concept, not to be confused with C++ specific string class).
Since a pointer does not contain information on length or end of such an array, we need to get that information across to the function we pass the pointer to. Sometimes we can just provide the length, sometimes we provide the end pointer. When dealing with "lines of text" or c-style strings, we use (and c standard library functions expect) what is callled a null-terminated string. In such a string, the first char after the last one used for a line is a null, which is, to simplify, basically a 0. A 0, but not a '0'.
So what you're passing to the function, and what you interpret as, say 416, is actually a pointer to a place in memory where '4' is econded and stored as a number, followed by '1' and then '6', taking up three bytes. And depending on how you obtained this line of text, '6' is probably followed by a NULL, that is - a zero.
NULL - cppreference.com
Conversion of such a string to a number first requires a data type able to hold it. In case of 416 it could be anything from short upwards. If you wanted to do that on your own, you would need to iterate over entire line of text and add the numbers multiplied by proper powers of 10, take care of signedness too and maybe check if there are any edge cases. You could however use a standard function like int atoi (const char * str);
atoi - cplusplus.com
Now, that would be nice of course, but you're trying to work with "bigints". However you define them, it means your class' purpose is to deal with numbers to big to be stored in built-in types. So there is no way you can convert them just like that.
What you're trying to do right now seems to be a constructor that creates a bigint out of number represented as a c style string. How shall I put it... you want to store your bigint internally as an array of it's digits in base 10 (a good choice for code simplicity, readability and maintainability, as well as interoperation with base 10 textual representation, but it doesn't make efficient use of memory and processing power.) and your input is also an array of digits in base 10, except internally you're storing numbers as numbers, while your input is encoded characters. You need to:
sanitize the input (you need criteria for what kind of input is acceptable, fe. if there can be any leading or trailing whitespace, can the number be followed by any non-numerical characters to be discarded, how to represent signedness, is + for positive numbers optional or forbidden etc., throw exception if the input is invalid.
convert whatever standard you enforce for your input into whatever uniform standard you employ internally, fe. strip leading whitespace, remove + sign if it's optional and you don't use it internally etc.
when you know which positions in your internal array correspond with which positions in the input string, you can iterate over it and copy every number, decoding it first from ASCII.
A side note - I can't be sure as to what exactly it is that you expect your input to be, because it's only likely that it is a textual representation - as it could just as easily be an array of unencoded chars. Of course it's obviously the former, which I know because of your post, but the function prototype (the line with return type and argument types) does not assure anyone about that. Just another thing to be aware of.
Hope this answer helped you understand what is happening there.
PS. I cannot emphasize strongly enough that the biggest problem with your code is that even if this line worked:
int number1 = number - '0'; // error code
You'd be trying to store a number on the order of 10^50 into a variable capable of holding on the order of 10^9
The crucial part in this problem, which I have a vague feeling you may have found on spoj.com is that you're handling BIGints. Integers too big to be stored in a trivial manner.
1 ) The standard does not actually require for char to be this size directly, but indirectly it requires for it to be at least 8 bits, possibly more on obscure platforms. And yes, I think there were some platforms where it was indeed over 8 bits. Same thing with pointers that may behave strange on obscure architectures.

What is really misaligned pointer?

Misaligned pointer is scaring me. When parsing string, a helpful technique I normally use is treating a group of chars as one unit.
So if I am comparing a string to where <!-- is 4 chars that mean begin comment, I would do...
if( *(unsigned int*)string == tobe32('<!--') )
// This is beggining of a comment possibly
As you can see I handle the Endianess problem. But will I still stumble upon alignment pointer problem. As if I am at index 1 of the string, will casting it to unsigned int pointer give me a 4 byte object on a 1 byte boundary?

Yes, it will almost certainly be a misaligned pointer(a), casting the address should not actually change the address, just change how it's treated when you dereference it.
However, that's not necessarily a problem. Some environments may actually raise a hardware exception if you do this (some early ARMs, from memory), some will run a little slower (some x86s), and no doubt some won't care at all. So it would depend on your underlying environment.
However, I'd really question the need for this trick, since the fact you have to do endian conversion means it may not be as efficient as you think.
My first inclination would be to just write an inline function that checks the four characters individually, time it, and only worry about optimisation if there's a real problem. That would be something like:
// Check first four characters match. Pre-condition is that both
// legacy-C-strings are at least four characters in length.
inline bool match4(const char *str, const char *match) {
if (*str++ != *match++) return false;
if (*str++ != *match++) return false;
if (*str++ != *match++) return false;
return *str == *match;
}
This would be my starting position rather than relying on possibly non-portable solutions using casting.
(a) If you want to know the alignment requirements of certain types, you can use the C++ alignof expression, such as alignof(int), assuming you have C++11 or better and, really, you should have :-)

Why is parameter to isdigit integer?

The function std::isdigit is:
int isdigit(int ch);
The return (Non-zero value if the character is a numeric character, zero otherwise.) smells like the function was inherited from C, but even that does not explain why the parameter type is int not char while at the same time...
The behavior is undefined if the value of ch is not representable as
unsigned char and is not equal to EOF.
Is there any technical reason why isdigitstakes an int not a char?

The reaons is to allow EOF as input. And EOF is (from here):
EOF integer constant expression of type int and negative value

The accepted answer is correct, but I believe the question deserves more detail.
A char in C++ is either signed or unsigned depending on your implementation (and, yet, it's a distinct type from signed char and unsigned char).
Where C grew up, char was typically unsigned and assumed to be an n-bit byte that could represent [0..2^n-1]. (Yes, there were some machines that had byte sizes other than 8 bits.) In fact, chars were considered virtually indistinguishable from bytes, which is why functions like memcpy take char * rather than something like uint8_t *, why sizeof char is always 1, and why CHAR_BITS isn't named BYTE_BITS.
But the C standard, which was the baseline for C++, only promised that char could hold any value in the execution character set. They might hold additional values, but there was no guarantee. The source character set (basically 7-bit ASCII minus some control characters) required something like 97 values. For a while, the execution character set could be smaller, but in practice it almost never was. Eventually there was an explicit requirement that a char be large enough to hold an 8-bit byte.
But the range was still uncertain. If unsigned, you could rely on [0..255]. Signed chars, however, could--in theory--use a sign+magnitude representation that would give you a range of [-127..127]. Note that's only 255 unique values, not 256 values ([-128..127]) like you'd get from two's complement. If you were language lawyerly enough, you could argue that you cannot store every possible value of an 8-bit byte in a char even though that was a fundamental assumption throughout the design of the language and its run-time library. I think C++ finally closed that apparent loophole in C++17 or C++20 by, in effect, requiring that a signed char use two's complement even if the larger integral types use sign+magnitude.
When it came time to design fundamental input/output functions, they had to think about how to return a value or a signal that you've reached the end of the file. It was decided to use a special value rather than an out-of-band signaling mechanism. But what value to use? The Unix folks generally had [128..255] available and others had [-128..-1].
But that's only if you're working with text. The Unix/C folks thought of textual characters and binary byte values as the same thing. So getc() was also for reading bytes from a binary file. All 256 possible values of a char, regardless of its signedness, were already claimed.
K&R C (before the first ANSI standard) didn't require function prototypes. The compiler made assumptions about parameter and return types. This is why C and C++ have the "default promotions," even though they're less important now than they once were. In effect, you couldn't return anything smaller than an int from a function. If you did, it would just be converted to int anyway.
The natural solution was therefore to have getc() return an int containing either the character value or a special end-of-file value, imaginatively dubbed EOF, a macro for -1.
The default promotions not only mandated a function couldn't return an integral type smaller than an int, they also made it difficult to pass in a small type. So int was also the natural parameter type for functions that expected a character. And thus we ended up with function signatures like int isdigit(int ch).
If you're a Posix fan, this is basically all you need.
For the rest of us, there's a remaining gotcha: If your chars are signed, then -1 might represent a legitimate character in your execution character set. How can you distinguish between them?
The answer is that functions don't really traffic in char values at all. They're really using unsigned char values dressed up as ints.
int x = getc(source_file);
if (x != EOF) { /* reached end of file */ }
else if (0 <= x && x < 128) { /* plain 7-bit character */ }
else if (128 <= x && x < 256) {
// Here it gets interesting.
bool b1 = isdigit(x); // OK
bool b2 = isdigit(static_cast<char>(x)); // NOT PORTABLE
bool b3 = isdigit(static_cast<unsigned char>(x)); // CORRECT!
}

dealing with endianness in c++

I am working on translating a system from python to c++. I need to be able to perform actions in c++ that are generally performed by using Python's struct.unpack (interpreting binary strings as numerical values). For integer values, I am able to get this to (sort of) work, using the data types in stdint.h:
struct.unpack("i", str) ==> *(int32_t*) str; //str is a char* containing the data
This works properly for little-endian binary strings, but fails on big-endian binary strings. Basically, I need an equivalent to using the > tag in struct.unpack:
struct.unpack(">i", str) ==> ???
Please note, if there is a better way to do this, I am all ears. However, I cannot use c++11, nor any 3rd party libraries other than Boost. I will also need to be able to interpret floats and doubles, as in struct.unpack(">f", str) and struct.unpack(">d", str), but I'll get to that when I solve this.
NOTE I should point out that the endianness of my machine is irrelevant in this case. I know that the bitstream I receive in my code will ALWAYS be big-endian, and that's why I need a solution that will always cover the big-endian case. The article pointed out by BoBTFish in the comments seems to offer a solution.

For 32 and 16-bit values:
This is exactly the problem you have for network data, which is big-endian. You can use the the ntohl to turn a 32-bit into host order, little-endian in your case.
The ntohl() function converts the unsigned integer netlong from network byte order to
host byte order.
int res = ntohl(*((int32_t) str)));
This will also take care of the case where your host is big-endian and won't do anything.
For 64-bit values
Non-standardly on linux/BSD you can take a look at 64 bit ntohl() in C++?, which points to htobe64
These functions convert the byte encoding of integer values from the byte order that
the current CPU (the "host") uses, to and from little-endian and big-endian byte
order.
For windows try: How do I convert between big-endian and little-endian values in C++?
Which points to _byteswap_uint64 and as well as a 16 and 32-bit solution and a gcc-specific __builtin_bswap(32/64) call.
Other Sizes
Most systems don't have values that aren't 16/32/64 bits long. At that point I might try to store it in a 64-bit value, shift it and they translate. I'd write some good tests. I suspectt is an uncommon situation and more details would help.

Unpack the string one byte at a time.
unsigned char *str;
unsigned int result;
result = *str++ << 24;
result |= *str++ << 16;
result |= *str++ << 8;
result |= *str++;

First, the cast you're doing:
char *str = ...;
int32_t i = *(int32_t*)str;
results in undefined behavior due to the strict aliasing rule (unless str is initialized with something like int32_t x; char *str = (char*)&x;). In practical terms that cast can result in an unaligned read which causes a bus error (a crash) on some platforms and slow performance on others.
Instead you should be doing something like:
int32_t i;
std::memcpy(&i, c, sizeof(i));
There are a number of functions for swapping bytes between the host's native byte ordering and a host independent ordering: ntoh*(), hton*(), where * is nothing, l, or s for the different types supported. Since different hosts may have different byte orderings then this may be what you want to use if the data you're reading uses a consistent serialized form on all platforms.
ntoh(i);
You can also manually move bytes around in str before copying it into the integer.
std::swap(str[0],str[3]);
std::swap(str[1],str[2]);
std::memcpy(&i,str,sizeof(i));
Or you can manually manipulate the integer's value using shifts and bitwise operators.
std::memcpy(&i,str,sizeof(i));
i = (i&0xFFFF0000)>>16 | (i&0x0000FFFF)<<16;
i = (i&0xFF00FF00)>>8 | (i&0x00FF00FF)<<8;

This falls in the realm of bit twiddling.
for (i=0;i<sizeof(struct foo);i++) dst[i] = src[i ^ mask];
where mask == (sizeof type -1) if the stored and native endianness differ.
With this technique one can convert a struct to bit masks:
struct foo {
byte a,b; // mask = 0,0
short e; // mask = 1,1
int g; // mask = 3,3,3,3,
double i; // mask = 7,7,7,7,7,7,7,7
} s; // notice that all units must be aligned according their native size
Again these masks can be encoded with two bits per symbol: (1<<n)-1, meaning that in 64-bit machines one can encode necessary masks of a 32 byte sized struct in a single constant (with 1,2,4 and 8 byte alignments).
unsigned int mask = 0xffffaa50; // or zero if the endianness matches
for (i=0;i<16;i++) {
dst[i]=src[i ^ ((1<<(mask & 3))-1]; mask>>=2;
}

If your as received values are truly strings, (char* or std::string) and you know their format information, sscanf(), and atoi(), well, really ato() will be your friends. They take well formatted strings and convert them per passed-in formats (kind of reverse printf).

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

optimizing string search for a single byte - c++

Related

Append to String a Signed Int (Converted to Bytes) in Big Endian

C++ Turning Character types into int type

What is really misaligned pointer?

Why is parameter to isdigit integer?

dealing with endianness in c++

Categories

Resources