fputc perfomance and optimization - c++

I have been seeing some strange behavior with a program I wrote that I cannot really explain and I was wondering if anybody could explain to me what is happening here. I have the feeling this is caused by some advanced optimization technique that g++ is using with -O3 but I am not sure.
I am running something similar to this (not a full example):
char* str = "(long AB string)"; // string _only_ consisting of As and Bs
size_t len = strlen(str);
for(unsigned long offset = 0; offset < len; offset++) {
if(offset % 100 == 0) fputc('\n', f);
fputc(str[offset], f);
}
This is fairly slow. However, when I additionally check the character like this, it suddenly becomes very fast:
char* str = "(long AB string)"; // string _only_ consisting of As and Bs
size_t len = strlen(str);
for(unsigned long offset = 0; offset < len; offset++) {
if(offset % 100 == 0) fputc('\n', f);
if(str[offset] != 'A' && str[offset] != 'B') exit(1);
fputc(str[offset], f);
}
This is despite the string only consisting of As and Bs, so the number of characters written does not change and the program always exits normally.
Can anybody explain to me what is happening here? Does the character check allow the optimizer to make some assumptions about str[offset] that it otherwise couldn't make, allowing it to optimize out some part of the fputc call?

The compiler optimizes away pretty everything into a simple
exit(1)
since the compiler is smart enough to recognize the string constant "(long string)" doesn't contain any 'A's or 'B's.
Frankly, I wouldn't have expected gcc to detect that ;)

In C, fputc(3) is mandated to be a function; equivalent to the typically implemented as a macro putc(3), which accesses the FILE buffer directly. Very hard to do it much faster, except perhaps using fwrite(3) to copy a stretch of characters at a time instead of going one-by-one. But such usage is exactly for what putc(3) is supposed to be optimized for, so...
Analyze your loop at the base C level by only preprocessing (gcc -E) and compiling to assembler (gcc -S), that might give some clues.
Are you sure (i.e., have concrete measurements to say so) that this loop is performance critical (or even relevant)? That would really be very strange.

Related

Right shift a binary string efficiently in C++

If I have a string that represents an integer in binary form such as
1101101
and I want to circularly right shift it to obtain
1110110
One way I could think of would be converting the string into an int and use (taken from wikipedia)
// https://stackoverflow.com/a/776550/3770260
template <typename INT>
#if __cplusplus > 201100L // Apply constexpr to C++ 11 to ease optimization
constexpr
#endif // See also https://stackoverflow.com/a/7269693/3770260
INT rol(INT val, size_t len) {
#if __cplusplus > 201100L && _wp_force_unsigned_rotate // Apply unsigned check C++ 11 to make sense
static_assert(std::is_unsigned<INT>::value,
"Rotate Left only makes sense for unsigned types");
#endif
return (val << len) | ((unsigned) val >> (-len & (sizeof(INT) * CHAR_BIT - 1)));
}
However, if the string consists of, say, 10^6 char then this does not work as the integer representation exceeds even the range of __int64.
In that case I could think of a solution by looping over the string
//let str be a char string of length n
char temp = str[n - 1];
for(int i = n - 1; i > 0; i--)
{
str[i] = str[i - 1];
}
str[0] = temp;
This solution runs in O(n) due the loop over the length of the string, n. My question is, is there much more efficient way to implement circular shifting for large binary strings?
EDIT
Both input and output are std::strings
You have to move memory one way or another, so your proposed solution is as fast as it gets.
You might also use standard std::string functions:
str.insert(str.begin(), str[n - 1]);
str.erase(str.end() - 1);
or memmove, or memcpy (I don't actually recommend this, it's for an argument)
char temp = str[n - 1];
memmove(str.data() + 1, str.data(), n - 1);
str[0] = temp;
Note that memmove may look faster, but it's essentially the same thing as your loop. It is moving bytes one by one, it's just encapsulated in a different function. This method might be faster for much larger data blocks, of size 1000 bytes or more, since the CPU is optimized to move large chunks of memory. But you won't be able to measure any difference for 10 or 20 bytes.
Moreover, the compiler will most likely run additional optimizations when it sees your for loop, it realizes that you are moving memory and chooses the best option.
The compiler is also good at dealing with std::string methods. These are common operations and the compiler knows the best way to handle it.

Bottleneck and bad style code

I wrote a program and I have a performance problem.
The bottleneck is this function:
void getlinesplit(const char *file, unsigned int &pos, tline &vline)
{
vline.clear();
unsigned int debut_du_mot = 0;
unsigned int i = 0;
while (file[pos+i] != '\n')
{
if (file[pos+i] == '\t')
{
vline.push_back(std::string(file+pos+debut_du_mot,i - debut_du_mot));
debut_du_mot = i+1;
}
++i;
}
vline.push_back(std::string(file+pos+debut_du_mot,i - debut_du_mot));
pos = pos + i+1;
}
This function is called 11 988 400 times.
vline is the same string vector to avoid creating and destroying a vector.
How can I improve this function?
PS: The line is composed of 1 or 2 words maximum.
Most likely the function is not the bottleneck, but the fact that you are calling it 12 million times :-)
An obvious improvement is having a variable
const char* file_pos = file + pos;
simplifying every single access. You don't say how tline is implemented; if a line never contains more than two words then you can probably make it faster by having two std::string members instead of an array.
std::experimental::string_view from C++17 if accessible should make that a bajillion times faster. If not, something similar (begin/end pointer pairs or start/length).
It is also awkwardly written, but I do not see anything horrible performance wise.
Toss in some alignment/length guarantees and you could SSE optimize it, but that would be tiny next to the replacement of strings with views, and much harder.
Any benefit is likely to be limited, as the situation is probably io-bound. Get a faster disk.

Replacing multiple chars at the same time

So in my code I have a series of chars which I want to replace with random data. Since rand can replace ints, I figured I could save some time by replacing four chars at once instead of one at a time. So basically instead of this:
unsigned char TXT[] = { data1,data2,data3,data4,data4,data5....
for (i = 34; i < flenght; i++) // generating the data to send.
TXT[i] = rand() % 255;
I'd like to do something like:
unsigned char TXT[] = { data1,data2,data3,data4,data4,data5....
for (i = 34; i < flenght; i+4) // generating the data to send.
TXT[i] = rand() % 4294967295;
Something that effect, but I'm not sure how to do the latter part. Any help you can give me is greatly appreciated, thanks!
That won't work. The compiler will take the result from rand() % big_number and chop off the extra data to fit it in an unsigned char.
Speed-wise, your initial approach was fine. The optimization you contemplated is valid, but most likely unneeded. It probably wouldn't make a noticeable difference.
What you wanted to do is possible, of course, but given your mistake, I'd say the effort to understand how right now far outweights the benefits. Keep learning, and the next time you run across code like this, you'll know what to do (and judge if it's necessary), look back on this moment and smile :).
You'll have to access memory directly, and do some transformations on your data. You probably want something like this:
unsigned char TXT[] = { data1,data2,data3,data4,data4,data5....
for (i = 34; i < flenght/sizeof(int); i+=sizeof(int)) // generating the data to send.
{
int *temp = (int*)&TXT[i]; // very ugly
*temp = rand() % 4294967295;
}
It can be problematic though because of alignment issues, so be careful. Alignment issues can cause your program to crash unexpectedly, and are hard to debug. I wouldn't do this if I were you, your initial code is just fine.
TXT[i] = rand() % 4294967295;
Will not work the way you expect it to. Perhaps you are expecting that rand()%4294967295 will generate a 4 byte integer(which you maybe interpreting as 4 different characters). The value that rand()%4294967295, produces will be type cast into a single char and will get assigned to only one of the index of TXT[i].
Though it's not quire clear as to why you need to make 4 assigning at the same time, one approach would be to use bit operators to obtain 4 different significant bytes of the number generated and those can then be assigned to the four different index.
There are valid answers just so much C does not care very much about what type it stores at which address. So you can get away with something like:
#include <stdio.h>
#include <stdlib.h>
#include <limits.h>
char *arr;
int *iArr;
int main (void){
int i;
arr = malloc(100);
/* Error handling ommitted, yes that's evil */
iArr = (int*) arr;
for (i = 0; i < 25; i++) {
iArr[i] = rand() % INT_MAX;
}
for (i = 0; i < 25; i++) {
printf("iArr[%d] = %d\n", i, iArr[i]);
}
for (i = 0; i < 100; i++) {
printf("arr[%d] = %c\n", i, arr[i]);
}
free(arr);
return 0;
}
In the end an array is just some contiguous block in memory. And you can interpret it as you like (if you want). If you know that sizeof(int) = 4 * sizeof(char) then the above code will work.
I do not say I recommend it. And the others have pointed out whatever happened the first loop through all the chars in TXT will yield the same result. One could think for example of unrolling a loop but really I'd not care about that.
The (int*) just alone is warning enough. It means to the compiler, do not think about what you think the type is just "believe" he programmer that he knows better.
Well this "know better" is probably the root of all evil in C programming....
unsigned char TXT[] = { data1,data2,data3,data4,data4,data5....
for (i = 34; i < flenght; i+4)
// generating the data to send.
TXT[i] = rand() % 4294967295;
This has a few issues:
TXT is not guaranteed to be memory-aligned as needed for the CPU to write int data (whether it works - perhaps relatively slowly - or not - e.g. SIGBUS on Solaris - is hardware specific)
the last 1-3 characters may be missed (even if you change i + 4 to i += 4 ;-P)
rand() returns an int anyway - you don't need to mod it with anything
you need to write your random data via an int* so you're accessing 4 bytes at a time and not simply slicing a byte off the end of the random data and overwriting every fourth single character
for stuff like this where you're dependent on the size of int, you should really write it in terms of sizeof(int) so it'll work even if int isn't 32 bits, or use a (currently sadly) non-Standard but common typedef such as int32_t (or on Windows I think it's __int32, or you can use a boost or other library header to get int32_t, or write your own typedef).
It's actually pretty tricky to align your text data: your code suggests you want int-sized slices from the 35th character... even if the overall character array is aligned properly for ints, the 35th character will not be.
If it really is always the 35th, then you can pad the data with a leading character so you're accessing the 36th (being a multiple of presumably 32-bit int size), then align the text to an 32-bit address (with a compiler-specific #pragma or using a union with int32_t). If the real code varies the character you start overwriting from, such that you can't simply align the data once, then you're stuck with:
your original character-at-a-time overwrites
non-portable unaligned overwrites (if that's possible and better on your system), OR
implementing code that overwrites up to three leading unaligned characters, then switches to 32-bit integer overwrite mode for aligned addresses, then back to character-by-character overwrites for up to three trailing characters.
That does not work because the generated value is converted to type of array element - char in this particular case. But you are free to interpret allocated memory in the manner you like. For example, you could convert it into array int:
unsigned char TXT[] = { data1,data2,data3,data4,data4,data5....
for (i = 34; i < flenght-sizeof(int); i+=sizeof(int)) // generating the data to send.
*(int*)(TXT+i) = rand(); // There is no need in modulo operator
for (; i < flenght; ++i) // generating the data to send.
TXT[i] = rand(); // There is no need in modulo operator either
I just want to complete solution with the remarks about modulo operator and handling of arrays not multiple of sizeof(int).
1) % means "the remainder when divided by", so you want rand() % 256 for a char, or else you will never get chars with a value of 255. Similarly for the int case, although here there is no point in doing a modulus operation anyway, since you want the entire range of output values.
2) rand usually only generates two bytes at a time; check the value of RAND_MAX.
3) 34 isn't divisible by 4 anyway, so you will have to handle the end case specially.
4) You will want to cast the pointer, and it won't work if it isn't already aligned. Once you have the cast, though, there is no need to account for the sizeof(int) in your iteration: pointer arithmetic automatically takes care of the element size.
5) Chances are very good that it won't make a noticeable difference. If scribbling random data into an array is really the bottleneck in your program, then it isn't really doing anything significiant anyway.

Is a std::vector lookup faster than performing a simple operation?

I'm trying to optimize some C++ code for speed, and not concerned about memory usage. If I have some function that, for example, tells me if a character is a letter:
bool letterQ ( char letter ) {
return (lchar>=65 && lchar<=90) ||
(lchar>=97 && lchar<=122);
}
Would it be faster to just create a lookup table, i.e.
int lookupTable[128];
for (i = 0 ; i < 128 ; i++) {
lookupTable[i] = // some int value that tells what it is
}
and then modifying the letterQ function above to be
bool letterQ ( char letter ) {
return lookupTable[letter]==LETTER_VALUE;
}
I'm trying to optimize for speed in this simple region, because these functions are called a lot, so even a small increase in speed would accumulate into long-term gain.
EDIT:
I did some testing, and it seems like a lookup array performs significantly better than a lookup function if the lookup array is cached. I tested this by trying
for (int i = 0 ; i < size ; i++) {
if ( lookupfunction( text[i] ) )
// do something
}
against
bool lookuptable[128];
for (int i = 0 ; i < 128 ; i++) {
lookuptable[i] = lookupfunction( (char)i );
}
for (int i = 0 ; i < size ; i++) {
if (lookuptable[(int)text[i]])
// do something
}
Turns out that the second one is considerably faster - about a 3:1 speedup.
About the only possible answer is "maybe" -- and you can find out by running a profiler or something else to time the code. At one time, it would have been pretty easy to give "yes" as the answer with little or no qualification. Now, given how much faster CPUs have gotten than memory, it's a lot less certain -- you can do a lot of computation in the time it takes to fill one cache line from main memory.
Edit: I should add that in either C or C++, it's probably best to at least start with the functions (or macros) built into the standard library. These are often fairly carefully optimized for the target and (more importantly for most people) support things like switching locales, so you won't be stuck trying to explain to your German users that 'ß' isn't really a letter (and I doubt many will be much amused by "but that's really two letters, not one!)
First, I assume you have profiled the code and verified that this particular function is consuming a noticeable amount of CPU time over the runtime of the program?
I wouldn't create a vector as you're dealing with a very fixed data size. In fact, you could just create a regular C++ array and initialize is at program startup. With a really modern compiler that supports array initializers you even can do something like this:
bool lookUpTable[128] = { false, false, false, ..., true, true, ... };
Admittedly I'd probably write a small script that generates out the code rather then doing it all manually.
For a simple calculation like this, the memory access (caused by a lookup table) is going to be more expensive than just doing the calculation every time.

Is there a way to improve the speed or efficiency of this lookup? (C/C++)

I have a function I've written to convert from a 64-bit integer to a base 62 string. Originally, I achieved this like so:
char* charset = " 0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ";
int charsetLength = strlen(charset);
std::string integerToKey(unsigned long long input)
{
unsigned long long num = input;
string key = "";
while(num)
{
key += charset[num % charsetLength];
num /= charsetLength;
}
return key;
}
However, this was too slow.
I improved the speed by providing an option to generate a lookup table. The table is about 624 strings in size, and is generated like so:
// Create the integer to key conversion lookup table
int lookupChars;
if(lookupDisabled)
lookupChars = 1;
else
largeLookup ? lookupChars = 4 : lookupChars = 2;
lookupSize = pow(charsetLength, lookupChars);
integerToKeyLookup = new char*[lookupSize];
for(unsigned long i = 0; i < lookupSize; i++)
{
unsigned long num = i;
int j = 0;
integerToKeyLookup[i] = new char[lookupChars];
while(num)
{
integerToKeyLookup[i][j] = charset[num % charsetLength];
num /= charsetLength;
j++;
}
// Null terminate the string
integerToKeyLookup[i][j] = '\0';
}
The actual conversion then looks like this:
std::string integerToKey(unsigned long long input)
{
unsigned long long num = input;
string key = "";
while(num)
{
key += integerToKeyLookup[num % lookupSize];
num /= lookupSize;
}
return key;
}
This improved speed by a large margin, but I still believe it can be improved. Memory usage on a 32-bit system is around 300 MB, and more than 400 MB on a 64-bit system. It seems like I should be able to reduce memory and/or improve speed, but I'm not sure how.
If anyone could help me figure out how this table could be further optimized, I'd greatly appreciate it.
Using some kind of string builder rather than repeated concatenation into 'key' would provide a significant speed boost.
You may want to reserve memory in advance for your string key. This may get you a decent performance gain, as well as a gain in memory utilization. Whenever you call the append operator on std::string, it may double the size of the internal buffer if it has to reallocate. This means each string may be taking up significantly more memory than is necessary to store the characters. You can avoid this by reserving memory for the string in advance.
I agree with Rob Walker - you're concentrating on improving performance in the wrong area. The string is the slowest part.
I timed the code (your original is broken, btw) and your original (when fixed) was 44982140 cycles for 100000 lookups and the following code is about 13113670.
const char* charset = "0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ";
#define CHARSET_LENGTH 62
// maximum size = 11 chars
void integerToKey(char result[13], unsigned long long input)
{
char* p = result;
while(input > 0)
{
*p++ = charset[input % CHARSET_LENGTH];
input /= CHARSET_LENGTH;
}
// null termination
*p = '\0';
// need to reverse the output
char* o = result;
while(o + 1 < p)
swap(*++o, *--p);
}
This is almost a textbook case of how not to do this. Concatenating strings in a loop is a bad idea, both because appending isn't particularly fast, and because you're constantly allocating memory.
Note: your question states that you're converting to base-62, but the code seems to have 63 symbols. Which are you trying to do?
Given a 64-bit integer, you can calculate that you won't need any more than 11 digits in the result, so using a static 12 character buffer will certainly help improve your speed. On the other hand, it's likely that your C++ library has a long-long equivalent to ultoa, which will be pretty optimal.
Edit: Here's something I whipped up. It allows you to specify any desired base as well:
std::string ullToString(unsigned long long v, int base = 64) {
assert(base < 65);
assert(base > 1);
static const char digits[]="0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ+/";
const int max_length=65;
static char buffer[max_length];
buffer[max_length-1]=0;
char *d = buffer + max_length-1;
do {
d--;
int remainder = v % base;
v /= base;
*d = digits[remainder];
} while(v>0);
return d;
}
This only creates one std::string object, and doesn't move memory around unnecessarily. It currently doesn't zero-pad the output, but it's trivial to change it to do that to however many digits of output you want.
You don't need to copy input into num, because you pass it by value. You can also compute the length of charset in compiletime, there's no need to compute it in runtime every single time you call the function.
But these are very minor performance issues. I think the the most significant help you can gain is by avoiding the string concatenation in the loop. When you construct the key string pass the string constructor the length of your result string so that there is only one allocation for the string. Then in the loop when you concatenate into the string you will not re-allocate.
You can make things even slightly more efficient if you take the target string as a reference parameter or even as two iterators like the standard algorithms do. But that is arguably a step too far.
By the way, what if the value passed in for input is zero? You won't even enter the loop; shouldn't key then be "0"?
I see the value passed in for input can't be negative, but just so we note: the C remainder operator isn't a modulo operator.
Why not just use a base64 library? Is really important that 63 equals '11' and not a longer string?
size_t base64_encode(char* outbuffer, size_t maxoutbuflen, const char* inbuffer, size_t inbuflen);
std::string integerToKey(unsigned long long input) {
char buffer[14];
size_t len = base64_encode(buffer, sizeof buffer, (const char*)&input, sizeof input);
return std::string(buffer, len);
}
Yes, every string will end with an equal size. If you don't want it to, strip off the equal sign. (Just remember to add it back if you need to decode the number.)
Of course, my real question is why are you turning a fixed width 8byte value and not using it directly as your "key" instead of the variable length string value?
Footnote: I'm well aware of the endian issues with this. He didn't say what the key will be used for and so I assume it isn't being used in network communications between machines of disparate endian-ness.
If you could add two more symbols so that it is converting to base-64, your modulus and division operations would turn into a bit mask and shift. Much faster than a division.
If all you need is a short string key, converting to base-64 numbers would speed up things a lot, since div/mod 64 is very cheap (shift/mask).