Generating random visual noise using For loop

Generating random visual noise using For loop - c++

I'm starting to get into the mild depths of C, using arduinos and such like, and just wanted some advice on how I'm generating random noise using a For loop.
The important bit:
void testdrawnoise() {
int j = 0;
for (uint8_t i=0; i<display.width(); i++) {
if (i == display.width()-1) {
j++;
i=0;
}
M = random(0, 2); // Random 0/1
display.drawPixel(i, j, M); // (Width, Height, Pixel on/off)
display.refresh();
}
}
The function draws a pixel one by one across the screen, moving to the next line down once i is has reached display.width()-1. Whether the pixel appears on(black) or off(white) is determined by M.
The code is working fine, but I feel like it could be done better, or at least neater, and perhaps more efficiently.
Input and critiques greatly appreciated.

First of all, your loop never ends, and goes on incrementing j without bounds, so, after you filled the screen once, you go on looping outside of the screen height; although your library does bounds checking, it's certainly not a productive use of CPU to keep on looping without actually doing useful work until j overflows and goes back to zero.
Also, signed overflow is undefined behavior in C++, so you are technically on shaky grounds (I originally thought that Arduino always compiles with -fwrapv which guarantees wraparound on signed integer overflow, but apparently I was mistaken).
Given that the library you are using keeps the whole framebuffer in memory and sends it all on refresh calls, it doesn't make much sense to re-send it at each pixel - especially since the frame transmission is probably going to be by far the slowest part of this loop. So, you can move it out of the loop.
Putting this together (plus caching width and height and using the simpler overload of random), you can change this to:
void testdrawnoise() {
int w = display.width(), h = display.height();
for (int j=0; j<h; ++j) {
for (int i=0; i<w; ++i) {
display.drawPixel(i, j, random(2));
}
}
display.refresh();
}
(if your screen dimensions are smaller than 256 on AVR Arduinos you may gain something by changing all those int to byte, but don't take my word for it)
Notice that this will do it just once, you can put it into your loop() function or in an infinite loop to make it keep generating random patterns.
This is what you can do with the provided interface; now, going into undocumented territory we can go faster.
As stated above, the library you are using keeps the whole framebuffer in memory, packed (as expected) at 8 bits per byte, in a single global variable named sharpmem_buffer, initialized with a malloc of the obvious size.
It should also be noted that, when you ask for a random bit in your code, the PRNG generates a full 31-bit random number and takes just the low bit. Why waste all the other perfectly good random bits?
At the same time, when you call drawPixel, the library performs a series of boolean operations on the corresponding byte in memory to set just the bit you asked for without touching the rest of the bits. Quite stupid, given that you are going to overwrite the other ones with random anyway.
So, putting together these two facts, we can do something like:
void testdrawnoise() {
// access the buffer defined in another .cpp
extern byte *sharpmem_buffer;
byte *ptr = sharpmem_buffer; // pointer to current position
// end position
byte *end = ptr + display.width()*display.height()/8;
for (; ptr!=end; ++ptr) {
// store a full byte of random
*ptr = random(256);
}
display.refresh();
}
which, subtracted the refresh() time, should be at very least 8 times faster than the previous version (I actually expect significantly more, given that not only the core of the loop executes 1/8th of iterations, but it's also way simpler - no function calls besides random, no branches, no boolean operations on memory).
On AVR Arduinos the only point that can be optimized further is probably the RNG - we are still using only 8 bit of a 31 bit (if they are actually 31 bits? Arduino documentation as usual sucks badly at providing useful technical information) RNG, so we could probably generate 3 bytes of random out of a single RNG call, or 4 if we switched to a hand-rolled LCG that didn't mess with the sign bit. On ARM Arduinos, in this last case, we could even gain something by performing full 32-bit stores in memory instead of writing single bytes.
However, these further optimizations are (1) tedious to write (if you have to handle screens where the number of pixels is not multiple of 24/32) and (2) probably not particularly profitable, given that most of the time will be spent in transmission over the SPI anyway. Worth mentioning them anyway, as they may be useful in other cases where there's no transmission bottleneck to slow everything down.
Given that OP's MCU is actually a Cortex M0 (so, a 32 bit ARM), it's worth trying to make it even faster using a full 32 bit PRNG and 32 bit stores.
As said above, built-in random returns a signed value, and it's not exactly clear what range it provides; for this reason, we'll have to roll our own PRNG that is guaranteed to provide 32 full bits of randomness.
A decent and very fast PRNG that provides 32 random bits with minimal state is xorshift; we'll just use the xorshift32 straight from Wikipedia, as we don't really need the improved "*" or "+" versions (nor we really care about having a bigger period provided by the larger counterparts).
struct XorShift32 {
uint32_t state = 0x12345678;
uint32_t next() {
uint32_t x = state;
x ^= x << 13;
x ^= x >> 17;
x ^= x << 5;
state = x;
return x;
}
};
XorShift32 xorShift;
Now we can rewrite testdrawnoise():
void testdrawnoise() {
int size = display.width()*display.height();
// access the buffer defined in another .cpp
extern byte *sharpmem_buffer;
/*
we can access the framebuffer as if it was an array of 32-bit words;
this is fine, since it was alloc-ed with malloc, which guarantees memory
aligned for the most restrictive built-in type, and the library only
uses it with byte pointers, so there should be no strict aliasing problem
*/
uint32_t *ptr = (uint32_t *)sharpmem_buffer;
/*
notice that the division is an integer division, which truncates; so, we
are filling the framebuffer up the the last multiple of 4 bytes; with
"strange" sizes we may be leaving out up to 3 bytes (see later)
*/
uint32_t *end = ptr + size/32;
for (; ptr!=end; ++ptr) {
// store a full byte of random
*ptr = xorShift.next();
}
// now to fill the possibly missing last three bytes
// pick it up where we left it
byte *final_ptr = (byte *)end;
byte *final_end = sharpmem_buffer + size/8;
// generate 32 random bits; it's ok, we'll need at most 24
uint32_t r = xorShift.next();
for(; final_ptr!=final_end; ++final_ptr) {
// take the lower 8 bits
*final_ptr = r;
// throw away the bits we used, get in the upper ones
r = r>>8;
}
display.refresh();
}

Related

Why bitset order looks like reversing per byte

I'm having trouble when I want to read binary file into bitset and process it.
std::ifstream is("data.txt", std::ifstream::binary);
if (is) {
// get length of file:
is.seekg(0, is.end);
int length = is.tellg();
is.seekg(0, is.beg);
char *buffer = new char[length];
is.read(buffer, length);
is.close();
const int k = sizeof(buffer) * 8;
std::bitset<k> tmp;
memcpy(&tmp, buffer, sizeof(buffer));
std::cout << tmp;
delete[] buffer;
}
int a = 5;
std::bitset<32> bit;
memcpy(&bit, &a, sizeof(a));
std::cout << bit;
I want to get {05 00 00 00} (hex memory view), bitset[0~31]={00000101 00000000 00000000 00000000} but I get bitset[0~31]={10100000 00000000 00000000 00000000}

You need to learn how to crawl before you can crawl on broken glass.
In short, computer memory is an opaque box, and you should stop making assumptions about it.
Hyrum's law is the stupidest thing that has ever existed and if you stopped proliferating this cancer, that would be great.
What I'm about to write is common sense to every single competent C++ programmer out there. As trivial as breathing, and as important as breathing. It should be included in every single copy of C++ book ever, hammered into the heads of new programmers as soon as possible, but for some undefined reason, isn't.
The only things you can rely on when it comes to what I'm going to loosely define as "memory" is bits of a byte never being out of order. std::byte is such a type, and before it was added to the standard, we used unsigned char, they are more or less interchangeable, but you should prefer std::byte whenever you can.
So, what do I mean by this?
std::byte a = 0b10101000;
assert(((a >> 3) & 1) == 1); // always true
That's it, everything else is up to the compiler, your machine architecture and stars in the sky.
Oh, what, you thought you can just write int a = 0b1010100000000010; and expect something good? I'm sorry, but that's just not how things work in these savage lands. If you expect any order here, you will have to split it into bytes yourself, you cannot just cast this into std::byte bytes[2] and expect bytes[0] == 0b10101000. It is NEVER correct to assume anything here, if you do, one day your code will break, and by the time you realize that it's broken it will be too late, because it will be yet another undebuggable 30 million line of code legacy codebase half of which is only available in proprietary shared objects that we didn't have source code of since 1997. Goodluck.
So, what's the correct way? Luckily for us, binary shifts are architecture independent. int is guaranteed to be no smaller than 2 bytes, so that's the only thing this example relies on, but most machines have sizeof (int) == 4. If you needed more bytes, or exact number of bytes, you should be using appropriate type from Fixed width integer types.
int a = 0b1010100000000010;
std::byte bytes[2]; // always correct
// std::byte bytes[4]; // stupid assumption by inexperienced programmers
// std::byte bytes[sizeof (a)]; // flexible solution that needs more work
// we think in terms of 8 bits, don't care about the rest
bytes[0] = a & 0xFF;
// we need to skip possibly more than 8 bits to access next 8 bits however
bytes[1] = (a >> CHAR_BIT) & 0xFF;
This is the only absolutely correct way to convert sizeof (T) > 1 into array of bytes and if you see anything else then it's without a doubt subpar implementation that will stop working the moment you change a compiler and/or machine architecture.
The reverse is true too, you need to use binary shifts to convert a byte array to a size bigger than 1 byte.
On top of that, this only applies to primitive types. int, long, short... Sometimes you can rely on it working correctly with float or double as long as you always need IEEE 754 and will never need a machine so old or bizarre that it doesn't support IEEE 754. That's it.
If you think really long and hard, you may realize that this is no different from structs.
struct x {
int a;
int b;
};
What can we rely on? Well, we know that x will have address of a. That's it. If we want to set b, we need to access it by x.b, every other assumption is ALWAYS wrong with no ifs or buts. The only exception is if you wrote your own compiler and you are using your own compiler, but you're ignoring the standard and at that point anything is possible; that's fine, but it's not C++ anymore.
So, what can we infer from what we know now? Array of bytes cannot be just memcpy'd into a std::bitset. You don't know its implementation and you cannot know its implementation, it may change tomorrow and if your code breaks because of that then it's wrong and you're a failure of a programmer.
Want to convert an array of bytes to a bitset? Then go ahead and iterate over every single bit in the byte array and set each bit of the bitset however you need it to be, that's the only correct and sane solution. Every other solution is objectively wrong, now, and forever. Until someone decides to say otherwise in C++ standard. Let's just hope that it will never happen.

Hash function: Is there a way to optimize my code further?

Above is the hash function.
I wrote the code below. I am not sure if I can use another clever way to make this more efficient. I am using the understanding that I do not need to do the mod at all since unsigned int takes care of that through overflow.
int myHash(string s)
{
unsigned int hash = 0;
long long int multiplier = 1;
for(int i = s.size()-1;i>-1;i--)
{
hash += (multiplier * s[i]);
multiplier *= 31;
}
return hash;
}

I would avoid using long long for multiplier. At least if you don't know 100% that your processor does 64-bit multiplies in the same amount of time as a 32-bit multiply. Really modern top of the range processors probably do, older & smaller processors almost certainly take longer to do 64-bit mul operations than 32-bit ones.
Multiplying by 31 can actually be quite fast even on processors that aren't good at multiplying, because x *= 31 can be converted to x = x * 32 - x; or x = (x << 5) - x; - in fact it may be worth trying that [if you haven't compiled the code to assembler and seen that the compiler already does that].
Beyond that, it would be processor or compiler-specific optimisations that comes to mind. Loop unrolling for example. Or using inline assembler or intrinsics to make use of vector instructions (subject to availability for different processor architectures and different generations). Modern compilers like recent versions of gcc or clang will probably vectorize this code, subject to being given the "right" options.
As with all optimisation projects, measure the time, using a representative workload, keep records of what you changed. Look at the generated code, try to figure out if there's a better way to do it. And don't lose track of the fact that it's the OVERALL program's performance that matter. If you spend 80% of the time in this function, by all means, optimize the heck out of it. If you spend 20% of the time, optimize it a bit, if you spend 2% of the time in it, unless there's OBVIOUS things you can do to improve it, it's not going to give you much. I've seen the results of people writing code to save a few clock-cycles in some code that takes several million cycles in the loop two lines further on. And using bit-fiddling tricks to save 2 bytes in something that takes half a megabyte. It just creates mess, not really worth doing.

I guess you could make the argument not have to copy the string for the function call, make s const string &s instead, or use std::string_view if you happen to be using C++17. Otherwise it looks fast the the point where you should leave the rest to the compiler. Try making it optimize with -O2 or your compilers equivalent.

Let me preface this by saying it's probably not worth doing -- it's unlikely that your hash function is going to be the bottleneck in your program, so making the hash function more elaborate in an attempt to make it more efficient will probably just make it harder to understand and maintain while not making your program measurably faster. So don't do this unless you've actually determined that your program spends a significant percentage of its time computing string hashes, and make sure you have a good benchmark routine that you can run "before" and "after" this change to verify that it actually did speed things up significantly, otherwise you might just be chasing rainbows.
That said, one potential way to hash long strings more quickly would be to process the string a word at a time rather than a character at a time, something like this:
unsigned int aSlightlyFasterHash(const string & s)
{
const unsigned int numWordsInString = s.size()/sizeof(unsigned int);
const unsigned int numExtraBytesInString = s.size()%sizeof(unsigned int);
// Compute the bulk of the hash by reading the string a word at a time
unsigned int hash = 0;
const unsigned int * iptr = reinterpret_cast<const unsigned int *>(s.c_str());
for (unsigned int i=0; i<numWordsInString; i++)
{
hash += *iptr;
iptr++;
}
// Then any "leftover" bytes at the end we will mix in to the hash the old way
const unsigned char * cptr = reinterpret_cast<const unsigned char *>(iptr);
unsigned int multiplier = 1;
for(unsigned int i=0; i<numExtraBytesInString; i++)
{
hash += (multiplier * *cptr);
cptr++;
multiplier *= 31;
}
return hash;
}
Note that the above function will return different hash values than the hash function you provided.
That cuts down on the number of loop iterations by a factor of four; of course it's likely that the execution of the function is limited by RAM bandwidth rather than CPU cycles anyway, so do be too surprised if this doesn't go noticeably faster on a modern CPU. If RAM bandwidth is indeed the bottleneck, then there's not too much you can do about it, since you have to read the contents of the string in order to compute a hash code for the string; there's no getting around that (except perhaps by precomputing the hash code in advance and storing it somewhere, but that only works if you know all the strings you are going to use in advance).

If a 32-bit integer overflows, can we use a 40-bit structure instead of a 64-bit long one?

If, say, a 32-bit integer is overflowing, instead of upgrading int to long, can we make use of some 40-bit type if we need a range only within 240, so that we save 24 (64-40) bits for every integer?
If so, how?
I have to deal with billions and space is a bigger constraint.

Yes, but...
It is certainly possible, but it is usually nonsensical (for any program that doesn't use billions of these numbers):
#include <stdint.h> // don't want to rely on something like long long
struct bad_idea
{
uint64_t var : 40;
};
Here, var will indeed have a width of 40 bits at the expense of much less efficient code generated (it turns out that "much" is very much wrong -- the measured overhead is a mere 1-2%, see timings below), and usually to no avail. Unless you have need for another 24-bit value (or an 8 and 16 bit value) which you wish to pack into the same structure, alignment will forfeit anything that you may gain.
In any case, unless you have billions of these, the effective difference in memory consumption will not be noticeable (but the extra code needed to manage the bit field will be noticeable!).
Note:
The question has in the mean time been updated to reflect that indeed billions of numbers are needed, so this may be a viable thing to do, presumed that you take measures not to lose the gains due to structure alignment and padding, i.e. either by storing something else in the remaining 24 bits or by storing your 40-bit values in structures of 8 each or multiples thereof).
Saving three bytes a billion times is worthwhile as it will require noticeably fewer memory pages and thus cause fewer cache and TLB misses, and above all page faults (a single page fault weighting tens of millions instructions).
While the above snippet does not make use of the remaining 24 bits (it merely demonstrates the "use 40 bits" part), something akin to the following will be necessary to really make the approach useful in a sense of preserving memory -- presumed that you indeed have other "useful" data to put in the holes:
struct using_gaps
{
uint64_t var : 40;
uint64_t useful_uint16 : 16;
uint64_t char_or_bool : 8;
};
Structure size and alignment will be equal to a 64 bit integer, so nothing is wasted if you make e.g. an array of a billion such structures (even without using compiler-specific extensions). If you don't have use for an 8-bit value, you could also use an 48-bit and a 16-bit value (giving a bigger overflow margin).
Alternatively you could, at the expense of usability, put 8 40-bit values into a structure (least common multiple of 40 and 64 being 320 = 8*40). Of course then your code which accesses elements in the array of structures will become much more complicated (though one could probably implement an operator[] that restores the linear array functionality and hides the structure complexity).
Update:
Wrote a quick test suite, just to see what overhead the bitfields (and operator overloading with bitfield refs) would have. Posted code (due to length) at gcc.godbolt.org, test output from my Win7-64 machine is:
Running test for array size = 1048576
what alloc seq(w) seq(r) rand(w) rand(r) free
-----------------------------------------------------------
uint32_t 0 2 1 35 35 1
uint64_t 0 3 3 35 35 1
bad40_t 0 5 3 35 35 1
packed40_t 0 7 4 48 49 1
Running test for array size = 16777216
what alloc seq(w) seq(r) rand(w) rand(r) free
-----------------------------------------------------------
uint32_t 0 38 14 560 555 8
uint64_t 0 81 22 565 554 17
bad40_t 0 85 25 565 561 16
packed40_t 0 151 75 765 774 16
Running test for array size = 134217728
what alloc seq(w) seq(r) rand(w) rand(r) free
-----------------------------------------------------------
uint32_t 0 312 100 4480 4441 65
uint64_t 0 648 172 4482 4490 130
bad40_t 0 682 193 4573 4492 130
packed40_t 0 1164 552 6181 6176 130
What one can see is that the extra overhead of bitfields is neglegible, but the operator overloading with bitfield reference as a convenience thing is rather drastic (about 3x increase) when accessing data linearly in a cache-friendly manner. On the other hand, on random access it barely even matters.
These timings suggest that simply using 64-bit integers would be better since they are still faster overall than bitfields (despite touching more memory), but of course they do not take into account the cost of page faults with much bigger datasets. It might look very different once you run out of physical RAM (I didn't test that).

You can quite effectively pack 4*40bits integers into a 160-bit struct like this:
struct Val4 {
char hi[4];
unsigned int low[4];
}
long getLong( const Val4 &pack, int ix ) {
int hi= pack.hi[ix]; // preserve sign into 32 bit
return long( (((unsigned long)hi) << 32) + (unsigned long)pack.low[i]);
}
void setLong( Val4 &pack, int ix, long val ) {
pack.low[ix]= (unsigned)val;
pack.hi[ix]= (char)(val>>32);
}
These again can be used like this:
Val4[SIZE] vals;
long getLong( int ix ) {
return getLong( vals[ix>>2], ix&0x3 )
}
void setLong( int ix, long val ) {
setLong( vals[ix>>2], ix&0x3, val )
}

You might want to consider Variable-Lenght Encoding (VLE)
Presumably, you have store a lot of those numbers somewhere (in RAM, on disk, send them over the network, etc), and then take them one by one and do some processing.
One approach would be to encode them using VLE.
From Google's protobuf documentation (CreativeCommons licence)
Varints are a method of serializing integers using
one or more bytes. Smaller numbers take a smaller number of bytes.
Each byte in a varint, except the last byte, has the most significant
bit (msb) set – this indicates that there are further bytes to come.
The lower 7 bits of each byte are used to store the two's complement
representation of the number in groups of 7 bits, least significant
group first.
So, for example, here is the number 1 – it's a single byte, so the msb
is not set:
0000 0001
And here is 300 – this is a bit more complicated:
1010 1100 0000 0010
How do you figure out that this is 300? First you drop the msb from
each byte, as this is just there to tell us whether we've reached the
end of the number (as you can see, it's set in the first byte as there
is more than one byte in the varint)
Pros
If you have lots of small numbers, you'll probably use less than 40 bytes per integer, in average. Possibly much less.
You are able to store bigger numbers (with more than 40 bits) in the future, without having to pay a penalty for the small ones
Cons
You pay an extra bit for each 7 significant bits of your numbers. That means a number with 40 significant bits will need 6 bytes. If most of your numbers have 40 significant bits, you are better of with a bit field approach.
You will lose the ability to easily jump to a number given its index (you have to at least partially parse all previous elements in an array in order to access the current one.
You will need some form of decoding before doing anything useful with the numbers (although that is true for other approaches as well, like bit fields)

(Edit: First of all - what you want is possible, and makes sense in some cases; I have had to do similar things when I tried to do something for the Netflix challenge and only had 1GB of memory; Second - it is probably best to use a char array for the 40-bit storage to avoid any alignment issues and the need to mess with struct packing pragmas; Third - this design assumes that you're OK with 64-bit arithmetic for intermediate results, it is only for large array storage that you would use Int40; Fourth: I don't get all the suggestions that this is a bad idea, just read up on what people go through to pack mesh data structures and this looks like child's play by comparison).
What you want is a struct that is only used for storing data as 40-bit ints but implicitly converts to int64_t for arithmetic. The only trick is doing the sign extension from 40 to 64 bits right. If you're fine with unsigned ints, the code can be even simpler. This should be able to get you started.
#include <cstdint>
#include <iostream>
// Only intended for storage, automatically promotes to 64-bit for evaluation
struct Int40
{
Int40(int64_t x) { set(static_cast<uint64_t>(x)); } // implicit constructor
operator int64_t() const { return get(); } // implicit conversion to 64-bit
private:
void set(uint64_t x)
{
setb<0>(x); setb<1>(x); setb<2>(x); setb<3>(x); setb<4>(x);
};
int64_t get() const
{
return static_cast<int64_t>(getb<0>() | getb<1>() | getb<2>() | getb<3>() | getb<4>() | signx());
};
uint64_t signx() const
{
return (data[4] >> 7) * (uint64_t(((1 << 25) - 1)) << 39);
};
template <int idx> uint64_t getb() const
{
return static_cast<uint64_t>(data[idx]) << (8 * idx);
}
template <int idx> void setb(uint64_t x)
{
data[idx] = (x >> (8 * idx)) & 0xFF;
}
unsigned char data[5];
};
int main()
{
Int40 a = -1;
Int40 b = -2;
Int40 c = 1 << 16;
std::cout << "sizeof(Int40) = " << sizeof(Int40) << std::endl;
std::cout << a << "+" << b << "=" << (a+b) << std::endl;
std::cout << c << "*" << c << "=" << (c*c) << std::endl;
}
Here is the link to try it live: http://rextester.com/QWKQU25252

You can use a bit-field structure, but it's not going to save you any memory:
struct my_struct
{
unsigned long long a : 40;
unsigned long long b : 24;
};
You can squeeze any multiple of 8 such 40-bit variables into one structure:
struct bits_16_16_8
{
unsigned short x : 16;
unsigned short y : 16;
unsigned short z : 8;
};
struct bits_8_16_16
{
unsigned short x : 8;
unsigned short y : 16;
unsigned short z : 16;
};
struct my_struct
{
struct bits_16_16_8 a1;
struct bits_8_16_16 a2;
struct bits_16_16_8 a3;
struct bits_8_16_16 a4;
struct bits_16_16_8 a5;
struct bits_8_16_16 a6;
struct bits_16_16_8 a7;
struct bits_8_16_16 a8;
};
This will save you some memory (in comparison with using 8 "standard" 64-bit variables), but you will have to split every operation (and in particular arithmetic ones) on each of these variables into several operations.
So the memory-optimization will be "traded" for runtime-performance.

As the comments suggest, this is quite a task.
Probably an unnecessary hassle unless you want to save alot of RAM - then it makes much more sense. (RAM saving would be the sum total of bits saved across millions of long values stored in RAM)
I would consider using an array of 5 bytes/char (5 * 8 bits = 40 bits). Then you will need to shift bits from your (overflowed int - hence a long) value into the array of bytes to store them.
To use the values, then shift the bits back out into a long and you can use the value.
Then your RAM and file storage of the value will be 40 bits (5 bytes), BUT you must consider data alignment if you plan to use a struct to hold the 5 bytes. Let me know if you need elaboration on this bit shifting and data alignment implications.
Similarly, you could use the 64 bit long, and hide other values (3 chars perhaps) in the residual 24 bits that you do not want to use. Again - using bit shifting to add and remove the 24 bit values.

Another variation that may be helpful would be to use a structure:
typedef struct TRIPLE_40 {
uint32_t low[3];
uint8_t hi[3];
uint8_t padding;
};
Such a structure would take 16 bytes and, if 16-byte aligned, would fit entirely within a single cache line. While identifying which of the parts of the structure to use may be more expensive than it would be if the structure held four elements instead of three, accessing one cache line may be much cheaper than accessing two. If performance is important, one should use some benchmarks since some machines may perform a divmod-3 operation cheaply and have a high cost per cache-line fetch, while others might have have cheaper memory access and more expensive divmod-3.

If you have to deal with billions of integers, I'd try to encapuslate arrays of 40-bit numbers instead of single 40-bit numbers. That way, you can test different array implementations (e.g. an implementation that compresses data on the fly, or maybe one that stores less-used data to disk.) without changing the rest of your code.
Here's a sample implementation (http://rextester.com/SVITH57679):
class Int64Array
{
char* buffer;
public:
static const int BYTE_PER_ITEM = 5;
Int64Array(size_t s)
{
buffer=(char*)malloc(s*BYTE_PER_ITEM);
}
~Int64Array()
{
free(buffer);
}
class Item
{
char* dataPtr;
public:
Item(char* dataPtr) : dataPtr(dataPtr){}
inline operator int64_t()
{
int64_t value=0;
memcpy(&value, dataPtr, BYTE_PER_ITEM); // Assumes little endian byte order!
return value;
}
inline Item& operator = (int64_t value)
{
memcpy(dataPtr, &value, BYTE_PER_ITEM); // Assumes little endian byte order!
return *this;
}
};
inline Item operator[](size_t index)
{
return Item(buffer+index*BYTE_PER_ITEM);
}
};
Note: The memcpy-conversion from 40-bit to 64-bit is basically undefined behavior, as it assumes litte-endianness. It should work on x86-platforms, though.
Note 2: Obviously, this is proof-of-concept code, not production-ready code. To use it in real projects, you'd have to add (among other things):
error handling (malloc can fail!)
copy constructor (e.g. by copying data, add reference counting or by making the copy constructor private)
move constructor
const overloads
STL-compatible iterators
bounds checks for indices (in debug build)
range checks for values (in debug build)
asserts for the implicit assumptions (little-endianness)
As it is, Item has reference semantics, not value semantics, which is unusual for operator[]; You could probably work around that with some clever C++ type conversion tricks
All of those should be straightforward for a C++ programmer, but they would make the sample code much longer without making it clearer, so I've decided to omit them.

I'll assume that
this is C, and
you need a single, large array of 40 bit numbers, and
you are on a machine that is little-endian, and
your machine is smart enough to handle alignment
you have defined size to be the number of 40-bit numbers you need
unsigned char hugearray[5*size+3]; // +3 avoids overfetch of last element
__int64 get_huge(unsigned index)
{
__int64 t;
t = *(__int64 *)(&hugearray[index*5]);
if (t & 0x0000008000000000LL)
t |= 0xffffff0000000000LL;
else
t &= 0x000000ffffffffffLL;
return t;
}
void set_huge(unsigned index, __int64 value)
{
unsigned char *p = &hugearray[index*5];
*(long *)p = value;
p[4] = (value >> 32);
}
It may be faster to handle the get with two shifts.
__int64 get_huge(unsigned index)
{
return (((*(__int64 *)(&hugearray[index*5])) << 24) >> 24);
}

For the case of storing some billions of 40-bit signed integers, and assuming 8-bit bytes, you can pack 8 40-bit signed integers in a struct (in the code below using an array of bytes to do that), and, since this struct is ordinarily aligned, you can then create a logical array of such packed groups, and provide ordinary sequential indexing of that:
#include <limits.h> // CHAR_BIT
#include <stdint.h> // int64_t
#include <stdlib.h> // div, div_t, ptrdiff_t
#include <vector> // std::vector
#define STATIC_ASSERT( e ) static_assert( e, #e )
namespace cppx {
using Byte = unsigned char;
using Index = ptrdiff_t;
using Size = Index;
// For non-negative values:
auto roundup_div( const int64_t a, const int64_t b )
-> int64_t
{ return (a + b - 1)/b; }
} // namespace cppx
namespace int40 {
using cppx::Byte;
using cppx::Index;
using cppx::Size;
using cppx::roundup_div;
using std::vector;
STATIC_ASSERT( CHAR_BIT == 8 );
STATIC_ASSERT( sizeof( int64_t ) == 8 );
const int bits_per_value = 40;
const int bytes_per_value = bits_per_value/8;
struct Packed_values
{
enum{ n = sizeof( int64_t ) };
Byte bytes[n*bytes_per_value];
auto value( const int i ) const
-> int64_t
{
int64_t result = 0;
for( int j = bytes_per_value - 1; j >= 0; --j )
{
result = (result << 8) | bytes[i*bytes_per_value + j];
}
const int64_t first_negative = int64_t( 1 ) << (bits_per_value - 1);
if( result >= first_negative )
{
result = (int64_t( -1 ) << bits_per_value) | result;
}
return result;
}
void set_value( const int i, int64_t value )
{
for( int j = 0; j < bytes_per_value; ++j )
{
bytes[i*bytes_per_value + j] = value & 0xFF;
value >>= 8;
}
}
};
STATIC_ASSERT( sizeof( Packed_values ) == bytes_per_value*Packed_values::n );
class Packed_vector
{
private:
Size size_;
vector<Packed_values> data_;
public:
auto size() const -> Size { return size_; }
auto value( const Index i ) const
-> int64_t
{
const auto where = div( i, Packed_values::n );
return data_[where.quot].value( where.rem );
}
void set_value( const Index i, const int64_t value )
{
const auto where = div( i, Packed_values::n );
data_[where.quot].set_value( where.rem, value );
}
Packed_vector( const Size size )
: size_( size )
, data_( roundup_div( size, Packed_values::n ) )
{}
};
} // namespace int40
#include <iostream>
auto main() -> int
{
using namespace std;
cout << "Size of struct is " << sizeof( int40::Packed_values ) << endl;
int40::Packed_vector values( 25 );
for( int i = 0; i < values.size(); ++i )
{
values.set_value( i, i - 10 );
}
for( int i = 0; i < values.size(); ++i )
{
cout << values.value( i ) << " ";
}
cout << endl;
}

Yes, you can do that, and it will save some space for large quantities of numbers
You need a class that contains a std::vector of an unsigned integer type.
You will need member functions to store and to retrieve an integer. For example, if you want do store 64 integers of 40 bit each, use a vector of 40 integers of 64 bits each. Then you need a method that stores an integer with index in [0,64] and a method to retrieve such an integer.
These methods will execute some shift operations, and also some binary | and & .
I am not adding any more details here yet because your question is not very specific. Do you know how many integers you want to store? Do you know it during compile time? Do you know it when the program starts? How should the integers be organized? Like an array? Like a map? You should know all this before trying to squeeze the integers into less storage.

There are quite a few answers here covering implementation, so I'd like to talk about architecture.
We usually expand 32-bit values to 64-bit values to avoid overflowing because our architectures are designed to handle 64-bit values.
Most architectures are designed to work with integers whose size is a power of 2 because this makes the hardware vastly simpler. Tasks such as caching are much simpler this way: there are a large number of divisions and modulus operations which can be replaced with bit masking and shifts if you stick to powers of 2.
As an example of just how much this matters, The C++11 specification defines multithreading race-cases based on "memory locations." A memory location is defined in 1.7.3:
A memory location is either an object of scalar type or a maximal
sequence of adjacent bit-fields all having non-zero width.
In other words, if you use C++'s bitfields, you have to do all of your multithreading carefully. Two adjacent bitfields must be treated as the same memory location, even if you wish computations across them could be spread across multiple threads. This is very unusual for C++, so likely to cause developer frustration if you have to worry about it.
Most processors have a memory architecture which fetches 32-bit or 64-bit blocks of memory at a time. Thus use of 40-bit values will have a surprising number of extra memory accesses, dramatically affecting run-time. Consider the alignment issues:
40-bit word to access: 32-bit accesses 64bit-accesses
word 0: [0,40) 2 1
word 1: [40,80) 2 2
word 2: [80,120) 2 2
word 3: [120,160) 2 2
word 4: [160,200) 2 2
word 5: [200,240) 2 2
word 6: [240,280) 2 2
word 7: [280,320) 2 1
On a 64 bit architecture, one out of every 4 words will be "normal speed." The rest will require fetching twice as much data. If you get a lot of cache misses, this could destroy performance. Even if you get cache hits, you are going to have to unpack the data and repack it into a 64-bit register to use it (which might even involve a difficult to predict branch).
It is entirely possible this is worth the cost
There are situations where these penalties are acceptable. If you have a large amount of memory-resident data which is well indexed, you may find the memory savings worth the performance penalty. If you do a large amount of computation on each value, you may find the costs are minimal. If so, feel free to implement one of the above solutions. However, here are a few recommendations.
Do not use bitfields unless you are ready to pay their cost. For example, if you have an array of bitfields, and wish to divide it up for processing across multiple threads, you're stuck. By the rules of C++11, the bitfields all form one memory location, so may only be accessed by one thread at a time (this is because the method of packing the bitfields is implementation defined, so C++11 can't help you distribute them in a non-implementation defined manner)
Do not use a structure containing a 32-bit integer and a char to make 40 bytes. Most processors will enforce alignment and you wont save a single byte.
Do use homogenous data structures, such as an array of chars or array of 64-bit integers. It is far easier to get the alignment correct. (And you also retain control of the packing, which means you can divide an array up amongst several threads for computation if you are careful)
Do design separate solutions for 32-bit and 64-bit processors, if you have to support both platforms. Because you are doing something very low level and very ill-supported, you'll need to custom tailor each algorithm to its memory architecture.
Do remember that multiplication of 40-bit numbers is different from multiplication of 64-bit expansions of 40-bit numbers reduced back to 40-bits. Just like when dealing with the x87 FPU, you have to remember that marshalling your data between bit-sizes changes your result.

This begs for streaming in-memory lossless compression. If this is for a Big Data application, dense packing tricks are tactical solutions at best for what seems to require fairly decent middleware or system-level support. They'd need thorough testing to make sure one is able to recover all the bits unharmed. And the performance implications are highly non-trivial and very hardware-dependent because of interference with the CPU caching architecture (e.g. cache lines vs packing structure). Someone mentioned complex meshing structures : these are often fine-tuned to cooperate with particular caching architectures.
It's not clear from the requirements whether the OP needs random access. Given the size of the data it's more likely one would only need local random access on relatively small chunks, organised hierarchically for retrieval. Even the hardware does this at large memory sizes (NUMA). Like lossless movie formats show, it should be possible to get random access in chunks ('frames') without having to load the whole dataset into hot memory (from the compressed in-memory backing store).
I know of one fast database system (kdb from KX Systems to name one but I know there are others) that can handle extremely large datasets by seemlessly memory-mapping large datasets from backing store. It has the option to transparently compress and expand the data on-the-fly.

If what you really want is an array of 40 bit integers (which obviously you can't have), I'd just combine one array of 32 bit and one array of 8 bit integers.
To read a value x at index i:
uint64_t x = (((uint64_t) array8 [i]) << 32) + array32 [i];
To write a value x to index i:
array8 [i] = x >> 32; array32 [i] = x;
Obviously nicely encapsulated into a class using inline functions for maximum speed.
There is one situation where this is suboptimal, and that is when you do truly random access to many items, so that each access to an int array would be a cache miss - here you would get two cache misses every time. To avoid this, define a 32 byte struct containing an array of six uint32_t, an array of six uint8_t, and two unused bytes (41 2/3rd bits per number); the code to access an item is slightly more complicated, but both components of the item are in the same cache line.

How can I get the sourcecode for rand() (C++)?

I'm new to programming.
I want to know exactly what rand() does.
Searching only yields examples on its usage. But none explain each step of how the function generates a random number. They treat rand() as a blackbox.
I want to know what rand() is doing; each step.
Is there a resource that will allow me to see exactly what rand() does?
This is all open source stuff isn't it? I'll settle for the disassembly, if there's no source.
I know it returns a random number, but how does it generate that number? I want to see each step.
Thank you.

Here is the current glibc implementation:
/* Return a random integer between 0 and RAND_MAX. */
int
rand (void)
{
return (int) __random ();
}
That's not much help, but __random eventually calls __random_r:
/* If we are using the trivial TYPE_0 R.N.G., just do the old linear
congruential bit. Otherwise, we do our fancy trinomial stuff, which is the
same in all the other cases due to all the global variables that have been
set up. The basic operation is to add the number at the rear pointer into
the one at the front pointer. Then both pointers are advanced to the next
location cyclically in the table. The value returned is the sum generated,
reduced to 31 bits by throwing away the "least random" low bit.
Note: The code takes advantage of the fact that both the front and
rear pointers can't wrap on the same call by not testing the rear
pointer if the front one has wrapped. Returns a 31-bit random number. */
int
__random_r (buf, result)
struct random_data *buf;
int32_t *result;
{
int32_t *state;
if (buf == NULL || result == NULL)
goto fail;
state = buf->state;
if (buf->rand_type == TYPE_0)
{
int32_t val = state[0];
val = ((state[0] * 1103515245) + 12345) & 0x7fffffff;
state[0] = val;
*result = val;
}
else
{
int32_t *fptr = buf->fptr;
int32_t *rptr = buf->rptr;
int32_t *end_ptr = buf->end_ptr;
int32_t val;
val = *fptr += *rptr;
/* Chucking least random bit. */
*result = (val >> 1) & 0x7fffffff;
++fptr;
if (fptr >= end_ptr)
{
fptr = state;
++rptr;
}
else
{
++rptr;
if (rptr >= end_ptr)
rptr = state;
}
buf->fptr = fptr;
buf->rptr = rptr;
}
return 0;
fail:
__set_errno (EINVAL);
return -1;
}

This was 10 seconds of googling:
gcc implementation of rand()
How is the rand()/srand() function implemented in C
implementation of rand()
...
http://gcc.gnu.org/onlinedocs/gcc-4.6.2/libstdc++/api/a01206.html
http://www.gnu.org/software/libc/manual/html_node/Pseudo_002dRandom-Numbers.html
I was gonna list the actual search, but seeing this is clearly a dupe, I'll just vote as dupe

You can browse the source code for different implementations of the C standard.
The question has been answered before, you might find what you're looking for at What common algorithms are used for C's rand()?
That answer provides code for glibc's implementation of rand()

The simplest reasonably good pseudo-random number generators are Linear Congruential Generators (LCGs). These are iterations of a formula such as
X_{n+1} = (a * X_n + c) modulo m
The constants a, c, and m are chosen to given unpredictable sequences. X_0 is the random seed value. Many other algorithms exists, but this is probably enough to get you going.
Really good pseudo-random number generators are more complex, such as the Mersenne Twister.

I guess, THIS is what you are looking for. It contains the detailed explanation of random function, and simple C program to understand the algo.
Edit:
You should check THIS as well. A possible duplicate.

Well, I believe rand is from the C standard library, not the C++ standard library. There is no one implementation of either library, there are several.
You could go somewhere like this page to view the source code for glibc, the c library used on most Linux distributions. For glibc you'd find it in source files under stdlib such as rand.c and random.c.
A different implementation, such as uClibc might be easier to read. Try here under the libc/stdlib folder.

Correct me if I'm wrong, but although this answer points to part of the implementation, I found that there is more to rand() used in stdlib, which is from [glibc][2]. From the 2.32 version obtained from here, the stdlib folder contains a random.c file which explains that a simple linear congruential algorithm is used. The folder also has rand.c and rand_r.c which can show you more of the source code. stdlib.h contained in the same folder will show you the values used for macros like RAND_MAX.
/* An improved random number generation package. In addition to the
standard rand()/srand() like interface, this package also has a
special state info interface. The initstate() routine is called
with a seed, an array of bytes, and a count of how many bytes are
being passed in; this array is then initialized to contain
information for random number generation with that much state
information. Good sizes for the amount of state information are
32, 64, 128, and 256 bytes. The state can be switched by calling
the setstate() function with the same array as was initialized with
initstate(). By default, the package runs with 128 bytes of state
information and generates far better random numbers than a linear
congruential generator. If the amount of state information is less
than 32 bytes, a simple linear congruential R.N.G. is used.
Internally, the state information is treated as an array of longs;
the zeroth element of the array is the type of R.N.G. being used
(small integer); the remainder of the array is the state
information for the R.N.G. Thus, 32 bytes of state information
will give 7 longs worth of state information, which will allow a
degree seven polynomial. (Note: The zeroth word of state
information also has some other information stored in it; see setstate
for details). The random number generation technique is a linear
feedback shift register approach, employing trinomials (since there
are fewer terms to sum up that way). In this approach, the least
significant bit of all the numbers in the state table will act as a
linear feedback shift register, and will have period 2^deg - 1
(where deg is the degree of the polynomial being used, assuming
that the polynomial is irreducible and primitive). The higher order
bits will have longer periods, since their values are also
influenced by pseudo-random carries out of the lower bits. The
total period of the generator is approximately deg*(2deg - 1); thus
doubling the amount of state information has a vast influence on the
period of the generator. Note: The deg*(2deg - 1) is an
approximation only good for large deg, when the period of the shift
register is the dominant factor. With deg equal to seven, the
period is actually much longer than the 7*(2**7 - 1) predicted by
this formula. */

Is it possible to roll a significantly faster version of sqrt

In an app I'm profiling, I found that in some scenarios this function is able to take over 10% of total execution time.
I've seen discussion over the years of faster sqrt implementations using sneaky floating-point trickery, but I don't know if such things are outdated on modern CPUs.
MSVC++ 2008 compiler is being used, for reference... though I'd assume sqrt is not going to add much overhead though.
See also here for similar discussion on modf function.
EDIT: for reference, this is one widely-used method, but is it actually much quicker? How many cycles is SQRT anyway these days?

Yes, it is possible even without trickery:
sacrifice accuracy for speed: the sqrt algorithm is iterative, re-implement with fewer iterations.
lookup tables: either just for the start point of the iteration, or combined with interpolation to get you all the way there.
caching: are you always sqrting the same limited set of values? if so, caching can work well. I've found this useful in graphics applications where the same thing is being calculated for lots of shapes the same size, so results can be usefully cached.
Hello from 11 years in the future.
Considering this still gets occasional votes, I thought I'd add a note about performance, which now even more than then is dramatically limited by memory accesses. You absolutely must use a realistic benchmark (ideally, your whole application) when optimising something like this - the memory access patterns of your application will have a dramatic effect on solutions like lookup tables and caches, and just comparing 'cycles' for your optimised version will lead you wildly astray: it is also very difficult to assign program time to individual instructions, and your profiling tool may mislead you here.
On a related note, consider using simd/vectorised instructions for calculating square roots, like _mm512_sqrt_ps or similar, if they suit your use case.
Take a look at section 15.12.3 of intel's optimisation reference manual, which describes approximation methods, with vectorised instructions, which would probably translate pretty well to other architectures too.

There's a great comparison table here:
http://assemblyrequired.crashworks.org/timing-square-root/
Long story short, SSE2's ssqrts is about 2x faster than FPU fsqrt, and an approximation + iteration is about 4x faster than that (8x overall).
Also, if you're trying to take a single-precision sqrt, make sure that's actually what you're getting. I've heard of at least one compiler that would convert the float argument to a double, call double-precision sqrt, then convert back to float.

You're very likely to gain more speed improvements by changing your algorithms than by changing their implementations: Try to call sqrt() less instead of making calls faster. (And if you think this isn't possible - the improvements for sqrt() you mention are just that: improvements of the algorithm used to calculate a square root.)
Since it is used very often, it is likely that your standard library's implementation of sqrt() is nearly optimal for the general case. Unless you have a restricted domain (e.g., if you need less precision) where the algorithm can take some shortcuts, it's very unlikely someone comes up with an implementation that's faster.
Note that, since that function uses 10% of your execution time, even if you manage to come up with an implementation that only takes 75% of the time of std::sqrt(), this still will only bring your execution time down by 2,5%. For most applications users wouldn't even notice this, except if they use a watch to measure.

How accurate do you need your sqrt to be? You can get reasonable approximations very quickly: see Quake3's excellent inverse square root function for inspiration (note that the code is GPL'ed, so you may not want to integrate it directly).

Don't know if you fixed this, but I've read about it before, and it seems that the fastest thing to do is replace the sqrt function with an inline assembly version;
you can see a description of a load of alternatives here.
The best is this snippet of magic:
double inline __declspec (naked) __fastcall sqrt(double n)
{
_asm fld qword ptr [esp+4]
_asm fsqrt
_asm ret 8
}
It's about 4.7x faster than the standard sqrt call with the same precision.

Here is a fast way with a look up table of only 8KB. Mistake is ~0.5% of the result. You can easily enlarge the table, thus reducing the mistake. Runs about 5 times faster than the regular sqrt()
// LUT for fast sqrt of floats. Table will be consist of 2 parts, half for sqrt(X) and half for sqrt(2X).
const int nBitsForSQRTprecision = 11; // Use only 11 most sagnificant bits from the 23 of float. We can use 15 bits instead. It will produce less error but take more place in a memory.
const int nUnusedBits = 23 - nBitsForSQRTprecision; // Amount of bits we will disregard
const int tableSize = (1 << (nBitsForSQRTprecision+1)); // 2^nBits*2 because we have 2 halves of the table.
static short sqrtTab[tableSize];
static unsigned char is_sqrttab_initialized = FALSE; // Once initialized will be true
// Table of precalculated sqrt() for future fast calculation. Approximates the exact with an error of about 0.5%
// Note: To access the bits of a float in C quickly we must misuse pointers.
// More info in: http://en.wikipedia.org/wiki/Single_precision
void build_fsqrt_table(void){
unsigned short i;
float f;
UINT32 *fi = (UINT32*)&f;
if (is_sqrttab_initialized)
return;
const int halfTableSize = (tableSize>>1);
for (i=0; i < halfTableSize; i++){
*fi = 0;
*fi = (i << nUnusedBits) | (127 << 23); // Build a float with the bit pattern i as mantissa, and an exponent of 0, stored as 127
// Take the square root then strip the first 'nBitsForSQRTprecision' bits of the mantissa into the table
f = sqrtf(f);
sqrtTab[i] = (short)((*fi & 0x7fffff) >> nUnusedBits);
// Repeat the process, this time with an exponent of 1, stored as 128
*fi = 0;
*fi = (i << nUnusedBits) | (128 << 23);
f = sqrtf(f);
sqrtTab[i+halfTableSize] = (short)((*fi & 0x7fffff) >> nUnusedBits);
}
is_sqrttab_initialized = TRUE;
}
// Calculation of a square root. Divide the exponent of float by 2 and sqrt() its mantissa using the precalculated table.
float fast_float_sqrt(float n){
if (n <= 0.f)
return 0.f; // On 0 or negative return 0.
UINT32 *num = (UINT32*)&n;
short e; // Exponent
e = (*num >> 23) - 127; // In 'float' the exponent is stored with 127 added.
*num &= 0x7fffff; // leave only the mantissa
// If the exponent is odd so we have to look it up in the second half of the lookup table, so we set the high bit.
const int halfTableSize = (tableSize>>1);
const int secondHalphTableIdBit = halfTableSize << nUnusedBits;
if (e & 0x01)
*num |= secondHalphTableIdBit;
e >>= 1; // Divide the exponent by two (note that in C the shift operators are sign preserving for signed operands
// Do the table lookup, based on the quaternary mantissa, then reconstruct the result back into a float
*num = ((sqrtTab[*num >> nUnusedBits]) << nUnusedBits) | ((e + 127) << 23);
return n;
}

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js