Fast hash function for long string key

Fast hash function for long string key - c++

I am using an extendible hash and I want to have strings as keys. The problem is that the current hash function that I am using iterates over the whole string/key and I think that this is pretty bad for the program's performance since the hash function is called multiple times especially when I am splitting buckets.
Current hash function
int hash(const string& key)
{
int seed = 131;
unsigned long hash = 0;
for(unsigned i = 0; i < key.length(); i++)
{
hash = (hash * seed) + key[i];
}
return hash;
}
The keys could be as long as 40 characters.
Example of string/key
string key = "from-to condition"
I have searched over the internet for a better one but I didn't find anything to match my case. Any suggestions?

You should prefer to use std::hash unless measurement shows that you can do better. To limit the number of characters it uses, use something like:
const auto limit = min(key.length(), 16);
for(unsigned i = 0; i < limit; i++)
You will want to experiment to find the best value of 16 to use.
I would actually expect the performance to get worse (because you will have more collisions). If your strings were several k, then limiting to the first 64 bytes might well be worth while.
Depending on your strings, it might be worth starting not at the beginning. For example, hashing filenames you would probably do better using the characters between 20 and 5 from the end (ignore the often constant pathname prefix, and the file extension). But you still have to measure.

I am using an extendible hash and I want to have strings as keys.
As mentioned before, use std::hash until there is a good reason not to.
The problem is that the current hash function that I am using iterates over the whole string/key and I think that this is pretty bad...
It's an understandable thought, but is actually unlikely to be a real concern.
(anticipating) why?
A quick scan over stack overflow will reveal many experienced developers talking about caches and cache lines.
(forgive me if I'm teaching my grandmother to suck eggs)
A modern CPU is incredibly quick at processing instructions and performing (very complex) arithmetic. In almost all cases, what limits its performance is having to talk to memory across a bus, which is by comparison, horribly slow.
So chip designers build in memory caches - extremely fast memory that sits in the CPU (and therefore does not have to be communicated with over a slow bus). Unhappily, there is only so much space available [plus heat constraints - a topic for another day] for this cache memory so the CPU must treat it like an OS does a disk cache, flushing memory and reading in memory as and when it needs to.
As mentioned, communicating across the bus is slow - (simply put) it requires all the electronic components on the motherboard to stop and synchronise with each other. This wastes a horrible amount of time [This would be a fantastic point to go into a discussion about the propagation of electronic signals across the motherboard being constrained by approximately half the speed of light - it's fascinating but there's only so much space here and I have only so much time]. So rather than transfer one byte, word or longword at a time, the memory is accessed in chunks - called cache lines.
It turns out that this is a good decision by chip designers because they understand that most memory is accessed sequentially - because most programs spend most of their time accessing memory linearly (such as when calculating a hash, comparing strings or objects, transforming sequences, copying and initialising sequences and so on).
What's the upshot of all this?
Well, bizarrely, if your string is not already in-cache it turns of that reading one byte of it is almost exactly as expensive as reading all the bytes in the first (say) 128 bytes of it.
Plus, because the cache circuitry assumes that memory access is linear, it will begin a fetch of the next cache line as soon as it has fetched your first one. It will do this while your CPU is performing its hash computation.
I hope you can see that in this case, even if your string was many thousands of bytes long, and you chose to only hash (say) every 128th byte, all you would be doing would be to compute a very much inferior hash which still causing the memory cache to halt the processor while it fetched large chunks of unused memory. It would take just as long - for a worse result!
Having said that, what are good reasons not to use the standard implementation?
Only when:
The users are complaining that your software is too slow to be useful, and
The program is verifiably CPU-bound (using 100% of CPU time), and
The program is not wasting any cycles by spinning, and
Careful profiling has revealed that the program's biggest bottleneck is the hash function, and
Independent analysis by another experienced developer confirms that there is no way to improve the algorithm (for example by calling hash less often).
In short, almost never.

You can directly use std::hashlink instead of implementing your own function.
#include <iostream>
#include <functional>
#include <string>
size_t hash(const std::string& key)
{
std::hash<std::string> hasher;
return hasher(key);
}
int main() {
std::cout << hash("abc") << std::endl;
return 0;
}
See this code here: https://ideone.com/4U89aU

Related

is using an integer to store many bool worth the effort?

I was considering ways to reduce memory footprint, and it is constantly mentioned that a bool takes up more memory than it logically needs to, as a byproduct of processor design.
it is also sometimes mentioned that one could store several bool within an int.
I am wondering if this would actually be more memory efficient?
if we have a usecase where we can use a significant portion of 32 (or 64) bool. and we decide to store all of them in a single int. then on the surface we have saved
7 (bits) * 32 (size of int) = 224 (bits) or 28 (bytes)
but in order to get each of those bits from the int, we needed to use some method of masking
such as:
bit shifting the int both directions (int<<x)>>y here we need to load and store x,y which are probably an int, but you could get them smaller depending on the use case
masking the int: int & int2 here we also store an additional int, which is stored and loaded
even if these aren't stored as variables, and they are defined statically within the code, it still ends up using additional memory, as it will increase the memory footprint of the instructions. as well as the instructions for the masking steps.
is there any way to do this that isn't actually worse for memory usage than just taking the hit on 7 wasted bits?

You are describing a text book example of a trade-off.
Yes, several bools in one int is hugeley more memory efficient - in itself.
Yes, you need to spend code to use that.
Yes, for only a few bools (for different values of "few"), the code might take more space than you save.
However, you could look at the kind of memory which is used. In some environments, RAM (which is saved by your idea) is much more expensive than ROM (which has to be paid for your idea).
Also, the price to pay is mostly paid once for implementation and only paid a fraction for using, especially when the using code is reused, e.g. in loops.
Altogether, in case of many bools, you can save more than you pay.
The point of actually saving needs to be determined for the special case.
On the other hand, you have missed on "currency" on the price-tag for the idea. You not only pay in memory, you also pay in execution time. You focused your question on memory, so I won't elaborate here. But for anything time critical, you should take the longer execution time into conisderation. You might find that saving memory is quite achievable with your idea, but the whole thing gets unbearably slow.
Again from the other side, as Eric Postpischil points out in a comment, execution speed can also improve due to cache effects from better memory footprint.

I am wondering if this would actually be more memory efficient?
Potentially yes. Storing multiple bools inside a single object may use less storage compared to having distinct bool object for each, if the number of bools is great enough to offset the cost in memory use of the instructions.
Also consider that there are more considerations than space efficiency. Most typically, people are concerned about time efficiency as well. In this regard, compacting bools may more or less efficient depending on the details of the use case.
is using an integer to store many bool worth the effort?
It can be worth the effort. It can also be counter productive. The difference can be minuscule or significant. Both in terms of time and space efficiency. Most accurate way to find out is to measure it.
It's not necessary to implement this yourself though, since there are solutions in the standard library. std::vector<bool> and std::bitset both implement compact storage of bools. Using bitfields may also be an option (just remember to not rely on the internal representation).

How to set number of threads in C++

I have written the following multi-threaded program for multi-threaded sorting using std::sort. In my program grainSize is a parameter. Since grainSize or the number of threads which can spawn is a system dependent feature. Therefore, I am not getting what should be the optimal value to which I should set the grainSize to? I work on Linux?
int compare(const char*,const char*)
{
//some complex user defined logic
}
void multThreadedSort(vector<unsigned>::iterator data, int len, int grainsize)
{
if(len < grainsize)
{
std::sort(data, data + len, compare);
}
else
{
auto future = std::async(multThreadedSort, data, len/2, grainsize);
multThreadedSort(data + len/2, len/2, grainsize); // No need to spawn another thread just to block the calling thread which would do nothing.
future.wait();
std::inplace_merge(data, data + len/2, data + len, compare);
}
}
int main(int argc, char** argv) {
vector<unsigned> items;
int grainSize=10;
multThreadedSort(items.begin(),items.size(),grainSize);
std::sort(items.begin(),items.end(),CompareSorter(compare));
return 0;
}
I need to perform multi-threaded sorting. So, that for sorting large vectors I can take advantage of multiple cores present in today's processor. If anyone is aware of an efficient algorithm then please do share.
I dont know why the value returned by multiThreadedSort() is not sorted, do you see some logical error in it, then please let me know about the same

This gives you the optimal number of threads (such as the number of cores):
unsigned int nThreads = std::thread::hardware_concurrency();
As you wrote it, your effective thread number is not equal to grainSize : it will depend on list size, and will potentially be much more than grainSize.
Just replace grainSize by :
unsigned int grainSize= std::max(items.size()/nThreads, 40);
The 40 is arbitrary but is there to avoid starting threads for sorting to few items which will be suboptimal (the time starting the thread will be larger than sorting the few items). It may be optimized by trial-and-error, and is potentially larger than 40.
You have at least a bug there:
multThreadedSort(data + len/2, len/2, grainsize);
If len is odd (for instance 9), you do not include the last item in the sort. Replace by:
multThreadedSort(data + len/2, len-(len/2), grainsize);

Unless you use a compiler with a totally broken implementation (broken is the wrong word, a better match would be... shitty), several invocations of std::futureshould already do the job for you, without having to worry.
Note that std::future is something that conceptually runs asynchronously, i.e. it may spawn another thread to execute concurrently. May, not must, mind you.
This means that it is perfectly "legitimate" for an implementation to simply spawn one thread per future, and it is also legitimate to never spawn any threads at all and simply execute the task inside wait().
In practice, sane implementations avoid spawning threads on demand and instead use a threadpool where the number of workers is set to something reasonable according to the system the code runs on.
Note that trying to optimize threading with std::thread::hardware_concurrency() does not really help you because the wording of that function is too loose to be useful. It is perfectly allowable for an implementation to return zero, or a more or less arbitrary "best guess", and there is no mechanism for you to detect whether the returned value is a genuine one or a bullshit value.
There also is no way of discriminating hyperthreaded cores, or any such thing as NUMA awareness, or anything the like. Thus, even if you assume that the number is correct, it is still not very meaningful at all.
On a more general note
The problem "What is the correct number of threads" is hard to solve, if there is a good universal answer at all (I believe there is not). A couple of things to consider:
Work groups of 10 are certainly way, way too small. Spawning a thread is an immensely expensive thing (yes, contrary to popular belief that's true for Linux, too) and switching or synchronizing threads is expensive as well. Try "ten thousands", not "tens".
Hyperthreaded cores only execute while the other core in the same group is stalled, most commonly on memory I/O (or, when spinning, by the explicit execution of an instruction such as e.g. REP-NOP on Intel). If you do not have a significant number of memory stalls, extra threads running on hyperthreaded cores will only add context switches, but will not run any faster. For something like sorting (which is all about accessing memory!), you're probably good to go as far as that one goes.
Memory bandwidth is usually saturated by one, sometimes 2 cores, rarely more (depends on the actual hardware). Throwing 8 or 12 threads at the problem will usually not increase memory bandwidth but will heighten pressure on shared cache levels (such as L3 if present, and often L2 as well) and the system page manager. For the particular case of sorting (very incoherent access, lots of stalls), the opposite may be the case. May, but needs not be.
Due to the above, for the general case "number of real cores" or "number of real cores + 1" is often a much better recommendation.
Accessing huge amounts of data with poor locality like with your approach will (single-threaded or multi-threaded) result in a lot of cache/TLB misses and possibly even page faults. That may not only undo any gains from thread parallelism, but it may indeed execute 4-5 orders of magnitude slower. Just think about what a page fault costs you. During a single page fault, you could have sorted a million elements.
Contrary to the above "real cores plus 1" general rule, for tasks that involve network or disk I/O which may block for a long time, even "twice the number of cores" may as well be the best match. So... there is really no single "correct" rule.
What's the conclusion of the somewhat self-contradicting points above? After you've implemented it, be sure to benchmark whether it really runs faster, because this is by no means guaranteed to be the case. And unluckily, there's no way of knowing with certitude what's best without having measured.
As another thing, consider that sorting is by no means trivial to parallelize. You are already using std::inplace_merge so you seem to be aware that it's not just "split subranges and sort them".
But think about it, what exactly does your approach really do? You are subdividing (recursively descending) up to some depth, then sorting the subranges concurrently, and merging -- which means overwriting. Then you are sorting (recursively ascending) larger ranges and merging them, until the whole range is sorted. Classic fork-join.
That means you touch some part of memory to sort it (in a pattern which is not cache-friendly), then touch it again to merge it. Then you touch it yet again to sort the larger range, and you touch it yet another time to merge that larger range. With any "luck", different threads will be accessing the memory locations at different times, so you'll have false sharing.
Also, if your understanding of "large data" is the same as mine, this means you are overwriting every memory location beween 20 and 30 times, possibly more often. That's a lot of traffic.
So much memory being read and written to repeatedly, over and over again, and the main bottleneck is memory bandwidth. See where I'm going? Fork-join looks like an ingenious thing, and in academics it probably is... but it isn't certain at all that this runs any faster on a real machine (it might quite possibly be many times slower).

Ideally, you cannot assume more than n*2 thread running in your system. n is number of CPU cores.
Modern OS uses concept of Hyperthreading. So, now on one CPU at a time can run 2 threads.
As mentioned in another answer, in C++11 you can get optimal number of threads using std::thread::hardware_concurrency();

Ring buffer: Disadvantages by moving through memory backwards?

This is probably language agnostic, but I'm asking from a C++ background.
I am hacking together a ring buffer for an embedded system (AVR, 8-bit). Let's assume:
const uint8_t size = /* something > 0 */;
uint8_t buffer[size];
uint8_t write_pointer;
There's this neat trick of &ing the write and read pointers with size-1 to do an efficient, branchless rollover if the buffer's size is a power of two, like so:
// value = buffer[write_pointer];
write_pointer = (write_pointer+1) & (size-1);
If, however, the size is not a power of two, the fallback would probably be a compare of the pointer (i.e. index) to the size and do a conditional reset:
// value = buffer[write_pointer];
if (++write_pointer == size) write_pointer ^= write_pointer;
Since the reset occurs rather rarely, this should be easy for any branch prediction.
This assumes though that the pointers need to be advancing foreward in memory. While this is intuitive, it requires a load of size in every iteration. I assume that reversing the order (advancing backwards) would yield better CPU instructions (i.e. jump if not zero) in the regular case, since size is only required during the reset.
// value = buffer[--write_pointer];
if (write_pointer == 0) write_pointer = size;
so
TL;DR: My question is: Does marching backwards through memory have a negative effect on the execution time due to cache misses (since memory cannot simply be read forward) or is this a valid optimization?

You have an 8 bit avr with a cache? And branch prediction?
How does forward or backwards matter as far as caches are concerned? The hit or miss on a cache is anywhere within the cache line, beginning, middle, end, random, sequential, doesnt matter. You can work from the back to the front or the front to back of a cache line, it is the same cost (assuming all other things held constant) the first mist causes a fill, then that line is in cache and you can access any of the items in any pattern at a lower latency until evicted.
On a microcontroller like that you want to make the effort, even at the cost of throwing away some memory, to align a circular buffer such that you can mask. There is no cache the instruction fetches are painful because they are likely from a flash that may be slower than the processor clock rate, so you do want to reduce instructions executed, or make the execution a little more deterministic (same number of instructions every loop until that task is done). There might be a pipeline that would appreciate the masking rather than an if-then-else.
TL;DR: My question is: Does marching backwards through memory have a
negative effect on the execution time due to cache misses (since
memory cannot simply be read forward) or is this a valid optimization?
The cache doesnt care, a miss from any item in the line causes a fill, once in the cache any pattern of access, random, sequential forward or back, or just pounding on the same address, takes less time being in faster memory. Until evicted. Evictions wont come from neighboring cache lines they will come from cache lines larger powers of two away, so whether the next cache line you pull is at a higher address or lower, the cost is the same.

Does marching backwards through memory have a negative effect on the
execution time due to cache misses (since memory cannot simply be read
forward)
Why do you think that you will have a cache miss? You will have a cache miss if you try to access outside the cache (forward or backward).

There are a number of points which need clarification:
That size needs to be loaded each and every time (it's const, therefore ought to be immutable)
That your code is correct. For example in a 0-based index (as used in C/C++ for array access) the value 0' is a valid pointer into the buffer, and the valuesize' is not. Similarly there is no need to xor when you could simply assign 0, equally a modulo operator will work (writer_pointer = (write_pointer +1) % size).
What happens in the general case with virtual memory (i.e. the logically adjacent addresses might be all over the place in the real memory map), paging (stuff may well be cached on a page-by-page basis) and other factors (pressure from external processes, interrupts)
In short: this is the kind of optimisation that leads to more feet related injuries than genuine performance improvements. Additionally it is almost certainly the case that you get much, much better gains using vectorised code (SIMD).
EDIT: And in interpreted languages or JIT'ed languages it might be a tad optimistic to assume you can rely on the use of JNZ and others at all. At which point the question is, how much of a difference is there really between loading size and comparing versus comparing with 0.

As usual, when performing any form of manual code optimization, you must have extensive in-depth knowledge of the specific hardware. If you don't have that, then you should not attempt manual optimizations, end of story.
Thus, your question is filled with various strange assumptions:
First, you assume that write_pointer = (write_pointer+1) & (size-1) is more efficient than something else, such as the XOR example you posted. You are just guessing here, you will have to disassemble the code and see which yields the less CPU instructions.
Because, when writing code for a tiny, primitive 8-bit MCU, there is not much going on in the core to speed up your code. I don't know AVR8, but it seems likely that you have a small instruction pipe and that's it. It seems quite unlikely that you have much in the way of branch prediction. It seems very unlikely that you have a data and/or instruction cache. Read the friendly CPU core manual.
As for marching backwards through memory, it will unlikely have any impact at all on your program's performance. On old, crappy compilers you would get slightly more efficient code if the loop condition was a comparison vs zero instead of a value. On modern compilers, this shouldn't be an issue. As for cache memory concerns, I doubt you have any cache memory to worry about.
The best way to write efficient code on 8-bit MCUs is to stick to 8-bit arithmetic whenever possible and to avoid 32-bit arithmetic like the plague. And forget you ever heard about something called floating point. This is what will make your program efficient, you are unlikely to find any better way to manually optimize your code.

Insert MANY key value pairs fast into berkeley db with hash access

i'm trying to build a hash with berkeley db, which shall contain many tuples (approx 18GB of key value pairs), but in all my tests the performance of the insert operations degrades drastically over time. I've written this script to test the performance:
#include<iostream>
#include<db_cxx.h>
#include<ctime>
#define MILLION 1000000
int main () {
long long a = 0;
long long b = 0;
int passes = 0;
int i = 0;
u_int32_t flags = DB_CREATE;
Db* dbp = new Db(NULL,0);
dbp->set_cachesize( 0, 1024 * 1024 * 1024, 1 );
int ret = dbp->open(
NULL,
"test.db",
NULL,
DB_HASH,
flags,
0);
time_t time1 = time(NULL);
while ( passes < 100 ) {
while( i < MILLION ) {
Dbt key( &a, sizeof(long long) );
Dbt data( &b, sizeof(long long) );
dbp->put( NULL, &key, &data, 0);
a++; b++; i++;
}
DbEnv* dbep = dbp->get_env();
int tmp;
dbep->memp_trickle( 50, &tmp );
i=0;
passes++;
std::cout << "Inserted one million --> pass: " << passes << " took: " << time(NULL) - time1 << "sec" << std::endl;
time1 = time(NULL);
}
}
Perhaps you can tell me why after some time the "put" operation takes increasingly longer and maybe how to fix this.
Thanks for your help,
Andreas

You may want to look at the information provided by the db_stat utility and the HASH-specific tuning functions that are available. Please see BDB Reference Guide section on configuring a HASH database.
I would expect you to get 10s of thousands of inserts per second on commodity hardware. What are you experiencing and what is your performance target?
Regards,
Dave

I would recommend trying the bulk insert API, you can read about that in the documentation here:
http://www.oracle.com/technology/documentation/berkeley-db/db/api_reference/CXX/dbput.html#put_DB_MULTIPLE_KEY
Also, I would guess that your call to memp_trickle is responsible for most of the slowdown. As the cache becomes dirtier, finding pages to trickle becomes more expensive. In fact, since you are only writing, having a large cache only hurts (once you've written the data, you don't use it again, so you don't want it to hang around in the cache.) I would recommend testing different (smaller) cache sizes.
Finally, if your sole concern is insert performance, using a larger page size will help. You'll be able to fit more data on each page and that will result in fewer disk writes.
-Ben

The memp_trickle is almost certainly slowing things down. It's often good to use trickle, but it belongs in its own thread to be effective. BDB (except when you get into the higher level replication APIs) creates no threads for you -- nothing happens behind the scenes (thread-wise). trickle will be effective when you have dirty pages forced from the cache (look at a stat output to see if that's what's happening).
You might also consider using a BTREE instead of a HASH. Yeah, I know you specifically said hash, but why? If you are looking to maximize performance, why add that restriction? You may be able to take advantage of locality of reference to reduce your cache footprint - there is often much more locality that you believe, or perhaps you can create some -- if you generate keys that are random digits, for example, prepend the date and time. That usually introduces locality into a perceived 'random' system. If you do use btree, you'll need to pay some attention to byte ordering for your keys for your system (look up Endianness in Wikipedia), if you are using a Little Endian system, you'll need to swap the bytes. Using BTREE with the right ordering and introduced locality means your key/value pairs will be stored in 'key-generation-time' order, so if you see the most action on the recent keys, you'll be tending to hit the same pages over and over (see your cache hit rate in your stats). So you'll need less cache. Another way to think of it is with the same cache amount, your solution will scale by a larger multiple.
I expect your actual app really doesn't insert integers keys in order (if it does, you'll be lucky). So you should write a benchmark that closely simulates your access patterns, at least with respect to: size of key, size of data, access pattern, number of items in the database, mix of read/write. Once you have that, look at the stats -- pay close attention to anything that implies IO or contention.
BTW, I've recently started a blog at http://libdb.wordpress.com to discuss BDB performance tuning (and other matters related to BDB). You may get some good ideas there. There can be a huge difference of latency and throughput depending on the kind of tuning you do. In particular, see http://libdb.wordpress.com/2011/01/31/revving-up-a-benchmark-from-626-to-74000-operations-per-second/

Your performance degrade can have several reasons, which aren't actually connected with your code. I may be mistaken, but I think this is all about internal database structure (and used data structure).
Think of a situation where database uses some approach other than a hash table, for example an RB Tree. Inserting in that tree would take O(logN) in Big-O sense and every inserted element increases the time needed for the next insert.
Unfortunately, the same could happen with a plain hash-table, so that initial O(1) insertion operation time degrades to something worse. This could have several reasons, but it's all about hash collisions, which could happen due to wrong hash function, wrong data (that is bad for currently used hash function) or even due to the moon phase.
If I were you, I would try to dig into your db internal structure. Also, I think testing your keys with something other than your db (e.g boost::unordered_map) could also benefit your testing and profiling.
Edit: Also to mention, did you try to change that cache_size stuff in your sample? Or maybe there are some other performance-related parameters that could be modified?

Which is more efficient: Bit, byte or int?

Let's say that you were to have a structure similar to the following:
struct Person {
int gender; // betwwen 0-1
int age; // between 0-200
int birthmonth; // between 0-11
int birthday; // between 1-31
int birthdayofweek; // between 0-6
}
In terms of performance, which would be the best data type to store each of the fields? (e.g. bitfield, int, char, etc.)
It will be used on an x86 processor and stored entirely in RAM. A fairly large number will need to be stored (50,000+), so processor caches and such will need to be taken into account.
Edit: OK, let me rephrase the question. If memory usage is not important, and the entire dataset will not fit into the cache no matter which datatypes are used, is it generally better to use smaller datatypes to fit more of the data into the CPU cache, or is it better to use larger datatypes to allow the CPU to perform faster operations? I am asking this for reference only, so code readability and such should not be considered.

Don't worry about it; use what is semantically correct, and most readable
There is no answer that is correct all the time. It depends on the platform and compiler. If you really care, then you have to test.
In general, I would stay stick with ints... except for gender which should probably be an enum.

int_fast#_t from <stdint.h> or <boost/cstdint.hpp>.
That said, you'll give up simplicity and consistency (these types may be a character type, for example, which are integer types in C/C++, and that might lead to surprising function resolutions) instead of just using an int.
You'll see much more significant performance benefits by concentrating on other areas, like algorithmic complexity and access patterns.
It will be used on an x86 processor and stored entirely in RAM. A fairly large number will need to be stored (50,000+), so processor caches and such will need to be taken into account.
You still have to worry about cache (after you're at that level of optimization), even if the whole data won't be cached. For example, do you access every item in sequence? unpredictable? or just one field from every item in sequence? Compare struct { int a, b; } data[N]; to int data_a[N], data_b[N];. (Imagine you need all the 'a' at once, but can ignore the other, which way is more cache friendly?) Again, this doesn't sound like the main area on which you should focus.

Amount of bits used:
gender; 1/2 (2 if you want to include intersexuality :))
age; 8 (0-255)
birthmonth; 4 (16)
birthday; 5 (32)
birthdayofweek; 3 (8)
Bits at all: less than 22.
Knowing that it is running on x86 we have the int datatype with 32 bits.
So build your own routines which can accept an int
read(gender, int* pValue);
write(gender, int* pValue);
by using shift and bit mask operators to store and
retrieve the information. For the fields you can
use typesafe enums.
That is very fast and has an extremely low memory footprint.

it depends. Are you running out of memory? Then memory efficiency becomes paramount. Is it taking too long? Then CPU time, or at least perceived user response time, becomes paramount.

It depends. Many processors can directly access words, but require additional instructions to access octets or bits. Depending on the the compiler and processor, int might be the same as the addressable word. But is speed really going to matter? Readability and maintainability is likely to be more important.

In General
Each type has advantages and disadvantages, and specifically there are scenarios where each one will have the highest performance.
Addressable types (byte, char, short, int, and on x86-64 "long int") can all be loaded from memory in a single operation and so they have the least CPU overhead on a per-operation basis.
But, bit fields or flags packed into one or more bits might result in an overall faster program because:
they use the cache more efficiently, and this is a huge win, easily paying for a few extra cpu ops needed to unpack each item
they require fewer I/O operations to read in from disk, and this additional huge win easily pays for more CPU ops, even tho once again the cpu ops must be paid per item
Processor speeds have been advancing faster than disk and network speeds for decades, and now individual CPU ops are rarely a concern, particularly in your C/C++ case. You are already using the fastest code generator in the arsenal.
The in-RAM/not-in-cache scenario you mentioned
As it happens there is a still a cache factor to consider. Because the CPU is so fast, it is likely that execution time will be dominated by DRAM access on cache loads. If this is true, there is still an advantage to packing the data but it is dimished somewhat for a linear scan through the table. As it happens, modern DRAM is far more efficiently read in order, so you can fill an entire cache block in not much more time than is required to randomly read a single address. If execution time is dominated by an in-order traversal of the data structure, this works in your favor and would tend to flatten the performance difference between using addressable units and packed data structures.
Worry about important things
Finally, it's probably obvious but I will say it anyway: the data structure in terms of maps like hashes and trees, and the choice of algorithm typically has much more influence than machine ops tuning, which gives only an essentially linear optimization.
Worrying about memory bloat does matter, and it matters a lot if there is any possibility that your app won't fit in memory. Virtual storage turned out to be really important for protection and OS-kernel-level memory management, but one thing it never managed to do was allow programs to grow bigger than available RAM without bogging everything down.

Int is the fastest.
If you're using it in an array, you'll waste more memory however, so you may want to stick with byte in that case.

What Chris said. If this is a hypothetical program you're designing, trying to pick int versus uint8 at this stage isn't going to help you one bit in the long run. Focus your effort elsewhere.
If, when it comes down to it, you have a giant complex system you've made several rounds of optimizations on, and you're curious what the effect is, switching int to uint8 is probably (should be anyway) pretty easily anyway. At that stage you can make a statistically valid comparison in a real-world use case - not before.

enum Gender { Female, Male };
struct Date {...};
class Person
{
Gender gender;
Date date_of_birth;
//no age field - can be computed from date_of_birth
};

There are many reasons why either version could be faster.
Packing the structure by making each member into a minimal number of bits means less memory traffic and cache usage, which will cause a speedup. It also means more work to extract and access/modify individual fields, which will cost you performance. It will probably mean marginally more code as well, which will take up more of the instruction cache and so cost a bit of performance.
It's really impossible to say for sure which factor will be dominant. In most cases, it will be most efficient to use ints and bytes as needed. Don't bother with bitfields.
But in some cases, that won't be true. So measure, benchmark, profile. Find out how each implementation performs in your code.

In most cases you should stick to the default word size supported by the CPU. In many situations most CPUs will pad or require the compiler to pad values that are smaller than the default word size (so that structures can be word aligned), so any memory usage gain from a smaller data type can be rendered moot.
The main exception to this where you may have a large array, for example loading a file into an array of byte[] is preferable to int[].
In general model the problem cleanly and do the simple obvious thing first, then profile to see if you have memory issues. Looking at the case supplied, it is worth noting a couple of points:
The field age is a de-normalisation of the model and is a function of birthmonth, birthday, birthdayofweek, which could be represented as a single long birthtimestamp. So, just by looking at the model you can trim 8 bytes easily.
4 (bytes per int) * 5 (number of fields) * 50 000 (number of instances) = ~1MB. My laptop has 4MB L2 cache, you're unlikely to have memory issues.

This depends on exactly one thing:
What is the ratio of blindly copying the data and actually reading the data?
If you treat the data mostly as cookies, and seldom have to access an actual record, use bit fields which consume fewer space and are more IO-friendly.
If you access it a lot, and seldom copy it (as a part of container reallocation) or serialize it to disk, use larger integers that are CPU-friendly.

The processors word size is the smallest unit the computer can read/write to/from memory. If you read or write a bit, char, short or int the CPU, northbridge overhead is the same in all cases given reasonable system/compiler behavior. There is obviously a sram cache benefit to smaller field sizes. The one gotcha for x86 is to ensure proper alignment at word sized multiples. Other architectures such as sparc are much less forgiving and will actually crash if memory access is unaligned.
I tend to shy away from data types smaller than the computers word size in performant multi-threaded apps as changes to one field within a word shared with other fields in the same word can have unintended dependancies if access to all fields within the word are not externally synchronized.

struct Person
{
uint8_t gender; // betwwen 0-1
uint8_t age; // between 0-200
uint8_t birthmonth; // between 0-11
uint8_t birthday; // between 1-31
uint8_t birthdayofweek; // between 0-6
}
If you need to access them sequentially, cache optimisation might win. Thus, reducing the data structure to the strict minimum of information may be better.
One might argue that the CPU will have to fetch a 32 bits block in memory anyway, then need to extract the byte thus requiring more time. However that is an arbitrary choice of the CPU and it should not influence your data structures. Imagine if tomorrow CPUs started working with 64 bits blocks, your "optimisation" would immediately become futile because there would be an extraction anyway. However, it's almost certain that memory cache optimisation will always be relevant. This is why I like to privilege making the application as light as it can be, memory-wise, even when concerned with performance.
Furthermore, if you are very concerned with memory you can obviously go below bytes, by packing some of those variables together. But then you'll need to use bit operations, thus additional instructions, and your code could quickly become unreadable / unmaintainable unless you formalize the storage into some sort of generic container / structure.

Thing is, I don't see large fields of Person being evaluated. Therefore, speed is not the issue here.
What IS an issue is consistence. You have a structure with public members that one could easily set to the wrong data, being it out of range or off by an index.
I mean, you are not consistent yourself here, are you?
int birthmonth; // between 0-11
int birthday; // between 1-31
Why are months starting at zero but days are not?
And even if you had this consistent, maybe one guy who uses your structure goes with "months start by zero", one guy with "months start by one". This invites error. Why has age a restriction? Sure, we will not find a 200 years old guy, but why not say max of type?
Rather use things like an enum class, a special type that restricts your range like template<int from, int to> class RangedInt or an already existing type, like Date (especially since such a type can check how many days february has in that year et cetera).
Some people commented on the gender with gender theory stuff here, but in any case, zero is not a gender. You mean male and female, but while we have some intuition when you say that day is in range 0..31, how would we guess if zero is male or female? Shouldn't be governed by comments. (And if you actually want to address gender theory stuff, I'd say that the right type is string.)
Also, in most cases, you would not change anything about a person once it is created. Changes that happen in real life (like getting a different name through marriage) would happen at database level or the like, not in class level. Therefore, I'd made this a class in which you can set only by constructor. Although, of course, this is dependent on the usage obviously. But if you can do it immutable, do it immutable.
Edit: And it's easy to make age inconsistent with the date of birth. That's no good. Remove the redundant data, make it a method or the like. Although the year is missing? Strange design, I must say.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js