Node size for unordered_map buckets - c++

I have a program where I want to store kmers (substrings of size k) and the number of times they appear. For this particular application, I'm reading in a file with these values and if the number of times they appear is > 255, it is ok to round down to 255. I thought that if I store the key-value pairs as (string, unsigned char) that might save space compared to storing the key-value pairs as (string, int), but this did not seem to be the case when I checked the max resident size by running /usr/bin/time.
To confirm, I also tried running the following test program where I alternated the type of the value in the unordered_map:
#include <iostream>
#include <unordered_map>
#include <utility>
#include <string>
#include <fstream>
int main() {
std::unordered_map<std::string, unsigned char> kmap;
std::ifstream infile("kmers_from_reads");
std::string kmer;
int abun;
while(infile >> kmer >> abun) {
unsigned char abundance = (abun > 255) ? 255 : abun;
kmap[kmer] = abundance;
}
std::cout << sizeof(*kmap.begin(0)) << std::endl;
}
This did not seem to impact the size of the nodes in the bucket (on my machine it returned 40 for both unsigned char and int values).
I was wondering how the size of the nodes in each bucket is determined.
My understanding of unordered maps is that the c++ standard more or less requires separate chaining and each node in a bucket must have at least one pointer so that the elements are iterable and can be erased (http://bannalia.blogspot.com/2013/10/implementation-of-c-unordered.html). However, I don't understand how the amount of space to store a value is determined, and it seems like it must also be flexible to accommodate larger values. I also tried looking at the gcc libstc++ unordered_map header (https://github.com/gcc-mirror/gcc/blob/master/libstdc%2B%2B-v3/include/bits/unordered_map.h) but had a hard time understanding what was going on.

Compile and execute this code:
#include <iostream>
#include <unordered_map>
#include <utility>
#include <string>
#include <fstream>
class foo
{
std::string kmer;
unsigned char abun;
};
class bar
{
std::string kmer;
int abun;
};
int main() {
std::cout << sizeof(foo) << " " << sizeof(bar) << std::endl;
}
I get, and you probably will too, 40 40. This is because of alignment requirements. If, for example, std::string contains at least one pointer (which it almost certainly does), it has to be aligned on at least a 4-byte boundary.
Imagine if sizeof(foo) was 39 and you had code that did foo foos[2]. If the pointer in foos[0].kmer was properly aligned, the pointer in foos[1].kmer wouldn't be. That would be a disaster.

Related

Storing heterogeneous data continuously in memory as a sequence of chars

I have a large number of strings and some data associated with each string. For simplicity, lets assume that the data is an int for each string. Lets assume I have an std::vector<std::tuple<std::string, int>>. I want to try to store this data continuously in memory with a single heap allocation. I will not need to worry about adding or deleting strings in the future.
A simple example
Constructing an std::string requires a heap allocation, and accessing entry chars of the std::string requires a dereference. If I have a bunch of strings, I may make better use of memory by storing all of the strings in one std::string and storing each string's starting index and size as a separate variable. If I want, I could try to store the starting index and size within the std::string itself.
Back to my problem
One idea I had was to store everything in an std::string or std::vector<char>. Each entry of the std::vector<std::tuple<std::string, int>> would be laid out in memory like this:
length of next string (int or size_t)
sequence of chars representing the string (chars)
some number zero chars for correct int alignment (chars)
data (int)
This requires being able to interpret a sequence of chars as an int. There have been questions about this before, but it seems to me that trying to do this can result in undefined behavior. I believe that I can help this slightly by checking the sizeof(int).
Another option I have is to create a union
union CharInt{
char[sizeof(int)] some_chars;
int data;
}
here, I would need to be careful that the number of chars per int used is determined at compile-time based on the result of sizeof(int). I would then store an std::vector<CharInt>. This seems more "C++" than using reinterpret_cast. One downside of this is that accessing the second char member of a CharInt would require an additional pointer addition (the pointer to the CharInt + 1). This cost still seems small relative to the benefit of making everything contiguous.
Is this the better option? Are there other options available? Are there pitfalls I need to account for using the union method?
Edit:
I wanted to provide clarity about how CharInt would be used. I provided an example below:
#include <iostream>
#include <string>
#include <vector>
class CharIntTest {
public:
CharIntTest() {
my_trie.push_back(CharInt{ 42 });
std::string example_string{ "this is a long string" };
my_trie.push_back(CharInt{ example_string, 5 });
my_trie.push_back(CharInt{ 106 });
}
int GetFirstInt() {
return my_trie[0].an_int;
}
char GetFirstChar() {
return my_trie[1].some_chars[0];
}
char GetSecondChar() {
return my_trie[1].some_chars[1];
}
int GetSecondInt() {
return my_trie[2].an_int;
}
private:
union CharInt {
// here I would need to be careful that I only insert sizeof(int) number of chars
CharInt(std::string s, int index) : some_chars{ s[index], s[index+1], s[index+2], s[index+3]} {
}
CharInt(int i) : an_int{ i } {
}
char some_chars[sizeof(int)];
int an_int;
};
std::vector<CharInt> my_trie;
};
Note that I do not access the first or third CharInts as though they were chars. I do not access the second CharInt as though it were an int. Here is the main:
int main() {
CharIntTest tester{};
std::cout << tester.GetFirstInt() << "\n";
std::cout << tester.GetFirstChar() << "\n";
std::cout << tester.GetSecondChar() << "\n";
std::cout << tester.GetSecondInt();
}
which produces the desired output
42
i
s
106

Safe way to unpack integer values from a remote device

I get bytes from a remote device via USB protocol. These bytes contain integer data. Is the following code a safe way to unpack them without portability issues (except endianess which is known):
#include <iostream>
#include <string>
#include <cstdint>
#include <cstring>
int main()
{
std::uint8_t someArray[4] = {1,0,0,0};
std::int32_t someValue = 0;
std::memcpy(&someValue, someArray, 4);
std::cout << someValue << std::endl;
}
Yes.
std::memcpy is indeed the way to go. In real-life code, I'd static_assert on the size of types used and check data size at run-time, but nothing more.

C++ priority_queue size() issue

I tried to get the size of an empty priority_queue. Something strange happened. Could anybody explain why this happened? Thanks a lot.
#include <iostream>
#include <queue>
using namespace std;
int main()
{
priority_queue<int, vector<int>, less<int> > asc_queue;
cout << asc_queue.size() << " " << asc_queue.size() - 1 << endl;
}
Output:
0 18446744073709551615
std::priority_queue::size() returns the size of the container as a std::size_t (technically the size_type of the underlying container of the priority queue) which is essentially an unsigned int - therefore trying to minus 1 from an empty container size gives you the unsigned decimal representation of 0xffffffffffffffffL which is why you get the large value you see.

Making maps using the boost::variant library. How to store and display things as the proper type?

I'm trying to create a map using the boost::variant library, but I can't seem to get any of the data held within the map to print properly.
Code:
#include <string>
#include <iostream>
#include "boost_1_55_0/boost/any.hpp"
#include "boost_1_55_0/boost/variant.hpp"
#include <map>
#include <list>
#include <complex>
#include <vector>
using namespace std;
int main()
{
std::map <string, boost::variant <int, double, bool, string> > myMap;
myMap.insert(pair<string, int>("PAGE_SIZE",2048));
myMap.insert(pair<string, boost::variant <int, double,bool, string> > ("PAGE_SIZE2", "hello, this is a string" )); //setup an enum to specify the second value maybe?
cout << "data page 1: " << myMap["PAGE)SIZE1"] << "\ntype page 1: " << myMap["PAGE_SIZE"].which() << endl;
cout << "data: " << myMap["PAGE_SIZE2"] << "\ntype: "<< myMap["PAGE_SIZE2"].which()<< endl;
return 0;
}
Ignore all the extra includes, I've been using this file to play around with lots of different ideas. When I compile with g++, I get the following:
data page 1: 0
type page 1: 0
data page 2: 1
type page 2: 2
I get that the first variable is being stored as an int, and is therefore of type 0, but why is it displaying a value of 0?
Same thing with the second output, except that I don't understand why it's being stored as a bool, is the value a 1 (true)?
All help is appreciated.
Thanks!
The second is being stored as a bool because it's a more basic conversion than to std::string (the compiler prefers const char* -> bool to const char* -> std::string). As the pointer is non-null it assigns the boolean value true. You can specifically construct a std::string here to work around the default conversion.
As for why the data output isn't functioning the only thing I can suspect is that possibly BOOST_NO_IOSTREAM is set, causing it to not have the appropriate operator<<.

Efficiently store array of up to 2048 characters?

Getting input from another source; which populates a string of up to 2048 characters.
What is the most efficient way of populating and comparing this string? - I want to be able to easily append to the string also.
Here are three attempts of mine:
C-style version
#include <cstdio>
#include <cstring>
int main(void) {
char foo[2048];
foo[0]='a', foo[1]='b', foo[2]='c', foo[3]='\0'; // E.g.: taken from u-input
puts(strcmp(foo, "bar")? "false": "true");
}
C++-style version 0
#include <iostream>
int main() {
std::string foo;
foo.reserve(2048);
foo += "abc"; // E.g.: taken from user-input
std::cout << std::boolalpha << (foo=="bar");
}
C++-style version 1
#include <iostream>
int main() {
std::string foo;
foo += "abc"; // E.g.: taken from user-input
std::cout << std::boolalpha << (foo=="bar");
}
What is most efficient depends on what you optimize for.
Some common criteria:
Program Speed
Program Size
Working Set Size
Code Size
Programmer Time
Safety
Undoubted King for 1 and 2, in your example probably also 3, is C style.
For 4 and 5, C++ style 1.
Point 6 is probably with C++-style.
Still, the proper mix of emphasizing these goal is called for, which imho favors C++ option 0.