I am creating a program that compresses files with Huffman compression. Originally I was using a vector of uint8_t to store bytes from the file, but the performance was horrible (2 hours to decompress a 74 MB file).
I have decided to use 16 bit chunks to represent values from the file.
Originally, I had this (the input bitset has 520 million bits in it)
std::vector<uint8_t> bytes;
boost::dynamic_bitset<unsigned char> input;
boost::to_block_range(input, std::back_inserter(bytes));
This worked great, and it filled a vector full of 8 bit integers that represented each byte of the file. The frequencies of each bit are recorded in a vector of integers of size 256. This is running horribly. It takes absolutely forever to decode a string, since the frequencies of these integers in my file are HUGE. I thought it would be better if I used 16 bit integers, and stored frequencies in a vector of size 65536. Here is my attempt at filling my vector of "bytes":
std::vector<uint16_t> bytes;
boost::dynamic_bitset<unsigned char> input;
boost::to_block_range(input, std::back_inserter(bytes));
The problem here is that the to_block_range() function is taking 8 bits out of my bitset and padding them with 8 zeroes, rather than taking 16 bytes out at a time.
Is there any way to fill a vector of uint16_t from a dynamic bitset in this fashion?
The problem here might not be what you think.
In your byte-based approach, adding a reserve call is likely to improve things considerably.
std::vector<uint8_t> bytes;
boost::dynamic_bitset<unsigned char> input;
bytes.reserve(input.num_blocks());
boost::to_block_range(input, std::back_inserter(bytes));
The problem with just inserting to the back of a vector is that the vector will get copied multiple times while it is growing. You can avoid this by giving it enough memory to work with.
Related
I need to write a boolean vector to a binary file. I searched stackoverflow for similar questions and didn't find anything that worked. I tried to write it myself, here's what I came up with:
vector<bool> bits = { 1, 0, 1, 1 };
ofstream file("file.bin", ios::out | ios::binary);
uint32_t size = (bits.size() / 8) + ((bits.size() % 8 == 0) ? 0 : 1);
file.write(reinterpret_cast<const char*>(&bits[0]), size);
I was hoping it would write 1011**** (where * is a random 1/0). But I got an error:
error C2102: '&' requires l-value
Theoretically, I could do some kind of loop and add 8 bools to char one by one, then write the char to a file, and then repeat many times. But I have a rather large vector, it will be very slow and inefficient. Is there any other way to write the entire vector at once. Bitset is not suitable since I need to add bits to the vector.
vector<bool> may or may not be packed and you can't access the internal data directly, at least not portable.
So you have to iterate over the bits one by one and combined them into bytes (yes, bytes, c++ has bytes now, don't use char, use uint8_t for older c++).
As you say writing out each byte is slow. But why would you write out each byte? You know how big the vector is so create a suitable buffer, fill it and write it out in one go. At a minimum write out chunks of bytes at once.
Since vector<bool> doesn't have the data() function, getting the address of its internal storage requires some ugly hacks (although it works for listdc++ I strongly discourage it)
file.write(
reinterpret_cast<const char*>(
*reinterpret_cast<std::uint64_t* const*>(&bits)),
size);
For example, we have 10^7 32bit integers. The memory usage to store these integers in an array is 32*10^7/8=40MB. However, inserting 10^7 32bit integers into a set takes more than 300MB of memory. Code:
#include <iostream>
#include <set>
int main(int argc, const char * argv[]) {
std::set<int> aa;
for (int i = 0; i < 10000000; i++)
aa.insert(i);
return 0;
}
Other containers like map, unordered_set takes even more memory with similar tests. I know that set is implemented with red black tree, but the data structure itself does not explain the high memory usage.
I am wondering the reason behind this 5~8 times original data memory usage, and some workaround/alternatives for a more memory efficient set.
Let's examine std::set implementation in GCC (which is not much different in other compilers). std::set is implemented as a red-black tree on GCC. Each node has a pointer to parent, left, and right nodes and a color enumerator (_S_red and _S_black). This means that besides the int (which is probably 4 bytes) there are 3 pointers (8 * 3 = 24 bytes for a 64-bit system) and one enumerator (since it comes before the pointers in _Rb_tree_node_base, it is padded to 8 byte boundary, so effectively it takes extra 8 bytes).
So far I have counted 24 + 8 + 4 = 36 bytes for each integer in the set. But since the node has to be aligned to 8 bytes, it has to be padded so that it is divisible by 8. Which means each node takes 40 bytes (10 times bigger than int).
But this is not all. Each such node is allocated by std::allocator. This allocator uses new to allocate each node. Since delete can't know how much memory to free, each node also has some heap-related meta-data. The meta-data should at least contain the size of the allocated block, which usually takes 8 bytes (in theory, it is possible to use some kind of Huffman coding and store only 1 byte in most cases, but I never saw anybody do that).
Considering everything, the total for each int node is 48 bytes. This is 12 times more than int. Every int in the set takes 12 times more than it would have taken in an array or a vector.
Your numbers suggest that you are on a 32 bit system, since your data takes only 300 MB. For 32 bit system, pointers take 4 bytes. This makes it 3 * 4 + 4 = 16 byte for the red-black tree related data in nodes + 4 for int + 4 for meta-data. This totals with 24 bytes for each int instead of 4. This makes it 6 times larger than vector for a big set. The numbers suggest that heap meta-data takes 8 bytes and not just 4 bytes (maybe due to alignment constraint).
So on your system, instead of 40MB (had it been std::vector), it is expected to take 280MB.
If you want to save some peanuts, you can use a non-standard allocator for your sets. You can avoid the metadata overhead by using boost's Segregated storage node allocators. But that is not such a big win in terms of memory. It could boost your performance, though, since the allocators are simpler than the code in new and delete.
I have a program that takes in a file that is just a list of sets, each with their own integer identifier, and then sorts the sets by the number of members.
File format (the number after R is the ID):
R1
0123
0000
R2
0321
R3
0002
...
struct CodeBook {
string Residue; //stores the R lines
vector<string> CodeWords; //stores the lines between the R lines
};
vector<CodeBook> DATA;
With each file I run, the number of sets gets larger, and currently I am just storing everything into the huge vector DATA. My latest file is large enough that I've taken over the server's memory and flowing over into swap. This will be the last file I process before I possibly switch to a more RAM-friendly algorithm. With that file, the number of sets is larger than an unsigned 32 byte int.
I can calc how many there will be, and the number of sets is important for calculation purposes, so overflow is not an option. Going all the way up to a unsigned long long int is not an option either, because I've already pretty much maxed out memory usage.
How could I implement a variable length integer to more efficiently store everything so I can more efficiently calc everything?
Ex: small id ints get 1 or 2 bytes and the largest ints get 5 bytes
PS: Given the size of what I'm working with, speed is also a factor if it can be helped, but it's not the most important concern :/
Store everything in two huge vectors.
Define two structs with associated vectors:
struct u { int id; unsigned count };
struct ull { int id, unsigned long long count };
std::vector<u> u_vector;
std::vector<ull> ull_vector;
Read the count into an unsigned long long. If the count fits in an unsigned, store it in u_vector, otherwise in ull_vector. Sort both vectors; output u_vector first and ull_vector second.
Don't be tempted to try doing the same thing with unsigned char - the structure will be the same size as u (because of padding to make the id aligned).
So I have a buffer:
uint32_t buff[2];
buff[0] = 12;
buff[1] = 13;
...
I can write this to the flash memory with the method:
HAL_FLASH_Program(TYPEPROGRAM_WORD, (uint32_t)(startAddress+(i*4)), *buff)
The definition of HAL_FLASH_Program is:
HAL_StatusTypeDef HAL_FLASH_Program(uint32_t TypeProgram, uint32_t Address, uint64_t Data)
That works perfectly. Now is there a way I can store chars instead or ints?
You can use HAL_FLASH_Program with TYPEPROGRAM_BYTE to write a single 1-byte char.
If your data is a bit long (a struct, a string...), you can also write the bulk with TYPEPROGRAM_WORD, or even TYPEPROGRAM_DOUBLEWORD (8 bytes at a time), and then either complete with single bytes as needed or pad the excess with zeros. That would certainly be a bit faster, but maybe it's not significant for you.
I have a program storage optimization question.
I have, let say 4096 "knots" stored in a:
boost::dynamic_bitset<>
I am now considering refactoring my program and build a CKnot class which will contain a bool.
The question is what will consume more space:
boost::dynamic_bitset<> ( 4096 , false );
CKnot Knot[4096] //contain one bool
Thanks
The bitset will be considerably smaller, as a bool in C++ must be at least a byte in size, whereas each bit in a bitset is exactly that, a bit.