I think this is a very common problem. Let me give an example.
I have a file, which contains many many lines (e.g. one million lines), and each line is of the following form: first comes a number X, and then follows a string of length X.
Now I want to read the file and store all the strings (for whatever reason). Usually, what I will do is: for every line I read the length X, and use malloc (in C) or new (in C++) to allocate X bytes, and then read the string.
The reason that I don't like this method: it might happen that most of the strings are very short, say under 8 bytes. In that case, according to my understanding, the allocation will be very wasteful, both in time and in space.
(First question here: am I understanding correctly, that allocating small pieces of memory is wasteful?)
I have thought about the following optimization: everytime I allocate a big chunk, say 1024 bytes, and whenever a small piece is needed, just cut it from the big chunk. The problem with this method is that, deallocation becomes almost impossible...
It might sound like I want to do the memory management myself... but still, I would like to know if there exists a better method? If needed, I don't mind use some data structure to do the management.
If you have some good idea that only works conditionally (e.g. with the knowledge that most pieces are small), I will also be happy to know it.
The "natural" way to do memory allocation is to ensure that every memory block is at least big enough to contain a pointer and a size, or some similar book-keeping that's sufficient to maintain a structure of free nodes. The details vary, but you can observe the overhead experimentally by looking at the actual addresses you get back from your allocator when you make small allocations.
This is the sense in which small allocations are "wasty". Actually with most C or C++ implementations all blocks get rounded to a multiple of some power of 2 (the power depending on the allocator and sometimes on the order of magnitude size of the allocation). So all allocations are wasty, but proportionally speaking there's more waste if a lot of 1 and 2 byte allocations are padded out to 16 bytes, than if a lot of 113 and 114 byte allocations are padded out to 128 bytes.
If you're willing to do away with the ability to free and reuse just a single allocation (which is fine for example if you're planning to free all of together once you're done worrying about the contents of this file) then sure, you can allocate lots of small strings in a more compact way. For example, put them all end to end in one or a few big allocations, each string nul-terminated, and deal in pointers to the first byte of each. The overhead is either 1 or 0 bytes per string depending how you consider the nul. This can work particularly neatly in the case of splitting a file into lines, if you just overwrite the linebreaks with nul bytes. Obviously you'd need to not mind that the linebreak has been removed from each line!
If you need freeing and re-use, and you know that all allocations are the same size, then you can do away with the size from the book-keeping, and write your own allocator (or, in practice, find an existing pool allocator you're happy with). The minimum allocated size could be one pointer. But that's only an easy win if all the strings are below the size of a pointer, "most" isn't so straightforward.
Yes, statically-allocating a large-ish buffer and reading into that is the usual way to read data.
Say you pick 1KB for the buffer size, because you expect most reads to fit into that.
Are you able to chop rare reads that go above 1KB into multiple reads?
Then do so.
Or not?
Then you can dynamically allocate if and only if necessary. Some simple pointer magic will do the job.
static const unsigned int BUF_SIZE = 1024;
static char buf[BUF_SIZE];
while (something) {
const unsigned int num_bytes_to_read = foo();
const char* data = 0;
if (num_bytes_to_read <= BUF_SIZE) {
read_into(&buf[0]);
data = buf;
}
else {
data = new char[num_bytes_to_read];
read_into(data);
}
// use data
if (num_bytes_to_read > BUF_SIZE)
delete[] data;
}
This code is a delightful mashup of C, C++ and pseudocode, since you did not specify a language.
If you're actually using C++, just use a vector for goodness' sake; let it grow if needed but otherwise just re-use its storage.
You could count the number of lines of text and their total length first, then allocate a block of memory to store the text and a block to store pointers into it. Fill these blocks by reading the file a second time. Just remember to add terminating zeros.
If the entire file will fit into memory, then why not get the size of the file, allocate that much memory and enough for pointers, then read in the entire file and create an array of pointers to the lines in the file?
I would store the "x" with using the largest buffer I can.
You did not tell us what is the max size of x as sizeof(x). It is I think crucial to store it in the buffer to evade addressing for each word and access them relatively quickly.
Something like :
char *buffer = "word1\0word2\0word3\0";
while stocking addr or ...etc.. for 'quick' access
Became like this :
char *buffer = "xx1word1xx2word2xx3word3\0\0\0\0";
As you can see with an x at a fixed size it can be really effective to jump word to word without the need to store each address, only need to read x and jump incrementing the addr using x...
x is not converted in char, integer injected and read using his type size, no need the end of the string \0 for the words this way, only for the full buff to know the end of the buffer (if x==0 then its the end).
I am not that good in explaining thanks to my English I push you some code as a better explanation :
#include <stdio.h>
#include <stdint.h>
#include <string.h>
void printword(char *buff){
char *ptr;
int i;
union{
uint16_t x;
char c[sizeof(uint16_t)];
}u;
ptr=buff;
memcpy(u.c,ptr,sizeof(uint16_t));
while(u.x){
ptr+=sizeof(u.x);
for(i=0;i<u.x;i++)printf("%c",buff[i+(ptr-buff)]);/*jump in buff using x*/
printf("\n");
ptr+=u.x;
memcpy(u.c,ptr,sizeof(uint16_t));
}
}
void addword(char *buff,const char *word,uint16_t x){
char *ptr;
union{
uint16_t x;
char c[sizeof(uint16_t)];
}u;
ptr=buff;
/* reach end x==0 */
memcpy(u.c,ptr,sizeof(uint16_t));
while(u.x){ptr+=sizeof(u.x)+u.x;memcpy(u.c,ptr,sizeof(uint16_t));}/*can jump easily! word2word*/
/* */
u.x=x;
memcpy(ptr,u.c,sizeof(uint16_t));
ptr+=sizeof(u.x);
memcpy(ptr,word,u.x);
ptr+=u.x;
memset(ptr,0,sizeof(uint16_t));/*end of buffer x=0*/
}
int main(void){
char buffer[1024];
memset(buffer,0,sizeof(uint16_t));/*first x=0 because its empty*/
addword(buffer,"test",4);
addword(buffer,"yay",3);
addword(buffer,"chinchin",8);
printword(buffer);
return 0;
}
Related
Processing a dict file with variant length ASCII words.
constexpr int MAXLINE = 1024 * 1024 * 10; // total number of words, one word per line.
Goal: read in the whole file into memory, and be able to access each word by index.
I want to quick access each word by index.We can use two-dimensional array to achieve that; however a MAXLENGTH need to be set, not mention that MAXLENGTH is not known ahead.
constexpr int MAXLENGTH= 1024; // since I do not have the maximum length of the word
char* aray = new char[MAXLINE * MAXLENGTH];
The code above would NOT be memory friendly if most words are shorter than MAXLENGTH; and also some words can be longer than MAXLENGTH, causing errors.
For variant length object, I think vector might be best fit to this problem. So I come up with vector of vector to store them.
vector<vector<char>> array(MAXLINE);
This looks so promising, until I realize that is not the case.
I tested both approaches on a dict file with MAXLINE 4-ASCII-character words.(here all words are 4-char words)
constexpr int MAXLINE = 1024 * 1024 * 10;
if I new operator the array to store, (here MAXLENGTH is just 4)
char* aray = new char[MAXLINE * 4];
The memory consumption is roughly 40MB. However, if I try to use vector to store ( I changed the char to int32_t for just fit four chars)
vector<vector<int32_t>> array(MAXLINE);
you can also use char vector, and reserve space for 4 chars.
vector<vector<char>> array(MAXLINE);
for (auto & c : array) {
c.reserve(4);
}
The memory consumption jumps up to about 720MB (debug mode), 280MB(release mode), which is so unexpected high and can someone give me some explanations for clarification why so.
obseravation: Size of vector is implementation dependent and if you are compiling in debug mode.
As on my system
sizeof(vector<int32_t>) = 16 // debug mode
and
sizeof(vector<int32_t>) = 12 // release mode
In debug mode the momory consumption is 720MB for vector<vector<int32_t>> array(MAXLINE);, while the actual vector only takes sizeof(vector<int32_t>) * MAXLINE = 16 * 10MB = 160 MB
In relase mode, the momory consumption is 280MB , however the expected value is sizeof(vector<int32_t>) * MAXLINE = 12 * 10MB = 120 MB
Can someone explain the big difference in real memory consumption and expected consumption(calculated from sub-vector size).
Appreciate, and Happy new year!
For your case:
so, does it mean vector of vectors is not a good idea to store small
objects? –
Generally no. A nested sub-vector isn't such a good solution for storing a boatload of teeny variable-sized sequences. You don't want to represent an indexed mesh that allows variable-polygons (triangles, quads, pentagons, hexagons, n-gons) using a separate std::vector instance per polygon, for example, or else you'll tend to blow up memory usage and have a really slow solution: slow because there's a heap allocation involved for every single freaking polygon, and explosive in memory because vector often preallocates some memory for the elements in addition to storing size and capacity in ways that are often larger than needed if you have a boatload of teeny sequences.
vector is an excellent data structure for storing a million things contiguously, but not so excellent for storing a million teeny vectors.
In such cases even a singly-linked indexed list can work better with the indices pointing to a bigger vector, performing so much faster and sometimes even using less memory in spite of the 32-bit link overhead, like this:
That said, for your particular case with a big ol' random-access sequence of variable-length strings, this is what I recommend:
// stores the starting index of each null-terminated string
std::vector<int> string_start;
// stores the characters for *all* the strings in one single vector.
std::vector<char> strings;
That will reduce the overhead down to closer to 32-bits per string entry (assuming int is 32-bits) and you will no longer require a separate heap allocation for every single string entry you add.
After you finish reading everything in, you can minimize memory use with a compaction to truncate the array (eliminating any excess reserve capacity):
// Compact memory use using copy-and-swap.
vector<int>(string_start).swap(string_start);
vector<char>(strings).swap(strings);
Now to retrieve the nth string, you can do this:
const char* str = strings.data() + string_start[n];
If you have need for search capabilities as well, you can actually store a boatload of strings and search for them quickly (including things like prefix-based searches) storing less memory than even the above solution would take using a compressed trie. It's a bit more of an involved solution though but might be worthwhile if your software revolves around dictionaries of strings and searching through them and you might just be able to find some third party library that already provides one for ya.
std::string
Just for completeness I thought I'd throw in a mention of std::string. Recent implementations often optimize for small strings by storing a buffer in advance which is not separately heap-allocated. However, in your case that can lead to even more explosive memory usage since that makes sizeof(string) bigger in ways that can consume far more memory than needed for really short strings. It does make std::string more useful though for temporary strings, making it so you might get something that performs perfectly fine if you fetched std::string out of that big vector of chars on demand like so:
std::string str = strings.data() + string_start[n];
... as opposed to:
const char* str = strings.data() + string_start[n];
That said, the big vector of chars would do much better performance and memory wise for storing all the strings. Just generally speaking, generalized containers of any sort tend to cease to perform so well if you want to store millions of teeny ones.
The main conceptual problem is that when the desire is for like a million variable-sized sequences, the variable-sized nature of the requirement combined with the generalized nature of the container will imply that you have a million teeny memory managers, all having to potentially allocate on the heap or, if not, allocate more data than is needed, along with keeping track of its size/capacity if it's contiguous, and so on. Inevitably a million+ managers of their own memory gets quite expensive.
So in these cases, it often pays to forego the convenience of a "complete, independent" container and instead use one giant buffer, or one giant container storing the element data as with the case of vector<char> strings, along with another big container that indexes or points to it, as in the case of vector<int> string_start. With that you can represent the analogical one million variable-length strings just using two big containers instead of a million small ones.
removing nth string
Your case doesn't sound like you need to ever remove a string entry, just load and access, but if you ever need to remove a string, that can get tricky when all the strings and indices to their starting positions are stored in two giant buffers.
Here I recommend, if you wanna do this, to not actually remove the string immediately from the buffer. Instead you can simply do this:
// Indicate that the nth string has been removed.
string_start[n] = -1;
When iterating over available strings, just skip the ones where string_start[n] is -1. Then, every now and then to compact memory use after a number of strings have been removed, do this:
void compact_buffers(vector<char>& strings, vector<int>& string_start)
{
// Create new buffers to hold the new data excluding removed strings.
vector<char> new_strings;
vector<int> new_string_start;
new_strings.reserve(strings.size());
new_string_start.reserve(string_start.size());
// Store a write position into the 'new_strings' buffer.
int write_pos = 0;
// Copy strings to new buffers, skipping over removed ones.
for (int start: string_start)
{
// If the string has not been removed:
if (start != -1)
{
// Fetch the string from the old buffer.
const char* str = strings.data() + start;
// Fetch the size of the string including the null terminator.
const size_t len = strlen(str) + 1;
// Insert the string to the new buffer.
new_strings.insert(new_strings.end(), str, str + len);
// Append the current write position to the starting positions
// of the new strings.
new_string_start.push_back(write_pos);
// Increment the write position by the string size.
write_pos += static_cast<int>(len);
}
}
// Swap compacted new buffers with old ones.
vector<char>(new_strings).swap(strings);
vector<int>(new_string_start).swap(string_start);
}
You can call the above periodically to compact memory use after removing a number of strings.
String Sequence
Here's some code throwing all this stuff together that you can freely use and modify however you like.
////////////////////////////////////////////////////////
// StringSequence.hpp:
////////////////////////////////////////////////////////
#ifndef STRING_SEQUENCE_HPP
#define STRING_SEQUENCE_HPP
#include <vector>
/// Stores a sequence of strings.
class StringSequence
{
public:
/// Creates a new sequence of strings.
StringSequence();
/// Inserts a new string to the back of the sequence.
void insert(const char str[]);
/// Inserts a new string to the back of the sequence.
void insert(size_t len, const char str[]);
/// Removes the nth string.
void erase(size_t n);
/// #return The nth string.
const char* operator[](size_t n) const;
/// #return The range of indexable strings.
size_t range() const;
/// #return True if the nth index is occupied by a string.
bool occupied(size_t n) const;
/// Compacts the memory use of the sequence.
void compact();
/// Swaps the contents of this sequence with the other.
void swap(StringSequence& other);
private:
std::vector<char> buffer;
std::vector<size_t> start;
size_t write_pos;
size_t num_removed;
};
#endif
////////////////////////////////////////////////////////
// StringSequence.cpp:
////////////////////////////////////////////////////////
#include "StringSequence.hpp"
#include <cassert>
StringSequence::StringSequence(): write_pos(1), num_removed(0)
{
// Reserve the front of the buffer for empty strings.
// We'll point removed strings here.
buffer.push_back('\0');
}
void StringSequence::insert(const char str[])
{
assert(str && "Trying to insert a null string!");
insert(strlen(str), str);
}
void StringSequence::insert(size_t len, const char str[])
{
const size_t str_size = len + 1;
buffer.insert(buffer.end(), str, str + str_size);
start.push_back(write_pos);
write_pos += str_size;
}
void StringSequence::erase(size_t n)
{
assert(occupied(n) && "The nth string has already been removed!");
start[n] = 0;
++num_removed;
}
const char* StringSequence::operator[](size_t n) const
{
return &buffer[0] + start[n];
}
size_t StringSequence::range() const
{
return start.size();
}
bool StringSequence::occupied(size_t n) const
{
return start[n] != 0;
}
void StringSequence::compact()
{
if (num_removed > 0)
{
// Create a new sequence excluding removed strings.
StringSequence new_seq;
new_seq.buffer.reserve(buffer.size());
new_seq.start.reserve(start.size());
for (size_t j=0; j < range(); ++j)
{
const char* str = (*this)[j];
if (occupied(j))
new_seq.insert(str);
}
// Swap the new sequence with this one.s
new_seq.swap(*this);
}
// Remove excess capacity.
if (buffer.capacity() > buffer.size())
std::vector<char>(buffer).swap(buffer);
if (start.capacity() > start.size())
std::vector<size_t>(start).swap(start);
}
void StringSequence::swap(StringSequence& other)
{
buffer.swap(other.buffer);
start.swap(other.start);
std::swap(write_pos, other.write_pos);
std::swap(num_removed, other.num_removed);
}
////////////////////////////////////////////////////////
// Quick demo:
////////////////////////////////////////////////////////
#include "StringSequence.hpp"
#include <iostream>
using namespace std;
int main()
{
StringSequence seq;
seq.insert("foo");
seq.insert("bar");
seq.insert("baz");
seq.insert("hello");
seq.insert("world");
seq.erase(2);
seq.erase(3);
cout << "Before compaction:" << endl;
for (size_t j=0; j < seq.range(); ++j)
{
if (seq.occupied(j))
cout << j << ": " << seq[j] << endl;
}
cout << endl;
cout << "After compaction:" << endl;
seq.compact();
for (size_t j=0; j < seq.range(); ++j)
{
if (seq.occupied(j))
cout << j << ": " << seq[j] << endl;
}
cout << endl;
}
Output:
Before compaction:
0: foo
1: bar
4: world
After compaction:
0: foo
1: bar
2: world
I didn't bother to make it standard-compliant (too lazy and the result isn't necessarily that much more useful for this particular situation) but hopefully not a strong need here.
Size of vector is implementation dependent and if you are compiling in debug mode. Normally its at least the size of some internal pointers (begin, end of storage and end of reserved memory). On my Linux system the sizeof(vector<int32_t>)is 24 bytes (probably 3 x 8 bytes for each pointer). That means that for your 10000000 items it should be at least ca. 240 MB.
How much memory does a vector<uint32_t> with a length of 1 need? Here are some estimates:
4 bytes for the uint32_t. That's what you expected.
ca. 8/16 bytes dynamic memory allocation overhead. People always forget that the new implementation must remember the size of the allocation, plus some additional housekeeping data. Typically, you can expect an overhead of two pointers, so 8 bytes on a 32 bit system, 16 bytes on a 64 bit system.
ca. 4/12 bytes for alignment padding. Dynamic allocations must be aligned for any data type. How much is required depends on the CPU, typical alignment requirements are 8 bytes (fully aligned double) or 16 bytes (for the CPUs vector instructions). So, your new implementation will add 4/12 padding bytes to the 4 bytes of payload.
ca. 12/24 bytes for the vector<> object itself. The vector<> object needs to store three things of pointer size: the pointer to the dynamic memory, its size, and the number of actually used objects. Multiply with the pointer size 4/8, and you get its sizeof.
Summing this all up, I arrive at 4 + 8/16 + 4/12 + 12/24 = 28/48 bytes that are used to store 4 bytes.
From your numbers, I guess that you compile in 32 bit mode, and that your max alignment is 8 bytes. In debug mode, your new implementation seems to add additional allocation overhead to catch common programming mistakes.
You're creating 41943040 instances of vector<int32_t>, stored inside another vector. I'm sure that 720MB is a reasonable amount of memory for all the instances' internal data members plus the outer vector's buffer.
As other have pointed out, the sizeof(vector<int32_t>) is big enough to produce such numbers when you initialize 41943040 instances.
What you may want is a cpp dictionary implementation - a map:
https://www.moderncplusplus.com/map/
It will still be big (even bigger), but less awkward stylistically. Now, if memory is a concern, then don't use it.
sizeof(std::map<std::string, std::string>) == 48 on my system.
I want rewrite file with 0's. It only write a few bytes.
My code:
int fileSize = boost::filesystem::file_size(filePath);
int zeros[fileSize] = { 0 };
boost::filesystem::path rewriteFilePath{filePath};
boost::filesystem::ofstream rewriteFile{rewriteFilePath, std::ios::trunc};
rewriteFile << zeros;
Also... Is this enough to shred the file? What should I do next to make the file unrecoverable?
EDIT: Ok. I rewrited my code to this. Is this code ok to do this?
int fileSize = boost::filesystem::file_size(filePath);
boost::filesystem::path rewriteFilePath{filePath};
boost::filesystem::ofstream rewriteFile{rewriteFilePath, std::ios::trunc};
for(int i = 0; i < fileSize; i++) {
rewriteFile << 0;
}
There are several problems with your code.
int zeros[fileSize] = { 0 };
You are creating an array that is sizeof(int) * fileSize bytes in size. For what you are attempting, you need an array that is fileSize bytes in size instead. So you need to use a 1-byte data type, like (unsigned) char or uint8_t.
But, more importantly, since the value of fileSize is not known until runtime, this type of array is known as a "Variable Length Array" (VLA), which is a non-standard feature in C++. Use std::vector instead if you need a dynamically allocated array.
boost::filesystem::ofstream rewriteFile{rewriteFilePath, std::ios::trunc};
The trunc flag truncates the size of an existing file to 0. What that entails is to update the file's metadata to reset its tracked byte size, and to mark all of the file's used disk sectors as available for reuse. The actual file bytes stored in those sectors are not wiped out until overwritten as sectors get reused over time. But any bytes you subsequently write to the truncated file are not guaranteed to (and likely will not) overwrite the old bytes on disk. So, do not truncate the file at all.
rewriteFile << zeros;
ofstream does not have an operator<< that takes an int[], or even an int*, as input. But it does have an operator<< that takes a void* as input (to output the value of the memory address being pointed at). An array decays into a pointer to the first element, and void* accepts any pointer. This is why only a few bytes are being written. You need to use ofstream::write() instead to write the array to file, and be sure to open the file with the binary flag.
Try this instead:
int fileSize = boost::filesystem::file_size(filePath);
std::vector<char> zeros(fileSize, 0);
boost::filesystem::path rewriteFilePath(filePath);
boost::filesystem::ofstream rewriteFile(rewriteFilePath, std::ios::binary);
rewriteFile.write(zeros.data()/*&zeros[0]*/, fileSize);
That being said, you don't need a dynamically allocated array at all, let alone one that is allocated to the full size of the file. That is just a waste of heap memory, especially for large files. You can do this instead:
int fileSize = boost::filesystem::file_size(filePath);
const char zeros[1024] = {0}; // adjust size as desired...
boost::filesystem::path rewriteFilePath(filePath);
boost::filesystem::ofstream rewriteFile(rewriteFilePath, std::ios::binary);
int loops = fileSize / sizeof(zeros);
for(int i = 0; i < loops; ++i) {
rewriteFile.write(zeros, sizeof(zeros));
}
rewriteFile.write(zeros, fileSize % sizeof(zeros));
Alternatively, if you open a memory-mapped view of the file (MapViewOfFile() on Windows, mmap() on Linux, etc) then you can simply use std::copy() or std::memset() to zero out the bytes of the entire file directly on disk without using an array at all.
Also... Is this enough to shred the file?
Not really, no. At the physical hardware layer, overwriting the file just one time with zeros can still leave behind remnant signals in the disk sectors, which can be recovered with sufficient tools. You should overwrite the file multiple times, with varying types of random data, not just zeros. That will more thoroughly scramble the signals in the sectors.
I cannot stress strongly enough the importance of the comments that overwriting a file's contents does not guarantee that any of the original data is overwritten. ALL OTHER ANSWERS TO THIS QUESTION ARE THEREFORE IRRELEVANT ON ANY RECENT OPERATING SYSTEM.
Modern filing systems are extents based, meaning that files are stored as a linked list of allocated chunks. Updating a chunk may be faster for the filing system to write a whole new chunk and simply adjust the linked list, so that's what they do. Indeed copy-on-write filing systems always write a copy of any modified chunk and update their B-tree of currently valid extents.
Furthermore, even if your filing system doesn't do this, your hard drive may use the exact same technique also for performance, and any SSD almost certainly always uses this technique due to how flash memory works. So overwriting data to "erase" it is meaningless on modern systems. Can't be done. The only safe way to keep old data hidden is full disk encryption. Anything else you are deceiving yourself and your users.
Just for fun, overwriting with random data:
Live On Coliru
#include <boost/iostreams/device/mapped_file.hpp>
#include <random>
namespace bio = boost::iostreams;
int main() {
bio::mapped_file dst("main.cpp");
std::mt19937 rng { std::random_device{} () };
std::uniform_int_distribution<char> dist;
std::generate_n(dst.data(), dst.size(), [&] { return dist(rng); });
}
Note that it scrambles its own source file after compilation :)
I am working on C++ and using multimap for storing data.
struct data
{
char* value1;
char* value2;
data(char* _value1, char* _value2)
{
int len1 = strlen(_value1);
value1 = new char[len1+1];
strcpy(value1,_value1);
int len2 = strlen(_value2);
value2 = new char[len2+2];
strcpy(value2,_value2);
}
~data()
{
delete[] value1;
delete[] value2;
}
}
struct ltstr
{
bool operator()(const char* s1, const char* s2) const
{
return strcmp(s1, s2) < 0;
}
};
multimap <char*, data*, ltstr> m;
Sample Input:
Key Value
ABCD123456 Data_Mining Indent Test Fast Might Must Favor List Myself Janki Jyoti Sepal Petal Catel Katlina Katrina Tesing Must Motor blah blah.
ABCD123456 Datfassaa_Minifasfngf Indesfsant Tfdasest Fast Might Must Favor List My\\fsad\\\self Jfasfsa Katrifasdna Tesinfasfg Must Motor blah blah.
tretD152456 fasdfa fasfsaDfasdfsafata_Mafsfining Infdsdent Tdfsest Fast Might Must Favor List Myself Janki
there are 27 million entries in input.
Input size = 14GB
But i noticed memory consumption reaches to 56 GB. May i know how can i reduce memory size?
If you can't reduce the amount of data you're actually storing, you might want to try to use a different container with less overhead (map and multimap have quite a bit) or find a way to keep only part of the data in memory.
You might want to take a look at these libraries:
STXXL: http://stxxl.sourceforge.net/
Google CPP-Btree: https://code.google.com/p/cpp-btree/
One possibility would be to use a std::map<char *, std::vector<data> > instead of a multimap. In a multimap, you're storing the key string in each entry. With a map you'd have only one copy of the key string, with multiple data items attached to it.
The first optimization would be to store data objects instead of pointers
std::multimap <char*, data, ltstr> m;
because using data* adds additional memory overhead for the allocation.
Another one is using a pool allocator/Memory pool to reduce the footprint of dynamic memory allocation.
If you have many identical key strings, you can improve that too, if you can reuse the keys.
I would suspect that you're leaking or unnecessarily duplicating memory in the keys. Where do the key char * strings come from and how do you manage their memory?
If they are the same string(s) as are in the data object, consider using a multiset<data *, ltdata> instead of a multimap.
If there are many duplicate strings, consider pooling strings in a set<char *,ltstr> to eliminate duplicates.
Without seeing some of your data, there are several things that could improve the memory usage of your project.
First, as Olaf suggested, store the data object in the multimap instead of a pointer to it. I don't suggest using a pool for your data structure though, it just complicates things without memory saving compared to directly storing it in the map.
What you could do though is a specialized allocator for your map that allocates std::pair<char*, data> objects. This could save some overhead and heap fragmentation.
Next, the main thing you should focus on is to try to get rid of the two char* pointers in your data. With 14 gigs of data, there has to be some overlap. Depending on what data it is, you could store it a bit differently.
For example, if the data is names or keywords then it would make sense to store them in a central hash. Yes there are more sophisticated solutions like a DAWG as suggested above but I think one should try the simple solutions first.
By simply storing it in a std::set<std::string> and storing the iterator to it you would compact all duplicates which would save a lot of data. This assumes though that you don't remove the strings. removing the strings would require you to do some reference counting so you would use something like std::map<std::string, unsinged long>. I suggest you write a class that inherits from / contains this hash rather then putting the reference counting logic into your data class though.
If the data that you are storing does not have many overlaps however, e.g. because it's binary data, then I suggest you store it in a std::string or std::vector<char> instead. The reason is because now you can get rid of the logic in your data structure and even replace it with a std::pair.
I'm also assuming that your key is not one of your pointers you are storing in your data structure. If it is, definitely get rid of it and use the first attribute of the std::pair in your multimap.
Further improvements might be possible depending on what type of data you are storing.
So, with a lot of assumptions that probably don't apply to your data you could have as little as this:
typedef std::set<std:string> StringMap;
typedef StringMap::const_iterator StringRef;
typedef std::multimap<StringRef, std::pair<StringRef, StringRef>> DataMap;
I'm still not entirely sure what is going on here, but it seems that memory overhead is at least some portion of the problem. However, the overall memory consumption is about 4x that which is needed for the data structure. There are approximately 500 bytes per record if there are 27M records taking up 14GB, yet the space taken up is 56GB. To me, this indicates that there is either more data stored than we're shown here, or at least some of the data is stored more than once.
And the "extra data for heap storage" isn't really doing it for me. In linux, a memory allocation takes somewhere around 32 bytes of data minimum. 16 bytes of overhead, and the memory allocated itself takes up a multiple of 16 bytes.
So for one data * record stored in the multimap, we need:
16 bytes of header for the memory allocation
8 bytes for pointer of `value1`
8 bytes for pointer of `value2`
16 bytes of header for the string in value1
16 bytes of header for the string in value2
8 bytes (on average) "size rounding" for string in value 1
8 bytes (on average) "size rounding" for string in value 2
?? bytes from the file. (X)
80 + X bytes total.
We then have char * in the multimap:
16 bytes of header for the memory allocation.
8 bytes of rounding on average.
?? bytes from the file. (Y)
24 + Y bytes total.
Each node of the multimap will have two pointers (I'm assuming it's some sort of binary tree):
16 bytes of header for the memory allocation of the node.
8 bytes of pointer to "left"
8 bytes of pointer to "right"
32 bytes total.
So, that makes 136 bytes of "overhead" per entry in the file. For 27M records, that is just over 4GB.
The file, as I said, contains 500 bytes per entry, so makes 14GB.
That's a total of 18GB.
So, somewhere, something is either leaking, or the math is wrong. I may be off by my calculations here, but even if everything above takes double the space I've calculated, there's STILL 20GB unaccounted for.
There are certainly some things we could do to save memory:
1) Don't allocate TWO strings in data. Calculate both lengths first, allocate one lump of memory, and store the strings immediately after each other:
data(char* _value1, char* _value2)
{
int len1 = strlen(_value1);
int len2 = strlen(_value2);
value1 = new char[len1 + len2 +2];
strcpy(value1,_value1);
value2 = value1 + len1 + 1;
strcpy(value2,_value2);
}
That would save on average 24 bytes per entry. We could possibly save even more by being clever and allocating the memory for data, value1 and value2 all at once. But that could be a little "too clever".
2) Allocating a large slab of data items, and doling them out one at a time would also help. For this to work, we need an empty constructor, and a "setvalues" method:
struct data
{
...
data() {};
...
set_values(char* _value1, char* _value2)
{
int len1 = strlen(_value1);
int len2 = strlen(_value2);
value1 = new char[len1 + len2 +2];
strcpy(value1,_value1);
value2 = value1 + len1 + 1;
strcpy(value2,_value2);
}
}
std::string v1[100], v2[100], key[100];
for(i = 0; i < 100; i++)
{
if (!read_line_from_file(key[i], v1[i], v2[i]))
{
break;
}
}
data* data_block = new data[i];
for(j = 0; j < i; j++)
{
data_block[j].setValues[v1[j].c_str(), v2[j].c_str());
m.insert(key[i].c_str(), &data_block[j]);
}
Again, this wouldn't save a HUGE amount of memory, but each 16 byte region saves SOME memory. The above is of course not complete code, and more of an "illustration of how it could be done".
3) I'm still not sure where the "Key" comes from in the multimap, but if the key is one of the value1 and value2 entries, then you could reuse one of those, rather than storing another copy [assuming that's how it's done currently].
I'm sorry if this isn't a true answer, but I do believe that it is an answer in the sense that "somewhere, something is unaccounted for in your explanation of what you are doing".
Understanding what allocations are made in your program would definitely help.
How to find out the lenght of an array of chars that is not null terminated/zero terminated or anything like that?
Because I wrote a writeFile function and I wanna get rid of that 'len' parameter.
int writeFile(FILE * handle, char * data, int len)
{
fseek(handle, 0, SEEK_SET);
for(int i=0; i <= len; i++)
fputc(data[i], handle);
}
You cannot get rid of the len parameter. The computer is not an oracle to guess your intentions. But you can use the fwrite() function which will write your data much more efficiently than fputc().
there is no portable way*, that is why sentinel values like null terminators are used.
In fact, its better to specify a length parameter, as it allows partial writes from data buffers (though in your case I would use a size_t/std::size_t for the length).
*you could try using _msize on windows, but it will lead to tears.
#define writeFile(handle, data) writeFileImpl(handle, data, sizeof data)
As Seith Carnegie commented and others answered, you cannot do that (getting the length of any array of char).
Some C libraries provide you with an extension giving an (over-sized) estimate of the length of heap-allocated memory (e.g. pointers obtained by malloc).
If you uses Boehm's garbage collector (which is very useful!), it gives you in <gc/gc.h> the GC_size function.
But when the array of char is inside a structure, or on the call stack, there is no way to get its size at runtime. Only your program knows it.
You can't get rid of the len parameter unless you have another way of determining the length of your data (usually by using a null terminator). This is because C and C++ don't store the length of the data. Furthermore, programmers might appreciate the len parameter. You don't always want to write out all the bytes in your array.
I don't understand how the reallocation of memory for a struct allows me to insert a larger char array into my struct.
Struct definition:
typedef struct props
{
char northTexture[1];
char southTexture[1];
char eastTexture[1];
char westTexture[1];
char floorTexture[1];
char ceilingTexture[1];
} PROPDATA;
example:
void function SetNorthTexture( PROPDATA* propData, char* northTexture )
{
if( strlen( northTexture ) != strlen( propData->northTexture ) )
{
PROPDATA* propPtr = (PROPDATA*)realloc( propData, sizeof( PROPDATA ) +
sizeof( northTexture ) );
if( propPtr != NULL )
{
strcpy( propData->northTexture, northTexture );
}
}
else
{
strcpy( propData->northTexture, northTexture );
}
}
I have tested something similar to this and it appears to work, I just don't understand how it does work. Now I expect some people are thinking "just use a char*" but I can't for whatever reason. The string has to be stored in the struct itself.
My confusion comes from the fact that I haven't resized my struct for any specific purpose. I haven't somehow indicated that I want the extra space to be allocated to the north texture char array in that example. I imagine the extra bit of memory I allocated is used for actually storing the string, and somehow when I call strcpy, it realises there is not enough space...
Any explanations on how this works (or how this is flawed even) would be great.
Is this C or C++? The code you've posted is C, but if it's actually C++ (as the tag implies) then use std::string. If it's C, then there are two options.
If (as you say) you must store the strings in the structure itself, then you can't resize them. C structures simply don't allow that. That "array of size 1" trick is sometimes used to bolt a single variable-length field onto the end of a structure, but can't be used anywhere else because each field has a fixed offset within the structure. The best you can do is decide on a maximum size, and make each an array of that size.
Otherwise, store each string as a char*, and resize with realloc.
This answer is not to promote the practice described below, but to explain things. There are good reasens not to use malloc and suggestions to use std::string, in other answers, are valid.
I think You have come across the trick used for example by Microsoft to avid the cost of a pointer dereference. In the case of Unsized Arrays in Structures (please check the link) it relies on a non-standard extension to the language. You can use a trick like that, even without the extension, but only for the struct member, that is positioned at it's end in the memory. Usually the last member in the structure declaration is also the last, in the memory, but check this question to know more about it. For the trick to work, You also have to make sure, the compiler won't add padding bytes at the end of the structure.
The general idea is like this: Suppose You have a structure with an array at the end like
struct MyStruct
{
int someIntField;
char someStr[1];
};
When allocating on the heap, You would normally say something like this
MyStruct* msp = (MyStruct*)malloc(sizeof(MyStruct));
However, if You allocate more space, than Your stuct actually occupies, You can reference the bytes, that are laid out in the memory, right behind the struct with "out of bounds" access to the array elements. Assuming some typical sizes for the int and the char, and lack of padding bytes at the end, if You write this:
MyStruct* msp = (MyStruct*)malloc(sizeof(MyStruct) + someMoreBytes);
The memory layout should look like:
| msp | msp+1 | msp+2 | msp+3 | msp+4 | msp+5 | msp+6 | ... |
| <- someIntField -> |someStr[0]| <- someMoreBytes -> |
In that case, You can reference the byte at the address msp+6 like this:
msp->someStr[2];
strcpy is not that intelligent, and it is not really working.
The call to realloc() allocates enough space for the string - so it doesn't actually crash but when you strcpy the string to propData->northTexture you may be overwriting anything following northTexture in propData - propData->southTexture, propData->westTexture etc.
For example is you called SetNorthTexture(prop, "texture");
and printed out the different textures then you would probably find that:
northTexture is "texture"
southTexture is "exture"
eastTexture is "xture" etc (assuming that the arrays are byte aligned).
Assuming you don't want to statically allocate char arrays big enough to hold the largest strings, and if you absolutely must have the strings in the structure then you can store the strings one after the other at the end of the structure. Obviously you will need to dynamically malloc your structure to have enough space to hold all the strings + offsets to their locations.
This is very messy and inefficient as you need to shuffle things around if strings are added, deleted or changed.
My confusion comes from the fact that
I haven't resized my struct for any
specific purpose.
In low level languages like C there is some kind of distinction between structs (or types in general) and actual memory. Allocation basically consists of two steps:
Allocation of raw memory buffer of right size
Telling the compiler that this piece of raw bytes should be treated as a structure
When you do realloc, you do not change the structure, but you change the buffer it is stored in, so you can use extra space beyond structure.
Note that, although your program will not crash, it's not correct. When you put text into northTexture, you will overwrite other structure fields.
NOTE: This has no char array example but it is the same principle. It is just a guess of mine of what are you trying to achieve.
My opinion is that you have seen somewhere something like this:
typedef struct tagBITMAPINFO {
BITMAPINFOHEADER bmiHeader;
RGBQUAD bmiColors[1];
} BITMAPINFO, *PBITMAPINFO;
What you are trying to obtain can happen only when the array is at the end of the struct (and only one array).
For example you allocate sizeof(BITMAPINFO)+15*sizeof(GBQUAD) when you need to store 16 RGBQUAD structures (1 from the structure and 15 extra).
PBITMAPINFO info = (PBITMAPINFO)malloc(sizeof(BITMAPINFO)+15*sizeof(GBQUAD));
You can access all the RGBQUAD structures like they are inside the BITMAPINFO structure:
info->bmiColors[0]
info->bmiColors[1]
...
info->bmiColors[15]
You can do something similar to an array declared as char bufStr[1] at the end of a struct.
Hope it helps.
One approach to keeping a struct and all its strings together in a single allocated memory block is something like this:
struct foo {
ptrdiff_t s1, s2, s3, s4;
size_t bufsize;
char buf[1];
} bar;
Allocate sizeof(struct foo)+total_string_size bytes and store the offsets to each string in the s1, s2, etc. members and bar.buf+bar.s1 is then a pointer to the first string, bar.buf+bar.s2 a pointer to the second string, etc.
You can use pointers rather than offsets if you know you won't need to realloc the struct.
Whether it makes sense to do something like this at all is debatable. One benefit is that it may help fight memory fragmentation or malloc/free overhead when you have a huge number of tiny data objects (especially in threaded environments). It also reduces error handling cleanup complexity if you have a single malloc failure to check for. There may be cache benefits to ensuring data locality. And it's possible (if you use offsets rather than pointers) to store the object on disk without any serialization (keeping in mind that your files are then machine/compiler-specific).