C++ slows over time reading 70,000 files

C++ slows over time reading 70,000 files - c++

I have a program which needs to analyze 100,000 files spread over multiple filesystems.
After processing around 3000 files it starts to slow down. I ran it through gprof, but since the slow down doesn't kick in until 30-60 seconds into the analysis, I don't think it tells me much.
How would I track down the cause? top doesn't show high CPU and the process memory does not increase over time, so I/O?
At top level, we have:
scanner.init(); // build a std::vector<std::string> of pathnames.
scanner.scan(); // analyze those files
Now, init() completes in 1 second. It populates the vector with 70,000 actual filenames and 30,000 symbolic links.
scan() traverses the entries in the vector, looks at the file names, reads the contents (say 1KB of text), and builds a "segment list" [1]
I've read conflicting views on the evils of using std::strings, especially passing them as arguments. All the functions pass &references for both std::strings, structures, etc.
But it does use a lot of string processing to parse filenames, extract substrings and search for substrings. (and if they were evil, the program should be always slow, not just slow down after a while.
Could that be a reason for slowing down over time?
The algorithm is very straightforward and doesn't have any new / delete operators...
Abbreviated, scan():
while (tsFile != mFileMap.end())
{
curFileInfo.filePath = tsFile->second;
mpUtils->parseDateTimeString(tsFile->first, curFileInfo.start);
// Ignore files too small
size_t fs = mpFileActions->fileSize(curFileInfo.filePath);
mDvStorInfo.tsSizeBytes += fs;
if (fileNum++ % 200 == 0)
{
usleep(LONGNAPUSEC); // long nap to give others a turn
}
// collect file information
curFileInfo.locked = isLocked(curFileInfo.filePath);
curFileInfo.sizeBytes = mpFileActions->fileSize(curFileInfo.filePath);
getTsRateAndPktSize(curFileInfo.filePath, curFileInfo.rateBps, curFileInfo.pktSize);
getServiceIdList(curFileInfo.filePath, curFileInfo.svcIdList);
std::string fileBasePath;
fileBasePath = mpUtils->strReplace(".ts", "", curFileInfo.filePath.c_str());
fileBasePath = mpUtils->strReplace(".lockts", "", fileBasePath.c_str()); // chained replace
// Extract the last part of the filename, ie. /mnt/das.b/20160327.104200.to.20160327.104400
getFileEndTimeAndDuration(fileBasePath, curFileInfo);
// Update machine info for both actual ts duration and span including gaps
mDvStorInfo.tsDurationSec += curFileInfo.durSec;
if (!firstTime)
{
// beef is here.
if (hasGap(curFileInfo, prevFileInfo) ||
lockChanged(curFileInfo, prevFileInfo) ||
svcIdListChanged(curFileInfo, prevFileInfo) ||
lastTsFile(tsFile))
{
// This current file differs from those before it so
// close off previous segment and push to list
curSegInfo.prevFileStart = curFileInfo.start;
mSegmentList.push_back(curSegInfo);
prevFileInfo = curFileInfo; // do this before resetting everything!
// initialize the new segment
resetSegmentInfo(curSegInfo);
copyValues(curSegInfo, curFileInfo);
resetFileInfo(curFileInfo);
}
else
{
// still running. Update current segment info
curSegInfo.durSec += curFileInfo.durSec;
curSegInfo.sizeBytes += curFileInfo.sizeBytes;
curSegInfo.end = curFileInfo.end;
curSegInfo.prevFileStart = prevFileInfo.start;
prevFileInfo = curFileInfo;
}
}
else // first time
{
firstTime = false;
prevFileInfo = curFileInfo;
copyValues(curSegInfo, curFileInfo);
resetFileInfo(curFileInfo);
}
++tsFile;
}
where:
curFileInfo/prevFileInfo is a plain struct. The other functions do string processing, returning a &reference to std::strings
fileSize is calculated by calling stat()
getServiceIdList opens the file with fopen, reads each line and closes the file.
UPDATE
Removing the push_back to the container did not change the performance at all. However, rewriting to use C functions (eg. strstr(), strcpy() etc) now shows constant performance.
Culprit was the std::strings – despite passing as &refs, I guess too many construct/destroy/copy.
[1] the file names are named by YYYYMMDD.HHMMSS date/time, eg 20160612.093200. The purpose of the program is to look for time gaps within the names of the 70,000 files and build a list of contiguous time segments.

This could be a heap fragmentation issue. Over time, the heap can turn into Swiss cheese making it much harder for the memory manager to allocate blocks, and potentially forcing swap even if there is free RAM because there aren't any large-enough contiguous free blocks. Here's an MSDN article about heap fragmentation.
You mentioned using std::vector which guarantees contiguous memory and therefore can be a major culprit in heap fragmentation, as it must free and reallocate each time the collection grows beyond a boundary. If you don't require the contiguous guarantee, you might try a different container.

the file names are named by YYYYMMDD.HHMMSS date/time, eg 20160612.093200. The purpose of the program is to look for time gaps within the names of the 70,000 files and build a list of contiguous time segments
Comparing strings is slow; O(N). Comparing integers is fast; O(1). Rather than storing the filenames as strings, consider storing them as integers (or pairs of integers).
And I strongly suggest that you use hash maps, if possible. See std::unordered_set and std::unordered_map. These will greatly cut down on the number of comparisons.
Removing the push_back to the container did not change the performance at all. However, rewriting to use C functions (eg. strstr(), strcpy() etc) now shows constant performance.
std::set<char*> is sorting pointer addresses, not the strings that they contain.
And don't forget to std::move your strings to cut down on allocations.

Related

Reducing memory footprint of c++ program utilising large vectors

In scaling up the problem size I'm handing to a self-coded program I started to bump into Linux's OOM killer. Both Valgrind (when ran on CPU) and cuda-memcheck (when ran on GPU) do not report any memory leaks. The memory usage keeps expanding while iterating through the inner loop, while I explicitly clear the vectors holding the biggest chunk of data at the end of the this loop. How can I ensure this memory hogging will disappear?
Checks for memory leaks were performed, all the memory leaks are fixed. Despite this, Out of Memory errors keep killing the program (via the OOM Killer). Manual monitoring of memory consumption shows an increase in memory utilisation, even after explicitly clearing the vectors containing the data.
Key to know is having three nested loops, one outer containing the sub-problems at hand. The middle loop loops over the Monte Carlo trials, with an inner loop running some sequential process required inside the trial. Pseudo-code looks as follows:
std::vector<object*> sub_problems;
sub_problems.push_back(retrieved_subproblem_from_database);
for(int sub_problem_index = 0; sub_problem_index < sub_problems.size(); ++sub_problem_index){
std::vector< std::vector<float> > mc_results(100000, std::vector<float>(5, 0.0));
for(int mc_trial = 0; mc_trial < 100000; ++mc_trial){
for(int sequential_process_index = 0; sequential_process_index < 5; ++sequential_process_index){
mc_results[mc_trial][sequential_process_index] = specific_result;
}
}
sub_problems[sub_problem_index]->storeResultsInObject(mc_results);
// Do some other things
sub_problems[sub_problem_index]->deleteMCResults();
}
deleteMCResults looks as follows:
bool deleteMCResults() {
for (int i = 0; i < asset_values.size(); ++i){
object_mc_results[i].clear();
object_mc_results[i].shrink_to_fit();
}
object_mc_results.clear();
object_mc_results.shrink_to_fit();
return true;
}
How can I ensure memory consumption to be solely dependent on the middle and inner loop instead of the outer loop? The second, and third and fourth and so, could theoretically use exactly the same memory space/addresses as utilised for the first iteration.

Perhaps I'm reading your pseudocode too literally, but it looks like you have two mc_results variables, one declared inside the for loop and one that deleteMCResults is accessing.
In any case, I have two suggestions for how to debug this. First, rather than letting the OOM killer strike, which takes a long time, is unpredictable, and might kill something important, use ulimit -v to put a limit on process size. Set it to something reasonable like, say, 1000000 (about 1GB) and work on keeping your process under that.
Second, start deleting or commenting out everything except the parts of the program that allocate and deallocate memory. Either you will find your culprit or you will make a program small enough to post in its entirety.

deleteMCResults() can be written a lot simpler.
void deleteMCResults() {
decltype(object_mc_results) empty;
std::swap(object_mc_results, empty);
}
But in this case, I'm wondering if you really want to release the memory. As you say, the iterations could reuse the same memory, so perhaps you should replace deleteMCResults() with returnMCResultsMemory(). Then hoist the declaration of mc_results out of the loop, and just reset its values to 5.0 after returnMCResultsMemory() returns.

There is one thing that could easily be improved from the code you show. However, it is really not enough and not precise enough info to make a full analysis. Extracting a relevant example ([mcve]) and perhaps asking for a review on codereview.stackexchange.com might improve the outcome.
The simple thing that could be done is to replace the inner vector of five floats with an array of five floats. Each vector consists (in typical implementations) of three pointers, to the beginnig and end of the allocated memory and another one to mark the used amount. The actual storage requires a separate allocation, which in turn incurs some overhead (and also performance overhead when accessing the data, keyword "locality of reference"). These three pointers require 24 octets on a common 64-bit machine. Compare that with five floats, those only require 20 octets. Even if those floats were padded to 24 octets, you would still benefit from eliding the separate allocation.
In order to try this out, just replace the inner vector with a std::array (https://en.cppreference.com/w/cpp/container/array). Odds are that you won't have to change much code, raw arrays, std::array and std::vector have very similar interfaces.

Find next available chunk in a memory pool

So, I've been spending some time implementing a memory pool class in C++. Except for some minor problems along the way, it's gone fairly well. However, when I tried testing it today by allocating 1000 chunks by first using the memory pool and then comparing it to using new, I was actually getting close too three times worse performance (in nano seconds) when using the memory pool. My allocation method looks like this:
template <class T> T* MemPool<T>::allocate()
{
Chunk<T>* tempChunk = _startChunk;
while (tempChunk->_free == false)
{
if (tempChunk->_nextChunk == NULL)
throw std::runtime_error("No available chunks");
tempChunk = tempChunk->_nextChunk;
}
tempChunk->_free = false;
return &tempChunk->object;
}
I am starting at the first chunk in the pool and doing a search through the pool's linked list until I find a free chunk, or reach the end of the pool. Now, the bigger the pool, the longer this will take as the search has an O(n) time complexity where n is the number of chunks in the pool.
So I was curious as to if anyone have any thoughts on how to improve the allocation? My initial thought was to use two linked lists instead of just the one, where one contains free chunks and the other allocated chunks. When a new chunk is to be allocated, I would simply take the first element in the first mentioned linked list and move it to the allocated linked list. As far as I can see, this would eliminate the need to do any searching when allocating, and leave only deallocating requiring a search to find the correct chunk.
Any thoughts are appreciated as this is my first time working directly with memory in this way. Thanks!

Instead of using a hand-crafted linked list, it would probably be more effective to use a std::list (particularly if you use it with a custom allocator). Less error prone, and probably better optimised.
Using two lists will allows simplifying a lot. No need to track, in the list itself, if a chunk is free or not - since that will be specified by which list the chunk is in (all that is needed is to ensure a chunk somehow doesn't appear in both lists).
Your current implementation means you are having to walk the linked list, both when allocating and deallocating.
If the chunks are fixed size, then allocation would simply be implemented by moving the first available chunk from the free to the allocated list - no need to search. To deallocate a chunk, you would still need to find it in the allocated list, which means you would need to map a T*to an entry in the list (e.g. perform a search), but then the act of deallocation will be simply moving the entry from one list to the other.
If the chunks are variable size, you'll need to do a bit more work. Allocating would require finding a chunk that is at least the requested size when allocating. Overallocating (allocating a larger chunk than needed) would make allocation and deallocation more efficent in terms of performance, but also mean that fewer chunks can be allocated from the pool. Alternatively, break a large chunk (from the free list) in two, and place one entry on both lists (representing the allocated part, and the part left unallocated). If you do this, when deallocating, it may be desirable to merge chunks that are adjacent in memory (effectively, implement defragmentation of the free memory in the pool).
You will need to decide whether the pool can be used from multiple threads, and use appropriate synchronisation.

Use a fixed number of size bins, and make each bin a linked list.
For instance, let's say your bins are simply the integer multiples of the system page size (usually 4KiB), and you use 1MiB chunks; then you have 1MiB/4KiB = 256 bins. If a free makes an n-page region available in a chunk, append it to bin n. When allocating an n-page region, walk through the bins from n to 256 and choose the first available chunk.
To maximize performance, associate the bins with a bitmap, then scan from bit n-1 to bit 255 to find the first set bit (count the leading or trailing zeroes using compiler intrinsics like __builtin_clz and _BitScanForward). That's still not quite O(1) due to the number of bins, but it's pretty close.
If you're worried about memory overhead, you could append each chunk only once for each bin. That is, even if a chunk has 128 1-page regions available (maximally fragmented), bin 1 will still only link to the chunk once and reuse it 128 times.
To do this you'd have to link these regions together inside each chunk, which means each chunk will also need to store a list of size bins - but this can be more memory efficient because there are only at most 256 valid offsets inside each chunk, whereas the list needs to store full pointers.
Note that either way, if you don't want the free space inside each chunk to get fragmented, you'll want a quick way to remove chunks from bins in your list - which means using doubly linked lists. Obviously that adds additional memory overhead, but it might still be preferable to doing periodic free space defragmentation on the whole list.

Searching in large memory mapped files

I have a large data structure stored in memory mapped file. Data structure is very simple:
struct Header {
...some metadata...
uint32_t index_size;
uint64_t index[]
};
This header is placed in the beginning of the file, it uses a structure hack - variable sized structure, size of the last element is not set in stone and can be changed.
char* mmaped_region = ...; // This memory comes from memory mapped file!
Header* pheader = reinterpret_cast<Header*>(mmaped_region);
Memory mapped region starts with Header and Header::index_size contains correct length of the Header::index array. This array contains offsets of the data elements, we can do this:
uint64_t offset = pheader->index[x];
DataItem* item = reinterpret_cast<DataItem*>(mmaped_region + offset);
// At this point, variable item contains pointer to data element
// if variable x contains correct index value (less than pheader->index_size)
All the data elements is sorted (less than relation defined for data elements). Their are stored in the same memory mapped region as Header but starting from the end to the beginning. Data elements can't be moved, because their are of variable size, instead of that - indexes in header are moved during sort procedure. This is very much like B-tree page in modern databases, index array is usually called an indirection vector.
Searches
This data-structure is searched with interpolation search algorithm (with limited amount of steps) and than with binary search. First, I have a whole index array to search, I'm trying to calculate - where searched element can be stored if distribution is uniform. I get some calculated index - look at element at this index and it usually doesn't match. Than I narrow the search range and repeat. Number of interpolation search steps is limited by some small number. After that data-structure is searched with binary search. This works very good with small data-sets, because distribution is usually uniform. Few iterations of the interpolation search and we're done.
Problem definition.
Memory mapped region can be very large in reality. For testing I use 32Gb file backed storage and search for some random keys. This is very slow because this pattern cause lot of random disk reads (all data can't be cached in memory).
What can be done here? I think that setting MADV_RANDOM with madvise syscall can help, but probably not very much. I want to get on par with B-tree search speed. Maybe it is possible to use mincore syscall to check what data-elements can be painlessly checked during interpolation search? Maybe I can use prefetching of some sort?

The interpolation search appears to be a good idea here. It usually has a small benefit, but in this case even a small number of iterations saved helps a lot since they're s slow (disk I/O).
However, real databases duplicate the actual key values in their indices. The space overhead for that is fully justified in the performance improvement. Btrees are a further improvement because they pack multiple related nodes in a single contiguous block of memory, further reducing disk seeks.
This is probably the correct solution for you as well. You should duplicate the keys to avoid disk I/O. You can probably get away by duplicating the keys in a separate structure and keeping that that fully in memory, if you can't alter the existing header.
A compromise is possible, where you just cache the top (2^N)-1 keys for the first N levels of binary search. That means you have to give up your interpolation for that part of the search, but as noted before interpolation is not a huge win anyway. The disk seeks saved will easily pay off. Even caching just the median key (N=1) will already save you one disk seek per lookup. And you can still use interpolation once you've run out of the cache.
In comparison, any attempt to fiddle with memory mapping parameters will give you a few percent speed improvement at best. "On par with B-trees" is not going to happen. If your algorithm needs those physical seeks, you lose. No magical pixie dust will fix a bad algorithm or a bad datastructure.

Algorithm for ordering strings to and from disk efficiently using minimal internal memory resources

I have a very (multiple terrabytes) large amount of strings stored on disk that I need to sort alphabetically and store in another file as quickly as possible (preferrably in C/C++) and using as little internal memory as possible. It is not an option to pre-index the strings beforehand, so I need to sort the strings whenever needed in a close to real-time fashion.
What would be the best algorithm to use in my case? I would prefer a suggestion for a linear algorithm rather than just a link to an existing software library like Lucene.

You usually sort huge external data by chunking it into smaller pieces, operating on them and eventually merging them back. When choosing the sorting algorithm you usually take a look at your requirements:
If you need a time-complexity guarantee that is also stable you can go for a mergesort (O(nlogn) guaranteed) although it requires an additional O(n) space.
If severely memory-bound you might want to try Smoothsort (constant memory, time O(nlogn))
Otherwise you might want to take a look at the research stuff in the gpgpu accelerators field like GPUTeraSort.
Google servers usually have this sort of problems.

Construct simply digital tree (Trie)
Memory will be much less than input data, because many words will be have common prefix. While adding data to tree u mark (incrementation) last child as end of word. If u add all words then u doing a DFS (with priority as u want sorting ex a->z ) and you output data to file. Time-complexity is exactly the same as memory size. It is hard to say about how is complexity because it depends on strings (many short strings better complexity) but it is still much better than input data O(n*k) where n-count of strings; k-the average length of string. Im sorry for my English.
PS. For solve problem with memorysize u can part file to smallest parts, sorting them with my method, and if u will be have for ex (1000 files) u will be remember in each first word (like queues) and next u will be output right word and input next in very short time.

I suggest you use the Unix "sort" command that can easily handle such files.
See How could the UNIX sort command sort a very large file? .
Before disk drives even existed, people wrote programs to sort lists that were far too large to hold in main memory.
Such programs are known as external sorting algorithms.
My understanding is that the Unix "sort" command uses the merge sort algorithm.
Perhaps the simplest version of the external sorting merge sort algorithm works like this (quoting from Wikipedia: merge sort):
Name four tape drives as A, B, C, D, with the original data on A:
Merge pairs of records from A; writing two-record sublists alternately to C and D.
Merge two-record sublists from C and D into four-record sublists; writing these alternately to A and B.
Merge four-record sublists from A and B into eight-record sublists; writing these alternately to C and D
Repeat until you have one list containing all the data, sorted --- in log2(n) passes.
Practical implementations typically have many tweaks:
Almost every practical implementation takes advantage of available RAM by reading many items into RAM at once, using some in-RAM sorting algorithm, rather than reading only one item at a time.
some implementations are able to sort lists even when some or every item in the list is too large to hold in the available RAM.
polyphase merge sort
As suggested by Kaslai, rather than only 4 intermediate files, it is usually quicker to use 26 or more intermediate files. However, as the external sorting article points out, if you divide up the data into too many intermediate files, the program spends a lot of time waiting for disk seeks; too many intermediate files make it run slower.
As Kaslai commented, using larger RAM buffers for each intermediate file can significantly decrease the sort time -- doubling the size of each buffer halves the number of seeks. Ideally each buffer should be sized so the seek time is a relatively small part of the total time to fill that buffer. Then the number of intermediate files should be picked so the total size of all those RAM buffers put together comes close to but does not exceed available RAM. (If you have very short seek times, as with a SSD, the optimal arrangement ends up with many small buffers and many intermediate files. If you have very long seek times, as with tape drives, the optimal arrangement ends up with a few large buffers and few intermediate files. Rotating disk drives are intermediate).
etc. -- See the Knuth book "The Art of Computer Programming, Vol. 3: Sorting and Searching" for details.

Use as much memory as you can and chunk your data. Read one chunk at a time into memory.
Step 1) Sort entries inside chunks
For each chunk:
Use IntroSort to sort your chunk. But to avoid copying your strings around and having to deal with variable sized strings and memory allocations (at this point it will be interesting and relevant if you actually have fixed or max size strings or not), preallocate a standard std array or other fitting container with pointers to your strings that point to a memory region inside the current data chunk. => So your IntroSort swaps the pointers to your strings, instead of swapping actual strings.
Loop over each entry in your sort-array and write the resulting (ordered) strings back to a corresponding sorted strings file for this chunk
Step 2) Merge all strings from sorted chunks into resulting sorted strings file
Allocate a "sliding" window memory region for all sorted strings files at once. To give an example: If you have 4 sorted strings files, allocate 4 * 256MB (or whatever fits, the larger the less (sequential) disk IO reads required).
Fill each window by reading the strings into it (so, read as much strings at once as your window can store).
Use MergeSort to compare any of your chunks, using a comparator to your window (e.g. stringInsideHunkA = getStringFromWindow(1, pointerToCurrentWindow1String) - pointerToCurrentWindow1String is a reference that the function advances to the next string). Note that if the string pointer to your window is beyond the window size (or the last record didn't fit to the window read the next memory region of that chunk into the window.
Use mapped IO (or buffered writer) and write the resulting strings into a giant sorted strings final
I think this could be an IO efficient way. But I've never implemented such thing.
However, in regards to your file size and yet unknown to me "non-functional" requirements, I suggest you to also consider benchmarking a batch-import using LevelDB [1]. It's actually very fast, minimizes disk IO, and even compresses your resulting strings file to about half the size without impact on speed.
[1] http://leveldb.googlecode.com/svn/trunk/doc/benchmark.html

Here is a general algorithm that will be able to do what you want with just a few gigs of memory. You could get away with much less, but the more you have, the less disk overhead you have to deal with. This assumes that all of the strings are in a single file, however could be applied to a multiple file setup.
1: Create some files to store loosely sorted strings in. For terabytes of data, you'd probably want 676 of them. One for strings starting in "aa", one for "ab", and so on until you get to "zy" and "zz".
2: For each file you created, create a corresponding buffer in memory. A std::vector<std::string> perhaps.
3: Determine a buffer size that you want to work with. This should not exceed much beyond 1/2 of your available physical memory.
4: Load as many strings as you can into this buffer.
5: Truncate the file so that the strings in your buffer are no longer on disk. This step can be delayed for later or omitted entirely if you have the disk space to work with or the data is too sensitive to lose in the case of process failure. If truncating, make sure you load your strings from the end of the file, so that the truncation is almost a NOP.
6: Iterate over the strings and store them in their corresponding buffer.
7: Flush all of the buffers to their corresponding files. Clear all the buffers.
8: Go to step 4 and repeat until you have exhausted your source of strings.
9: Read each file to memory and sort it with whatever algorithm you fancy. On the off chance you end up with a file that is larger than your available physical memory, use a similar process from above to split it into smaller files.
10: Overwrite the unsorted file with this new sorted file, or append it to a monolithic file.
If you keep the individual files rather than a monolithic file, you can make insertions and deletions relatively quickly. You would only have to load in, insert, and sort the value into a single file that can be read entirely into memory. Now and then you might have to split a file into smaller files, however this merely amounts to looking around the middle of the file for a good place to split it and then just moving everything after that point to another file.
Good luck with your project.

Why is deque using so much more RAM than vector in C++?

I have a problem I am working on where I need to use some sort of 2 dimensional array. The array is fixed width (four columns), but I need to create extra rows on the fly.
To do this, I have been using vectors of vectors, and I have been using some nested loops that contain this:
array.push_back(vector<float>(4));
array[n][0] = a;
array[n][1] = b;
array[n][2] = c;
array[n][3] = d;
n++
to add the rows and their contents. The trouble is that I appear to be running out of memory with the number of elements I was trying to create, so I reduced the number that I was using. But then I started reading about deque, and thought it would allow me to use more memory because it doesn't have to be contiguous. I changed all mentions of "vector" to "deque", in this loop, as well as all declarations. But then it appeared that I ran out of memory again, this time with even with the reduced number of rows.
I looked at how much memory my code is using, and when I am using deque, the memory rises steadily to above 2GB, and the program closes soon after, even when using the smaller number of rows. I'm not sure exactly where in this loop it is when it runs out of memory.
When I use vectors, the memory usage (for the same number of rows) is still under 1GB, even when the loop exits. It then goes on to a similar loop where more rows are added, still only reaching about 1.4GB.
So my question is. Is this normal for deque to use more than twice the memory of vector, or am I making an erroneous assumption in thinking I can just replace the word "vector" with "deque" in the declarations/initializations and the above code?
Thanks in advance.
I'm using:
MS Visual C++ 2010 (32-bit)
Windows 7 (64-bit)

The real answer here has little to do with the core data structure. The answer is that MSVC's implementation of std::deque is especially awful and degenerates to an array of pointers to individual elements, rather than the array of arrays it should be. Frankly, only twice the memory use of vector is surprising. If you had a better implementation of deque you'd get better results.

It all depends on the internal implementation of deque (I won't speak about vector since it is relatively straightforward).
Fact is, deque has completely different guarantees than vector (the most important one being that it supports O(1) insertion at both ends while vector only supports O(1) insertion at the back). This in turn means the internal structures managed by deque have to be more complex than vector.
To allow that, a typical deque implementation will split its memory in several non-contiguous blocks. But each individual memory block has a fixed overhead to allow the memory management to work (eg. whatever the size of the block, the system may need another 16 or 32 bytes or whatever in addition, just for bookkeeping). Since, contrary to a vector, a deque requires many small, independent blocks, the overhead stacks up which can explain the difference you see. Also note that those individual memory blocks need to be managed (maybe in separate structures?), which probably means some (or a lot of) additional overhead too.
As for a way to solve your problem, you could try what #BasileStarynkevitch suggested in the comments, this will indeed reduce your memory usage but it will get you only so far because at some point you'll still run out of memory. And what if you try to run your program on a machine that only has 256MB RAM? Any other solution which goal is to reduce your memory footprint while still trying to keep all your data in memory will suffer from the same problems.
A proper solution when handling large datasets like yours would be to adapt your algorithms and data structures in order to be able to handle small partitions at a time of your whole dataset, and load/save those partitions as needed in order to make room for the other partitions. Unfortunately since it probably means disk access, it also means a big drop in performance but hey, you can't eat the cake and have it too.

Theory
There two common ways to efficiently implement a deque: either with a modified dynamic array or with a doubly linked list.
The modified dynamic array uses is basically a dynamic array that can grow from both ends, sometimes called array deques. These array deques have all the properties of a dynamic array, such as constant-time random access, good locality of reference, and inefficient insertion/removal in the middle, with the addition of amortized constant-time insertion/removal at both ends, instead of just one end.
There are several implementations of modified dynamic array:
Allocating deque contents from the center of the underlying array,
and resizing the underlying array when either end is reached. This
approach may require more frequent resizings and waste more space,
particularly when elements are only inserted at one end.
Storing deque contents in a circular buffer, and only resizing when
the buffer becomes full. This decreases the frequency of resizings.
Storing contents in multiple smaller arrays, allocating additional
arrays at the beginning or end as needed. Indexing is implemented by
keeping a dynamic array containing pointers to each of the smaller
arrays.
Conclusion
Different libraries may implement deques in different ways, but generally as a modified dynamic array. Most likely your standard library uses the approach #1 to implement std::deque, and since you append elements only from one end, you ultimately waste a lot of space. For that reason, it makes an illusion that std::deque takes up more space than usual std::vector.
Furthermore, if std::deque would be implemented as doubly-linked list, that would result in a waste of space too since each element would need to accommodate 2 pointers in addition to your custom data.
Implementation with approach #3 (modified dynamic array approach too) would again result in a waste of space to accommodate additional metadata such as pointers to all those small arrays.
In any case, std::deque is less efficient in terms of storage than plain old std::vector. Without knowing what do you want to achieve I cannot confidently suggest which data structure do you need. However, it seems like you don't even know what deques are for, therefore, what you really want in your situation is std::vector. Deques, in general, have different application.

Deque can have additional memory overhead over vector because it's made of a few blocks instead of contiguous one.
From en.cppreference.com/w/cpp/container/deque:
As opposed to std::vector, the elements of a deque are not stored contiguously: typical implementations use a sequence of individually allocated fixed-size arrays.

The primary issue is running out of memory.
So, do you need all the data in memory at once?
You may never be able to accomplish this.
Partial Processing
You may want to consider processing the data into "chunks" or smaller sub-matrices. For example, using the standard rectangular grid:
Read data of first quadrant.
Process data of first quandrant.
Store results (in a file) of first quandrant.
Repeat for remaining quandrants.
Searching
If you are searching for a particle or a set of datum, you can do that without reading in the entire data set into memory.
Allocate a block (array) of memory.
Read a portion of the data into this block of memory.
Search the block of data.
Repeat steps 2 and 3 until the data is found.
Streaming Data
If your application is receiving the raw data from an input source (other than a file), you will want to store the data for later processing.
This will require more than one buffer and is more efficient using at least two threads of execution.
The Reading Thread will be reading data into a buffer until the buffer is full. When the buffer is full, it will read data into another empty one.
The Writing Thread will initially wait until either the first read buffer is full or the read operation is finished. Next, the Writing Thread takes data out of the read buffer and writes to a file. The Write Thread then starts writing from the next read buffer.
This technique is called Double Buffering or Multiple Buffering.
Sparse Data
If there is a lot of zero or unused data in the matrix, you should try using Sparse Matrices. Essentially, this is a list of structures that hold the data's coordinates and the value. This also works when most of the data is a common value other than zero. This saves a lot of memory space; but costs a little bit more execution time.
Data Compression
You could also change your algorithms to use data compression. The idea here is to store the data location, value and the number or contiguous equal values (a.k.a. runs). So instead of storing 100 consecutive data points of the same value, you would store the starting position (of the run), the value, and 100 as the quantity. This saves a lot of space, but requires more processing time when accessing the data.
Memory Mapped File
There are libraries that can treat a file as memory. Essentially, they read in a "page" of the file into memory. When the requests go out of the "page", they read in another page. All this is performed "behind the scenes". All you need to do is treat the file like memory.
Summary
Arrays and deques are not your primary issue, quantity of data is. Your primary issue can be resolved by processing small pieces of data at a time, compressing the data storage, or treating the data in the file as memory. If you are trying to process streaming data, don't. Ideally, streaming data should be placed into a file and then processed later.
A historical purpose of a file is to contain data that doesn't fit into memory.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js