How many character can a STL string class can hold? - c++

I need to work with a series of characters. The number of characters can be upto 1011.
In a usual array, it's not possible. What should I use?
I wanted to use gets() function to hold the string. But, is this possible for STL containers?
If not, then what's the way?
Example:
input:
AMIRAHID
output: A.M.I.R.A.H.I.D
How to make this possible if the number of characters lessened to 10^10 in 32-bit machine ?
Thank you in advance.

Well, that's roughly 100GByte of data. No usual string class will be able to hold more than fits into your main memory. You might want to look at STXXL, which is an implementation of STL allowing to store part of the data on disk.

If your machine has 1011 == 93GB of memory then it's probably a 64bit machine, so string will work. Otherwise nothing will help you.
Edited answer for the edited question: In that case you don't really need to store the whole string in memory. You can store only small part of it that fits into the memory.
Just read every character from the input, write it to the output and write a dot after it. Repeat it until you get and EOF on the input. To increase performance you can read and write large chunks of the data but such that still can fit into the memory.
Such algorithms are called online algorithms.

It is possible for an array that large to be created. But not on a 32-bit machine. Switching to STL will likely not help, and is unnecessary.

You need to contemplate how much memory that is, and if you have any chance of doing it at all.
1011 is roughly 100 gigabytes, which means you will need a 64-bit system (and compiler) to even be able to address it.
STL's strings support a max of max_size() characters, so the answer can change with the implementation.

A string suffers from the same problem as an array: *it has to fit in memory.
10^11 characters would take up over 4GB. That's hard to fit into memory on a 32-bit machine which has a 4GB memory space. You either need to split up your data into smaller chunks, and only load a bit of it at a time, or switch to 64-bit, in which case both arrays and strings should be able to hold the data (although it may still be preferable to split it up into multiple smaller strings/arrays

The SGI version of STL has a ROPE class (A rope is a big string, get it).
I am not sure it is designed to handle that much data but you can have a look.
http://www.sgi.com/tech/stl/Rope.html

If all you're trying to do is read in some massive file and write to another file the same data with periods interspersed between each character, why bother reading the whole thing into memory at once? Pick some reasonable buffer size and do it in chunks.

Related

Create array upto 10^12

I tried to create an array with size upto 10^12 elements in c++. But I can only make array upto 1000001 size. i.e
long long int dp[1000001]
But I want to store data upto 10^12 values in the array. Any Idea how can I implement this in C++ ?
First, you must realize that the size of that array is nearly 8 TB. Does your computer have that much memory? Probably not. In such case, you cannot store that much data in memory, and practically cannot have such a large array.
Any Idea how can I implement this
Instead of an array in memory, you could store the data in the file system... Assuming you have 8 TB free storage. You can use a paging mechanism to read and write small pieces of the file at a time.
The simplest way to implement that in C++ is to use operating system functionality to map the file into the memory. That way the operating system takes care of the paging. There is no standard way to map files into memory in C++, so first step is to figure out what operating system you're using. POSIX standard specifies mmap function for this purpose.
Before doing that however, I recommend considering whether you actually need to store that much data. Perhaps you need a smarter algorithm instead.

Reading large (~1GB) data file with C++ sometimes throws bad_alloc, even if I have more than 10GB of RAM available

I'm trying to read the data contained in a .dat file with size ~1.1GB.
Because I'm doing this on a 16GB RAM machine, I though it would have not be a problem to read the whole file into memory at once, to only after process it.
To do this, I employed the slurp function found in this SO answer.
The problem is that the code sometimes, but not always, throws a bad_alloc exception.
Looking at the task manager I see that there are always at least 10GB of free memory available, so I don't see how memory would be an issue.
Here is the code that reproduces this error
#include <iostream>
#include <fstream>
#include <sstream>
#include <string>
using namespace std;
int main()
{
ifstream file;
file.open("big_file.dat");
if(!file.is_open())
cerr << "The file was not found\n";
stringstream sstr;
sstr << file.rdbuf();
string text = sstr.str();
cout << "Successfully read file!\n";
return 0;
}
What could be causing this problem?
And what are the best practices to avoid it?
The fact that your system has 16GB doesn't mean any program at any time can allocate a given amount of memory. In fact, this might work on a machine that has only 512MB of physical RAM, if enought swap is available, or it might fail on a HPC node with 128GB of RAM – it's totally up to your Operating System to decide how much memory is available to you, here.
I'd also argue that std::string is never the data type of choice if actually dealing with a file, possibly binary, that large.
The point here is that there is absolutely no knowing how much memory stringstream tries to allocate. A pretty reasonable algorithm would double the amount of memory allocated every time the allocated internal buffer becomes too small to contain the incoming bytes. Also, libc++/libc will probably also have their own allocators that will have some allocation overhead, here.
Note that stringstream::str() returns a copy of the data contained in the stringstream's internal state, again leaving you with at least 2.2 GB of heap used up for this task.
Really, if you need to deal with data from a large binary file as something that you can access with the index operator [], look into memory mapping your file; that way, you get a pointer to the beginning of the file, and might work with it as if it was a plain array in memory, letting your OS take care of handling the underlying memory/buffer management. It's what OSes are for!
If you didn't know Boost before, it's kind of "the extended standard library for C++" by now, and of course, it has a class abstracting memory mapping a file: mapped_file.
The file I'm reading contains a series of data in ASCII tabular form, i.e. float1,float2\nfloat3,float4\n....
I'm browsing through the various possible solutions proposed on SO to deal with this kind of problem, but I was left wondering on this (to me) peculiar behaviour. What would you recommend in these kinds of circumstances?
Depends; I actually think the fastest way of dealing with this (since file IO is much, much slower than in-memory parsing of ASCII) is to parse the file incrementally, directly into an in-memory array of float variables; possibly taking advantage of your OS'es pre-fetching SMP capabilities in that you don't even get that much of a speed advantage if you'd spawn separate threads for file reading and float conversion. std::copy, used to read from std::ifstream to a std::vector<float> should work fine, here.
I'm still not getting something: you say that file IO is much slower than in-memory parsing, and this I understand (and is the reason why I wanted to read the whole file at once). Then you say that the best way is to parse the whole file incrementally into an in-memory array of float. What exactly do you mean by this? Doesn't this mean to read the file line-by-line, resulting in a large number of file IO operations?
Yes, and no: First, of course, you will have more context switches then you'd have if you just ordered for the whole to be read at once. But those aren't that expensive -- at least, they're going to be much less expensive when you realize that most OSes and libc's know quite well how to optimize reads, and thus will fetch a whole lot of file at once if you don't use extremely randomized read lengths. Also, you don't infer the penalty of trying to allocate a block of RAM at least 1.1GB in size -- that calls for some serious page table lookups, which aren't that fast, either.
Now, the idea is that your occasional context switch and the fact that, if you're staying single-threaded, there will be times when you don't read the file because you're still busy converting text to float will still mean less of a performance hit, because most of the time, your read will pretty much immediately return, as your OS/runtime has already prefetched a significant part of your file.
Generally, to me, you seem to be worried about all the wrong kinds of things: Performance seems to be important to you (is it really that important, here? You're using a brain-dead file format for interchanging floats, which is both bloaty, loses information, and on top of that is slow to parse), but you'd rather first read the whole file in at once and then start converting it to numbers. Frankly, if performance was of any criticality to your application, you would start to multi-thread/-process it, so that string parsing could already happen while data is still being read. Using buffers of a few kilo- to Megabytes to be read up to \n boundaries and exchanged with a thread that creates the in-memory table of floats sounds like it would basically reduce your read+parse time down to read+non-measurable without sacrificing read performance, and without the need for Gigabytes of RAM just to parse a sequential file.
By the way, to give you an impression of how bad storing floats in ASCII is:
The typical 32bit single-precision IEEE753 floating point number has about 6-9 significant decimal digits. Hence, you will need at least 6 characters to represent these in ASCII, one ., typically one exponential divider, e.g. E, and on average 2.5 digits of decimal exponent, plus on average half a sign character (- or not), if your numbers are uniformly chosen from all possible IEEE754 32bit floats:
-1.23456E-10
That's an average of 11 characters.
Add one , or \n after every number.
Now, your character is 1B, meaning that you blow up your 4B of actual data by a factor of 3, still losing precision.
Now, people always come around telling me that plaintext is more usable, because if in doubt, the user can read it… I've yet to see one user that can skim through 1.1GB (according to my calculations above, that's around 90 million floating point numbers, or 45 million floating point pairs) and not go insane.
In a 32 bit executable, total memory address space is 4gb. Of that, sometimes 1-2 gb is reserved for system use.
To allocate 1 GB, you need 1 GB of contiguous space. To copy it, you need 2 1 GB blocks. This can easily fail, unpredictably.
There are two approaches. First, switch to a 64 bit executable. This will not run on a 32 bit system.
Second, stop allocating 1 GB contiguous blocks. Once you start dealing with that much data, segmenting it and or streaming it starts making a lot of sense. Done right you'll also be able to start to process it prior to finishing reading it.
There are many file io datastructures, from stxxl to boost, or you can roll your own.
The size of the heap (a pool of memory used for dynamic allocations) is limited independently on the amount of RAM your machine has. You should use some other memory allocation technique for such large allocations which will probably force you to change the way you read from the file.
If you are running on UNIX based system you can check the function vmalloc or the VirtualAlloc function if you are running on Windows platform.

Declaring large character array in c++

I am trying right now to declare a large character array. I am using the character array as a bitmap (as in a map of booleans, not the image file type). The following code generates a compilation error.
//This is code before main. I want these as globals.
unsigned const long bitmap_size = (ULONG_MAX/(sizeof(char)));
char bitmap[bitmap_size];
The error is overflow in array dimension. I recognize that I'm trying to have my process consume a lot of data and that there might be some limit in place that prevents me from doing so. I am curious as to whether I am making a syntax error or if I need to request more resources from the kernel. Also, I have no interest in creating a bitmap with some class. Thank you for your time.
EDIT
ULONG_MAX is very much dependent upon the machine that you are using. On the particular machine I was compiling my code on it was well over 4.2 billion. All in all, I wouldn't to use that constant like a constant, at least for the purpose of memory allocation.
ULONG_MAX/sizeof(char) is the same as ULONG_MAX, which is a very large number. So large, in fact, that you don't have room for it even in virtual memory (because ULONG_MAX is probably the number of bytes in your entire virtual memory).
You definitely need to rethink what you are trying to do.
It's impossible to declare an array that large on most systems -- on a 32-bit system, that array is 4 GB, which doesn't fit into the available address space, and on most 64-bit systems, it's 16 exabytes (16 million terabytes), which doesn't fit into the available address space there either (and, incidentally, may be more memory than exists on the entire planet).
Use malloc() to allocate large amounts of memory. But be realistic. :)
As I understand it, the maximum size of an array in c++ is the largest integer the platform supports. It is likely that your long-type bitmap_size constant exceeds that limit.

C++ Qt WordCount and large data sets

I want to count words occurrences in a set of plain text files. Just like here http://doc.trolltech.com/4.5/qtconcurrent-wordcount-main-cpp.html
The problem is that i need to process very big amount of plain text files - so my result srored in QMap could not fit into memory.
I googled external memory (file based) merge sort algorithm, but i'm too lazy to implement myself. So i want to divide result set by portions to fit each of them into memory. Then store this portions in files on disk. Then call magic function mergeSort(QList, result_file) and have final result in result_file.
Does anyone know Qt compatible implementation of this algo?
In short i'm looking for pythons heapq.merge (http://docs.python.org/library/heapq.html#heapq.merge) analog but for Qt containers.
You might wanna check out this one:
http://stxxl.sourceforge.net/
It's not exactly what you are looking for (close enough though), but I guess you will not find exactly what you want working with Qt lists. Since you are are implementing alghoritm creating this list, changing it's type shouldn't be a problem. As far as i remember on those list you can use standard stl sorting alghoritms. The only problem remains preformance.
I presume that the map contains the association between the word and the number of occurences. In this case, why do you say you have such a significant memory consumption? How many distinct words and forms could you have and what is the average memory consumption for one word?
Considering 1.000.000 words, with 1K memory consumption per word (that includes the word text, the QMap specific storage), that would lead to (aprox) 1GB of memory, which... doesn't seem so much to me.

How can I manage bits/binary in c++?

What I need to do is open a text file with 0s and 1s to find patterns between the columns in the file.
So my first thought was to parse each column into a big array of bools, and then do the logic between the columns (now in arrays). Until I found that the size of bools is actually a byte not a bit, so i would be wasting 1/8 of memory, assigning each value to a bool.
Is it even relevant in a grid of 800x800 values? What would be the best way to handle this?
I would appreciate a code snippet in case its a complicated answer
You could use std::bitset or Boosts dynamic_bitset which provide different methods which will help you manage your bits.
They for example support constructors which create bitsets from other default types like int or char. You can also export the bitset into an ulong or into a string (which then could be turned into a bitset again etc)
I once asked about concatenating those, which wasn't performantly possible to do. But perhaps you could use the info in that question too.
you can use std::vector<bool> which is a specialization of vector that uses a compact store for booleans....1 bit not 8 bits.
I think it was Knuth who said "premature optimization is the root of all evil." Let's find out a little bit more about the problem. Your array is 800**2 == 640,000 bytes, which is no big deal on anything more powerful than a digital watch.
While storing it as bytes may seem wasteful -- as you say, 7/8ths of the memory is redundant -- but on the other hand, most machines don't do bit operations as efficiently as bytes; by saving the memory, you might waste so much effort masking and testing that you would have been better off with the bytes model.
On the other hand, if what you want to do with it is look for larger patterns, you might want to use a bitwise representation because you can do things with 8 bits at a time.
The real point here is that there are several possibilities, but no one can tell you the "right" representation without knowing what the problem is.
For that size grid your array of bools would be about 640KB. Depends how much memory you have if that will be a problem. It would probably be the simplest for the logic analysis code.
By grouping the bits and storing in an array of int you could drop the memory requirement to 80KB, but the logic code would be more complicated as you'd be always isolating the bits you wanted to check.