Reading in a large file - c++

I have to read in a large file, character by character, and put each character in a map with a corresponding key. My question is, is there a way to read in the file to the map and save it there, so the program doesn't have to read the whole file character by character each time (takes forever)?
The characters are used later in the program to do an encode/decode thing.

Well, yes, ir gonna take forever, but you can use a std::unordered_multimap to speed it up a little, by skipping the sort phase.

Related

Is there any way to read from a .csv file without the use of getline()?

In a project that was assigned to me, that has to do with reading .csv files, my teacher made it clear that I cannot use strings. Wherever I have looked for help, the only solution I have found is the use of the function getline(), which only takes strings. Is there any other possible way to do it?
It sounds as though your teacher is expecting you to test the input one character at a time (until end of line), and so is looking to see you do things such as
test for end of file
read characters and save progressively into variables
test for the comma character and write out the current variable
ignore characters that aren't part of your required input. (You didn't mention whether the CSV is of strings or of numeric values, but if the exercise was to identify integer values then you'd be rejecting all non-numeric characters.
Avoiding strings is technically not possible (since that is what a CSV is!) but I would guess the intention is to get you understanding streams as characters and writing logic to do with each character at a time.

Using Getline on a Binary File

I have read that getline behaves as an unformatted input function. Which I believe should allow it to be used on a binary file. Let's say for example that I've done this:
ofstream ouput("foo.txt", ios_base::binary);
const auto foo = "lorem ipsum";
output.write(foo, strlen(foo) + 1);
output.close();
ifstream input("foo.txt", ios_base::binary);
string bar;
getline(input, bar, '\0');
Is that breaking any rules? It seems to work fine, I think I've just traditionally seen arrays handled by writing the size and then writing the array.
No, it's not breaking any rules that I can see.
Yes, it's more common to write an array with a prefixed size, but using a delimiter to mark the end can work perfectly well also. The big difference is that (like with a text file) you have to read through data to find the next item. With a prefixed size, you can look at the size, and skip directly to the next item if you don't need the current one. Of course, you also need to ensure that if you're using something to mark the end of a field, that it can never occur inside the field (or come up with some way of detecting when it's inside a field, so you can read the rest of the field when it does).
Depending on the situation, that can mean (for example) using Unicode text. This gives you a lot of options for values that can't occur inside the text (because they aren't legal Unicode). That, on the other hand, would also mean that your "binary" file is really a text file, and has to follow some basic text-file rules to make sense.
Which is preferable depends on how likely it is that you'll want to read random pieces of the file rather than reading through it from beginning to end, as well as the difficulty (if any) of finding a unique delimiter and if you don't have one, the complexity of making the delimiter recognizable from data inside a field. If the data is only meaningful if written in order, then having to read it in order doesn't really pose a problem. If you can read individual pieces meaningfully, then being able to do so much more likely to be useful.
In the end, it comes down to a question of what you want out of your file being "binary'. In the typical case, all 'binary" really means is that what end of line markers that might be translated from a new-line character to (for example) a carriage-return/line-feed pair, won't be. Depending on the OS you're using, it might not even mean that much though--for example, on Linux, there's normally no difference between binary and text mode at all.
Well, there are no rules broken and you'll get away with that just fine, except that may miss the precision of reading binary from a stream object.
With binary input, you usually want to know how many characters were read successfully, which you can obtain afterwards with gcount()... Using std::getline will not reflect the bytes read in gcount().
Of cause, you can simply get such info from the size of the string you passed into std::getline. But the stream will no longer encapsulate the number of bytes you consumed in the last Unformatted Operation

How to use go the m th line and n th character of a file??

If I want to insert or copy something from the the m th line and n th character in a file, what should I do? Is there a way better than using getline for m times and seekp? Thanks.
Is there a way better than using getline for m times and seekp?
Not really! Lines aren't "special" at the operating system level; they're just parts of a text file separated by the newline character. The only way to get to line m of a text file is to read through all of the file until you've seen m - 1 newlines. Your C++ library's getline() function is likely to have a pretty efficient implementation of that operation already, so you're probably best off just using that.
If your application needs to seek to specific lines of a large file many times during a single run, it may make sense to read in the whole file into a data structure at startup (e.g, an array of structures, each one representing a single line of text); once you've done this, seeking to a specific line is as easy as an array lookup. But if you only need to seek to a specific line once, that's not necessary.
A more memory-efficient approach for repeated seeks in larger files may be to record the file offset for each line number as you encounter it, so that you can easily return to a given line without starting over from the beginning. Again, though, this is only necessary if seeks will be repeated many times.

Is there any way to read characters that satisfy certain conditions only from stdin in C++?

I am trying to read some characters that satisfy certain condition from stdin with iostream library while leave those not satisfying the condition in stdin so that those skipped characters can be read later. Is it possible?
For example, I want characters in a-c only and the input stream is abdddcxa.
First read in all characters in a-c - abca; after this input finished, start read the remaining characters dddx. (This two inputs can't happen simultaneously. They might be in two different functions).
Wouldn't it be simpler to read everything, then split the input into the two parts you need and finally send each part to the function that needs to process it?
Keeping the data in the stdin buffer is akin to using globals, it makes your program harder to understand and leaves the risk of other code (or the user) changing what is in the buffer while you process it.
On the other hand, dividing your program into "the part that reads the data", "the part that parses the data and divides the workload" and the "part that does the work" makes for a better structured program which is easy to understand and test.
You can probably use regex to do the actual split.
What you're asking for is the putback method (for more details see: http://www.cplusplus.com/reference/istream/istream/putback/). You would have to read everything, filter the part that you don't want to keep out, and put it back into the stream. So for instance:
cin >> myString;
// Do stuff to fill putbackBuf[] with characters in reverse order to be put back
pPutbackBuf = &putbackBuf[0];
do{
cin.putback(*(pPutbackBuf++));
while(*pPutbackBuf);
Another solution (which is not exactly what you're asking for) would be to split the input into two strings and then feed the "non-inputted" string into a stringstream and pass that to whatever function needs to do something with the rest of the characters.
What you want to do is not possible in general; ungetc and putback exist, but they're not guaranteed to work for more than one character. They don't actually change stdin; they just push back on an input buffer.
What you could do instead is to explicitly keep a buffer of your own, by reading the input into a string and processing that string. Streams don't let you safely rewind in many cases, though.
No, random access is not possible for streams (except for fstream an stringstream). You will have to read in the whole line/input and process the resulting string (which you could, however, do using iostreams/std::stringstream if you think it is the best tool for that -- I don't think that but iostreams gurus may differ).

Incremental Stream Parsing in C++

I am reading data from a (serial) port using a non-blocking read() function (in C/C++).
This means the data comes-in in chunks of undetermined (but reported) sizes (incl. 0) each time I "poll" the port. I then need to parse this "stream" for certain patterns (not XML).
My naive implementation concatenates the new string to the previous stream-string each time read() returns a non-zero buffer, and re-parses the whole string.
When a pattern is matched, the relevant part is discarded leaving only the tail of the string for the next time.
Obviously there are much more efficient way to do this, for example, incremental parsing a-la SAX, deque-like buffers or similar string slices etc.
Also, obviously, I am not the first to have to do this type of stream parsing.
Does anyone know of any library that already does this sort of thing?
Preventing memory-overflow in case of missing pattern matches would also be a big plus.
Thanks,
Adi
You can do some tricks depending on your pattern.
Looking for one character like a newline you only need to scan the new portion of the string.
If you are looking for \r\n then you only need to scan the new portion starting with the last character of the previous portion.
If you have a pattern with a known ending part then you only have to scan for that ending to know if you need to scan the whole buffer.
If you have some kind of synchronizing character, like semicolon in C/C++, then you can parse or declare a parse error whenever you find one.