Reading a file without reading the whole thing into memory

Reading a file without reading the whole thing into memory - c++

I am trying to read an extremely large text file. I want to write a program (C++) to read it line by line until I reach a certain set of characters, then begin to write the following text into a string until it reaches another set of characters.
It is a XML file, so I'm looking at
<flag>info</flag>
I need my program to read the file until it reaches <flag>, inputs "info" into a a string and notes that </flag> is the point to stop putting stuff into the string. What tools could I utilize that can actually read the file. As far as detecting the <flag>, I can do that.

Use an XML SAX parser such as Xerces; they will allow you to parse the XML file in a streaming fashion, so you don't need to load it into memory all at once. Reading line-by-line will not give you correct results on general XML files.

Related

"Input is provided as CSV format via STDIN"

I'm working on a programming problem in C++, where I need to write a CSV parser. I've written this before for files, but the instructions state:
Input is provided as CSV format via STDIN. The first line is a header. The subsequent lines are data.
Here's an example of the input "file":
#id,time,amount
0,4,5
2,8,3
8,1,2
...
Now I'm a bit confused by this because I haven't worked with STDIN much since picking up C++ a few years ago. How exactly does one read in a csv file through STDIN? When I've used std::cin in the past, if I try to paste multiple lines, only the first line will get read.
The instructions of the programming problem did not make it clear how the input "file" will be fed in through STDIN, or perhaps there's some classical way it's done and my lack of knowledge makes me think it's unclear? Is there some standard way a CSV file is read in through STDIN?
All I am tasked to do is to process what comes through STDIN, and I'm not given how things are passed into STDIN. I feel like I need to know how things are passed in to know what I'm supposed to do? Like it could be passed in character by character, line by line, entry by entry, or the entire file at a time?

How to read .inp file in c++?

I have a dataset, a ".inp" format file, and I need to read this file in c++. However, the fopen() fread() method seemed to fail and read the wrong data(e.g. the first integer should be 262144, the fread yields an integer much larger than this nevertheless).
To be more specific, my ".inp" file contains a few integers and float points, how can I read them successfully in c++?
enter image description here
This is the screenshot of the "*.inp" file from Notepad++. Basically this is a text file.

I solved it by coping the data into a txt. However, I am still not aware how to read "*.inp"

I found some info about INP file extension. It seems like there are multiple variances of it, each meant to be used for different purpose. Where is your file coming from? As for soultion, if you can't open the file using fopen/fstream normally, you could treat it as binary and read each value in the way you specify. Other than that, I could think of calling system functions to get file contents (like cat in linux for example), then if there are some random characters, you could parse your string to ommit them.
Here is example of how to call cat in C++:
Simple way to call 'cat' from c++?

Efficiently read the last row of a csv file

Is there an efficient C or C++ way to read the last row of a CSV file? The naive approach involves reading in the entire file and then going to the end. Is there a quicker way this can be done (particularly if the CSV files are large)?

What you can do is guess the line length, then jump 2-3 lines before the end of the file and read the remaining lines. The last line you read is the last one, as long you read at least one line prior (otherwise, you still start again with a bigger offset)
I posted some sample code for doing a similar thing (reading last N lines) in this answer (in PHP, but serves as an illustration)
For implementations in a variety of languages, see
C++ : c++ fastest way to read only last line of text file?
Python : Efficiently finding the last line in a text file
Perl : How can I read lines from the end of file in Perl?
C# : Get last 10 lines of very large text file > 10GB c#
PHP : how to read only 5 last line of the txt file
Java: Read last n lines of a HUGE file
Ruby: Reading the last n lines of a file in Ruby?
Objective-C : How to read data from NSFileHandle line by line?

You can try working backwards. Read some size block of bytes from the end of the file, and look for the newline. If there is no newline in that block, then read the previous block, and so on.
Note that if the size of a row relative to the size of the file is large that this may result in worse performance, because most file caching schemes assume someone reads forward in the file.

You can use Perl module File::ReadBackwards.

Your problem falls into the same domain as searching for a string within a file. As you rightly point out, it's not always a great idea to read the entire file into memory and then search for your string. But you can always do the next best thing. Memory map your file. Then use your string searching functions to search backwards from the end of the string for your newline.
It's an extremely efficient mechanism with minimal memory footprint and optimum disk I/O.

Read with what and on what? On a Unix system, if you want the last line, it is as simple as
tail -n1 file.csv
If you want this approach from within your C++ app, you can do something like
system("tail -n1 file.csv")
if you want a quick and dirty way to accomplish this task.

Retrieving file from .dat via getline() w/ c++

I posted this over at Code Review Beta but noticed that there is much less activity there.
I have the following code and it works just fine. It's function is to grab the input from a file and display it out (to confirm that it's been grabbed). My task is to write a program that counts how many times a certain word (string) "abc" is found in the input file.
Is it better to store the input as a string or in arrays/vectors and have each line be stored separately? a[1], a[2] ect? Perhaps someone could also point me to a resource that I can use to learn how to filter through the input data.
Thanks.
input_file.open ("in.dat");
while(!input_file.eof()) // Inputs all the lines until the end of file (eof).
{
getline(input_file,STRING); // Saves the input_file in STRING.
cout<<STRING; // Prints our STRING.
}
input_file.close();

Reading as much of the file into memory is always more efficient than reading one letter or text line at a time. Disk drives take a lot of time to spin up and relocate to a sector. However, your program will run faster if you can minimize the number of reads from the file.
Memory is fast to search.
My recommendation is to read the entire file, or as much as you can into memory, then search the memory for a "word". Remember, that in English, words can have hyphens,'-', and single quotes, "don't". Word recognition may become more difficult if it is split across a line or you include abbreviations (with periods).
Good luck.

getline() text with UNIX formatting characters

I am writing a C++ program which reads lines of text from a .txt file. Unfortunately the text file is generated by a twenty-something year old UNIX program and it contains a lot of bizarre formatting characters.
The first few lines of the file are plain, English text and these are read with no problems. However, whenever a line contains one or more of these strange characters mixed in with the text, that entire line is read as characters and the data is lost.
The really confusing part is that if I manually delete the first couple of lines so that the very first character in the file is one of these unusual characters, then everything in the file is read perfectly. The unusual characters obviously just display as little ascii squiggles -arrows, smiley faces etc, which is fine. It seems as though a decision is being made automatically, without my knowledge or consent, based on the first line read.
Based on some googling, I suspected that the issue might be with the locale, but according to the visual studio debugger, the locale property of the ifstream object is "C" in both scenarios.
The code which reads the data is as follows:
//Function to open file at location specified by inFilePath, load and process data
int OpenFile(const char* inFilePath)
{
string line;
ifstream codeFile;
//open text file
codeFile.open(inFilePath,ios::in);
//read file line by line
while ( codeFile.good() )
{
getline(codeFile,line);
//check non-zero length
if (line != "")
ProcessLine(&line[0]);
}
//close line
codeFile.close();
return 1;
}
If anyone has any suggestions as to what might be going on or how to fix it, they would be very welcome.

From reading about your issues it sounds like you are reading in binary data, which will cause getline() to throw out content or simply skip over the line.
You have a couple of choices:
If you simply need lines from the data file you can first sanitise them by removing all non-printable characters (that is the "official" name for those weird ascii characters). On UNIX a tool such as strings would help you with that process.
You can off course also do this programmatically in your code by simply reading in X amount of data, storing it in a string, and then removing those characters that fall outside of the standard ASCII character range. This will most likely cause you to lose any unicode that may be stored in the file.
You change your program to understand the format and basically write a parser that allows you to parse the document in a more sane way.
If you can, I would suggest trying solution number 1, simply to see if the results are sane and can still be used. You mention that this is medical data, do you per-chance know what file format this is? If you are trying to find out and have access to a unix/linux machine you can use the utility file and maybe it can give you a clue (worst case it will tell you it is simply data).
If possible try getting a "clean" file that you can post the hex dump of so that we can try to provide better help than that what we are currently providing. With clean I mean that there is no personally identifying information in the file.
For number 2, open the file in binary mode. You mentioned using Windows, binary and non-binary files in std::fstream objects are handled differently, whereas on UNIX systems this is not the case (on most systems, I'm sure I'll get a comment regarding the one system that doesn't match this description).
codeFile.open(inFilePath,ios::in);
would become
codeFile.open(inFilePath, ios::in | ios::binary);
Instead of getline() you will want to become intimately familiar with .read() which will allow unformatted operations on the ifstream.
Reading will be like this:
// This code has not been tested!
char input[1024];
codeFile.read(input, 1024);
int actual_read = codeFile.gcount();
// Here you can process input, up to a maximum of actual_read characters.
//ProcessLine() // We didn't necessarily read a line!
ProcessData(input, actual_read);
The other thing as mentioned is that you can change the locale for the current stream and change the separator it considers a new line, maybe this will fix your issue without requiring to use the unformatted operators:
imbue the stream with a new locale that only knows about the newline. This method may or may not let your getline() function without issues.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js