C++ File I/O Tab separated data - c++

I have a text file with numbers as follows:
num1 TAB num2 TAB.... num22 newline
.
.
.
I would like to read num1 check to see if it is equal to 3 and if yes copy the entire row to a new file. What is the fastest way to do this? The file is quite big 80Mb+. Also, num 1 is repetitive, i.e it goes from 0 to 3 in steps of 0.001. So I just have to read every so many steps. I am not sure how to tell the computer to a-priori skip x-lines?
Thanks.

Given you've said that runtime performance is not a primary concern, then the following is clear and concise:
#include <string>
#include <fstream>
void foo(std::string const& in_fn, std::string const& out_fn)
{
std::ifstream is(in_fn);
std::ofstream os(out_fn);
std::string line;
while (std::getline(is, line))
if (line.size() && std::stoi(line) == 3)
os << line << '\n';
}
(C++11 support assumed; error handling omitted for brevity.)

pseudo code can looks like this:
while (not eof) {
fgets(...);
find TAB symbol or end of line
get string between two marks
cleain it from spaces and other unnecessary symbols
float fval = atof(...);
if (fval == 3) {
write the string into new file
}
}

Related

Why is C++'s file I/O ignoring initial empty lines when reading a text file? How can I make it NOT do this?

I'm trying to build myself a mini programming language using my own custom regular expression and abstract syntax tree parsing library 'srl.h' (aka. "String and Regular-Expression Library") and I've found myself an issue I can't quite seem to figure out.
The problem is this: When my custom code encounters an error, it obviously throws an error message, and this error message contains information about the error, one bit being the line number from which the error was thrown.
The issue comes in the fact that C++ seems to just be flat out ignoring the existence of lines which contain no characters (ie. line that are just the CRLF) until it finds a line which does contain characters, after which point it stops ignoring empty lines and treats them properly, thus giving all errors thrown an incorrect line number, with them all being incorrect by the same offset.
Basically, if given a file which contains the contents "(crlf)(crlf)abc(crlf)def", it'll be read as though its content were "abc(crlf)def", ignoring the initial new lines and thus reporting the wrong line number for any and all errors thrown.
Here's a copy of the (vary messily coded) function I'm using to get the text of a text file. If one of y'all could tell me what's going on here, that'd be awesome.
template<class charT> inline std::pair<bool, std::basic_string<charT>> load_text_file(const std::wstring& file_path, const char delimiter = '\n') {
std::ifstream fs(file_path);
std::string _nl = srl::get_nlp_string<char>(srl::newline_policy);
if (fs.is_open()) {
std::string s;
char b[SRL_TEXT_FILE_MAX_CHARS_PER_LINE];
while (!fs.eof()) {
if (s.length() > 0)
s += _nl;
fs.getline(b, SRL_TEXT_FILE_MAX_CHARS_PER_LINE, delimiter);
s += std::string(b);
}
fs.close();
return std::pair<bool, std::basic_string<charT>>(true, srl::string_cast<char, charT>(s));
}
else
return std::pair<bool, std::basic_string<charT>>(false, std::basic_string<charT>());
}
std::ifstream::getline() does not input the delimiter (in this case, '\n') into the string and also flushes it from the stream, which is why all the newlines from the file (including the leading ones) are discarded upon reading.
The reason it seems the program does not ignore newlines between other lines is because of:
if (s.length() > 0)
s += _nl;
All the newlines are really coming from here, but this cannot happen at the very beginning, since the string is empty.
This can be verified with a small test program:
#include <iostream>
#include <fstream>
#include <string>
int main()
{
std::ifstream inFile{ "test.txt" }; //(crlf)(crlf)(abc)(crlf)(def) inside
char line[80]{};
int lineCount{ 0 };
std::string script;
while (inFile.peek() != EOF) {
inFile.getline(line, 80, '\n');
lineCount++;
script += line;
}
std::cout << "***Captured via getline()***" << std::endl;
std::cout << script << std::endl; //prints "abcdef"
std::cout << "***End***" << std::endl << std::endl;
std::cout << "Number of lines: " << lineCount; //result: 5, so leading /n processed
}
If the if condition is removed, so the program has just:
s += _nl;
, newlines will be inserted instead of the discarded ones from the file, but as long as '\n' is the delimiter, std::ifstream::getline() will continue discarding the original ones.
As a final touch, I would suggest using
while (fs.peek() != EOF){};
instead of
while(fs){}; or while(!fs.eof()){};
If you look at int lineCount's final value in the test program, the latter two give 6 instead of 5, as they make a redundant iteration in the end.

C++ file conversion: pipe delimited to comma delimited

I am trying to figure out how to turn this input file that is in pipe delimited form into comma delimited. I have to open the file, read it into an array, convert it into comma delimited in an output CSV file and then close all files. I have been told that the easiest way to do is within excel but I am not quite sure how.
#include <iostream>
#include <fstream>
using namespace std;
int main()
{
ifstream inFile;
string myArray[5];
cout << "Enter the input filename:";
cin >> inFileName;
inFile.open(inFileName);
if(inFile.is_open())
std::cout<<"File Opened"<<std::endl;
// read file line by line into array
cout<<"Read";
for(int i = 0; i < 5; ++i)
{
file >> myArray[i];
}
// File conversion
// close input file
inFile.close();
// close output file
outFile.close();
...
What I need to convert is:
Miles per hour|6,445|being the "second" team |5.54|9.98|6,555.00
"Ending" game| left at "beginning"|Elizabeth, New Jersey|25.25|6.78|987.01
|End at night, or during the day|"Let's go"|65,978.21|0.00|123.45
Left-base night|10/07/1900|||4.07|777.23
"Let's start it"|Start Baseball Game|Starting the new game to win
What the output should look like in comma-delimited form:
Miles per hour,"6,445","being the ""second"" team member",5.54,9.98,"6,555.00",
"""Ending"" game","left at ""beginning""","Denver, Colorado",25.25,6.78,987.01,
,"End at night, during the day","""Let's go""","65,978.21",0.00,123.45,
Left-base night, 10/07/1900,,,4.07,777.23,
"""Let's start it""", Start Baseball Game, Starting the new game to win,
I will show you a complete solution and explain it to you. But let's first have view on it:
#include <iostream>
#include <vector>
#include <fstream>
#include <regex>
#include <string>
#include <algorithm>
// I omit in the example here the manual input of the filenames. This exercise can be done by somebody else
// Use fixed filenames in this example.
const std::string inputFileName("r:\\input.txt");
const std::string outputFileName("r:\\output.txt");
// The delimiter for the source csv file
std::regex re{ R"(\|)" };
std::string addQuotes(const std::string& s) {
// if there are single quotes in the string, then replace them with double quotes
std::string result = std::regex_replace(s, std::regex(R"(")"), R"("")");
// If there is any quote (") or comma in the file, then quote the complete string
if (std::any_of(result.begin(), result.end(), [](const char c) { return ((c == '\"') || (c == ',')); })) {
result = "\"" + result + "\"";
}
return result;
}
// Some output function
void printData(std::vector<std::vector<std::string>>& v, std::ostream& os) {
// Go throug all rows
std::for_each(v.begin(), v.end(), [&os](const std::vector<std::string>& vs) {
// Define delimiter
std::string delimiter{ "" };
// Show the delimited strings
for (const std::string& s : vs) {
os << delimiter << s;
delimiter = ",";
}
os << "\n";
});
}
int main() {
// We first open the ouput file, becuse, if this cannot be opened, then no meaning to do the rest of the exercise
// Open output file and check, if it could be opened
if (std::ofstream outputFileStream(outputFileName); outputFileStream) {
// Open the input file and check, if it could be opened
if (std::ifstream inputFileStream(inputFileName); inputFileStream) {
// In this variable we will store all lines from the CSV file including the splitted up columns
std::vector<std::vector<std::string>> data{};
// Now read all lines of the CSV file and split it into tokens
for (std::string line{}; std::getline(inputFileStream, line); ) {
// Split line into tokens and add to our resulting data vector
data.emplace_back(std::vector<std::string>(std::sregex_token_iterator(line.begin(), line.end(), re, -1), {}));
}
std::for_each(data.begin(), data.end(), [](std::vector<std::string>& vs) {
std::transform(vs.begin(), vs.end(), vs.begin(), addQuotes);
});
// Output, to file
printData(data, outputFileStream);
// And to the screen
printData(data, std::cout);
}
else {
std::cerr << "\n*** Error: could not open input file '" << inputFileName << "'\n";
}
}
else {
std::cerr << "\n*** Error: could not open output file '" << outputFileName << "'\n";
}
return 0;
}
So, then let's have a look. We have function
main, read csv files, split it into tokens, convert it, and write it
addQuotes. Add quote if necessary
printData print he converted data to an output stream
Let's start with main. main will first open the input file and the output file.
The input file contains a kind of structured data and is also called csv (comma separted values). But here we do not have a comma, but a pipe symbol as delimter.
And the result will be typically stored in a 2d-vector. In dimension 1 is the rows and the other dimension is for the columns.
So, what do we need to do next? As we can see, we need to read first all complete text lines form the source stream. This can be easily done with a one-liner:
for (std::string line{}; std::getline(inputFileStream, line); ) {
As you can see here, the for statement has an declaration/initialization part, then a condition, and then a statement, carried out at the end of the loop. This is well known.
We first define a variable "line" of type std::string and use the default initializer to create an empty string. Then we use std::getline to read from the stream a complete line and put it into our variable. The std::getline returns a reference to sthe stream, and the stream has an overloaded bool operator, where it returns, if there was a failure (or end of file). So, the for loop does not need an additional check for the end of file. And we do not use the last statement of the for loop, because by reading a line, the file pointer is advanced automatically.
This gives us a very simple for loop, fo reading a complete file line by line.
Please note: Defining the variable "line" in the for loop, will scope it to the for loop. Meaning, it is only visible in the for loop. This is generally a good solution to prevent the pollution of the outer name space.
OK, now the next line:
data.emplace_back(std::vector<std::string>(std::sregex_token_iterator(line.begin(), line.end(), digit), {}));
Uh Oh, what is that?
OK, lets go step by step. First, we obviously want to add someting to our 2-dimensionsal data vector. We will use the std::vectors function emplace_back. We could have used also used push_back, but this would mean that we need to do unnecessary copying of data. Hence, we selected emplace_back to do an in place construction of the thing that we want to add to our 2-dimensionsal data vector.
And what do we want to add? We want to add a complete row, so a vector of columns. In our case a std::vector<std::string>. And, becuase we want to do in inplace construction of this vector, we call it with the vectors range constructor. Please see here: Constructor number 5. The range constructor takes 2 iterators, a begin and an end iterator, as parameter, and copies all values pointed to by the iterators into the vector.
So, we expect a begin and an end iterator. And what do we see here:
The begin iterator is: std::sregex_token_iterator(line.begin(), line.end(), digit)
And the end iterator is simply {}
But what is this thing, the sregex_token_iterator?
This is an iterator that iterates over patterns in a line. And the pattern is given by a regex. You may read here about the C++ regex libraray. Since it is very powerful, you unfortunately need to learn about it a little longer. And I cannot cover it here. But let us describe its basic functionality for our purpose: You can describe a pattern in some kind of meta language, and the
std::sregex_token_iterator will look for that pattern, and, if it finds a match, return the related data. In our case the pattern is very simple: Digits. This can be desribed with "\d+" and means, try to match one or more digits.
Now to the {} as the end iterator. You may have read that the {} will do default construction/initialization. And if you read here, number 1, then you see that the "default-constructor" constructs an end-of-sequence iterator. So, exactly what we need.
After we have read all data, we will transform the single strings, to the required output. This will be done with std::transform and the function addQuotes.
The strategy here is to first replace the single quotes with double quotes.
And then, next, we look, if there is any comma or quote in the string, then we enclose the whole string additionally in quotes.
And last, but not least, we have a simple output function and print the converted data into a file and on the screen.

Split english text into senteces(multiple lines)

I wondering about an efficient way to split text into sentences.
Sentences are split by a dot + space
Example text
The quick brown fox jumps
over the lazy dog. I love eating toasted cheese and tuna sandwiches.
My algorithm works like this
Read first line from text file to string
Find what is needed
Write to file
However sometimes half of a sentence can be on a upcoming line.
So I was wondering what is the best way to confront this problem
Yes a tried googling "search across multiple lines" and I don't want to use regex
Initially my idea is to check if the first line ends with a .+ space and if not grab another line and search through it. But I have a feeling I am missing out on something.
EDIT: Sorry forgot to mention that I am doing this in C++
You can use something like accumulator.
1. Read line
2. Check the last symbols in this line.
3. If last symbols are dot or dot+space
3.1 Split it and write all strings to output
3.2 GOTO 1
ELSE
3.3 split the line, write length-1 strings to output
3.4 Keep last piece in some variable and append next readed line to it.
Hope my idea is clear.
Here is my approach for this problem
void to_sentences()
{
// Do not skip whitespaces
std::cin >> std::noskipws;
char c;
// Loop until there is no input
while (std::cin >> c) {
// Skip new lines
if (c == '\n')
continue;
// Output the character
std::cout << c;
// check if there is a dot folowed by space
// if there add new line
if (c == '.') {
std::cin >> c;
if (c == ' ')
std::cout << endl;
}
}
// Reset skip whitespaces
std::cin >> std::skipws;
}
You can read the comments and ask if there is something unclear.
You can use std::getline(), with custom delimeter '.'
#include <sstream>
#include <string>
#include <vector>
auto split_to_sentences(std::string inp)
{
std::istringstream ss(inp); // make a stream using the string
std::vector< std::string > sentences; // return value
while(true) {
std::string this_sentence;
std::getline(ss, this_sentence, '.');
if (this_sentence != "")
sentences.push_back(std::move(this_sentence));
else
return sentences;
}
}
Note that if you have the input text as a stream, then you can skip the std::stringstream step, and give the stream directly to std::getline, in the place of ss.
The use of std::move is not necessary, but might increase performance, by preventing a copy and a deletion of the dynamic parts (on heap) of std::string.

What's the correct way to read a text file in C++?

I need to make a program in C++ that must read and write text files line by line with an specific format, but the problem is that in my PC I work in Windows, and in College they have Linux and I am having problems because of line endings are different in these OS.
I am new to C++ and don't know could I make my program able read the files no matter if they were written in Linux or Windows. Can anybody give me some hints? thanks!
The input is like this:
James White 34 45.5 10 black
Miguel Chavez 29 48.7 9 red
David McGuire 31 45.8 10 blue
Each line being a record of a struct of 6 variables.
Using the std::getline overload without the last (i.e. delimiter) parameter should take care of the end-of-line conversions automatically:
std::ifstream in("TheFile.txt");
std::string line;
while (std::getline(in, line)) {
// Do something with 'line'.
}
Here's a simple way to strip string of an extra "\r":
std::ifstream in("TheFile.txt");
std::string line;
std::getline(input, line));
if (line[line.size() - 1] == '\r')
line.resize(line.size() - 1);
If you can already read the files, just check for all of the newline characters like "\n" and "\r". I'm pretty sure that linux uses "\r\n" as the newline character.
You can read this page: http://en.wikipedia.org/wiki/Newline
and here is a list of all the ascii codes including the newline characters:
http://www.asciitable.com/
Edit: Linux uses "\n", Windows uses "\r\n", Mac uses "\r". Thanks to Seth Carnegie
Since the result will be CR LF, I would add something like the following to consume the extras if they exist. So once your have read you record call this before trying to read the next.
std::cin.ignore(std::numeric_limits<std::streamsize>::max(), '\n');
If you know the number of values you are going to read for each record you could simply use the ">>" method. For example:
fstream f("input.txt" std::ios::in);
string tempStr;
double tempVal;
for (number of records) {
// read the first name
f >> tempStr;
// read the last name
f >> tempStr;
// read the number
f >> tempVal;
// and so on.
}
Shouldn't that suffice ?
Hi I will give you the answer in stages. Please go trough in order to understand the code.
Stage 1: Design our program:
Our program based on the requirements should...:
...include a definition of a data type that would hold the data. i.e. our
structure of 6 variables.
...provide user interaction i.e. the user should be able to
provide the program, the file name and its location.
...be able to
open the chosen file.
...be able to read the file data and
write/save them into our structure.
...be able to close the file
after the data is read.
...be able to print out of the saved data.
Usually you should split your code into functions representing the above.
Stage 2: Create an array of the chosen structure to hold the data
...
#define MAX 10
...
strPersonData sTextData[MAX];
...
Stage 3: Enable user to give in both the file location and its name:
.......
string sFileName;
cout << "Enter a file name: ";
getline(cin,sFileName);
ifstream inFile(sFileName.c_str(),ios::in);
.....
->Note 1 for stage 3. The accepted format provided then by the user should be:
c:\\SomeFolder\\someTextFile.txt
We use two \ backslashes instead of one \, because we wish it to be treated as literal backslash.
->Note 2 for stage 3. We use ifstream i.e. input file stream because we want to read data from file. This
is expecting the file name as c-type string instead of a c++ string. For this reason we use:
..sFileName.c_str()..
Stage 4: Read all data of the chosen file:
...
while (!inFile.eof()) { //we loop while there is still data in the file to read
...
}
...
So finally the code is as follows:
#include <iostream>
#include <fstream>
#include <cstring>
#define MAX 10
using namespace std;
int main()
{
string sFileName;
struct strPersonData {
char c1stName[25];
char c2ndName[30];
int iAge;
double dSomeData1; //i had no idea what the next 2 numbers represent in your code :D
int iSomeDate2;
char cColor[20]; //i dont remember the lenghts of the different colors.. :D
};
strPersonData sTextData[MAX];
cout << "Enter a file name: ";
getline(cin,sFileName);
ifstream inFile(sFileName.c_str(),ios::in);
int i=0;
while (!inFile.eof()) { //loop while there is still data in the file
inFile >>sTextData[i].c1stName>>sTextData[i].c2ndName>>sTextData[i].iAge
>>sTextData[i].dSomeData1>>sTextData[i].iSomeDate2>>sTextData[i].cColor;
++i;
}
inFile.close();
cout << "Reading the file finished. See it yourself: \n"<< endl;
for (int j=0;j<i;j++) {
cout<<sTextData[j].c1stName<<"\t"<<sTextData[j].c2ndName
<<"\t"<<sTextData[j].iAge<<"\t"<<sTextData[j].dSomeData1
<<"\t"<<sTextData[j].iSomeDate2<<"\t"<<sTextData[j].cColor<<endl;
}
return 0;
}
I am going to give you some exercises now :D :D
1) In the last loop:
for (int j=0;j<i;j++) {
cout<<sTextData[j].c1stName<<"\t"<<sTextData[j].c2ndName
<<"\t"<<sTextData[j].iAge<<"\t"<<sTextData[j].dSomeData1
<<"\t"<<sTextData[j].iSomeDate2<<"\t"<<sTextData[j].cColor<<endl;}
Why do I use variable i instead of lets say MAX???
2) Could u change the program based on stage 1 on sth like:
int main(){
function1()
function2()
...
functionX()
...return 0;
}
I hope i helped...

C++ length of file and vectors

Hi I have a file with some text in it. Is there some easy way to get the number of lines in the file without traversing through the file?
I also need to put the lines of the file into a vector. I am new to C++ but I think vector is like ArrayList in java so I wanted to use a vector and insert things into it. So how would I do it?
Thanks.
There is no way of finding the number of lines in a file without reading it. To read all lines:
1) create a std::vector of std::string
3 ) open a file for input
3) read a line as a std::string using getline()
4) if the read failed, stop
5) push the line into the vector
6) goto 3
You would need to traverse the file to detect the number of lines (or at least call a library method that traverse the file).
Here is a sample code for parsing text file, assuming that you pass the file name as an argument, by using the getline method:
#include <string>
#include <vector>
#include <fstream>
#include <iostream>
int main(int argc, char* argv[])
{
std::vector<std::string> lines;
std::string line;
lines.clear();
// open the desired file for reading
std::ifstream infile (argv[1], std::ios_base::in);
// read each file individually (watch out for Windows new lines)
while (getline(infile, line, '\n'))
{
// add line to vector
lines.push_back (line);
}
// do anything you like with the vector. Output the size for example:
std::cout << "Read " << lines.size() << " lines.\n";
return 0;
}
Update: The code could fail for many reasons (e.g. file not found, concurrent modifications to file, permission issues, etc). I'm leaving that as an exercise to the user.
1) No way to find number of lines without reading the file.
2) Take a look at getline function from the C++ Standard Library. Something like:
string line;
fstream file;
vector <string> vec;
...
while (getline(file, line)) vec.push_back(line);
Traversing the file is fundamentally required to determine the number of lines, regardless of whether you do it or some library routine does it. New lines are just another character, and the file must be scanned one character at a time in its entirety to count them.
Since you have to read the lines into a vector anyways, you might as well combine the two steps:
// Read lines from input stream in into vector out
// Return the number of lines read
int getlines(std::vector<std::string>& out, std::istream& in == std::cin) {
out.clear(); // remove any data in vector
std::string buffer;
while (std::getline(in, buffer))
out.push_back(buffer);
// return number of lines read
return out.size();
}