Read only given columns when streaming a parquet file in C++ - c++

The Parquet C++ documentation gives this example to stream from a parquet file:
{
std::shared_ptr<arrow::io::ReadableFile> infile;
PARQUET_ASSIGN_OR_THROW(
infile,
arrow::io::ReadableFile::Open("test.parquet"));
parquet::StreamReader stream{parquet::ParquetFileReader::Open(infile)};
std::string article;
float price;
uint32_t quantity;
while ( !stream.eof() )
{
stream >> article >> price >> quantity >> parquet::EndRow;
// ...
}
}
Suppose I want only to read price: how can I do that?
Is there a way to select which columns to read by name or by column number?

Thanks to #kiner_shah for the comment, one can use the skipColumns method:
while ( !stream.eof() )
{
stream.skipColumns(1);
stream >> price;
stream.skipColumns(1);
stream >> parquet::EndRow;
// ...
}

Related

Four variable file I/O C++ into hash table for processing

I have a problem with C++ File input.
I have a file with n lines that each lines contains 4 variables. I need them read them into a hash table that I created. My problem is that I can't read the file correct way.
For example here the input file variables:
line id cont uniq-number Band
0 10 B 02020213456 DaftPunk
1 11 A 02030213456 Dazy
and so on..
The main problem is to read each variable in line until file EOF.
So I need read in each line these variables id, cout, uniq-number and band and while it is reading put these data inside a hash table to process them even further in C++.
Example
cout << "File" << endl;
int date,id;
string group,line,ch;
datar d;
hasher h1;
ifstream inFile;
inFile.open("file.txt");
while (getline (inFile,line))
{
// "reads" file each variable
inFile >> id >> ch >> date >> group;
//add these variable in line one to first hash line
d.id = id;
d.data = ch;
d.date= date;
d.group = group;
h1.add(d);
//must repeat until file EOF for each line
}
inFile.close();
In this code,
while (getline (inFile,line))
{
// "reads" file each variable
inFile >> id >> ch >> date >> group;
//add these variable in line one to first hash line
d.id = id;
d.data = ch;
d.date= date;
d.group = group;
h1.add(d);
//must repeat until file EOF for each line
}
the getline reads a line.
Then the loop body reads items from the next line.
Instead of that
inFile >> id >> ch >> date >> group;
do e.g.
istringstream linestream( line );
linestream >> id >> ch >> date;
getline( linestream, group );
The last getline in order to handle possible spaces in a name.
This assumes that the name is last on the line, as it currently is, and it assumes that it's not the case that every second line should be ignored (i.e., that that was bug).
It's also a good idea to add failure checking.
If a stream operation fails then the stream enters a failure state, and you can check that via the member function .fail().

C++ file IO getline not pulling a string

I have a file that is opened filled like this:
STRING
INT
INT
INT
filename.txt
STRING
INT
INT
INT
filename1.txt
etcetera
I have code that is supposed to read from the file, pull the string, the integers, and the file names. It is able to pull the string and the integers, but it won't pull the file name.
Here is the code:
while( !input.eof() )
{
string name;
int type = UNKNOWN;
int pages;
float ounces;
getline( input, name );
input >> type >> pages >> ounces;
getline(input, reviewFile); //reviewFile is a static string called in the header file
input.ignore(INT_MAX, '\n');
}
It should work if you put the ignore before the getline, to eat the newline after the ounces:
while( !input.eof() )
{
string name;
int type = UNKNOWN;
int pages;
float ounces;
getline( input, name );
input >> type >> pages >> ounces;
input.ignore(INT_MAX, '\n');
getline(input, reviewFile); //reviewFile is a static string called in the header file
}
The operator>> will read words, not taking end of line into account that much.
Therefore, I would write something like this (As Massa wrote in a comment):
input >> type >> pages >> ounces >> ws;
Also, please note that "eof" checks are suboptimal. Try to avoid them. Moreover, you do not check after the reads within the loop if there is anything more to read.

How to extract specific data from text file containing whitespace and newlines?

I would like to extract and analyze data from a large text file. The data contains floats, integers and words.
The way I thought of doing this is to extract a complete line (up to newline) using std::getline(). Then extract individual data from the line extracted before (extract until whitespace, then repeat).
So far I have this:
int main( )
{
std::ifstream myfile;
myfile.open( "example.txt", std::ios::in );
if( !(myfile.is_open()) )
{ std::cout << "Error Opening File";
std::exit(0); }
std::string firstline;
while( myfile.good() )
{
std::getline( myfile, firstline);
std::cout<< "\n" << firstline <<"\n";
}
myfile.close();
return 0;
}
I have several problems:
1) How do I extract up to a whitespace?
2) What would be the best method of storing the data? There are about 7-9 data types, and the data file is large.
EDIT: An example of the file would be:
Result Time Current Path Requirements
PASS 04:31:05 14.3 Super_Duper_capacitor_413 -39.23
FAIL 04:31:45 13.2 Super_Duper_capacitor_413 -45.23
...
Ultimately I would like to analyze the data, but so far I'm more concerned about proper input/reading.
You can use std::stringstream to parse the data and let it worry about skipping the whitspaces. Since each element in the input line appears to require additional processing just parse them into local variables and after all post processing is done store the final results into a data structure.
#include <sstream>
#include <iomanip>
std::stringstream templine(firstline);
std::string passfail;
float floatvalue1;
std::string timestr;
std::string namestr;
float floatvalue2;
// split to two lines for readability
templine >> std::skipws; // no need to worry about whitespaces
templine >> passfail >> timestr >> floatvalue1 >> namestr >> floatvalue2;
If you do not need or want to validate that the data is in the correct format you can parse the lines directly into a data structure.
struct LineData
{
std::string passfail;
float floatvalue1;
int hour;
int minute;
int seconds;
std::string namestr;
float floatvalue2;
};
LineData a;
char sep;
// parse the pass/fail
templine >> a.passfail;
// parse time value
templine >> a.hour >> sep >> a.minute >> sep >> a.seconds;
// parse the rest of the data
templine >> a.timestr >> a.floatvalue1 >> a.namestr >> a.floatvalue2;
For the first question, you can do this:
while( myfile.good() )
{
std::getline( myfile, firstline);
std::cout<< "\n" << firstline <<"\n";
std::stringstream ss(firstline);
std::string word;
while (std::getline(ss,word,' '))
{
std::cout << "Word: " << word << std::endl;
}
}
As for the second question, can you give us more precision about the data types and what is it you want to do with the data once stored?

Reading in a .txt file word by word to a struct in C++

I am having some trouble with my lab assignment for my CMPT class...
I am trying to read a text file that has two words and a string of numbers per line, and the file can be as long as anyone makes it.
An example is
Xiao Wang 135798642
Lucie Chan 122344566
Rich Morlan 123456789
Amir Khan 975312468
Pierre Guertin 533665789
Marie Tye 987654321
I have to make each line a separate "student", so I was thinking of using struct to do so, but I don't know how to do that as I need the first, last, and ID number to be separate.
struct Student{
string firstName;
string secondName;
string idNumber;
};
All of the tries done to read in each word separately have failed (ended up reading the whole line instead) and I am getting mildly frustrated.
With the help from #Sylence I have managed to read in each line separately. I am still confused with how to split the lines by the whitespace though. Is there a split function in ifstream?
Sylence, is 'parts' going to be an array? I saw you had indexes in []'s.
What exactly does the students.add( stud ) do?
My code so far is:
int getFileInfo()
{
Student stdnt;
ifstream stdntFile;
string fileName;
char buffer[256];
cout<<"Please enter the filename of the file";
cin>>filename;
stdntFile.open(fileName.c_str());
while(!stdFile.eof())
{
stdFile.getLine(buffer,100);
}
return 0;
}
This is my modified and final version of getFileInfo(), thank you Shahbaz, for the easy and quick way to read in the data.
void getFileInfo()
{
int failed=0;
ifstream fin;
string fileName;
vector<Student> students; // A place to store the list of students
Student s; // A place to store data of one student
cout<<"Please enter the filename of the student grades (ex. filename_1.txt)."<<endl;
do{
if(failed>=1)
cout<<"Please enter a correct filename."<<endl;
cin>>fileName;
fin.open(fileName.c_str());// Open the file
failed++;
}while(!fin.good());
while (fin >> s.firstName >> s.lastName >> s.stdNumber)
students.push_back(s);
fin.close();
cout<<students.max_size()<<endl<< students.size()<<endl<<students.capacity()<<endl;
return;
}
What I am confused about now is how to access the data that was inputted! I know it was put into a vector, but How to I go about accessing the individual spaces in the vector, and how exactly is the inputted data stored in the vector? If I try to cout a spot of the vector, I get an error because Visual Studio doesn't know what to output I guess..
The other answers are good, but they look a bit complicated. You can do it simply by:
vector<Student> students; // A place to store the list of students
Student s; // A place to store data of one student
ifstream fin("filename"); // Open the file
while (fin >> s.firstName >> s.secondName >> s.idNumber)
students.push_back(s);
Note that if istream fails, such as when the file finishes, the istream object (fin) will evaluate to false. Therefore while (fin >> ....) will stop when the file finishes.
P.S. Don't forget to check if the file is opened or not.
Define a stream reader for student:
std::istream& operator>>(std::istream& stream, Student& data)
{
std::string line;
std::getline(stream, line);
std::stringstream linestream(line);
linestream >> data.firstName >> data.secondName >> data.idNumber;
return stream;
}
Now you should be able to stream objects from any stream, including a file:
int main()
{
std::ifstream file("data");
Student student1;
file >> student1; // Read 1 student;
// Or Copy a file of students into a vector
std::vector<Student> studentVector;
std::copy(std::istream_iterator<Student>(file),
std::istream_iterator<Student>(),
std::back_inserter(studentVector)
);
}
Simply read a whole line and then split the string at the spaces and assign the values to an object of the struct.
pseudo code:
while( !eof )
line = readline()
parts = line.split( ' ' )
Student stud = new Student()
stud.firstName = parts[0]
stud.secondName = parts[1]
stud.idNumber = parts[2]
students.add( stud )
end while

readout a big txt file

i want to know, if there is another way to readout a big file
Hans //name
Bachelor // study_path
WS10_11 //semester
22 // age
and not like this:
fout >> name; //string
fout >> study_path; //string
fout >> Semester ; //string
fout >> age; //int
when my file turns to more than 20 line's i should make 20+ fouts?
Is there another way?
You could define a class to hold the data for each person:
class Person {
public:
std::string name;
std::string study_path;
std::string semester;
unsigned int age;
};
Then you can define a stream extraction operator for that class:
std::istream & operator>>(std::istream & stream, Person & person) {
stream >> person.name >> person.study_path >> person.semester >> person.age;
return stream;
}
And then you can just read the entire file like that:
std::ifstream file("datafile.txt");
std::vector<Person> data;
std::copy(std::istream_iterator<Person>(file), std::istream_iterator<Person>(),
std::back_inserter(data));
This will read the entire file and store all the extracted records in a vector. If you know the number of records you will read in advance, you can call data.reserve(number_of_records) before reading the file. Thus, the vector will have enough memory to store all records without reallocation, which might potentially speed up the loading if the file is large.
If you are on linux, you can mmap() the big file, and use data as it is in memory.