Converting 900 MB .csv into ROOT (CERN) TTree

Converting 900 MB .csv into ROOT (CERN) TTree - c++

I am new to programming and ROOT (CERN), so go easy on me. Simply, I want to convert a ~900 MB (11M lines x 10 columns) .csv file into a nicely organized .root TTree. Could someone provide the best way to go about this?
Here is an example line of data with headers (it's 2010 US census block population and population density data):
"Census County Code","Census Tract Code","Census Block Code","County/State","Block Centroid Latitude (degrees)","Block Centroid W Longitude (degrees)","Block Land Area (sq mi)","Block Land Area (sq km)","Block Population","Block Population Density (people/sq km)"
1001,201,1000,Autauga AL,32.469683,-86.480959,0.186343,0.482626154,61,126.3918241
I've pasted the what I've wrote so far below.
I particularly can’t figure out this error when running: "C:41:1: error: unknown type name ‘UScsvToRoot’”.
This may be really really stupid, but how do you read in strings in ROOT (for reading in the County/State name)? Like what is the data type? Do I just have to use char’s? I’m blanking.
#include "Riostream.h"
#include "TString.h"
#include "TFile.h"
#include "TNtuple.h"
#include "TSystem.h"
void UScsvToRoot() {
TString dir = gSystem->UnixPathName(__FILE__);
dir.ReplaceAll("UScsvToRoot.C","");
dir.ReplaceAll("/./","/");
ifstream in;
in.open(Form("%sUSPopDens.csv",dir.Data()));
Int_t countyCode,tractCode,blockCode;
// how to import County/State string?
Float_t lat,long,areaMi,areaKm,pop,popDens;
Int_t nlines = 0;
TFile *f = new TFile("USPopDens.root","RECREATE");
TNtuple *ntuple = new TNtuple("ntuple","data from csv file","countyCode:tractCode:blockCode:countyState:lat:long:areaMi:areaKm:pop:popDens");
while (1) {
in >> countyCode >> tractCode >> blockCode >> countyState >> lat >> long >> areaMi >> areaKm >> pop >> popDens;
if (!in.good()) break;
ntuple->Fill(countyCode,tractCode,blockCode,countyState,lat,long,areaMi,areaKm,pop,popDens);
nlines++;
}
in.close();
f->Write();
}`

Ok, so I am going to give this a shot, but a few comments up front:
for questions on root, you should strongly consider going to the root homepage and then to the forum. While stackoverflow is an excellent source of information, specific questions on the root framework are better suited on the root homepage.
If you are new to root, you should take a look at the tutorial page; it has many examples on how to use the various features of root.
You should also make use of the root reference guide that has documentation on all root classes.
To your code: if you look at the documentation for the class TNtuple that you are using you see that in the description it plainly says:
A simple tree restricted to a list of float variables only.
so trying to store any string into a TNtuple will not work. You need to use the more general class TTree for that.
To read your file and store the information in a tree you have two options:
either you manually define the branches and then fill the tree as you loop over the file:
void UScsvToRoot() {
TString dir = gSystem->UnixPathName(__FILE__);
dir.ReplaceAll("UScsvToRoot.C","");
dir.ReplaceAll("/./","/");
ifstream in;
in.open(Form("%sUSPopDens.csv",dir.Data()));
Int_t countyCode,tractCode,blockCode;
char countyState[1024];
Float_t lat,lon,areaMi,areaKm,pop,popDens;
Int_t nlines = 0;
TFile *f = new TFile("USPopDens.root","RECREATE");
TTree *tree = new TTree("ntuple","data from csv file");
tree->Branch("countyCode",&countyCode,"countyCode/I");
tree->Branch("tractCode",&tractCode,"tractCode/I");
tree->Branch("blockCode",&blockCode,"blockCode/I");
tree->Branch("countyState",countyState,"countyState/C");
tree->Branch("lat",&lat,"lat/F");
tree->Branch("long",&lon,"lon/F");
tree->Branch("areaMi",&areaMi,"areaMi/F");
tree->Branch("areaKm",&areaKm,"areaKm/F");
tree->Branch("pop",&pop,"pop/F");
tree->Branch("popDens",&popDens,"popDens/F");
while (1) {
in >> countyCode >> tractCode >> blockCode >> countyState >> lat >> lon >> areaMi >> areaKm >> pop >> popDens;
if (!in.good()) break;
tree->Fill();
nlines++;
}
in.close();
f->Write();
}
The command TTree::Branch basically tells root
the name of your branch
the address of the variable from which root will read the information
the format of the branch
The TBranch that contains the string information is of type C which if you look at the TTree documentation means
C : a character string terminated by the 0 character
N.B. I gave the character array a certain size, you should see yourself what size is appropriate for your data.
The other possibility that you can use is to do away with the ifstream and simply make use of the ReadFile method of the TTree which you would employ like this
#include "Riostream.h"
#include "TString.h"
#include "TFile.h"
#include "TTree.h"
#include "TSystem.h"
void UScsvToRoot() {
TString dir = gSystem->UnixPathName(__FILE__);
dir.ReplaceAll("UScsvToRoot.C","");
dir.ReplaceAll("/./","/");
TFile *f = new TFile("USPopDens.root","RECREATE");
TTree *tree = new TTree("ntuple","data from csv file");
tree->ReadFile("USPopDens.csv","countyCode/I:tractCode/I:blockCode/I:countyState/C:lat/F:lon/F:areaMi/F:areaKm/F:pop/F:popDens/F",',');
f->Write();
}
You can read the section on TTress in the root users guide on for more information; among many other things it also has an example using TTree:ReadFile.
Let me know if this helps

I think you might be better off just using root_pandas. In the comprehensive answer by #Erik you still end up specifying the variables of interest by hand (countryCode/I,…). Which has its advantages (I just list generic: you know what you'll get. error message in case an expected variable is missing, ). But on the other hand it gives you the chance of introducing typos, if you read multiple csv files you won't notice if any of them have more variables … and ultimately copying variable names and determining variable types is something a computer should be very good at.
In root_pandas your code should be something like
import pandas
df = pandas.read_csv("USPopDens.csv")
from root_pandas import readwrite
df.to_root("USPopDens.root")

I'd like to highlight one detail from Erik's answer: the fact that the TFile is created BEFORE the TTree has implications in the size of the root file resulting from the program. I was dealing with a similar problem (need to read a CSV file of ~1 GB) into a root tree and saving to a file but created the TTree first and then the TFile to store the tree. The resulting root file was a factor ~10 larger than when creating first the TTree and then the TFile.
The reason for this behavior is the difference in the compression ratio of the branches in the TTree. Basically, no compression is applied if the tree is written into memory, while a higher compression ratio is applied when writing the tree to disk.
ref: https://root-forum.cern.ch/t/ttree-compression-factor-1-00/31850/11

Related

C++: How to read a lot of data from formatted text files into program?

I'm writing a CFD solver for specific fluid problems. So far the mesh is generated every time running the simulation, and when changing geometry and fluid properties,the program needs to be recompiled.
For small-sized problem with low number of cells, it works just fine. But for cases with over 1 million cells, and fluid properties needs to be changed very often, It is quite inefficient.
Obviously, we need to store simulation setup data in a config file, and geometry information in a formatted mesh file.
Simulation.config file
% Dimension: 2D or 3D
N_Dimension= 2
% Number of fluid phases
N_Phases= 1
% Fluid density (kg/m3)
Density_Phase1= 1000.0
Density_Phase2= 1.0
% Kinematic viscosity (m^2/s)
Viscosity_Phase1= 1e-6
Viscosity_Phase2= 1.48e-05
...
Geometry.mesh file
% Dimension: 2D or 3D
N_Dimension= 2
% Points (index: x, y, z)
N_Points= 100
x0 y0
x1 y1
...
x99 y99
% Faces (Lines in 2D: P1->p2)
N_Faces= 55
0 2
3 4
...
% Cells (polygons in 2D: Cell-Type and Points clock-wise). 6: triangle; 9: quad
N_Cells= 20
9 0 1 6 20
9 1 3 4 7
...
% Boundary Faces (index)
Left_Faces= 4
0
1
2
3
Bottom_Faces= 6
7
8
9
10
11
12
...
It's easy to write config and mesh information to formatted text files. The problem is, how do we read these data efficiently into program? I wonder if there is any easy-to-use c++ library to do this job.

Well, well
You can implement your own API based on a finite elements collection, a dictionary, some Regex and, after all, apply bet practice according to some international standard.
Or you can take a look on that:
GMSH_IO
OpenMesh:
I just used OpenMesh in my last implementation for C++ OpenGL project.

As a first-iteration solution to just get something tolerable - take #JosmarBarbosa's suggestion and use an established format for your kind of data - which also probably has free, open-source libraries for you to use. One example is OpenMesh developed at RWTH Aachen. It supports:
Representation of arbitrary polygonal (the general case) and pure triangle meshes (providing more efficient, specialized algorithms)
Explicit representation of vertices, halfedges, edges and faces.
Fast neighborhood access, especially the one-ring neighborhood (see below).
[Customization]
But if you really need to speed up your mesh data reading, consider doing the following:
Separate the limited-size meta-data from the larger, unlimited-size mesh data;
Place the limited-size meta-data in a separate file and read it whichever way you like, it doesn't matter.
Arrange the mesh data as several arrays of fixed-size elements or fixed-size structures (e.g. cells, faces, points, etc.).
Store each of the fixed-width arrays of mesh data in its own file - without using streaming individual values anywhere: Just read or write the array as-is, directly. Here's an example of how a read would look. Youll know the appropriate size of the read either by looking at the file size or the metadata.
Finally, you could avoid explicitly-reading altogether and use memory-mapping for each of the data files. See
fastest technique to read a file into memory?
Notes/caveats:
If you write and read binary data on systems with different memory layout of certain values (e.g. little-endian vs big-endian) - you'll need to shuffle the bytes around in memory. See also this SO question about endianness.
It might not be worth it to optimize the reading speed as much as possible. You should consider Amdahl's law, and only optimize it to a point where it's no longer a significant fraction of your overall execution time. It's better to lose a few percentage points of execution time, but get human-readable data files which can be used with other tools supporting an established format.

In the following answear I asume:
That if the first character of a line is % then it shall be ignored as a comment.
Any other line is structured exactly as follows: identifier= value.
The code I present will parse a config file following the mentioned assumptions correctly. This is the code (I hope that all needed explanation is in comments):
#include <fstream> //required for file IO
#include <iostream> //required for console IO
#include <unordered_map> //required for creating a hashtable to store the identifiers
int main()
{
std::unordered_map<std::string, double> identifiers;
std::string configPath;
std::cout << "Enter config path: ";
std::cin >> configPath;
std::ifstream config(configPath); //open the specified file
if (!config.is_open()) //error if failed to open file
{
std::cerr << "Cannot open config file!";
return -1;
}
std::string line;
while (std::getline(config, line)) //read each line of the file
{
if (line[0] == '%') //line is a comment
continue;
std::size_t identifierLenght = 0;
while (line[identifierLenght] != '=')
++identifierLenght;
identifiers.emplace(
line.substr(0, identifierLenght),
std::stod(line.substr(identifierLenght + 2))
); //add entry to identifiers
}
for (const auto& entry : identifiers)
std::cout << entry.first << " = " << entry.second << '\n';
}
After reading the identifiers you can, of course, do whatever you need to do with them. I just print them as an example to show how to fetch them. For more information about std::unordered_map look here. For a lot of very good information about making parsers have a look here instead.
If you want to make your program process input faster insert the following line at the beginning of main: std::ios_base::sync_with_stdio(false). This will desynchronize C++ IO with C IO and, in result, make it faster.

Assuming:
you don't want to use an existing format for meshes
you don't want to use a generic text format (json, yml, ...)
you don't want a binary format (even though you want something efficient)
In a nutshell, you really need your own text format.
You can use any parser generator to get started. While you could probably parse your config file as it is using only regexps, they can be really limited on the long run. So I'll suggest a context-free grammar parser, generated with Boost spirit::x3.
AST
The Abstract Syntax Tree will hold the final result of the parser.
#include <string>
#include <utility>
#include <vector>
#include <variant>
namespace AST {
using Identifier = std::string; // Variable name.
using Value = std::variant<int,double>; // Variable value.
using Assignment = std::pair<Identifier,Value>; // Identifier = Value.
using Root = std::vector<Assignment>; // Whole file: all assignments.
}
Parser
Grammar description:
#include <boost/fusion/adapted/std_pair.hpp>
#include <boost/spirit/home/x3.hpp>
namespace Parser {
using namespace x3;
// Line: Identifier = value
const x3::rule<class assignment, AST::Assignment> assignment = "assignment";
// Line: comment
const x3::rule<class comment> comment = "comment";
// Variable name
const x3::rule<class identifier, AST::Identifier> identifier = "identifier";
// File
const x3::rule<class root, AST::Root> root = "root";
// Any valid value in the config file
const x3::rule<class value, AST::Value> value = "value";
// Semantic action
auto emplace_back = [](const auto& ctx) {
x3::_val(ctx).emplace_back(x3::_attr(ctx));
};
// Grammar
const auto assignment_def = skip(blank)[identifier >> '=' >> value];
const auto comment_def = '%' >> omit[*(char_ - eol)];
const auto identifier_def = lexeme[alpha >> +(alnum | char_('_'))];
const auto root_def = *((comment | assignment[emplace_back]) >> eol) >> omit[*blank];
const auto value_def = double_ | int_;
BOOST_SPIRIT_DEFINE(root, assignment, comment, identifier, value);
}
Usage
// Takes iterators on string/stream...
// Returns the AST of the input.
template<typename IteratorType>
AST::Root parse(IteratorType& begin, const IteratorType& end) {
AST::Root result;
bool parsed = x3::parse(begin, end, Parser::root, result);
if (!parsed || begin != end) {
throw std::domain_error("Parser received an invalid input.");
}
return result;
}
Live demo
Evolutions
To change where blank spaces are allowed, add/move x3::skip(blank) in the xxxx_def expressions.
Currently the file must end with a newline. Rewriting the root_def expression can fix that.
You'll certainly want to know why the parsing failed on invalid inputs. See the error handling tutorial for that.
You're just a few rules away from parsing more complicated things:
// 100 X_n Y_n
const auto point_def = lit("N_Points") >> ':' >> int_ >> eol >> *(double_ >> double_ >> eol)

If you don't need specific text file format, but have a lot of data and do care about performance, I recommend using some existing data serialization frameworks instead.
E.g. Google protocol buffers allow efficient serialization and deserialization with very little code. The file is binary, so typically much smaller than text file, and binary serialization is much faster than parsing text. It also supports structured data (arrays, nested structs), data versioning, and other goodies.
https://developers.google.com/protocol-buffers/

parse huge csv file with C++

I order to simulate my network I am using a trace file (csv file) with a size between 5 to 30 GB.
The csv file is a row based, where each row contains multiple fields delimited by a space and forming teh information to form a network packet:
3 53 4 12 1 1 2 6
Since the file's size could reach several GBs (millions of lines), is it better to divided it into small chunks myfile00.csv, myfile01.csv..., or I can process the entire file on the hard drive without being loaded into the memory?
I want to read the file line by line at a specific time, which is the clock cycle of the simulation, and get all information in the line to create an omnet++ message.
packet MyTrace::getpacket() {
int id; // first field
int cycle; // second field
int source; // third field
int destination; // fourth field
int numberofDep; // fifth field
std::list<int> listofDep; // remaining fields
if (!traceFile.is_open()) {
// get id
// get cycle
// ....
}
Any suggestion would be helpful.
EDIT:
string line;
ifstream myfile ("BlackSmall.csv");
int currentline=0 ;
if (myfile.is_open())
{
while (getline(myfile, line)) {
istringstream ss(line);
string request;
int id, cycle, source , dest, srcType, destType, packetSize, dependency;
int listdep;
std::list<int> dep;
ss >> id;
ss>> cycle;
ss>> source;
ss>> dest;
ss>>request;
ss>> srcType;
ss>> destType;
ss>> packetSize;
ss>> dependency;
while (ss >> listdep) dep.push_back(listdep);
// Create my packet
}
myfile.close();
}
else cout << "Unable to open file";
With the above code, I can get all information that I need from a line.
The problem is that I need to use this code inside a class, which when I call it returns just one line's information. Is there a way how to point to a specific line when I call this class?

It seems like your application seems to require a single sequential pass through the input, so processing a file that is 1GB or 100GB is perhaps just a matter of patience and perhaps parallelism.
The approach should be to translate records line-by-line. You should avoid strategies that attempt to read the entire file into memory. The STL offers the easy-to-use std::ifstream class with a built-in getline method, which returns a std::string containing the line to be converted.
If you are feeling more ambitious and want to control the amount of data read or buffered more carefully then you would not be the first developer to roll-your-own code to implement a buffered reader. This is a fairly empowering exercise and will help you think through some corner cases with reading partial lines and such. But in the end, it probably will not give you a significant boost toward your goal. I suspect the ifstream approach will get you up and running without the hassle and will not ultimately be the bottleneck in processing these files.
If you were really concerned about optimizing execution time then having multiple files might help you launch parallel processing tasks.
// define a class to hold your custom record
class Record {
};
// create a parser function to convert a line of text into the record
bool parse(std::string const &line, Record &record) {
}
// create a translator method to convert a record into the desired output
bool write(Record const &record, std::ofstream &os) {
}
// actually open input stream for the input file
std::ifstream is;
std::ofstream os;
std::string line;
while (std::getline(is,line)) {
Record record;
if (!parse(line,record)) break;
if (!write(record,os)) break;
}
You can re-use the Record instance by moving it outside the while loop so long as you are careful to reset the variable so that information from preceding records does not taint the current record. You can also dive head first into the C++ ecosystem by producing stream input and output operator ("<<",">>") but I personally find this approach to be more confusion than it is worth.

Perhaps best approach for you would be to import your CSV file into SQLite database.
Once you import it and add some indexes, you can easily and very efficiently query necessary rows from that database. SQLite has lots of ready-to-use C/C++ client libraries available, you can start with default one at https://www.sqlite.org/cintro.html.

Save the variables of an object to then be able to initialise another object with those variables

What I am trying to achieve is this:
Let's say I have a class Score. This class has an int variable and a char* variable.
Now when I have an object Score score, I would like to be able to save the value of those variables (I guess to a file). So now this file has an int variable and a char* variable that I can then access later to create a new Score object.
So I create Score score(10, "Bert");. I either do something like score.SaveScore(); or the score gets saved when the game is over or the program exits, it doesn't matter.
Basically I am looking for the equivalent/correct way of doing this:
score.SaveScore(FILE file)
{
file.var1 = score.score;
file.var2 = score.name;
}
I realize this is probably very stupid and not done this way whatsoever! This is just me trying to explain what I am trying to achieve in the simplest way possible.
Anyway, when I run the program again, that original Score score(10, "Bert") does not exist any more. But I would like to be able to access the saved score(from file or wherever it may be) and create another Score object.
So it may look something like:
LoadScore(FILE file)
{
Score newScore(file.var1, file.var2);
}
Again, just trying to show what I am trying to achieve.
The reason why I want to be able to access the variables again is to eventually have a Scoreboard, the Scoreboard would load a bunch of scores from the file.
Then when a new score is created, it is added to the scoreboard, compared to the other scores currently in the scoreboard and inserted in the right position (like a score of 6 would go in between 9 and 4).
I feel like this was a bit long winded but I was trying to really explain myself well! Which I hope I did!
Anyway, I am not looking for someone to tell me how to do all of that.
All I am after is how to do the initial save to a file.
Thank you for any suggestions.

I would use the <fstream> library, like this;
//example values
int x=10;
float y=10.5;
const char* chars = "some random value";
string str(chars); //make string buffer for sizing
str.resize(20); //make sure its fixed size
//open a test.txt file, in the same dir for output
std::ofstream os("test.txt", std::ios::out | std::ios::binary); //make it output binary
//(char*) cast &x, sizeof(type) for values/write to file chars for x and y
os.write((char*)&x, sizeof(int)); //only sizeof(int) starting at &x
os.write((char*)&y, sizeof(float)); //cast as a char pointer
os.write(str.data(), sizeof(char)*str.size()); //write str data
os.close();
//the file test.txt will now have binary data in it
//to read it back in, just ifstream, and put that info in new containers, like this;
int in_x = 0; //new containters set to 0 for debug
float in_y = 0;
char inchar[20]; //buffer to write 20 chars to
ifstream is("test.txt", std::ios::in | std::ios::binary); //read in binary
is.read((char*)&in_x, sizeof(int)); //write to new containers
is.read((char*)&in_y, sizeof(float));
is.read((char*)&inchar, sizeof(char)*20); //write char assuming 20 size
is.close();
//outputting will show the values are correctly read into the new containers
cout << in_x << endl;
cout << in_y << endl;
cout << inchar << endl;

I realize this is probably very stupid and not done this way whatsoever!
The entire software industry was stupid enough to have it done so many times that even a special term was invented for this operation - serialization and nearly all C++ frameworks and libraries have implemented this in a various ways.
Since question is tagged with C++ I would suggest you to look at boost serialization but there are many other implementations.
Do you need that file to be readable by a human? If yes than consider, for example, XML or JSON formats.
You don't need it be readable but want it be as compact as possible? Consider google protobuf
Just start doing it and come with a more specific question(s).
As it was mentioned before, keep strings as std:string objects rather then char*
About writing/reading to/from files in C++ read about fstream

Reading a CSV to vectors in objects

I'm trying to write code that will, on a line-by-line basis, pass numerical data from a CSV to an object's vector. The object's structure is as follows: the object itself (let's call it CS) is an enclosed space, within which resides a vector of objects (called Points) which each have a vector of objects (Features) with 3 variables. The first two variables in these Features are descriptors of the feature and the third is the actual value taken by a specific Point[i].Feature[j]. Each point has the same set of Features, and aside from third value being different, the descriptors are likewise identical. (edit: Sadly I can't change this structure as it's part of a larger framework which is out of my hands)
Currently, my CSV has one column per feature, the first two rows being the descriptors which apply for all points and the rest of the rows being each individual point's third feature value. It's been a while since my introductory C++ course and I'm finding it hard to think of a fast way to implement this, as my CSVs could become fairly large (my current upper limit is 50000 points having 2000 features, this will probably grow) and I wouldn't want to do something silly like rereading the first two lines every time for each point. I've looked around and most CSV solutions involve string CSVs, which I don't have to deal with, and simpler array objects in which the CSV is stored. The problem for me is simply going up a level each time I reach the end of the line and restarting the procedure for the next point, and I can't think of anything. Any tips?

You could just create a temporary array of Descriptor objects which holds the two descriptors for each column and then read in your first row and create your Point objects from that. Afterwards you can just copy the descriptors from the Point a row above, e.g. Point[i-csvWidth], and deallocate the Descriptor array.

I guess I was nearly there, just used the wrong kind of variable to read in.
fstream myFile;
myFile.open(filePath.c_str());
if(!myFile){
cout << "File \"" << filePath << "\" doesn't exist, exiting program." << endl;
exit(EXIT_FAILURE);
}
string line,line2,line3;
Points.clear();
//gets the range row
getline(myFile,line);
istringstream lineStream(line);
//gets the nomin row
getline(myFile,line2);
istringstream lineStream2(line2);
//gets the first person's traits
getline(myFile,line3);
istringstream lineStream3(line3);
CultVec originalCultVec = CultVec(RNG);
int val,val2,val3,val4;
while (lineStream >> val && lineStream2 >> val2 && lineStream3 >> val3) {
Feature feature;
feature.Range = (char)val;
feature.Nomin = (bool)val2;
feature.Trait = (char)val3;
originalCultVec.addFeature(feature);
} // while
Points.push_back(originalCultVec);
while (getline(myFile,line)) {
int i = 0;
CultVec newVec = CultVec(RNG);
istringstream lineStream4(line);
while ( lineStream4 >> val4 ) {
Feature newFeat = originalCultVec.getFeature(i);
newFeat.Trait = (char)val4;
newVec.addFeature(newFeat);
i++;
}
Points.push_back(newVec);
}

having issues reading in files within a folder C++

I am currently working on a project that requires the assessment of mystery text files, and cross referencing them with signatures that are provided to me.
one issue I am facing is that we have gone over reading in files from a folder within the projects folder. (I'm using Visual Studios 2010)
I am provided with a simple 'data.txt' file, that contains an integer representing the number of file names of signatures; followed by that many signatures, then another integer representing the number of mystery texts; followed by that many mystery texts.
my question is, how does one read in a file, from a path given to them within another text document?
the 'data.txt' file is as follows:
13
signatures/agatha.christi.stats
signatures/alexandre.dumas.stats
signatures/brothers.grim.stats
signatures/charles.dickens.stats
signatures/douglas.adams.stats
signatures/emily.bronte.stats
signatures/fyodor.dostoevsky.stats
signatures/james.joyce.stats
signatures/jane.austen.stats
signatures/lewis.caroll.stats
signatures/mark.twain.stats
signatures/sir.arthur.conan.doyle.stats
signatures/william.shakespeare.stats
5
documents/mystery1.txt
documents/mystery2.txt
documents/mystery3.txt
documents/mystery4.txt
documents/mystery5.txt
one of the signature files is as follows(don't ask why my prof decided to use .stats, because I have no clue):
agatha christie
4.40212537354
0.103719383127
0.0534892315963
1
0.0836888743
1.90662947161
I cannot change the files, nor can I change the area in which they are saved.
I can easily read in the 'data.txt' file but cannot seem to find the signature files at all.
any help would be appreciated.
once I read in the signatures, I plan on saving them as structs in an array so I can reference them later in the project to compare them to the signatures of the mystery texts.
this program is using namespace std, if that matters to anyone...

Example as reading doubles.
File *file;
file = std::fopen(filename.c_str(), "r+b");
std::fread(&/*variableName*/, sizeof(double), 1, file);
Is this what you're looking for?

I assume your directory structure is as follows:
the_program
data.txt
signatures/...
documents/...
Then it should be straightforward to read the files:
std::ifstream in("data.txt");
std::vector<std::string> files;
int num_files;
in >> num_files;
for (unsigned i = 0; i < num_files; ++i) {
std::string file;
in >> file;
files.push_back(file);
}
// read mystery filenames
std::vector<std::string>::iterator it;
for (it = files.begin(); it != files.end(); ++it) {
std::ifstream sig(it->c_str());
// sig is your signature file. Read it here
}

#ptic12
your answer helped a lot, i managed to edit/manipulate it to get what i needed out of it.
i created a class for signatures to make it a bit more simple.
this code is written rather simply, and its longer than it needs to be, but it works, later in the project i plan on 'slimming down' a little bit
there are a few things missing from this of course, but it would be a long post if i included them
vector<Signature> MakeSignatures(string DataFile)
{
string SigFile="", MystFile="";
int NumSig=0, NumMystery=0;
ifstream infile(DataFile);// opens data.txt
infile >> NumSig;
vector<Signature> SigStorage;//creates a vector in which to store signature objects
for(int i=0; i <NumSig; i++)
{
infile >> SigFile;
Signature Sig(SigFile);
SigStorage.push_back(SigFile);
}
infile >> NumMystery;
for(int i=0; i < NumMystery; i++)
{
infile >> MystFile;
//not quite done here yet
//large part of project will be called here
}
return SigStorage;
}
and the constructor in the Signature class's .cpp
Signature::Signature(string SigFile)
{
ifstream in(SigFile);
while(!in.eof())
{
getline(in, AuthName); //gets author name
in >> AverageWord; //get next 5 floats
in >> TypeToken;
in >> HapaxLego;
in >> AverageNumber;
in >> SentenceCom;
}
}
hopefully this will help anyone who needs the same sort of help i did.
(didn't really help that the data.txt included a misspelled path, took a while to figure that one out)

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js