Reading using fstream

Reading using fstream - c++

I am using fstream to read a binary file, but strangely I get different values for the same input file each time I execute the code.
if(fs->is_open())
{
while (!fs->eof())
{
fs->seekg( pos );
fs->read( (char *)&mdfHeader, sizeof(mdfHeader_t) );
pos += mdfHeader.length;
fs->read( (char *)&eventHeader, sizeof(eventHeader_t) );
fs->read( (char *)&rawHeader, sizeof(rawHeader_t) );
fs->read( (char *)&ingressHeader, sizeof(ingressHeader_t) );
fs->read( (char *)&l1Header_xc0, sizeof(l1Header_xc0_t) );
fs->read(data, dataLength);
printf("Data=%#x\n",data);
std::cout << "counter: " << c << "\n";
c++;
}
fs->close();
}
As you can see, I print out data, which should be the same each time, but yields a different value. mdfHeader.length is the length of one block of data.

The first things to change are:
The condition eof() is only really useful to determine why reading data failed but it isn't a useful condition for a loop.
You need to check after reading that you successfully read the data you are interested in.
That, the loop would look something like this:
while (*fs) {
// read data from fs
if (*fs) {
// do something with the data
}
else if (!fs->eof()) {
std::cout << "ERROR: failed to read record\n";
}
}
I'd also guess that you don't need the seeks and it is a good idea to get rid of them: seeking is relatively expensive because it looses any buffer. You didn't show the entire code but the initial value of pos has a fair chance to provide some level of randomness. Also, you assume that the sequence of bytes you are reading matches how the data is laid out in your computer. Typically, that isn't the case and you generally need to adjust the binary format, e.g., to accommodate different sizes of words, different endianess, padding, etc.

Computer is like mathematics, every thing is certain(even for functions like rand if input be the same, the output is also same as before) So if you run a code a hundred time with same input and state you will certainly get same output, unless input or running state changed.
You say that input is same each time you execute the code, so only thing that is changed is running state( for example malloc may return 2 different value each time that you run the program, because it may work in different state, because its state will be indicated by the OS ).
In your code you use printf("Data=%#x\n",data); to output your data, but it actually just print address of data as HEX value, so it is very natural that in multiple runs of the program this address may changed because OS map your executive to different positions or anything else. You should output content of the data and you will see that it will be same as previous run

Related

how to parse stream data(string) to different data files

#everyone, I have some problem in reading data form IMU recently.
Below is the data I got from My device, it is ASCII, all are chars,and my data size is [122], which is really big, I need convert them to short, and then float, but I dont know why and how.....
unsigned char data[33];
short x,y,z;
float x_fl,y_fl,z_fl,t_fl;
float bias[3]={0,0,0};//array initialization
unsigned char sum_data=0;
int batch=0;
if ( !PurgeComm(file,PURGE_RXCLEAR ))
cout << "Clearing RX Buffer Error" << endl;//this if two sentence aim to clean the buffer
//---------------- read data from IMU ----------------------
do { ReadFile(file,&data_check,1,&read,NULL);
//if ((data_check==0x026))
{ ReadFile(file,&data,33,&read,NULL); }
/// Wx Values
{
x=(data[8]<<8)+data[9];
x_fl=(float)6.8664e-3*x;
bias[0]+=(float)x_fl;
}
/// Wy Values
{
y=(data[10]<<8)+data[11];
y_fl=(float)6.8664e-3*y;
bias[1]+=(float)y_fl;
}
/// Wz Values
{
z=(data[12]<<8)+data[13];
z_fl=(float)6.8664e-3*z;
bias[2]+=(float)z_fl;
}
batch++;
}while(batch<NUM_BATCH_BIAS);
$VNYMR,+049.320,-017.922,-024.946,+00.2829,-00.2734,+00.2735,-02.961,+03.858,-08.325,-00.001267,+00.000213,-00.001214*64
$VNYMR,+049.322,-017.922,-024.948,+00.2829,-00.2714,+00.2735,-02.958,+03.870,-08.323,+00.004923,-00.000783,+00.000290*65
$VNYMR,+049.321,-017.922,-024.949,+00.2821,-00.2655,+00.2724,-02.984,+03.883,-08.321,+00.000648,-00.000391,-00.000485*61
$VNYMR,+049.320,-017.922,-024.947,+00.2830,-00.2665,+00.2756,-02.983,+03.874,-08.347,-00.003416,+00.000437,+00.000252*6C
$VNYMR,+049.323,-017.921,-024.947,+00.2837,-00.2773,+00.2714,-02.955,+03.880,-08.326,+00.002570,-00.001066,+00.000690*67
$VNYMR,+049.325,-017.922,-024.948,+00.2847,-00.2715,+00.2692,-02.944,+03.875,-08.344,-00.002550,+00.000638,+00.000022*6A
$VNYMR,+049.326,-017.921,-024.945,+00.2848,-00.2666,+00.2713,-02.959,+03.876,-08.309,+00.002084,+00.000449,+00.000667*6A
all I want to do is:
extract last 6 numbers separated by commas, btw, I don't need the last 3 chars(like *66).
Save the extracted data to 6 .dat files.
What is the best way to do this?
Since I got this raw data from IMU, and I need the last 6 data, which are accelerations(x,y,z) and gyros(x,y,z).
If someone could tell me how to set a counter to the end of each data stream, that will be perfect, because I need the time stamp of IMU also.
Last word is I am doing data acquisition under windows, c++.
Hope someone could help me, I am freaking out because of so much things to do and that's really annoying!!

There's a whole family of scanf functions (fscanf, sscanf and some "secure" ones).
Assuming you have read a line into a string:-
sscanf( s, "VNYMR,%*f,%*f,%*f,%*f,%*f,%*f,%f,%f,%f,%f,%f,%f", &accX, &accY, &accZ, &gyroX, &gyroY, &gyroZ )
And assuming I have counted correctly! This will verify that the literal $VNYMR is there, followed by about five floats that you don't assign and finally the six that you care about. &accaX, etc are the addresses of your floats. Test the result - the number of assignments made..

Reading key-value pairs as fast as possible in C++ from file

I have a file with roughly 2 million lines like this:
2s,3s,4s,5s,6s 100000
2s,3s,4s,5s,8s 101
2s,3s,4s,5s,9s 102
The first comma separated part indicates a poker result in Omaha, while the latter score is an example "value" of the cards. It is very important for me to read this file as fast as possible in C++, but I cannot seem to get it to be faster than a simple approach in Python (4.5 seconds) using the base library.
Using the Qt framework (QHash and QString), I was able to read the file in 2.5 seconds in release mode. However, I do not want to have the Qt dependency. The goal is to allow quick simulations using those 2 million lines, i.e. some_container["2s,3s,4s,5s,6s"] to yield 100 (though if applying a translation function or any non-readable format will allow for faster reading that's okay as well).
My current implementation is extremely slow (8 seconds!):
std::map<std::string, int> get_file_contents(const char *filename)
{
std::map<std::string, int> outcomes;
std::ifstream infile(filename);
std::string c;
int d;
while (infile.good())
{
infile >> c;
infile >> d;
//std::cout << c << d << std::endl;
outcomes[c] = d;
}
return outcomes;
}
What can I do to read this data into some kind of a key/value hash as fast as possible?
Note: The first 16 characters are always going to be there (the cards), while the score can go up to around 1 million.
Some further informations gathered from various comments:
sample file: http://pastebin.com/rB1hFViM
ram restrictions: 750MB
initialization time restriction: 5s
computation time per hand restriction: 0.5s

As I see it, there are two bottlenecks on your code.
1 Bottleneck
I believe that the file reading is the biggest problem there. Having a binary file is the fastest option. Not only you can read it directly in an array with a raw istream::read in a single operation (which is very fast), but you can even map the file in memory if your OS supports it. Here is a link that's very informative on how to use memory mapped files.
2 Bottleneck
The std::map is usually implemented with a self-balancing BST that will store all the data in order. This makes the insertion to be an O(logn) operation. You can change it to std::unordered_map, wich uses a hash table instead. A hash table have a constant time insertion if the number of colisions are low. As the ammount of elements that you need to read is known, you can reserve a suitable ammount of chuncks before inserting the elements. Keep in mind that you need more chuncks than the number of elements that will be inserted in the hash to avoid the maximum ammount of colisions.

Ian Medeiros already mentioned the two major botlenecks.
a few thoughts about data structures:
the amount of different cards is known: 4 colors of each 13 cards -> 52 cards.
so a card requires less than 6 bits to store. your current file format currently uses 24 bit (includig the comma).
so by simply enumerating the cards and omitting the comma you can save ~2/3 of file size and allows you to determine a card with reading only one character per card.
if you want to keep the file text based you may use a-m, n-z, A-M and N-Z for the four colors.
another thing that bugs me is the string based map. string operations are innefficient.
One hand contains 5 cards.
that means 52^5 posiibilities if we keep it simple and do not consider the already drawn cards.
--> 52^5 = 380.204.032 < 2^32
that means we can enumuerate every possible hand with a uint32 number. by defining a special sorting scheme of the cards (since order is irrelevant), we can assign a number to the hand and use this number as key in our map that is a lot faster than using strings.
if we have enough memory (1.5 GB) we do not even need a map but we can simply use an array.
of course the most cells are unused but access may be very fast. we even can ommit the ordering of the cards since the cells are present independet if we fill them or not. So we can use them. but in this case you should not forget to fill all possible permutations of the hand read from the file.
with this scheme we also (may be) can further optimize our file reading speed. if we only store the hands number and the rating so that only 2 values need to be parsed.
infact we can optimize the required storage space by using a more complex adressing scheme for the different hands, since in reality there are only 52*51*50*49*48 = 311.875.200 possible hands.additional to that the ordering is irrelevant as mentioned but i think that this saving is not worth the increased complexity of the encoding of the hands.

A simple idea might be to use the C API, which is considerably simpler:
#include <cstdio>
int n;
char s[128];
while (std::fscanf(stdin, "%127s %d", s, &n) == 2)
{
outcomes[s] = n;
}
A rough test showed a considerable speedup for me compared to the iostreams library.
Further speedups may be achieved by storing the data in a contiguous array, e.g. a vector of std::pair<std::string, int>; it depends on whether your data is already sorted and how you need to access it later.
For a serious solution, though, you should probably step back further and think of a better way to represent your data. For example, a fixed-width, binary encoding would be much more space-efficient and faster to parse, since you won't need to look ahead for line endings or parse strings.
Update: From some quick experimentation I've found it fairly fast to first read the entire file into memory and then perform alternating strtok calls with either " " or "\n" as the delimiter; whenever a pair of calls succeed, apply strtol on the second pointer to parse the integer. Here's a skeleton:
#include <cerrno>
#include <cstdio>
#include <cstdlib>
#include <cstring>
#include <vector>
int main()
{
std::vector<char> data;
// Read entire file to memory
{
data.reserve(100000000);
char buf[4096];
for (std::size_t n; (n = std::fread(buf, 1, sizeof buf, stdin)) > 0; )
{
data.insert(data.end(), buf, buf + n);
}
data.push_back('\0');
}
// Tokenize the in-memory data
char * p = &data.front();
for (char * q = std::strtok(p, " "); q; q = std::strtok(nullptr, " "))
{
if (char * r = std::strtok(nullptr, "\n"))
{
char * e;
errno = 0;
int const n = std::strtol(r, &e, 10);
if (*e != '\0' || errno != 0) { continue; }
// At this point we have data:
// * the string is "q"
// * the integer is "n"
}
}
}

No methods of read a file seem to work, all return nothing - C++

EDIT: Problem solved! Turns out Windows 7 wont let me read/ write to files without explicitly running as administrator. So if i run as admin it works fine, if i dont i get the weird results i explain below.
I've been trying to get a part of a larger program of mine to read a file.
Despite trying multiple methods(istream::getline, std::getline, using the >> operator etc) All of them return with either /0, blank or a random number/what ever i initialised the var with.
My first thought was that the file didn't exist or couldn't be opened, however the state flags .good, .bad and .eof all indicate no problems and the file im trying to read is certainly in the same directory as the debug .exe and contains data.
I'd most like to use istream::getline to read lines into a char array, however reading lines into a string array is possible too.
My current code looks like this:
void startup::load_settings(char filename[]) //master function for opening a file.
{
int i = 0; //count variable
int num = 0; //var containing all the lines we read.
char line[5];
ifstream settings_file (settings.inf);
if (settings_file.is_open());
{
while (settings_file.good())
{
settings_file.getline(line, 5);
cout << line;
}
}
return;
}
As said above, it compiles but just puts /0 into every element of the char array much like all the other methods i've tried.
Thanks for any help.

Firstly your code is not complete, what is settings.inf ?
Secondly most probably your reading everything fine, but the way you are printing is cumbersome
cout << line; where char line[5]; be sure that the last element of the array is \0.
You can do something like this.
line[4] = '\0' or you can manually print the values of each element in array in a loop.
Also you can try printing the character codes in hex for example. Because the values (character codes) in array might be not from the visible character range of ASCII symbols. You can do it like this for example :
cout << hex << (int)line[i]

getline while reading a file vs reading whole file and then splitting based on newline character

I want to process each line of a file on a hard-disk now. Is it better to load a file as a whole and then split on basis of newline character (using boost), or is it better to use getline()? My question is does getline() reads single line when called (resulting in multiple hard disk access) or reads whole file and gives line by line?

getline will call read() as a system call somewhere deep in the gutst of the C library. Exactly how many times it is called, and how it is called depends on the C library design. But most likely there is no distinct difference in reading a line at a time vs. the whole file, becuse the OS at the bottom layer will read (at least) one disk-block at a time, and most likely at least a "page" (4KB), if not more.
Further, unles you do nearly nothing with your string after you have read it (e.g you are writing something like "grep", so mostly just reading the to find a string), it is unlikely that the overhead of reading a line at a time is a large part of the time you spend.
But the "load the whole file in one go" has several, distinct, problems:
You don't start processing until you have read the whole file.
You need enough memory to read the entire file into memory - what if the file is a few hundred GB in size? Does your program fail then?
Don't try to optimise something unless you have used profiling to prove that it's part of why your code is running slow. You are just causing more problems for yourself.
Edit: So, I wrote a program to measure this, since I think it's quite interesting.
And the results are definitely interesting - to make the comparison fair, I created three large files of 1297984192 bytes each (by copying all source files in a directory with about a dozen different source files, then copying this file several times over to "multiply" it, until it took over 1.5 seconds to run the test, which is how long I think you need to run things to make sure the timing isn't too susceptible to random "network packet came in" or some other outside influences taking time out of the process).
I also decided to measure the system and user-time by the process.
$ ./bigfile
Lines=24812608
Wallclock time for mmap is 1.98 (user:1.83 system: 0.14)
Lines=24812608
Wallclock time for getline is 2.07 (user:1.68 system: 0.389)
Lines=24812608
Wallclock time for readwhole is 2.52 (user:1.79 system: 0.723)
$ ./bigfile
Lines=24812608
Wallclock time for mmap is 1.96 (user:1.83 system: 0.12)
Lines=24812608
Wallclock time for getline is 2.07 (user:1.67 system: 0.392)
Lines=24812608
Wallclock time for readwhole is 2.48 (user:1.76 system: 0.707)
Here's the three different functions to read the file (there's some code to measure time and stuff too, of course, but for reducing the size of this post, I choose to not post all of that - and I played around with ordering to see if that made any difference, so results above are not in the same order as the functions here)
void func_readwhole(const char *name)
{
string fullname = string("bigfile_") + name;
ifstream f(fullname.c_str());
if (!f)
{
cerr << "could not open file for " << fullname << endl;
exit(1);
}
f.seekg(0, ios::end);
streampos size = f.tellg();
f.seekg(0, ios::beg);
char* buffer = new char[size];
f.read(buffer, size);
if (f.gcount() != size)
{
cerr << "Read failed ...\n";
exit(1);
}
stringstream ss;
ss.rdbuf()->pubsetbuf(buffer, size);
int lines = 0;
string str;
while(getline(ss, str))
{
lines++;
}
f.close();
cout << "Lines=" << lines << endl;
delete [] buffer;
}
void func_getline(const char *name)
{
string fullname = string("bigfile_") + name;
ifstream f(fullname.c_str());
if (!f)
{
cerr << "could not open file for " << fullname << endl;
exit(1);
}
string str;
int lines = 0;
while(getline(f, str))
{
lines++;
}
cout << "Lines=" << lines << endl;
f.close();
}
void func_mmap(const char *name)
{
char *buffer;
string fullname = string("bigfile_") + name;
int f = open(fullname.c_str(), O_RDONLY);
off_t size = lseek(f, 0, SEEK_END);
lseek(f, 0, SEEK_SET);
buffer = (char *)mmap(NULL, size, PROT_READ, MAP_PRIVATE, f, 0);
stringstream ss;
ss.rdbuf()->pubsetbuf(buffer, size);
int lines = 0;
string str;
while(getline(ss, str))
{
lines++;
}
munmap(buffer, size);
cout << "Lines=" << lines << endl;
}

The OS will read a whole block of data (depending on how the disk is formatted, typically 4-8k at a time) and do some of the buffering for you. Let the OS take care of it for you, and read the data in the way that makes sense for your program.

The fstreams are buffered reasonably. The underlying acesses to the harddisk by the OS are buffered reasonably. The hard disk itself has a reasonable buffer. You most surely will not trigger more hard disk accesses if you read the file line by line. Or character by character, for that matter.
So there is no reason to load the whole file into a big buffer and work on that buffer, because it already is in a buffer. And there often is no reason to buffer one line at a time, either. Why allocate memory to buffer something in a string that is already buffered in the ifstream? If you can, work on the stream directly, don't bother tossing everything around twice or more from one buffer to the next. Unless it supports readability and/or your profiler told you that disc access is slowing your program down significantly.

I believe the C++ idiom would be to read the file line-by-line, and create a line-based container as you read the file. Most likely the iostreams (getline) will be buffered enough that you won't notice a significant difference.
However for very large files you may get better performance by reading larger chunks of the file (not the whole file at once) and splitting internall as newlines are found.
If you want to know specifically which method is faster and by how much, you'll have to profile your code.

Its better to fetch the all data if it can be accommodated in memory because whenever you request the I/O your programmme looses the processing and put in a wait Q.
However if the file size is big then it's better to read as much data at a time which is required in processing. Because bigger read operation will take much time to complete then the small ones. cpu process switching time is much smaller then this entire file read time.

If it's a small file on disk, it's probably more efficient to read the entire file and parse it line by line vs. reading one line at a time--that would take lot's of disk access.

Reading file byte by byte with ifstream::get

I wrote this binary reader after a tutorial on the internet. (I'm trying to find the link...)
The code reads the file byte by byte and the first 4 bytes are together the magic word. (Let's say MAGI!) My code looks like this:
std::ifstream in(fileName, std::ios::in | std::ios::binary);
char *magic = new char[4];
while( !in.eof() ){
// read the first 4 bytes
for (int i=0; i<4; i++){
in.get(magic[i]);
}
// compare it with the magic word "MAGI"
if (strcmp(magic, "MAGI") != 0){
std::cerr << "Something is wrong with the magic word: "
<< magic << ", couldn't read the file further! "
<< std::endl;
exit(1);
}
// read the rest ...
}
Now here comes the problem, when I open my file, I get this error output:
Something is wrong with the magic word: MAGI?, couldn't read the file further! So there is always one (mostly random) character after the word MAGI, like in this example the character ?!
I do think that it has something to do with how a string in C++ is stored and compared with each other. Am I right and how can I avoid this?
PS: this implementation is included in another program and works totally fine ... weird.

strcmp assumes that both strings are nul-terminated (end with a nul-character). When you want to compare strings which are not terminated, like in this case, you need to use strncmp and tell it how many characters to compare (4 in this case).
if (strncmp(magic, "MAGI", 4) != 0){
When you try to use strcmp to compare not null-terminated char arrays, it can't tell how long the arrays are (you can't tell the length of an array in C/C++ just by looking at the array itself - you need to know the length it was allocated with. The standard library is not exempt from this limitation). So it reads any data which happens to be stored in memory after the char array until it hits a 0-byte.
By the way: Note the comment to your question by Lightness Races in Orbit, which is unrelated to the issue you are having now, but which hints a different bug which might cause you some problems later on.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js