how to parse stream data(string) to different data files

how to parse stream data(string) to different data files - c++

#everyone, I have some problem in reading data form IMU recently.
Below is the data I got from My device, it is ASCII, all are chars,and my data size is [122], which is really big, I need convert them to short, and then float, but I dont know why and how.....
unsigned char data[33];
short x,y,z;
float x_fl,y_fl,z_fl,t_fl;
float bias[3]={0,0,0};//array initialization
unsigned char sum_data=0;
int batch=0;
if ( !PurgeComm(file,PURGE_RXCLEAR ))
cout << "Clearing RX Buffer Error" << endl;//this if two sentence aim to clean the buffer
//---------------- read data from IMU ----------------------
do { ReadFile(file,&data_check,1,&read,NULL);
//if ((data_check==0x026))
{ ReadFile(file,&data,33,&read,NULL); }
/// Wx Values
{
x=(data[8]<<8)+data[9];
x_fl=(float)6.8664e-3*x;
bias[0]+=(float)x_fl;
}
/// Wy Values
{
y=(data[10]<<8)+data[11];
y_fl=(float)6.8664e-3*y;
bias[1]+=(float)y_fl;
}
/// Wz Values
{
z=(data[12]<<8)+data[13];
z_fl=(float)6.8664e-3*z;
bias[2]+=(float)z_fl;
}
batch++;
}while(batch<NUM_BATCH_BIAS);
$VNYMR,+049.320,-017.922,-024.946,+00.2829,-00.2734,+00.2735,-02.961,+03.858,-08.325,-00.001267,+00.000213,-00.001214*64
$VNYMR,+049.322,-017.922,-024.948,+00.2829,-00.2714,+00.2735,-02.958,+03.870,-08.323,+00.004923,-00.000783,+00.000290*65
$VNYMR,+049.321,-017.922,-024.949,+00.2821,-00.2655,+00.2724,-02.984,+03.883,-08.321,+00.000648,-00.000391,-00.000485*61
$VNYMR,+049.320,-017.922,-024.947,+00.2830,-00.2665,+00.2756,-02.983,+03.874,-08.347,-00.003416,+00.000437,+00.000252*6C
$VNYMR,+049.323,-017.921,-024.947,+00.2837,-00.2773,+00.2714,-02.955,+03.880,-08.326,+00.002570,-00.001066,+00.000690*67
$VNYMR,+049.325,-017.922,-024.948,+00.2847,-00.2715,+00.2692,-02.944,+03.875,-08.344,-00.002550,+00.000638,+00.000022*6A
$VNYMR,+049.326,-017.921,-024.945,+00.2848,-00.2666,+00.2713,-02.959,+03.876,-08.309,+00.002084,+00.000449,+00.000667*6A
all I want to do is:
extract last 6 numbers separated by commas, btw, I don't need the last 3 chars(like *66).
Save the extracted data to 6 .dat files.
What is the best way to do this?
Since I got this raw data from IMU, and I need the last 6 data, which are accelerations(x,y,z) and gyros(x,y,z).
If someone could tell me how to set a counter to the end of each data stream, that will be perfect, because I need the time stamp of IMU also.
Last word is I am doing data acquisition under windows, c++.
Hope someone could help me, I am freaking out because of so much things to do and that's really annoying!!

There's a whole family of scanf functions (fscanf, sscanf and some "secure" ones).
Assuming you have read a line into a string:-
sscanf( s, "VNYMR,%*f,%*f,%*f,%*f,%*f,%*f,%f,%f,%f,%f,%f,%f", &accX, &accY, &accZ, &gyroX, &gyroY, &gyroZ )
And assuming I have counted correctly! This will verify that the literal $VNYMR is there, followed by about five floats that you don't assign and finally the six that you care about. &accaX, etc are the addresses of your floats. Test the result - the number of assignments made..

Related

How do I take data from a CSV file, and parse the data into an INT between two commas?

First off, apologies for how unspecific this question may be, for it is my first time asking a question of StackOverflow.
I will get down to business, I am working on a project which reads a file (CSV specifically) and will then save certain data from there into an Int i will later mess with.
Essentially, my code is currently as followed:
if (!infile) {
cout << "You entered something that cannot be opened, please try again." << endl;
Continue = true;
}
else {
while (infile.good()) {
getline(infile, value, ','); // read a string until next comma
cout << string(value, 1, value.length() - 2); // display value removing the first and the last character from it
}
Continue = false;
}
}
So far, this reads the data and outputs lines that look like this:
2014-01-03,"2014","01","03","†","-12.8","","-31.0","","-21.9","","39.9","","0.0","","","M","","M","0.0","","","","","","<31",""
2014-01-04,"2014","01","04","†","-2.3","","-12.8","","-7.6","","25.6","","0.0","","","M","","M","2.9","","40","","18","","39",""
2014-01-05,"2014","01","05","†","-2.1","","-4.1","","-3.1","","21.1","","0.0","","","M","","M","16.2","","52","","8","","32",""
My information needed is between the 5th and 6th comma (it represents the average temperature for the day) and I somehow need to put the information between those two commas into some kind of int.
As well, between the 3rd and 4th comma is the day, which I will need later to calculate the average temperature for each month, so I need to figure that out as well.
Does anyone know the correct way to go about doing this? Unfortunately my knowledge of string parsing is to be desired.

Firstly, the values between the 5th and 6th commas are floating-point numbers - you should to convert it into either float or double.
Then,
If both
you always have a “.” as a separator in the floating-point numbers
C++17 or higher versions of the language is acceptable for you
use std::from_chars (https://en.cppreference.com/w/cpp/utility/from_chars)
If not, consider the following:
for signed integers - std::stoi/std::stol/std::stoll (https://en.cppreference.com/w/cpp/string/basic_string/stol)
and std::strtol/std::strtoll (https://en.cppreference.com/w/cpp/string/byte/strtol)
for unsigned integers - std::stoul/std::stoull (https://en.cppreference.com/w/cpp/string/basic_string/stoul) and std::strtoul/std::strtoull (https://en.cppreference.com/w/cpp/string/byte/strtoul)
for floating-point numbers - std::stof/std::stod/std::stold (https://en.cppreference.com/w/cpp/string/basic_string/stof), std::strtof/std::strtod/std::strtold (https://en.cppreference.com/w/cpp/string/byte/strtof) and for wide-character strings std::wcstof/std::wcstod/std::wcstold (https://en.cppreference.com/w/cpp/string/wide/wcstof)
If your compiler does not support the functions above, your only options are
atoi/atol (https://en.cppreference.com/w/cpp/string/byte/atoi)
and atof (https://en.cppreference.com/w/cpp/string/byte/atof), and sscanf (https://en.cppreference.com/w/cpp/io/c/fscanf).

String parsing to extract int in C++ for Arduino

I'm trying to write a sketch that allows a user to access data in EEPROM using the serial monitor. In the serial monitor the user should be able to type one of two commands: “read” and “write. "Read" should take one argument, an EEPROM address. "Write" should take two arguments, an EEPROM address and a value. For example, if the user types “read 7” then the contents of EEPROM address 7 should be printed to the serial monitor. If the user types “write 7 12” then the value 12 should be written into address 7 of the EEPROM. Any help is much appreciated. I'm not an expert in Arudino, still learning ;). In the code below I defined inByte to be the serail.read(). Now how do I extract numbers from the string "inByte" to assign to "val" and "addr"
void loop() {
String inByte;
if (Serial.available() > 0) {
// get incoming byte:
inByte = Serial.read();
}
if (inByte.startsWith("Write")) {
EEPROM.write(addr, val);
}
if (inByte.startsWith("Read")) {
val= EEPROM.read(addr);
}
delay(500);
}

Serial.read() only reads a single character. You should loop until no more input while filling your buffer or use a blocking function like Serial.readStringUntil() or Serial.readBytes() to fill a buffer for you.
https://www.arduino.cc/en/Serial/ReadStringUntil
https://www.arduino.cc/en/Serial/ReadBytes
Or you can use Serial.parseInt() twice to grab the two values directly into a pair of integers. This function will skip the non numerical text and grab the values. This method is also blocking.
https://www.arduino.cc/en/Reference/StreamParseInt
A patch I wrote to improve this function is available in the latest hourly build, but the old versions still work fine for simple numbers with the previous IDE's
The blocking methods can be tweaked using Serial.setTimeout() to change how long they wait for input (1000ms default)
https://www.arduino.cc/en/Serial/SetTimeout

[missed the other answer, there's half my answer gone]
I was going to say use Serial.readStringUntil('\n') in order to read a line at a time.
To address the part:
how do I extract numbers from the string "inByte" to assign to "val" and "addr"
This is less trivial than it might seem and a lot of things can go wrong. For simplicity, let's assume the input string is always in the format /^(Read|Write) (\d+)( \d+)?$/.
A simple way to parse it would be to find the spaces, isolate the number strings and call .toInt().
...
int val, addr;
int addrStart = 0;
while(inByte[addrStart] != ' ' && addrStart < inByte.length())
addrStart++;
addrStart++; //skip the space
int addrEnd = addrStart + 1;
while(inByte[addrEnd] != ' ' && addrEnd < inByte.length())
addrEnd++;
String addrStr = inByte.substring(addrStart, addrEnd); //excludes addrEnd
addr = addrStr.toInt();
if (inByte.startsWith("Write")) {
int valEnd = addrEnd+1;
while(inByte[varEnd] != ' ' && varEnd < inByte.length())
valEnd++;
String valStr = inByte.substring(addrEnd+1, valEnd);
val = valStr.toInt();
EEPROM.write(addr, val);
}
else if (inByte.startsWith("Read")) {
val = EEPROM.read(addr);
}
This can fail in all sorts of horrible ways if the input string has a double space or the numbers are malformed, or has any other subtle error.
If you're concerned with correctness, I suggest you look into a regex library, or even an standard format such as JSON - see ArduinoJson.

How to get more performance when reading file

My program download files from site (via curl per 30 min). (it is possible that size of these files can reach 150 mb)
So i thought that getting data from these files can be inefficient. (search a line per 5 seconds)
These files can have ~10.000 lines
To parse this file (values are seperate by ",") i use regex :
regex wzorzec("(.*),(.*),(.*),(.*),(.*),(.*),(.*),(.*)");
There are 8 values.
Now i have to push it to vector:
allys.push_back({ std::stoi(std::string(wynik[1])), nick, tag, stoi(string(wynik[4])), stoi(string(wynik[5])), stoi(string(wynik[6])), stoi(string(wynik[7])), stoi(string(wynik[8])) });
I use std::async to do that, but for 3 files (~7 mb) procesor jumps to 80% and operation take about 10 secs. I read from SSD so this is not slowly IO fault.
I'm reading data line per line by fstream
How to boost this operation?
Maybe i have to parse this values, and push it to SQL ?
Best Regards

You can probably get some performance boost by avoiding regex, and use something along the lines of std::strtok, or else just hard-code a search for commas in your data. Regex has more power than you need just to look for commas. Next, if you use vector::reserve before you begin a sequence of push_back for any given vector, you will save a lot of time in both reallocation and moving memory around. If you are expecting a large vector, reserve room for it up front.
This may not cover all available performance ideas, but I'd bet you will see an improvement.

Your problem here is most likely additional overhead introduced by the regular expression, since you're using many variable length and greedy matches (the regex engine will try different alignments for the matches to find the largest matching result).
Instead, you might want to try to manually parse the lines. There are many different ways to achieve this. Here's one quick and dirty example (it's not flexible and has quite some duplicate code in there, but there's lots of room for optimization). It should explain the basic idea though:
#include <iostream>
#include <sstream>
#include <cstdlib>
const char *input = "1,Mario,Stuff,4,5,6,7,8";
struct data {
int id;
std::string nick;
std::string tag;
} myData;
int main(int argc, char **argv){
char buffer[256];
std::istringstream in(input);
// Read an entry and convert/store it:
in.get(buffer, 256, ','); // read
myData.id = atoi(buffer); // convert and store
// Skip the comma
in.seekg(1, std::ios::cur);
// Read the next entry and convert/store it:
in.get(buffer, 256, ','); // read
myData.nick = buffer; // store
// Skip the comma
in.seekg(1, std::ios::cur);
// Read the next entry and convert/store it:
in.get(buffer, 256, ','); // read
myData.tag = buffer; // store
// Skip the comma
in.seekg(1, std::ios::cur);
// Some test output
std::cout << "id: " << myData.id << "\nnick: " << myData.nick << "\ntag: " << myData.tag << std::endl;
return 0;
}
Note that there isn't any error handling in case entries are too long or too short (or broken in some other way).
Console output:
id: 1
nick: Mario
tag: Stuff

Reading key-value pairs as fast as possible in C++ from file

I have a file with roughly 2 million lines like this:
2s,3s,4s,5s,6s 100000
2s,3s,4s,5s,8s 101
2s,3s,4s,5s,9s 102
The first comma separated part indicates a poker result in Omaha, while the latter score is an example "value" of the cards. It is very important for me to read this file as fast as possible in C++, but I cannot seem to get it to be faster than a simple approach in Python (4.5 seconds) using the base library.
Using the Qt framework (QHash and QString), I was able to read the file in 2.5 seconds in release mode. However, I do not want to have the Qt dependency. The goal is to allow quick simulations using those 2 million lines, i.e. some_container["2s,3s,4s,5s,6s"] to yield 100 (though if applying a translation function or any non-readable format will allow for faster reading that's okay as well).
My current implementation is extremely slow (8 seconds!):
std::map<std::string, int> get_file_contents(const char *filename)
{
std::map<std::string, int> outcomes;
std::ifstream infile(filename);
std::string c;
int d;
while (infile.good())
{
infile >> c;
infile >> d;
//std::cout << c << d << std::endl;
outcomes[c] = d;
}
return outcomes;
}
What can I do to read this data into some kind of a key/value hash as fast as possible?
Note: The first 16 characters are always going to be there (the cards), while the score can go up to around 1 million.
Some further informations gathered from various comments:
sample file: http://pastebin.com/rB1hFViM
ram restrictions: 750MB
initialization time restriction: 5s
computation time per hand restriction: 0.5s

As I see it, there are two bottlenecks on your code.
1 Bottleneck
I believe that the file reading is the biggest problem there. Having a binary file is the fastest option. Not only you can read it directly in an array with a raw istream::read in a single operation (which is very fast), but you can even map the file in memory if your OS supports it. Here is a link that's very informative on how to use memory mapped files.
2 Bottleneck
The std::map is usually implemented with a self-balancing BST that will store all the data in order. This makes the insertion to be an O(logn) operation. You can change it to std::unordered_map, wich uses a hash table instead. A hash table have a constant time insertion if the number of colisions are low. As the ammount of elements that you need to read is known, you can reserve a suitable ammount of chuncks before inserting the elements. Keep in mind that you need more chuncks than the number of elements that will be inserted in the hash to avoid the maximum ammount of colisions.

Ian Medeiros already mentioned the two major botlenecks.
a few thoughts about data structures:
the amount of different cards is known: 4 colors of each 13 cards -> 52 cards.
so a card requires less than 6 bits to store. your current file format currently uses 24 bit (includig the comma).
so by simply enumerating the cards and omitting the comma you can save ~2/3 of file size and allows you to determine a card with reading only one character per card.
if you want to keep the file text based you may use a-m, n-z, A-M and N-Z for the four colors.
another thing that bugs me is the string based map. string operations are innefficient.
One hand contains 5 cards.
that means 52^5 posiibilities if we keep it simple and do not consider the already drawn cards.
--> 52^5 = 380.204.032 < 2^32
that means we can enumuerate every possible hand with a uint32 number. by defining a special sorting scheme of the cards (since order is irrelevant), we can assign a number to the hand and use this number as key in our map that is a lot faster than using strings.
if we have enough memory (1.5 GB) we do not even need a map but we can simply use an array.
of course the most cells are unused but access may be very fast. we even can ommit the ordering of the cards since the cells are present independet if we fill them or not. So we can use them. but in this case you should not forget to fill all possible permutations of the hand read from the file.
with this scheme we also (may be) can further optimize our file reading speed. if we only store the hands number and the rating so that only 2 values need to be parsed.
infact we can optimize the required storage space by using a more complex adressing scheme for the different hands, since in reality there are only 52*51*50*49*48 = 311.875.200 possible hands.additional to that the ordering is irrelevant as mentioned but i think that this saving is not worth the increased complexity of the encoding of the hands.

A simple idea might be to use the C API, which is considerably simpler:
#include <cstdio>
int n;
char s[128];
while (std::fscanf(stdin, "%127s %d", s, &n) == 2)
{
outcomes[s] = n;
}
A rough test showed a considerable speedup for me compared to the iostreams library.
Further speedups may be achieved by storing the data in a contiguous array, e.g. a vector of std::pair<std::string, int>; it depends on whether your data is already sorted and how you need to access it later.
For a serious solution, though, you should probably step back further and think of a better way to represent your data. For example, a fixed-width, binary encoding would be much more space-efficient and faster to parse, since you won't need to look ahead for line endings or parse strings.
Update: From some quick experimentation I've found it fairly fast to first read the entire file into memory and then perform alternating strtok calls with either " " or "\n" as the delimiter; whenever a pair of calls succeed, apply strtol on the second pointer to parse the integer. Here's a skeleton:
#include <cerrno>
#include <cstdio>
#include <cstdlib>
#include <cstring>
#include <vector>
int main()
{
std::vector<char> data;
// Read entire file to memory
{
data.reserve(100000000);
char buf[4096];
for (std::size_t n; (n = std::fread(buf, 1, sizeof buf, stdin)) > 0; )
{
data.insert(data.end(), buf, buf + n);
}
data.push_back('\0');
}
// Tokenize the in-memory data
char * p = &data.front();
for (char * q = std::strtok(p, " "); q; q = std::strtok(nullptr, " "))
{
if (char * r = std::strtok(nullptr, "\n"))
{
char * e;
errno = 0;
int const n = std::strtol(r, &e, 10);
if (*e != '\0' || errno != 0) { continue; }
// At this point we have data:
// * the string is "q"
// * the integer is "n"
}
}
}

Reading using fstream

I am using fstream to read a binary file, but strangely I get different values for the same input file each time I execute the code.
if(fs->is_open())
{
while (!fs->eof())
{
fs->seekg( pos );
fs->read( (char *)&mdfHeader, sizeof(mdfHeader_t) );
pos += mdfHeader.length;
fs->read( (char *)&eventHeader, sizeof(eventHeader_t) );
fs->read( (char *)&rawHeader, sizeof(rawHeader_t) );
fs->read( (char *)&ingressHeader, sizeof(ingressHeader_t) );
fs->read( (char *)&l1Header_xc0, sizeof(l1Header_xc0_t) );
fs->read(data, dataLength);
printf("Data=%#x\n",data);
std::cout << "counter: " << c << "\n";
c++;
}
fs->close();
}
As you can see, I print out data, which should be the same each time, but yields a different value. mdfHeader.length is the length of one block of data.

The first things to change are:
The condition eof() is only really useful to determine why reading data failed but it isn't a useful condition for a loop.
You need to check after reading that you successfully read the data you are interested in.
That, the loop would look something like this:
while (*fs) {
// read data from fs
if (*fs) {
// do something with the data
}
else if (!fs->eof()) {
std::cout << "ERROR: failed to read record\n";
}
}
I'd also guess that you don't need the seeks and it is a good idea to get rid of them: seeking is relatively expensive because it looses any buffer. You didn't show the entire code but the initial value of pos has a fair chance to provide some level of randomness. Also, you assume that the sequence of bytes you are reading matches how the data is laid out in your computer. Typically, that isn't the case and you generally need to adjust the binary format, e.g., to accommodate different sizes of words, different endianess, padding, etc.

Computer is like mathematics, every thing is certain(even for functions like rand if input be the same, the output is also same as before) So if you run a code a hundred time with same input and state you will certainly get same output, unless input or running state changed.
You say that input is same each time you execute the code, so only thing that is changed is running state( for example malloc may return 2 different value each time that you run the program, because it may work in different state, because its state will be indicated by the OS ).
In your code you use printf("Data=%#x\n",data); to output your data, but it actually just print address of data as HEX value, so it is very natural that in multiple runs of the program this address may changed because OS map your executive to different positions or anything else. You should output content of the data and you will see that it will be same as previous run

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js