What's the easiest way to parse this data?

What's the easiest way to parse this data? - c++

For my game, I'm creating a tilemap that has tiles with different numeric ids, such as 1, 28, etc. This map data is saved into a .dat file that the user edits, and looks a little bit like this:
0, 83, 7, 2, 4
Now in order for the game to generate the correct tiles, it must see what the map data is obviously. What I want this small parser to do is to skip over whitespace and the commas, and fetch the numeric data. Also, I want it to be in a way that I can get not only single digit id numbers, but 2-3 digits as well (83, etc).
Thanks for any and all help!

Sounds like a job for the strtok function:
#include <stdlib.h>
#include <string.h>
int parseData(char* data, size_t maxAllowed, int* parsedData) {
char * pch;
pch = strtok (data," ,");
int idx = 0;
while (pch != NULL)
{
parsedData[idx++] = atoi(pch); //convert to integer
if (i == maxAllowed) {
break; //reached the maximum allowed
}
pch = strtok (NULL, " ,");
}
return i; //return the number found
}
//E.g.
char data[] ="0, 83, 7, 2, 4";
int parsedData[5];
int numFound = parseData(data,5,parsedData);
The above sample will remove all spaces and commas, returning an integer value for each found along with the total number of elements found.
Reading the file could be done easily using C functions. You could read it either all at once, or chunk by chunk (calling the function for each chunk).

This is CSV parsing, but easy is in the eye of the beholder. Depends on whether you want to "own" the code or use what someone else did. There are two good answers on SO, and two good libraries on a quick search.
How can I read and parse CSV files in C++?
Fast, Simple CSV Parsing in C++
https://code.google.com/p/fast-cpp-csv-parser/
https://code.google.com/p/csv-parser-cplusplus/
The advantage of using a pre-written parser is that when need some other feature it's probably already there.

Related

Is there a way to seek the "\n" character that is faster than looping through char one at a time?

Looking at the sample implementation of wc.c when counting number of lines, it loop through the file, one character at a time and accumulating the '\n' to count the number of newlines:
#define COUNT(c) \
ccount++; \
if ((c) == '\n') \
lcount++;
Is there a way to just seek the file for '\n' and keep jumping to the newline characters and do a count?
Would seeking for '\n' be the same as just reading characters one at a time until we see '\n' and count it?

Well, all characters are not '\n', except for one.
A branch-less algorithm is likely to be faster.
Have you tried std::count, though?
#include <string>
#include <algorithm>
int main() {
const auto s = std::string("Hello, World!\nfoo\nbar\nbaz");
const auto lines_in_s = std::count(s.cbegin(), s.cend(), '\n');
return lines_in_s;
}
Compiler Explorer
Or with a file:
#include <algorithm>
#include <fstream>
#include <iostream>
#include <iterator>
#include <string>
int main() {
if (std::ifstream is("filename.txt"); is) {
const auto lines_in_file =
std::count(std::istreambuf_iterator<char>(is),
std::istreambuf_iterator<char>{}, '\n');
std::cout << lines_in_file << '\n';
}
}
Compiler Explorer

The only way you could skip looking at every character would be if you had domain knowledge about the string you're currently looking at:
If you knew that you're handling a text with continuous paragraphs of at least 50 words or so, you could, after each '\n', advance by 100 or 200 chars, thus saving some time. You'd need to test and refine that jump length, of course, but then you wouldn't need to check every single char.
For a general-purpose counting function you're stuck with looking at every possible char.

Q: Is there a faster way to count the number of lines in a file than reading one character at a time?
A: The quick answer is no, but one can parallelize the counting which might shorten the runtime but the program would still have to run through every byte once. Such a program may by IO bound and so it depends on the hardware involved as to how useful parallelization is in this case.
Q: Is there a way to skip from one newline character to the next without having to read through all the bytes in between?
A: The quick answer is no, but if one had a really large text file for example, what one could do is make an 'index' file of offsets. One would still have to make one pass over the file in order to generate such a file, but once it was made, one could find the nth line by reading the nth offset in the index and then 'seek'-ing to it. The index would have to maintained or regenerated though every time the file changed. If one used fixed width offsets, one could seek straight to the offset required with some simple arithmetic, read the index for the offset, then seek to the correct position in the file. A line count can be obtained at the same time as generating the index. Once the index is generated, a line count could quickly be determined from the size of the index file if it has to be computed again.
It probably should be mentioned that the number of lines in a text file might not be derived from the number of '\n' bytes because of multi-byte character encoding. To count the number of lines, one needs to scan the file character by character rather than just byte by byte, and to do that, one needs to know what character encoding scheme is being used.

You can use strchr function to "jump" to next '\n' in string, and it will be faster on some platforms, because strchr usually implemented in assembly language and use processor instructions that can scan memory faster where such instructions are available.
something like this:
#include <string.h>
unsigned int count_newlines(const char *str) {
unsigned result = 0;
const char s = str;
while ((s = strchr(s, '\n')) != NULL) {
++result; // found one '\n'
++s; // and start searching again from the next character
}
return result;
}

Binary file containing all ASCII characters

Is there a binary file (.dat file) that contains all 256 ASCII characters? I'd like to test it on my Huffman compression algorithm.
When creating a file normally using a text editor like vim, I won't be able to add in the weird characters like control characters, etc. I wonder if anyone knows whether there is such a file ready to use or if I can make one myself.

You can create one with for(i=0;i<256;i++)printf("%c",i);.

Sounds pretty straightforward.
What is the problem?
void MakeFile(void)
{
unsigned char ascii[128] = {0};
for(int i=1; i<ARRAY_SIZE(ascii); ++i)
{
ascii[i] = ascii[i-1]+1;
}
FILE* fp = fopen("all_ascii.txt", "w");
if (fp == NULL)
{
return;
}
fwrite(ascii, ARRAY_SIZE(ascii), 1, fp);
fclose(fp);
}

I don't know of such a file, but you could use the integer representation, or the hex, etc of those characters. You could check to see if they can be represented by using the function isprint(), which will return false if it can't be printed.
So what I would do is use a loop and start at the first character, then use that function to determine whether it can be represented in a character form, or whether you will need to express it some other way.

Take a look at Random.org.
There is a white noise generator, it generates a audiofile, but you may use it for your compression tests.

How to get more performance when reading file

My program download files from site (via curl per 30 min). (it is possible that size of these files can reach 150 mb)
So i thought that getting data from these files can be inefficient. (search a line per 5 seconds)
These files can have ~10.000 lines
To parse this file (values are seperate by ",") i use regex :
regex wzorzec("(.*),(.*),(.*),(.*),(.*),(.*),(.*),(.*)");
There are 8 values.
Now i have to push it to vector:
allys.push_back({ std::stoi(std::string(wynik[1])), nick, tag, stoi(string(wynik[4])), stoi(string(wynik[5])), stoi(string(wynik[6])), stoi(string(wynik[7])), stoi(string(wynik[8])) });
I use std::async to do that, but for 3 files (~7 mb) procesor jumps to 80% and operation take about 10 secs. I read from SSD so this is not slowly IO fault.
I'm reading data line per line by fstream
How to boost this operation?
Maybe i have to parse this values, and push it to SQL ?
Best Regards

You can probably get some performance boost by avoiding regex, and use something along the lines of std::strtok, or else just hard-code a search for commas in your data. Regex has more power than you need just to look for commas. Next, if you use vector::reserve before you begin a sequence of push_back for any given vector, you will save a lot of time in both reallocation and moving memory around. If you are expecting a large vector, reserve room for it up front.
This may not cover all available performance ideas, but I'd bet you will see an improvement.

Your problem here is most likely additional overhead introduced by the regular expression, since you're using many variable length and greedy matches (the regex engine will try different alignments for the matches to find the largest matching result).
Instead, you might want to try to manually parse the lines. There are many different ways to achieve this. Here's one quick and dirty example (it's not flexible and has quite some duplicate code in there, but there's lots of room for optimization). It should explain the basic idea though:
#include <iostream>
#include <sstream>
#include <cstdlib>
const char *input = "1,Mario,Stuff,4,5,6,7,8";
struct data {
int id;
std::string nick;
std::string tag;
} myData;
int main(int argc, char **argv){
char buffer[256];
std::istringstream in(input);
// Read an entry and convert/store it:
in.get(buffer, 256, ','); // read
myData.id = atoi(buffer); // convert and store
// Skip the comma
in.seekg(1, std::ios::cur);
// Read the next entry and convert/store it:
in.get(buffer, 256, ','); // read
myData.nick = buffer; // store
// Skip the comma
in.seekg(1, std::ios::cur);
// Read the next entry and convert/store it:
in.get(buffer, 256, ','); // read
myData.tag = buffer; // store
// Skip the comma
in.seekg(1, std::ios::cur);
// Some test output
std::cout << "id: " << myData.id << "\nnick: " << myData.nick << "\ntag: " << myData.tag << std::endl;
return 0;
}
Note that there isn't any error handling in case entries are too long or too short (or broken in some other way).
Console output:
id: 1
nick: Mario
tag: Stuff

Reading key-value pairs as fast as possible in C++ from file

I have a file with roughly 2 million lines like this:
2s,3s,4s,5s,6s 100000
2s,3s,4s,5s,8s 101
2s,3s,4s,5s,9s 102
The first comma separated part indicates a poker result in Omaha, while the latter score is an example "value" of the cards. It is very important for me to read this file as fast as possible in C++, but I cannot seem to get it to be faster than a simple approach in Python (4.5 seconds) using the base library.
Using the Qt framework (QHash and QString), I was able to read the file in 2.5 seconds in release mode. However, I do not want to have the Qt dependency. The goal is to allow quick simulations using those 2 million lines, i.e. some_container["2s,3s,4s,5s,6s"] to yield 100 (though if applying a translation function or any non-readable format will allow for faster reading that's okay as well).
My current implementation is extremely slow (8 seconds!):
std::map<std::string, int> get_file_contents(const char *filename)
{
std::map<std::string, int> outcomes;
std::ifstream infile(filename);
std::string c;
int d;
while (infile.good())
{
infile >> c;
infile >> d;
//std::cout << c << d << std::endl;
outcomes[c] = d;
}
return outcomes;
}
What can I do to read this data into some kind of a key/value hash as fast as possible?
Note: The first 16 characters are always going to be there (the cards), while the score can go up to around 1 million.
Some further informations gathered from various comments:
sample file: http://pastebin.com/rB1hFViM
ram restrictions: 750MB
initialization time restriction: 5s
computation time per hand restriction: 0.5s

As I see it, there are two bottlenecks on your code.
1 Bottleneck
I believe that the file reading is the biggest problem there. Having a binary file is the fastest option. Not only you can read it directly in an array with a raw istream::read in a single operation (which is very fast), but you can even map the file in memory if your OS supports it. Here is a link that's very informative on how to use memory mapped files.
2 Bottleneck
The std::map is usually implemented with a self-balancing BST that will store all the data in order. This makes the insertion to be an O(logn) operation. You can change it to std::unordered_map, wich uses a hash table instead. A hash table have a constant time insertion if the number of colisions are low. As the ammount of elements that you need to read is known, you can reserve a suitable ammount of chuncks before inserting the elements. Keep in mind that you need more chuncks than the number of elements that will be inserted in the hash to avoid the maximum ammount of colisions.

Ian Medeiros already mentioned the two major botlenecks.
a few thoughts about data structures:
the amount of different cards is known: 4 colors of each 13 cards -> 52 cards.
so a card requires less than 6 bits to store. your current file format currently uses 24 bit (includig the comma).
so by simply enumerating the cards and omitting the comma you can save ~2/3 of file size and allows you to determine a card with reading only one character per card.
if you want to keep the file text based you may use a-m, n-z, A-M and N-Z for the four colors.
another thing that bugs me is the string based map. string operations are innefficient.
One hand contains 5 cards.
that means 52^5 posiibilities if we keep it simple and do not consider the already drawn cards.
--> 52^5 = 380.204.032 < 2^32
that means we can enumuerate every possible hand with a uint32 number. by defining a special sorting scheme of the cards (since order is irrelevant), we can assign a number to the hand and use this number as key in our map that is a lot faster than using strings.
if we have enough memory (1.5 GB) we do not even need a map but we can simply use an array.
of course the most cells are unused but access may be very fast. we even can ommit the ordering of the cards since the cells are present independet if we fill them or not. So we can use them. but in this case you should not forget to fill all possible permutations of the hand read from the file.
with this scheme we also (may be) can further optimize our file reading speed. if we only store the hands number and the rating so that only 2 values need to be parsed.
infact we can optimize the required storage space by using a more complex adressing scheme for the different hands, since in reality there are only 52*51*50*49*48 = 311.875.200 possible hands.additional to that the ordering is irrelevant as mentioned but i think that this saving is not worth the increased complexity of the encoding of the hands.

A simple idea might be to use the C API, which is considerably simpler:
#include <cstdio>
int n;
char s[128];
while (std::fscanf(stdin, "%127s %d", s, &n) == 2)
{
outcomes[s] = n;
}
A rough test showed a considerable speedup for me compared to the iostreams library.
Further speedups may be achieved by storing the data in a contiguous array, e.g. a vector of std::pair<std::string, int>; it depends on whether your data is already sorted and how you need to access it later.
For a serious solution, though, you should probably step back further and think of a better way to represent your data. For example, a fixed-width, binary encoding would be much more space-efficient and faster to parse, since you won't need to look ahead for line endings or parse strings.
Update: From some quick experimentation I've found it fairly fast to first read the entire file into memory and then perform alternating strtok calls with either " " or "\n" as the delimiter; whenever a pair of calls succeed, apply strtol on the second pointer to parse the integer. Here's a skeleton:
#include <cerrno>
#include <cstdio>
#include <cstdlib>
#include <cstring>
#include <vector>
int main()
{
std::vector<char> data;
// Read entire file to memory
{
data.reserve(100000000);
char buf[4096];
for (std::size_t n; (n = std::fread(buf, 1, sizeof buf, stdin)) > 0; )
{
data.insert(data.end(), buf, buf + n);
}
data.push_back('\0');
}
// Tokenize the in-memory data
char * p = &data.front();
for (char * q = std::strtok(p, " "); q; q = std::strtok(nullptr, " "))
{
if (char * r = std::strtok(nullptr, "\n"))
{
char * e;
errno = 0;
int const n = std::strtol(r, &e, 10);
if (*e != '\0' || errno != 0) { continue; }
// At this point we have data:
// * the string is "q"
// * the integer is "n"
}
}
}

Reading integers from file in separate arrays, one digit at a time, in C++

I have a text file that contains several lines, each line containing two very large integers.
I need to read the first integer on the line, store each one of its digits in an int array, read the second integer on the line, store each one of its digits in another int array. Then I should perform some operations (adding them, multiplying them etc), then repeat the procedure for the second line in the text file and so on.
I don't know how to read the integers this way. I would be able to read one integer only as an array of digits, but I don't know how to differentiate between the integers separated by space, much less how to tell the compiler when to switch the line.
The reason why I can't read the integers as int variables is, as I said, that they are too large for common numeric operations, so I must do them the same way I would by hand. I've written functions to replicate the process, but they need arrays of digits.
I tried to use fscanf or getline , but anything similar will read both integers on the line in one single array. Also, anything that reads until a space is encountered will read ALL of my numbers, not only the ones on the line I'm at.
The ideal would be two arrays, each containing the digits of one integer, that I keep reinitialising every time I switch the line.
Any suggestions on how to do this (or ideas that follow a different approach to do the same) would be appreciated.

Using boost library (algorithm for string split function, and lexical cast for conversion), you may take a look at this code snippets - (without validation)
typedef std::vector<int> intarray;
intarray da[2];
std::string s;
std::fstream f(filename,std::ios::in);
while(!f.eof() && !f.fail())
{
std::getline(f, s );
std::vector<std::string> v;
boost::algorithm::split(v, s, boost::algorithm::is_any_of(" "));
for(int j = 0; j<1; ++j)
{
std::string fs = v.at(j);
for(int i = 0; i<fs.size(); ++i)
{
try
{
int d = boost::lexical_cast<int>(fs.at(i));
da[j].push_back(d);
}
catch(bad_lexical_cast& e)
{
std::cout << "caught exception.\n";
break;
}
}
}
}

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js