Is there a way to seek the "\n" character that is faster than looping through char one at a time? - c++

Looking at the sample implementation of wc.c when counting number of lines, it loop through the file, one character at a time and accumulating the '\n' to count the number of newlines:
#define COUNT(c) \
ccount++; \
if ((c) == '\n') \
lcount++;
Is there a way to just seek the file for '\n' and keep jumping to the newline characters and do a count?
Would seeking for '\n' be the same as just reading characters one at a time until we see '\n' and count it?

Well, all characters are not '\n', except for one.
A branch-less algorithm is likely to be faster.
Have you tried std::count, though?
#include <string>
#include <algorithm>
int main() {
const auto s = std::string("Hello, World!\nfoo\nbar\nbaz");
const auto lines_in_s = std::count(s.cbegin(), s.cend(), '\n');
return lines_in_s;
}
Compiler Explorer
Or with a file:
#include <algorithm>
#include <fstream>
#include <iostream>
#include <iterator>
#include <string>
int main() {
if (std::ifstream is("filename.txt"); is) {
const auto lines_in_file =
std::count(std::istreambuf_iterator<char>(is),
std::istreambuf_iterator<char>{}, '\n');
std::cout << lines_in_file << '\n';
}
}
Compiler Explorer

The only way you could skip looking at every character would be if you had domain knowledge about the string you're currently looking at:
If you knew that you're handling a text with continuous paragraphs of at least 50 words or so, you could, after each '\n', advance by 100 or 200 chars, thus saving some time. You'd need to test and refine that jump length, of course, but then you wouldn't need to check every single char.
For a general-purpose counting function you're stuck with looking at every possible char.

Q: Is there a faster way to count the number of lines in a file than reading one character at a time?
A: The quick answer is no, but one can parallelize the counting which might shorten the runtime but the program would still have to run through every byte once. Such a program may by IO bound and so it depends on the hardware involved as to how useful parallelization is in this case.
Q: Is there a way to skip from one newline character to the next without having to read through all the bytes in between?
A: The quick answer is no, but if one had a really large text file for example, what one could do is make an 'index' file of offsets. One would still have to make one pass over the file in order to generate such a file, but once it was made, one could find the nth line by reading the nth offset in the index and then 'seek'-ing to it. The index would have to maintained or regenerated though every time the file changed. If one used fixed width offsets, one could seek straight to the offset required with some simple arithmetic, read the index for the offset, then seek to the correct position in the file. A line count can be obtained at the same time as generating the index. Once the index is generated, a line count could quickly be determined from the size of the index file if it has to be computed again.
It probably should be mentioned that the number of lines in a text file might not be derived from the number of '\n' bytes because of multi-byte character encoding. To count the number of lines, one needs to scan the file character by character rather than just byte by byte, and to do that, one needs to know what character encoding scheme is being used.

You can use strchr function to "jump" to next '\n' in string, and it will be faster on some platforms, because strchr usually implemented in assembly language and use processor instructions that can scan memory faster where such instructions are available.
something like this:
#include <string.h>
unsigned int count_newlines(const char *str) {
unsigned result = 0;
const char s = str;
while ((s = strchr(s, '\n')) != NULL) {
++result; // found one '\n'
++s; // and start searching again from the next character
}
return result;
}

Related

String Management C/C++ & Writing and Reading From txt File

I am facing a problem with reading and writing a string from and to a file respectively.
Purpose:
To enter a string into a text file as a complete sentence, read the string from the text file and separate all words that start from a vowel using a function and display them as a sentence. (The sentence just needs to consist of the words from the string that start with a vowel.)
Problem:
The code is working as intended but as i have used the getline() function to obtain the string from the txt file when i withdraw a substring from it, it includes the entire file after the vowel instead of just the word. I cannot understand how to make the substring only include words.
Code:
#include <fstream>
#include <string>
#include <iostream>
#include <cstring>
using namespace std;
string vowels(string a)
{
int c=sizeof(a);
string b[c];
string d;
static int n;
for(int i=1;i<=c;i++)
{
if (a.find("a")!=-1)
{
b[i]=a.substr(a.find("a",n));
d+=b[i];
n=a.find("a")+1;
}
else if (a.find("e")!=-1)
{
b[i]=a.substr(a.find("e",n));
d+=b[i];
n=a.find("e")+1;
}
else if (a.find("i")!=-1)
{
b[i]=a.substr(a.find("i",n));
d+=b[i];
n=a.find("i")+1;
}
else if (a.find("o")!=-1)
{
b[i]=a.substr(a.find("o",n));
d+=b[i];
n=a.find("o")+1;
}
else if (a.find("u")!=-1)
{
b[i]=a.substr(a.find("u",n));
d+=b[i];
n=a.find("u")+1;
}
}
return d;
}
int main()
{
string input,lne,e;
ofstream file("output.txt", ios::app);
cout<<"Please input text for text file input: ";
getline(cin,input);
file << input;
file.close();
ifstream myfile("output.txt");
getline(myfile,lne);
e=vowels(lne);
cout<<endl<<"Text inside file reads: ";
cout<<lne;
cout<<endl;
cout<<e<<endl;
system("pause");
myfile.close();
return 0;
}
I haven't read your code VERY carefully, but several things stand out:
Look up find_first_of - it'll simplify your code A LOT.
sizeof(a) certainly doesn't do what you think it does [unless you think it gives you the size of the std::string class type - which makes it rather strange as a use-case, why not use either 12 or 24?]
find (and find_first_of), technically speaking, doesn't return -1 when the function isn't finding what you want. It returns std::string::npos [which may appear to be -1, but a) is not guaranteed to be, and b) is unsingned so can't be negative].
Your program only reads one line.
x.substr(n) will give you the string of x from position n - is that what you want?
Don't repeat find, use p = x.find("X"); and then do x.substr(p) [assuming that is what you want].
There are various problems with your code.
int c = sizeof( a );
This is the number of bytes that a string takes up in memory. And you certainly don't want to create an array of this many strings as it makes no sense for what you're trying to achieve. Don't do this to yourself. You're only copying one string inside the loop, all you need is one string and you already have string d.
To get the actual size of a string, you have to call
str.size()
The string.substr(..) has a couple overloads, one of them takes only one argument, an index. This will return sub string starting at that index in the original string. (The string starting at the vowel all the way through to the end of the string)
What you are maybe looking for is the overload that takes two arguments, the start index (beginning of the word and the end of the word).
The string input will not take the newline that you enter to flush cin. And then you add it to the file in append mode, so after running the program a few times your file is a huge one-liner. Did you really intend to do this?
Maybe you should explicitly add a new line to the file after entering the input. Something like file << std::endl;
Also, the conditions in the ifs
if (a.find("a")!=-1)
Don't match what you do next,
b[i]=a.substr(a.find("a",n));
Then you use a static int,
static int n;
This is bad, because this function will only work once. You're lucky that static initializes its values to zero, but you should always initialize explicitly. In your case, you don't need this to be static.
Finally: "so i was unsure of how many loops to run"
When you don't know how many loops you have to run, then a for loop is not adequate.
You should use a while loop or a do while.
You shouldn't try to learn C++ by guessing, because that's what it looks like you're doing. You're trying to do more than you know and making some very silly mistakes. Find a good book to learn from, or at the very least google the functions you're using to see what they do and how to use them properly. (ie: http://www.cplusplus.com/reference/string/string/substr/ )
Here's a list of books from stackoverflow's FAQ: The Definitive C++ Book Guide and List
The last thing is about finding vowels. When you find a vowel, you have to make sure it's at the beginning of a word. Then you want to read it until the word ends, that is when you find a character that is not part of a word. (a whitespace, certain punctuation, ... ) This should mark the beginning and end of the word.

How to get more performance when reading file

My program download files from site (via curl per 30 min). (it is possible that size of these files can reach 150 mb)
So i thought that getting data from these files can be inefficient. (search a line per 5 seconds)
These files can have ~10.000 lines
To parse this file (values are seperate by ",") i use regex :
regex wzorzec("(.*),(.*),(.*),(.*),(.*),(.*),(.*),(.*)");
There are 8 values.
Now i have to push it to vector:
allys.push_back({ std::stoi(std::string(wynik[1])), nick, tag, stoi(string(wynik[4])), stoi(string(wynik[5])), stoi(string(wynik[6])), stoi(string(wynik[7])), stoi(string(wynik[8])) });
I use std::async to do that, but for 3 files (~7 mb) procesor jumps to 80% and operation take about 10 secs. I read from SSD so this is not slowly IO fault.
I'm reading data line per line by fstream
How to boost this operation?
Maybe i have to parse this values, and push it to SQL ?
Best Regards
You can probably get some performance boost by avoiding regex, and use something along the lines of std::strtok, or else just hard-code a search for commas in your data. Regex has more power than you need just to look for commas. Next, if you use vector::reserve before you begin a sequence of push_back for any given vector, you will save a lot of time in both reallocation and moving memory around. If you are expecting a large vector, reserve room for it up front.
This may not cover all available performance ideas, but I'd bet you will see an improvement.
Your problem here is most likely additional overhead introduced by the regular expression, since you're using many variable length and greedy matches (the regex engine will try different alignments for the matches to find the largest matching result).
Instead, you might want to try to manually parse the lines. There are many different ways to achieve this. Here's one quick and dirty example (it's not flexible and has quite some duplicate code in there, but there's lots of room for optimization). It should explain the basic idea though:
#include <iostream>
#include <sstream>
#include <cstdlib>
const char *input = "1,Mario,Stuff,4,5,6,7,8";
struct data {
int id;
std::string nick;
std::string tag;
} myData;
int main(int argc, char **argv){
char buffer[256];
std::istringstream in(input);
// Read an entry and convert/store it:
in.get(buffer, 256, ','); // read
myData.id = atoi(buffer); // convert and store
// Skip the comma
in.seekg(1, std::ios::cur);
// Read the next entry and convert/store it:
in.get(buffer, 256, ','); // read
myData.nick = buffer; // store
// Skip the comma
in.seekg(1, std::ios::cur);
// Read the next entry and convert/store it:
in.get(buffer, 256, ','); // read
myData.tag = buffer; // store
// Skip the comma
in.seekg(1, std::ios::cur);
// Some test output
std::cout << "id: " << myData.id << "\nnick: " << myData.nick << "\ntag: " << myData.tag << std::endl;
return 0;
}
Note that there isn't any error handling in case entries are too long or too short (or broken in some other way).
Console output:
id: 1
nick: Mario
tag: Stuff

Reading key-value pairs as fast as possible in C++ from file

I have a file with roughly 2 million lines like this:
2s,3s,4s,5s,6s 100000
2s,3s,4s,5s,8s 101
2s,3s,4s,5s,9s 102
The first comma separated part indicates a poker result in Omaha, while the latter score is an example "value" of the cards. It is very important for me to read this file as fast as possible in C++, but I cannot seem to get it to be faster than a simple approach in Python (4.5 seconds) using the base library.
Using the Qt framework (QHash and QString), I was able to read the file in 2.5 seconds in release mode. However, I do not want to have the Qt dependency. The goal is to allow quick simulations using those 2 million lines, i.e. some_container["2s,3s,4s,5s,6s"] to yield 100 (though if applying a translation function or any non-readable format will allow for faster reading that's okay as well).
My current implementation is extremely slow (8 seconds!):
std::map<std::string, int> get_file_contents(const char *filename)
{
std::map<std::string, int> outcomes;
std::ifstream infile(filename);
std::string c;
int d;
while (infile.good())
{
infile >> c;
infile >> d;
//std::cout << c << d << std::endl;
outcomes[c] = d;
}
return outcomes;
}
What can I do to read this data into some kind of a key/value hash as fast as possible?
Note: The first 16 characters are always going to be there (the cards), while the score can go up to around 1 million.
Some further informations gathered from various comments:
sample file: http://pastebin.com/rB1hFViM
ram restrictions: 750MB
initialization time restriction: 5s
computation time per hand restriction: 0.5s
As I see it, there are two bottlenecks on your code.
1 Bottleneck
I believe that the file reading is the biggest problem there. Having a binary file is the fastest option. Not only you can read it directly in an array with a raw istream::read in a single operation (which is very fast), but you can even map the file in memory if your OS supports it. Here is a link that's very informative on how to use memory mapped files.
2 Bottleneck
The std::map is usually implemented with a self-balancing BST that will store all the data in order. This makes the insertion to be an O(logn) operation. You can change it to std::unordered_map, wich uses a hash table instead. A hash table have a constant time insertion if the number of colisions are low. As the ammount of elements that you need to read is known, you can reserve a suitable ammount of chuncks before inserting the elements. Keep in mind that you need more chuncks than the number of elements that will be inserted in the hash to avoid the maximum ammount of colisions.
Ian Medeiros already mentioned the two major botlenecks.
a few thoughts about data structures:
the amount of different cards is known: 4 colors of each 13 cards -> 52 cards.
so a card requires less than 6 bits to store. your current file format currently uses 24 bit (includig the comma).
so by simply enumerating the cards and omitting the comma you can save ~2/3 of file size and allows you to determine a card with reading only one character per card.
if you want to keep the file text based you may use a-m, n-z, A-M and N-Z for the four colors.
another thing that bugs me is the string based map. string operations are innefficient.
One hand contains 5 cards.
that means 52^5 posiibilities if we keep it simple and do not consider the already drawn cards.
--> 52^5 = 380.204.032 < 2^32
that means we can enumuerate every possible hand with a uint32 number. by defining a special sorting scheme of the cards (since order is irrelevant), we can assign a number to the hand and use this number as key in our map that is a lot faster than using strings.
if we have enough memory (1.5 GB) we do not even need a map but we can simply use an array.
of course the most cells are unused but access may be very fast. we even can ommit the ordering of the cards since the cells are present independet if we fill them or not. So we can use them. but in this case you should not forget to fill all possible permutations of the hand read from the file.
with this scheme we also (may be) can further optimize our file reading speed. if we only store the hands number and the rating so that only 2 values need to be parsed.
infact we can optimize the required storage space by using a more complex adressing scheme for the different hands, since in reality there are only 52*51*50*49*48 = 311.875.200 possible hands.additional to that the ordering is irrelevant as mentioned but i think that this saving is not worth the increased complexity of the encoding of the hands.
A simple idea might be to use the C API, which is considerably simpler:
#include <cstdio>
int n;
char s[128];
while (std::fscanf(stdin, "%127s %d", s, &n) == 2)
{
outcomes[s] = n;
}
A rough test showed a considerable speedup for me compared to the iostreams library.
Further speedups may be achieved by storing the data in a contiguous array, e.g. a vector of std::pair<std::string, int>; it depends on whether your data is already sorted and how you need to access it later.
For a serious solution, though, you should probably step back further and think of a better way to represent your data. For example, a fixed-width, binary encoding would be much more space-efficient and faster to parse, since you won't need to look ahead for line endings or parse strings.
Update: From some quick experimentation I've found it fairly fast to first read the entire file into memory and then perform alternating strtok calls with either " " or "\n" as the delimiter; whenever a pair of calls succeed, apply strtol on the second pointer to parse the integer. Here's a skeleton:
#include <cerrno>
#include <cstdio>
#include <cstdlib>
#include <cstring>
#include <vector>
int main()
{
std::vector<char> data;
// Read entire file to memory
{
data.reserve(100000000);
char buf[4096];
for (std::size_t n; (n = std::fread(buf, 1, sizeof buf, stdin)) > 0; )
{
data.insert(data.end(), buf, buf + n);
}
data.push_back('\0');
}
// Tokenize the in-memory data
char * p = &data.front();
for (char * q = std::strtok(p, " "); q; q = std::strtok(nullptr, " "))
{
if (char * r = std::strtok(nullptr, "\n"))
{
char * e;
errno = 0;
int const n = std::strtol(r, &e, 10);
if (*e != '\0' || errno != 0) { continue; }
// At this point we have data:
// * the string is "q"
// * the integer is "n"
}
}
}

Count the number of unique words and occurrence of each word

CSCI-15 Assignment #2, String processing. (60 points) Due 9/23/13
You MAY NOT use C++ string objects for anything in this program.
Write a C++ program that reads lines of text from a file using the ifstream getline() method, tokenizes the lines into words ("tokens") using strtok(), and keeps statistics on the data in the file. Your input and output file names will be supplied to your program on the command line, which you will access using argc and argv[].
You need to count the total number of words, the number of unique words, the count of each individual word, and the number of lines. Also, remember and print the longest and shortest words in the file. If there is a tie for longest or shortest word, you may resolve the tie in any consistent manner (e.g., use either the first one or the last one found, but use the same method for both longest and shortest). You may assume the lines comprise words (contiguous lower-case letters [a-z]) separated by spaces, terminated with a period. You may ignore the possibility of other punctuation marks, including possessives or contractions, like in "Jim's house". Lines before the last one in the file will have a newline ('\n') after the period. In your data files, omit the '\n' on the last line. You may assume that the lines will be no longer than 100 characters, the individual words will be no longer than 15 letters and there will be no more than 100 unique words in the file.
Read the lines from the input file, and echo-print them to the output file. After reaching end-of-file on the input file (or reading a line of length zero, which you should treat as the end of the input data), print the words with their occurrence counts, one word/count pair per line, and the collected statistics to the output file. You will also need to create other test files of your own. Also, your program must work correctly with an EMPTY input file – which has NO statistics.
Test file looks like this (exactly 4 lines, with NO NEWLINE on the last line):
the quick brown fox jumps over the lazy dog.
now is the time for all good men to come to the aid of their party.
all i want for christmas is my two front teeth.
the quick brown fox jumps over a lazy dog.
Copy and paste this into a small file for one of your tests.
Hints:
Use a 2-dimensional array of char, 100 rows by 16 columns (why not 15?), to hold the unique words, and a 1-dimensional array of ints with 100 elements to hold the associated counts. For each word, scan through the occupied lines in the array for a match (use strcmp()), and if you find a match, increment the associated count, otherwise (you got past the last word), add the word to the table and set its count to 1.
The separate longest word and the shortest word need to be saved off in their own C-strings. (Why can't you just keep a pointer to them in the tokenized data?)
Remember – put NO NEWLINE at the end of the last line, or your test for end-of-file might not work correctly. (This may cause the program to read a zero-length line before seeing end-of-file.)
This is not a long program – no more than about 2 pages of code
Here is what I have so far:
#include<iostream>
#include<iomanip>
#include<fstream>
#include<string>
#include<cstring>
using namespace std;
void totalwordCount(ifstream &inputFile)
{
char words[100][16]; // Holds the unique words.
char *token;
int totalCount = 0; // Counts the total number of words.
// Read every word in the file.
while(inputFile >> words[99])
{
totalCount++; // Increment the total number of words.
// Tokenize each word and remove spaces, periods, and newlines.
token = strtok(words[99], " .\n");
while(token != NULL)
{
token = strtok(NULL, " .\n");
}
}
cout << "Total number of words in file: " << totalCount << endl;
}
void uniquewordCount(ifstream &inputFile)
{
char words[100][16]; // Holds the unique words
int counter[100];
char *tok = "0";
int uniqueCount = 0; // Counts the total number of unique words
while(!inputFile.eof())
{
uniqueCount++;
tok = strtok(words[99], " .\n");
while(tok != NULL)
{
tok = strtok(NULL, " .\n");
inputFile >> words[99];
if(strcmp(tok, words[99]) == 0)
{
counter[99]++;
}
else
{
words[99][15] += 1;
}
uniqueCount++;
}
}
cout << counter[99] << endl;
}
int main(int argc, char *argv[])
{
ifstream inputFile;
char inFile[12] = "string1.txt";
char outFile[16] = "word result.txt";
// Get the name of the file from the user.
cout << "Enter the name of the file: ";
cin >> inFile;
// Open the input file.
inputFile.open(inFile);
// If successfully opened, process the data.
if(inputFile)
{
while(!inputFile.eof())
{
totalwordCount(inputFile);
uniquewordCount(inputFile);
}
}
return 0;
}
I already took care of how to count the total number of words in the file in the totalwordCount() function, but in the uniquewordCount() function, I am having trouble counting the total number of unique words and counting the number of occurrences of each word. Is there anything that I need to change in the uniquewordCount() function?
This program contains several issues which are to be considered harmful! To prevent bad software being created based on entirely nonsensical assignments like the above, here are a number of hints:
Always test the stream for success after reading from it. Using in.eof() to determine if the stream is in a good state does not work! One of the problems is that you will get an infinite loop if the stream goes bad for a different reason than end of file, e.g., failure to correctly parse a value (this will set std::ios_base::failbit but not std::ios_base::eofbit.
Reading to a fixed size char array a using in >> a without having set up limits for the number of characters to be read is the C++ way to spell gets()! If you really think that using in >> a is the right way to (see next item), you absolutely need to set up the array's width, e.g., using in >> std::setw(sizeof(a)) >> a. You still need to check that this extraction was successful, of course.
From the looks of it, your teacher wants you to actually use std::istream::getline() to read the array, e.g., using in.getline(a, sizeof(a)) (which, of course, needs to be checked for success).
Note that the formatted input, i.e., in >> a already tokenizes the stream being received by spaces! There is no need to faff about with strtok() after that.
Once you have consumed a stream, it is consumed. Assuming the characters don't come from a file but rather from something like standard input, you also can't rewind the stream to read it again. I'd think you want to tokenize the values once and use them for both purposes.
This is more of a sidenote: after you created a stream, its nature should be entirely immaterial for the processing of the stream's content (although, e.g., for string streams you might want to eventually collect the result using the str() member): implement your stream processing functions in terms of std::istream rather than std::ifstream!
Since you have a concrete question ("Is there anything that I need to change in the uniquewordCount() function?"): yes, everything! Throw away this function entirely and rethink what you need to do. Basically, the structure of the functionality should be along the lines of
char buffer[100];
while (in.getline(buffer, sizeof(buffer))) {
// tokenize buffer into words
// for each word check if it already exists
// if the word does not exist, append it to the array of known words and set count to 1
// if the word exists, increment the count
// determine if the word is shorter or longer than the shortest or longest word so far
// if it is the case, remember the word's index or a pointer to it
}

Removing punctuation marks using ispunct()

ispunct() works well when words are separated in this way "one, two; three". Then it will remove ", ;" and replace with given character.
But if string is given in this manner "ts='TOK_STORE_ID';" then it will take "ts='TOK_STORE_ID';" as one single token, or
"one,one, two;four$three two" as three tokens 1. "one,one" 2. "two;four$three" 3. "two"
Is there any one so that "one,one, two;four$three two" could be considered as "one one two four three two" each separate token?
Writing manual code like:
for(i=0;i<str.length();i++)
{
//validating each character
}
This operation will become very costly when string is very very long.
So is there any other function like ispunct()? or anything else?
In c we do this to compare each character:
for(i=0;i<str.length();i++)
{
if(str[i]==',' || str[i]==",") // Is there any way here to compare with all puctuations in one shot?
str[i]=" "; //replace with space
}
In c++ what is the correct way for this?
This operation will become very costly when string is very very long.
No, it won't. It will be an O(n) operation which is good for this problem. You cannot get better than this for this operation because any which way, you have to look at each and every character in the string. There is no way to do this without looking at each and every character in the string.
Assuming you're dealing with a typical 8-bit character set, I'd start by building a translation table:
std::vector<char> trans(UCHAR_MAX);
for (int i=0; i<UCHAR_MAX; i++)
trans[i] = ispunct(i) ? ' ' : i;
Then processing a string of text can be something like this:
for (auto &ch : str)
ch = trans[(unsigned char)ch];
For an 8-bit character set, the translation table will typically all fit in your L1 cache, and the loop has only one branch that's highly predictable (always taken except when you reach the end of the string) so it should be fairly fast.
Just to be clear, when I say "fairly fast", I mean i's extremely unlikely that this would be the bottleneck in the process you've described. You'd need a combination of a slow processor and fast network connection to stand any chance of this being the bottleneck in processing data you're obtaining over a network.
If you have a Raspberry Pi with a 10 GbE network connection, you might need to do a little more optimization work for this to keep up (but I'm not sure even then). For any less radical mismatch, the network is clearly going to be the bottleneck.
So is there any other function like ispunct()? or anything else?
As a matter of fact, there is. man ispunct gives me this beautiful list:
int isalnum(int c);
int isalpha(int c);
int isascii(int c);
int isblank(int c);
int iscntrl(int c);
int isdigit(int c);
int isgraph(int c);
int islower(int c);
int isprint(int c);
int ispunct(int c);
int isspace(int c);
int isupper(int c);
int isxdigit(int c);
Take whichever you want.
You can also use std::remove_copy_if to remove the punctuation completely:
#include <algorithm>
#include <string>
string words = "I,love.punct-uation!";
string result; // this will be our final string after it has been purified
// iterates through each character in the string
// to remove all punctuation
std::remove_copy_if(words.begin(), words.end(),
std::back_inserter(result), //Store output
std::ptr_fun<int, int>(&std::ispunct)
);
// ta-da!
cout << result << endl;