How to extract formatted text in C++? - c++

This might have appeared before, but I couldn't understand how to extract formatted data. Below is my code to extract all text between string "[87]" and "[90]" in a text file.
Apparently, the position of [87] and [90] is the same as indicated in the output.
void ExtractWebContent::filterContent(){
string str, str1;
string positionOfCurrency1 = "[87]";
string positionOfCurrency2 = "[90]";
size_t positionOfText1, positionOfText2;
ifstream reading;
reading.open("file_Currency.txt");
while (!reading.eof()){
getline (reading, str);
positionOfText1 = str.find(positionOfCurrency1);
positionOfText2 = str.find(positionOfCurrency2);
cout << "positionOfCurrency1 " << positionOfText1 << endl;
cout << "positionOfCurrency2 " << positionOfText2 << endl;
//str1= str.substr (positionOfText);
cout << "String" << str1 << endl;
}
reading.close();
An Update on the currency file:
[79]More »Brent slips to $102 on worries about euro zone economy
Market Data
* Currencies
CAPTION: Currencies
Name Price Change % Chg
[80]USD/SGD
1.2606 -0.00 -0.13%
USD/SGD [81]USDSGD=X
[82]EUR/SGD
1.5242 0.00 +0.11%
EUR/SGD [83]EURSGD=X

That really depends on what 'extracting data means'. In simple cases you can just read the file into a string and then use string member functions (especially find and substr) to extract the segment you are interested in. If you are interested in data per line getline is the way to go for line extraction. Apply find and substr as before to get the segment.
Sometimes a simple find wont get you far and you will need a regular expression to do easily get to the parts you are interested in.
Often simple parsers evolve and soon outgrow even regular expressions. This often signals time for the very large hammer of C++ parsing Boost.Spirit.

Boost.Tokenizer can be helpful for parsing out a string, but it gets a little trickier if those delimiters have to be bracketed numbers like you have them. With the delimieters as described, a regex is probably adequate.

All that does is concatenate the output of reading and the strings "[1]" and "[2]". I'm guessing this code resulted from a rather literal extrapolation of similar code using scanf. scanf (as well as the rest of C) still works in C++, so if that works for you I would use it.
That said, there are various levels of sophistication at which you can do this. Using regexes is one of the most powerful/flexible ways, but it might be overkill. The quickest way in my opinion is just to do something like:
Find index of substring "[1]", i1
Find index of substring "[2]", i2
get substring between i1+3 and i2.
In code, supposing std::string line has the text:
size_t i1 = line.find("[1]");
size_t i2 = line.find("[2]");
std::string out(line.substr(i1+3, i2));
Warning: no error checking.

Related

Parsing Data of data from a file

i have this project due however i am unsure of how to parse the data by the word, part of speech and its definition... I know that i should make use of the tab spacing to read it but i have no idea how to implement it. here is an example of the file
Recollection n. The power of recalling ideas to the mind, or the period within which things can be recollected; remembrance; memory; as, an event within my recollection.
Nip n. A pinch with the nails or teeth.
Wodegeld n. A geld, or payment, for wood.
Xiphoid a. Of or pertaining to the xiphoid process; xiphoidian.
NB: Each word and part of speech and definition is one line in a text file.
If you can be sure that the definition will always follow the first period on a line, you could use an implementation like this. But it will break if there are ever more than 2 periods on a single line.
string str = "";
vector<pair<string,string>> v; // <word,definition>
while(getline(fileStream, str, '.')) { // grab line, deliminated '.'
str[str.length() - 1] = ""; // get rid of n, v, etc. from word
v.push_back(make_pair<string,string>(str,"")); // push the word
getline(fileStream, str, '.'); // grab the next part of the line
v.back()->second = str; // push definition into last added element
}
for(auto x : v) { // check your results
cout << "word -> " << x->first << endl;
cout << "definition -> " << x->second << endl << endl;
}
The better solution would be to learn Regular Expressions. It's a complicated topic but absolutely necessary if you want to learn how to parse text efficiently and properly:
http://www.cplusplus.com/reference/regex/

Issue in Loop with Reading and Writing from File c++

The problem I'm having is that in the following loop, I'm attempting to read sentences one by one from the input file (inSentences) and output it into another file (outMatch), if it contains a certain word. I am not allowed to use any string functions except for .at() and .size().
The problem lies in that I'm trying to output the sentence first into a intermediary file and then use the extraction operator to get the word one by one to see if it has the word. If it does, it outputs the sentence into the outMatch file. While debugging I found that the sentence variable receives all of the sentences one by one, but the sentMore variable is always extracting out the first sentence of the file, so it's not able to test the rest of the sentences. I can't seem to fix this bug. Help would be greatly appreciated.
P.S. I don't really want the answer fleshed out, just a nudge or a clue in the right direction.
outMatch.open("match");
outTemp.open("temp");
while(getline(inSentences, sentence, '.')){
outTemp << sentence;
outTemp.close();
inTemp.open("temp");
while(inTemp >> sentMore){
if(sentMore == word){
cout << sentence << "." << endl;
inTemp.close();
sentCount++;
}
}
}
You should use string streams! Your issue right now appears to be that you never reopen outTemp, so after the first loop you're trying to write to it after you've closed it. But you can avoid all these problems by switching to stringstreams!
Since you don't want a full solution, an example may look like:
string line_of_text, word;
istringstream strm(line_of_text);
while(strm >> word) {
cout << word << ' ';
}

How to get more performance when reading file

My program download files from site (via curl per 30 min). (it is possible that size of these files can reach 150 mb)
So i thought that getting data from these files can be inefficient. (search a line per 5 seconds)
These files can have ~10.000 lines
To parse this file (values are seperate by ",") i use regex :
regex wzorzec("(.*),(.*),(.*),(.*),(.*),(.*),(.*),(.*)");
There are 8 values.
Now i have to push it to vector:
allys.push_back({ std::stoi(std::string(wynik[1])), nick, tag, stoi(string(wynik[4])), stoi(string(wynik[5])), stoi(string(wynik[6])), stoi(string(wynik[7])), stoi(string(wynik[8])) });
I use std::async to do that, but for 3 files (~7 mb) procesor jumps to 80% and operation take about 10 secs. I read from SSD so this is not slowly IO fault.
I'm reading data line per line by fstream
How to boost this operation?
Maybe i have to parse this values, and push it to SQL ?
Best Regards
You can probably get some performance boost by avoiding regex, and use something along the lines of std::strtok, or else just hard-code a search for commas in your data. Regex has more power than you need just to look for commas. Next, if you use vector::reserve before you begin a sequence of push_back for any given vector, you will save a lot of time in both reallocation and moving memory around. If you are expecting a large vector, reserve room for it up front.
This may not cover all available performance ideas, but I'd bet you will see an improvement.
Your problem here is most likely additional overhead introduced by the regular expression, since you're using many variable length and greedy matches (the regex engine will try different alignments for the matches to find the largest matching result).
Instead, you might want to try to manually parse the lines. There are many different ways to achieve this. Here's one quick and dirty example (it's not flexible and has quite some duplicate code in there, but there's lots of room for optimization). It should explain the basic idea though:
#include <iostream>
#include <sstream>
#include <cstdlib>
const char *input = "1,Mario,Stuff,4,5,6,7,8";
struct data {
int id;
std::string nick;
std::string tag;
} myData;
int main(int argc, char **argv){
char buffer[256];
std::istringstream in(input);
// Read an entry and convert/store it:
in.get(buffer, 256, ','); // read
myData.id = atoi(buffer); // convert and store
// Skip the comma
in.seekg(1, std::ios::cur);
// Read the next entry and convert/store it:
in.get(buffer, 256, ','); // read
myData.nick = buffer; // store
// Skip the comma
in.seekg(1, std::ios::cur);
// Read the next entry and convert/store it:
in.get(buffer, 256, ','); // read
myData.tag = buffer; // store
// Skip the comma
in.seekg(1, std::ios::cur);
// Some test output
std::cout << "id: " << myData.id << "\nnick: " << myData.nick << "\ntag: " << myData.tag << std::endl;
return 0;
}
Note that there isn't any error handling in case entries are too long or too short (or broken in some other way).
Console output:
id: 1
nick: Mario
tag: Stuff

Convert string to int and get the number of characters consumed in C++ with stringstream

I am new to C++ (coming from a C# background) and am trying to learn how to convert a string to an int.
I got it working by using a stringstream and outputting it into a double, like so:
const char* inputIndex = "5+2";
double number = 0;
stringstream ss(inputIndex);
ss >> number;
// number = 5
This works great. The problem I'm having is that the strings I'm parsing start with a number, but may have other, not digit characters after the digits (e.g. "5+2", "9-(3+2)", etc). The stringstream parses the digits at the beginning and stops when it encounters a non-digit, like I need it to.
The problem comes when I want to know how many characters were used to parse into the number. For example, if I parse 25+2, I want to know that two characters were used to parse 25, so that I can advance the string pointer.
So far, I got it working by clearing the stringstream, inputting the parsed number back into it, and reading the length of the resulting string:
ss.str("");
ss << number;
inputIndex += ss.str().length();
While this does work, it seems really hacky to me (though that might just be because I'm coming from something like C#), and I have a feeling that might cause a memory leak because the str() creates a copy of the string.
Is there any other way to do this, or should I stick with what I have?
Thanks.
You can use std::stringstream::tellg() to find out the current get position in the input stream. Store this value in a variable before you extract from the stream. Then get the position again after you extract from the stream. The difference between these two values is the number of characters extracted.
double x = 3435;
std::stringstream ss;
ss << x;
double y;
std::streampos pos = ss.tellg();
ss >> y;
std::cout << (ss.tellg() - pos) << " characters extracted" << std::endl;
The solution above using tellg() will fail on modern compilers (such as gcc-4.6).
The reason for this is that tellg() really shows the position of the cursor, which is now out of scope. See eg "file stream tellg/tellp and gcc-4.6 is this a bug?"
Therefore you need to also test for eof() (meaning the entire input was consumed).

How to read a file and get words in C++

I am curious as to how I would go about reading the input from a text file with no set structure (Such as notes or a small report) word by word.
The text for example might be structured like this:
"06/05/1992
Today is a good day;
The worm has turned and the battle was won."
I was thinking maybe getting the line using getline, and then seeing if I can split it into words via whitespace from there. Then I thought using strtok might work! However I don't think that will work with the punctuation.
Another method I was thinking of was getting everything char by char and omitting the characters that were undesired. Yet that one seems unlikely.
So to sort the thing short:
Is there an easy way to read an input from a file and split it into words?
Since it's easier to write than to find the duplicate question,
#include <iterator>
std::istream_iterator<std::string> word_iter( my_file_stream ), word_iter_end;
size_t wordcnt;
for ( ; word_iter != word_iter_end; ++ word_iter ) {
std::cout << "word " << wordcnt << ": " << * word_iter << '\n';
}
The std::string argument to istream_iterator tells it to return a string when you do *word_iter. Every time the iterator is incremented, it grabs another word from its stream.
If you have multiple iterators on the same stream at the same time, you can choose between data types to extract. However, in that case it may be easier just to use >> directly. The advantage of an iterator is that it can plug into the generic functions in <algorithm>.
Yes. You're looking for std::istream::operator>> :) Note that it will remove consecutive whitespace but I doubt that's a problem here.
i.e.
std::ifstream file("filename");
std::vector<std::string> words;
std::string currentWord;
while(file >> currentWord)
words.push_back(currentWord);
You can use getline with a space character, getline(buffer,1000,' ');
Or perhaps you can use this function to split a string into several parts, with a certain delimiter:
string StrPart(string s, char sep, int i) {
string out="";
int n=0, c=0;
for (c=0;c<(int)s.length();c++) {
if (s[c]==sep) {
n+=1;
} else {
if (n==i) out+=s[c];
}
}
return out;
}
Notes: This function assumes that it you have declared using namespace std;.
s is the string to be split.
sep is the delimiter
i is the part to get (0 based).
You can use the scanner technique to grabb words, numbers dates etc... very simple and flexible. The scanner normally returns token (word, number, real, keywords etc..) to a Parser.
If you later intend to interpret the words, I would recommend this approach.
I can warmly recommend the book "Writing Compilers and Interpreters" by Ronald Mak (Wiley Computer Publishing)