C++ boost/regex regex_search - c++

Consider the following string content:
string content = "{'name':'Fantastic gloves','description':'Theese gloves will fit any time period.','current':{'trend':'high','price':'47.1000'}";
I have never used regex_search and I have been searching around for ways to use it - I still do not quite get it. From that random string (it's from an API) how could I grab two things:
1) the price - in this example it is 47.1000
2) the name - in this example Fantastic gloves
From what I have read, regex_search would be the best approach here. I plan on using the price as an integer value, I will use regex_replace in order to remove the "." from the string before converting it. I have only used regex_replace and I found it easy to work with, I don't know why I am struggling so much with regex_search.
Keynotes:
Content is contained inside ' '
Content id and value is separated by :
Conent/value are separated by ,
Value of id's name and price will vary.
My first though was to locate for instance price and then move 3 characters ahead (':') and gather everything until the next ' - however I am not sure if I am completely off-track here or not.
Any help is appreciated.

boost::regex would not be needed. Regular expressions are used for more general pattern matching, whereas your example is very specific. One way to handle your problem is to break the string up into individual tokens. Here is an example using boost::tokenizer:
#include <iostream>
#include <string>
#include <boost/tokenizer.hpp>
#include <map>
int main()
{
std::map<std::string, std::string> m;
std::string content = "{'name':'Fantastic gloves','description':'Theese gloves will fit any time period.','current':{'trend':'high','price':'47.1000'}";
boost::char_separator<char> sep("{},':");
boost::tokenizer<boost::char_separator<char>> tokenizer(content, sep);
std::string id;
for (auto tok = tokenizer.begin(); tok != tokenizer.end(); ++tok)
{
// Since "current" is a special case I added code to handle that
if (*tok != "current")
{
id = *tok++;
m[id] = *tok;
}
else
{
id = *++tok;
m[id] = *++tok; // trend
id = *++tok;
m[id] = *++tok; // price
}
}
std::cout << "Name: " << m["name"] << std::endl;
std::cout << "Price: " << m["price"] << std::endl;
}
Link to live code.

As the string you are attempting to parse appears to be JSON (JavaScript Object Notation), consider using a specialized JSON parser.
You can find a comprehensive list of JSON parsers in many languages including C++ at http://json.org/. Also, I found a discussion on the merits of several JSON parsers for C++ in response to this SO question.

Related

Insert into array specific strings from text file

ArticlesDataset.txt file contains all the metadata information of documents. unigramCount contains all unique words and their number of occurrences for each document. There are 1500 publications recorded in the txt file. Here is an example entry for a document:
{"creator":["Romain Allais","Julie Gobert"],
"datePublished":"2018-05-30",
"docType":"article",
"doi":"10.1051\/mattech\/2018010",
"id":"ark:\/\/27927\/phz10hn2bh3",
"isPartOf":"Mat\u00e9riaux & Techniques",
"issueNumber":"5-6",
"language":["eng"],
"outputFormat":["unigram","bigram","trigram"],
"pageCount":7,
"pagination":"pp. null-null",
"provider":"portico",
"publicationYear":2018,
"publisher":"EDP Sciences",
"sequence":3.0,
"tdmCategory":["Applied sciences -Engineering"],
"title":"Environmental assessment of PSS",
"url":"http:\/\/doi.org\/10.1051\/mattech\/2018010",
"volumeNumber":"105",
"wordCount":4446,
"unigramCount":{"others":1,"air":1,"networks,":1,"conventional":1,"IEEE":1}}
My purpose is to pull out the unigram counts for each document and store them in a suitable array. How can I do it by using fstream library?
How can i improve below code to reach my goal.
std::string dummy;
std::ifstream data("PublicationsDataSet.txt");
while (data.good())
{
getline(data, dummy, ',');
}
your question delves in two different topics, one is parsing the data and the other into storing it in memory.
To the first point the answer is, you'll need a parser, you either write one which will involve a syntax parser to convert each "key words" into tokens, for then an interpreter to compile them into a data object based on the token parameter the data is preceded or succeeded eg:
'[' = start an array, every values after this are part of the array
']' = end of the an array, return to previous parsing state
':' = separate key and values, left hand side is key, right hand side is value
...
this is a fine exercise to sharpen one's skills but way too arduous and with potential never-ending-bug-fixing road, as recommended also by other comments finding an already made library is probably the easier road on a time pinch or on a project time crunching scenario.
Another thing to point out, plain arrays in c++ are size fixed, so mostly likely since you are parsing the values you'll probably use std::vectors, which allow insertion, and once you are done processing the file and really intend to send the data back as an array you can do that directly from the object
std::vector<YourObjectType> parsedObject;
char* arr = new char[parsedObject.size()];
std::copy(v.begin(), v.end(), arr);
this is a psudo code, lots of things will depend on the implementation, but it gives the idea.
A starting point to write a parse is this article goes in great details on how it works and it's components, mind you every parser implements it's own language (yes just like c++ and other languages, are all parsed) so you'll need to expand on the concept with your commands
expression parser
Here's a simplified solution of what you could do using std::regex:
Read the lines of a stream (std::cin in this case) one by one.
Check if the line contains a unigramCount element.
If that's the case, walk the different entries within the unigramCount element.
About the regular expressions used:
"unigramCount":{}, allowing:
zero or more whitespaces basically everywhere, and
zero or more characters within the braces.
"<key>":<value>, where:
<key> is one or more characters other than a double quote,
<value> is one or more digits, and
you could have whitespaces at both sides of the :.
A good data structure for storing your unigramCount entries could be a std::map.
[Demo]
#include <iostream> // cout
#include <map>
#include <regex> // regex_match, regex_search, sregex_iterator
#include <string> // stoi
int main()
{
std::string line{};
std::map<std::string, int> unigram_counts{};
while (std::getline(std::cin, line))
{
const std::regex unigram_count_pattern{R"(^\s*\"unigramCount\"\s*:\s*\{.*\}\s*$)"};
if (std::regex_match(line, unigram_count_pattern))
{
const std::regex entry_pattern{R"(\"([^\"]+)\"\s*:\s*([0-9]+))"};
for (auto entry_it{std::sregex_iterator(line.cbegin(), line.cend(), entry_pattern)};
entry_it != std::sregex_iterator{};
++entry_it)
{
auto matches{*entry_it};
auto& key{matches[1]};
auto& value{matches[2]};
unigram_counts[key] = std::stoi(value);
}
}
}
for (auto& [key, value] : unigram_counts)
{
std::cout << "'" << key << "' : " << value << "\n";
}
}
// Outputs:
//
// 'IEEE' : 1
// 'air' : 1
// 'conventional' : 1
// 'networks,' : 1
// 'others' : 1

In C++ STL, how do I remove non-numeric characters from std::string with regex_replace?

Using the C++ Standard Template Library function regex_replace(), how do I remove non-numeric characters from a std::string and return a std::string?
This question is not a duplicate of question
747735
because that question requests how to use TR1/regex, and I'm
requesting how to use standard STL regex, and because the answer given
is merely some very complex documentation links. The C++ regex
documentation is extremely hard to understand and poorly documented,
in my opinion, so even if a question pointed out the standard C++
regex_replace
documentation,
it still wouldn't be very useful to new coders.
// assume #include <regex> and <string>
std::string sInput = R"(AA #-0233 338982-FFB /ADR1 2)";
std::string sOutput = std::regex_replace(sInput, std::regex(R"([\D])"), "");
// sOutput now contains only numbers
Note that the R"..." part means raw string literal and does not evaluate escape codes like a C or C++ string would. This is very important when doing regular expressions and makes your life easier.
Here's a handy list of single-character regular expression raw literal strings for your std::regex() to use for replacement scenarios:
R"([^A-Za-z0-9])" or R"([^A-Za-z\d])" = select non-alphabetic and non-numeric
R"([A-Za-z0-9])" or R"([A-Za-z\d])" = select alphanumeric
R"([0-9])" or R"([\d])" = select numeric
R"([^0-9])" or R"([^\d])" or R"([\D])" = select non-numeric
Regular expressions are overkill here.
#include <algorithm>
#include <iostream>
#include <iterator>
#include <string>
inline bool not_digit(char ch) {
return '0' <= ch && ch <= '9';
}
std::string remove_non_digits(const std::string& input) {
std::string result;
std::copy_if(input.begin(), input.end(),
std::back_inserter(result),
not_digit);
return result;
}
int main() {
std::string input = "1a2b3c";
std::string result = remove_non_digits(input);
std::cout << "Original: " << input << '\n';
std::cout << "Filtered: " << result << '\n';
return 0;
}
The accepted answer if fine for the specifics of the given sample.
But it will fail for a number such as "-12.34" (it would result in "1234").
(note how the sample could be negative numbers)
Then the regex should be:
(-|\+)?(\d)+(.(\d)+)*
explanation: (optional ( "-" or "+" )) with (a number, repeated 1 to n times) with (optionally end's with: ( a "." followed by (a number, repeated 1 to n times) )
A bit over-reaching, but I was looking for this and the page showed up first in my search, so I'm adding my answer for future searches.

c++ Is there a way to find sentences within strings?

I'm trying to recognise certain phrases within a user defined string but so far have only been able to get a single word.
For example, if I have the sentence:
"What do you think of stack overflow?"
is there a way to search for "What do you" within the string?
I know you can retrieve a single word with the find function but when attempting to get all three it gets stuck and can only search for the first.
Is there a way to search for the whole string in another string?
Use str.find()
size_t find (const string& str, size_t pos = 0)
Its return value is the starting position of the substring. You can test if the string you are looking for is contained in the main string by performing the simple boolean test of returning str::npos:
string str = "What do you think of stack overflow?";
if (str.find("What do you") != str::npos) // is contained
The second argument can be used to limit your search from certain string position.
The OP question mentions it gets stuck in the attempt to find a three word string. Actually, I believe you are misinterpreting the return value. It happens that the return for the single word search "What" and the string "What do you" have coincidental starting positions, therefore str.find() returns the same. To search for individual words positions, use multiple function calls.
Use regular expressions
#include <iostream>
#include <string>
#include <regex>
int main ()
{
std::string s ("What do you think of stack overflow?");
std::smatch m;
std::regex e ("\\bWhat do you think\\b");
std::cout << "The following matches and submatches were found:" << std::endl;
while (std::regex_search (s,m,e)) {
for (auto x:m) std::cout << x << " ";
std::cout << std::endl;
s = m.suffix().str();
}
return 0;
}
Also you can find wildcards implementing with boost (regex in std library was boost::regex library before c++11) there

Remove specific format of string from a List?

I'm writing a program for an Arduino that takes information in a sort of NMEA format which is read from a .txt file stored in a List< String >. I need to strip out strings that begin with certain prefixes ($GPZDA, $GPGSA, $GPGSV) because these are useless to me and therefore I only need $GPRMC and $GPGGA which contains a basic time stamp and the location which is all I'm using anyway. I'm looking to use as little external libraries (SPRINT, BOOST) as possible as the DUE doesn't have a fantastic amount of space as-is.
All I really need is a method to remove lines from the LIST<STRING> that doesn't start with a specific prefix, Any ideas?
The method I'm currently using seems to have replaced the whole output with one specific string yet kept the file length/size the same (1676 and 2270, respectively), these outputs are achieved using two While statements that put the two input files into List<STRING>
Below is a small snipped from what I'm trying to use, which is supposed to sort the file into a correct order (Working, they are current ordered by their numerical value, which works well for the time which is the second field in the string) however ".unique();" appears to have taken each "Unique" value and replaced all the others with it so now I have a 1676 line list that basically goes 1,1,1,2,2,2,3,3,4... 1676 ???
while (std::getline(GPS1,STRLINE1)){
ListOne.push_back("GPS1: " + STRLINE1 + "\n");
ListOne.sort();
ListOne.unique();
std::cout << ListOne.back() << std::endl;
GPSO1 << ListOne.back();
}
Thanks
If I understand correctly and you want to have some sort of white list of prefixes.
You could use remove_if to look for them, and use a small function to check whether one of the prefixes fits(using mismatch like here) for example:
#include <iostream>
#include <algorithm>
#include <string>
#include <list>
using namespace std;
int main() {
list<string> l = {"aab", "aac", "abb", "123", "aaw", "wws"};
list<string> whiteList = {"aa", "ab"};
auto end = remove_if(l.begin(), l.end(), [&whiteList](string item)
{
for(auto &s : whiteList)
{
auto res = mismatch(s.begin(), s.end(), item.begin());
if (res.first == s.end()){
return false; //found allowed prefix
}
}
return true;
});
for (auto it = l.begin(); it != end; ++it){
cout<< *it << endl;
}
return 0;
}
(demo)

Read file and extract certain part only

ifstream toOpen;
openFile.open("sample.html", ios::in);
if(toOpen.is_open()){
while(!toOpen.eof()){
getline(toOpen,line);
if(line.find("href=") && !line.find(".pdf")){
start_pos = line.find("href");
tempString = line.substr(start_pos+1); // i dont want the quote
stop_pos = tempString .find("\"");
string testResult = tempString .substr(start_pos, stop_pos);
cout << testResult << endl;
}
}
toOpen.close();
}
What I am trying to do, is to extrat the "href" value. But I cant get it works.
EDIT:
Thanks to Tony hint, I use this:
if(line.find("href=") != std::string::npos ){
// Process
}
it works!!
I'd advise against trying to parse HTML like this. Unless you know a lot about the source and are quite certain about how it'll be formatted, chances are that anything you do will have problems. HTML is an ugly language with an (almost) self-contradictory specification that (for example) says particular things are not allowed -- but then goes on to tell you how you're required to interpret them anyway.
Worse, almost any character can (at least potentially) be encoded in any of at least three or four different ways, so unless you scan for (and carry out) the right conversions (in the right order) first, you can end up missing legitimate links and/or including "phantom" links.
You might want to look at the answers to this previous question for suggestions about an HTML parser to use.
As a start, you might want to take some shortcuts in the way you write the loop over lines in order to make it clearer. Here is the conventional "read line at a time" loop using C++ iostreams:
#include <fstream>
#include <iostream>
#include <string>
int main ( int, char ** )
{
std::ifstream file("sample.html");
if ( !file.is_open() ) {
std::cerr << "Failed to open file." << std::endl;
return (EXIT_FAILURE);
}
for ( std::string line; (std::getline(file,line)); )
{
// process line.
}
}
As for the inner part the processes the line, there are several problems.
It doesn't compile. I suppose this is what you meant with "I cant get it works". When asking a question, this is the kind of information you might want to provide in order to get good help.
There is confusion between variable names temp and tempString etc.
string::find() returns a large positive integer to indicate invalid positions (the size_type is unsigned), so you will always enter the loop unless a match is found starting at character position 0, in which case you probably do want to enter the loop.
Here is a simple test content for sample.html.
<html>
<a href="foo.pdf"/>
</html>
Sticking the following inside the loop:
if ((line.find("href=") != std::string::npos) &&
(line.find(".pdf" ) != std::string::npos))
{
const std::size_t start_pos = line.find("href");
std::string temp = line.substr(start_pos+6);
const std::size_t stop_pos = temp.find("\"");
std::string result = temp.substr(0, stop_pos);
std::cout << "'" << result << "'" << std::endl;
}
I actually get the output
'foo.pdf'
However, as Jerry pointed out, you might not want to use this in a production environment. If this is a simple homework or exercise on how to use the <string>, <iostream> and <fstream> libraries, then go ahead with such a procedure.