Can I use a string as delimiter while using boost split method? - c++

I am trying to parse an HTML string using the split method from boost. Can it be used with a string delimiter like "<td>" ? Can someone give me an example of how to do it efficiently ?
I am trying to do something like
vector <string> fields;
split( fields, str, is_any_of( "<td>" ) );
But then I understand that it is treating '<','t','d' and '>' - all characters as delims.I am trying to find a way to use a string as delim.

Looking at the documentation for split it works on a character-by-character basis, treating the string as a sequence of characters. Therefore the predicate it uses to determine if something is a delimiter can only test a single character, so if you want to split on a complete string you're going to need to use something else. A regular expression library would certainly be able to do it, but you could fairly easily hand-code one by searching for substrings.

Related

How can I put a small bit of a string in a another string?

I have a Question. I want to copy something from a string. I have this:
string buffer = "
{"clientToken":"clientToken","accessToken":"abdjuhsdhjsksdnasfldafgkuadbkghubdfhlujgbdfhulgdfbhugfdbgujhlfdanhjgkhdfanhnjkgbafdhkugbadcjgfdabhgfdabgjhdfabkhgfdbghujfdabghkjfdabghujfadbgfdjhaugbafdhjujgjbfuhkgbf
dhugdbfauhgbfaluhgbdafilgbdfhgfdigujladbijfbghdufjbvgfbhgadbfgbdfjgfbgjfdbjflbgjedfbgauiadfbuigbuifgdabhf
juhgbdaihjfhgbdiuflghbdfiugbfdugbbbbb","selectedProfile":
{"name":"secret","id":"secret"},"availableProfiles":[{"name":"secret","id":"nothing"}]}";
string token;
and i want to put the accestoken (the long gibberish) into the string token(the rest is not needed only the accesstoken). Can anyone help me?
There are few ways to do it, but the most simple is to locate "accessToken" with std::string::find(), then located leading and ending " of the token value itself, and then use std::string::substr() to extract a necessary piece. But in general, you better use some json parsing library, like https://github.com/nlohmann/json

Regular expression to extract either integer or string from JSON

I am working in an environment without a JSON parser, so I am using regular expressions to parse some JSON. The value I'm looking to isolate may be either a string or an integer.
For instance
Entry1
{"Product_ID":455233, "Product_Name":"Entry One"}
Entry2
{"Product_ID":"455233-5", "Product_Name":"Entry One"}
I have been attempting to create a single regex pattern to extract the Product_ID whether it is a string or an integer.
I can successfully extract both results with separate patterns using look around with either (?<=Product_ID":")(.*?)(?=") or (?<=Product_ID":)(.*?)(?=,)
however since I don't know which one I will need ahead of time I would like a one size fits all.
I have tried to use [^"] in the pattern however I just cant seem to piece it together
I expect to receive 455233-5 and 455233 but currently I receive "455233-5"
(?<="Product_ID"\s*:\s*"?)[^"]+(?="?\s*,)
, try it here.

Advanced Lua Pattern Matching

I would like to know if either/both of these two scenarios are possible in Lua:
I have a string that looks like such: some_value=averylongintegervalue
Say I know there are exactly 21 characters after the = sign in the string, is there a short way to replace the string averylongintegervalue with my own? (i.e. a simpler way than typing out: string.gsub("some_value=averylongintegervalue", "some_value=.....................", "some_value=anewintegervalue")
Say we edit the original string to look like such: some_value=averylongintegervalue&
Assuming we do not know how many characters is after the = sign, is there a way to replace the string in between the some_value= and the &?
I know this is an oddly specific question but I often find myself needing to perform similar tasks using regex and would like to know how it would be done in Lua using pattern-matching.
Yes, you can use something like the following (%1 refers to the first capture in the pattern, which in this case captures some_value=):
local str = ("some_value=averylongintegervalue"):gsub("(some_value=)[^&]+", "%1replaced")
This should assign some_value=replaced.
Do you know if it is also possible to replace every character between the = and & with a single character repeated (such as a * symbol repeated 21 times instead of a constant string like replaced)?
Yes, but you need to use a function:
local str = ("some_value=averylongintegervalue")
:gsub("(some_value=)([^&]+)", function(a,b) return a..("#"):rep(#b) end)
This will assign some_value=#####################. If you need to limit this to just one replacement, then add ,1 as the last parameter to gsub (as Wiktor suggested in the comment).

Parse tab delimited file with Boost.Spirit where entries may contain whitespace in

I want to parse a tab delimited file using Boost.Spirit (Qi). My file looks something like this:
John Doe\tAge 23\tMember
Jane Doe\tAge 25\tMember
...
Is it possible to parse this with a skip parser? The problem I have right now is, that boost::spirit::ascii:space also skips the whitespace within the name of the person. How would the phrase_parse(...) call look like?
I am also using the Boost.Fusion tuples for convient storing of the results in a struct:
struct Person
{
string name;
int age;
string status;
};
This seems to work for the name:
String %= lexeme[+(char_-'\t')];
It matches everything char that is not a tab. It is then used as part of the bigger rule:
Start %= Name >> Age >> Status;
Q. Is it possible to parse this with a skip parser?
A. No, it's not possible to parse anything with the skip parser. Skippers achieve the opposite: they disregard certain input information.
However, what you seem to be looking for something like this hack: (I don't recommend it)
Read empty values with boost::spirit
Now, you could look at my other answers for proper ways to parse CSV/TSV dealing with embedded whitespace, quoted values, escaped quotes etc. (I believe one even shows line-continuation characters)
How to parse csv using boost::spirit
Parse quoted strings with boost::spirit
How to make my split work only on one real line and be capable to skip quoted parts of string?

False word elemination using Regex replacement

I need to perform content/keyword based search in a list of files. for that i need to extract the keywords and store them in MySQL database. the key words are extracted in following manner:
Read the file content
Remove special characters and additional white spaces if any using
Regex.Replace(input, "[^a-zA-Z0-9_]+", " ")
Remove am/is/are/be/being/been/ , have/has/having/had/, do/does/doing/did/ adjectives, phrases, Adverbs etc..
Removing endings like :
-IC-ATION fortification
-IC-ITY electricity
-IC-MENT fantastically
-AT-IV contemplative
-AT-OR conspirator
-IV-ITY relativity
-IV-MENT instinctively
-ABLE-ITY incapability
-ABLE-MENT charitably
-OUS-MENT famously
Can i do the whole operation using a single Regular expression? is their any simplest method for this? Here i have a reference algorithm for this operation.
I don't think it would be possible to implement a stemming algorithm using regular expressions exclusively. Maybe you should take a look at already existing implementations to get ideas. Here is a link to the Porter stemming algorithm in VB.net