I have string in which tags like this comes(there are multiple such tags)
|{{nts|-2605.2348}}
I want to use boost regex to remove |{{nts| and }} and replace whole string that i have typed above with
-2605.2348
in original string
To make it more clear:
Suppose string is:
number is |{{nts|-2605.2348}}
I want string as:
number is -2605.2348
I am quite new to boost regex and read many things online but not able to get answer to this any help would be appreciated
It really depends on how specific do you want to be. Do you want to always remove exactly |{{nts|, or do you want to remove pipe, followed by {{, followed by any number of letters, followed by pipe? Or do you want to remove everything that isn't whitespace between the last space and the first part of the number?
One of the many ways to do this would be something like:
#include <iostream>
#include <boost/regex.hpp>
int main()
{
std::string str = "number is |{{nts|-2605.2348}}";
boost::regex re("\\|[^-\\d.]*(-?[\\d.]*)\\}\\}");
std::cout << regex_replace(str, re, "$1") << '\n';
}
online demo: http://liveworkspace.org/code/2B290X
However, since you're using boost, consider the much simpler and faster parsers generated by boost.spirit.
Related
Using C++, I would like to split the rows of a string (CSV file in this case) where some of the fields may contain delimiters that are escaped (using "") and should be seen as literals. I have looked at the various questions already posed by have not found a direct answer to my problem.
Example of CSV file data:
Header1,Header2,Header3,Header4,Header5
Hello,",,,","world","!,,!,",","
Desired string vector after splitting:
["Hello"],[",,,"],["world"],["!,,!,"],[","]
Note: The CSV is only valid if the number of data columns equal the number of header columns.
Would prefer a non-boost / third-party solution. Efficiency is not a priority.
EDIT:
Code below implementing regex from #ClasG at least satisfies the scenario above. I am drafting fringe test cases but would love to hear when / where it breaks down...
std::string s = "Hello,\",,,\",\"world\",\"!,,!,\",\",\"\"";
std::string rx_string = "(\"[^\"]*\"|[^,]*)(?:,|$)";
regex e(rx_string);
std::regex_iterator<std::string::iterator> rit ( s.begin(), s.end(), e );
std::regex_iterator<std::string::iterator> rend;
while (rit!=rend)
{
std::cout << rit->str() << std::endl;
++rit;
}
This is not a complete (c++) solution, but a regex that might nudge you in the right direction.
A regex like
("[^"]*"|[^,]*)(?:,|$)
will match the individual columns. (Note that it doesn't handle escaped quotes.)
See it here at regex101.
This is not an answer, but it's too long to put as a comment IMHO.
CSV is one of those seemingly-simple-but-actually-quite-fiendish storage formats.
The droid you're looking for is Boost.Spirit.
The Spirit Master's name (on stack overflow) is #sehe.
See his answer here: https://stackoverflow.com/a/18366335/2015579
Please credit sehe, not me.
Question
How to minify HTML using C++?
Resources
An external library could be the answer, but I'm more looking for improvements of my current code. Although I'm all ears for other possibilities.
Current code
This is my interpretation in c++ of the following answer.
The only part I had to change from the original post is this part on top: "(?ix)"
...and a few escape signs
#include <boost/regex.hpp>
void minifyhtml(string* s) {
boost::regex nowhitespace(
"(?ix)"
"(?>" // Match all whitespans other than single space.
"[^\\S ]\\s*" // Either one [\t\r\n\f\v] and zero or more ws,
"| \\s{2,}" // or two or more consecutive-any-whitespace.
")" // Note: The remaining regex consumes no text at all...
"(?=" // Ensure we are not in a blacklist tag.
"[^<]*+" // Either zero or more non-"<" {normal*}
"(?:" // Begin {(special normal*)*} construct
"<" // or a < starting a non-blacklist tag.
"(?!/?(?:textarea|pre|script)\\b)"
"[^<]*+" // more non-"<" {normal*}
")*+" // Finish "unrolling-the-loop"
"(?:" // Begin alternation group.
"<" // Either a blacklist start tag.
"(?>textarea|pre|script)\\b"
"| \\z" // or end of file.
")" // End alternation group.
")" // If we made it here, we are not in a blacklist tag.
);
// #todo Don't remove conditional html comments
boost::regex nocomments("<!--(.*)-->");
*s = boost::regex_replace(*s, nowhitespace, " ");
*s = boost::regex_replace(*s, nocomments, "");
}
Only the first regex is from the original post, the other one is something I'm working on and should be considered far from complete. It should hopefully give a good idea of what I try to accomplish though.
Regexps are a powerful tool, but I think that using them in this case will be a bad idea. For example, regexp you provided is maintenance nightmare. By looking at this regexp you can't quickly understand what the heck it is supposed to match.
You need a html parser that would tokenize input file, or allow you to access tokens either as a stream or as an object tree. Basically read tokens, discards those tokens and attributes you don't need, then write what remains into output. Using something like this would allow you to develop solution faster than if you tried to tackle it using regexps.
I think you might be able to use xml parser or you could search for xml parser with html support.
In C++, libxml (which might have HTML support module), Qt 4, tinyxml, plus libstrophe uses some kind of xml parser that could work.
Please note that C++ (especially C++03) might not be the best language for this kind of program. Although I strongly dislike python, python has "Beautiful Soup" module that would work very well for this kind of problem.
Qt 4 might work because it provides decent unicode string type (and you'll need it if you're going to parse html).
I am trying to parse an HTML string using the split method from boost. Can it be used with a string delimiter like "<td>" ? Can someone give me an example of how to do it efficiently ?
I am trying to do something like
vector <string> fields;
split( fields, str, is_any_of( "<td>" ) );
But then I understand that it is treating '<','t','d' and '>' - all characters as delims.I am trying to find a way to use a string as delim.
Looking at the documentation for split it works on a character-by-character basis, treating the string as a sequence of characters. Therefore the predicate it uses to determine if something is a delimiter can only test a single character, so if you want to split on a complete string you're going to need to use something else. A regular expression library would certainly be able to do it, but you could fairly easily hand-code one by searching for substrings.
I'm just trying to mess around and get familiar with using regex in c++.
Let's say I want the user to input the following: ###-$$-###, make #=any number between 0-9 and $=any number between 0-5. This is my idea for accomplishing this:
regex rx("[0-9][0-9][0-9]""\\-""[0-5][0-5]")
That's not the exact code however that's the general idea to check whether or not the user's input is a valid string of numbers. However, let's say i won't allow numbers starting with a 0 so: 099-55-999 is not acceptable. How can I check something like that and output invalid? Thanks
[0-9]{3}-[0-5]{2}-[0-9]{3}
matches a string that starts with three digits between 0 and 9, followed by a dash, followed by two digits between 0 and 5, followed by a dash, followed by three digits between 0 and 9.
Is that what you're looking for? This is very basic regex stuff. I suggest you look at a good tutorial.
EDIT: (after you changed your question):
[1-9][0-9]{2}-[0-5]{2}-[0-9]{3}
would match the same as above except for not allowing a 0 as the first character.
std::tr1::regex rx("[0-9]{3}-[0-5]{2}-[0-9]{3}");
Your talking about using tr1 regex in c++ right and not the managed c++? If so, go here where it explains this stuff.
Also, you should know that if your using VS2010 that you don't need the boost library anymore for regex.
Try this:
#include <regex>
#include <iostream>
#include <string>
int main()
{
std::tr1::regex rx("\\d{3}-[0-5]{2}-\\d{3}");
std::string s;
std::getline(std::cin,s);
if(regex_match(s.begin(),s.end(),rx))
{
std::cout << "Matched!" << std::endl;
}
}
For explanation check #Tim's answer. Do note the double \ for the digit metacharacter.
Suppose I have the following text:
My name is myName. I love
stackoverflow .
Hi, Guys! There is more than one space after "Guys!" 123
And also after "123" there are 2 spaces and newline.
Now I need to read this text file as it is. Need to make some actions only with alphanumeric words. And after it I have to print it with changed words but spaces and newlines and punctuations unchanged and on the same position. When changing alphanumeric words length remains same. I have tried this with library checking for alphanumeric values, but code get very messy. Is there anyother way?
You can read your file line-by-line with fgets() function. It will fill char array and you can work with this array, e.g. iterate over this array, split it into alnum words; change the words and then write fixed string into new file with "fwrite()" function.
If you prefer C++ way of working with files (iostream), you can use istream::getline. It will save spaces; but it will consume "\n". If you need to save even "\n" (it can be '\r' and '\r\n' sometimes), you can use istream::get.
Maybe you should look at Boost Tokenizer. It can break of a string into a series of tokens and iterate through them. The following sample breaks up a phrase into words:
int main()
{
std::string s = "Hi, Guys! There is more...";
boost::tokenizer<> tok(s);
for(boost::tokenizer<>::iterator beg = tok.begin(); beg != tok.end(); ++beg)
{
std::cout << *beg << "\n";
}
return 0;
}
But in your case you need to provide a TokenizerFunc that will break up a string at alphanumeric/non-alphanumeric boundaries.
For more information see Boost Tokenizer documentation and implementation of an already provided char_separator, offset_separator and escaped_list_separator.
The reason that your code got messy is usually because you didn't break down your problem in clear functions and classes. If you do, you will have a few functions that each do precisely one thing (not messy). Your main function will then just call these simple functions. If the function names are well chosen, the main function will become short and clear, too.
In this case, your main function needs to do:
Loop: Read every line of a file
On every line, check if and where a "special" word occurs.
If a special word occurs, replace it
Extra hints: a line of text can be stored as a std::string and can be read by std::getline(std::cin, line)