I thought that $ indicates the end of string. However, the following piece of code gives "testbbbccc" as a result, which is quite astonishing to me... This means that $ actually matches end of line, not end of the whole string.
#include <iostream>
#include <regex>
using namespace std;
int main()
{
tr1::regex r("aaa([^]*?)(ogr|$)");
string test("bbbaaatestbbbccc\nddd");
vector<int> captures;
captures.push_back(1);
const std::tr1::sregex_token_iterator end;
for (std::tr1::sregex_token_iterator iter(test.begin(), test.end(), r, captures); iter != end; )
{
string& t1 = iter->str();
iter++;
cout << t1;
}
}
I have been trying to find a "multiline" switch (which actually can be easily found in PCRE), but without success... Can someone point me to the right direction?
Regards,
R.P.
As Boost::Regex was selected for tr1, try the following:
From Boost::Regex
Anchors:
A '^' character shall match the start
of a line when used as the first
character of an expression, or the
first character of a sub-expression.
A '$' character shall match the end of
a line when used as the last character
of an expression, or the last
character of a sub-expression.
So the behavior you observed is correct.
From: Boost Regex as well:
\A Matches at the start of a buffer
only (the same as \`).
\z Matches at
the end of a buffer only (the same as
\').
\Z Matches an optional sequence
of newlines at the end of a buffer:
equivalent to the regular expression
\n*\z
I hope that helps.
There is no multiline switch in TR1 regexs. It's not exactly the same, but you could get the same functionality matching everything:
(.|\r|\n)*?
This matches non-greedily every character, including new line and carriage return.
Note: Remember to escape the backslashes '\' like this '\\' if your pattern is a C++ string in code.
Note 2: If you don't want to capture the matched contents, append '?:' to the opening bracket:
(?:.|\r|\n)*?
Related
In verilog language, the statements are enclosed in a begin-end delimiter instead of bracket.
always# (*) begin
if (condA) begin
a = c
end
else begin
b = d
end
end
I'd like to parse outermost begin-end with its statements to check coding rule in python. Using regular expression, I want results with regular expression like:
if (condA) begin
a = c
end
else begin
b = d
end
I found similar answer for bracket delimiter.
int funcA() {
if (condA) {
b = a
}
}
regular expression:
/({(?>[^{}]+|(?R))*})/g
However, I don't know how to modify atomic group ([^{}]) for "begin-end"?
/(begin(?>[??????]+|(?R))*end)/g
The point of the [??????]+ part is to match any text that does not match a char that is equal or is the starting point of the delimiters.
So, in your case, you need to match any char other than a char that starts either begin or end substring:
/begin(?>(?!begin|end).|(?R))*end/gs
See the regex demo
The . here will match any char including line break chars due to the s modifier. Note that the actual implementation might need adjustments (e.g. in PHP, the g modifier should not be used as there are specific functions/features for that).
Also, since you recurse the whole pattern, you need no outer parentheses.
I am trying to match single line comment pattern in flex. Patterns of the comment could be:
//this is a single /(some random stuff) line comment
Or it could be like this:
// this is also a comment\
continuation of the comment from previous line
From the example it's obvious that I have to handle the multi-line case too.
Now my approach was using states. This is what I have so far:
"//" {
yymore();
BEGIN (SINGLE_COMMENT);
}
<SINGLE_COMMENT>([^{NEWLINE}]|\\[(.){NEWLINE}]) {
yymore();
}
<SINGLE_COMMENT>([^{NEWLINE}]|[^\\]{NEWLINE}) {
logout << "Line no " << line_count << ": TOKEN <COMMENT> Lexeme " << string(yytext) << "\nfound\n\n";
BEGIN (INITIAL);
}
NEWLINE is declared as:
NEWLINE \r?\n
My declaration unit:
%option noyywrap
%x SINGLE_COMMENT
int line_count = 1;
const int bucketSize = 10; // change if necessary
ofstream logout;
ofstream tokenout;
SymbolTable symbolTable(bucketSize);
Action of NEWLINE:
{NEWLINE} {
line_count++;
}
If I run it with the following input:
// hello\
int main
This is my log file:
Line no 1: TOKEN <COMMENT> Lexeme // hello\
found
Line no 1: TOKEN <INT> Lexeme int found
Line no 1: TOKEN <ID> Lexeme main found
ScopeTable # 1
6 --> < main , ID >
So, it's not catching the multi-line comment. Also the line_count is not incremented. It's staying the same. Can anybody help me figuring out what I have done wrong?
Link to code
In (f)lex, as in most regular expression engines, [ and ] enclose a character class description. A character class is a set of individual characters, and it always matches exactly one character which is a member of that set. There are also negated character classes which are written the same way except that they start with [^ and match exactly one character which is not a member of the set.
Character classes are not the same as sequences of characters:
ab matches an a followed by a b
[ab] matches either an a or a b
Since character classes are just sets of characters, it is meaningless for the individual characters in the class to be repeated or optional, etc. Consequently, almost no regular expression operators (*, +, ?, etc.) are meaningful inside a character class. If you put one of them in a character class expression, it is handled just like an ordinary character:
a* matches 0 or more as
[a*] matches either an a or a *
One of the features flex provides which is not provided by most other regular expression systems is macro expansions, of the form {name}. Here the { and } indicate the expansion of a defined macro, whose name is contained between the braces. These characters are also not special inside a character class:
{identifier} matches whatever the expanded macro named identifier would match.
[{identifier}] matches a single character which is {, } or one of the letters definrt
Macro definitions seem to be overused by beginners. My advice is always to avoid them, and thereby avoid the confusion which they create.
It's also worth noting that (f)lex does not have an operator which negates a subpattern. Only character classes can be negated; there is no easy way to write "match anything other than foo". However, you can generally rely on the first longest-match rule to effectively implement negations: if some pattern p executes, then there cannot be any pattern which would match more than p. Thus, it might not be necessary to explicitly write the negation.
For example, in your comment detector where the only real issue is dealing with carriage return (\r) characters which are not followed by newline characters, you could use (f)lex's pattern matching algorithm to your advantage:
<SINGLE_COMMENT>{
[^\\\r\n]+ ;
\\\r?\n { ++line_count; }
\\. ; /* only matches if the above rule doesn't */
\r?\n { ++line_count; BEGIN(INITIAL); }
\r ; /* only matches if the above rule doesn't */
}
By the way, it's usually much easier to provide %option yylineno than to try to track newlines manually.
string "I am 5 years old"
regex "(?!am )\d"
if you go to http://regexr.com/ and apply regex to the string you'll get 5.
I would like to get this result with std::regex, but I do not understand how to use match results and probably regex has to be changed as well.
std::regex expression("(?!am )\\d");
std::smatch match;
std::string what("I am 5 years old.");
if (regex_search(what, match, expression))
{
//???
}
The std::smatch is an instantiation of the match_results class template for matches on string objects (with string::const_iterator as its iterator type). The members of this class are those described for match_results, but using string::const_iterator as its BidirectionalIterator template parameter.
std::match_results supports a operator[]:
If n > 0 and n < size(), returns a reference to the std::sub_match representing the part of the target sequence that was matched by the nth captured marked subexpression).
If n == 0, returns a reference to the std::sub_match representing the part of the target sequence matched by the entire matched regular expression.
if n >= size(), returns a reference to a std::sub_match representing an unmatched sub-expression (an empty subrange of the target sequence).
In your case, regex_search finds the first match only and then match[0] holds the entire match text, match[1] would contain the text captured with the first capturing group (the fist parenthesized pattern part), etc. In this case though, your regex does not contain capturing groups.
Here, you need to use a capturing mechanism here since std::regex does not support a lookbehind. You used a lookahead that checks the text that immediately follows the current location, and the regex you have is not doing what you think it is.
So, use the following code:
#include <regex>
#include <string>
#include <iostream>
using namespace std;
int main() {
std::regex expression(R"(am\s+(\d+))");
std::smatch match;
std::string what("I am 5 years old.");
if (regex_search(what, match, expression))
{
cout << match.str(1) << endl;
}
return 0;
}
Here, the pattern is am\s+(\d+)". It is matching am, 1+ whitespaces, and then captures 1 or more digits with (\d+). Inside the code, match.str(1) allows access to the values that are captured with capturing groups. As there is only one (...) in the pattern, one capturing group, its ID is 1. So, str(1) returns the text captured into this group.
The raw string literal (R"(...)") allows using a single backslash for regex escapes (like \d, \s, etc.).
Here's the regular expression:
let legalStr = "(?:[eE][\\+\\-]?[0-9]{1,3})?$"
Here's the invocation:
if let match = sender.stringValue.rangeOfString(legalStr, options: .RegularExpressionSearch) {
print("\(sender.stringValue) is legal")
}
else {
print( "\(sender.stringValue) is not legal")
}
If I type garbage, like "abcd" is returns illegal string.
If I type something like "e123" it returns legal string.
(note that the empty string is also legal.)
However, if I type "e1234" it still returns "legal". I'd expect it to return "not legal". Am I missing something here? BTW, note the "$" at the end of the regular expression. The three digits should appear at the end of the string.
If it's not immediately clear, the source of the string is a text edit box.
Your pattern is only anchored at the end, and matches the empty string. So any string at all will match successfully by just matching your pattern as an empty string at the end.
Add a ^ to the front to anchor it on that side, too.
May we have similar question here stackoverflow:
But my question is:
First I tried to match all x in the string so I write the following code, and it's working well:
string str = line;
regex rx("x");
vector<int> index_matches; // results saved here
for (auto it = std::sregex_iterator(str.begin(), str.end(), rx);
it != std::sregex_iterator();
++it)
{
index_matches.push_back(it->position());
}
Now if I tried to match all { I tried to replace
regex rx("x"); with regex rx("{"); andregex rx("\{");.
So I got an exception and I think it should throw an exception because we use {
sometimes to express the regular expression, and it expect to have } in the regex at the end that's why it throw an exception.
So first is my explanation correct?
Second question I need to match all { using the same code above, is that possible to change the regex rx("{"); to something else?
You need to escape characters with special meaning in regular expressions, i.e. use \{ regular expression. But, \ has special meaning in C++ string literals. So, next you need to escape characters with special meaning in C++ string literals, i.e. write:
regex rx("\\{");