Character class has duplicated range - regex

The regex is: [%{1,2}|\/\/]
This matches %, %% and //
FYI: this is a warning generated from flycheck.

The warning means that you have duplicates in your character class. When you put something between square brackets, it means "one of those": [abc] means any of a, b or c. [a|b] means any of a, | or b.
So when you do [\/\/], you mean "either /, or /" which is obviously a duplicate. For the same reason, [%{1,2}] means "either % or { or 1 or , or 2 or }" which is clearly not what you want.
The group selector are parenthesis, not square brackets, so use this regex instead:
(%{1,2}|\/\/)

Related

What happens with extra spaces and newlines in C/C++ code?

Is there a difference between;
int main(){
return 0;
}
and
int main(){return 0;}
and
int main(){
return
0;
}
They will all likely compile to same executable. How does the C/C++ compiler treat the extra spaces and newlines, and if there is a difference between how newlines are treated differently than spaces in C code?
Also, how about tabs? What's the significance of using tabs instead of spaces in code, if there is any?
Any sequence of 1+ whitespace symbol (space/line-break/tab/...) is equivalent to a single space.
Exceptions:
Whitespace is preserved in string literals. They can't contain line-breaks, except C++ raw literals (R"(...)"). The same applies to file names in #include.
Single-line comments (//) are terminated with line-breaks only.
Preprocessor directives (starting with #) are terminated with line-breaks only.
\ followed by a line-break removes both, allowing multi-line // comments, preprocessor directrives, and string literals.
Also, whitespace symbols are ignored if there is punctuation (anything except letters, numbers, and _) to the left and/or to the right of it. E.g. 1 + 2 and 1+2 are the same, but return a; and returna; are not.
Exceptions:
Whitespace is not ignored inside string literals, obviously. Nor in #include file names.
Operators consisting of >1 punctuation symbols can't be separated, e.g. cout < < 1 is illegal. The same applies to things like // and /* */.
A space between punctuation might be necessary to prevent it from coalescing into a single operator. Examples:
+ +a is different from ++a.
a+++b is equivalent to a++ +b, but not to a+ ++b.
Pre-C++11, closing two template argument lists in a row required a space: std::vector<std::vector<int> >.
When defining a function-like macro, the space is not allowed before the opening parenthesis (adding it turns it into an object-like macro). E.g. #define A() replaces A() with nothing, but #define A () replaces A with ().

Regex search for strings with extra spaces for a given word

Here is a string:
foo
I want to find all occurrence of strings that have at least one space between characters, i.e.:
(f oo|fo o|f o o)
However, if the length is longer, I cannot simply do like the above.
I tried
f\s?o\s?o
But in this case, "foo" will be matched too.
UPDATE
As #CertainPerformace clarified:
The whole string needs to contain at least one space, but it doesn't matter where and how many, as long as there's at least one somewhere
However, I don't want words like b ar to be matched. I want foo-but-with-extra-spaces strings to be matched.
For example, given a string
foo f oo f o o b ar
I want only f oo and f o o to be matched.
After matching the first character, lookahead for zero to n - 2 non-space characters, followed by a space, where n is the length of the string. For example, with foo, you'd repeat non-spaces up to 1 time. That'll ensure that a space occurs before the characters in the word you want run out.
Then, end the lookahead, and match the letters normally, possible with spaces in between:
f(?=\S{0,1} ) *o *o
https://regex101.com/r/nzejCS/3
For a longer word, like foobar, you'd do:
f(?=\S{0,4} ) *o *o *b *a *r
You can also negative lookahead at the beginning:
(?!foo)f *o *o
https://regex101.com/r/nzejCS/4

How can I used regular expressions to find all lines of source code defining a default arguments for a function?

I want to find lines of code which declare functions with default arguments, such as:
int sum(int a, int b=10, int c=20);
I was thinking I would look for:
The first part of the matched pattern is exactly one left-parenthesis "("
The second part of string is one or more of any character excluding "="
exactly one equals-sign "="
a non-equal-sign
one or more characters except right parenthesis ")"
")"
The following is my attempt:
([^=]+=[^=][^)]+)
I would like to avoid matching condition-clauses for if-statements and while-loops.
For example,
int x = 5;
if (x = 10) {
x = 7;
}
Our regex should find functions with default arguments in any one of python, Java, or C++. Let us not assume that function declarations end with semi-colon, or begin with a data-type
Try this:
\([^)]*\w+\s+\w+\s*=[^),][^)]*\)
See live demo.
It looks for words chars (the param type), space(s), word chars (the param name), optional space(s), then an equals sign.
Add ".*" to each end to match the whole line.
Please check this one:
\(((?:\w+\s+[\w][\w\s=]*,*\s*){1,})\)
The above expression matches the parameter list and returns it as $1 (Group 1), in case it is needed for further processing.
demo here

recursive matching for string delimiter with regular expression

In verilog language, the statements are enclosed in a begin-end delimiter instead of bracket.
always# (*) begin
if (condA) begin
a = c
end
else begin
b = d
end
end
I'd like to parse outermost begin-end with its statements to check coding rule in python. Using regular expression, I want results with regular expression like:
if (condA) begin
a = c
end
else begin
b = d
end
I found similar answer for bracket delimiter.
int funcA() {
if (condA) {
b = a
}
}
regular expression:
/({(?>[^{}]+|(?R))*})/g
However, I don't know how to modify atomic group ([^{}]) for "begin-end"?
/(begin(?>[??????]+|(?R))*end)/g
The point of the [??????]+ part is to match any text that does not match a char that is equal or is the starting point of the delimiters.
So, in your case, you need to match any char other than a char that starts either begin or end substring:
/begin(?>(?!begin|end).|(?R))*end/gs
See the regex demo
The . here will match any char including line break chars due to the s modifier. Note that the actual implementation might need adjustments (e.g. in PHP, the g modifier should not be used as there are specific functions/features for that).
Also, since you recurse the whole pattern, you need no outer parentheses.

Comment pattern match in flex using states

I am trying to match single line comment pattern in flex. Patterns of the comment could be:
//this is a single /(some random stuff) line comment
Or it could be like this:
// this is also a comment\
continuation of the comment from previous line
From the example it's obvious that I have to handle the multi-line case too.
Now my approach was using states. This is what I have so far:
"//" {
yymore();
BEGIN (SINGLE_COMMENT);
}
<SINGLE_COMMENT>([^{NEWLINE}]|\\[(.){NEWLINE}]) {
yymore();
}
<SINGLE_COMMENT>([^{NEWLINE}]|[^\\]{NEWLINE}) {
logout << "Line no " << line_count << ": TOKEN <COMMENT> Lexeme " << string(yytext) << "\nfound\n\n";
BEGIN (INITIAL);
}
NEWLINE is declared as:
NEWLINE \r?\n
My declaration unit:
%option noyywrap
%x SINGLE_COMMENT
int line_count = 1;
const int bucketSize = 10; // change if necessary
ofstream logout;
ofstream tokenout;
SymbolTable symbolTable(bucketSize);
Action of NEWLINE:
{NEWLINE} {
line_count++;
}
If I run it with the following input:
// hello\
int main
This is my log file:
Line no 1: TOKEN <COMMENT> Lexeme // hello\
found
Line no 1: TOKEN <INT> Lexeme int found
Line no 1: TOKEN <ID> Lexeme main found
ScopeTable # 1
6 --> < main , ID >
So, it's not catching the multi-line comment. Also the line_count is not incremented. It's staying the same. Can anybody help me figuring out what I have done wrong?
Link to code
In (f)lex, as in most regular expression engines, [ and ] enclose a character class description. A character class is a set of individual characters, and it always matches exactly one character which is a member of that set. There are also negated character classes which are written the same way except that they start with [^ and match exactly one character which is not a member of the set.
Character classes are not the same as sequences of characters:
ab matches an a followed by a b
[ab] matches either an a or a b
Since character classes are just sets of characters, it is meaningless for the individual characters in the class to be repeated or optional, etc. Consequently, almost no regular expression operators (*, +, ?, etc.) are meaningful inside a character class. If you put one of them in a character class expression, it is handled just like an ordinary character:
a* matches 0 or more as
[a*] matches either an a or a *
One of the features flex provides which is not provided by most other regular expression systems is macro expansions, of the form {name}. Here the { and } indicate the expansion of a defined macro, whose name is contained between the braces. These characters are also not special inside a character class:
{identifier} matches whatever the expanded macro named identifier would match.
[{identifier}] matches a single character which is {, } or one of the letters definrt
Macro definitions seem to be overused by beginners. My advice is always to avoid them, and thereby avoid the confusion which they create.
It's also worth noting that (f)lex does not have an operator which negates a subpattern. Only character classes can be negated; there is no easy way to write "match anything other than foo". However, you can generally rely on the first longest-match rule to effectively implement negations: if some pattern p executes, then there cannot be any pattern which would match more than p. Thus, it might not be necessary to explicitly write the negation.
For example, in your comment detector where the only real issue is dealing with carriage return (\r) characters which are not followed by newline characters, you could use (f)lex's pattern matching algorithm to your advantage:
<SINGLE_COMMENT>{
[^\\\r\n]+ ;
\\\r?\n { ++line_count; }
\\. ; /* only matches if the above rule doesn't */
\r?\n { ++line_count; BEGIN(INITIAL); }
\r ; /* only matches if the above rule doesn't */
}
By the way, it's usually much easier to provide %option yylineno than to try to track newlines manually.