I am trying to create a simple state machine in flex which has to ensure that strings spanning multiple lines must have \ for line breaks. Concretely:
"this is \
ok"
"this is not
ok"
The first one is valid. The second one is not.
I have the following state machine:
expectstring BEGIN(expectstr);
<expectstr>[^\n] {num_lines++;}
<expectstr>\ {flag = true;}
<expectstr>\n {printf("%s\n", flag ? "True" : False);}
But when I try to compile this state machine, flex tells me that the rule with \ can not be matched. Why is that?
I have looked at this but cannot figure it out.
In flex, the following pattern matches anything other than a newline:
.
You can also write that as
[^\n]
but . is more normal.
In order to match a backslash you can write
\\
"\\"
[\\]
Again, the first would be the usual way.
It's important to understand that [...] is an way of representing a set of characters, and that most regular expression operators are just ordinary characters inside the brackets. Similarly, "..." is a way of representing a sequence of characters and most regular expression operators are just ordinary characters inside the quotes.
Thus,
[a|b] matches one character if it is an a, a |, or a b
"a|b" matches the three-character sequence a | b
and|but matches either of the three-character sequences and or but.
Since flex lets you match regular expressions, you really don't need to manually build a state machine. Just use an appropriate regular expression. For example, the following will match strings which start and end with ", in which \ may be used to escape itself as well as newlines, and in which newlines (other than escaped ones) are illegal. I think that's your goal.
\"([^"\n\\]|\\(.|\n))*\"
You should make sure you understand how it works; there are lots of good explanations of regular expressions on the internet (and even more bad ones, so try to find one written by someone who knows what they are talking about). Here's the summary:
\" A literal double-quote
(...)* Any number of repetitions of:
[^"\n\\] Anything other than a double-quote, newline, or backslash
| Or
\\ A literal backslash, followed by
(...) Grouping
. Anything other than a newline
| Or
\n a newline
Related
I'mm starting to learn about Regular Expressions and I have written code in c++
my task is : Implement a function that replaces each digit in the given string with a '#' character.
For my example, the inputstring = "12 points".
I know I need to use \d for matches a digit. I tried to use this : std::regex_replace(input,std::regex("\d"),"#");
but it is not working: the output is still "12 points";
Then I searched the internet and the result is:
std::regex_replace(input,std::regex("\\d"),"#");
with the output is "## points".
Can anyone help me to understand what is "\\d" ?
\d means decimal, however, in the regular expression, the \ is a special character, which needs to be escaped on its own as well, hence in \\d you escape the \ to mark it to be used as a regular character instead of its special meaning.
When you use "\d" in a C++ application, the \ is an escape character in C++. So it doesn't treat the following d as a d.
Regex then gets a string that doesn't have \d in it, but most likely an empty string (since \d doesn't evaluate to anything in C++ to my knowledge).
When you use "\d" you are escaping the . So C++ reads the string as "\d" as you intended.
An example of when you'd use an escape character, is when you want to output a quote. "\"" would output a single double quote.
I have been looking through other regex questions, but have not been able to find an answer. I am working on a grammar in ANTLR4 and there is a regular expression that has been eluding me.
I am trying to match any character, except for \ followed by anything other than [btnrf"\].
I have tried ( ~([\\][.]) | [\\][btnrf"\] ) but the ~ only negates a single character as far as I can tell. I get the error:
error AC0050: extraneous input '[.]' expecting RPAREN while looking for lexer rule element
It seems like it shouldn't be too hard to exclude \* but allow the small list of acceptable escaped characters. I have been on http://www.regex101.com and I don't have any trouble matching the allowable characters, but for some reason I just can't figure out how to disallow escape characters besides the ones mentioned above, while also allowing all other characters.
Manually specifying every valid input character seems like overkill, but that may be what it comes down to. Something like:
[a-ZA-Z0-9_!##$%^&*()\-+=/.,<>;':\b\t\n\r\f\"\\]*
That may not be 100% valid, but the idea is just listing all valid possible characters, which by default would exclude any invalid escape characters. It seems like there should be a simpler way. Any tips or links to useful information would greatly appreciated.
The actual rule that I have so far, which allows anything enclosed in double quotes as a valid string:
STRING : '"' (~[\"] | '\\"')* '"';
I don't have ANTLR handy, but the following seems to do what you're after :
\([^\\].\)\|\(\\[btnrf\\"\\\\]\)
so effectively allow "EITHER anything other than a backslash followed by any character, OR a backslash followed by a specified character".
eg, putting that string in a file regexfile, and given a datafile containing
\a
\b
\\
xy
then performing grep -f regexfile datafile will exclude the \a, and return :
\b
\\
xy
Here is the example that confuses me:
select ' w' ~ '^\s\w$';
This results in "false", but seems like it should be true.
select ' w' ~ '^\\s\w*$';
This results in "true", but:
Why does \s need the extra backslash?
If it truly does, why does \w not need the extra backslash?
Thanks for any help!
I think you have tested it the wrong way because I'm getting the opposite results that you got.
select ' w' ~ '^\s\w$';
Is returning 1 in my case. Which actually makes sense because it is matching the space at the beginning of the text, followed by the letter at the end.
select ' w' ~ '^\\s\w*$';
Is returning 0 and it makes sense too. Here you're trying to match a backslash at the beginning of the text followed by an s and then, by any number of letters, numbers or underscores.
A piece of text that would match your second regex would be: '\sw'
Check the fiddle here.
The string constants are first parsed and interpreted as strings, including escaped characters. Escaping of unrecognized sequences is handled differently by different parsers, but generally, besides errors, the most common behavior is to ignore the backslash.
In the first example, the right-hand string constant is first being interpreted as '^sw$', where both \s and \w are not recognized string escape sequences.
In the second example the right hand constant is interpreted as '^\sw*$' where \\s escapes the \
After the strings are interpreted they are then applied as a regular expression, '^\sw*$' matching ' w' where '^sw$' does not.
Some languages use backslash as an escape character. Regexes do that, C-like languages do that, and some rare and odd dialects of SQL do that. PostgresSQL does it. PostgresSQL is translating the backslash escaping to arrive at a string value, and then feeding that string value to the regex parser, which AGAIN translates whatever backslashes survived the first translation -- if any. In your first regex, none did.
For example, in a string literal or a regex, \n doesn't mean a backslash followed by a lowercase n. It means a newline. Depending on the language, a backslash followed by a lowercase s will mean either just a lowercase s, or nothing. In PostgresSQL, an invalid escape sequence in a string literal translates as the escaped character: '\w' translates to 'w'. All the regex parser sees there is the w. By chance, you used the letter w in the string you're matching against. It's not matching that w in the lvalue because it's a word character; it's matching it because it's a lowercase w. Change it to lowercase x and it'll stop matching.
If you want to put a backslash in a string literal, you need to escape it with another backslash: '\\'. This is why \\s in your second regex worked. Add a second backslash to \w if you want to match any word character with that one.
This is a horrible pain. It's why JavaScript, Perl, and other languages have special conventions for regex literals like /\s\w/, and why C# programmers use the #"string literal" feature to disable backslash escaping in strings they intend to use as regexes.
I can only find negative lookbehind for this , something like (?<!\\).
But this won't compile in c++ and flex. It seems like both regex.h nor flex support this?
I am trying to implement a shell which has to get treat special char like >, < of | as normal argument string if preceded by backslash. In other word, only treat special char as special if not preceded by 0 or even number of '\'
So echo \\>a or echo abc>a should direct output to a
but echo \>a should print >a
What regular expression should I use?
I'm using flex and yacc to parse the input.
In a Flex rule file, you'd use \\ to match a single backslash '\' character. This is because the \ is used as an escape character in Flex.
BACKSLASH \\
LITERAL_BACKSLASH \\\\
LITERAL_LESSTHAN \\\\<
LITERAL_GREATERTHAN \\\\>
LITERAL_VERTICALBAR \\\\|
If I follow you correctly, in your case you want "\>" to be treated as literal '>' but "\\>" to be treated as literal '\' followed by special redirect. You don't need negative look behind or anything particularly special to accomplish this as you can build one rule that would accept both your regular argument characters and also the literal versions of your special characters.
For purposes of discussion, let's assume that your argument/parameter can contain any character but ' ', '\t', and the special forms of '>', '<', '|'. The rule for the argument would then be something like:
ARGUMENT ([^ \t\\><|]|\\\\|\\>|\\<|\\\|)+
Where:
[^ \t\\><|] matches any single character but ' ', '\t', and your special characters
\\\\ matches any instance of "\" (i.e. a literal backslash)
\\> matches any instance of ">" (i.e. a literal greater than)
\\< matches any instance of "\<" (i.e. a literal less than)
\\\| matches any instance of "\|" (i.e. a literal vertical bar/pipe)
Actually... You can probably just shorten that rule to:
ARGUMENT ([^ \t\\><|]|\\[^ \t\r\n])+
Where:
[^ \t\\><|] matches any single character but ' ', '\t', and your special characters
\\[^ \t\r\n] matches any character preceded by a '\' in your input except for whitespace (which will handle all of your special characters and allow for literal forms of all other characters)
If you want to allow for literal whitespace in your arguments/parameters then you could shorten the rule even further but be careful with using \\. for the second half of the rule alternation as it may or may not match " \n" (i.e. eat your trailing command terminator character!).
Hope that helps!
You cannot easily extract single escaped characters from a command-line, since you will not know the context of the character. In the simplest case, consider the following:
LessThan:\<
BackslashFrom:\\<
In the first one, < is an escaped character; in the second one, it is not. If your language includes quotes (as most shells do), things become even more complicated. It's a lot better to parse the string left to right, one entity at a time. (I'd use flex myself, because I've stopped wasting my time writing and testing lexers, but you might have some pedagogical reason to do so.)
If you really need to find a special character which shouldn't be special, just search for it (in C++98, where you don't have raw literals, you'll have to escape all of the backslashes):
regex: (\\\\)*\\[<>|]
(An even number -- possibly 0 -- of \, then a \ and a <, > or |)
as a C string => "(\\\\\\\\)*\\\\[<>|]"
I am a newbie of regular expressions, I try to understand what kind of string of the following regular expressions trying to match:
set result [regexp "$PersonName\\|\[^\\n]*\\|\[^\\n]*\\|\\s*0x$PersonId\\|\\s*$gender" [split $outPut \n]]
what does the regular expressions above trying to match?what is the value of result?
The complication here is that the regex specification is protected from the Tcl's string interpolation rules.
To detangle, you should think along these lines:
"$PersonName\\|\[^\\n]*\\|\[^\\n]*\\|\\s*0x$PersonId\\|\\s*$gender" is a double-quoted string, so the usual interpolation rules apply:
Each backslash escapes the following character;
Each $variable reference is substituted for its value;
[command ...] is substituted for the string returned by the executed command.
So each occurence of \\ is there to produce a single '\' character in the interpolated string, and \[ are meant to prevent Tcl from interpreting those [^\n] as commands (named "^\n") to be executed.
So if we suppose that the PersonName variable contains "Joe", PersonId contains DEAD and gender contains "male", Tcl will get Joe\|[^\n]*\|[^\n]*\|\s*0xDEAD\|\s*male after performing all substitutions on the source string.
Now the resulting string is passed to the RE engine which applies its own syntacting rules when it parses the string denoting a regex, as described in the re_syntax manual page.
According to these rules, each backslash, again, escapes the following character unless it's a special "character-entry escape" so here we have:
\s denotes "any whitespace character";
\| escapes the '|' making it lose its usual meaning—to introduce an alteration—so that it literally matches the character '|'.
The [^\n]* construct means "a longest series of zero or more characters not including the newline character". Read up on "character classes" in regexes for more info.
The value of result will be the number of times the regular expression matched. In the absence of the -all option, that will always be 0 or 1 (i.e., not-found/found).
Overall, that regular expression (which #kostix's answer explains well) is really ugly though. REs are a powerful tool, but you can get very confused with them very easily. Moreover, if you're splitting the output on newlines then you don't need to try to exclude them in the RE match; there will definitely be no newlines in the result of split in that case.
If we better understood what you were trying to do, we could direct you to far more effective methods of matching (e.g., using lsearch with suitable options, loading the data into an in-memory SQLite database).