Unexpected end of regex when ascii character - c++

Minimal Verfiable Example
#include<regex>
int main(){
std::regex re("\\u_nic400_ib_ext_m_ib_ar_fifo_wr_mux/mux_0_1_out [0]");
}
Why is this giving me a regex_error? My debugger's error message is unexpected end of regex when ascii character, but I just trying to match the literal above and I don't see where the issue is.

\u is the beginning of the escape sequence for a Unicode code point, you need to escape it. Also, [...] is a character set match, it needs to be escaped if you want to match it literally.
std::regex re("\\\\u_nic400_ib_ext_m_ib_ar_fifo_wr_mux/mux_0_1_out \\[0\\]");
If you're using C++11 or newer, it's helpful to use raw strings when writing regular expressions, so you don't have to double the backslashes.
std::regex re(R"(\\u_nic400_ib_ext_m_ib_ar_fifo_wr_mux/mux_0_1_out \[0\])");
This is all only relevant if you're creating the regexp as a literal. If you're constructing it dynamically at run time, you don't need to double the escapes, since you're feeding the string directly to the regexp engine, it's not being parsed as C source code.

Related

C++ regex for properly matching strings that contain c-style escape characters (ECMAScript style, no look-behind)

I'm a regex noob attempting to match either the contents or the entirety of a quoted segment of text without breaking on escaped quotation marks.
Put another way, I need a regex that, between two question marks, will match all characters that are not quotation marks and also any quotation mark that has an odd number of consecutive backslashes preceding it. It has to be an odd number of backslashes as a pair of backslashes escapes to a single backslash.
I've successfully created a regex that does this but it relied on look-behind and because this project is in C++ and because the regex implementation of standard C++ does not have look-behind functionality, I could not use said regex.
Here is the regex with look-behind that I came up with: "(((?<!\\)(\\\\)*\\"|[^"])*)"
The following text should produce 8 matches:
"Woah. Look. A tab."
"This \\\\\\\\\\\\\" is all one string"
"This \"\"\"\" is\" also\"\\ \' one\"\\\" string."
"These \\""are separate strings"
"The cat said,\"Yo.\""
"
\"Shouldn't it work on multiple lines?\" he asked rhetorically.
\"Of course it should.\"
"
"If you don't have exactly 8 matches, then you've failed."
Here's a picture of my (probably naive) look-behind version for the visual people among you (You know who you are):
And here's a link to this example: https://regex101.com/r/uOxqWl/1
If this is impossible to do without look-behind, please let me know.
Also, if there is a well-regarded C++ regex library that allows regex look-behind, please let me know (It doesn't have to be ECMAScript, though I would slightly prefer that).
Let's derive a garden variety regular expression for C-style strings from an English description.
A string is a quotation mark, followed by a sequence of string-characters, followed by another quotation mark.
std::regex stringMatcher ( R"("<string-character>*")" );
Obviously this doesn't work as we didn't define the string-character yet. We can do so piece by piece.
Firstly, a string character could be any character except a quotation mark and a backslash.
R"([^\\"])"
Secondly, a string character could be an escape sequence consisting of a backslash and a single other character from a fixed set.
R"(\\[abfnrtv'"\\?])"
Thirdly, it can be an octal escape sequence that consists of a backslash and three octal digits
R"(\\[0-7][0-7][0-7])"
(We simplify here a bit because the real C standard allows 1, 2 or 3 octal digits. This is easy to add.)
Fourthly, it can be a hexadecimal escape sequence that consists of a backslash, a letter x, and a hexadecimal number. The range of the number is implementation defined, so we need to accept any one.
R"(\\x[0-9a-fA-F][0-9a-fA-F]*)"
We omit universal character names, they could be added in an exactly the same way. There are none in the given test example.
So, to bring this all together:
std::regex stringMatcher ( R"("([^\\"]|\\([abfnrtv'"\\?]|[0-7][0-7][0-7]|x[0-9a-fA-F][0-9a-fA-F]*))*")" );
// collapsed the leading backslashes of all the escape sequence types together
Live demo.

Regex c++ crashing while initialization

I'm currently working on finding registry paths match using regex.
I have initalized regex as
regex regx("HKEY_LOCAL_MACHINE\\SOFTWARE\\Microsoft\\Windows\\CurrentVersion\\Uninstall\\\\{0398BFBC-913B-3275-9463-D2BF91B3C80B\\}")
and the program throws a std::tr1::regex_error exception.
I tried to escape the curly braces using "\\\\" but it still didn't work.
Any idea on how to fix it?
I'm on Windows 10, Visual Studio 2010.
Let's look at a C++ string literal (a slightly shorter one that we can read):
"A\\B\\C"
This, taking account of the literal escaping, is really the string:
A\B\C
Now you're passing this string to the regex engine. But regex has its own escaping, yet there are no escape sequences \B or \C (there may be, but there aren't for your actual characters).
Hence the regex is invalid and trying to instantiate it throws an exception.
You will need an extra layer of escaping:
"A\\\\B\\\\C"
Or use a raw string literal:
R"(A\\B\\C)"
In other words:
regex regx(R"(HKEY_LOCAL_MACHINE\\SOFTWARE\\Microsoft\\Windows\\CurrentVersion\\Uninstall\\\\{0398BFBC-913B-3275-9463-D2BF91B3C80B\\})")
(Yuck!)

string Regex using lex [duplicate]

I am learning to make a compiler and it's got some rules like single string:
char ch[] ="abcd";
and multi string:
printf("This is\
a multi\
string");
I wrote the regular expression
STRING \"([^\"\n]|\\{NEWLINE})*\"
It works fine with single line string but it doesn't work with multi line string where one line ends with a '\' character.
What should I change?
A common string pattern is
\"([^"\\\n]|\\(.|\n))*\"
This will match strings which include escaped double quotes (\") and backslashes (\\). It uses \\(.|\n) to allow any character after a backslash. Although some backslash sequences are longer than one character (\x40), none of them include non-alphanumerics after the first character.
It is possible that your input includes Windows line endings (CR-LF), in which case the backslash will not be directly followed by a newline; it will be followed by a carriage return. If you want to accept that input rather than throwing an error (which might be more appropriate), you need to do so explicitly:
\"([^"\\\n]|\\(.|\r?\n))*\"
But recognising a string and understanding what the string represents are two different things. Normally a compiler will need to turn the representation of a string into a byte sequence and that requires, for example, turning \n into the byte 10 and removing backslashed newlines altogether.
That task can easily be done in a (f)lex scanner using start conditions. (Or, of course, you can rescan the string using a different lexical scanner.)
Additionally, you need to think about error-handling. Once you ban strings with unescaped newlines (as C does), you open the door to the possibility of an unterminated string, where a newline is encountered before the closing quote. The same could happen at the end of the file if a string is not correctly​ closed.
If you have a single-character fallback rule, it will recognise the opening quote of an unterminated string. This is not desirable because it will then scan the contents of the string as program text leading to a cascade of errors. If you are not attempting error recovery it doesn't matter, but if you are it is usually better to at least recognize the unterminated string as such up to the newline, using a different pattern.

Why do regexes and string literals use different escape sequences?

The handling of escape sequences varies across languages and between string literals and regular expressions. For example, in Python the \s escape sequence can be used in regular expressions but not in string literals, whereas in PHP the \f form feed escape sequence can be used in regular expressions but not in string literals.
In PHP, there is a dedicated page for PCRE escape sequences (http://php.net/manual/en/regexp.reference.escape.php) but it does not have an official list of escape sequences that are exclusive to string literals.
As a beginner in programming, I am concerned that I may not have a full understanding of the background and context of this topic. Are these concerns valid? Is this an issue that others are aware of?
Why do different programming languages handle escape sequences differently between regular expressions and string literals?
The escape sequences found in string literals are there to stop the programing language from getting confused. For example, in many languages a string literal is denoted as characters between quotes, like so
my_string = 'x string'
But if your string contains a quote character then you need a way to tell the programming language that this should be interpreted as a literal character
my_string = 'x's string' # this will cause bugs
my_string = 'x\'s string' # lets the programing language know that the internal quote is literal and not the end of the string
I think that most programing languages have the same set of escape sequences for string literals.
Regexes are a different story, you can think of them as their own separate language that is written as a string literal. In a regex some characters like the period (.) have a special meaning and must be escaped to match their literal counterpart. Whereas other characters, when preceded by a backslash allow those characters to have special meaning.
For example
regex_string = 'A.C' # match an A, followed by any character, followed by C
regex_string = 'A\.C' # match an A, followed by a period, followed by C
regex_string = 'AsC' # match an A, followed by s, followed by C
regex_string = 'A\sC' # match an A, followed by a space character, followed by C
Because regexes are their own mini-language it doesn't make sense that all of the escape sequences in regexes are available to normal string literals.
Regular expressions are best thought of as a language in themselves, which have their own syntax. Some programming languages offer a literal syntax specifically for describing a regex, but usually a regex will be compiled from an existing string. If you create that string from literal syntax, that uses a different set of escape sequences because it is a different kind of thing, created with a different syntax, for a different context, in a different language. That's the simple and direct answer to the question.
There are different needs and requirements. Regexes have to be able to describe things that aren't a single, specific sequence of text. String literals obviously don't have that problem, but they do need a way to, say, include quotation marks in the text. That usually isn't a problem for regex syntax, because the content of the string is already determined by that point. (Some languages have a "regex literal" syntax, typically enclosing the regex in forward slashes. In these languages, forward slashes that are supposed to be part of the regex need to be escaped.)
Although I understand the obvious (\s represents multiple characters and would introduce ambiguity)
Ambiguity isn't actually a concern for most languages that support regex. It often happens that the string literal syntax and the regex syntax use the same sequence to mean different things. For example: \b represents a word boundary in regex syntax, but many languages' string literal syntax also uses it to represent a backspace character, Unicode code point 8. (Unless you meant that \s to mean "any whitespace character" doesn't make sense in the string literal context but only in the regex context - then yes, of course.)
But keep in mind - if the regex is being compiled from a string literal, then first the string literal is interpreted to figure out what the string actually contains, and then that string is used to create the regex. These are separate steps that can and do apply separate rules, so there is no conflict.
This sometimes means that code has to use a double escaping mechanism: first for the string literal, and then for the regex syntax. If you want a regex that matches a literal backslash, you might end up typing four backslashes in a string literal - since that code will create a string that actually contains only two backslashes, which in turn is what the regex syntax requires. (Some languages offer some kind of "raw" string literal facility to work around this.)

How do I scan for a "string" constant in a flex scanner?

As part of a class assignment to create a flex scanner, I need to create a rule that recognizes a string constant. That is, a collection of characters between a set of quotation marks. How do I identify a bad string?
The only way a string literal can be "bad" is if it is missing the closing quote mark. Unfortunately, that is not easy to detect, since it is likely that there is another string literal in the program, and the opening quote of the following string literal will be taken as the missing close quote. Once the quote marks are out of synch, the lexical scan will continue incorrectly until the end of file is detected inside a supposed string literal, at which point an error can be reported.
Languages like the C family do not allow string literals to contain newline characters, which allows missing quotes to be detected earlier. In that case, a "bad" string literal is one which contains a newline. It's quite possible that the lexical scan will incorrectly include characters which were intended to be outside of the string literal, but error recovery is somewhat easier than in languages in which a missing quote effectively inverts the entire program.
It's worth noting that it is almost as common to accidentally fail to escape a quote inside a quoted string, which will result in the string literal being closed prematurely; the intended close quote will then be lexed as an open quote, and the eventual lexical error will again be delayed.
(F)lex uses the "longest match" rule to identify which pattern to recognize. If the string pattern doesn't allow newlines, as in C, it might be (in a simplified version, leaving out the complication of escapes) something like:
\"[^"]*\"
(remembering that in flex, . does not match a newline.) If the closing quote is not present in the line, this pattern will not match, and it is likely that the fallback pattern will succeed, matching only the open quote. That's good enough if immediate failure is acceptable, but if you want to do error recovery, you probably want to ignore the rest of the line. In that case, you might add a pattern such as
\"[^"]*
That will match every valid string as well, of course (not including the closing quote) but it doesn't matter because the valid string literal pattern's match will be longer (by one character). So the pattern without the closing quote will only match unterminated string literals.