string Regex using lex [duplicate] - regex

I am learning to make a compiler and it's got some rules like single string:
char ch[] ="abcd";
and multi string:
printf("This is\
a multi\
string");
I wrote the regular expression
STRING \"([^\"\n]|\\{NEWLINE})*\"
It works fine with single line string but it doesn't work with multi line string where one line ends with a '\' character.
What should I change?

A common string pattern is
\"([^"\\\n]|\\(.|\n))*\"
This will match strings which include escaped double quotes (\") and backslashes (\\). It uses \\(.|\n) to allow any character after a backslash. Although some backslash sequences are longer than one character (\x40), none of them include non-alphanumerics after the first character.
It is possible that your input includes Windows line endings (CR-LF), in which case the backslash will not be directly followed by a newline; it will be followed by a carriage return. If you want to accept that input rather than throwing an error (which might be more appropriate), you need to do so explicitly:
\"([^"\\\n]|\\(.|\r?\n))*\"
But recognising a string and understanding what the string represents are two different things. Normally a compiler will need to turn the representation of a string into a byte sequence and that requires, for example, turning \n into the byte 10 and removing backslashed newlines altogether.
That task can easily be done in a (f)lex scanner using start conditions. (Or, of course, you can rescan the string using a different lexical scanner.)
Additionally, you need to think about error-handling. Once you ban strings with unescaped newlines (as C does), you open the door to the possibility of an unterminated string, where a newline is encountered before the closing quote. The same could happen at the end of the file if a string is not correctly​ closed.
If you have a single-character fallback rule, it will recognise the opening quote of an unterminated string. This is not desirable because it will then scan the contents of the string as program text leading to a cascade of errors. If you are not attempting error recovery it doesn't matter, but if you are it is usually better to at least recognize the unterminated string as such up to the newline, using a different pattern.

Related

Why does it give me an error when opening a txt fiile? [duplicate]

I'm really confused about the escape character " \ " and its relation to the windows file system. In the following example:
char* fwdslash = "c:/myfolder/myfile.txt";
char* backslash = "c:\myfolder\myfile.txt";
char* dblbackslash = "c:\\myfolder\\myfile.txt";
std::ifstream file(fwdslash); // Works
std::ifstream file(dblbackslash); // Works
std::ifstream file(backslash); // Doesn't work
I get what you are doing here is escaping a special character so you can use it in this string. In no way by placing a backslash in a string literal or std::string do you actually change the string ---
---Edit: This is completely wrong, and the source of my confusion---
So it seems that the escape character is only treated by certain classes or things to mean something other than a backslash, like outputting on the console, ie., std::cout << "\hello"; will not print the backslash. In the case of ifstream (or I'm not sure if the same applies with the C fopen() version), it must be that this class or function treats backslashes as escape characters. I'm wondering, since the Windows file system uses backslashes wouldn't it make sense for it to accept the simple string with backslashes, ie., "c:\myfolder\myfile.txt" ? Trying it this way fails.
Also, in my compiler (Visual Studio) when I include headers I can use .\ and ..\ to mean either the current folder, or the parent folder. I'm pretty sure the \ in this isn't related to the escape character, but are these forms specific to Windows, part of the C preprocessor, or part of the C or C++ language? I know that backslashes are a Windows thing, so I can't see any reason another system would expect backslashes even when using .\ and ..\
Thanks.
In no way by placing a backslash in a string literal[...] do you
actually change the string
You do. Compiler actually modifies literal you wrote before embedding it into compiled program. If a backslash is found in string or character literal while parsing source code it is ignored and next character is treated specially. \n becomes carriage return, etc. For escaped characters without special meaning threatment is implementation defined. Usually it just means character unchanged.
You cannot just pass "c:\myfolder\file.txt" because it is not a string which will be seen by your program. Your program will see "c:myfolderfile.txt" instead. This is why escaped backslash has a special meaning, to allow embedding backslashes in actual string your program will see.
The solution is to either escape your backslashes, or use raw string literals (C++11 onwards):
const char* path = R"(c:\myfolder\file.txt)"
Filenames given to #include directive are not string literals, even if they are in form "path\to\header", so substitution rules are not applied to them.
The single backwards slash practically escapes the next character. In order to get rid of this behavior you need to double escape it. Now for the forward slash, it is probably a compatibility issue which follows the Unix tradition.
Similar thing to this is also in the Java world. A single forward slash is treated for path separation on both Windows and Unix, while also a double backslash.
To make it more clear why single backslash doesn't work, just remember that the following String practically produces a newline, a backslash and a tab:
"\n\\\t"
i.e. in an example like:
""c:\my\next\file.txt"
would actually produce:
"c:my
ext
ile.txt"
(the double space is form feed, see here)
Because when declaring a cstring literal the backslashes escape the next character, for special characters. This is so you can do newlines (\n), nulls (\0), carriage returns (\r) etc...
char* backslash = "c:\myfolder \myfile.txt";

Unexpected end of regex when ascii character

Minimal Verfiable Example
#include<regex>
int main(){
std::regex re("\\u_nic400_ib_ext_m_ib_ar_fifo_wr_mux/mux_0_1_out [0]");
}
Why is this giving me a regex_error? My debugger's error message is unexpected end of regex when ascii character, but I just trying to match the literal above and I don't see where the issue is.
\u is the beginning of the escape sequence for a Unicode code point, you need to escape it. Also, [...] is a character set match, it needs to be escaped if you want to match it literally.
std::regex re("\\\\u_nic400_ib_ext_m_ib_ar_fifo_wr_mux/mux_0_1_out \\[0\\]");
If you're using C++11 or newer, it's helpful to use raw strings when writing regular expressions, so you don't have to double the backslashes.
std::regex re(R"(\\u_nic400_ib_ext_m_ib_ar_fifo_wr_mux/mux_0_1_out \[0\])");
This is all only relevant if you're creating the regexp as a literal. If you're constructing it dynamically at run time, you don't need to double the escapes, since you're feeding the string directly to the regexp engine, it's not being parsed as C source code.

Could someone explain C++ escape character " \ " in relation to Windows file system?

I'm really confused about the escape character " \ " and its relation to the windows file system. In the following example:
char* fwdslash = "c:/myfolder/myfile.txt";
char* backslash = "c:\myfolder\myfile.txt";
char* dblbackslash = "c:\\myfolder\\myfile.txt";
std::ifstream file(fwdslash); // Works
std::ifstream file(dblbackslash); // Works
std::ifstream file(backslash); // Doesn't work
I get what you are doing here is escaping a special character so you can use it in this string. In no way by placing a backslash in a string literal or std::string do you actually change the string ---
---Edit: This is completely wrong, and the source of my confusion---
So it seems that the escape character is only treated by certain classes or things to mean something other than a backslash, like outputting on the console, ie., std::cout << "\hello"; will not print the backslash. In the case of ifstream (or I'm not sure if the same applies with the C fopen() version), it must be that this class or function treats backslashes as escape characters. I'm wondering, since the Windows file system uses backslashes wouldn't it make sense for it to accept the simple string with backslashes, ie., "c:\myfolder\myfile.txt" ? Trying it this way fails.
Also, in my compiler (Visual Studio) when I include headers I can use .\ and ..\ to mean either the current folder, or the parent folder. I'm pretty sure the \ in this isn't related to the escape character, but are these forms specific to Windows, part of the C preprocessor, or part of the C or C++ language? I know that backslashes are a Windows thing, so I can't see any reason another system would expect backslashes even when using .\ and ..\
Thanks.
In no way by placing a backslash in a string literal[...] do you
actually change the string
You do. Compiler actually modifies literal you wrote before embedding it into compiled program. If a backslash is found in string or character literal while parsing source code it is ignored and next character is treated specially. \n becomes carriage return, etc. For escaped characters without special meaning threatment is implementation defined. Usually it just means character unchanged.
You cannot just pass "c:\myfolder\file.txt" because it is not a string which will be seen by your program. Your program will see "c:myfolderfile.txt" instead. This is why escaped backslash has a special meaning, to allow embedding backslashes in actual string your program will see.
The solution is to either escape your backslashes, or use raw string literals (C++11 onwards):
const char* path = R"(c:\myfolder\file.txt)"
Filenames given to #include directive are not string literals, even if they are in form "path\to\header", so substitution rules are not applied to them.
The single backwards slash practically escapes the next character. In order to get rid of this behavior you need to double escape it. Now for the forward slash, it is probably a compatibility issue which follows the Unix tradition.
Similar thing to this is also in the Java world. A single forward slash is treated for path separation on both Windows and Unix, while also a double backslash.
To make it more clear why single backslash doesn't work, just remember that the following String practically produces a newline, a backslash and a tab:
"\n\\\t"
i.e. in an example like:
""c:\my\next\file.txt"
would actually produce:
"c:my
ext
ile.txt"
(the double space is form feed, see here)
Because when declaring a cstring literal the backslashes escape the next character, for special characters. This is so you can do newlines (\n), nulls (\0), carriage returns (\r) etc...
char* backslash = "c:\myfolder \myfile.txt";

How do I scan for a "string" constant in a flex scanner?

As part of a class assignment to create a flex scanner, I need to create a rule that recognizes a string constant. That is, a collection of characters between a set of quotation marks. How do I identify a bad string?
The only way a string literal can be "bad" is if it is missing the closing quote mark. Unfortunately, that is not easy to detect, since it is likely that there is another string literal in the program, and the opening quote of the following string literal will be taken as the missing close quote. Once the quote marks are out of synch, the lexical scan will continue incorrectly until the end of file is detected inside a supposed string literal, at which point an error can be reported.
Languages like the C family do not allow string literals to contain newline characters, which allows missing quotes to be detected earlier. In that case, a "bad" string literal is one which contains a newline. It's quite possible that the lexical scan will incorrectly include characters which were intended to be outside of the string literal, but error recovery is somewhat easier than in languages in which a missing quote effectively inverts the entire program.
It's worth noting that it is almost as common to accidentally fail to escape a quote inside a quoted string, which will result in the string literal being closed prematurely; the intended close quote will then be lexed as an open quote, and the eventual lexical error will again be delayed.
(F)lex uses the "longest match" rule to identify which pattern to recognize. If the string pattern doesn't allow newlines, as in C, it might be (in a simplified version, leaving out the complication of escapes) something like:
\"[^"]*\"
(remembering that in flex, . does not match a newline.) If the closing quote is not present in the line, this pattern will not match, and it is likely that the fallback pattern will succeed, matching only the open quote. That's good enough if immediate failure is acceptable, but if you want to do error recovery, you probably want to ignore the rest of the line. In that case, you might add a pattern such as
\"[^"]*
That will match every valid string as well, of course (not including the closing quote) but it doesn't matter because the valid string literal pattern's match will be longer (by one character). So the pattern without the closing quote will only match unterminated string literals.

What's the Magic Behind Escape(\) Character

How does the C/C++ compiler manipulate the escape character ["\"] in source code? How is compiler grammar written for processing that character? What does the compiler do after encountering that character?
Most compilers are divided into parts: the compiler front-end is called a lexical analyzer or a scanner. This part of the compiler reads the actual characters and creates tokens. It has a state machine which decides, upon seeing an escape character, whether it is genuine (for example when it appears inside a string) or it modifies the next character. The token is output accordingly as the escape character or some other token (such as a tab or a newline) to the next part of the compiler (the parser). The state machine can group several characters into a token.
An interesting note on this subject is On Trusting Trust [PDF link].
The paper describes one way a compiler could handle this problem exactly, shows how the c-written-in-c compiler does not have an explicit translation of the codes into ASCII values; and how to bootstrap a new escape code into the compiler so that the understanding of the ASCII value for the new code is also implicit.
It generally escapes the following character:
In a string literal or character literal, it means escape the next character. \a means 'alert' (flashing the terminal, beeping or whatever), \n means 'linefeed', \xNUM means an hexadecimal number for example.
If it appears as the last visible character before a newline, whether within a string or not (and even within a line-wide comment!), it acts as a line-continuation: The following newline character is ignored, and the next line is merged with the current line.
escape character with a following character (like \n) is a single character for C compiler - scanner presents it to parser as character token, so there is no need in special syntax rules in parser for escape character.