How does the C/C++ compiler manipulate the escape character ["\"] in source code? How is compiler grammar written for processing that character? What does the compiler do after encountering that character?
Most compilers are divided into parts: the compiler front-end is called a lexical analyzer or a scanner. This part of the compiler reads the actual characters and creates tokens. It has a state machine which decides, upon seeing an escape character, whether it is genuine (for example when it appears inside a string) or it modifies the next character. The token is output accordingly as the escape character or some other token (such as a tab or a newline) to the next part of the compiler (the parser). The state machine can group several characters into a token.
An interesting note on this subject is On Trusting Trust [PDF link].
The paper describes one way a compiler could handle this problem exactly, shows how the c-written-in-c compiler does not have an explicit translation of the codes into ASCII values; and how to bootstrap a new escape code into the compiler so that the understanding of the ASCII value for the new code is also implicit.
It generally escapes the following character:
In a string literal or character literal, it means escape the next character. \a means 'alert' (flashing the terminal, beeping or whatever), \n means 'linefeed', \xNUM means an hexadecimal number for example.
If it appears as the last visible character before a newline, whether within a string or not (and even within a line-wide comment!), it acts as a line-continuation: The following newline character is ignored, and the next line is merged with the current line.
escape character with a following character (like \n) is a single character for C compiler - scanner presents it to parser as character token, so there is no need in special syntax rules in parser for escape character.
Related
I am learning to make a compiler and it's got some rules like single string:
char ch[] ="abcd";
and multi string:
printf("This is\
a multi\
string");
I wrote the regular expression
STRING \"([^\"\n]|\\{NEWLINE})*\"
It works fine with single line string but it doesn't work with multi line string where one line ends with a '\' character.
What should I change?
A common string pattern is
\"([^"\\\n]|\\(.|\n))*\"
This will match strings which include escaped double quotes (\") and backslashes (\\). It uses \\(.|\n) to allow any character after a backslash. Although some backslash sequences are longer than one character (\x40), none of them include non-alphanumerics after the first character.
It is possible that your input includes Windows line endings (CR-LF), in which case the backslash will not be directly followed by a newline; it will be followed by a carriage return. If you want to accept that input rather than throwing an error (which might be more appropriate), you need to do so explicitly:
\"([^"\\\n]|\\(.|\r?\n))*\"
But recognising a string and understanding what the string represents are two different things. Normally a compiler will need to turn the representation of a string into a byte sequence and that requires, for example, turning \n into the byte 10 and removing backslashed newlines altogether.
That task can easily be done in a (f)lex scanner using start conditions. (Or, of course, you can rescan the string using a different lexical scanner.)
Additionally, you need to think about error-handling. Once you ban strings with unescaped newlines (as C does), you open the door to the possibility of an unterminated string, where a newline is encountered before the closing quote. The same could happen at the end of the file if a string is not correctly closed.
If you have a single-character fallback rule, it will recognise the opening quote of an unterminated string. This is not desirable because it will then scan the contents of the string as program text leading to a cascade of errors. If you are not attempting error recovery it doesn't matter, but if you are it is usually better to at least recognize the unterminated string as such up to the newline, using a different pattern.
As part of a class assignment to create a flex scanner, I need to create a rule that recognizes a string constant. That is, a collection of characters between a set of quotation marks. How do I identify a bad string?
The only way a string literal can be "bad" is if it is missing the closing quote mark. Unfortunately, that is not easy to detect, since it is likely that there is another string literal in the program, and the opening quote of the following string literal will be taken as the missing close quote. Once the quote marks are out of synch, the lexical scan will continue incorrectly until the end of file is detected inside a supposed string literal, at which point an error can be reported.
Languages like the C family do not allow string literals to contain newline characters, which allows missing quotes to be detected earlier. In that case, a "bad" string literal is one which contains a newline. It's quite possible that the lexical scan will incorrectly include characters which were intended to be outside of the string literal, but error recovery is somewhat easier than in languages in which a missing quote effectively inverts the entire program.
It's worth noting that it is almost as common to accidentally fail to escape a quote inside a quoted string, which will result in the string literal being closed prematurely; the intended close quote will then be lexed as an open quote, and the eventual lexical error will again be delayed.
(F)lex uses the "longest match" rule to identify which pattern to recognize. If the string pattern doesn't allow newlines, as in C, it might be (in a simplified version, leaving out the complication of escapes) something like:
\"[^"]*\"
(remembering that in flex, . does not match a newline.) If the closing quote is not present in the line, this pattern will not match, and it is likely that the fallback pattern will succeed, matching only the open quote. That's good enough if immediate failure is acceptable, but if you want to do error recovery, you probably want to ignore the rest of the line. In that case, you might add a pattern such as
\"[^"]*
That will match every valid string as well, of course (not including the closing quote) but it doesn't matter because the valid string literal pattern's match will be longer (by one character). So the pattern without the closing quote will only match unterminated string literals.
I am writing a C++ program to solve a common problem of message decoding. Part of the problem requires me to get a bunch of random characters, including '\', and map them to a key, one by one.
My program works fine in most cases, except that when I read characters such as '\' from a string, I obviously get a completely different character representation (e.g. '\0' yields a null character, or '\' simply escapes itself when it needs to be treated as a character).
Since I am not supposed to have any control on what character keys are included, I have been desperately trying to find a way to treat special control characters such as the backslash as the character itself.
My questions are basically these:
Is there a way to turn all special characters off within the scope of my program?
Is there a way to override current digraphs definitions of special characters and define them as something else (like digraphs using very rare keys)?
Is there some obscure method on the String class that I missed which can force the actual character on the string to be read instead of the pre-defined constant?
I have been trying to look for a solution for hours now but all possible fixes I've found are for other languages.
Any help is greatly appreciate.
If you read in a string like "\0" from stdin or a file, it will be treated as two separate characters: '\\' and '0'. There is no additional processing that you have to do.
Escaping characters is only used for string/character literals. That is to say, when you want to hard-code something into your source code.
This is a question mostly concerning WinAPI RegSetValueEx. If you look at its description in MSDN here you'd find:
lpData [in] The data to be stored.
REG_SZ, the string must be null-terminated. With the REG_MULTI_SZ data
type, the string must be terminated with two null characters. A
backslash must be preceded by another backslash as an escape
character. For example, specify "C:\\mydir\\myfile" to store the
string "C:\mydir\myfile".
The question I have, do I really need to escape slashes? Because I've never done that before and it worked perfectly fine.
This is indeed a documentation error. You do not need to escape backslashes here. The exact string that you send to this API is what will be stored in the registry. No processing of backslashes will be performed.
Now, it's true that in C and C++ you need to escape certain characters in string literals, but that's not pertinent to a Win32 API documentation. That's an issue for source code to object code translation for specific languages and quite beyond the remit of this documentation.
Yes, because \ has a meaning in C++, whereas \\ means an ordinary backslash.
When \ appears in a string, C++ compiler will look at the next character and convert the combination into something (for example \n will be converted into a "newline" character). \\ will be converted into a regular backslash. This is called "escaping" (historically, on old terminals, the ESC+key combination was used for many keys that were not on the keyboard).
I need to convert strings from one encoding (UTF-8) to another. The problem is that in the target encoding we do not have all characters from the source encoding and libc iconv(3) function fails in such situation. What I want is to be able to perform conversion but in output string have this problematic characters been replaced with some symbol, say '?'.
Programming language is C or C++.
Is there a way to address this issue ?
Try appending "//TRANSLIT" or "//IGNORE" to the end of the destination charset string. Note that this is only supported under the GNU C library.
From iconv_open(3):
//TRANSLIT
When the string "//TRANSLIT" is appended to tocode, translitera‐
tion is activated. This means that when a character cannot be
represented in the target character set, it can be approximated
through one or several similarly looking characters.
//IGNORE
When the string "//IGNORE" is appended to tocode, characters
that cannot be represented in the target character set will be
silently discarded.
Alternately, manually skip a character and insert a substitution in the output when you get -EILSEQ from iconv(3).
Regex based on the translatable source ranges used to swap a corresponding placeholder in for any chars that don't match.