Parsing quotes within a string literal - regex

Why do strings in almost all languages require that you escape the quotations?
for instance if you have a string such as
"hello world""
why do languages want you to write it as
"hello world\""
Do you not only require that the string starts and ends with a quotation?
You can treat the end quote as the terminating quote for the string. If there is no end quote then there is an error. You can also assume that a string starts and ends on a single line and does not span multiple lines.

Suppose I want to put ", " into a string literal (so the literal contains quotes).
If I did that without escaping, I’d write "", "". This looks like two empty string literals separated by a comma. If I want to, for example, call a function with this string literal, I would write f("", ""). This looks to the compiler like I am passing two arguments, both empty strings. How can it know the difference?
The answer is, it can’t. Perhaps in simple cases like "hello world"", it might be able to figure it out, for at least some languages. But the set of strings which were unambiguous and didn’t need escaping would be different for different languages and it would be hard to keep track of which was which, and for any language there would be some ambiguous case which would need escaping anyway. It is much easier for the compiler writer to skip all those edge cases and just always require you to escape quotation marks, and it is probably also easier for the programmer.

Otherwise, the compiler would see the second quotation mark as the end of you string, and then a random quotation mark following it, causing an error.
"The use of the word "escape" really means to temporarily escape out of parsing the text and into a another mode where the subsequent character is treated differently." Source: https://softwareengineering.stackexchange.com/questions/112731/what-does-backslash-escape-character-really-escape

How would the compiler know which quote ended the string?
UPDATE:
In C & C++, this is a perfectly fine string:
printf("Hel" "lo" "," "Wor""ld" "!");
It prints Hello, World!
Or how 'bout is C#
Console.WriteLine("Hello, "+"World!");
Now should that print Hello, World or Hello, "+"World! ?

The reason you have to escape the second quotation mark is so the compiler knows that the quotation mark is part of the string, and not a terminator. If you weren't escaping it, the compiler would only pick up hello world rather than hello world"

Lets do a practical example.
How should this be translated?
"Hello"+"World"
'HelloWorld' or 'Hello"+"World'
vs
"Hello\"+\"World"
By escaping the quote characters, you remove the ambiguity, and code should have 0 ambiguity to the compiler. All compilers should compile the same code to identical executable's. It's basically a way of telling the compiler "I know this looks weird, but I really mean that this is how it should look"

Related

Tokenize parse option

Consider a slightly different toy example from my previous question:
. local string my first name is Pearly,, and my surname is Spencer
. tokenize "`string'", parse(",,")
. display "`1'"
my first name is Pearly
. display "`2'"
,
. display "`3'"
,
. display "`4'"
and my surname is Spencer
I have two questions:
Does tokenize work as expected in this case? I thought local macro
2 should be ,, instead of , while local macro 3 contain the rest of the string (and local macro 4 be empty).
Is there a way to force tokenize to respect the double comma as a parsing
character?
tokenize -- and gettoken too -- won't, from what I can see, accept repeated characters such as ,, as a composite parsing character. ,, is not illegal as a specification of parsing characters, but is just understood as meaning that , and , are acceptable parsing characters. The repetition in practice is ignored, just as adding "My name is Pearly" after "My name is Pearly" doesn't add information in a conversation.
To back up: know that without other instructions (such as might be given by a syntax command) Stata will parse a string according to spaces, except that double quotes (or compound double quotes) bind harder than spaces separate.
tokenize -- and gettoken too -- will accept multiple parse characters pchars and the help for tokenize gives an example with space and + sign. (It's much more common, in my experience, to want to use space and comma , when the syntax for a command is not quite what syntax parses completely.)
A difference between space and the other parsing characters is that spaces are discarded but other parsing characters are not discarded. The rationale here is that those characters often have meaning you might want to take forward. Thus in setting up syntax for a command option, you might want to allow something like myoption( varname [, suboptions])
and so whether a comma is present and other stuff follows is important for later code.
With composite characters, so that you are looking for say ,, as separators I think you'd need to loop around using substr() or an equivalent. In practice an easier work-around might be first to replace your composite characters with some neutral single character and then apply tokenize. That could need to rely on knowing that that neutral character should not occur otherwise. Thus I often use # as a character placeholder because I know that it will not occur as part of variable or scalar names and it's not part of function names or an operator.
For what it's worth, I note that in first writing split I allowed composite characters as separators. As I recall, a trigger to that was a question on Statalist which was about data for legal cases with multiple variations on VS (versus) to indicate which party was which. This example survives into the help for the official command.
On what is a "serious" bug, much depends on judgment. I think a programmer would just discover on trying it out that composite characters don't work as desired with tokenize in cases like yours.

Regex For Strings in C

I'm looking to make a regular expression for some strings in C.
This is what i have so far:
Strings in C are delimited by double quotes (") so the regex has to be surrounded by \" \".
The string may not contain newline characters so I need to do [^\n] ( I think ).
The string may also contain double quotes or back slash characters if and only if they're escaped. Therefore [\\ \"] (again I think).
Other than that anything else goes.
Any help is much appreciated I'm kind of lost on how to start writing this regex.
A simple flex pattern to recognize string literals (including literals with embedded line continuations):
["]([^"\\\n]|\\.|\\\n)*["]
That will allow
"string with \
line continuation"
But not
"C doesn't support
multiline strings"
If you don't want to deal with line continuations, remove the \\\n alternative. If you need trigraph support, it gets more irritating.
Although that recognizes strings, it doesn't attempt to make sense of them. Normally, a C lexer will want to process strings with backslash sequences, so that "\"\n" is converted to the two characters "NL (0x22 0x0A). You might, at some point, want to take a look at, for example, Optimizing flex string literal parsing (although that will need to be adapted if you are programming in C).
Flex patterns are documented in the flex manual. It might also be worthwhile reading a good reference on regular expressions, such as John Levine's excellent book on Flex and Bison.

How can I detect string literals in code?

I want to write string detecting function for my obfuscator, I've stuck at debugging it, I can write pattern for strings like cout<<"Hello world" or cout<<"2+2=4"
but not for
cout<<"2+2"<<"Trolll";
cout<<"asd \" trololo";
simply I just want to extract things which are between " and ", actually I tried
["][\x20-\x74]*["]
but for e.g.
cout<<"asdfg"<<"asdsfgh";
it gives me "asdfg"<<"asdfgh", not "asdfg".
Any ideas how to build the expression for string extraction?
Regular expressions, by default, are greedy. This means that they try to match as much as possible. There are several ways of preventing this. The easiest is to just make them non-greedy. You can make the quantifier * non-greedy by appending ?:
"[\x20-\x74]*?"
(Incidentally, there’s no need for the […] around the quotes.)
However, it’s helpful to be explicit and precise in descriptions. One reason for this is that the above expression is still buggy. For instance, it doesn’t match "\"" correctly.
A string literal in C++ is quite well-defined, and your definition simply doesn’t match it. The actual definition (§2.14.3 of the standard) is (simplified): a char-sequence surrounded by ", where a char-sequence is a sequence of zero or more characters except ", \ and newline, or an escape-sequence.
An escape-sequence`, in turn, is defined as either simple, octal or hexadecimal. Taken together, this leaves us with (again, slightly simplified):
"([^"\\]|\\(['"?\\abfnrtv]|[0-7]+|x[0-9a-fA-F]+))*"
– no need for the non-greedy specifier now, since we explicitly exclude " from matching earlier, unless escaped.

C++ - Escaping or disabling backslash on string

I am writing a C++ program to solve a common problem of message decoding. Part of the problem requires me to get a bunch of random characters, including '\', and map them to a key, one by one.
My program works fine in most cases, except that when I read characters such as '\' from a string, I obviously get a completely different character representation (e.g. '\0' yields a null character, or '\' simply escapes itself when it needs to be treated as a character).
Since I am not supposed to have any control on what character keys are included, I have been desperately trying to find a way to treat special control characters such as the backslash as the character itself.
My questions are basically these:
Is there a way to turn all special characters off within the scope of my program?
Is there a way to override current digraphs definitions of special characters and define them as something else (like digraphs using very rare keys)?
Is there some obscure method on the String class that I missed which can force the actual character on the string to be read instead of the pre-defined constant?
I have been trying to look for a solution for hours now but all possible fixes I've found are for other languages.
Any help is greatly appreciate.
If you read in a string like "\0" from stdin or a file, it will be treated as two separate characters: '\\' and '0'. There is no additional processing that you have to do.
Escaping characters is only used for string/character literals. That is to say, when you want to hard-code something into your source code.

Why must C/C++ string literal declarations be single-line?

Is there any particular reason that multi-line string literals such as the following are not permitted in C++?
string script =
"
Some
Formatted
String Literal
";
I know that multi-line string literals may be created by putting a backslash before each newline.
I am writing a programming language (similar to C) and would like to allow the easy creation of multi-line strings (as in the above example).
Is there any technical reason for avoiding this kind of string literal? Otherwise I would have to use a python-like string literal with a triple quote (which I don't want to do):
string script =
"""
Some
Formatted
String Literal
""";
Why must C/C++ string literal declarations be single-line?
The terse answer is "because the grammar prohibits multiline string literals." I don't know whether there is a good reason for this other than historical reasons.
There are, of course, ways around this. You can use line splicing:
const char* script = "\
Some\n\
Formatted\n\
String Literal\n\
";
If the \ appears as the last character on the line, the newline will be removed during preprocessing.
Or, you can use string literal concatenation:
const char* script =
" Some\n"
" Formatted\n"
" String Literal\n";
Adjacent string literals are concatenated during preprocessing, so these will end up as a single string literal at compile-time.
Using either technique, the string literal ends up as if it were written:
const char* script = " Some\n Formatted\n String Literal\n";
One has to consider that C was not written to be an "Applications" programming language but a systems programming language. It would not be inaccurate to say it was designed expressly to rewrite Unix. With that in mind, there was no EMACS or VIM and your user interfaces were serial terminals. Multiline string declarations would seem a bit pointless on a system that did not have a multiline text editor. Furthermore, string manipulation would not be a primary concern for someone looking to write an OS at that particular point in time. The traditional set of UNIX scripting tools such as AWK and SED (amongst MANY others) are a testament to the fact they weren't using C to do significant string manipulation.
Additional considerations: it was not uncommon in the early 70s (when C was written) to submit your programs on PUNCH CARDS and come back the next day to get them. Would it have eaten up extra processing time to compile a program with multiline strings literals? Not really. It can actually be less work for the compiler. But you were going to come back for it the next day anyhow in most cases. But nobody who was filling out a punch card was going to put large amounts of text that wasn't needed in their programs.
In a modern environment, there is probably no reason not to include multiline string literals other than designer's preference. Grammatically speaking, it's probably simpler because you don't have to take linefeeds into consideration when parsing the string literal.
In addition to the existing answers, you can work around this using C++11's raw string literals, e.g.:
#include <iostream>
#include <string>
int main() {
std::string str = R"(a
b)";
std::cout << str;
}
/* Output:
a
b
*/
Live demo.
[n3290: 2.14.5/4]: [ Note: A source-file new-line in a raw string
literal results in a new-line in the resulting execution
string-literal. Assuming no whitespace at the beginning of lines in
the following example, the assert will succeed:
const char *p = R"(a\
b
c)";
assert(std::strcmp(p, "a\\\nb\nc") == 0);
—end note ]
Though non-normative, this note and the example that follows it in [n3290: 2.14.5/5] serve to complement the indication in the grammar that the production r-char-sequence may contain newlines (whereas the production s-char-sequence, used for normal string literals, may not).
Others have mentioned some excellent workarounds, I just wanted to address the reason.
The reason is simply that C was created at a time when processing was at a premium and compilers had to be simple and as fast as possible. These days, if C were to be updated (I'm looking at you, C1X), it's quite possible to do exactly what you want. It's unlikely, however. Mostly for historical reasons; such a change could require extensive rewrites of compilers, and so will likely be rejected.
The C preprocessor works on a line-by-line basis, but with lexical tokens. That means that the preprocessor understands that "foo" is a token. If C were to allow multi-line literals, however, the preprocessor would be in trouble. Consider:
"foo
#ifdef BAR
bar
#endif
baz"
The preprocessor isn't able to mess with the inside of a token - but it's operating line-by-line. So how is it supposed to handle this case? The easy solution is to simply forbid multiline strings entirely.
Actually, you can break it up thus:
string script =
"\n"
" Some\n"
" Formatted\n"
" String Literal\n";
Adjacent string literals are concatenated by the compiler.
Strings can lay on multiple lines, but each line has to be quoted individually :
string script =
" \n"
" Some \n"
" Formatted \n"
" String Literal ";
I am writing a programming language
(similar to C) and would like to let
write multi-line strings easily (like
in above example).
There is no reason why you couldn't create a programming language that allows multi-line strings.
For example, Vedit Macro Language (which is C-like scripting language for VEDIT text editor) allows multi-line strings, for example:
Reg_Set(1,"
Some
Formatted
String Literal
")
It is up to you how you define your language syntax.
You can also do:
string useMultiple = "this"
"is "
"a string in C.";
Place one literal after another without any special chars.
Literal declarations doesn't have to be single-line.
GPUImage inlines multiline shader code. Checkout its SHADER_STRING macro.