Why must C/C++ string literal declarations be single-line? - c++

Is there any particular reason that multi-line string literals such as the following are not permitted in C++?
string script =
"
Some
Formatted
String Literal
";
I know that multi-line string literals may be created by putting a backslash before each newline.
I am writing a programming language (similar to C) and would like to allow the easy creation of multi-line strings (as in the above example).
Is there any technical reason for avoiding this kind of string literal? Otherwise I would have to use a python-like string literal with a triple quote (which I don't want to do):
string script =
"""
Some
Formatted
String Literal
""";
Why must C/C++ string literal declarations be single-line?

The terse answer is "because the grammar prohibits multiline string literals." I don't know whether there is a good reason for this other than historical reasons.
There are, of course, ways around this. You can use line splicing:
const char* script = "\
Some\n\
Formatted\n\
String Literal\n\
";
If the \ appears as the last character on the line, the newline will be removed during preprocessing.
Or, you can use string literal concatenation:
const char* script =
" Some\n"
" Formatted\n"
" String Literal\n";
Adjacent string literals are concatenated during preprocessing, so these will end up as a single string literal at compile-time.
Using either technique, the string literal ends up as if it were written:
const char* script = " Some\n Formatted\n String Literal\n";

One has to consider that C was not written to be an "Applications" programming language but a systems programming language. It would not be inaccurate to say it was designed expressly to rewrite Unix. With that in mind, there was no EMACS or VIM and your user interfaces were serial terminals. Multiline string declarations would seem a bit pointless on a system that did not have a multiline text editor. Furthermore, string manipulation would not be a primary concern for someone looking to write an OS at that particular point in time. The traditional set of UNIX scripting tools such as AWK and SED (amongst MANY others) are a testament to the fact they weren't using C to do significant string manipulation.
Additional considerations: it was not uncommon in the early 70s (when C was written) to submit your programs on PUNCH CARDS and come back the next day to get them. Would it have eaten up extra processing time to compile a program with multiline strings literals? Not really. It can actually be less work for the compiler. But you were going to come back for it the next day anyhow in most cases. But nobody who was filling out a punch card was going to put large amounts of text that wasn't needed in their programs.
In a modern environment, there is probably no reason not to include multiline string literals other than designer's preference. Grammatically speaking, it's probably simpler because you don't have to take linefeeds into consideration when parsing the string literal.

In addition to the existing answers, you can work around this using C++11's raw string literals, e.g.:
#include <iostream>
#include <string>
int main() {
std::string str = R"(a
b)";
std::cout << str;
}
/* Output:
a
b
*/
Live demo.
[n3290: 2.14.5/4]: [ Note: A source-file new-line in a raw string
literal results in a new-line in the resulting execution
string-literal. Assuming no whitespace at the beginning of lines in
the following example, the assert will succeed:
const char *p = R"(a\
b
c)";
assert(std::strcmp(p, "a\\\nb\nc") == 0);
—end note ]
Though non-normative, this note and the example that follows it in [n3290: 2.14.5/5] serve to complement the indication in the grammar that the production r-char-sequence may contain newlines (whereas the production s-char-sequence, used for normal string literals, may not).

Others have mentioned some excellent workarounds, I just wanted to address the reason.
The reason is simply that C was created at a time when processing was at a premium and compilers had to be simple and as fast as possible. These days, if C were to be updated (I'm looking at you, C1X), it's quite possible to do exactly what you want. It's unlikely, however. Mostly for historical reasons; such a change could require extensive rewrites of compilers, and so will likely be rejected.

The C preprocessor works on a line-by-line basis, but with lexical tokens. That means that the preprocessor understands that "foo" is a token. If C were to allow multi-line literals, however, the preprocessor would be in trouble. Consider:
"foo
#ifdef BAR
bar
#endif
baz"
The preprocessor isn't able to mess with the inside of a token - but it's operating line-by-line. So how is it supposed to handle this case? The easy solution is to simply forbid multiline strings entirely.

Actually, you can break it up thus:
string script =
"\n"
" Some\n"
" Formatted\n"
" String Literal\n";
Adjacent string literals are concatenated by the compiler.

Strings can lay on multiple lines, but each line has to be quoted individually :
string script =
" \n"
" Some \n"
" Formatted \n"
" String Literal ";

I am writing a programming language
(similar to C) and would like to let
write multi-line strings easily (like
in above example).
There is no reason why you couldn't create a programming language that allows multi-line strings.
For example, Vedit Macro Language (which is C-like scripting language for VEDIT text editor) allows multi-line strings, for example:
Reg_Set(1,"
Some
Formatted
String Literal
")
It is up to you how you define your language syntax.

You can also do:
string useMultiple = "this"
"is "
"a string in C.";
Place one literal after another without any special chars.

Literal declarations doesn't have to be single-line.
GPUImage inlines multiline shader code. Checkout its SHADER_STRING macro.

Related

Expand escape sequences detected by flex

In my scanner.lex file I have this:
{Some rule that matches strings} return STRING; //STRING is enum
in my c++ file I have this:
if (yylex == STRING) {
cout << "STRING: " << yytext << endl;
Obviously with some logic to take the input from stdin.
Now if this program gets the input "Hello\nWorld", my output is "STRING: Hello\nWorld", while I would want my output to be:
Hello
World
The same goes for other escape characters such as \",\0, \x<hex_number>, \t, \\... But I'm not sure how to achieve this. I'm not even sure if that's a flex issue or if I can solve this using only c++ tools...
How can I get this done?
As #Some programmer dude mentions in a comment, there is an an example of how to do this using start conditions in the Flex documentation. That example puts the escaping rules into a separate start condition; each rule is implemented by appending the unescaped text to a buffer. And that's the way it's normally done.
Of course, you might find an external library which unescapes C-style escaped strings, which you could call on the string returned by flex. But that would be both slower and less flexible than the approach suggested in the Flex manual: slower because it requires a second scan of the string, and less flexible because the library is likely to have its own idea of what escapes to handle.
If you're using C++, you might find it more elegant to modify that example to use a std::string buffer instead of an arbitrary fixed-size character array. You can compile a flex-generated scanner with C++, so there is no problem using C++ standard library objects in your scanner code.
Depending on the various semantic value types you are managing, you will probably want to modify the yylex prototype to either use an additional reference parameter or a more structured return type, in order to return the token value to the caller. Note that while it is OK to use yytext before the next call to yylex, it's not generally considered good style since it won't work with most parsers: in general, parsers require the ability to look one or more tokens ahead, and thus yytext is likely to be overwritten by the time your parser needs its value. The flex manual documents the macro hook used to modify the yylex() prototype.

Make variable string ignore escape sequences

I'm currently facing an issue with a method parsing a string to another method. The problem is that I want to prevent it from using possible escape sequences.
The string I want to parse is not constant so (as far as I know) using the R declaration to make it a raw literal is not applicable here since I have to use variables.
Furthermore, in some cases there is user input included into the string (unconverted), so simply escaping those sequences by replacing a "\" character with "\\" is not an option either, the input can include those sequences too.
To be more precise on the issue:
A string formatted like f.e. " "\x10\x4 \x6(" " is getting auto compiled and converted into a non-human readable format as soon as it gets parsed to the next function. I want to prevent that conversion without In order to get the exact same string in the next function which needs to work with it.
Hope someone can help me since I'm new to c++ programming. Thanks in advance :D
#include "pch.h"
#include <iostream>
int main()
{
stringTester stringtester;
std::string test = stringtester.exampleString();
stringtester.stringOutput(test);
}
std::string stringTester::exampleString()
{
std::string exampleInput = "\x10\x5\x1a\aTestInput\\n \x6(";
return exampleInput;
}
void stringTester::stringOutput(std::string test)
{
std::cout << test << std::endl;
}
The actual output her (copied from console) is " TestInput\n ( ", whereas the wanted output would be the original string "\x10\x5\x1a\aTestInput\n \x6("
Edit: It seems like on SO it can't show the unknown characters. There are xtra characters in front and after the "TestInput\n ("
When you write a string literal in your source code the compiler replaces escape sequences with the character that they represent. That's why the quoted string in your example gets turned into nonsense. The way to fix that is to either replace each backslash with two backslashes or to make it a raw string literal.
When your program reads text input it doesn't do any of those adjustments. So if the code does
std::string input;
std::cin >> input;
and the user types the characters \x10\x5\x1a\aTestInput\\n \x6( into the console, input will end up with the characters \x10\x5\x1a\aTestInput\\n \x6(.
Once you've got the string, whether as a string literal or as text from the console, you can do whatever you want with it.
You have two possibilities for a backslash to remain a backslash in your C/C++ strings (and Java, JavaScript, PHP...)
Double all the Backslashes
Just as you said, you want to double all backslashes. This is fine. If the input was:
\\\\
Then your C/C++ string is going to be:
"\\\\\\\\"
(a mouthful, I know...)
Use the Hex/Octal Character
The other way, if you don't like the double backslash too much (if it scares you, somehow), is to use the character sequence in octal or hex (or Unicode in newer versions):
\ becomes "\134" or "\x5C"
As you may notice, though, this means 4 characters per backslash. So most people will generally just double the backslash (one 2 characters). Plus the double backslash is well understood. The code point may not be as well known by programmers coming behind you.
As a side note, if your user can enter any character, then they can also enter the double quote (") character. It is important that you also escape those. You can similarly use the backslash and the double quote character or its code point:
\" or \042 or \x22

Parsing quotes within a string literal

Why do strings in almost all languages require that you escape the quotations?
for instance if you have a string such as
"hello world""
why do languages want you to write it as
"hello world\""
Do you not only require that the string starts and ends with a quotation?
You can treat the end quote as the terminating quote for the string. If there is no end quote then there is an error. You can also assume that a string starts and ends on a single line and does not span multiple lines.
Suppose I want to put ", " into a string literal (so the literal contains quotes).
If I did that without escaping, I’d write "", "". This looks like two empty string literals separated by a comma. If I want to, for example, call a function with this string literal, I would write f("", ""). This looks to the compiler like I am passing two arguments, both empty strings. How can it know the difference?
The answer is, it can’t. Perhaps in simple cases like "hello world"", it might be able to figure it out, for at least some languages. But the set of strings which were unambiguous and didn’t need escaping would be different for different languages and it would be hard to keep track of which was which, and for any language there would be some ambiguous case which would need escaping anyway. It is much easier for the compiler writer to skip all those edge cases and just always require you to escape quotation marks, and it is probably also easier for the programmer.
Otherwise, the compiler would see the second quotation mark as the end of you string, and then a random quotation mark following it, causing an error.
"The use of the word "escape" really means to temporarily escape out of parsing the text and into a another mode where the subsequent character is treated differently." Source: https://softwareengineering.stackexchange.com/questions/112731/what-does-backslash-escape-character-really-escape
How would the compiler know which quote ended the string?
UPDATE:
In C & C++, this is a perfectly fine string:
printf("Hel" "lo" "," "Wor""ld" "!");
It prints Hello, World!
Or how 'bout is C#
Console.WriteLine("Hello, "+"World!");
Now should that print Hello, World or Hello, "+"World! ?
The reason you have to escape the second quotation mark is so the compiler knows that the quotation mark is part of the string, and not a terminator. If you weren't escaping it, the compiler would only pick up hello world rather than hello world"
Lets do a practical example.
How should this be translated?
"Hello"+"World"
'HelloWorld' or 'Hello"+"World'
vs
"Hello\"+\"World"
By escaping the quote characters, you remove the ambiguity, and code should have 0 ambiguity to the compiler. All compilers should compile the same code to identical executable's. It's basically a way of telling the compiler "I know this looks weird, but I really mean that this is how it should look"

Escaping blocks of foreign code in C++

I'm currently working on a toy language that works like this: one can embed blocks written in this language into a C++ source, and before compilation, these blocks are translated into C++ in an extra preprocessing step, producing a valid C++ source.
I want to make sure that these blocks can always be identified in the source unambiguously and also, whenever such a block is present in the source, it cannot be valid C++. Moreover, I want to achieve these by putting as few constraints to the embedded language as possible (the language itself is still somewhat fluid).
The obvious way would be to introduce a pair of special multi-character parentheses, made of characters that cannot appear together in valid C++ code (or in the embedded language). However, I'm not sure how to ensure that particular a character sequence is good for this purpose (not after GotW #78, anyway (: ).
So what is a good way to escape these blocks?
If your compiler can be made to accept C++11 standard, you could use raw string literals like eg:
std::cout << R"*(<!DOCTYPE html>
<html>
<head>
<title>Title with a backslash \ here
and double " quote</title>)*";
Hence with raw string literals there is no forbidden sequence of characters in those raw string literals. Any sequence of characters could appear in them (but you can define the ending sequence of the raw string)
And you could use #{ and }# like I do in MELT macro-strings; MELT is Lisp-like domain specific language to extend GCC, and you can embed code in it with e.g.
(code_chunk hellocount_chk
#{ /* $HELLOCOUNT_CHK chunk */
static int $HELLOCOUNT_CHK#_counter;
$HELLOCOUNT_CHK#_counter++;
$HELLOCOUNT_CHK#_lab:
printf ("Hello World, counted %d\n",
$HELLOCOUNT_CHK#_counter);
if (random() % 4 == 0) goto $HELLOCOUNT_CHK#_lab;
}#)
The #{ and }# are enclosing macro-strings (these character sequences are unlikely to appear in C or C++ code, except in string literals and comments), with the $ starting symbols in such macro-strings (up to a non-letter or # character).
Using #{ and }# is not fool-proof (e.g. because of raw string literals) but good enough: a cooperative user could manage to avoid them.

How to avoid backslash escape when writing regular expression in C/C++

For regular expression \w+\d, in many script language such as perl/python it can be written literally. But in C/C++, I must write it as:
const char *re_str = "\\w+\\d";
which is ugly to eye.
Is there any method to avoid it? MACRO are also acceptable.
Just as an FYI, the next C++ standard (C++ 0x) will have something called raw string literals which should let you do something like:
const char *re_str = R"(\w+\d)";
However until then I think you're stuck with the pain of doubling up your backslashes if you want the regex to be a literal in the source file.
When I reading [C: A reference manual] Chapter 3: Prepressors. An idea emerges:
#define STR(a) #a
#define R(var, re) static char var##_[] = STR(re);\
const char * var = ( var##_[ sizeof(var##_) - 2] = '\0', (var##_ + 1) );
R(re, "\w\d");
printf("Hello, world[%s]\n", re);
It's portable in both C and C++, only uses standard preprocessing features. The trick is to use macro to expand \ inside liternal string and then remove the leading and tailing double quote strings.
Now I think it's the best way until C++0x really introduce the new literal string syntax R"...". And for C I think it'll be the best way for a long time.
The side effect is that we cannot defined such a variable in the global scope in C. Because there's a statement to remove the tailing double-quote character. In C++ it's OK.
You can put your regexp in a file and read the file if you have a lot or need to modify them often. That's the only way I see to avoid backslashes.
No. There is only one kind of string literals in C++, and it's the kind that treats escaped characters.