Make variable string ignore escape sequences - c++

I'm currently facing an issue with a method parsing a string to another method. The problem is that I want to prevent it from using possible escape sequences.
The string I want to parse is not constant so (as far as I know) using the R declaration to make it a raw literal is not applicable here since I have to use variables.
Furthermore, in some cases there is user input included into the string (unconverted), so simply escaping those sequences by replacing a "\" character with "\\" is not an option either, the input can include those sequences too.
To be more precise on the issue:
A string formatted like f.e. " "\x10\x4 \x6(" " is getting auto compiled and converted into a non-human readable format as soon as it gets parsed to the next function. I want to prevent that conversion without In order to get the exact same string in the next function which needs to work with it.
Hope someone can help me since I'm new to c++ programming. Thanks in advance :D
#include "pch.h"
#include <iostream>
int main()
{
stringTester stringtester;
std::string test = stringtester.exampleString();
stringtester.stringOutput(test);
}
std::string stringTester::exampleString()
{
std::string exampleInput = "\x10\x5\x1a\aTestInput\\n \x6(";
return exampleInput;
}
void stringTester::stringOutput(std::string test)
{
std::cout << test << std::endl;
}
The actual output her (copied from console) is " TestInput\n ( ", whereas the wanted output would be the original string "\x10\x5\x1a\aTestInput\n \x6("
Edit: It seems like on SO it can't show the unknown characters. There are xtra characters in front and after the "TestInput\n ("

When you write a string literal in your source code the compiler replaces escape sequences with the character that they represent. That's why the quoted string in your example gets turned into nonsense. The way to fix that is to either replace each backslash with two backslashes or to make it a raw string literal.
When your program reads text input it doesn't do any of those adjustments. So if the code does
std::string input;
std::cin >> input;
and the user types the characters \x10\x5\x1a\aTestInput\\n \x6( into the console, input will end up with the characters \x10\x5\x1a\aTestInput\\n \x6(.
Once you've got the string, whether as a string literal or as text from the console, you can do whatever you want with it.

You have two possibilities for a backslash to remain a backslash in your C/C++ strings (and Java, JavaScript, PHP...)
Double all the Backslashes
Just as you said, you want to double all backslashes. This is fine. If the input was:
\\\\
Then your C/C++ string is going to be:
"\\\\\\\\"
(a mouthful, I know...)
Use the Hex/Octal Character
The other way, if you don't like the double backslash too much (if it scares you, somehow), is to use the character sequence in octal or hex (or Unicode in newer versions):
\ becomes "\134" or "\x5C"
As you may notice, though, this means 4 characters per backslash. So most people will generally just double the backslash (one 2 characters). Plus the double backslash is well understood. The code point may not be as well known by programmers coming behind you.
As a side note, if your user can enter any character, then they can also enter the double quote (") character. It is important that you also escape those. You can similarly use the backslash and the double quote character or its code point:
\" or \042 or \x22

Related

Why does it give me an error when opening a txt fiile? [duplicate]

I'm really confused about the escape character " \ " and its relation to the windows file system. In the following example:
char* fwdslash = "c:/myfolder/myfile.txt";
char* backslash = "c:\myfolder\myfile.txt";
char* dblbackslash = "c:\\myfolder\\myfile.txt";
std::ifstream file(fwdslash); // Works
std::ifstream file(dblbackslash); // Works
std::ifstream file(backslash); // Doesn't work
I get what you are doing here is escaping a special character so you can use it in this string. In no way by placing a backslash in a string literal or std::string do you actually change the string ---
---Edit: This is completely wrong, and the source of my confusion---
So it seems that the escape character is only treated by certain classes or things to mean something other than a backslash, like outputting on the console, ie., std::cout << "\hello"; will not print the backslash. In the case of ifstream (or I'm not sure if the same applies with the C fopen() version), it must be that this class or function treats backslashes as escape characters. I'm wondering, since the Windows file system uses backslashes wouldn't it make sense for it to accept the simple string with backslashes, ie., "c:\myfolder\myfile.txt" ? Trying it this way fails.
Also, in my compiler (Visual Studio) when I include headers I can use .\ and ..\ to mean either the current folder, or the parent folder. I'm pretty sure the \ in this isn't related to the escape character, but are these forms specific to Windows, part of the C preprocessor, or part of the C or C++ language? I know that backslashes are a Windows thing, so I can't see any reason another system would expect backslashes even when using .\ and ..\
Thanks.
In no way by placing a backslash in a string literal[...] do you
actually change the string
You do. Compiler actually modifies literal you wrote before embedding it into compiled program. If a backslash is found in string or character literal while parsing source code it is ignored and next character is treated specially. \n becomes carriage return, etc. For escaped characters without special meaning threatment is implementation defined. Usually it just means character unchanged.
You cannot just pass "c:\myfolder\file.txt" because it is not a string which will be seen by your program. Your program will see "c:myfolderfile.txt" instead. This is why escaped backslash has a special meaning, to allow embedding backslashes in actual string your program will see.
The solution is to either escape your backslashes, or use raw string literals (C++11 onwards):
const char* path = R"(c:\myfolder\file.txt)"
Filenames given to #include directive are not string literals, even if they are in form "path\to\header", so substitution rules are not applied to them.
The single backwards slash practically escapes the next character. In order to get rid of this behavior you need to double escape it. Now for the forward slash, it is probably a compatibility issue which follows the Unix tradition.
Similar thing to this is also in the Java world. A single forward slash is treated for path separation on both Windows and Unix, while also a double backslash.
To make it more clear why single backslash doesn't work, just remember that the following String practically produces a newline, a backslash and a tab:
"\n\\\t"
i.e. in an example like:
""c:\my\next\file.txt"
would actually produce:
"c:my
ext
ile.txt"
(the double space is form feed, see here)
Because when declaring a cstring literal the backslashes escape the next character, for special characters. This is so you can do newlines (\n), nulls (\0), carriage returns (\r) etc...
char* backslash = "c:\myfolder \myfile.txt";

string Regex using lex [duplicate]

I am learning to make a compiler and it's got some rules like single string:
char ch[] ="abcd";
and multi string:
printf("This is\
a multi\
string");
I wrote the regular expression
STRING \"([^\"\n]|\\{NEWLINE})*\"
It works fine with single line string but it doesn't work with multi line string where one line ends with a '\' character.
What should I change?
A common string pattern is
\"([^"\\\n]|\\(.|\n))*\"
This will match strings which include escaped double quotes (\") and backslashes (\\). It uses \\(.|\n) to allow any character after a backslash. Although some backslash sequences are longer than one character (\x40), none of them include non-alphanumerics after the first character.
It is possible that your input includes Windows line endings (CR-LF), in which case the backslash will not be directly followed by a newline; it will be followed by a carriage return. If you want to accept that input rather than throwing an error (which might be more appropriate), you need to do so explicitly:
\"([^"\\\n]|\\(.|\r?\n))*\"
But recognising a string and understanding what the string represents are two different things. Normally a compiler will need to turn the representation of a string into a byte sequence and that requires, for example, turning \n into the byte 10 and removing backslashed newlines altogether.
That task can easily be done in a (f)lex scanner using start conditions. (Or, of course, you can rescan the string using a different lexical scanner.)
Additionally, you need to think about error-handling. Once you ban strings with unescaped newlines (as C does), you open the door to the possibility of an unterminated string, where a newline is encountered before the closing quote. The same could happen at the end of the file if a string is not correctly​ closed.
If you have a single-character fallback rule, it will recognise the opening quote of an unterminated string. This is not desirable because it will then scan the contents of the string as program text leading to a cascade of errors. If you are not attempting error recovery it doesn't matter, but if you are it is usually better to at least recognize the unterminated string as such up to the newline, using a different pattern.

Could someone explain C++ escape character " \ " in relation to Windows file system?

I'm really confused about the escape character " \ " and its relation to the windows file system. In the following example:
char* fwdslash = "c:/myfolder/myfile.txt";
char* backslash = "c:\myfolder\myfile.txt";
char* dblbackslash = "c:\\myfolder\\myfile.txt";
std::ifstream file(fwdslash); // Works
std::ifstream file(dblbackslash); // Works
std::ifstream file(backslash); // Doesn't work
I get what you are doing here is escaping a special character so you can use it in this string. In no way by placing a backslash in a string literal or std::string do you actually change the string ---
---Edit: This is completely wrong, and the source of my confusion---
So it seems that the escape character is only treated by certain classes or things to mean something other than a backslash, like outputting on the console, ie., std::cout << "\hello"; will not print the backslash. In the case of ifstream (or I'm not sure if the same applies with the C fopen() version), it must be that this class or function treats backslashes as escape characters. I'm wondering, since the Windows file system uses backslashes wouldn't it make sense for it to accept the simple string with backslashes, ie., "c:\myfolder\myfile.txt" ? Trying it this way fails.
Also, in my compiler (Visual Studio) when I include headers I can use .\ and ..\ to mean either the current folder, or the parent folder. I'm pretty sure the \ in this isn't related to the escape character, but are these forms specific to Windows, part of the C preprocessor, or part of the C or C++ language? I know that backslashes are a Windows thing, so I can't see any reason another system would expect backslashes even when using .\ and ..\
Thanks.
In no way by placing a backslash in a string literal[...] do you
actually change the string
You do. Compiler actually modifies literal you wrote before embedding it into compiled program. If a backslash is found in string or character literal while parsing source code it is ignored and next character is treated specially. \n becomes carriage return, etc. For escaped characters without special meaning threatment is implementation defined. Usually it just means character unchanged.
You cannot just pass "c:\myfolder\file.txt" because it is not a string which will be seen by your program. Your program will see "c:myfolderfile.txt" instead. This is why escaped backslash has a special meaning, to allow embedding backslashes in actual string your program will see.
The solution is to either escape your backslashes, or use raw string literals (C++11 onwards):
const char* path = R"(c:\myfolder\file.txt)"
Filenames given to #include directive are not string literals, even if they are in form "path\to\header", so substitution rules are not applied to them.
The single backwards slash practically escapes the next character. In order to get rid of this behavior you need to double escape it. Now for the forward slash, it is probably a compatibility issue which follows the Unix tradition.
Similar thing to this is also in the Java world. A single forward slash is treated for path separation on both Windows and Unix, while also a double backslash.
To make it more clear why single backslash doesn't work, just remember that the following String practically produces a newline, a backslash and a tab:
"\n\\\t"
i.e. in an example like:
""c:\my\next\file.txt"
would actually produce:
"c:my
ext
ile.txt"
(the double space is form feed, see here)
Because when declaring a cstring literal the backslashes escape the next character, for special characters. This is so you can do newlines (\n), nulls (\0), carriage returns (\r) etc...
char* backslash = "c:\myfolder \myfile.txt";

Parsing quotes within a string literal

Why do strings in almost all languages require that you escape the quotations?
for instance if you have a string such as
"hello world""
why do languages want you to write it as
"hello world\""
Do you not only require that the string starts and ends with a quotation?
You can treat the end quote as the terminating quote for the string. If there is no end quote then there is an error. You can also assume that a string starts and ends on a single line and does not span multiple lines.
Suppose I want to put ", " into a string literal (so the literal contains quotes).
If I did that without escaping, I’d write "", "". This looks like two empty string literals separated by a comma. If I want to, for example, call a function with this string literal, I would write f("", ""). This looks to the compiler like I am passing two arguments, both empty strings. How can it know the difference?
The answer is, it can’t. Perhaps in simple cases like "hello world"", it might be able to figure it out, for at least some languages. But the set of strings which were unambiguous and didn’t need escaping would be different for different languages and it would be hard to keep track of which was which, and for any language there would be some ambiguous case which would need escaping anyway. It is much easier for the compiler writer to skip all those edge cases and just always require you to escape quotation marks, and it is probably also easier for the programmer.
Otherwise, the compiler would see the second quotation mark as the end of you string, and then a random quotation mark following it, causing an error.
"The use of the word "escape" really means to temporarily escape out of parsing the text and into a another mode where the subsequent character is treated differently." Source: https://softwareengineering.stackexchange.com/questions/112731/what-does-backslash-escape-character-really-escape
How would the compiler know which quote ended the string?
UPDATE:
In C & C++, this is a perfectly fine string:
printf("Hel" "lo" "," "Wor""ld" "!");
It prints Hello, World!
Or how 'bout is C#
Console.WriteLine("Hello, "+"World!");
Now should that print Hello, World or Hello, "+"World! ?
The reason you have to escape the second quotation mark is so the compiler knows that the quotation mark is part of the string, and not a terminator. If you weren't escaping it, the compiler would only pick up hello world rather than hello world"
Lets do a practical example.
How should this be translated?
"Hello"+"World"
'HelloWorld' or 'Hello"+"World'
vs
"Hello\"+\"World"
By escaping the quote characters, you remove the ambiguity, and code should have 0 ambiguity to the compiler. All compilers should compile the same code to identical executable's. It's basically a way of telling the compiler "I know this looks weird, but I really mean that this is how it should look"

Why must C/C++ string literal declarations be single-line?

Is there any particular reason that multi-line string literals such as the following are not permitted in C++?
string script =
"
Some
Formatted
String Literal
";
I know that multi-line string literals may be created by putting a backslash before each newline.
I am writing a programming language (similar to C) and would like to allow the easy creation of multi-line strings (as in the above example).
Is there any technical reason for avoiding this kind of string literal? Otherwise I would have to use a python-like string literal with a triple quote (which I don't want to do):
string script =
"""
Some
Formatted
String Literal
""";
Why must C/C++ string literal declarations be single-line?
The terse answer is "because the grammar prohibits multiline string literals." I don't know whether there is a good reason for this other than historical reasons.
There are, of course, ways around this. You can use line splicing:
const char* script = "\
Some\n\
Formatted\n\
String Literal\n\
";
If the \ appears as the last character on the line, the newline will be removed during preprocessing.
Or, you can use string literal concatenation:
const char* script =
" Some\n"
" Formatted\n"
" String Literal\n";
Adjacent string literals are concatenated during preprocessing, so these will end up as a single string literal at compile-time.
Using either technique, the string literal ends up as if it were written:
const char* script = " Some\n Formatted\n String Literal\n";
One has to consider that C was not written to be an "Applications" programming language but a systems programming language. It would not be inaccurate to say it was designed expressly to rewrite Unix. With that in mind, there was no EMACS or VIM and your user interfaces were serial terminals. Multiline string declarations would seem a bit pointless on a system that did not have a multiline text editor. Furthermore, string manipulation would not be a primary concern for someone looking to write an OS at that particular point in time. The traditional set of UNIX scripting tools such as AWK and SED (amongst MANY others) are a testament to the fact they weren't using C to do significant string manipulation.
Additional considerations: it was not uncommon in the early 70s (when C was written) to submit your programs on PUNCH CARDS and come back the next day to get them. Would it have eaten up extra processing time to compile a program with multiline strings literals? Not really. It can actually be less work for the compiler. But you were going to come back for it the next day anyhow in most cases. But nobody who was filling out a punch card was going to put large amounts of text that wasn't needed in their programs.
In a modern environment, there is probably no reason not to include multiline string literals other than designer's preference. Grammatically speaking, it's probably simpler because you don't have to take linefeeds into consideration when parsing the string literal.
In addition to the existing answers, you can work around this using C++11's raw string literals, e.g.:
#include <iostream>
#include <string>
int main() {
std::string str = R"(a
b)";
std::cout << str;
}
/* Output:
a
b
*/
Live demo.
[n3290: 2.14.5/4]: [ Note: A source-file new-line in a raw string
literal results in a new-line in the resulting execution
string-literal. Assuming no whitespace at the beginning of lines in
the following example, the assert will succeed:
const char *p = R"(a\
b
c)";
assert(std::strcmp(p, "a\\\nb\nc") == 0);
—end note ]
Though non-normative, this note and the example that follows it in [n3290: 2.14.5/5] serve to complement the indication in the grammar that the production r-char-sequence may contain newlines (whereas the production s-char-sequence, used for normal string literals, may not).
Others have mentioned some excellent workarounds, I just wanted to address the reason.
The reason is simply that C was created at a time when processing was at a premium and compilers had to be simple and as fast as possible. These days, if C were to be updated (I'm looking at you, C1X), it's quite possible to do exactly what you want. It's unlikely, however. Mostly for historical reasons; such a change could require extensive rewrites of compilers, and so will likely be rejected.
The C preprocessor works on a line-by-line basis, but with lexical tokens. That means that the preprocessor understands that "foo" is a token. If C were to allow multi-line literals, however, the preprocessor would be in trouble. Consider:
"foo
#ifdef BAR
bar
#endif
baz"
The preprocessor isn't able to mess with the inside of a token - but it's operating line-by-line. So how is it supposed to handle this case? The easy solution is to simply forbid multiline strings entirely.
Actually, you can break it up thus:
string script =
"\n"
" Some\n"
" Formatted\n"
" String Literal\n";
Adjacent string literals are concatenated by the compiler.
Strings can lay on multiple lines, but each line has to be quoted individually :
string script =
" \n"
" Some \n"
" Formatted \n"
" String Literal ";
I am writing a programming language
(similar to C) and would like to let
write multi-line strings easily (like
in above example).
There is no reason why you couldn't create a programming language that allows multi-line strings.
For example, Vedit Macro Language (which is C-like scripting language for VEDIT text editor) allows multi-line strings, for example:
Reg_Set(1,"
Some
Formatted
String Literal
")
It is up to you how you define your language syntax.
You can also do:
string useMultiple = "this"
"is "
"a string in C.";
Place one literal after another without any special chars.
Literal declarations doesn't have to be single-line.
GPUImage inlines multiline shader code. Checkout its SHADER_STRING macro.