Include )" in raw string literal without terminating said literal - c++

The two characters )" terminate the raw string literal in the example below.
The sequence )" could appear in my text at some point, and I want the string to continue even if this sequence is found within it.
R"(
Some Text)"
)"; // ^^
How can I include the sequence )" within the string literal without terminating it?

Raw string literals let you specify an almost arbitrary* delimiter:
//choose ### as the delimiter so only )###" ends the string
R"###(
Some Text)"
)###";
*The exact rules are: "any member of the basic source character set except:
space, the left parenthesis (, the right parenthesis ), the backslash \,
and the control characters representing horizontal tab,
vertical tab, form feed, and newline" (N3936 §2.14.5 [lex.string] grammar) and "at most 16 characters" (§2.14.5/2)

Escaping won't help you since this is a raw literal, but the syntax is designed to allow clear demarcation of start and end, by introducing a little arbitrary phrase like aha.
R"aha(
Some Text)"
)aha";
By the way note the order of ) and " at the end, opposite of your example.
Regarding the formal, at first sight (studying the standard) it might seem as if escaping works the same in raw string literals as in ordinary literals. Except one knows that it doesn't, so how is that possible, when no exception is noted in the rules? Well, when raw string literals were introduced in C++11 it was by way of introducing an extra undoing translation phase, undoing the effect of e.g. escaping!, to wit, …
C++11 §2.5/3
” Between the
initial and final double quote characters of the raw string, any transformations performed in phases 1
and 2 (trigraphs, universal-character-names, and line splicing) are reverted; this reversion shall apply
before any d-char, r-char, or delimiting parenthesis is identified.
This takes care of Unicode character specifications (the universal-character-names like \u0042), which although they look and act like escapes are formally, in C++, not escape sequences.
The true formal escapes are handled, or rather, not handled!, by using a custom grammar rule for the content of a raw string literal. Namely that in C++ §2.14.5 the raw-string grammar entity is defined as
" d-char-sequenceopt ( r-char-sequenceopt ) d-char-sequenceopt "
where an r-char-sequence is defined as a sequence of r-char, each of which is
” any member of the source character set, except
a right parenthesis ) followed by the initial d-char-sequence
[like aha above] (which may be empty) followed by a double quote "
Essentially the above means that not only can you not use escapes directly in raw strings (which is much of the point, it's positive, not negative), you can't use Unicode character specifications directly either.
Here's how to do it indirectly:
#include <iostream>
using namespace std;
auto main() -> int
{
cout << "Ordinary string with a '\u0042' character.\n";
cout << R"(Raw string without a '\u0042' character, and no \n either.)" "\n";
cout << R"(Raw string without a '\u0042' character, i.e. no ')" "\u0042" R"(' character.)" "\n";
}
Output:
Ordinary string with a 'B' character.
Raw string without a '\u0042' character, and no \n either.
Raw string without a '\u0042' character, i.e. no 'B' character.

You can use,
R"aaa(
Some Text)"
)aaa";
Here aaa will be your string delimiter.

Related

Detecting Quotes in a String C++ [duplicate]

This question already has answers here:
How can I get double quotes into a string literal?
(2 answers)
Closed 2 years ago.
I am having issues detecting double quotes ("") or quotation marks in general from a string.
I tried using the str.find(""") or str.find("""") however, the first one doesn't compile and the second does not find the location. It returns 0. I have to read a string from a file, for example:
testFile.txt
This is the test "string" written in the file.
I would read the string using and search it
string str;
size_t x;
getline(inFile, str);
x = str.find("""");
however the value returned is 0. Is there another way to find the quotation marks that enclose 'string'?
The string """" doesn't contain any quotes. It is actually an empty string: when two string literals are next to each other, they get merged into just one string literal. For example "hello " "world" is equivalent to "hello world".
To embed quotes into a string you can do one of the following:
Escape the quotes you want to have inside your string, e.g. "\"\"".
Use a raw character string, e.g. R"("")".
You should use backslash to protect your quotes.
string a = str.find("\"")
will find ".
The " character is a special one, used to mark the beginning and end of a string literal.
String literals have the unusual property that consecutive ones are concatenated in translation phase 6, so for example the sequence "Hello" "World" is identical to "HelloWorld". This also means that """" is identical to "" "" is identical to "" - it's just a long way to write the empty string.
The documentation on string literals does say they can contain both unescaped and escaped characters. An escape is a special character that suppresses the special meaning of the next character. For example
\"
means "really just a double quote, not with the special meaning that it begins or ends a string literal".
So, you can write
"\""
for a string consisting of a single double quote.
You can also use a character literal since you only want one character anyway, with the 4th overload of std::string::find.

Replace every " with \" in Lua

X-Problem: I want to dump an entire lua-script to a single string-line, which can be compiled into a C-Program afterwards.
Y-Problem: How can you replace every " with \" ?
I think it makes sense to try something like this
data = string.gsub(line, "c", "\c")
where c is the "-character. But this does not work of course.
You need to escape both quotes and backslashes, if I understand your Y problem:
data = string.gsub(line, "\"", "\\\"")
or use the other single quotes (still escape the backslash):
data = string.gsub(line, '"', '\\"')
A solution to your X-Problem is to safely escape any sequence that could interfere with the interpreter.
Lua has the %q option for string.format that will format and escape the provided string in such a way, that it can be safely read back by Lua. It should be also true for your C interpreter.
Example string: This \string's truly"tricky
If you just enclosed it in either single or double-quotes, there'd still be a quote that ended the string early. Also there's the invalid escape sequence \s.
Imagine this string was already properly handled in Lua, so we'll just pass it as a parameter:
string.format("%q", 'This \\string\'s truly"tricky')
returns (notice, I used single-quotes in code input):
"This \\string's truly\"tricky"
Now that's a completely valid Lua string that can be written and read from a file. No need to manually escape every special character and risk implementation mistakes.
To correctly implement your Y approach, to escape (invalid) characters with \, use proper pattern matching to replace the captured string with a prefix+captured string:
string.gsub('he"ll"o', "[\"']", "\\%1") -- will prepend backslash to any quote

Regular expression with backslash in Python3

I'm trying to match a specific substring in one string with regular expression, like matching "\ue04a" in "\ue04a abc". But something seems to be wrong. Here's my code:
m = re.match('\\([ue]+\d+[a-z]+)', "\ue04a abc").
The returned m is an empty object, even I tried using three backslashes in the pattern. What's wrong?
Backslashes in regular expressions in Python are extremely tricky. With regular strings (single or triple quotes) there are two passes of backslash interpretation: first, Python itself interprets backslashes (so "\t" represents a single character, a literal tab) and then the result is passed to the regular expression engine, which has its own semantics for any remaining backslashes.
Generally, using r"\t" is strongly recommended, because this removes the Python string parsing aspect. This string, with the r prefix, undergoes no interpretation by Python -- every character in the string simply represents itself, including backslash. So this particular example represents a string of length two, containing the literal characters backslash \ and t.
It's not clear from your question whether the target string "\ue04a abc" should be interpreted as a string of length five containing the Unicode character U+E04A (which is in the Private Use Area, aka PUA, meaning it doesn't have any specific standard use) followed by space, a, b, c -- in which case you would use something like
m = re.match(r'[\ue000-\uf8ff]', "\ue04a abc")
to capture any single code point in the traditional Basic Multilingual Plane PUA; -- or if you want to match a literal string which begins with the two characters backslash \ and u, followed by four hex digits:
m = re.match(r'\\u[0-9a-fA-F]{4}', r"\ue04a abc")
where the former is how Python (and hence most Python programmers) would understand your question, but both interpretations are plausible.
The above show how to match the "mystery sequence" "\ue04a"; it should not then be hard to extend the code to match a longer string containing this sequence.
This should help.
import re
m = re.match(r'(\\ue\d+[a-z]+)', r"\ue04a abc")
if m:
print( m.group() )
Output:
\ue04a

C++ Unrecognized escape sequence

I want to create a string that contains all possible special chars.
However, the compiler gives me the warning "Unrecognized escape sequence" in this line:
wstring s=L".,;*:-_⁊#‘⁂‡…–«»¤¤¡=„+-¶~´:№\¯/?‽!¡-¢–”¥—†¿»¤{}«[-]()·^°$§%&«|⸗<´>²³£­€™℗#©®~µ´`'" + wstring(1,34);
Can anybody please tell me which one of the characters I may not add to this string the way I did?
You have to escape \ as \\, otherwise \¯ will be interpreted as an (invalid) escape sequence:
wstring s=L".,;*:-_⁊#‘⁂‡…–«»¤¤¡=„+-¶~´:№\\¯/?‽!¡-¢–”¥—†¿»¤{}«[-]()·^°$§%&«|⸗<´>²³£­€™℗#©®~µ´`'" + wstring(1,34);
Escape sequence is a character string that has a different meaning than the literal characters themselves. In C and C++ the sequence begins with \ so if your string contains a double quote or backslash it must be escaped properly using \" and \\
In long copy-pasted strings it may be difficult to spot those characters and it's also less maintainable in the future so it's recommended to use raw string literals with the prefix R so you don't need any escapes at all
wstring s = LR"(.,;*:-_⁊#‘⁂‡…–«»¤¤¡=„+-¶~´:№\¯/?‽!¡-¢–”¥—†¿»¤{}«[-]()·^°$§%&«|⸗<´>²³£­€™℗#©®~µ´`')"
+ wstring(1,34);
A special delimiter string may be inserted outside the parentheses like this LR"delim(special string)delim" in case your raw string contains a )" sequence

Help with specific Regex: need to match multiple instances of multiple formats in a single string

I apologize for the terrible title...it can be hard to try to summarize an entire situation into a single sentence.
Let me start by saying that I'm asking because I'm just not a Regex expert. I've used it a bit here and there, but I just come up short with the correct way to meet the following requirements.
The Regex that I'm attempting to write is for use in an XML schema for input validation, and used elsewhere in Javascript for the same purpose.
There are two different possible formats that are supported. There is a literal string, which must be surrounded by quotation marks, and a Hex-value string which must be surrounded by braces.
Some test cases:
"this is a literal string" <-- Valid string, enclosed properly in "s
"this should " still be correct" <-- Valid string, "s are allowed within (if possible, this requirement could be forgiven if necessary)
"{00 11 22}" <-- Valid string, {}'s allow in strings. Another one that can be forgiven if necessary
I am bad output <-- Invalid string, no "s
"Some more problemss"you know <-- Invalid string, must be fully contained in "s
{0A 68 4F 89 AC D2} <-- Valid string, hex characters enclosed in {}s
{DDFF1234} <-- Valid string, spaces are ignored for Hex strings
DEADBEEF <-- Invalid string, must be contained in either "s or {}s
{0A 12 ZZ} <-- Invalid string, 'Z' is not a valid Hex character
To satisfy these general requirements, I had come up with the following Regex that seems to work well enough. I'm still fairly new to Regex, so there could be a huge hole here that I'm missing.:
".+"|\{([0-9]|[a-f]|[A-F]| )+\}
If I recall correctly, the XML Schema regex automatically assumes beginning and end of line (^ and $ respectively). So, essentially, this regex accepts any string that starts and ends with a ", or starts and ends with {}s and contains only valid Hexidecimal characters. This has worked well for me so far except that I had forgotten about another (although less common, and thus forgotten) input option that completely breaks my regex.
Where I made my mistake:
Valid input should also allow a user to separate valid strings (of either type, literal/hex) by a comma. This means that a single string should be able to contain more than one of the above valid strings, separated by commas. Luckily, however, a comma is not a supported character within a literal string (although I see that my existing regex does not care about commas).
Example test cases:
"some string",{0A F1} <-- Valid
{1122},{face},"peanut butter" <-- Valid
{0D 0A FF FE},"string",{FF FFAC19 85} <-- Valid (Spaces don't matter in Hex values)
"Validation is allowed to break, if a comma is found not separating values",{0d 0a} <-- Invalid, comma is a delimiter, but "Validation is allowed to break" and "if a comma..." are not marked as separate strings with "s
hi mom,"hello" <-- Invalid, String1 was not enclosed properly in "s or {}s
My thoughts are that it is possible to use commas as a delimiter to check each "section" of the string to match a regex similar to the original, but I just am not that advanced in regex yet to come up with a solution on my own. Any help would be appreciated, but ultimately a final solution with an explanation would just stellar.
Thanks for reading this huge wall of text!
According to http://www.regular-expressions.info/xml.html the regex language to be used in XSD is less expressive than the one used in Java, but expressive enough for your task.
Now for a construction, take your own regex. Replace the dot with a negated character class [^,] to match everything except the comma, and (for increased clarity) merge the hexadecimal character classes into one. You get the following regex:
"[^,]+"|\{[0-9a-fA-F ]+\}
If we name this regex <S> (for "single string"), the additional feature is validated by the regex matching any number of <S>,, followed by a single <S>:
(<S>,)*<S>
Expanded, this yields the desired regex:
(("[^,]+"|\{[0-9a-fA-F ]+\}),)*("[^,]+"|\{[0-9a-fA-F ]+\})
Maybe something along the lines of
(?:(?:"[^,]+?"|\{(?:[0-9]|[a-f]|[A-F]| )+?\}),)*(?:(?:"[^,]+?"|\{(?:[0-9]|[a-f]|[A-F]| )+?\}))