Matching a string containing special characters with regex in perl - regex

I have a line in my file which contains the following string
$print = "SM_sdo_debugss_cxct6_CSCTM_4 \csctm_gen[4]_ctm_i_nctm_I_csctm (4+5)";
$my_meta = '\csctm_gen[4]_ctm_i_nctm_I_csctm';
print "I got this\n" if($print =~ /\Q$my_meta\E/);
But it's not able to find the $my_meta string in $print. Why?

Your first string is in double quotes, so backslash escape sequences are processed.
\cs stands for Ctrl-S, which can also be written chr(19) or "\x13".
Your second string is in single quotes, which ignores backslash escapes (apart from \\ and \').
So your regex ends up looking for a 3-character sequence \ c s, but your target string contains a single byte 0x13.
To fix this, either write "... \\cs ..." in your first string (the first backslash escapes the second one), or use single quotes for your first string ('... \cs ...').

Related

Detecting Quotes in a String C++ [duplicate]

This question already has answers here:
How can I get double quotes into a string literal?
(2 answers)
Closed 2 years ago.
I am having issues detecting double quotes ("") or quotation marks in general from a string.
I tried using the str.find(""") or str.find("""") however, the first one doesn't compile and the second does not find the location. It returns 0. I have to read a string from a file, for example:
testFile.txt
This is the test "string" written in the file.
I would read the string using and search it
string str;
size_t x;
getline(inFile, str);
x = str.find("""");
however the value returned is 0. Is there another way to find the quotation marks that enclose 'string'?
The string """" doesn't contain any quotes. It is actually an empty string: when two string literals are next to each other, they get merged into just one string literal. For example "hello " "world" is equivalent to "hello world".
To embed quotes into a string you can do one of the following:
Escape the quotes you want to have inside your string, e.g. "\"\"".
Use a raw character string, e.g. R"("")".
You should use backslash to protect your quotes.
string a = str.find("\"")
will find ".
The " character is a special one, used to mark the beginning and end of a string literal.
String literals have the unusual property that consecutive ones are concatenated in translation phase 6, so for example the sequence "Hello" "World" is identical to "HelloWorld". This also means that """" is identical to "" "" is identical to "" - it's just a long way to write the empty string.
The documentation on string literals does say they can contain both unescaped and escaped characters. An escape is a special character that suppresses the special meaning of the next character. For example
\"
means "really just a double quote, not with the special meaning that it begins or ends a string literal".
So, you can write
"\""
for a string consisting of a single double quote.
You can also use a character literal since you only want one character anyway, with the 4th overload of std::string::find.

RegEx: Grabbing all text between quotation marks (including nested), except if in a comment line (starting with //)

I'm trying to put together an expression that will grab text between quotation marks (single or double), including text in nested quotes, but will ignore text in comments, so except if the line starts with //.
Code example:
// this is a "comment" and should be ignored
//this is also a 'comment' and should be ignored
printf("This is text meant to be "captured", and can include any type of character");
printf("This is the same as above, only with 'different' nested quotes");
This can be quite useful to extract translatable content from a file.
So far, I have managed to use ^((?!\/\/).)* to exclude comment lines from being imported, and ["'](.+)["'] to extract text between quotes, but I haven't been able to combine it on a single expression.
Running them in a sequence also doesn't work, I think because of the greedy quantifier in the first expression.
There is written nothing about type of input files and so I assume C source code files.
I suggest following regular expression tested with text editor UltraEdit which uses the
Boost C++ Perl regular expression library.
^(?:(?!//|"|').)*(["'])(?!\1)\K(?:\\\1|.)+?(?=\1)
It matches first single or double quoted string on a line. Other strings on same line are ignored by this regular expression search string in Perl syntax which is not optimal.
It ignores single or double quoted strings in line comments starting with // independent on line comment being at start of a line without or with leading spaces/tabs or somewhere else on a line after code.
It ignores also empty strings like "" or ''. If a line contains first "" or '' and second one more single or double quoted non-empty string, the non-empty string is ignored, too. This is not optimal.
The string delimiting character is not matched on both sides of the matched string.
The string delimiting character must be escaped with a backslash character to be interpreted by this search expression as literal character inside the string. The first printf in example in question would definitely result in a syntax error on compilation by a C compiler.
Strings in block comments are not ignored by this expression as also strings in code ignored by compiler because of a preprocessor macro.
Example:
// This is a "comment" and should be ignored.
//This is also a 'comment' which should be ignored.
printf("This is text meant to be \"captured\", and can include any type of character.\n"); // But " must be escaped with a backslash.
printf("This is the same as above, only with 'different' nested quotes.\n");
putchar('I');
putchar('\'');
printf(""); printf("m thinking."); // First printf is not very useful.
printf("\"OK!\"");
printf("Hello"); printf(" world!\n");
printf("%d file%s found.\n",iFileCount,((iFileCount != 1) ? "s" : "");
printf("Result is: %s\n",sText); // sText is "success" or "failure".
return iReturnCode; // 0 ... "success", 1 ... "error"
The search expression matches for this example:
This is text meant to be \"captured\", and can include any type of character.\n
This is the same as above, only with 'different' nested quotes.\n
I
\'
\"OK!\"
Hello
%d file%s found.\n
Result is: %s\n
So it does not find all non-empty strings output on running C code example.
Explanation for search string:
^ ... start search at beginning of a line.
This is the main reason why it is not possible to match with this expression second, third, ... string on a line not being in a line comment.
(?:(?!//|"|').)* ... search with a non-marking group for zero or more characters not being a newline character on which next there is neither // nor " nor ' verified with a negative look-ahead containing an OR expression.
This expression is responsible for ignoring everything after // once found in a line because of
(["']) ... " or ' must be found next and found character is marked for back-referencing.
(?!\1) ... but match is only positive if the next character is not the same character to ignore empty strings.
\K ... resets the start location of $0 to the current text position: in other words everything to the left of \K is "kept back" and does not form part of the regular expression match. So everything matched from beginning of line up to " or ' at beginning of non-empty string is not matched (selected) anymore.
(?:\\\1|.)+? ... non-marking group to find either a backslash and the character at beginning of the string or any character not being a newline character non-greedy one or more times.
This expression matches the string of interest.
(?=\1) ... matching any character not being a newline character should stop on next character (positive look-ahead) being the same quote character not escaped with a backslash as at beginning of the string without matching this quote character.
For matching first non-empty string outside a line comment with the delimiting character on both sides:
^(?:(?!//|"|').)*\K(["'])(?!\1)(?:\\\1|.)+?\1
How to get really all non-empty strings outside of comments?
Copy content of entire C source code file into a new file.
Remove in new file all not nested block comments and all line comments with the search expression:
^[\t ]*//.*[\r\n]+|[\t ]*/\*[\s\S]+?\*/[\r\n]*|[\t ]*//.*$
The replace string is an empty string.
Note: // inside a string is interpreted by third part of OR expression also as beginning of a line comment although this it not correct.
Use as search string (["'])(?!\1)(?:\\\1|.)+?\1 to find really all non-empty strings with matching also the string delimiting character on both sides of every string.
Best would parsing a C source code file with a program written in C for non-empty strings because of such a program could much better find out what is a line comment, what is a block comment even on source code file containing nested block comments and what are non-empty strings. Well, it would be also possible to let the C compiler just preprocess the C source code files of a project with generating the output files after preprocessing with all line and block comments already removed and search in those preprocessed files for non-empty strings.

Regex is grabbing preceding character

So I am experiencing some inconsistent behavior in my regex
My regex:
(?<=test\\\\)(.*)(?=\",)
The input string:
"test.exe /c echo teststring > \\\\.\\test\\teststring",
When I run this in https://Regex101.com
I get the value teststring however when I run this in F#
Regex.Match(inputString, "(?<=test\\\\)(.*)(?=\",)")
I get \teststring back. My goal is to get just teststring. I'm not sure what I'm doing wrong.
I had success using triple quoted strings. Then only the regex escapes need be considered, and not the F# string escapes.
let inputString = """test.exe /c echo teststring > \\\\.\\test\\teststring","""
let x = Regex.Match(inputString, """(?<=test\\\\)(.*)(?=\",)""")
"teststring" comes out
The string in your source comes out as
(?<=test\\)(.*)(?=",)
If you don't want to use triple quotes or verbatim, you will have to write this in F# :
"(?<=test\\\\\\\\)(.*)(?=\\\",)"
This string in F# uses backslashes to escape backslashes and a quote character. There are eight backslashes in a row in one place, and this then becomes four actual backslashes in the string value. There is also this:
\\\"
which translates to one actual \ and one actual " in the actual string value.
So then we end up with a string value of
(?<=test\\\\)(.*)(?=\",)
This then is the actual string value fed to the regex engine. The regex engine, like the F# compiler, also uses the backslash to escape characters. That's why any actual backslash had to be doubled and then doubled again.

Detect \ using regex in R [duplicate]

I'm writing strings which contain backslashes (\) to a file:
x1 = "\\str"
x2 = "\\\str"
# Error: '\s' is an unrecognized escape in character string starting "\\\s"
x2="\\\\str"
write(file = 'test', c(x1, x2))
When I open the file named test, I see this:
\str
\\str
If I want to get a string containing 5 backslashes, should I write 10 backslashes, like this?
x = "\\\\\\\\\\str"
[...] If I want to get a string containing 5 \ ,should i write 10 \ [...]
Yes, you should. To write a single \ in a string, you write it as "\\".
This is because the \ is a special character, reserved to escape the character that follows it. (Perhaps you recognize \n as newline.) It's also useful if you want to write a string containing a single ". You write it as "\"".
The reason why \\\str is invalid, is because it's interpreted as \\ (which corresponds to a single \) followed by \s, which is not valid, since "escaped s" has no meaning.
Have a read of this section about character vectors.
In essence, it says that when you enter character string literals you enclose them in a pair of quotes (" or '). Inside those quotes, you can create special characters using \ as an escape character.
For example, \n denotes new line or \" can be used to enter a " without R thinking it's the end of the string. Since \ is an escape character, you need a way to enter an actual . This is done by using \\. Escaping the escape!
Note that the doubling of backslashes is because you are entering the string at the command line and the string is first parsed by the R parser. You can enter strings in different ways, some of which don't need the doubling. For example:
> tmp <- scan(what='')
1: \\\\\str
2:
Read 1 item
> print(tmp)
[1] "\\\\\\\\\\str"
> cat(tmp, '\n')
\\\\\str
>

C++ Unrecognized escape sequence

I want to create a string that contains all possible special chars.
However, the compiler gives me the warning "Unrecognized escape sequence" in this line:
wstring s=L".,;*:-_⁊#‘⁂‡…–«»¤¤¡=„+-¶~´:№\¯/?‽!¡-¢–”¥—†¿»¤{}«[-]()·^°$§%&«|⸗<´>²³£­€™℗#©®~µ´`'" + wstring(1,34);
Can anybody please tell me which one of the characters I may not add to this string the way I did?
You have to escape \ as \\, otherwise \¯ will be interpreted as an (invalid) escape sequence:
wstring s=L".,;*:-_⁊#‘⁂‡…–«»¤¤¡=„+-¶~´:№\\¯/?‽!¡-¢–”¥—†¿»¤{}«[-]()·^°$§%&«|⸗<´>²³£­€™℗#©®~µ´`'" + wstring(1,34);
Escape sequence is a character string that has a different meaning than the literal characters themselves. In C and C++ the sequence begins with \ so if your string contains a double quote or backslash it must be escaped properly using \" and \\
In long copy-pasted strings it may be difficult to spot those characters and it's also less maintainable in the future so it's recommended to use raw string literals with the prefix R so you don't need any escapes at all
wstring s = LR"(.,;*:-_⁊#‘⁂‡…–«»¤¤¡=„+-¶~´:№\¯/?‽!¡-¢–”¥—†¿»¤{}«[-]()·^°$§%&«|⸗<´>²³£­€™℗#©®~µ´`')"
+ wstring(1,34);
A special delimiter string may be inserted outside the parentheses like this LR"delim(special string)delim" in case your raw string contains a )" sequence