Processing a string with the null character - regex

I have a text file full of strings (computer paths) which I want to process by replacing every backslash with an underscore, in addition to replacing every number ( integer or float) with an underscore as well, the original string looks like that :
string = "\Software\Microsoft\0\Windows\CurrentVersion\Internet Settings\5.0\Cache"
Usually, I could replace easily the backslash with the following command:
string=string.replace('\\','_')
and apply some regular expressions such as: '(\d(?:\.\d)?)' to replace the numbers.
However in my case I couldn't do either, because python recognise always '\0' as a null character and '\5.0' as ENQ, in fact any number follow the backslash will be treated the same way as well.
Any suggested way to replace them ?
e.g. is there a way to convert my string to raw string as a start ?

Always remember: Backslash(\) escapes special characters. If you want to use the backslash itself, you need to escape it too. Your string should look like this:
string = "\\Software\\Microsoft\\0\\Windows\\CurrentVersion\\Internet Settings\\5.0\\Cache"

Related

Regulare expression

I need the regular expression for below string cases,
String value = "�江苏银行股份有限公司南京迈皋桥支行";
String value = "�/CNYXB/02112";
in both the cases only the character "�" needs to be removed and the final string values should be as below after applying regular expression,
String value = "江苏银行股份有限公司南京迈皋桥支行";
String value = "/CNYXB/02112";
thanks in advance!!!
yes i have tried below regEx,
value = value.replaceAll("[^\\p{ASCII}]", "");
I'm not sure if this is what you're actually asking, but you can easily remove the first character from the string:
^.
matches the first character at the start of the string.
If you want to remove an out-of-range character then you need to define your range. Use multiple classes wiht octal escapes, so something like:
[\o{2444}-\o{3444}\o{40}-\o{77}]
without know what the characters you're looking for really are it's difficult to be more specific.
try to use replaceFirst instead of replaceAll:
value = value.replaceFirst("[^\\p{ASCII}]", "");

Replace every " with \" in Lua

X-Problem: I want to dump an entire lua-script to a single string-line, which can be compiled into a C-Program afterwards.
Y-Problem: How can you replace every " with \" ?
I think it makes sense to try something like this
data = string.gsub(line, "c", "\c")
where c is the "-character. But this does not work of course.
You need to escape both quotes and backslashes, if I understand your Y problem:
data = string.gsub(line, "\"", "\\\"")
or use the other single quotes (still escape the backslash):
data = string.gsub(line, '"', '\\"')
A solution to your X-Problem is to safely escape any sequence that could interfere with the interpreter.
Lua has the %q option for string.format that will format and escape the provided string in such a way, that it can be safely read back by Lua. It should be also true for your C interpreter.
Example string: This \string's truly"tricky
If you just enclosed it in either single or double-quotes, there'd still be a quote that ended the string early. Also there's the invalid escape sequence \s.
Imagine this string was already properly handled in Lua, so we'll just pass it as a parameter:
string.format("%q", 'This \\string\'s truly"tricky')
returns (notice, I used single-quotes in code input):
"This \\string's truly\"tricky"
Now that's a completely valid Lua string that can be written and read from a file. No need to manually escape every special character and risk implementation mistakes.
To correctly implement your Y approach, to escape (invalid) characters with \, use proper pattern matching to replace the captured string with a prefix+captured string:
string.gsub('he"ll"o', "[\"']", "\\%1") -- will prepend backslash to any quote

Regular expression with backslash in Python3

I'm trying to match a specific substring in one string with regular expression, like matching "\ue04a" in "\ue04a abc". But something seems to be wrong. Here's my code:
m = re.match('\\([ue]+\d+[a-z]+)', "\ue04a abc").
The returned m is an empty object, even I tried using three backslashes in the pattern. What's wrong?
Backslashes in regular expressions in Python are extremely tricky. With regular strings (single or triple quotes) there are two passes of backslash interpretation: first, Python itself interprets backslashes (so "\t" represents a single character, a literal tab) and then the result is passed to the regular expression engine, which has its own semantics for any remaining backslashes.
Generally, using r"\t" is strongly recommended, because this removes the Python string parsing aspect. This string, with the r prefix, undergoes no interpretation by Python -- every character in the string simply represents itself, including backslash. So this particular example represents a string of length two, containing the literal characters backslash \ and t.
It's not clear from your question whether the target string "\ue04a abc" should be interpreted as a string of length five containing the Unicode character U+E04A (which is in the Private Use Area, aka PUA, meaning it doesn't have any specific standard use) followed by space, a, b, c -- in which case you would use something like
m = re.match(r'[\ue000-\uf8ff]', "\ue04a abc")
to capture any single code point in the traditional Basic Multilingual Plane PUA; -- or if you want to match a literal string which begins with the two characters backslash \ and u, followed by four hex digits:
m = re.match(r'\\u[0-9a-fA-F]{4}', r"\ue04a abc")
where the former is how Python (and hence most Python programmers) would understand your question, but both interpretations are plausible.
The above show how to match the "mystery sequence" "\ue04a"; it should not then be hard to extend the code to match a longer string containing this sequence.
This should help.
import re
m = re.match(r'(\\ue\d+[a-z]+)', r"\ue04a abc")
if m:
print( m.group() )
Output:
\ue04a

Regex is grabbing preceding character

So I am experiencing some inconsistent behavior in my regex
My regex:
(?<=test\\\\)(.*)(?=\",)
The input string:
"test.exe /c echo teststring > \\\\.\\test\\teststring",
When I run this in https://Regex101.com
I get the value teststring however when I run this in F#
Regex.Match(inputString, "(?<=test\\\\)(.*)(?=\",)")
I get \teststring back. My goal is to get just teststring. I'm not sure what I'm doing wrong.
I had success using triple quoted strings. Then only the regex escapes need be considered, and not the F# string escapes.
let inputString = """test.exe /c echo teststring > \\\\.\\test\\teststring","""
let x = Regex.Match(inputString, """(?<=test\\\\)(.*)(?=\",)""")
"teststring" comes out
The string in your source comes out as
(?<=test\\)(.*)(?=",)
If you don't want to use triple quotes or verbatim, you will have to write this in F# :
"(?<=test\\\\\\\\)(.*)(?=\\\",)"
This string in F# uses backslashes to escape backslashes and a quote character. There are eight backslashes in a row in one place, and this then becomes four actual backslashes in the string value. There is also this:
\\\"
which translates to one actual \ and one actual " in the actual string value.
So then we end up with a string value of
(?<=test\\\\)(.*)(?=\",)
This then is the actual string value fed to the regex engine. The regex engine, like the F# compiler, also uses the backslash to escape characters. That's why any actual backslash had to be doubled and then doubled again.

C++ Unrecognized escape sequence

I want to create a string that contains all possible special chars.
However, the compiler gives me the warning "Unrecognized escape sequence" in this line:
wstring s=L".,;*:-_⁊#‘⁂‡…–«»¤¤¡=„+-¶~´:№\¯/?‽!¡-¢–”¥—†¿»¤{}«[-]()·^°$§%&«|⸗<´>²³£­€™℗#©®~µ´`'" + wstring(1,34);
Can anybody please tell me which one of the characters I may not add to this string the way I did?
You have to escape \ as \\, otherwise \¯ will be interpreted as an (invalid) escape sequence:
wstring s=L".,;*:-_⁊#‘⁂‡…–«»¤¤¡=„+-¶~´:№\\¯/?‽!¡-¢–”¥—†¿»¤{}«[-]()·^°$§%&«|⸗<´>²³£­€™℗#©®~µ´`'" + wstring(1,34);
Escape sequence is a character string that has a different meaning than the literal characters themselves. In C and C++ the sequence begins with \ so if your string contains a double quote or backslash it must be escaped properly using \" and \\
In long copy-pasted strings it may be difficult to spot those characters and it's also less maintainable in the future so it's recommended to use raw string literals with the prefix R so you don't need any escapes at all
wstring s = LR"(.,;*:-_⁊#‘⁂‡…–«»¤¤¡=„+-¶~´:№\¯/?‽!¡-¢–”¥—†¿»¤{}«[-]()·^°$§%&«|⸗<´>²³£­€™℗#©®~µ´`')"
+ wstring(1,34);
A special delimiter string may be inserted outside the parentheses like this LR"delim(special string)delim" in case your raw string contains a )" sequence