Regex is grabbing preceding character - regex

So I am experiencing some inconsistent behavior in my regex
My regex:
(?<=test\\\\)(.*)(?=\",)
The input string:
"test.exe /c echo teststring > \\\\.\\test\\teststring",
When I run this in https://Regex101.com
I get the value teststring however when I run this in F#
Regex.Match(inputString, "(?<=test\\\\)(.*)(?=\",)")
I get \teststring back. My goal is to get just teststring. I'm not sure what I'm doing wrong.

I had success using triple quoted strings. Then only the regex escapes need be considered, and not the F# string escapes.
let inputString = """test.exe /c echo teststring > \\\\.\\test\\teststring","""
let x = Regex.Match(inputString, """(?<=test\\\\)(.*)(?=\",)""")
"teststring" comes out
The string in your source comes out as
(?<=test\\)(.*)(?=",)
If you don't want to use triple quotes or verbatim, you will have to write this in F# :
"(?<=test\\\\\\\\)(.*)(?=\\\",)"
This string in F# uses backslashes to escape backslashes and a quote character. There are eight backslashes in a row in one place, and this then becomes four actual backslashes in the string value. There is also this:
\\\"
which translates to one actual \ and one actual " in the actual string value.
So then we end up with a string value of
(?<=test\\\\)(.*)(?=\",)
This then is the actual string value fed to the regex engine. The regex engine, like the F# compiler, also uses the backslash to escape characters. That's why any actual backslash had to be doubled and then doubled again.

Related

Replace every " with \" in Lua

X-Problem: I want to dump an entire lua-script to a single string-line, which can be compiled into a C-Program afterwards.
Y-Problem: How can you replace every " with \" ?
I think it makes sense to try something like this
data = string.gsub(line, "c", "\c")
where c is the "-character. But this does not work of course.
You need to escape both quotes and backslashes, if I understand your Y problem:
data = string.gsub(line, "\"", "\\\"")
or use the other single quotes (still escape the backslash):
data = string.gsub(line, '"', '\\"')
A solution to your X-Problem is to safely escape any sequence that could interfere with the interpreter.
Lua has the %q option for string.format that will format and escape the provided string in such a way, that it can be safely read back by Lua. It should be also true for your C interpreter.
Example string: This \string's truly"tricky
If you just enclosed it in either single or double-quotes, there'd still be a quote that ended the string early. Also there's the invalid escape sequence \s.
Imagine this string was already properly handled in Lua, so we'll just pass it as a parameter:
string.format("%q", 'This \\string\'s truly"tricky')
returns (notice, I used single-quotes in code input):
"This \\string's truly\"tricky"
Now that's a completely valid Lua string that can be written and read from a file. No need to manually escape every special character and risk implementation mistakes.
To correctly implement your Y approach, to escape (invalid) characters with \, use proper pattern matching to replace the captured string with a prefix+captured string:
string.gsub('he"ll"o', "[\"']", "\\%1") -- will prepend backslash to any quote

Regular expression with backslash in Python3

I'm trying to match a specific substring in one string with regular expression, like matching "\ue04a" in "\ue04a abc". But something seems to be wrong. Here's my code:
m = re.match('\\([ue]+\d+[a-z]+)', "\ue04a abc").
The returned m is an empty object, even I tried using three backslashes in the pattern. What's wrong?
Backslashes in regular expressions in Python are extremely tricky. With regular strings (single or triple quotes) there are two passes of backslash interpretation: first, Python itself interprets backslashes (so "\t" represents a single character, a literal tab) and then the result is passed to the regular expression engine, which has its own semantics for any remaining backslashes.
Generally, using r"\t" is strongly recommended, because this removes the Python string parsing aspect. This string, with the r prefix, undergoes no interpretation by Python -- every character in the string simply represents itself, including backslash. So this particular example represents a string of length two, containing the literal characters backslash \ and t.
It's not clear from your question whether the target string "\ue04a abc" should be interpreted as a string of length five containing the Unicode character U+E04A (which is in the Private Use Area, aka PUA, meaning it doesn't have any specific standard use) followed by space, a, b, c -- in which case you would use something like
m = re.match(r'[\ue000-\uf8ff]', "\ue04a abc")
to capture any single code point in the traditional Basic Multilingual Plane PUA; -- or if you want to match a literal string which begins with the two characters backslash \ and u, followed by four hex digits:
m = re.match(r'\\u[0-9a-fA-F]{4}', r"\ue04a abc")
where the former is how Python (and hence most Python programmers) would understand your question, but both interpretations are plausible.
The above show how to match the "mystery sequence" "\ue04a"; it should not then be hard to extend the code to match a longer string containing this sequence.
This should help.
import re
m = re.match(r'(\\ue\d+[a-z]+)', r"\ue04a abc")
if m:
print( m.group() )
Output:
\ue04a

Escaping Asterisk Grabs wrong character

When I run the code below I get the unexpected result where \* also captures É. Is there a way to make it only capture * like I wanted?
let s =
"* A
ÉR
* B"
let result = System.Text.RegularExpressions.Regex.Replace(s, "\n(?!\*)", "", Text.RegularExpressions.RegexOptions.Multiline)
printfn "%s" result
Result After Running Replace
* AÉR
* B
Expected Result
"* A
ÉR
* B"
UPDATE
This seems to be working, when I use a pattern like so \n(?=\*). I guess I needed a positive lookahead.
You're right that you need to use positive lookahead instead of negative lookahead to get the result you want. However, to clarify an issue that came up in the comments, in F# a string delimited by just "" is not quite like either a plain C# string delimited by "" or a C# string delimited by #"" - if you want the latter you should also use #"" in F#. The difference is that in a normal F# string, backslashes will be treated as escape sequences only when used in front of a valid character to escape (see the table towards the top of Strings (F#)). Otherwise, it is treated as a literal backslash character. So, since '*' is not a valid character to escape, you luckily see the behavior you expected (in C#, by contrast, this would be a syntax error because it's an unrecognized escape). I would recommend that you not rely on this and should use a verbatim #"" string instead.
In other words, in F# the following three strings are all equivalent:
let s1 = "\n\*"
let s2 = "\n\\*"
let s3 = #"
\*"
I think that the C# design is more sensible because it prevents confusion on what exactly is being escaped.

Processing a string with the null character

I have a text file full of strings (computer paths) which I want to process by replacing every backslash with an underscore, in addition to replacing every number ( integer or float) with an underscore as well, the original string looks like that :
string = "\Software\Microsoft\0\Windows\CurrentVersion\Internet Settings\5.0\Cache"
Usually, I could replace easily the backslash with the following command:
string=string.replace('\\','_')
and apply some regular expressions such as: '(\d(?:\.\d)?)' to replace the numbers.
However in my case I couldn't do either, because python recognise always '\0' as a null character and '\5.0' as ENQ, in fact any number follow the backslash will be treated the same way as well.
Any suggested way to replace them ?
e.g. is there a way to convert my string to raw string as a start ?
Always remember: Backslash(\) escapes special characters. If you want to use the backslash itself, you need to escape it too. Your string should look like this:
string = "\\Software\\Microsoft\\0\\Windows\\CurrentVersion\\Internet Settings\\5.0\\Cache"

Double-escaping regex from inside a Groovy expression

Note: I had to simplify my actual use case to spare SO a lot of backstory. So if your first reaction to this question is: why would you ever do this, trust me, I just need to.
I'm trying to write a Groovy expression that replaces double-quotes (""") that appear in a string with single-quotes ("'").
// BEFORE: Replace my "double" quotes with 'single' quotes.
String toReplace = "Replace my \"double-quotes\" with 'single' quotes.";
// Wrong: compiler error
String replacerExpression = "toReplace.replace(""", "'");";
Binding binding = new Binding();
binding.setVariable("toReplace", toReplace);
GroovyShell shell = new GroovyShell(binding);
// AFTER: Replace my 'double' quotes with 'single' quotes.
String replacedString = (String)shell.evaluate(replacerExpression);
The problem is, I'm getting a compile error on the line where I assign replacerExpression:
Syntax error on token ""toReplace.replace("", { expected
I think it's because I need to escape the string that contains the double-quote character (""") but since it's a string-inside-a-string, I'm not sure how to properly escape it here. Any ideas?
You need to escape the quote within quotes in this line:
String replacerExpression = "toReplace.replace(""", "'");";
The string will be evaluated twice: once as a string literal, and once as a script. This means you have to escape it with a backslash, and escape the backslash too. Also, with the embedded quotes, it'll be much more readable if you use triple quotes.
Try this (in groovy):
String replacerExpression = """toReplace.replace("\\"", "'");""";
In Java, you're stuck with using backslashes to escape all the quotes and the embedded backslash:
String replacerExpression = "toReplace.replace(\"\\\"\", \"\'\");";
Triple-quotes work well, but one can also use single-quoted string to specify a double-quote, and a double-quoted string for a single-quote.
Consider this:
String toReplace = "Replace my \"double-quotes\" with 'single' quotes."
// key line:
String replacerExpression = """toReplace.replace('"', "'");"""
Binding binding = new Binding(); binding.setVariable("toReplace", toReplace)
GroovyShell shell = new GroovyShell(binding)
String replacedString = (String)shell.evaluate(replacerExpression)
That is, after the string literal evaluation, this is evaluated in the Groovy shell:
toReplace.replace('"', "'");
If that is too hard on the eyes, replace the "key line" above with another style (using slashy strings):
String ESC_DOUBLE_QUOTE = /'"'/
String ESC_SINGLE_QUOTE = /"'"/
String replacerExpression = """toReplace.replace(${ESC_DOUBLE_QUOTE}, ${ESC_SINGLE_QUOTE});"""
Please try to use regular expressions to solve this kind of problems, instead of messing your head to tackle the escaping of quotes.
I have put up a solution using groovy console. Please see if that helps.