Regular expression with backslash in Python3 - regex

I'm trying to match a specific substring in one string with regular expression, like matching "\ue04a" in "\ue04a abc". But something seems to be wrong. Here's my code:
m = re.match('\\([ue]+\d+[a-z]+)', "\ue04a abc").
The returned m is an empty object, even I tried using three backslashes in the pattern. What's wrong?

Backslashes in regular expressions in Python are extremely tricky. With regular strings (single or triple quotes) there are two passes of backslash interpretation: first, Python itself interprets backslashes (so "\t" represents a single character, a literal tab) and then the result is passed to the regular expression engine, which has its own semantics for any remaining backslashes.
Generally, using r"\t" is strongly recommended, because this removes the Python string parsing aspect. This string, with the r prefix, undergoes no interpretation by Python -- every character in the string simply represents itself, including backslash. So this particular example represents a string of length two, containing the literal characters backslash \ and t.
It's not clear from your question whether the target string "\ue04a abc" should be interpreted as a string of length five containing the Unicode character U+E04A (which is in the Private Use Area, aka PUA, meaning it doesn't have any specific standard use) followed by space, a, b, c -- in which case you would use something like
m = re.match(r'[\ue000-\uf8ff]', "\ue04a abc")
to capture any single code point in the traditional Basic Multilingual Plane PUA; -- or if you want to match a literal string which begins with the two characters backslash \ and u, followed by four hex digits:
m = re.match(r'\\u[0-9a-fA-F]{4}', r"\ue04a abc")
where the former is how Python (and hence most Python programmers) would understand your question, but both interpretations are plausible.
The above show how to match the "mystery sequence" "\ue04a"; it should not then be hard to extend the code to match a longer string containing this sequence.

This should help.
import re
m = re.match(r'(\\ue\d+[a-z]+)', r"\ue04a abc")
if m:
print( m.group() )
Output:
\ue04a

Related

regex that will check expression is valid arithmetic expression, for example 'a+b*c=d' is valid and accepted

I made separate regexs for both but its not giving desired result. and it should work like check whole input string and return valid if its valid or invalid if its invalid.
import re
identifiers = r'^[^\d\W]\w*\Z'
operators = r'[\+\*\-\/=]'
a = re.compile(identifiers, re.UNICODE)
b = re.compile(operators, re.UNICODE)
c = re.findall(a, 'a+b*c=d')
d = re.findall(b, 'a+b*c=d')
print c, 'identifiers'
print d, 'operators'
Result of this snippet is
[ ] identifiers &
['+', '*', '='] operators
I want results like input string is valid or invalid by checking all characters of input string by both regex
I think the issue you're having with your current code is that your identifiers pattern only works if it matches the whole string.
The problem is that the current pattern requires that both the beginning and end of the input be matched (by the ^ and \Z respectively). That's usually causing you to not finding any identifiers, since only an input like "foo" would be matched, since it's a single identifier that contacts both the start and end of the string. (I'd also note that it is a bit odd to mix ^ and \Z together, though it is not invalid. It would just be more natural to pair ^ with $ or \Z with \A.)
I suspect that you don't actually want ^ and \Z in your pattern, but rather should be using \b in place of both. The \b escape matches "word breaks", which means either the start or end of the input, or a change between word-characters and non-word characters.
>>> re.findall(r'\b[^\d\W]\w*\b', 'a+b*c=d', re.U)
['a', 'b', 'c', 'd']
This still isn't going to do what you say you ultimately want (testing if the string to ensure it's a valid expression). That's a much more difficult task, and regular expressions are not up to it in general. Certain specific forms of expressions can perhaps be matched with regex, but supporting things like parentheses will break the whole system in a hurry. To identify arbitrary arithmetic expressions, you'd need a more sophisticated parser, which might use regex in some of it's steps, but not for the whole thing.
For the simple cases like your example an expression like this will work:
^[0-9a-z]+([+/*-][0-9a-z]+)+=[0-9a-z]+$

Escaping Asterisk Grabs wrong character

When I run the code below I get the unexpected result where \* also captures É. Is there a way to make it only capture * like I wanted?
let s =
"* A
ÉR
* B"
let result = System.Text.RegularExpressions.Regex.Replace(s, "\n(?!\*)", "", Text.RegularExpressions.RegexOptions.Multiline)
printfn "%s" result
Result After Running Replace
* AÉR
* B
Expected Result
"* A
ÉR
* B"
UPDATE
This seems to be working, when I use a pattern like so \n(?=\*). I guess I needed a positive lookahead.
You're right that you need to use positive lookahead instead of negative lookahead to get the result you want. However, to clarify an issue that came up in the comments, in F# a string delimited by just "" is not quite like either a plain C# string delimited by "" or a C# string delimited by #"" - if you want the latter you should also use #"" in F#. The difference is that in a normal F# string, backslashes will be treated as escape sequences only when used in front of a valid character to escape (see the table towards the top of Strings (F#)). Otherwise, it is treated as a literal backslash character. So, since '*' is not a valid character to escape, you luckily see the behavior you expected (in C#, by contrast, this would be a syntax error because it's an unrecognized escape). I would recommend that you not rely on this and should use a verbatim #"" string instead.
In other words, in F# the following three strings are all equivalent:
let s1 = "\n\*"
let s2 = "\n\\*"
let s3 = #"
\*"
I think that the C# design is more sensible because it prevents confusion on what exactly is being escaped.

Regex is grabbing preceding character

So I am experiencing some inconsistent behavior in my regex
My regex:
(?<=test\\\\)(.*)(?=\",)
The input string:
"test.exe /c echo teststring > \\\\.\\test\\teststring",
When I run this in https://Regex101.com
I get the value teststring however when I run this in F#
Regex.Match(inputString, "(?<=test\\\\)(.*)(?=\",)")
I get \teststring back. My goal is to get just teststring. I'm not sure what I'm doing wrong.
I had success using triple quoted strings. Then only the regex escapes need be considered, and not the F# string escapes.
let inputString = """test.exe /c echo teststring > \\\\.\\test\\teststring","""
let x = Regex.Match(inputString, """(?<=test\\\\)(.*)(?=\",)""")
"teststring" comes out
The string in your source comes out as
(?<=test\\)(.*)(?=",)
If you don't want to use triple quotes or verbatim, you will have to write this in F# :
"(?<=test\\\\\\\\)(.*)(?=\\\",)"
This string in F# uses backslashes to escape backslashes and a quote character. There are eight backslashes in a row in one place, and this then becomes four actual backslashes in the string value. There is also this:
\\\"
which translates to one actual \ and one actual " in the actual string value.
So then we end up with a string value of
(?<=test\\\\)(.*)(?=\",)
This then is the actual string value fed to the regex engine. The regex engine, like the F# compiler, also uses the backslash to escape characters. That's why any actual backslash had to be doubled and then doubled again.

Processing a string with the null character

I have a text file full of strings (computer paths) which I want to process by replacing every backslash with an underscore, in addition to replacing every number ( integer or float) with an underscore as well, the original string looks like that :
string = "\Software\Microsoft\0\Windows\CurrentVersion\Internet Settings\5.0\Cache"
Usually, I could replace easily the backslash with the following command:
string=string.replace('\\','_')
and apply some regular expressions such as: '(\d(?:\.\d)?)' to replace the numbers.
However in my case I couldn't do either, because python recognise always '\0' as a null character and '\5.0' as ENQ, in fact any number follow the backslash will be treated the same way as well.
Any suggested way to replace them ?
e.g. is there a way to convert my string to raw string as a start ?
Always remember: Backslash(\) escapes special characters. If you want to use the backslash itself, you need to escape it too. Your string should look like this:
string = "\\Software\\Microsoft\\0\\Windows\\CurrentVersion\\Internet Settings\\5.0\\Cache"

How to replace all the numbers with literal \d in scala?

I want to write a function, to replace all the numbers in a string with literal \d. My code is:
val r = """\d""".r
val s = r.replaceAllIn("123abc", """\d""")
println(s)
I expect the result is \d\d\dabc, but get:
dddabc
Then I change my code (line 2) to:
val s = r.replaceAllIn("123abc", """\\d""")
The result is correct now: \d\d\dabc
But I don't understand why the method replaceAllIn converts the string, not use it directly?
There was a toList in my previous code, that now what I want. I have just update the question. Thanks to everyone.
Just remove the toList.
val r = """\d""".r
val list = r.replaceAllIn("123abc", """\\d""")
println(list)
Strings are (implicitly, via WrappedString, convertible to) Seq[Char]. If you invoke toList, you will have a List[Char].
Scala's Regex uses java.util.regex underneath (at least on the JVM). Now, if you look up replaceAll on Java docs, you'll see this:
Note that backslashes (\) and dollar
signs ($) in the replacement string
may cause the results to be different
than if it were being treated as a
literal replacement string. Dollar
signs may be treated as references to
captured subsequences as described
above, and backslashes are used to
escape literal characters in the
replacement string.