How to replace all the numbers with literal \d in scala? - regex

I want to write a function, to replace all the numbers in a string with literal \d. My code is:
val r = """\d""".r
val s = r.replaceAllIn("123abc", """\d""")
println(s)
I expect the result is \d\d\dabc, but get:
dddabc
Then I change my code (line 2) to:
val s = r.replaceAllIn("123abc", """\\d""")
The result is correct now: \d\d\dabc
But I don't understand why the method replaceAllIn converts the string, not use it directly?
There was a toList in my previous code, that now what I want. I have just update the question. Thanks to everyone.

Just remove the toList.
val r = """\d""".r
val list = r.replaceAllIn("123abc", """\\d""")
println(list)
Strings are (implicitly, via WrappedString, convertible to) Seq[Char]. If you invoke toList, you will have a List[Char].

Scala's Regex uses java.util.regex underneath (at least on the JVM). Now, if you look up replaceAll on Java docs, you'll see this:
Note that backslashes (\) and dollar
signs ($) in the replacement string
may cause the results to be different
than if it were being treated as a
literal replacement string. Dollar
signs may be treated as references to
captured subsequences as described
above, and backslashes are used to
escape literal characters in the
replacement string.

Related

Regulare expression

I need the regular expression for below string cases,
String value = "�江苏银行股份有限公司南京迈皋桥支行";
String value = "�/CNYXB/02112";
in both the cases only the character "�" needs to be removed and the final string values should be as below after applying regular expression,
String value = "江苏银行股份有限公司南京迈皋桥支行";
String value = "/CNYXB/02112";
thanks in advance!!!
yes i have tried below regEx,
value = value.replaceAll("[^\\p{ASCII}]", "");
I'm not sure if this is what you're actually asking, but you can easily remove the first character from the string:
^.
matches the first character at the start of the string.
If you want to remove an out-of-range character then you need to define your range. Use multiple classes wiht octal escapes, so something like:
[\o{2444}-\o{3444}\o{40}-\o{77}]
without know what the characters you're looking for really are it's difficult to be more specific.
try to use replaceFirst instead of replaceAll:
value = value.replaceFirst("[^\\p{ASCII}]", "");

Regular expression with backslash in Python3

I'm trying to match a specific substring in one string with regular expression, like matching "\ue04a" in "\ue04a abc". But something seems to be wrong. Here's my code:
m = re.match('\\([ue]+\d+[a-z]+)', "\ue04a abc").
The returned m is an empty object, even I tried using three backslashes in the pattern. What's wrong?
Backslashes in regular expressions in Python are extremely tricky. With regular strings (single or triple quotes) there are two passes of backslash interpretation: first, Python itself interprets backslashes (so "\t" represents a single character, a literal tab) and then the result is passed to the regular expression engine, which has its own semantics for any remaining backslashes.
Generally, using r"\t" is strongly recommended, because this removes the Python string parsing aspect. This string, with the r prefix, undergoes no interpretation by Python -- every character in the string simply represents itself, including backslash. So this particular example represents a string of length two, containing the literal characters backslash \ and t.
It's not clear from your question whether the target string "\ue04a abc" should be interpreted as a string of length five containing the Unicode character U+E04A (which is in the Private Use Area, aka PUA, meaning it doesn't have any specific standard use) followed by space, a, b, c -- in which case you would use something like
m = re.match(r'[\ue000-\uf8ff]', "\ue04a abc")
to capture any single code point in the traditional Basic Multilingual Plane PUA; -- or if you want to match a literal string which begins with the two characters backslash \ and u, followed by four hex digits:
m = re.match(r'\\u[0-9a-fA-F]{4}', r"\ue04a abc")
where the former is how Python (and hence most Python programmers) would understand your question, but both interpretations are plausible.
The above show how to match the "mystery sequence" "\ue04a"; it should not then be hard to extend the code to match a longer string containing this sequence.
This should help.
import re
m = re.match(r'(\\ue\d+[a-z]+)', r"\ue04a abc")
if m:
print( m.group() )
Output:
\ue04a

How to filter a list in Kotlin using Regex [duplicate]

I've created a very simple match-all Regex with Regex.fromLiteral(".*").
According to the documentation: "Returns a literal regex for the specified literal string."
But I don't really get what "for the specified literal string" is supposed to mean.
Consider this example:
fun main(args: Array<String>) {
val regex1 = ".*".toRegex()
val regex2 = Regex.fromLiteral(".*")
println("regex1 matches abc: " + regex1.matches("abc"))
println("regex2 matches abc: " + regex2.matches("abc"))
println("regex2 matches .* : " + regex2.matches(".*"))
}
Output:
regex1 matches abc: true
regex2 matches abc: false
regex2 matches .* : true
so apparently (and contrary to my expectations), Regex.fromLiteral() and String.toRegex() behave completely different (I've tried dozens of different arguments to regex2.matches() - the only one that returned true was .*)
Does this mean that a Regex created with Regex.fromLiteral() always matches only the exact string it was created with?
If yes, what are possible use cases for such a Regex? (I can't think of any scenario where that would be useful)
Yes, it does indeed create a regex that matches the literal characters in the String. This is handy when you're trying to match symbols that would be interpreted in a regex - you don't have to escape them this way.
For example, if you're looking for strings that contain .*[](1)?[2], you could do the following:
val regex = Regex.fromLiteral(".*[](1)?[2]")
regex.containsMatchIn("foo") // false
regex.containsMatchIn("abc.*[](1)?[2]abc") // true
Of course you can do almost anything you can do with a Regex with just regular String methods too.
val literal = ".*[](1)?[2]"
literal == "foo" // equality checks
literal in "abc.*[](1)?[2]abc" // containment checks
"some string".replace(literal, "new") // replacements
But sometimes you need a Regex instance as a parameter, so the fromLiteral method can be used in those cases. Performance of these different operations for different inputs could also be interesting for some use cases.
The Regex.fromLiteral() instantiates a regex object while escaping the special regex metacharacters. The pattern you get is actually \.\*, and since you used matches() that requires a full string match, you can only match a .* string with it (with find() you could match it anywhere inside a string).
See the source code:
public fun fromLiteral(literal: String): Regex = Regex(escape(literal))

regex that will check expression is valid arithmetic expression, for example 'a+b*c=d' is valid and accepted

I made separate regexs for both but its not giving desired result. and it should work like check whole input string and return valid if its valid or invalid if its invalid.
import re
identifiers = r'^[^\d\W]\w*\Z'
operators = r'[\+\*\-\/=]'
a = re.compile(identifiers, re.UNICODE)
b = re.compile(operators, re.UNICODE)
c = re.findall(a, 'a+b*c=d')
d = re.findall(b, 'a+b*c=d')
print c, 'identifiers'
print d, 'operators'
Result of this snippet is
[ ] identifiers &
['+', '*', '='] operators
I want results like input string is valid or invalid by checking all characters of input string by both regex
I think the issue you're having with your current code is that your identifiers pattern only works if it matches the whole string.
The problem is that the current pattern requires that both the beginning and end of the input be matched (by the ^ and \Z respectively). That's usually causing you to not finding any identifiers, since only an input like "foo" would be matched, since it's a single identifier that contacts both the start and end of the string. (I'd also note that it is a bit odd to mix ^ and \Z together, though it is not invalid. It would just be more natural to pair ^ with $ or \Z with \A.)
I suspect that you don't actually want ^ and \Z in your pattern, but rather should be using \b in place of both. The \b escape matches "word breaks", which means either the start or end of the input, or a change between word-characters and non-word characters.
>>> re.findall(r'\b[^\d\W]\w*\b', 'a+b*c=d', re.U)
['a', 'b', 'c', 'd']
This still isn't going to do what you say you ultimately want (testing if the string to ensure it's a valid expression). That's a much more difficult task, and regular expressions are not up to it in general. Certain specific forms of expressions can perhaps be matched with regex, but supporting things like parentheses will break the whole system in a hurry. To identify arbitrary arithmetic expressions, you'd need a more sophisticated parser, which might use regex in some of it's steps, but not for the whole thing.
For the simple cases like your example an expression like this will work:
^[0-9a-z]+([+/*-][0-9a-z]+)+=[0-9a-z]+$

Escaping Asterisk Grabs wrong character

When I run the code below I get the unexpected result where \* also captures É. Is there a way to make it only capture * like I wanted?
let s =
"* A
ÉR
* B"
let result = System.Text.RegularExpressions.Regex.Replace(s, "\n(?!\*)", "", Text.RegularExpressions.RegexOptions.Multiline)
printfn "%s" result
Result After Running Replace
* AÉR
* B
Expected Result
"* A
ÉR
* B"
UPDATE
This seems to be working, when I use a pattern like so \n(?=\*). I guess I needed a positive lookahead.
You're right that you need to use positive lookahead instead of negative lookahead to get the result you want. However, to clarify an issue that came up in the comments, in F# a string delimited by just "" is not quite like either a plain C# string delimited by "" or a C# string delimited by #"" - if you want the latter you should also use #"" in F#. The difference is that in a normal F# string, backslashes will be treated as escape sequences only when used in front of a valid character to escape (see the table towards the top of Strings (F#)). Otherwise, it is treated as a literal backslash character. So, since '*' is not a valid character to escape, you luckily see the behavior you expected (in C#, by contrast, this would be a syntax error because it's an unrecognized escape). I would recommend that you not rely on this and should use a verbatim #"" string instead.
In other words, in F# the following three strings are all equivalent:
let s1 = "\n\*"
let s2 = "\n\\*"
let s3 = #"
\*"
I think that the C# design is more sensible because it prevents confusion on what exactly is being escaped.