kotlin String::replace removing escape sequences? - regex

I'm trying some string manipulation using regex's, but I'm not getting the expected output
var myString = "/api/<user_id:int>/"
myString.replace(Regex("<user_id:int>"), "(\\d+)")
this should give me something like /api/(\d+)/ but instead I get /api/(d+)/
However if I create an escaped string directly like var a = "\d+"
I get the correct output \d+ (that I can further use to create a regex Pattern)
is this due to the way String::replace works?
if so, isn't this a bug, why is it removing my escape sequences?

To make the replace a literal string, use:
myString.replace(Regex("<user_id:int>"), Regex.escapeReplacement("(\\d+)"))
For details, this is what kotlin Regex.replace is doing:
Pattern nativePattern = Pattern.compile("<user_id:int>");
String m = nativePattern.matcher("/api/<user_id:int>/").replaceAll("(\\d+)");
-> m = (d+)
From Matcher.replaceAll() javadoc:
Note that backslashes () and dollar signs ($) in the replacement
string may cause the results to be different than if it were being
treated as a literal replacement string. Dollar signs may be treated
as references to captured subsequences as described above, and
backslashes are used to escape literal characters in the replacement
string.
The call to Regex.escapeReplacement above does exactly that, turning (\\d+) to (\\\\d+)

You are using a .replace overload that takes a regex as the first argument, thus, the second argument is parsed as a regex replacement pattern. Inside a regex replacement pattern, a \ char is special, it may escape a dollar symbol to be treated as a literal dollar sign. So, the literal backslash inside regex replacement patterns should be doubled.
You might use
myString.replace(Regex("<user_id:int>"), """(\\d+)""")
Whenever you have to search and replace with a regex and your replacement pattern is a dynamic value, you should use Regex.escapeReplacement (see GUIDO's answer).
However, you are replacing a literal value with another literal value, you do not have to use a regex here:
myString.replace("<user_id:int>", """(\d+)""")
See this Kotlin demo yielding /api/(\d+)/.
Note the use of raw string literals where a backslash is parsed as a literal backslash.

The replacement as the regex engine see's it is interpolated as a double quoted string.
This is true with every regex engine.
This is to distinguish control codes, like tab newline or carriage return.
Nothing special here.
So the replacement as the engine wants to see it is (\\d+).
The language interpolates the same.
Final result repl_str = "(\\\\d+)"

Related

Regular expression in Snowflake - starts with string and ends with digits

I am struggling with writing regex expression in Snowflake.
SELECT
'DEM7BZB01-123' AS SKU,
RLIKE('DEM7BZB01-123','^DEM.*\d\d$') AS regex
I would like to find all strings that starts with "DEM" and ends with two digits. Unfortunately the expression that I am using returns FALSE.
I was checking this expression in two regex generators and it worked.
In snowflake the backslash character \ is an escape character.
Reference: Escape Characters and Caveats
So you need to use 2 backslashes in a regex to express 1.
SELECT
'DEM7BZB01-123' AS SKU,
RLIKE('DEM7BZB01-123', '^DEM.*\\d\\d$') AS regex
Or you could write the regex pattern in such a way that the backslash isn't used.
For example, the pattern ^DEM.*[0-9]{2}$ matches the same as the pattern ^DEM.*\d\d$.
You need to escape your backslashes in your SQL before it can be parsed as a regex string. (sometimes it gets a bit silly with the number of backslashes needed)
Your example should look like this
RLIKE('DEM7BZB01-123','^DEM.*\\d\\d$') AS regex
RLIKE (which is an alias in Snowflake for the SQL Standard REGEXP_LIKE function) implicitly adds ^ and $ to your search pattern...
The function implicitly anchors a pattern at both ends (i.e. '' automatically becomes '^$', and 'ABC' automatically becomes '^ABC$').
so you can remove them, and that then allows you to use $$ quoting
In single-quoted string constants, you must escape the backslash character in the backslash-sequence. For example, to specify \d, use \d. For details, see Specifying Regular Expressions in Single-Quoted String Constants (in this topic).
You do not need to escape backslashes if you are delimiting the string with pairs of dollar signs ($$) (rather than single quotes).
so you can simply use the regex DEM.*\d\d to find all strings that starts with DEM and ends with two digits without extra escaping as follows
SELECT
'DEM7BZB01-123' AS SKU
, RLIKE('DEM7BZB01-123', $$DEM.*\d\d$$) AS regex
which gives
SKU |REGEX|
-------------+-----+
DEM7BZB01-123|true |

escape apostrophe in a regex string that starts and ends with apostrophe in dart

I'm trying to create a regular expression match for an email address and I intend to use it in a dart application.
I found the following regex for that:
(?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*|"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*")#(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:(2(5[0-5]|[0-4][0-9])|1[0-9][0-9]|[1-9]?[0-9]))\.){3}(?:(2(5[0-5]|[0-4][0-9])|1[0-9][0-9]|[1-9]?[0-9])|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])
now i'm really new to dart but I understood that I can create regular expression strings with r'' or r"".
now with dart I can escape characters with \ so if I want to escape an apostrophe in a string that I started and ended with apostrophe I can just do this:
final String a = 'foo\'bar';
but with final String a = r'foo\'bar' I get an error. how can I properly escape that ?
thank you
No, r'' does not mean "regular expression". It means "raw", so backslash is interpreted as a literal backslash, and not as an escape character.
Not having to escape each backslash is useful for the kind of strings which often contain a lot of backslashes, such as regular expression patterns.
Regular expressions are created as instances of the RegExp class.
You can concatenate raw strings that use different delimiters to create a single string for the whole pattern. In your case, this should work:
String pattern = r"(?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*|" + r'"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*")#(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:(2(5[0-5]|[0-4][0-9])|1[0-9][0-9]|[1-9]?[0-9]))\.){3}(?:(2(5[0-5]|[0-4][0-9])|1[0-9][0-9]|[1-9]?[0-9])|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])';
RegExp exp = new RegExp(pattern);

How to exclude part of string using regex and change add this part and the and of string?

I've got a little problem with regex.
I got few strings in one file looking like this:
TEST.SYSCOP01.D%%ODATE
TEST.SYSCOP02.D%%ODATE
TEST.SYSCOP03.D%%ODATE
...
What I need is to define correct regex and change those string name for:
TEST.D%%ODATE.SYSCOP.#01
TEST.D%%ODATE.SYSCOP.#02
TEST.D%%ODATE.SYSCOP.#03
Actually, I got my regex:
r".SYSCOP[0-9]{2}.D%%ODATE" - for finding this in file
But how should look like the changing regex? I need to have the numbers from a string at the and of new string name.
.D%%ODATE.SYSCOP.# - this is just string, no regex and It didn't work
Any idea?
Find: (SYSCOP)(\d+)\.(D%%ODATE)
Replace: $3.$1.#$2 or \3.\1.#\2 for Python
Demo
You may use capturing groups with backreferences in the replacement part:
s = re.sub(r'(\.SYSCOP)([0-9]{2})(\.D%%ODATE)', r'\3\1.#\2', s)
See the regex demo
Each \X in the replacement pattern refers to the Nth parentheses in the pattern, thus, you may rearrange the match value as per your needs.
Note that . must be escaped to match a literal dot.
Please mind the raw string literal, the r prefix before the string literals helps you avoid excessive backslashes. '\3\1.#\2' is not the same as r'\3\1.#\2', you may print the string literals and see for yourself. In short, inside raw string literals, string escape sequences like \a, \f, \n or \r are not recognized, and the backslash is treated as a literal backslash, just the one that is used to build regex escape sequences (note that r'\n' and '\n' both match a newline since the first one is a regex escape sequence matching a newline and the second is a literal LF symbol.)

Groovy : RegEx for matching Alphanumeric and underscore and dashes

I am working on Grails 1.3.6 application. I need to use Regular Expressions to find matching strings.
It needs to find whether a string has anything other than Alphanumeric characters or "-" or "_" or "*"
An example string looks like:
SDD884MMKG_JJGH1222
What i came up with so far is,
String regEx = "^[a-zA-Z0-9*-_]+\$"
The problem with above is it doesn't search for special characters at the end or beginning of the string.
I had to add a "\" before the "$", or else it will give an compilation error.
- Groovy:illegal string body character after dollar sign;
Can anyone suggest a better RegEx to use in Groovy/Grails?
Problem is unescaped hyphen in the middle of the character class. Fix it by using:
String regEx = "^[a-zA-Z0-9*_-]+\$";
Or even shorter:
String regEx = "^[\\w*-]+\$";
By placing an unescaped - in the middle of character class your regex is making it behave like a range between * (ASCII 42) and _ (ASCII 95), matching everything in this range.
In Groovy the $ char in a string is used to handle replacements (e.g. Hello ${name}). As these so called GStrings are only handled, if the string is written surrounding it with "-chars you have to do extra escaping.
Groovy also allows to write your strings without that feature by surrounding them with ' (single quote). Yet the easiest way to get a regexp is the syntax with /.
assert "SDD884MMKG_JJGH1222" ==~ /^[a-zA-Z0-9*-_]+$/
See Regular Expressions for further "shortcuts".
The other points from #anubhava remain valid!
It's easier to reverse it:
String regEx = "^[^a-zA-Z0-9\\*\\-\\_]+\$" /* This matches everything _but_ alnum and *-_ */

What does it mean "you can’t hide the terminating delimiter of a pattern inside a regex construct" in the "Programming Perl"?

Sorry, but once again I need help to understand rather complicated snippet from the "Programming Perl" book. Here it is (what is obscure to me marked as bold):
patterns are parsed like double-quoted strings, all the normal double-quote conventions will work, including variable interpolation (unless you use single quotes
as the delimiter) and special characters indicated with backslash escapes. These are applied before the string is interpreted as a regular expression (This is one of the
few places in the Perl language where a string undergoes more than one pass of
processing). ...
Another consequence of this two-pass parsing is that the ordinary Perl tokener
finds the end of the regular expression first, just as if it were looking for the
terminating delimiter of an ordinary string. Only after it has found the end of the
string (and done any variable interpolation) is the pattern treated as a regular
expression. Among other things, this means you can’t “hide” the terminating
delimiter of a pattern inside a regex construct (such as a bracketed character class
or a regex comment, which we haven’t covered yet). Perl will see the delimiter
wherever it is and terminate the pattern at that point.
First, why it is said that Only after it has found the end of the string not the end of the regular expression which it was looking, as stated before?
Second, what does it mean you can’t “hide” the terminating delimiter of a pattern inside a regex construct? Why I can't hide the terminating delimiter /, whereas I can place it wherever I want either in the regexp directly /A\/C/ or in a interpolated variable (even without \):
my $s = 'A/';
my $p = 'A/C';
say $p =~ /$s/;
outputs 1.
While I was writing and re-reading my question I thought that this snippet tells about using a single-quote as a regexp delimiter, then it all seems quite cohesive. Is my assumption correct?
My appreciation.
It says "end of the string" instead of "end of the regular expression" because at that point it's treating the regex as if it were just a string.
It's trying to say that this does not work:
/foo[-/_]/
Even though normal regex metacharacters are not special inside [], Perl will see the regex as /foo[-/ and complain about an unterminated class.
It's trying to say that Perl does not parse the regex as it reads it. First it finds the end of the regex in your source code as if it were a quoted string, so the only special character is \. Then it interpolates any variables. Then it parses the result as a regular expression.
You can hide the terminating delimiter with \ because that works in ordinary strings. You can hide the delimiter inside an interpolated variable, because interpolation happens after the delimiter is found. If you use a bracketing delimiter (e.g. { } or [ ]), you can nest matching pairs of delimiters inside the regex, because q{} works like that too.
But you can't hide it inside any other regex construct.
Say you want to match a *. You would use
m/\*/
But what if you were using you used * as your delimiter? The following doesn't work:
m*\**
because it's interpreted as
m/*/
as seen in the following:
$ perl -e'm*\**'
Quantifier follows nothing in regex; marked by <-- HERE in m/* <-- HERE / at -e line 1.
Take the string literal
"a\"b"
It produces the string
a"b
Similarly, the match operator
m*a\*b*
produces the regex pattern
a*b
If you want to match a literal *, you have to use other means. In other words.
m*a\*b* === m/a*b/ matches pattern a*b
m*a\x{2A}b* === m/a\*b/ matches pattern a\*b