Perl regex: string literal containing metacharacters as variable for match expression - regex

I need to define a string literal as a variable which will be later used as a match expression.
I want my variable $regex_op to match the string alt_id: ID: as well as the string id: ID:.
my $regex_op = "(id|alt_id):\sID:";
my $searchword = "4";
Later on, I'm joining the variables in a regular expression:
/^($regex_op)($searchword)/m
Unfortunately, the whitespace wildcard \s is an "Unrecognized escape \s passed through".
The problem apparently consists in the string literal containing backslashes (which are needed as part of the regex later on!).
Any ideas how to solve this?

For regexes, use regex quotes qr//. This ensures the correct parsing rules for regexes are used, not those for double quoted strings:
my $regex_op = qr/(?:id|alt_id):\sID:/; # I think that group should be non-capturing
my $searchword = 4;
/^$regex_op($searchword)/m; # no need to group $regex_op; unless you want to capture
In a double quoted string, if a backslash is followed by a character that is not a known escape, then that character is left as is, but the backslash removed:
"\s" eq "s"

Related

Regex replace between 2 words and modify result

I have this string example:
var s = 'type=audio&hls=&mp3=foo';
I would like to find everything between = and & and replace with quotes + matched value so I get this:
type="audio" hls="" mp3="foo"
(match is in quotes even if its empty and & gets replaced with space)
This is my regex but its not working:
s = s.replace(/=.+?\\&/g,function(a,inside){
return '="'+inside+'" ';
})
If we consider your regex one token at a time, here's what it means:
= matches a literal equal sign
.+? matches an optional string that has at least one character in it
\\ matches a literal backslash
& matches a literal ampersand
Among other problems, this requires that a literal backslash to be in the input string, and requires the string to end with an ampersand. Also .+? can be shortened to .*, but is still wrong because it might include ampersands and equal signs in the matched string.
Also, there is no need to replace with a function, as JavaScript can do what you are doing with just a string replacement.
A better regex might have these tokens:
= matches a literal equal sign
[^&]* matches a string (possibly empty) that does not contain ampersands
&? matches an optional ampersand
As Wiktor pointed out above, this could all be combined together like this:
s = s.replace(/=([^&]*)&?/g, '="$1" ').trim();
Here parentheses are used to mark the portion of the matched pattern that is being replaced, the $1 is used to refer to the marked portion of the pattern in parentheses, and the .trim() removes the trailing space.

How to exclude part of string using regex and change add this part and the and of string?

I've got a little problem with regex.
I got few strings in one file looking like this:
TEST.SYSCOP01.D%%ODATE
TEST.SYSCOP02.D%%ODATE
TEST.SYSCOP03.D%%ODATE
...
What I need is to define correct regex and change those string name for:
TEST.D%%ODATE.SYSCOP.#01
TEST.D%%ODATE.SYSCOP.#02
TEST.D%%ODATE.SYSCOP.#03
Actually, I got my regex:
r".SYSCOP[0-9]{2}.D%%ODATE" - for finding this in file
But how should look like the changing regex? I need to have the numbers from a string at the and of new string name.
.D%%ODATE.SYSCOP.# - this is just string, no regex and It didn't work
Any idea?
Find: (SYSCOP)(\d+)\.(D%%ODATE)
Replace: $3.$1.#$2 or \3.\1.#\2 for Python
Demo
You may use capturing groups with backreferences in the replacement part:
s = re.sub(r'(\.SYSCOP)([0-9]{2})(\.D%%ODATE)', r'\3\1.#\2', s)
See the regex demo
Each \X in the replacement pattern refers to the Nth parentheses in the pattern, thus, you may rearrange the match value as per your needs.
Note that . must be escaped to match a literal dot.
Please mind the raw string literal, the r prefix before the string literals helps you avoid excessive backslashes. '\3\1.#\2' is not the same as r'\3\1.#\2', you may print the string literals and see for yourself. In short, inside raw string literals, string escape sequences like \a, \f, \n or \r are not recognized, and the backslash is treated as a literal backslash, just the one that is used to build regex escape sequences (note that r'\n' and '\n' both match a newline since the first one is a regex escape sequence matching a newline and the second is a literal LF symbol.)

kotlin String::replace removing escape sequences?

I'm trying some string manipulation using regex's, but I'm not getting the expected output
var myString = "/api/<user_id:int>/"
myString.replace(Regex("<user_id:int>"), "(\\d+)")
this should give me something like /api/(\d+)/ but instead I get /api/(d+)/
However if I create an escaped string directly like var a = "\d+"
I get the correct output \d+ (that I can further use to create a regex Pattern)
is this due to the way String::replace works?
if so, isn't this a bug, why is it removing my escape sequences?
To make the replace a literal string, use:
myString.replace(Regex("<user_id:int>"), Regex.escapeReplacement("(\\d+)"))
For details, this is what kotlin Regex.replace is doing:
Pattern nativePattern = Pattern.compile("<user_id:int>");
String m = nativePattern.matcher("/api/<user_id:int>/").replaceAll("(\\d+)");
-> m = (d+)
From Matcher.replaceAll() javadoc:
Note that backslashes () and dollar signs ($) in the replacement
string may cause the results to be different than if it were being
treated as a literal replacement string. Dollar signs may be treated
as references to captured subsequences as described above, and
backslashes are used to escape literal characters in the replacement
string.
The call to Regex.escapeReplacement above does exactly that, turning (\\d+) to (\\\\d+)
You are using a .replace overload that takes a regex as the first argument, thus, the second argument is parsed as a regex replacement pattern. Inside a regex replacement pattern, a \ char is special, it may escape a dollar symbol to be treated as a literal dollar sign. So, the literal backslash inside regex replacement patterns should be doubled.
You might use
myString.replace(Regex("<user_id:int>"), """(\\d+)""")
Whenever you have to search and replace with a regex and your replacement pattern is a dynamic value, you should use Regex.escapeReplacement (see GUIDO's answer).
However, you are replacing a literal value with another literal value, you do not have to use a regex here:
myString.replace("<user_id:int>", """(\d+)""")
See this Kotlin demo yielding /api/(\d+)/.
Note the use of raw string literals where a backslash is parsed as a literal backslash.
The replacement as the regex engine see's it is interpolated as a double quoted string.
This is true with every regex engine.
This is to distinguish control codes, like tab newline or carriage return.
Nothing special here.
So the replacement as the engine wants to see it is (\\d+).
The language interpolates the same.
Final result repl_str = "(\\\\d+)"

Regex replace escaped characters

I would like to define a regex pattern which replaces escaped characters with the corresponding value.
For example the string
xy\tz\\x
Should be converted to
xy{tab}z\x
The problem is how to handle things like
xy\\\\\t
this string should become
xy\\{tab}
I don't know how to create a pattern which matches only odd backslashes.
This isn't something that can be accomplished using a single pattern. To start, strip out collections of backslashes:
s/\\\\/\\/g
This replaces two backslashes with a single one.
Then you can just apply one pattern per escaped character:
s/\\t/\t/g
The trick here is to escape the backslash you want to replace. What this'll do is replace the literal string "\t" with a tab character.

How can I match double-quoted strings with escaped double-quote characters?

I need a Perl regular expression to match a string. I'm assuming only double-quoted strings, that a \" is a literal quote character and NOT the end of the string, and that a \ is a literal backslash character and should not escape a quote character. If it's not clear, some examples:
"\"" # string is 1 character long, contains dobule quote
"\\" # string is 1 character long, contains backslash
"\\\"" # string is 2 characters long, contains backslash and double quote
"\\\\" # string is 2 characters long, contains two backslashes
I need a regular expression that can recognize all 4 of these possibilities, and all other simple variations on those possibilities, as valid strings. What I have now is:
/".*[^\\]"/
But that's not right - it won't match any of those except the first one. Can anyone give me a push in the right direction on how to handle this?
/"(?:[^\\"]|\\.)*"/
This is almost the same as Cal's answer, but has the advantage of matching strings containing escape codes such as \n.
The ?: characters are there to prevent the contained expression being saved as a backreference, but they can be removed.
NOTE: as pointed out by Louis Semprini, this is limited to 32kb texts due a recursion limit built into Perl's regex engine (that unfortunately silently returns a failure when hit, instead of crashing loudly).
How about this?
/"([^\\"]|\\\\|\\")*"/
matches zero or more characters that aren't slashes or quotes OR two slashes OR a slash then a quote
A generic solution(matching all backslashed characters):
/ \A " # Start of string and opening quote
(?: # Start group
[^\\"] # Anything but a backslash or a quote
| # or
\\. # Backslash and anything
)* # End of group
" \z # Closing quote and end of string
/xms
See Text::Balanced. It's better than reinvent wheel. Use gen_delimited_pat to see result pattern and learn form it.
RegExp::Common is another useful tool to be aware of. It contains regexps for many common cases, included quoted strings:
use Regexp::Common;
my $str = '" this is a \" quoted string"';
if ($str =~ $RE{quoted}) {
# do something
}
Here's a very simple way:
/"(?:\\?.)*?"/
Just remember if you're embedding such a regex in a string to double the backslashes.
Try this piece of code : (\".+")