Perl split pattern - regex

According to the perldoc, the syntax for split is:
split /PATTERN/,EXPR,LIMIT
But the PATTERN can also be a single- or double-quoted string: split "PATTERN", EXPR. What difference does it make?
Edit: A difference I'm aware of is splitting on backslashes: split /\\/ vs split '\\'. The second form doesn't work.

It looks like it uses that as "an expression to specify patterns":
The pattern /PATTERN/ may be replaced
with an expression to specify patterns
that vary at runtime. (To do runtime
compilation only once, use
/$variable/o .)
edit: I tested it with this:
my $foo = 'a:b:c,d,e';
print join(' ', split("[:,]", $foo)), "\n";
print join(' ', split(/[:,]/, $foo)), "\n";
print join(' ', split(/\Q[:,]\E/, $foo)), "\n";
Except for the ' ' special case, it looks just like a regular expression.

PATTERN is always interpreted as... well, a pattern -- never as a literal value. It can be either a regex1 or a string. Strings are compiled to regexes. For the most part the behavior is the same, but there can be subtle differences caused by the double interpretation.
The string '\\' only contains a single backslash. When interpreted as a pattern, it's as if you had written /\/, which is invalid:
C:\>perl -e "print join ':', split '\\', 'a\b\c'"
Trailing \ in regex m/\/ at -e line 1.
Oops!
Additionally, there are two special cases:
The empty pattern //, which splits on the empty string.
A single space ' ', which splits on whitespace after first trimming any
leading or trailing whitespace.
1. Regexes can be supplied either inline /.../ or via a precompiled qr// quoted string.

I believe there's no difference. A string pattern is also interpreted as a regular expression.

perl -e 'print join("-",split("[a-e]","regular"))';
r-gul-r
As you see, the delimiter is interpreted as a regular expression, not a string literal.
So, it's mostly the same - with one important exception: split(" ",... ) and split(/ /,... ) are different.
I prefer to use /PATTERN/ to avoid confusion, it's easy to forget that it's a regexp otherwise.

Two observable rules:
the special case split(" ") is equivalent to split(/\s+/).
for everything else (it seems—don't nail me), split("something") is equal to split(/something/)

Related

powershell -replace: surround captured regex group with dollar signs like: $group$

I want to replace strings like url: `= this.url` with url: $url$
I got quite close with this:
(Get-Content '.\file') -Replace "``= this.(\w+)``", "$ `$1$"
with output url: $ url$.
But when I remove extra space then the output breaks.
How can I escape/modify "$`$1$" so that it works?
You can use
-Replace "``= this\.(\w+)``", '$$$1$$'
Note that
The . must be escaped in the regex pattern
'$$$1$$' is a $$$1$$ string that contains:
$$ - a literal single $ char
$1 - the backreference to the first capturing group
$$ - a literal single $ char.
Powershell 7 version of -replace with a scriptblock 2nd argument. Just assigning $_ into $a to look at it. Note the backquote is a special character inside doublequotes, which I'm avoiding.
'url: `= this.url`' -replace '`= this\.(\w+)`', {$a = $_; '$' + $_.groups[1] + '$'}
url: $url$
$a
Groups : {0, 1}
Success : True
Name : 0
Captures : {0}
Index : 5
Length : 12
Value : `= this.url`
ValueSpan :
tl;dr
# * Consistent use of '...', obviating the need to `-escape ` and $
# * Verbatim $ chars. in the substitution string escaped as $$
# * Capture-group reference $1 represented as ${1} for visual clarity.
(Get-Content .\file) -replace '`= this\.(\w+)`', '$$${1}$$'
Background information and guidance:
In the substitution operand of PowerShell's regex-based -replaceoperator, a verbatim $ character must be escaped as $$, given that $-prefixed tokens have special meaning, namely to refer to results of the regex matching operation, such as $1 in your command (a reference to what the 1st, unnamed capture group in the search regex captured).
Unlike what the docs partially suggest, such a substitution string is not itself a regex, and any other characters are used verbatim.
To programmatically escape $ for verbatim use in a substitution string, it's simplest to use the .Replace() .NET string method, which performs _verbatim (literal) replacements (assuming that all $ instance are to be escaped; e.g. '$foo$'.Replace('$', '$$')
Note that, situationally, a capture-group reference such as $1 may need to be disambiguated as ${1}, and you may always choose to do that for visual clarity, as shown above.
It is only the search operand is a regex, and there all characters that are regex metacharacters must be \-escaped in order to be used verbatim, which can be done:
character-individually, in string literals (amount: \$)
programmatically, for entire strings, using [regex]::Escape() ([regex]::Escape('amount: $'))
To avoid confusion over up-front string interpolation by PowerShell vs. what the .NET regex engine ends up seeing, it's best to consistently use verbatim (single-quoted) strings ('...') rather than expandable (double-quoted) strings ("...").
If embedding PowerShell variable values is needed, use techniques such as:
string concatenation ('^' + [regex]::Escape($foo) + '$')
or -f, the format operator ('^{0}$' -f [regex]::Escape($foo))
In your case, using '...' helps you avoid the `-escaping that "..." requires to make PowerShell treat $ and ` (and ") verbatim, as shown above.
For a comprehensive overview of PowerShell's -replace operator, see this answer.

matching cond in perl using double exclaimation

if ($a =~ m!^$var/!)
$var is a key in a two dimensional hash and $a is a key in another hash.
What is the meaning of this expressions?
This is a regular expression ("regex"), where the ! character is used as the delimiter for the pattern that is to be matched in the string that it binds to via the =~ operator (the $a† here).
It may clear it up to consider the same regex with the usual delimiter instead, $a =~ /^$var\// (then m may be omitted); but now any / used in the pattern clearly must be escaped. To avoid that unsightly and noisy \/ combo one often uses another character for the delimiter, as nearly any character may be used (my favorite is the curlies, m{^$var/}). ‡ §
This regex in the question tests whether the value in the variable $a begins with (by ^ anchor) the value of the variable $var followed by / (variables are evaluated and the result used). §
† Not a good choice for a variable name since $a and $b are used by the builtin sort
‡ With the pattern prepared ahead of time the delimiter isn't even needed
my $re = qr{^$var/};
if ($string =~ $re) ...
(but I do like to still use // then, finding it clearer altogether)
Above I use qr but a simple q() would work just fine (while I absolutely recommend qr). These take nearly any characters for the delimiter, as well.
§ Inside a pattern the evaluated variables are used as regex patterns, what is wrong in general (when this is intended they should be compiled using qr and thus used as subpatterns).
An unimaginative example: a variable $var = q(\s) (literal backslash followed by letter s) evaluated inside a pattern yields the \s sequence which is then treated as a regex pattern, for whitespace. (Presumably unintended; we just wanted \ and s.)
This is remedied by using quotemeta, /\Q$var\E/, so that possible metacharacters in $var are escaped; this results in the correct pattern for the literal characters, \\s. So a correct way to write the pattern is m{^\Q$var\E/}.
Failure to do this also allows the injection bug. Thanks to ikegami for commenting on this.
The match operator (m/.../) is one of Perl's "quote-like" operators. The standard usage is to use slashes before and after the regex that goes in the middle of the operator (and if you use slashes, then you can omit the m from the start of the operator). But if the regex itself contains a slash then it is convenient to use a different delimiter instead to avoid having to escape the embedded slash. In your example, the author has decided to use exclamation marks, but any non-whitespace character can be used.
Many Perl operators work like this - m/.../, s/.../.../, tr/.../.../, q/.../, qq/.../, qr/.../, qw/.../, qx/.../ (I've probably forgotten some).

How to use a variable as part of a regular expression in PowerShell

I want to Select-String parts of a file path starting at a string value that is contained in a variable. Let me explain this in an abstracted example.
Let's assume this path: /docs/reports/test reports/document1.docx
Using a regular expression I can get the required string like so:
'^.*(?=\/test\s)'
https://regex101.com/r/6mBhLX/5
The resulting string is '/test reports/document1.docx'.
Now, for this to work I have to use the literal string 'test'. However, I would like to know how to use a variable that contains 'test', e.g. $myString.
I already looked at How do you use a variable in a regular expression?, but I couldn't figure out how to adapt this for PowerShell.
I suggest using $([regex]::escape($myString)) inside a double quoted string literal:
$myString="[test]"
$pattern = "^.*(?=/$([regex]::escape($myString))\s)"
Or, in case you do not want to worry with additional escaping, use a regular concatenation using + operator:
$pattern = '^.*(?=/' + [regex]::escape($myString) +'\s)'
The resulting $pattern will look like ^.*(?=/\[test]\s). Since the $myString variable is a literal string, you need to escape all special regex metacharacters (with [regex]::escape()) that may be inside it for the regex engine to interpret it as literal chars.
In your case, you may use
$s = '/docs/reports/test reports/document1.docx'
$myString="test"
$pattern = "^.*(?=/$([regex]::escape($myString))\s)"
$s -replace $pattern
Result: /test reports/document1.docx
Wiktor Stribiżew's helpful answer provides the crucial pointer:
Use [regex]::Escape() in order to escape a string for safe inclusion in a regex (regular expression) so that it is treated as a literal;
e.g., [regex]::Escape('$10?') yields \$10\? - the characters with special meaning to a regex were \-escaped.
However, I suggest using '...', i.e., building the regex from single-quoted aka verbatim strings:
$myString='test'
$regex = '^.*(?=/' + [regex]::escape($myString) + '\s)'
Using the -f operator - $regex = '^.*(?=/{0}'\s)' -f [regex]::Escape($myString) works too and is perhaps visually cleaner, but note that -f - unlike string concatenation with + - is culture-sensitive, which can lead to different results.
Using '...' strings in regex contexts in PowerShell is a good habit to form:
By avoiding "...", so-called expandable strings, you avoid additional up-front interpretation (interpolation a.k.a expansion) of the string, which can have unexpected effects, given that $ has special meaning in both contexts: the start of
a variable reference or subexpression when string-expanding, and the end-of-input marker in regexes.
Using "..." can be especially tricky in the replacement string of the regex-based -replace operator, in whose replacement string operand tokens such as $1 refer to capture-group results, and if you used "$1", PowerShell would try to expand a $1 variable, which presumably doesn't exist, resulting in the empty string.
Just write the variable within double quotes ("pattern"), like this:
PS > $pattern = "^\d+\w+"
PS > "357test*&(fdnsajkfj" -match $pattern # return true
PS > "357test*&(fdnsajkfj" -match "$pattern.*\w+$" # return true
PS > "357test*&(fdnsajkfj" -match "$pattern\w+$" # return false
Please have a try. :)

Perl: how to use string variables as search pattern and replacement in regex

I want to use string variables for both search pattern and replacement in regex. The expected output is like this,
$ perl -e '$a="abcdeabCde"; $a=~s/b(.)d/_$1$1_/g; print "$a\n"'
a_cc_ea_CC_e
But when I moved the pattern and replacement to a variable, $1 was not evaluated.
$ perl -e '$a="abcdeabCde"; $p="b(.)d"; $r="_\$1\$1_"; $a=~s/$p/$r/g; print "$a\n"'
a_$1$1_ea_$1$1_e
When I use "ee" modifier, it gives errors.
$ perl -e '$a="abcdeabCde"; $p="b(.)d"; $r="_\$1\$1_"; $a=~s/$p/$r/gee; print "$a\n"'
Scalar found where operator expected at (eval 1) line 1, near "$1$1"
(Missing operator before $1?)
Bareword found where operator expected at (eval 1) line 1, near "$1_"
(Missing operator before _?)
Scalar found where operator expected at (eval 2) line 1, near "$1$1"
(Missing operator before $1?)
Bareword found where operator expected at (eval 2) line 1, near "$1_"
(Missing operator before _?)
aeae
What do I miss here?
Edit
Both $p and $r are written by myself. What I need is to do multiple similar regex replacing without touching the perl code, so $p and $r have to be in a separate data file. I hope this file can be used with C++/python code later.
Here are some examples of $p and $r.
^(.*\D)?((19|18|20)\d\d)年 $1$2<digits>年
^(.*\D)?(0\d)年 $1$2<digits>年
([TKZGD])(\d+)/(\d+)([^\d/]) $1$2<digits>$3<digits>$4
([^/TKZGD\d])(\d+)/(\d+)([^/\d]) $1$3分之$2$4
With $p="b(.)d"; you are getting a string with literal characters b(.)d. In general, regex patterns are not preserved in quoted strings and may not have their expected meaning in a regex. However, see Note at the end.
This is what qr operator is for: $p = qr/b(.)d/; forms the string as a regular expression.
As for the replacement part and /ee, the problem is that $r is first evaluated, to yield _$1$1_, which is then evaluated as code. Alas, that is not valid Perl code. The _ are barewords and even $1$1 itself isn't valid (for example, $1 . $1 would be).
The provided examples of $r have $Ns mixed with text in various ways. One way to parse this is to extract all $N and all else into a list that maintains their order from the string. Then, that can be processed into a string that will be valid code. For example, we need
'$1_$2$3other' --> $1 . '_' . $2 . $3 . 'other'
which is valid Perl code that can be evaluated.
The part of breaking this up is helped by split's capturing in the separator pattern.
sub repl {
my ($r) = #_;
my #terms = grep { $_ } split /(\$\d)/, $r;
return join '.', map { /^\$/ ? $_ : q(') . $_ . q(') } #terms;
}
$var =~ s/$p/repl($r)/gee;
With capturing /(...)/ in split's pattern, the separators are returned as a part of the list. Thus this extracts from $r an array of terms which are either $N or other, in their original order and with everything (other than trailing whitespace) kept. This includes possible (leading) empty strings so those need be filtered out.
Then every term other than $Ns is wrapped in '', so when they are all joined by . we get a valid Perl expression, as in the example above.
Then /ee will have this function return the string (such as above), and evaluate it as valid code.
We are told that safety of using /ee on external input is not a concern here. Still, this is something to keep in mind. See this post, provided by Håkon Hægland in a comment. Along with the discussion it also directs us to String::Substitution. Its use is demonstrated in this post. Another way to approach this is with replace from Data::Munge
For more discussion of /ee see this post, with several useful answers.
Note on using "b(.)d" for a regex pattern
In this case, with parens and dot, their special meaning is maintained. Thanks to kangshiyin for an early mention of this, and to Håkon Hægland for asserting it. However, this is a special case. Double-quoted strings directly deny many patterns since interpolation is done -- for example, "\w" is just an escaped w (what is unrecognized). The single quotes should work, as there is no interpolation. Still, strings intended for use as regex patterns are best formed using qr, as we are getting a true regex. Then all modifiers may be used as well.

Regex to find text between second and third slashes

I would like to capture the text that occurs after the second slash and before the third slash in a string. Example:
/ipaddress/databasename/
I need to capture only the database name. The database name might have letters, numbers, and underscores. Thanks.
How you access it depends on your language, but you'll basically just want a capture group for whatever falls between your second and third "/". Assuming your string is always in the same form as your example, this will be:
/.*/(.*)/
If multiple slashes can exist, but a slash can never exist in the database name, you'd want:
/.*/(.*?)/
/.*?/(.*?)/
In the event that your lines always have / at the end of the line:
([^/]*)/$
Alternate split method:
split("/")[2]
The regex would be:
/[^/]*/([^/]*)/
so in Perl, the regex capture statement would be something like:
($database) = $text =~ m!/[^/]*/([^/]*)/!;
Normally the / character is used to delimit regexes but since they're used as part of the match, another character can be used. Alternatively, the / character can be escaped:
($database) = $text =~ /\/[^\/]*\/([^\/]*)\//;
You can even more shorten the pattern by going this way:
[^/]+/(\w+)
Here \w includes characters like A-Z, a-z, 0-9 and _
I would suggest you to give SPLIT function a priority, since i have experienced a good performance of them over RegEx functions wherever it is possible to use them.
you can use explode function with PHP or split with other languages to so such operation.
anyways, here is regex pattern:
/[\/]*[^\/]+[\/]([^\/]+)/
I know you specifically asked for regex, but you don't really need regex for this. You simply need to split the string by delimiters (in this case a backslash), then choose the part you need (in this case, the 3rd field - the first field is empty).
cut example:
cut -d '/' -f 3 <<< "$string"
awk example:
awk -F '/' {print $3} <<< "$string"
perl expression, using split function:
(split '/', $string)[2]
etc.