Decoding a regular expression in Perl - regex

I am trying to decode the following Perl regular expression:
$fname =~ /\/([^\/]+)\.txt$/
What are we trying to match for here?

Here's how you break it down.
\/ - the literal character /
(...) - followed by a group that will be captured to $1
[ ... ] - a character class
^ - in a character class, this means take the inversion of the specified set
\/ - the literal character /
+ - one or more times
\. - the literal character .
txt - the literal string txt
$ - the end of the string
So, in other words, this is trying to match "anything with a / followed by one or more characters that are not /, followed by .txt, followed by the end of the string, and put the part before .txt into $1"

\/([^\/]+)\.txt
This regular expression matches a file name, as it exists in a path
Minus the extension, and
Only when (or starting where) the path begins with an up-right slash.
Examples:
\folder\path\file.txt
Nothing is matched.
folder/path/file.txt
file.txt is matched (and file is placed in capture group 1: $1).
/folder/path/file.txt
Again, file.txt is matched (and file captured).
You can try it yourself at Debuggex

Related

Replace characters within a specific string

I have a text file with URLs where space is + and it needs to be %20 to work.
For example:
http://myserver/abc/this+is+my+document.doc
I want it to be:
http://myserver/abc/this%20is%20my%20document.doc
How to replace + with %20, but only when the string starts with http://myserver/abc? Don't want to replace any other +'s in the document.
Thanks in advance!
You can use the following regex:
See it in use here
(?:http://myserver/abc|\G(?!\A))[^\s+]*\K\+
Replace with %20
How the regex works?
(?:http://myserver/abc|\G(?!\A)) matches either http://myserver/abc literally, or the previously matched location (\G is previously matched location or start of the string and (?!\A) prevents \G from matching the start of the string)
[^\s+]* matches any character except whitespace and + (literally) any number of times
\K resets the match. Any previously consumed characters are excluded from the final match
\+ match this character literally

Powershell regex for string between two special characters

A file name as below
$inpFiledev = "abc_XYZ.bak"
I need only XYZ in a variable to do a compare with other file name.
i tried below:
[String]$findev = [regex]::match($inpFiledev ,'_*.').Value
Write-Host $findev
Asterisks in regex don't behave in the same way as they do in filesystem listing commands. As it stands your regex is looking for underscore, repeated zero or more times, followed by any character (represented in regex by a period). So the regex finds zero underscores right at the start of the string, then it finds 'a', and that's the match it returns.
First, correct that bit:
'_*.'
Becomes "underscore, followed by any number of characters, followed by a literal period". The 'literal period' means we need to escape the period in the regex, by using \., remembering that period means any character:
'_.*\.'
_ underscore
.* any number of characters
\. a literal period
That returns:
_XYZ.
So, not far off.
If you're looking to return something from between characters, you'll need to use capturing groups. Put parentheses around the bit you want to keep:
'_(.*)\.'
Then you'll need to use PowerShell regex groups to get the value:
[regex]::match($inpFiledev ,'_(.*)\.').Groups[1].Value
Which returns: XYZ
The number 1 in the Groups[1] just means the first capturing group, you can add as many as you like to the expression by using more parentheses, but you only need one in this case.
To complement mjsqu's helpful answer with two PowerShell-idiomatic alternatives:
For an overview of how regexes (regular expressions) are used in PowerShell, see Get-Help about_regular_expressions.
Using -split to split by _ and ., extracting the resulting 3-element array's middle element:
PS> ("abc_XYZ.bak" -split '[_.]')[1]
XYZ
-split's (first) RHS operand is a regex; regex [_.] is a character set ([...]) that matches a single char. that is either a literal _ or a literal . Therefore, input abc_XYZ.bak is broken into an array containing the strings abc, XYZ, and bak. Applying index [1] therefore extracts the middle token, XYZ.
Using -replace to extract the token of interest via a capture group ((...), referred to in the replacement operand as $1):
PS> "abc_XYZ.bak" -replace '^.+_([^.]+).+$', '$1'
XYZ
-replace too operates on a regex as the first RHS operand - what to replace - whereas the second operand specifies what to replace the matched (sub)string with.
Regex ^.+_([^.]+).+$:
^.+_ matches one or more (+) characters (.) at the start of the input (^) - note how . - used outside of a character set ([...]) - is a regex metacharacter that represents any character (in a single-line input string).
([^.]+) is a capture group ((...)) that matches a negated character set ([^...]): [^.] matches any literal char. that isn't a literal ., one or more times (+).
Whatever matched the sub-expression inside (...) can be referenced in the replacement operand as $<n>, where <n> represents the 1-based index of the capture group in the regex; in this case, $1 can be used to refer to this first (and only) capture group.
.+$ matches one or more (+) remaining characters (.) until the end of the input is reached ($).
Replacement operand $1 simply refers to what the first capture group matched; in this case: XYZ.
For a comprehensive overview of the syntax of -replace replacement operands, see this answer.
Because you're using the [regex] accelerator, you need the backslash to escape your end . (if you want to match it), and you need a dot before your asterix to match any characters after your underscore. If the characters in between are all letters, then use \w+
$findev = [regex]::match($inpFiledev ,'_.*\.')
$findev
_XYZ.
this demos two other ways to get the desired info from the sample string. the 1st uses the basic .Split() string method on the raw string. the 2nd presumes you are dealing with file objects and starts off by getting the .BaseName for the file. that already removes the extension, so you need not bother doing it yourself.
if you are dealing with a large number of strings, and not file objects, then the previous regex answers will likely be faster. [grin]
$inpFiledev = 'abc_XYZ.bak'
$findev = $inpFiledev.Split('.')[0].Split('_')[-1]
# fake reading in a file with Get-Item or Get-ChildItem
$File = [System.IO.FileInfo]'c:\temp\testing\abc_XYZ.bak'
$WantedPart = $File.BaseName.Split('_')[-1]
'split on a string = {0}' -f $findev
'split on BaseName of file = {0}' -f $WantedPart
output ...
split on a string = XYZ
split on BaseName of file = XYZ

regex for first instance of a specific character that DOESN'T come immediately after another specific character

I have a function, translate(), takes multiple parameters. The first param is the only required and is a string, that I always wrap in single quotes, like this:
translate('hello world');
The other params are optional, but could be included like this:
translate('hello world', true, 1, 'foobar', 'etc');
And the string itself could contain escaped single quotes, like this:
translate('hello\'s world');
To the point, I now want to search through all code files for all instances of this function call, and extract just the string. To do so I've come up with the following grep, which returns everything between translate(' and either ') or ',. Almost perfect:
grep -RoPh "(?<=translate\(').*?(?='\)|'\,)" .
The problem with this though, is that if the call is something like this:
translate('hello \'world\', you\'re great!');
My grep would only return this:
hello \'world\
So I'm looking to modify this so that the part that currently looks for ') or ', instead looks for the first occurrence of ' that hasn't been escaped, i.e. doesn't immediately follow a \
Hopefully I'm making sense. Any suggestions please?
You can use this grep with PCRE regex:
grep -RoPh "\btranslate\(\s*\K'(?:[^'\\\\]*)(?:\\\\.[^'\\\\]*)*'" .
Here is a regex demo
RegEx Breakup:
\b # word boundary
translate # match literal translate
\( # match a (
\s* # match 0 or more whitespace
\K # reset the matched information
' # match starting single quote
(?: # start non-capturing group
[^'\\\\]* # match 0 or more chars that are not a backslash or single quote
) # end non-capturing group
(?: # start non-capturing group
\\\\. # match a backslash followed by char that is "escaped"
[^'\\\\]* # match 0 or more chars that are not a backslash or single quote
)* # end non-capturing group
' # match ending single quote
Here is a version without \K using look-arounds:
grep -oPhR "(?<=\btranslate\(')(?:[^'\\\\]*)(?:\\\\.[^'\\\\]*)*(?=')" .
RegEx Demo 2
I think the problem is the .*? part: the ? makes it a non-greedy pattern, meaning it'll take the shortest string that matches the pattern. In effect, you're saying, "give me the shortest string that's followed by quote+close-paren or quote+comma". In your example, "world\" is followed by a single quote and a comma, so it matches your pattern.
In these cases, I like to use something like the following reasoning:
A string is a quote, zero or more characters, and a quote: '.*'
A character is anything that isn't a quote (because a quote terminates the string): '[^']*'
Except that you can put a quote in a string by escaping it with a backslash, so a character is either "backslash followed by a quote" or, failing that, "not a quote": '(\\'|[^'])*'
Put it all together and you get
grep -RoPh "(?<=translate\(')(\\'|[^'])*(?='\)|'\,)" .

How to regexp match surrounding whitespace or beginning/end of line

I am trying to find lines in a file that contain a / (slash) character which is not part of a word, like this:
grep "\</\>" file
But no luck, even if the file contains the "/" alone, grep does not find it.
I want to be able to match lines such as
some text / pictures
/ text
text /
but not e.g.
/home
Why your approach does not work
\<, \> only match against the beginning (or end, respectively) of a word. That means that they can never match if put adjacent to / (which is not treated as a word-character) – because e.g. \</ basically says "match the beginning of a word directly followed by something other than a word (a 'slash', in this case)", which is impossible.
What will work
This will match / surrounded by whitespace (\s) or beginning/end of line:
egrep '(^|\s)/($|\s)' file
(egrep implies the -E option, which turns on processing of extended regular expressions.)
What might also work
The following slightly simpler expression will work if a / is never adjacent to non-word characters (such as *, #, -, and characters outside the ASCII range); it might be of limited usefulness in OP's case:
grep '\B/\B' file
for str in 'some text / pictures' ' /home ' '/ text' ' text /'; do
echo "$str" | egrep '(^|\s)/($|\s)'
done
This will match /:
if the entire input string is /
if the input string starts with / and is followed by at least 1 whitespace
if the input string ends with / and is preceded by at least 1 whitespace
if / is inside the input string surrounded by at least 1 whitespace on either side.
As for why grep "\</\>" file did not work:
\< and /> match the left/right boundaries between words and non-words. However, / does not qualify as a word, because words are defined as a sequence of one or more instances of characters from the set [[:alnum:]_], i.e.: sequences of at least length 1 composed entirely of letters, digits, or _.
This seems to work for me.
grep -rni " / \| /\|/ " .

Regular Expression match lines starting with a certain character OR whitespace and then that character

I am trying to write a regular expression that matches lines beginning with a hyphen (-) OR that begins with spaces or tabs and then has a hyphen. So it should match the following:
- hello!
- hello!
Here's what I've got so far: ^(\-). But that doesn't match the second example above because it requires the first character to be a hyphen.
You can try
^\s*-
^: start of string
\s*: zero or more whitespace characters
-: a literal - (you don't need to escape this outside a character class)
You can use this regex by making 0 or more spaces optional match at beginning:
^\s*-
the above (using \s*) is the easiest one for this case, but in general, you can always use the | syntax:
re.match('^-|^\s+-', '- hello')
<_sre.SRE_Match object at 0x0000000054E72030>
re.match('^-|^\s+-', ' - hello')
<_sre.SRE_Match object at 0x0000000054E72030>
re.match('^-|^\s+-', ' + hello')
None
^- is the case for - at beginning, `^\s+-' is with one or more spaces, and | chooses either one.