perl regular expression explanations - regex

I am completely lost on this line of perl code
$path =~ s|^\./|~/|; #change the path for prettier output
I am assuming it has to do with regex. I have some understanding of regex but i just cant seem to figure this one out.
what is =~ and why is there s and how does regex expressed in perl?

=~ is a binding operator. It applies the substitution (hence the s) to the variable $path. The substitution has two parts - a regular expression and the replacement. They are delimited by the | character in this case. The regular expression is
^\./
^ stands for the beginning of the string. \. stands for a literal dot, / stands for itself. So, ./ at the beginning of the string is replaced by ~/.

the =~ binds a scalar expression to a pattern match, the s is for replacement
what its doing is matching start of line with a ./ then replacing it with a ~/
as far as the | pipes, you can use any non-whitespace character to delimit parts of the regex you can use ^ or & or q or m or { whatever.. most people use / for readability but for cases where you might match on / use something else.
Hope this helps.

Related

Regular Expression to follow a specific pattern

I'm trying to make sure the input to my shell script follows the format Name_Major_Minor.extension
where Name is any number of digits/characters/"-" followed by "_"
Major is any number of digits followed by "_"
Minor is any number of digits followed by "."
and Extension is any number of characters followed by the end of the file name.
I'm fairly certain my regular expression is just messed up slightly. any file I currently run through it evaluates to "yes" but if I add "[A-Z]$" instead of "*$" it always evaluates to "no". Regular expressions confuse the hell out of me as you can probably tell..
if echo $1 | egrep -q [A-Z0-9-]+_[0-9]+_[0-9]+\.*$
then
echo "yes"
else
echo "nope"
exit
fi
edit: realized I am missing the pattern for "minor". Still doesn't work after adding it though.
Use =~ operator
Bash supports regular expression matching through its =~ operator, and there is no need for egrep in this particular case:
if [[ "$1" =~ ^[A-Za-z0-9-]+_[0-9]+_[0-9]+\..*$ ]]
Errors in your regular expression
The \.*$ sequence in your regular expression means "zero or more dots". You probably meant "a dot and some characters after it", i.e. \..*$.
Your regular expression matches only the end of the string ($). You likely want to match the whole string. To match the entire string, use the ^ anchor to match the beginning of the line.
Escape the command line arguments
If you still want to use egrep, you should escape its arguments as you should escape any command line arguments to avoid reinterpretation of special characters, or rather wrap the argument in single, or double quotes, e.g.:
if echo "$1" | egrep -q '^[A-Za-z0-9-]+_[0-9]+_[0-9]+\..*$'
Use printf instead of echo
Don't use echo, as its behavior is considered unreliable. Use printf instead:
printf '%s\n' "$1"
Try this regex instead: ^[A-Za-z0-9-]+(?:_[0-9]+){2}\..+$.
[A-Za-z0-9-]+ matches Name
_[0-9]+ matches _ followed by one or more digits
(?:...){2} matches the group two times: _Major_Minor
\..+ matches a period followed by one or more character
The problem in your regex seems to be at the end with \.*, which matches a period \. any number of times, see here. Also the [A-Z0-9-] will only match uppercase letters, might not be what you wanted.

How do I extract a string using a regex in a shell script?

I want to extract part of a string using a regular expression. For example, how do I extract the domain name from the $name variable?
name='here'
domain_name=... # apply some regex on $name
Using bash regular expressions:
re="http://([^/]+)/"
if [[ $name =~ $re ]]; then echo ${BASH_REMATCH[1]}; fi
Edit - OP asked for explanation of syntax. Regular expression syntax is a large topic which I can't explain in full here, but I will attempt to explain enough to understand the example.
re="http://([^/]+)/"
This is the regular expression stored in a bash variable, re - i.e. what you want your input string to match, and hopefully extract a substring. Breaking it down:
http:// is just a string - the input string must contain this substring for the regular expression to match
[] Normally square brackets are used say "match any character within the brackets". So c[ao]t would match both "cat" and "cot". The ^ character within the [] modifies this to say "match any character except those within the square brackets. So in this case [^/] will match any character apart from "/".
The square bracket expression will only match one character. Adding a + to the end of it says "match 1 or more of the preceding sub-expression". So [^/]+ matches 1 or more of the set of all characters, excluding "/".
Putting () parentheses around a subexpression says that you want to save whatever matched that subexpression for later processing. If the language you are using supports this, it will provide some mechanism to retrieve these submatches. For bash, it is the BASH_REMATCH array.
Finally we do an exact match on "/" to make sure we match all the way to end of the fully qualified domain name and the following "/"
Next, we have to test the input string against the regular expression to see if it matches. We can use a bash conditional to do that:
if [[ $name =~ $re ]]; then
echo ${BASH_REMATCH[1]}
fi
In bash, the [[ ]] specify an extended conditional test, and may contain the =~ bash regular expression operator. In this case we test whether the input string $name matches the regular expression $re. If it does match, then due to the construction of the regular expression, we are guaranteed that we will have a submatch (from the parentheses ()), and we can access it using the BASH_REMATCH array:
Element 0 of this array ${BASH_REMATCH[0]} will be the entire string matched by the regular expression, i.e. "http://www.google.com/".
Subsequent elements of this array will be subsequent results of submatches. Note you can have multiple submatch () within a regular expression - The BASH_REMATCH elements will correspond to these in order. So in this case ${BASH_REMATCH[1]} will contain "www.google.com", which I think is the string you want.
Note that the contents of the BASH_REMATCH array only apply to the last time the regular expression =~ operator was used. So if you go on to do more regular expression matches, you must save the contents you need from this array each time.
This may seem like a lengthy description, but I have really glossed over several of the intricacies of regular expressions. They can be quite powerful, and I believe with decent performance, but the regular expression syntax is complex. Also regular expression implementations vary, so different languages will support different features and may have subtle differences in syntax. In particular escaping of characters within a regular expression can be a thorny issue, especially when those characters would have an otherwise different meaning in the given language.
Note that instead of setting the $re variable on a separate line and referring to this variable in the condition, you can put the regular expression directly into the condition. However in bash 3.2, the rules were changed regarding whether quotes around such literal regular expressions are required or not. Putting the regular expression in a separate variable is a straightforward way around this, so that the condition works as expected in all bash versions that support the =~ match operator.
One way would be with sed. For example:
echo $name | sed -e 's?http://www\.??'
Normally the sed regular expressions are delimited by `/', but you can use '?' since you're searching for '/'. Here's another bash trick. #DigitalTrauma's answer reminded me that I ought to suggest it. It's similar:
echo ${name#http://www.}
(DigitalTrauma also gets credit for reminding me that the "http://" needs to be handled.)

Seperate backreference followed by numeric literal in perl regex

I found this related question : In perl, backreference in replacement text followed by numerical literal
but it seems entirely different.
I have a regex like this one
s/([^0-9])([xy])/\1 1\2/g
^
whitespace here
But that whitespace comes up in the substitution.
How do I not get the whitespace in the substituted string without having perl confuse the backreference to \11?
For eg.
15+x+y changes to 15+ 1x+ 1y.
I want to get 15+1x+1y.
\1 is a regex atom that matches what the first capture captured. It makes no sense to use it in a replacement expression. You want $1.
$ perl -we'$_="abc"; s/(a)/\1/'
\1 better written as $1 at -e line 1.
In a string literal (including the replacement expression of a substitution), you can delimit $var using curlies: ${var}. That means you want the following:
s/([^0-9])([xy])/${1}1$2/g
The following is more efficient (although gives a different answer for xxx):
s/[^0-9]\K(?=[xy])/1/g
Just put braces around the number:
s/([^0-9])([xy])/${1}1${2}/g

Replace specific capture group instead of entire regex in Perl

I've got a regular expression with capture groups that matches what I want in a broader context. I then take capture group $1 and use it for my needs. That's easy.
But how to use capture groups with s/// when I just want to replace the content of $1, not the entire regex, with my replacement?
For instance, if I do:
$str =~ s/prefix (something) suffix/42/
prefix and suffix are removed. Instead, I would like something to be replaced by 42, while keeping prefix and suffix intact.
As I understand, you can use look-ahead or look-behind that don't consume characters. Or save data in groups and only remove what you are looking for. Examples:
With look-ahead:
s/your_text(?=ahead_text)//;
Grouping data:
s/(your_text)(ahead_text)/$2/;
If you only need to replace one capture then using #LAST_MATCH_START and #LAST_MATCH_END (with use English; see perldoc perlvar) together with substr might be a viable choice:
use English qw(-no_match_vars);
$your_string =~ m/aaa (bbb) ccc/;
substr $your_string, $LAST_MATCH_START[1], $LAST_MATCH_END[1] - $LAST_MATCH_START[1], "new content";
# replaces "bbb" with "new content"
This is an old question but I found the below easier for replacing lines that start with >something to >something_else. Good for changing the headers for fasta sequences
while ($filelines=~ />(.*)\s/g){
unless ($1 =~ /else/i){
$filelines =~ s/($1)/$1\_else/;
}
}
I use something like this:
s/(?<=prefix)(group)(?=suffix)/$1 =~ s|text|rep|gr/e;
Example:
In the following text I want to normalize the whitespace but only after ::=:
some text := a b c d e ;
Which can be achieved with:
s/(?<=::=)(.*)/$1 =~ s|\s+| |gr/e
Results with:
some text := a b c d e ;
Explanation:
(?<=::=): Look-behind assertion to match ::=
(.*): Everything after ::=
$1 =~ s|\s+| |gr: With the captured group normalize whitespace. Note the r modifier which makes sure not to attempt to modify $1 which is read-only. Use a different sub delimiter (|) to not terminate the replacement expression.
/e: Treat the replacement text as a perl expression.
Use lookaround assertions. Quoting the documentation:
Lookaround assertions are zero-width patterns which match a specific pattern without including it in $&. Positive assertions match when their subpattern matches, negative assertions match when their subpattern fails. Lookbehind matches text up to the current match position, lookahead matches text following the current match position.
If the beginning of the string has a fixed length, you can thus do:
s/(?<=prefix)(your capture)(?=suffix)/$1/
However, ?<= does not work for variable length patterns (starting from Perl 5.30, it accepts variable length patterns whose length is smaller than 255 characters, which enables the use of |, but still prevents the use of *). The work-around is to use \K instead of (?<=):
s/.*prefix\K(your capture)(?=suffix)/$1/

Vim regex backreference

I want to do this:
%s/shop_(*)/shop_\1 wp_\1/
Why doesn't shop_(*) match anything?
There's several issues here.
parens in vim regexen are not for capturing -- you need to use \( \) for captures.
* doesn't mean what you think. It means "0 or more of the previous", so your regex means "a string that contains shop_ followed by 0+ ( and then a literal ). You're looking for ., which in regex means "any character". Put together with a star as .* it means "0 or more of any character". You probably want at least one character, so use .\+ (+ means "1 or more of the previous")
Use this: %s/shop_\(.\+\)/shop_\1 wp_\1/.
Optionally end it with g after the final slash to replace for all instances on one line rather than just the first.
If I understand correctly, you want %s/shop_\(.*\)/shop_\1 wp_\1/
Escape the capturing parenthesis and use .* to match any number of any character.
(Your search is searching for "shop_" followed by any number of opening parentheses followed by a closing parenthesis)
If you would like to avoid having to escape the capture parentheses and make the regex pattern syntax closer to other implementations (e.g. PCRE), add \v (very magic!) at the start of your pattern (see :help \magic for more info):
:%s/\vshop_(*)/shop_\1 wp_\1/
#Luc if you look here: regex-info, you'll see that vim is behaving correctly. Here's a parallel from sed:
echo "123abc456" | sed 's#^([0-9]*)([abc]*)([456]*)#\3\2\1#'
sed: -e expression #1, char 35: invalid reference \3 on 's' command's RHS
whereas with the "escaped" parentheses, it works:
echo "123abc456" | sed 's#^\([0-9]*\)\([abc]*\)\([456]*\)#\3\2\1#'
456abc123
I hate to see vim maligned - especially when it's behaving correctly.
PS I tried to add this as a comment, but just couldn't get the formatting right.