Why does perl ignore extra characters in my regex? - regex

I have this line in bash:
echo "a=-1"|perl -nle 'if (/.*=[0-9]*/){print;}'
and get:
a=-1
Wait..I didn't say perl should match on the -. I made a minor change to:
echo "a=-1"|perl -nle 'if (/.*=[0-9]*$/){print;}'
and it correctly ignores the line. Why?

You might find it useful, while developing a regex, to print only the part of the string that the regex actually matched, instead of the entire line. This will give you better insight into what your regex is doing. You can do this with the special $& variable. So instead of:
echo "a=-1"|perl -nle 'if (/.*=[0-9]*/){print;}'
use
echo "a=-1"|perl -nle 'if (/.*=[0-9]*/){print $&;}'
You will now get different output:
a=
And this new information may give you a head start in understanding how your regex is [mis]behaving with regards to the input data.

[0-9]* can match the empty string. When you anchored to the end of the string, you prevented this empty match.
You probably want to say [0-9]+ to mean "at least one digit".

The first example, you say that [0-9] can happen "*" times, that means zero or more (so it matches only the "=". When you added that "$" it doesnt match anymore because it doesn't end after the [0-9].

Related

Regular Expression to follow a specific pattern

I'm trying to make sure the input to my shell script follows the format Name_Major_Minor.extension
where Name is any number of digits/characters/"-" followed by "_"
Major is any number of digits followed by "_"
Minor is any number of digits followed by "."
and Extension is any number of characters followed by the end of the file name.
I'm fairly certain my regular expression is just messed up slightly. any file I currently run through it evaluates to "yes" but if I add "[A-Z]$" instead of "*$" it always evaluates to "no". Regular expressions confuse the hell out of me as you can probably tell..
if echo $1 | egrep -q [A-Z0-9-]+_[0-9]+_[0-9]+\.*$
then
echo "yes"
else
echo "nope"
exit
fi
edit: realized I am missing the pattern for "minor". Still doesn't work after adding it though.
Use =~ operator
Bash supports regular expression matching through its =~ operator, and there is no need for egrep in this particular case:
if [[ "$1" =~ ^[A-Za-z0-9-]+_[0-9]+_[0-9]+\..*$ ]]
Errors in your regular expression
The \.*$ sequence in your regular expression means "zero or more dots". You probably meant "a dot and some characters after it", i.e. \..*$.
Your regular expression matches only the end of the string ($). You likely want to match the whole string. To match the entire string, use the ^ anchor to match the beginning of the line.
Escape the command line arguments
If you still want to use egrep, you should escape its arguments as you should escape any command line arguments to avoid reinterpretation of special characters, or rather wrap the argument in single, or double quotes, e.g.:
if echo "$1" | egrep -q '^[A-Za-z0-9-]+_[0-9]+_[0-9]+\..*$'
Use printf instead of echo
Don't use echo, as its behavior is considered unreliable. Use printf instead:
printf '%s\n' "$1"
Try this regex instead: ^[A-Za-z0-9-]+(?:_[0-9]+){2}\..+$.
[A-Za-z0-9-]+ matches Name
_[0-9]+ matches _ followed by one or more digits
(?:...){2} matches the group two times: _Major_Minor
\..+ matches a period followed by one or more character
The problem in your regex seems to be at the end with \.*, which matches a period \. any number of times, see here. Also the [A-Z0-9-] will only match uppercase letters, might not be what you wanted.

Unable to figure out regex bash or sed or awk

I wanted to split the following jdk-1.6.0_30-fcs.x86_64 to just jdk-1.6.0_30. I tried the following sed 's/\([a-z][^fcs]*\).*/\1/'but I end up with jdk-1.6.0_30-. I think am approaching it the wrong way, is there a way to start from the end of the word and traverse backwards till I encounter -?
Not exactly, but you can anchor the pattern to the end of the string with $. Then you just need to make sure that the characters you repeat may not include hyphens:
echo jdk-1.6.0_30-fcs.x86_64 | sed 's/-[^-]*$//'
This will match from a - to the end of the string, but all characters in between must be different from - (so that it does not match for the first hyphen already).
A slightly more detailed explanation. The engine tries to match the literal - first. That will first work at the first - in the string (obviously). Then [^-]* matches as many non-- characters as possible, so it will consume 1.6.0_30 (because the next character is in fact a hyphen). Now the engine will try to match $, but that does not work because we are not at the end of the string. Some backtracking occurs, but we can ignore that here. In the end the engine will abandon matching the first - and continue through the string. Then the engine will match the literal - with the second -. Now [^-]* will consume fcs.x86_64. Now we are actually at the end of the string and $ will match, so the full match (which will be removed is) -fcs.x86_64.
Use cut >>
echo 'jdk-1.6.0_30-fcs.x86_64' | cut -d- -f-2
Try doing this :
echo 'jdk-1.6.0_30-fcs.x86_64' | sed 's/-fcs.*//'
If using bash, sh or ash, you can do :
var=jdk-1.6.0_30-fcs.x86_64
echo ${var%%-fcs*}
jdk-1.6.0_30
Later solution use parameter expansion, tested on Linux and Minix3

Perl regex subsitute last occurrence

I have this input:
AB2.HYNN.KABCDSEG.L000.G0001V00
AB2.HYNN.GABCDSEG.L000.G0005V00
I would like to remove all which finish by GXXXXVXX in the string.
When i use this code:
$result =~ s/\.G.*V.*$//g;
print "$result \n";
The result is :
AB2.HYNN.KABCDSEG.L000
AB2.HYNN
It seems each time the regex find ".G" it removes with blank .
I don't understand.
I would like to have this:
AB2.HYNN.KABCDSEG.L000
AB2.HYNN.GABCDSEG.L000
How i can do this in regex ?
Update:
After talking in the comments, the final solution was:
s/\.G\w+V\w+$//;
In your regex:
s/\.G.*V.*$//g;
those .* are greedy and will match as much as possible. The only requirement you have is that there must be a V after the .G somewhere, so it will truncate the string from the first .G it finds, as long as it is followed by a V. There is no need for the /g modifier here, because any match that occurs will delete the rest of the string. Unless you have newlines, because . does not match newlines without the /s modifier.
$result =~ s/\.G\d+V\d+//g;
Works on given input.

Is this line of Perl meaningless? s/^(\d+)\b/$1/sg

Does this line of Perl really do anything?
$variable =~ s/^(\d+)\b/$1/sg;
The only thing I can think of is that $1 or $& might be re-used, but it is immediately followed by.
$variable =~ s/\D//sg;
With these two lines together, is the first line meaningless and removable? It seems like it would be, but I have seen it multiple times in this old program, and wanted to make sure.
$variable =~ s/^(\d+)\b/$1/sg;
The anchor ^ at the beginning makes the /g modifier useless.
The lack of the wildcard character . in the string makes the /s modifier useless, since it serves to make . also match newline.
Since \b and ^ are zero-width assertions, and the only things outside the capture group, this substitution will not change the variable at all.
The only thing this regex does is capture the digits into $1, if they are found.
The subsequent regex
$variable =~ s/\D//sg;
Will remove all non-digits, making the variable just one long number. If one wanted to separate the first part (matched by the first regex), the only way to do so would be by accessing $1 from the first regex.
However, the first regex in that case would be better written simply:
$variable =~ /^(\d+)\b/;
And if the capture is supposed to be used:
my ($num) = $variable =~ /^(\d+)\b/;
Is "taint mode" in use? (Script is invoked with -T option.)
Maybe it's used to sanitize (i.e. untaint) user input.

substition regex, with capture

maybe this is a stupid question but :
i run perl 5.8.8 and i need to replace any underscore preceded by a number, with "0".
running : $var =~s /(\d)_/$10/g;
obviously does not work as $10 is interpreted as... well... $10, not "$1 followed by 0"
moreover, as runing perl5.8, i can't do
$var=~s/(?<n1>\d)\_/$+{n1}0/g;
any idea ?
thanks in advance
Just like in various Unix shells, you can enclose the variable name in braces for disambiguation.
$var =~s /(\d)_/${1}0/g;
Or you can use a look-behind to prevent the digit from being part of the match:
$var =~s /(?<=\d)_/0/g;
This would also be a good place for a zero width look-behind assertion:
$var =~ s/(?<=\d)_/0/g;
It looks for a digit without actually slurping the digit into the matched text.
$var =~s/(\d)_/${1}0/g;
Another possibilities are (not sure if applicable to perl 5.8.8)
s/\d\K_/0/
s/(?<=\d)_/0/