Only allow some characters with grep? - regex

I would like to check a string, so it only contains the characters 0-9 a-z -.
When I do
regex='[-a-z0-9]*'
string='abcd!'
if [[ $string =~ $regex ]]
then
echo "valid"
else
echo "not valid"
fi
it outputs valid, where I would have expected not valid because $string contains a !.

try this: regex='^[-a-z0-9]*$'. It will force the complete line to match this class. Otherwise, only a single match, or no match at all (due to *) will return valid. ^...$ says the string starts and ends without anything that fails to match.

You will have to add boundaries for this regex to work.
'[-a-z0-9]*' says: match these characters 0 or more times anywhere in the string.
So adding start and end of line characters to the regex will do what you are looking for:
regex='^[-a-z0-9]*$'
The next step is to limit the number of occurrences of the '-' to only once. Can the dash charcter occur at the start or at the end of the string? If not try:
regex='^[a-z0-9]*-?[a-z0-9]*$'
Hope this helps.

Related

Perl: How to match last n digits of a string with n consecutive digits or more?

I use the following:
if ($content =~ /([0-9]{11})/) {
my $digits = $1;
}
to extract 11 consecutive digits from a string. However, it grabs the first 11 consecutive digits. How can I get it to extract the last 11 consecutive digits so that I would get 24555199361 from a string with hdjf95724555199361?
/([0-9]{11})/
means
/^.*?([0-9]{11})/s # Minimal lead that allows a match.
You get what you want by making the .* greedy.
/^.*([0-9]{11})/s # Maximal lead that allows a match.
If the digits appear at the very end of the string, you can also use the following:
/([0-9]{11})\z/
Whenever you want to match something at the end of a string, use the end of line anchor $.
$content =~ m/(\d{11})$/;
If that pattern is not the very end, but you want to match the "last" occurence of that pattern, you would first match "the entire string" with /.*/ and then backtrack to the final occurence of the pattern. The /s flag permits the . metacharacter to match a line feed.
$content =~ m/.*(\d{11})/s;
See the Perl regexp tutorial for more information.

Perl regular expression to retrieve the first digit

I have a string with the value Validation_File_2_3.45.2017.csv.
How do I extract the first digit which is 2 in this case using a regular expression?
I have tried the expression ($Filedigit) = ($Filename =~ m/^[0-9]/g) but it didn't work
In a comment, you said you tried this:
($Filedigit)= ($Filename =~ m/^[0-9]/g);
A couple of things. You should always check that a match is successful before continuing on with your script, specifically when trying to capture. Next, ^ looks from the beginning of a string, then immediately looks for a single digit 0-9, globally. This won't match unless you had a filename such as 2_blah.csv. However, you're not actually attempting to capture anything, so if you do happen to match an entry, $Filedigit will be 1 in all cases (signifying a match happened).
Here's an example that does what you want:
use warnings;
use strict;
my $str = 'Validation_File_2_3.45.2017.csv';
# confirm there's a match
if (my ($num) = $str =~ /^.*?(\d+)/){
print "$num\n";
}
else {
print "no match\n";
}
Explanation of the regex:
^ - start from beginning of string
.*? - anything, non-greedy
( - begin capture
\d+ - any number of contiguous digit chars
) - end capture
So, it starts from the beginning of the string, throws away anything before the first set of contiguous digits and captures them and puts that into the variable.
See perlreftut and perlre.

Regular Expression to follow a specific pattern

I'm trying to make sure the input to my shell script follows the format Name_Major_Minor.extension
where Name is any number of digits/characters/"-" followed by "_"
Major is any number of digits followed by "_"
Minor is any number of digits followed by "."
and Extension is any number of characters followed by the end of the file name.
I'm fairly certain my regular expression is just messed up slightly. any file I currently run through it evaluates to "yes" but if I add "[A-Z]$" instead of "*$" it always evaluates to "no". Regular expressions confuse the hell out of me as you can probably tell..
if echo $1 | egrep -q [A-Z0-9-]+_[0-9]+_[0-9]+\.*$
then
echo "yes"
else
echo "nope"
exit
fi
edit: realized I am missing the pattern for "minor". Still doesn't work after adding it though.
Use =~ operator
Bash supports regular expression matching through its =~ operator, and there is no need for egrep in this particular case:
if [[ "$1" =~ ^[A-Za-z0-9-]+_[0-9]+_[0-9]+\..*$ ]]
Errors in your regular expression
The \.*$ sequence in your regular expression means "zero or more dots". You probably meant "a dot and some characters after it", i.e. \..*$.
Your regular expression matches only the end of the string ($). You likely want to match the whole string. To match the entire string, use the ^ anchor to match the beginning of the line.
Escape the command line arguments
If you still want to use egrep, you should escape its arguments as you should escape any command line arguments to avoid reinterpretation of special characters, or rather wrap the argument in single, or double quotes, e.g.:
if echo "$1" | egrep -q '^[A-Za-z0-9-]+_[0-9]+_[0-9]+\..*$'
Use printf instead of echo
Don't use echo, as its behavior is considered unreliable. Use printf instead:
printf '%s\n' "$1"
Try this regex instead: ^[A-Za-z0-9-]+(?:_[0-9]+){2}\..+$.
[A-Za-z0-9-]+ matches Name
_[0-9]+ matches _ followed by one or more digits
(?:...){2} matches the group two times: _Major_Minor
\..+ matches a period followed by one or more character
The problem in your regex seems to be at the end with \.*, which matches a period \. any number of times, see here. Also the [A-Z0-9-] will only match uppercase letters, might not be what you wanted.

What's special about a "space" character in an "expr match" regexp?

In a bash shell, I set line like so:
line="total active bytes: 256"
Now, I just want to get the digits from that line so I do:
echo $(expr match "$line" '.*\([[:digit:]]*\)' )
and I don't get anything. But, if I add a space character before the first backslash in the regexp, then it works:
echo $(expr match "$line" '.* \([[:digit:]]*\)' )
Why?
The space isn't special at all. What's happening is that in the first case, the .* matches the entire string (i.e., it matches "greedily"), including the numbers, and since you've quantified the digits with * (as opposed to \+), that part of the regex is allowed to match 0 characters.
By putting a space before the digit match, the first part can only match up to but not including the last space in the string, leaving the digits to be matched by \([[:digit:]]*\).

Why do these two regexes behave differently?

Why do the following two regexes behave differently?
$millisec = "1391613310.1";
$millisec =~ s/.*(\.\d+)?$/$1/;
vs.
$millisec =~ s/\d*(\.\d+)?$/$1/;
This code prints nothing:
perl -e 'my $mtime = "1391613310.1"; my $millisec = $mtime; $millisec =~ s/.*(\.\d+)?$/$1/; print "$millisec";'
While this prints the decimal portion of the string:
perl -e 'my $mtime = "1391613310.1"; my $millisec = $mtime; $millisec =~ s/\d*(\.\d+)?$/$1/; print "$millisec";'
In the first regex, the .* is taking up everything to the end of the string, so there's nothing the optional (.\d+)? can pick up. $1 will be empty, so the string is replaced by an empty string.
In the second regex, only digits are grabbed from the beginning so that \d* stops in front of the dot. (.\d+)? will pick the dot, including the trailing digits.
You're using .\d+ inside parentheses, which will match any character plus digits. If you want to match a dot explicitly, you have to use \..
To make the first regex behave similarly to the second one you would have to write
$millisec =~ s/.*?(\.\d+)?$/$1/;
so that the initial .* doesn't take up everything.
Greed.
Perl's regex engine will match as much as possible with each term before moving on to the next term. So for .*(.\d+)?$ the .* matches the entire string, then (.\d)? matches nothing as it is optional.
\d*(.\d+)?$ can match only up to the dot, so then has to match .1 against (.\d+)?