Regular Expression to follow a specific pattern

Regular Expression to follow a specific pattern - regex

I'm trying to make sure the input to my shell script follows the format Name_Major_Minor.extension
where Name is any number of digits/characters/"-" followed by "_"
Major is any number of digits followed by "_"
Minor is any number of digits followed by "."
and Extension is any number of characters followed by the end of the file name.
I'm fairly certain my regular expression is just messed up slightly. any file I currently run through it evaluates to "yes" but if I add "[A-Z]$" instead of "*$" it always evaluates to "no". Regular expressions confuse the hell out of me as you can probably tell..
if echo $1 | egrep -q [A-Z0-9-]+_[0-9]+_[0-9]+\.*$
then
echo "yes"
else
echo "nope"
exit
fi
edit: realized I am missing the pattern for "minor". Still doesn't work after adding it though.

Use =~ operator
Bash supports regular expression matching through its =~ operator, and there is no need for egrep in this particular case:
if [[ "$1" =~ ^[A-Za-z0-9-]+_[0-9]+_[0-9]+\..*$ ]]
Errors in your regular expression
The \.*$ sequence in your regular expression means "zero or more dots". You probably meant "a dot and some characters after it", i.e. \..*$.
Your regular expression matches only the end of the string ($). You likely want to match the whole string. To match the entire string, use the ^ anchor to match the beginning of the line.
Escape the command line arguments
If you still want to use egrep, you should escape its arguments as you should escape any command line arguments to avoid reinterpretation of special characters, or rather wrap the argument in single, or double quotes, e.g.:
if echo "$1" | egrep -q '^[A-Za-z0-9-]+_[0-9]+_[0-9]+\..*$'
Use printf instead of echo
Don't use echo, as its behavior is considered unreliable. Use printf instead:
printf '%s\n' "$1"

Try this regex instead: ^[A-Za-z0-9-]+(?:_[0-9]+){2}\..+$.
[A-Za-z0-9-]+ matches Name
_[0-9]+ matches _ followed by one or more digits
(?:...){2} matches the group two times: _Major_Minor
\..+ matches a period followed by one or more character
The problem in your regex seems to be at the end with \.*, which matches a period \. any number of times, see here. Also the [A-Z0-9-] will only match uppercase letters, might not be what you wanted.

Related

$cmd =~ s#-fp [^ ]+##; What does it mean in Perl?

$cmd =~ s#-fp [^ ]+##;
Is there anyone who let me know what this regex means in Perl?
I couldn't find any regex like above through googling...

This removes the -fp optional parameter and its value from the command.
This takes the string stored by variable $cmd and replaces a section matching -fp [^ ]+ with nothing.
This command is employing the fact that Perl subsitution (or other regex modifiers) can have any delimiter character. What is normally written as s/.../.../ is s#...#...# here. That may be the source of confusion.
=~ is a binary binding operator which takes the left argument as the string to perform the right argument argument on, in this case a substitution.
-fp [^ ]+
-fp matches literally.
[^ ]+ matches one or more characters which are not space.

Let's get the easy bit out of the way first. The $cmd =~ simply means "do the substitution on the variable $cmd".
Not all of this expression is a regex. It's actually the substitution operator - s/REGEX/STRING/. It matches the REGEX and replaces it with the STRING.
Like many similar operators in Perl, the substitution operator allows you to choose the delimiter character that you use. In this case, the programmer has made the slightly bizarre choice to use #.
So, we have this:
$cmd =~ s/-fp [^ ]+//;
And we now know that it means. "Match the variable $cmd against the regex -fp [^ ]+ and replace it with an empty string". Why an empty string? Because the replacement string bit (between the second and third /) is an empty string.
All we need to do now is to understand the actual regex - -fp [^ ]+. And it's not very complicated.
-fp - the first four characters (up to and including the space) match themselves. So this matches the literal string "-fp ".
[^ ] - this is a "character class". Normally, it means "match any of the characters inside [...]". But the ^ at the start inverts that meaning to "match any characters expect the ones between [^...]. So this is match anything that isn't a space.
+ - this is a modifier that means "match one or more of the previous expression".
So, put together, this is "match the string '-fp ' followed by one or more non-space characters.
And, adding in the rest of the expression, we get:
Look at the string in $cmd, if you find the string '-fp -' followed by one or more non-space characters, then replace the matched portion with an empty string.

How to grep for this pattern in Unix

I want to grep for this particular pattern. The pattern is as follows
**xMT123xMT123x**ABCxxxxxxxxxxxxxxxxxx_123_29887
inside the file test.txt which has the following data
NNN**xMT123xMT123x**ABCxxxxxxxxxxxxxxxxxx_123_29887_20140628.csv
I tried using grep "**xMT123xMT123x**ABCxxxxxxxxxxxxxxxxxx_123_29887" test.txt but it's not returning anything. Please advice
EDIT:
Hi, basically i'm inside a loop and only sometimes i get files with this pattern. So currently im putting like grep "$i" test.txt which works in all the cases except when I have to encounter such patterns.
And I'm actually grepping for the exact file_number, file sequence.So if it says 123_29887 it will be 123_29887. Thanks.

You could use:
grep -P "(?i)\*\*[a-z\d]+\*\*[a-z]+_\d+_\d+" somepath
(?i) turns on case-insensitive mode
\*\* matches the two opening stars
[a-z\d]+ matches letters and digits
\*\* matches two more stars
[a-z]+ matches letters
_\d+_\d+ matches underscore, digits, underscore, digits
If you need to be more specific (for instance, you know that a group of digits always has three digits), you can replace parts of the expression: for instance, \d+ becomes \d{3}
Matching a Literal but Yet Unknown Pattern: \Q and \E
If you receive literal patterns that you need to match, such as **xMT123xMT123x**ABCxxxxxxxxxxxxxxxxxx_123_29887, the issue is that special regex characters such as * need to be escaped. If the whole string is a literal, we do this by escaping the whole string between \Q and \E:
grep -P "\Q**xMT123xMT123x**ABCxxxxxxxxxxxxxxxxxx_123_29887\E" somepath
And in a loop, of course, you can build that regex programmatically by concatenating \Q and \E on both sides.

How do I extract a string using a regex in a shell script?

I want to extract part of a string using a regular expression. For example, how do I extract the domain name from the $name variable?
name='here'
domain_name=... # apply some regex on $name

Using bash regular expressions:
re="http://([^/]+)/"
if [[ $name =~ $re ]]; then echo ${BASH_REMATCH[1]}; fi
Edit - OP asked for explanation of syntax. Regular expression syntax is a large topic which I can't explain in full here, but I will attempt to explain enough to understand the example.
re="http://([^/]+)/"
This is the regular expression stored in a bash variable, re - i.e. what you want your input string to match, and hopefully extract a substring. Breaking it down:
http:// is just a string - the input string must contain this substring for the regular expression to match
[] Normally square brackets are used say "match any character within the brackets". So c[ao]t would match both "cat" and "cot". The ^ character within the [] modifies this to say "match any character except those within the square brackets. So in this case [^/] will match any character apart from "/".
The square bracket expression will only match one character. Adding a + to the end of it says "match 1 or more of the preceding sub-expression". So [^/]+ matches 1 or more of the set of all characters, excluding "/".
Putting () parentheses around a subexpression says that you want to save whatever matched that subexpression for later processing. If the language you are using supports this, it will provide some mechanism to retrieve these submatches. For bash, it is the BASH_REMATCH array.
Finally we do an exact match on "/" to make sure we match all the way to end of the fully qualified domain name and the following "/"
Next, we have to test the input string against the regular expression to see if it matches. We can use a bash conditional to do that:
if [[ $name =~ $re ]]; then
echo ${BASH_REMATCH[1]}
fi
In bash, the [[ ]] specify an extended conditional test, and may contain the =~ bash regular expression operator. In this case we test whether the input string $name matches the regular expression $re. If it does match, then due to the construction of the regular expression, we are guaranteed that we will have a submatch (from the parentheses ()), and we can access it using the BASH_REMATCH array:
Element 0 of this array ${BASH_REMATCH[0]} will be the entire string matched by the regular expression, i.e. "http://www.google.com/".
Subsequent elements of this array will be subsequent results of submatches. Note you can have multiple submatch () within a regular expression - The BASH_REMATCH elements will correspond to these in order. So in this case ${BASH_REMATCH[1]} will contain "www.google.com", which I think is the string you want.
Note that the contents of the BASH_REMATCH array only apply to the last time the regular expression =~ operator was used. So if you go on to do more regular expression matches, you must save the contents you need from this array each time.
This may seem like a lengthy description, but I have really glossed over several of the intricacies of regular expressions. They can be quite powerful, and I believe with decent performance, but the regular expression syntax is complex. Also regular expression implementations vary, so different languages will support different features and may have subtle differences in syntax. In particular escaping of characters within a regular expression can be a thorny issue, especially when those characters would have an otherwise different meaning in the given language.
Note that instead of setting the $re variable on a separate line and referring to this variable in the condition, you can put the regular expression directly into the condition. However in bash 3.2, the rules were changed regarding whether quotes around such literal regular expressions are required or not. Putting the regular expression in a separate variable is a straightforward way around this, so that the condition works as expected in all bash versions that support the =~ match operator.

One way would be with sed. For example:
echo $name | sed -e 's?http://www\.??'
Normally the sed regular expressions are delimited by `/', but you can use '?' since you're searching for '/'. Here's another bash trick. #DigitalTrauma's answer reminded me that I ought to suggest it. It's similar:
echo ${name#http://www.}
(DigitalTrauma also gets credit for reminding me that the "http://" needs to be handled.)

perl regular expression explanations

I am completely lost on this line of perl code
$path =~ s|^\./|~/|; #change the path for prettier output
I am assuming it has to do with regex. I have some understanding of regex but i just cant seem to figure this one out.
what is =~ and why is there s and how does regex expressed in perl?

=~ is a binding operator. It applies the substitution (hence the s) to the variable $path. The substitution has two parts - a regular expression and the replacement. They are delimited by the | character in this case. The regular expression is
^\./
^ stands for the beginning of the string. \. stands for a literal dot, / stands for itself. So, ./ at the beginning of the string is replaced by ~/.

the =~ binds a scalar expression to a pattern match, the s is for replacement
what its doing is matching start of line with a ./ then replacing it with a ~/
as far as the | pipes, you can use any non-whitespace character to delimit parts of the regex you can use ^ or & or q or m or { whatever.. most people use / for readability but for cases where you might match on / use something else.
Hope this helps.

Why does perl ignore extra characters in my regex?

I have this line in bash:
echo "a=-1"|perl -nle 'if (/.*=[0-9]*/){print;}'
and get:
a=-1
Wait..I didn't say perl should match on the -. I made a minor change to:
echo "a=-1"|perl -nle 'if (/.*=[0-9]*$/){print;}'
and it correctly ignores the line. Why?

You might find it useful, while developing a regex, to print only the part of the string that the regex actually matched, instead of the entire line. This will give you better insight into what your regex is doing. You can do this with the special $& variable. So instead of:
echo "a=-1"|perl -nle 'if (/.*=[0-9]*/){print;}'
use
echo "a=-1"|perl -nle 'if (/.*=[0-9]*/){print $&;}'
You will now get different output:
a=
And this new information may give you a head start in understanding how your regex is [mis]behaving with regards to the input data.

[0-9]* can match the empty string. When you anchored to the end of the string, you prevented this empty match.
You probably want to say [0-9]+ to mean "at least one digit".

The first example, you say that [0-9] can happen "*" times, that means zero or more (so it matches only the "=". When you added that "$" it doesnt match anymore because it doesn't end after the [0-9].

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Regular Expression to follow a specific pattern - regex

Related

$cmd =~ s#-fp [^ ]+##; What does it mean in Perl?

How to grep for this pattern in Unix

How do I extract a string using a regex in a shell script?

perl regular expression explanations

Why does perl ignore extra characters in my regex?

Categories

Resources