wildcard in regular expression in shell script - regex

I have a question regarding wildcard in shell script regular exression
vi a.sh
if [[ $1 == 1*3 ]]; then
echo "matching"
else
echo "not matching"
fi
If I run sh a.sh 123 the output is: "matching".
But according to http://www.tldp.org/LDP/GNU-Linux-Tools-Summary/html/x11655.htm:
* (asterisk)
the proceeding item is to be matched zero or more times. ie. n* will
match n, nn, nnnn, nnnnnnn but not na or any other character.
it should match to only 3,13,113,111..3.
But why is it matching 123?

In the documentation you linked you are taking the part in which it talks about "Regular expressions".
However, what is important here is what is within "Standard Wildcards (globbing patterns)":
Standard Wildcards (globbing patterns)
this can represent any number of characters (including zero, in other words, zero or more characters). If you specified a "cd*" it
would use "cda", "cdrom", "cdrecord" and anything that starts with
“cd” also including “cd” itself. "m*l" could by mill, mull, ml, and
anything that starts with an m and ends with an l.
That is, it does not refer to the previous character but to a set of characters (zero or more). It is what equivalent to a .* in normal regular expressions.
So the expression 1*3 matches anything starting with 1 + zero or more characters + 3 at the end.

There are two different pattern matching you find in the Unix/Linux World:
Regular Expressions: This is the complex pattern matching you find in grep, sed, and many other utilities. As time moved on, many extensions can be found. These extensions are referred to in POSIX as original (now considered obsolete), modern, and extended.
Globbing: This refers to the file matching you find when you do things such as *.txt to match all text files. These are much simpler and less extensive. There are a few extensions (like ** to match subdirectories in Ant).
When you use [[ ... == ... ]] without quotes in Bash, you are using globbing file matches. If you want to use regular expressions, you need to use the =~ operator:
if [[ $foo =~ ^11*3 ]] # Matches 13, 113, 1113, 11113
then

Related

Perl: how to use string variables as search pattern and replacement in regex

I want to use string variables for both search pattern and replacement in regex. The expected output is like this,
$ perl -e '$a="abcdeabCde"; $a=~s/b(.)d/_$1$1_/g; print "$a\n"'
a_cc_ea_CC_e
But when I moved the pattern and replacement to a variable, $1 was not evaluated.
$ perl -e '$a="abcdeabCde"; $p="b(.)d"; $r="_\$1\$1_"; $a=~s/$p/$r/g; print "$a\n"'
a_$1$1_ea_$1$1_e
When I use "ee" modifier, it gives errors.
$ perl -e '$a="abcdeabCde"; $p="b(.)d"; $r="_\$1\$1_"; $a=~s/$p/$r/gee; print "$a\n"'
Scalar found where operator expected at (eval 1) line 1, near "$1$1"
(Missing operator before $1?)
Bareword found where operator expected at (eval 1) line 1, near "$1_"
(Missing operator before _?)
Scalar found where operator expected at (eval 2) line 1, near "$1$1"
(Missing operator before $1?)
Bareword found where operator expected at (eval 2) line 1, near "$1_"
(Missing operator before _?)
aeae
What do I miss here?
Edit
Both $p and $r are written by myself. What I need is to do multiple similar regex replacing without touching the perl code, so $p and $r have to be in a separate data file. I hope this file can be used with C++/python code later.
Here are some examples of $p and $r.
^(.*\D)?((19|18|20)\d\d)年 $1$2<digits>年
^(.*\D)?(0\d)年 $1$2<digits>年
([TKZGD])(\d+)/(\d+)([^\d/]) $1$2<digits>$3<digits>$4
([^/TKZGD\d])(\d+)/(\d+)([^/\d]) $1$3分之$2$4
With $p="b(.)d"; you are getting a string with literal characters b(.)d. In general, regex patterns are not preserved in quoted strings and may not have their expected meaning in a regex. However, see Note at the end.
This is what qr operator is for: $p = qr/b(.)d/; forms the string as a regular expression.
As for the replacement part and /ee, the problem is that $r is first evaluated, to yield _$1$1_, which is then evaluated as code. Alas, that is not valid Perl code. The _ are barewords and even $1$1 itself isn't valid (for example, $1 . $1 would be).
The provided examples of $r have $Ns mixed with text in various ways. One way to parse this is to extract all $N and all else into a list that maintains their order from the string. Then, that can be processed into a string that will be valid code. For example, we need
'$1_$2$3other' --> $1 . '_' . $2 . $3 . 'other'
which is valid Perl code that can be evaluated.
The part of breaking this up is helped by split's capturing in the separator pattern.
sub repl {
my ($r) = #_;
my #terms = grep { $_ } split /(\$\d)/, $r;
return join '.', map { /^\$/ ? $_ : q(') . $_ . q(') } #terms;
}
$var =~ s/$p/repl($r)/gee;
With capturing /(...)/ in split's pattern, the separators are returned as a part of the list. Thus this extracts from $r an array of terms which are either $N or other, in their original order and with everything (other than trailing whitespace) kept. This includes possible (leading) empty strings so those need be filtered out.
Then every term other than $Ns is wrapped in '', so when they are all joined by . we get a valid Perl expression, as in the example above.
Then /ee will have this function return the string (such as above), and evaluate it as valid code.
We are told that safety of using /ee on external input is not a concern here. Still, this is something to keep in mind. See this post, provided by Håkon Hægland in a comment. Along with the discussion it also directs us to String::Substitution. Its use is demonstrated in this post. Another way to approach this is with replace from Data::Munge
For more discussion of /ee see this post, with several useful answers.
Note on using "b(.)d" for a regex pattern
In this case, with parens and dot, their special meaning is maintained. Thanks to kangshiyin for an early mention of this, and to Håkon Hægland for asserting it. However, this is a special case. Double-quoted strings directly deny many patterns since interpolation is done -- for example, "\w" is just an escaped w (what is unrecognized). The single quotes should work, as there is no interpolation. Still, strings intended for use as regex patterns are best formed using qr, as we are getting a true regex. Then all modifiers may be used as well.

Grep a filename with a specific underscore pattern

I am trying to grep a pattern from files using egrep and regex without success.
What I need is to get a file with for example a convention name of:
xx_code_lastname_firstname_city.doc
The code should have at least 3 digits, the lastname and firstname and city can vary on size
I am trying the code below but it fails to achieve what I desire:
ls -1 | grep -E "[xx_][A-Za-z]{3,}[_][A-Za-z]{2,}[_][A-Za-z]{2,}[_][A-Za-z]{2,}[.][doc|pdf]"
That is trying to get the standard xx_ from the beggining, then any code that has at least 3 words and after that it must have another underscore, and so on.
Could anybody help ?
Consider an extglob, as follows:
#!/bin/bash
shopt -s extglob # turn on extended globbing syntax
files=( xx_[[:alpha:]][[:alpha:]]+([[:alpha:]])_[[:alpha:]]+([[:alpha:]])_[[:alpha:]]+([[:alpha:]])_[[:alpha:]]+([[:alpha:]]).#(doc|docx|pdf) )
[[ -e ${files[0]} ]] || -L ${files[0]} ]] && printf '%s\n' "${files[#]}"
This works because
[[:alpha:]][[:alpha:]]+([[:alpha:]])
...matches any string of three or more alpha characters -- two of them explicitly, one of them with the +() one-or-more extglob syntax.
Similarly,
#(doc|docx|pdf)
...matches any of these three specific strings.
So you're trying to match a literal xx_? Begin your pattern with that portion then.
xx_
Next comes the "3 digits" you're trying to match. I'm going to assume based off your own regex that by "digits" you mean characters (hence the [a-zA-Z] character classes). Let's make the quantifier non-greedy to avoid any unintentional capturing behavior.
xx_[a-zA-Z]{3,}?
For the firstname and lastname portions, I see you've specified a variable length with at least 2 characters. Let's make sure these quantifiers are non-greedy as well by appending the ? character after our quantifiers. According to your regex, it also looks like you expect your city construct to take a similar form to the firstname and lastname bits. Let's add all three then.
xx_[a-zA-Z]{3,}?_[a-zA-Z]{2,}?_[a-zA-Z]{2,}?_[a-zA-Z]{2,}\.
NOTE: We didn't need to make the city quantifier non-greedy since we asserted that it's followed by a literal ".", which we don't expect to appear anywhere else in the text we're interested in matching. Notice how it's escaped because it's a metacharacter in the regex syntax.
Lastly comes the file extensions, which your example has as "docx". I also see you put a "doc" and a "pdf" extension in your regex. Let's combine all three of these.
xx_[a-zA-Z]{3,}?_[a-zA-Z]{2,}?_[a-zA-Z]{2,}?_[a-zA-Z]{2,}\.(docx?|pdf)
Hopefully this works. Comment if you need any clarification. Notice how the "doc" and the "docx" portions were condensed into one element. This is not necessary, but I think it looks more deliberate in this form. It could also be written as (doc|docx|pdf). A little repetitive for my taste.

Linux Bash Script Regex malfunction

I would like to make a bash script, which should decide about the given strings, if they fulfill the term or not.
The terms are:
The string's first 3 character must be "le-"
Between hyphens there can any number of consonant in any arrangement, just one "e" and it cannot contain any vowel.
Between hyphens there must be something
The string must not end with hyphen
I made this script:
#!/bin/bash
# Testing regex
while read -r line; do
if [[ $line =~ ^le((-[^aeiou\W]*e+[^aeiou\W]*)+)$ ]]
then
printf "\""$line"\"\t\t\t-> True\n";
else
printf "\""$line"\"\t\t\t-> False\n";
fi
done < <(cat "$#")
It does everything fine, except one thing:
It says true no matter how many hyphens are next to each other.
For example:
It says true for this string "le--le"
I tried this regex expression on websites (like this) and they worked without this malfunction.
All I can think of there must be something difference between the web page and the linux bash. (All I can see on the web page is it runs PHP)
Do you have got any idea, how could I make it work ?
Thank you for your answers!
sweaver2112 rightly points out that the \W is causing you problems, but fails to provide a working example of a bash test regex that does what you ask (at least, i couldn't get it to work).
this seems to do it (adapting Laurel's consonant regex):
[[ "$line" =~ ^le(-[b-df-hj-np-tv-z]*e[b-df-hj-np-tv-z]*)+$ ]]
it matches (e.g.):
le-e
le-e-le
le-e-e-e-e-e
and more generally:
le-([[:consonant:]]*e[[:consonant:]]*)+
and doesn't match (e.g.):
le-
le--le
le-lea-le
also, you can write it more cleanly this way:
c='[b-df-hj-np-tv-z]'
[[ "$line" =~ ^le(-$c*e$c*)+$ ]]
There's at least one problem with your regex: [^aeiou\W] - a negated "non-word", means "word" - and it matches any letter, consonants included. Character classes are inclusive, not exclusive. We're better off just listing all the consonants (and for you case, we'll add 'e' and '-' to the set as well).
So try this one: (edit: using #Laurel's more concise char class)
`(?=^le-)(?!.*--)(?!.*-[^-]*e[^-]*e[^-]*-)[b-hj-np-tv-z-]*[^-]$`
(?=^le-) starts with 'le-'
(?!.*--) no double dashes allowed
(?!.*-[^-]*e[^-]*e[^-]*-) do NOT see two e's between dashes
[b-hj-np-tv-z-]* - consume consonants, e, and dashes (same as [bcdfghjklmnpqrstlvwze-])
[^-]$ last character must be non-dash

Bash regex to match dots and characters

I'm trying to use the =~ operator to execute a regular expression pattern against a curl response string.
The pattern im currently using is:
name\":\"(\.[a-zA-Z]+)\"
Currently however this pattern only extracts values that that contain only the characters a-z and A-Z. I need this pattern to also pick up values that contain a '.' character and a '#' character. How would I do this?
Also, is there any way this pattern can be improved performance wise? It takes quite a long time to execute against the string.
Cheers.
I recently ran into this problem in my script that sets my bash prompt according to my git status, and found that it was because of the placement of other things (namely, a hyphen) I wanted to match inside the expression.
For example, I wanted to match a certain part of a git status output, e.g. the part where it says "Your branch is ahead of 'origin/mybranch' by 1 commit."
This was my original pattern:
"Your branch is (ahead of|behind) '([a-zA-Z0-9_-]+)/([a-zA-Z0-9_-]+)' by ([0-9]+) commit".
One day I created a branch that had a . in it and found that my bash prompt wasn't showing me the right thing, and modified the expression to the following:
"Your branch is (ahead of|behind) '([a-zA-Z0-9_-]+)/([a-zA-Z0-9_-.]+)' by ([0-9]+) commit".
I expected it to work just fine, but instead there was no match at all.
After reading a lot of posts, I realized it was because of the placement of the hyphen (-); I had to put it right after the first square bracket, otherwise it would be interpreted as a range (in this case, it was trying to interpret the range of _-., which is invalid or just somehow makes the whole expression fall over.
It started working when I changed the expression to the following:
"Your branch is (ahead of|behind) '([a-zA-Z0-9_-]+)/([-a-zA-Z0-9_.]+)' by ([0-9]+) commit".
So basically what I meant to say that it could be something else in your expression (like the hyphen in mine) that is interfering with the matching of the dot and ampersand.
Working example script:
#!/bin/bash
regex='"name":"([a-zA-Z.#]+)"'
input='"name":"internal.action.retry.queue#temp"'
if [[ $input =~ $regex ]]
then
echo "$input matches regex $regex"
for (( i=0; i<${#BASH_REMATCH[#]}; i++))
do
echo -e "\tGroup[$i]: ${BASH_REMATCH[$i]}"
done
else
echo "$input does not match regex $regex"
fi
Just add dot ('.') and at sign ('#'):
name\":\"(\.[a-zA-Z.#]+)\"
If you don't need mandatory dot at the beginnig of the URL, use this:
\"name\":\"([a-zA-Z.#]+)\"

Using the star sign in grep

I am trying to search for the substring "abc" in a specific file in linux/bash
So I do:
grep '*abc*' myFile
It returns nothing.
But if I do:
grep 'abc' myFile
It returns matches correctly.
Now, this is not a problem for me. But what if I want to grep for a more complex string, say
*abc * def *
How would I accomplish it using grep?
The asterisk is just a repetition operator, but you need to tell it what you repeat. /*abc*/ matches a string containing ab and zero or more c's (because the second * is on the c; the first is meaningless because there's nothing for it to repeat). If you want to match anything, you need to say .* -- the dot means any character (within certain guidelines). If you want to just match abc, you could just say grep 'abc' myFile. For your more complex match, you need to use .* -- grep 'abc.*def' myFile will match a string that contains abc followed by def with something optionally in between.
Update based on a comment:
* in a regular expression is not exactly the same as * in the console. In the console, * is part of a glob construct, and just acts as a wildcard (for instance ls *.log will list all files that end in .log). However, in regular expressions, * is a modifier, meaning that it only applies to the character or group preceding it. If you want * in regular expressions to act as a wildcard, you need to use .* as previously mentioned -- the dot is a wildcard character, and the star, when modifying the dot, means find one or more dot; ie. find one or more of any character.
The dot character means match any character, so .* means zero or more occurrences of any character. You probably mean to use .* rather than just *.
Use grep -P - which enables support for Perl style regular expressions.
grep -P "abc.*def" myfile
The "star sign" is only meaningful if there is something in front of it. If there isn't the tool (grep in this case) may just treat it as an error. For example:
'*xyz' is meaningless
'a*xyz' means zero or more occurrences of 'a' followed by xyz
This worked for me:
grep ".*${expr}" - with double-quotes, preceded by the dot.
Where ${expr} is whatever string you need in the end of the line.
So in your case:
grep ".*abc.*" myFile
Standard unix grep.
The expression you tried, like those that work on the shell command line in Linux for instance, is called a "glob". Glob expressions are not full regular expressions, which is what grep uses to specify strings to look for. Here is (old, small) post about the differences. The glob expressions (as in "ls *") are interpreted by the shell itself.
It's possible to translate from globs to REs, but you typically need to do so in your head.
You're not using regular expressions, so your grep variant of choice should be fgrep, which will behave as you expect it to.
Try grep -E for extended regular expression support
Also take a look at:
The grep man page
'*' works as a modifier for the previous item. So 'abc*def' searches for 'ab' followed by 0 or more 'c's follwed by 'def'.
What you probably want is 'abc.*def' which searches for 'abc' followed by any number of characters, follwed by 'def'.
This may be the answer you're looking for:
grep abc MyFile | grep def
Only thing is... it will output lines were "def" is before OR after "abc"
$ cat a.txt
123abcd456def798
123456def789
Abc456def798
123aaABc456DEF
* matches the preceding character zero or more times.
$ grep -i "abc*def" a.txt
$
It would match, for instance "abdef" or "abcdef" or "abcccccccccdef". But none of these are in the file, so no match.
. means "match any character" Together with *, .* means match any character any number of times.
$ grep -i "abc.*def" a.txt
123abcd456def798
Abc456def798
123aaABc456DEF
So we get matches.
There are alot of online references about regular expressions, which is what is being used here.
I summarize other answers, and make these examples to understand how the regex and glob work.
There are three files
echo 'abc' > file1
echo '*abc' > file2
echo '*abcc' > file3
Now I execute the same commands for these 3 files, let's see what happen.
(1)
grep '*abc*' file1
As you said, this one return nothing. * wants to repeat something in front of it. For the first *, there is nothing in front of it to repeat, so the system recognize this * just a character *. Because the string in the file is abc, there is no * in the string, so you cannot find it. The second * after c means it repeat c 0 or more times.
(2)
grep '*abc*' file2
This one return *abc, because there is a * in the front, it matches the pattern *abc*.
(3)
grep '*abc*' file3
This one return *abcc because there is a * in the front and 2 c at the tail. so it matches the pattern *abc*
(4)
grep '.*abc.*' file1
This one return abc because .* indicate 0 or more repetition of any character.