Linux Bash Script Regex malfunction - regex

I would like to make a bash script, which should decide about the given strings, if they fulfill the term or not.
The terms are:
The string's first 3 character must be "le-"
Between hyphens there can any number of consonant in any arrangement, just one "e" and it cannot contain any vowel.
Between hyphens there must be something
The string must not end with hyphen
I made this script:
#!/bin/bash
# Testing regex
while read -r line; do
if [[ $line =~ ^le((-[^aeiou\W]*e+[^aeiou\W]*)+)$ ]]
then
printf "\""$line"\"\t\t\t-> True\n";
else
printf "\""$line"\"\t\t\t-> False\n";
fi
done < <(cat "$#")
It does everything fine, except one thing:
It says true no matter how many hyphens are next to each other.
For example:
It says true for this string "le--le"
I tried this regex expression on websites (like this) and they worked without this malfunction.
All I can think of there must be something difference between the web page and the linux bash. (All I can see on the web page is it runs PHP)
Do you have got any idea, how could I make it work ?
Thank you for your answers!

sweaver2112 rightly points out that the \W is causing you problems, but fails to provide a working example of a bash test regex that does what you ask (at least, i couldn't get it to work).
this seems to do it (adapting Laurel's consonant regex):
[[ "$line" =~ ^le(-[b-df-hj-np-tv-z]*e[b-df-hj-np-tv-z]*)+$ ]]
it matches (e.g.):
le-e
le-e-le
le-e-e-e-e-e
and more generally:
le-([[:consonant:]]*e[[:consonant:]]*)+
and doesn't match (e.g.):
le-
le--le
le-lea-le
also, you can write it more cleanly this way:
c='[b-df-hj-np-tv-z]'
[[ "$line" =~ ^le(-$c*e$c*)+$ ]]

There's at least one problem with your regex: [^aeiou\W] - a negated "non-word", means "word" - and it matches any letter, consonants included. Character classes are inclusive, not exclusive. We're better off just listing all the consonants (and for you case, we'll add 'e' and '-' to the set as well).
So try this one: (edit: using #Laurel's more concise char class)
`(?=^le-)(?!.*--)(?!.*-[^-]*e[^-]*e[^-]*-)[b-hj-np-tv-z-]*[^-]$`
(?=^le-) starts with 'le-'
(?!.*--) no double dashes allowed
(?!.*-[^-]*e[^-]*e[^-]*-) do NOT see two e's between dashes
[b-hj-np-tv-z-]* - consume consonants, e, and dashes (same as [bcdfghjklmnpqrstlvwze-])
[^-]$ last character must be non-dash

Related

Grep a filename with a specific underscore pattern

I am trying to grep a pattern from files using egrep and regex without success.
What I need is to get a file with for example a convention name of:
xx_code_lastname_firstname_city.doc
The code should have at least 3 digits, the lastname and firstname and city can vary on size
I am trying the code below but it fails to achieve what I desire:
ls -1 | grep -E "[xx_][A-Za-z]{3,}[_][A-Za-z]{2,}[_][A-Za-z]{2,}[_][A-Za-z]{2,}[.][doc|pdf]"
That is trying to get the standard xx_ from the beggining, then any code that has at least 3 words and after that it must have another underscore, and so on.
Could anybody help ?
Consider an extglob, as follows:
#!/bin/bash
shopt -s extglob # turn on extended globbing syntax
files=( xx_[[:alpha:]][[:alpha:]]+([[:alpha:]])_[[:alpha:]]+([[:alpha:]])_[[:alpha:]]+([[:alpha:]])_[[:alpha:]]+([[:alpha:]]).#(doc|docx|pdf) )
[[ -e ${files[0]} ]] || -L ${files[0]} ]] && printf '%s\n' "${files[#]}"
This works because
[[:alpha:]][[:alpha:]]+([[:alpha:]])
...matches any string of three or more alpha characters -- two of them explicitly, one of them with the +() one-or-more extglob syntax.
Similarly,
#(doc|docx|pdf)
...matches any of these three specific strings.
So you're trying to match a literal xx_? Begin your pattern with that portion then.
xx_
Next comes the "3 digits" you're trying to match. I'm going to assume based off your own regex that by "digits" you mean characters (hence the [a-zA-Z] character classes). Let's make the quantifier non-greedy to avoid any unintentional capturing behavior.
xx_[a-zA-Z]{3,}?
For the firstname and lastname portions, I see you've specified a variable length with at least 2 characters. Let's make sure these quantifiers are non-greedy as well by appending the ? character after our quantifiers. According to your regex, it also looks like you expect your city construct to take a similar form to the firstname and lastname bits. Let's add all three then.
xx_[a-zA-Z]{3,}?_[a-zA-Z]{2,}?_[a-zA-Z]{2,}?_[a-zA-Z]{2,}\.
NOTE: We didn't need to make the city quantifier non-greedy since we asserted that it's followed by a literal ".", which we don't expect to appear anywhere else in the text we're interested in matching. Notice how it's escaped because it's a metacharacter in the regex syntax.
Lastly comes the file extensions, which your example has as "docx". I also see you put a "doc" and a "pdf" extension in your regex. Let's combine all three of these.
xx_[a-zA-Z]{3,}?_[a-zA-Z]{2,}?_[a-zA-Z]{2,}?_[a-zA-Z]{2,}\.(docx?|pdf)
Hopefully this works. Comment if you need any clarification. Notice how the "doc" and the "docx" portions were condensed into one element. This is not necessary, but I think it looks more deliberate in this form. It could also be written as (doc|docx|pdf). A little repetitive for my taste.

wildcard in regular expression in shell script

I have a question regarding wildcard in shell script regular exression
vi a.sh
if [[ $1 == 1*3 ]]; then
echo "matching"
else
echo "not matching"
fi
If I run sh a.sh 123 the output is: "matching".
But according to http://www.tldp.org/LDP/GNU-Linux-Tools-Summary/html/x11655.htm:
* (asterisk)
the proceeding item is to be matched zero or more times. ie. n* will
match n, nn, nnnn, nnnnnnn but not na or any other character.
it should match to only 3,13,113,111..3.
But why is it matching 123?
In the documentation you linked you are taking the part in which it talks about "Regular expressions".
However, what is important here is what is within "Standard Wildcards (globbing patterns)":
Standard Wildcards (globbing patterns)
this can represent any number of characters (including zero, in other words, zero or more characters). If you specified a "cd*" it
would use "cda", "cdrom", "cdrecord" and anything that starts with
“cd” also including “cd” itself. "m*l" could by mill, mull, ml, and
anything that starts with an m and ends with an l.
That is, it does not refer to the previous character but to a set of characters (zero or more). It is what equivalent to a .* in normal regular expressions.
So the expression 1*3 matches anything starting with 1 + zero or more characters + 3 at the end.
There are two different pattern matching you find in the Unix/Linux World:
Regular Expressions: This is the complex pattern matching you find in grep, sed, and many other utilities. As time moved on, many extensions can be found. These extensions are referred to in POSIX as original (now considered obsolete), modern, and extended.
Globbing: This refers to the file matching you find when you do things such as *.txt to match all text files. These are much simpler and less extensive. There are a few extensions (like ** to match subdirectories in Ant).
When you use [[ ... == ... ]] without quotes in Bash, you are using globbing file matches. If you want to use regular expressions, you need to use the =~ operator:
if [[ $foo =~ ^11*3 ]] # Matches 13, 113, 1113, 11113
then

Bash regex to match dots and characters

I'm trying to use the =~ operator to execute a regular expression pattern against a curl response string.
The pattern im currently using is:
name\":\"(\.[a-zA-Z]+)\"
Currently however this pattern only extracts values that that contain only the characters a-z and A-Z. I need this pattern to also pick up values that contain a '.' character and a '#' character. How would I do this?
Also, is there any way this pattern can be improved performance wise? It takes quite a long time to execute against the string.
Cheers.
I recently ran into this problem in my script that sets my bash prompt according to my git status, and found that it was because of the placement of other things (namely, a hyphen) I wanted to match inside the expression.
For example, I wanted to match a certain part of a git status output, e.g. the part where it says "Your branch is ahead of 'origin/mybranch' by 1 commit."
This was my original pattern:
"Your branch is (ahead of|behind) '([a-zA-Z0-9_-]+)/([a-zA-Z0-9_-]+)' by ([0-9]+) commit".
One day I created a branch that had a . in it and found that my bash prompt wasn't showing me the right thing, and modified the expression to the following:
"Your branch is (ahead of|behind) '([a-zA-Z0-9_-]+)/([a-zA-Z0-9_-.]+)' by ([0-9]+) commit".
I expected it to work just fine, but instead there was no match at all.
After reading a lot of posts, I realized it was because of the placement of the hyphen (-); I had to put it right after the first square bracket, otherwise it would be interpreted as a range (in this case, it was trying to interpret the range of _-., which is invalid or just somehow makes the whole expression fall over.
It started working when I changed the expression to the following:
"Your branch is (ahead of|behind) '([a-zA-Z0-9_-]+)/([-a-zA-Z0-9_.]+)' by ([0-9]+) commit".
So basically what I meant to say that it could be something else in your expression (like the hyphen in mine) that is interfering with the matching of the dot and ampersand.
Working example script:
#!/bin/bash
regex='"name":"([a-zA-Z.#]+)"'
input='"name":"internal.action.retry.queue#temp"'
if [[ $input =~ $regex ]]
then
echo "$input matches regex $regex"
for (( i=0; i<${#BASH_REMATCH[#]}; i++))
do
echo -e "\tGroup[$i]: ${BASH_REMATCH[$i]}"
done
else
echo "$input does not match regex $regex"
fi
Just add dot ('.') and at sign ('#'):
name\":\"(\.[a-zA-Z.#]+)\"
If you don't need mandatory dot at the beginnig of the URL, use this:
\"name\":\"([a-zA-Z.#]+)\"

Remove characters and numbers from a string in perl

I'm trying to rename a bunch of files in my directory and I'm stuck at the regex part of it.
I want to remove certain characters from a filename which appear at the beginning.
Example1: _00-author--book_revision_
Expected: Author - Book (Revision)
So far, I am able to use regex to remove underscores & captialize the first letter
$newfile =~ s/_/ /g;
$newfile =~ s/^[0-9]//g;
$newfile =~ s/^[0-9]//g;
$newfile =~ s/^-//g;
$newfile = ucfirst($newfile);
This is not a good method. I need help in removing all characters until you hit the first letter, and when you hit the first '-' I want to add a space before and after '-'.
Also when I hit the second '-' I want to replace it with '('.
Any guidance, tips or even suggestions on taking the right approach is much appreciated.
So do you want to capitalize all the components of the new filename, or just the first one? Your question is inconsistent on that point.
Note that if you are on Linux, you probably have the rename command, which will take a perl expression and use it to rename files for you, something like this:
rename 'my ($a,$b,$r);$_ = "$a - $b ($r)"
if ($a, $b, $r) = map { ucfirst $_ } /^_\d+-(.*?)--(.*?)_(.*?)_$/' _*
Your instructions and your example don't match.
According to your instructions,
s/^[^\pL]+//; # Remove everything until first letter.
s/-/ - /; # Replace first "-" with " - "
s/-[^-]*\K-/(/; # Replace second "-" with "("
According to your example,
s/^[^\pL]+//;
s/--/ - /;
s/_/ (/;
s/_/)/;
s/(?<!\pL)(\pL)/\U$1/g;
$filename =~ s,^_\d+-(.*?)--(.*?)_(.*?)_$,\u\1 - \u\2 (\u\3),;
My Perl interpreter (using strict and warnings) says that this is better written as:
$filename =~ s,^_\d+-(.*?)--(.*?)_(.*?)_$,\u$1 - \u$2 (\u$3),;
The first one probably is more sedish for its taste! (Of course both version works just the same.)
Explanation (as requested by stema):
$filename =~ s/
^ # matches the start of the line
_\d+- # matches an underscore, one or more digits and a hypen minus
(.*?)-- # matches (non-greedyly) anything before two consecutive hypen-minus
# and captures the entire match (as the first capture group)
(.*?)_ # matches (non-greedyly) anything before a single underscore and
# captures the entire match (as the second capture group)
(.*?)_ # does the same as the one before (but captures the match as the
# third capture group obviously)
$ # matches the end of the line
/\u$1 - \u$2 (\u$3)/x;
The \u${1..3} in replacement specification simply tells Perl to insert the capture groups from 1 to 3 with their first character made upper-case. If you'd wanted to make the entire match (in a captured group) upper-case you'd had to use \U instead.
The x flags turns on verbose mode, which tells the Perl interpreter that we want to use # comments, so it will ignore these (and any white space in the regular expression - so if you want to match a space you have to use either \s or \). Unfortunately I couldn't figure out how to tell Perl to ignore white space in the * replacement* specification - this is why I've written that on a single line.
(Also note that I've changed my s terminator from , to / - Perl barked at me if I used the , with verbose mode turned on ... not exactly sure why.)
If they all follow that format then try:
my ($author, $book, $revision) = $newfiles =~ /-(.*?)--(.*?)_(.*?)_/;
print ucfirst($author ) . " - $book ($revision)\n";

Regular Expression to match an ssh connection string

I'm trying in vain to write a regular expression to match valid ssh connection strings.
I really only need to recognise strings of the format:
user#hostname:/some/path
but it would be nice to also match an implicit home directory:
user#hostname:
I've so-far come up with this regex:
/^[:alnum:]+\#\:(\/[:alnum:]+)*$/
which doesn't work as intended.
Any suggestions welcome before my brain explodes and I start speaking in line noise :)
Your supplied regex doesn't have the hostname section. Try:
/^[:alnum:]+\#[:alnum:\.]\:(\/[:alnum:]+)*$/
or
/^[A-Za-z][A-Za-z0-9_]*\#[A-Za-z][A-Za-z0-9_\.]*\:(\/[A-Za-z][A-Za-z0-9_]*)*$/
since I don't trust alnum without double brackets.
Also, :alnum: may not give you the required range for your sections. You can have "." characters in your host name and may also need to allow for "_" characters. And it's rare I've seen usernames or hostnames start with a non-alphabetic.
Just as a side note, I try to avoid the enhanced regexes since they don't run on all regex engines (I've been using UNIX for a long time). Unfortunately that makes my regexes ungainly (see above) and not overly internationalizable. Apologies for that.
Bracket expressions go inside their own brackets. You are matching any one of colon, 'a', 'l', 'm', 'n', or 'u'.
And like Pax said, you missed the hostname. But the bracket expressions are still wrong.
After some more revisions I'm using:
/^\w+\#(\w|\.)+\:(\/\w+)*$/
which seems to match my test cases and accounts for hostnames, FQDNs and IP addresses in the host portion. It also makes the path after the colon optional to allow for implicit home directories.
Thanks for the help so far - I didn't spot the lack of hostname until it was pointed out.
What sgm is getting at, is you are doing
/^[:alnum:]+\#\:(\/[:alnum:]+)*$/
Where you should be doing
/^[[:alnum:]]+\#\:(\/[[:alnum:]]+)*$/
Pax's answer is also practical, but won't work without the proper double bracketing.
my $at = q{#};
my #res = (
qr/^[:alnum:]+${at}[:alnum:]+:(\/[:alnum:]+)*$/,
qr/^[[:alnum:]]+${at}[[:alnum:]]+:(\/[[:alnum:]]+)*$/,
qr/^[a-z][[:alnum:]_]*${at}[a-z][[:alnum:]_.]*:(\/[^\/]*)*$/i,
);
my #u = qw{
user#hostname:/some/path
bob_foo#bobs.fish.stores.com:/foo/bar/baz/quux_
9foo#9foo.org:/9foo/9foo
baz#foo.org:/9foo/_#_numerics_are_fine_in_URI_and_so_is_anything_else_(virtually)
};
for my $str (#u) {
for my $re (#res) {
if ( $str =~ $re ) {
print "$str =~ $re\n";
}
else {
print "NOT $str =~ $re\n";
}
}
}
POSIX syntax [: :] belongs inside character classes in regex; marked by <-- HERE in m/^[:alnum:] <-- HERE +#[:alnum:]+:(/[:alnum:]+)*$/ at /tmp/egl.pl line 27.
POSIX syntax [: :] belongs inside character classes in regex; marked by <-- HERE in m/^[:alnum:]+#[:alnum:] <-- HERE +:(/[:alnum:]+)*$/ at /tmp/egl.pl line 27.
POSIX syntax [: :] belongs inside character classes in regex; marked by <-- HERE in m/^[:alnum:]+#[:alnum:]+:(/[:alnum:] <-- HERE +)*$/ at /tmp/egl.pl line 27.
NOT user#hostname:/some/path =~ (?-xism:^[:alnum:]+#[:alnum:]+:(/[:alnum:]+)*$)
user#hostname:/some/path =~ (?-xism:^[[:alnum:]]+#[[:alnum:]]+:(/[[:alnum:]]+)*$)
user#hostname:/some/path =~ (?i-xsm:^[a-z][[:alnum:]_]*#[a-z][[:alnum:]_.]*:(/[^/]*)*$)
NOT bob_foo#bobs.fish.stores.com:/foo/bar/baz/quux_ =~ (?-xism:^[:alnum:]+#[:alnum:]+:(/[:alnum:]+)*$)
NOT bob_foo#bobs.fish.stores.com:/foo/bar/baz/quux_ =~ (?-xism:^[[:alnum:]]+#[[:alnum:]]+:(/[[:alnum:]]+)*$)
bob_foo#bobs.fish.stores.com:/foo/bar/baz/quux_ =~ (?i-xsm:^[a-z][[:alnum:]_]*#[a-z][[:alnum:]_.]*:(/[^/]*)*$)
NOT 9foo#9foo.org:/9foo/9foo =~ (?-xism:^[:alnum:]+#[:alnum:]+:(/[:alnum:]+)*$)
NOT 9foo#9foo.org:/9foo/9foo =~ (?-xism:^[[:alnum:]]+#[[:alnum:]]+:(/[[:alnum:]]+)*$)
NOT 9foo#9foo.org:/9foo/9foo =~ (?i-xsm:^[a-z][[:alnum:]_]*#[a-z][[:alnum:]_.]*:(/[^/]*)*$)
NOT baz#foo.org:/9foo/_#_numerics_are_fine_in_URI_and_so_is_anything_else_(virtually) =~ (?-xism:^[:alnum:]+#[:alnum:]+:(/[:alnum:]+)*$)
NOT baz#foo.org:/9foo/_#_numerics_are_fine_in_URI_and_so_is_anything_else_(virtually) =~ (?-xism:^[[:alnum:]]+#[[:alnum:]]+:(/[[:alnum:]]+)*$)
baz#foo.org:/9foo/_#_numerics_are_fine_in_URI_and_so_is_anything_else_(virtually) =~ (?i-xsm:^[a-z][[:alnum:]_]*#[a-z][[:alnum:]_.]*:(/[^/]*)*$)
OK, further revision to:
/^\w+\#(\w|\.)+\:(\/(\w|.)+)*$/
to account for the . that might be present in a file name.
Final go:
/^\w+\#(\w|\.)+\:(\/(\w|.)+\/?)*$/
This also allows for an optional trailing slash.
These didn't quite do it for what I needed; as some were broke or not liberal enough. For example if you have a folder named stackoverflow.com not having dots would break it. Implementations are inconsistent about what \w means, so I wouldn't recommend using that, especially since we know darn well what characters we need.
The following is a bash example to construct the regex:
#should match 99.9% of SSH users
user_regex='[a-zA-Z][a-zA-Z0-9_]+'
#match domains
host_regex='([a-zA-Z][a-zA-Z0-9\-]*\.)*[a-zA-Z][a-zA-Z0-9\-]*'
#match paths starting with / and empty strings (which is valid for our use!)
path_regex='(\/[A-Za-z0-9_\-\.]+)*\/?'
#the complete regex
master_regex="^$user_regex\#$host_regex\:$path_regex\$"
This affords the modularity to check your parts later, if need be. To enable IP addresses in the match add 0-9 to the two first letter match portions of the host regex.