Bash regex to match dots and characters - regex

I'm trying to use the =~ operator to execute a regular expression pattern against a curl response string.
The pattern im currently using is:
name\":\"(\.[a-zA-Z]+)\"
Currently however this pattern only extracts values that that contain only the characters a-z and A-Z. I need this pattern to also pick up values that contain a '.' character and a '#' character. How would I do this?
Also, is there any way this pattern can be improved performance wise? It takes quite a long time to execute against the string.
Cheers.

I recently ran into this problem in my script that sets my bash prompt according to my git status, and found that it was because of the placement of other things (namely, a hyphen) I wanted to match inside the expression.
For example, I wanted to match a certain part of a git status output, e.g. the part where it says "Your branch is ahead of 'origin/mybranch' by 1 commit."
This was my original pattern:
"Your branch is (ahead of|behind) '([a-zA-Z0-9_-]+)/([a-zA-Z0-9_-]+)' by ([0-9]+) commit".
One day I created a branch that had a . in it and found that my bash prompt wasn't showing me the right thing, and modified the expression to the following:
"Your branch is (ahead of|behind) '([a-zA-Z0-9_-]+)/([a-zA-Z0-9_-.]+)' by ([0-9]+) commit".
I expected it to work just fine, but instead there was no match at all.
After reading a lot of posts, I realized it was because of the placement of the hyphen (-); I had to put it right after the first square bracket, otherwise it would be interpreted as a range (in this case, it was trying to interpret the range of _-., which is invalid or just somehow makes the whole expression fall over.
It started working when I changed the expression to the following:
"Your branch is (ahead of|behind) '([a-zA-Z0-9_-]+)/([-a-zA-Z0-9_.]+)' by ([0-9]+) commit".
So basically what I meant to say that it could be something else in your expression (like the hyphen in mine) that is interfering with the matching of the dot and ampersand.

Working example script:
#!/bin/bash
regex='"name":"([a-zA-Z.#]+)"'
input='"name":"internal.action.retry.queue#temp"'
if [[ $input =~ $regex ]]
then
echo "$input matches regex $regex"
for (( i=0; i<${#BASH_REMATCH[#]}; i++))
do
echo -e "\tGroup[$i]: ${BASH_REMATCH[$i]}"
done
else
echo "$input does not match regex $regex"
fi

Just add dot ('.') and at sign ('#'):
name\":\"(\.[a-zA-Z.#]+)\"
If you don't need mandatory dot at the beginnig of the URL, use this:
\"name\":\"([a-zA-Z.#]+)\"

Related

Perl extract group with lookbehind from different line

I've tried web search and have read several answers on stackexchange, still cannot grasp why command does not extract anything. At the end I want to extract group with lookbehind from different line, e.g. from
Code>TEST1<Code Code2>best<Code2
Code>test2<Code
Type>false<Type
by finding needed key between Type and extracting first Code above the finding, so it case above to get test2. But I cannot succeed to extract even something from multiple lines, i.e.
perl -lne 'print $1,"_",$2 if /Code>(.*)<Code[\s\S\n]*?Type>(.*)<Type/'<test.txt prints nothing.
I've played with removing ln parameters and adding/removing greedy ? and trying just . in place of [\s\S\n].
perl -lne 'print $1,"_",$2 if /Code>(.*)<Code[\s\S\n]*?Code2>(.*)<Code2/'<test.txt
gives TEST1_best so same line extraction works.
What am I missing? Can what I want be done in one line of command?
The following command answers your question: it collects all values contained in a Code>...<Code pattern, if they are followed by a Type>...<Type pattern (with potentially other patterns in between, but no other occurrences of Code>...<Code in between):
perl -lne 's/^.*?(?=Code>)//s; for (split /Code>/) { print qq($1:$2\n) if /(.*?)<Code.*?Type>(.*?)<Type/s }' -0777 <test.txt
If e.g. test.txt contains the following lines,
Code>test4<Code Type>false<Type
Code>test3<Code
Type>true<Type
Code>TEST1<Code Code2>best<Code2
Code>test2<Code
Type>false<Type
then the command will collect the following value pairs:
test4:false
test3:true
test2:false
Edited on 04/08/2019, 17:38 CEST I edited the command to remove the "header part" of the file (the part before the first occurrence of Code>), as it might - by some error of the file's editor - contain a closing tag <Code which had not been opened with Code> but instead with a typo like e.g. Cde>. My assumption was that the complete file was "syntactically correct" in the sense that it consists of elements of type /(\w+)>.*?<\1/, separated by whitespace (including newlines). For files which do not conform to this syntax, the statement was not waterproof.
Another way to do it, using progressive matching and embedded code
perl -lne 'while (/\b(?:Code>(.*?)<Code(?{$c=$1})|Type>(.*?)<Type(?{print qq($c:$2\n) if defined $c;undef $c}))\b/g){}' -0777 <test.txt
Explanations:
Basically, the expression finds occurrences of Code>(.*?)<Code or Type>(.*)<Type. This gives the basic form of an alternation in an unnamed grouping expression: (?:Code>(.*?)<Code|Type>(.*?)<Type).
The word boundary assertions \b around the group ensure that the keywords Codeand Type are matched, but not e.g. Code2 or TType.
The modifier g ensures progressive application of the regular expression on the string. Since I want to extract the result inside of the expression itself, I place the regex in an empty loop, i.e. while (/.../g) {}.
You suppose a grammar rule Code ⟶ Type, i.e. you look for occurrences of a Type token following a Code token. For this, a Code token is memorized in a variable $c with the code expression (?{$c=$1}). If a Type token is found, it is considered a match only if formerly a Code token has been found, indicated by the fact that the variable $c is defined. In any case, if a Type token has been found, the variable $c will be undefd to prepare it for the next search. This gives the code evaluation (${print qq($c:$2\n) if defined $c;undef $c;}) in the Type branch of the regular expression.
Note that the captures of the Code>(.*?)<Code and Type>(.*?)<Type tokens may be the empty string. This is why I am working with undef $c and if defined $c instead of the simpler $c='' and if $c.
if your data in 'd', by gnu sed;
sed -Ez 's/.*Code>(\w+)<Code\sType>\w*<Type.*/\1/' d
Perl
perl -ne 'BEGIN{undef $/} /Code>(\w+)<Code\nType>\w*<Type/; print $1' d

Grep a filename with a specific underscore pattern

I am trying to grep a pattern from files using egrep and regex without success.
What I need is to get a file with for example a convention name of:
xx_code_lastname_firstname_city.doc
The code should have at least 3 digits, the lastname and firstname and city can vary on size
I am trying the code below but it fails to achieve what I desire:
ls -1 | grep -E "[xx_][A-Za-z]{3,}[_][A-Za-z]{2,}[_][A-Za-z]{2,}[_][A-Za-z]{2,}[.][doc|pdf]"
That is trying to get the standard xx_ from the beggining, then any code that has at least 3 words and after that it must have another underscore, and so on.
Could anybody help ?
Consider an extglob, as follows:
#!/bin/bash
shopt -s extglob # turn on extended globbing syntax
files=( xx_[[:alpha:]][[:alpha:]]+([[:alpha:]])_[[:alpha:]]+([[:alpha:]])_[[:alpha:]]+([[:alpha:]])_[[:alpha:]]+([[:alpha:]]).#(doc|docx|pdf) )
[[ -e ${files[0]} ]] || -L ${files[0]} ]] && printf '%s\n' "${files[#]}"
This works because
[[:alpha:]][[:alpha:]]+([[:alpha:]])
...matches any string of three or more alpha characters -- two of them explicitly, one of them with the +() one-or-more extglob syntax.
Similarly,
#(doc|docx|pdf)
...matches any of these three specific strings.
So you're trying to match a literal xx_? Begin your pattern with that portion then.
xx_
Next comes the "3 digits" you're trying to match. I'm going to assume based off your own regex that by "digits" you mean characters (hence the [a-zA-Z] character classes). Let's make the quantifier non-greedy to avoid any unintentional capturing behavior.
xx_[a-zA-Z]{3,}?
For the firstname and lastname portions, I see you've specified a variable length with at least 2 characters. Let's make sure these quantifiers are non-greedy as well by appending the ? character after our quantifiers. According to your regex, it also looks like you expect your city construct to take a similar form to the firstname and lastname bits. Let's add all three then.
xx_[a-zA-Z]{3,}?_[a-zA-Z]{2,}?_[a-zA-Z]{2,}?_[a-zA-Z]{2,}\.
NOTE: We didn't need to make the city quantifier non-greedy since we asserted that it's followed by a literal ".", which we don't expect to appear anywhere else in the text we're interested in matching. Notice how it's escaped because it's a metacharacter in the regex syntax.
Lastly comes the file extensions, which your example has as "docx". I also see you put a "doc" and a "pdf" extension in your regex. Let's combine all three of these.
xx_[a-zA-Z]{3,}?_[a-zA-Z]{2,}?_[a-zA-Z]{2,}?_[a-zA-Z]{2,}\.(docx?|pdf)
Hopefully this works. Comment if you need any clarification. Notice how the "doc" and the "docx" portions were condensed into one element. This is not necessary, but I think it looks more deliberate in this form. It could also be written as (doc|docx|pdf). A little repetitive for my taste.

Linux Bash Script Regex malfunction

I would like to make a bash script, which should decide about the given strings, if they fulfill the term or not.
The terms are:
The string's first 3 character must be "le-"
Between hyphens there can any number of consonant in any arrangement, just one "e" and it cannot contain any vowel.
Between hyphens there must be something
The string must not end with hyphen
I made this script:
#!/bin/bash
# Testing regex
while read -r line; do
if [[ $line =~ ^le((-[^aeiou\W]*e+[^aeiou\W]*)+)$ ]]
then
printf "\""$line"\"\t\t\t-> True\n";
else
printf "\""$line"\"\t\t\t-> False\n";
fi
done < <(cat "$#")
It does everything fine, except one thing:
It says true no matter how many hyphens are next to each other.
For example:
It says true for this string "le--le"
I tried this regex expression on websites (like this) and they worked without this malfunction.
All I can think of there must be something difference between the web page and the linux bash. (All I can see on the web page is it runs PHP)
Do you have got any idea, how could I make it work ?
Thank you for your answers!
sweaver2112 rightly points out that the \W is causing you problems, but fails to provide a working example of a bash test regex that does what you ask (at least, i couldn't get it to work).
this seems to do it (adapting Laurel's consonant regex):
[[ "$line" =~ ^le(-[b-df-hj-np-tv-z]*e[b-df-hj-np-tv-z]*)+$ ]]
it matches (e.g.):
le-e
le-e-le
le-e-e-e-e-e
and more generally:
le-([[:consonant:]]*e[[:consonant:]]*)+
and doesn't match (e.g.):
le-
le--le
le-lea-le
also, you can write it more cleanly this way:
c='[b-df-hj-np-tv-z]'
[[ "$line" =~ ^le(-$c*e$c*)+$ ]]
There's at least one problem with your regex: [^aeiou\W] - a negated "non-word", means "word" - and it matches any letter, consonants included. Character classes are inclusive, not exclusive. We're better off just listing all the consonants (and for you case, we'll add 'e' and '-' to the set as well).
So try this one: (edit: using #Laurel's more concise char class)
`(?=^le-)(?!.*--)(?!.*-[^-]*e[^-]*e[^-]*-)[b-hj-np-tv-z-]*[^-]$`
(?=^le-) starts with 'le-'
(?!.*--) no double dashes allowed
(?!.*-[^-]*e[^-]*e[^-]*-) do NOT see two e's between dashes
[b-hj-np-tv-z-]* - consume consonants, e, and dashes (same as [bcdfghjklmnpqrstlvwze-])
[^-]$ last character must be non-dash

wildcard in regular expression in shell script

I have a question regarding wildcard in shell script regular exression
vi a.sh
if [[ $1 == 1*3 ]]; then
echo "matching"
else
echo "not matching"
fi
If I run sh a.sh 123 the output is: "matching".
But according to http://www.tldp.org/LDP/GNU-Linux-Tools-Summary/html/x11655.htm:
* (asterisk)
the proceeding item is to be matched zero or more times. ie. n* will
match n, nn, nnnn, nnnnnnn but not na or any other character.
it should match to only 3,13,113,111..3.
But why is it matching 123?
In the documentation you linked you are taking the part in which it talks about "Regular expressions".
However, what is important here is what is within "Standard Wildcards (globbing patterns)":
Standard Wildcards (globbing patterns)
this can represent any number of characters (including zero, in other words, zero or more characters). If you specified a "cd*" it
would use "cda", "cdrom", "cdrecord" and anything that starts with
“cd” also including “cd” itself. "m*l" could by mill, mull, ml, and
anything that starts with an m and ends with an l.
That is, it does not refer to the previous character but to a set of characters (zero or more). It is what equivalent to a .* in normal regular expressions.
So the expression 1*3 matches anything starting with 1 + zero or more characters + 3 at the end.
There are two different pattern matching you find in the Unix/Linux World:
Regular Expressions: This is the complex pattern matching you find in grep, sed, and many other utilities. As time moved on, many extensions can be found. These extensions are referred to in POSIX as original (now considered obsolete), modern, and extended.
Globbing: This refers to the file matching you find when you do things such as *.txt to match all text files. These are much simpler and less extensive. There are a few extensions (like ** to match subdirectories in Ant).
When you use [[ ... == ... ]] without quotes in Bash, you are using globbing file matches. If you want to use regular expressions, you need to use the =~ operator:
if [[ $foo =~ ^11*3 ]] # Matches 13, 113, 1113, 11113
then

Regex matching filenames

I Know this will sound silly to some of you but I am not good with regex resolutions. I came across the following expressions in a function someone else has written and can't figure out what he/she was doing.
REGEX 1
[ ! -d ${2%/*}/ ]
REGEX 2
cmp -s $2 ${2##*/}
as you can guess, these regex evaluations are being used in a script, doing file updating and moving them around. I was wondering the meaning of
${2%/*}/
and
${2##*/}
Let's take an example to understand better:
s='abc/def/foo'
echo "${s%/*}/"
abc/def/
echo "${s##*/}"
foo
First expression is discarding text after last / in the input.
Second expression is discarding all the text before last / in the input.
You can see more details in man bash:
##*/ is used to match longest string before / from start of input string.
%/* is used to match text after / from end of input.