Regex doesn't match with the lines in txt file - regex

I'm reading the lines from a text file and check if it matches with the regex that I've created or not.
But it always says that your regex didn't match but the regex tool shows that it matches with my regular explanation.
while read line
do
name=$line
BRANCH_REGEX="\d{10}\-[^_]*\_\d{13}"
if [[ $name =~ $BRANCH_REGEX ]];
then
echo "BRANCH '$name' matches BRANCH_REGEX '$BRANCH_REGEX'"
else
echo "BRANCH '$name' DOES NOT MATCH BRANCH_REGEX '$BRANCH_REGEX'"
fi
done < names.txt
names.txt includes lines for example :
9000999484-suchocka_1416578464908
9000989944-schubertk_1416582641605
9001026342-extbeerfelde_1416586904787
9000687045-sturmjo_1416573131629
9001059401-extburghartswieser_1416405627982
9000806302-PDPUPDATE_1357830207068
9000658783-PDPUPDATE_1360445087963

BRANCH_REGEX="/\d{10}\-[^_]*\_\d{13}"
↑
Remove the leading /, none of your lines begin with it.
Also note that _ doesn't need to be escaped, you can write _ instead of \_.

Change your regex to:
BRANCH_REGEX="[0-9]{10}-[^_]*_[0-9]{13}"
Or else:
BRANCH_REGEX="[[:digit:]]{10}-[^_]*_[[:digit:]]{13}"
As BASH regex doesn't support \d property. There is no need to escape hyphens.

Related

Regex for matching directory path depth

I'm trying to regex match a specific folder depth of varying path strings using bash scripts.
I want to match two levels down from packages eg. /packages/[any-folder-name]/[any-folder-name]/.
So for example for /packages/frontend/react-app/src/index.ts I want to match /packages/frontend/react-app/ and store it in an array
array=()
string="/packages/frontend/react-app/src/index.ts"
[[ $string =~ packages/.*/.*/ ]] && array+=(${BASH_REMATCH[0]}
almost works, but it returns /packages/frontend/react-app/src/
I've been going round in circles on this for a few hours now.
Probably this:
#!/usr/bin/env bash
array=()
string="/packages/frontend/react-app/src/index.ts"
[[ $string =~ packages/([^/]+/){2} ]] && array+=("${BASH_REMATCH[0]}")
Explanation:
[^/]+ match any non-empty string that does not contain a /.
$ echo '/packages/frontend/react-app/src/index.ts' | sed 's|^\(/packages/[^/]*/[^/]*/\).*$|\1|'
/packages/frontend/react-app/
Explanation:
use sed regex:
|^...$| - match the whole string, anchor at beginning and end
^\(...\) - capture stuff inside parenthesis
/packages/ - expect this text
[^/]*/ - followed by anything non-slash, followed by a slash
[^/]*/ - rinse and repeat
.* - discard anything after the captured text
|\1| - replace matched text with the captured text
Looks like a glob expression would be enough.
# enable nullglob to get an empty array if there is no match
shopt -s nullglob
array=(/packages/*/*/)
echo ${array[*]}

Regex with characters, dots and numbers in Bash

I am trying to find a regex that it will check the following pattern:
>chr28.1.1.24407.24473
So, this pattern consists of 5 parts separated by dots. The first part is the string ">chr" following by a number (one or more digits) and all the other parts should be numbers with one or more digits.
This regex should be a part of a small script which finds first these lines and then it checks their validation.
HCE=$1
hceregex='^>chr[1-9]+\.[1-9]+\.[1-9]+\.[1-9]+\.[1-9]+$'
grep ">" $HCE > HCE.headers
file="HCE.headers"
lines=`cat $file`
for line in $lines
do
if [[ ! $line =~ $hceregex ]]
then
echo "Invalid fasta header in HCE sequence. Check the G-Anchor manual for the headers format"
exit 1
else
echo "Brilliant!!!!"
fi
done
My problem is that the regex without the escape character for the dots returns all the headers. By using the escape character it excludes everything, even the right ones.
What am I doing wrong?
Many thanks in advance.
First problem is using [1-9] which will match only digits 1-9. You should be using [0-9] to match any digit.
Second problem is use of unnecessary cat and unquoted variables. You should be using this code:
hceregex='^>chr[0-9]+\.[0-9]+\.[0-9]+\.[0-9]+\.[0-9]+$'
while read -r line; do
if [[ ! $line =~ $hceregex ]]; then
echo 'Invalid fasta header in HCE sequence'
else
echo 'Brilliant!!!!'
fi
done < file
As a further optimization, you can shorten your regex to this:
hceregex='^>chr[0-9]+(\.[0-9]+){4}$'
In your text you have zero, here 24407 but in regex not [1-9]+, you have to update it to:
^>chr[0-9]+\.[0-9]+\.[0-9]+\.[0-9]+\.[0-9]+$

Unexpected behavior in a regular expression in bash

I created this regular expression and tested it out successfully
https://regex101.com/r/a7qvuw/1
However the regular expression behaves differently in this bash code that I wrote
# Splitting by colon
IFS=';' read -ra statements <<< $contents
# Splitting by the = sign.
regex="\s*(.*?)\s*=\s*(.*)\b"
for i in "${statements[#]}"; do
if [[ $i =~ $regex ]]; then
key=${BASH_REMATCH[1]}
params=${BASH_REMATCH[2]}
echo "KEY: $key| PARAMS: $params"
fi
done
The variable $contents has the text as is used in the link. The problem is that the $key has a space at its end, while the regular expression I tried matches the words without the space.
I get output like this:
KEY: vclock_spec | PARAMS: clk_i 1 1
As you can see there is a space between vclock_spec and the | which should not be there. What am I doing wrong?
As #Cyrus mentioned, lazy quantifiers are not supported in Bash regex. They act as greedy ones.
You may fix your pattern to work in Bash using
regex="\s*([^=]*\S)\s*=\s*(.*)\b"
^^^^^^^
The [^=]* matches zero or more symbols other then = and \S matches any non-whitespace (maybe [^\s=] will be more precise here as it matches any char but a whitespace (\s) and =, but it looks like regex="\s*([^=]*[^\s=])\s*=\s*(.*)\b" yields the same results).

Getting exact match for pattern in bash

I am having trouble with bash matching exactly a pattern. Say for example I am only wanting to matching letters before my file extension like this "test.bam", but in the case a number is included like, "t1st.bam" I get this output: "st".
hello="t1est.bam"
re="([a-zA-Z]+)\.bam"
if [[ $hello =~ $re ]]; then
result=${BASH_REMATCH[1]}
else
echo "unable to parse string"
fi
echo "$result"
What I would like it to do is not to match the pattern at all if a non-alpha character is provided and go into the 'else' block.Thanks
If you want the match to start at the beginning of the string, add the ^ anchor:
re='^([a-zA_Z_]+)\.bam'

Regex match string UNTIL string in a comma separated line

All words start with "Passed", but I only want to match those that also end with "Unique".
Input:
PassedShownWeekUnique,PassedShownDayUnique,PassedFailedWeek,PassedFailedDayUnique,Passed1Week,Passed1WeekUnique
Desired output:
PassedShownWeekUnique,PassedShownDayUnique,PassedFailedDayUnique,Passed1WeekUnique
I tried regex Passed.* and it matches everything. Passed.*Unique isn't working, anyone help?
Just use the following. Match from Passed, then everything, until Unique
Passed.*Unique
if [[ $line =~ Passed.*Unique ]]; then echo line matched $line done; fi
EDIT: Since op revised his question to be a comma separated line.
line=PassedShownWeekUnique,PassedShownDayUnique,PassedFailedWeek,PassedFailedDayUnique,Passed1Week,Passed1WeekUnique
REGEX=Passed.*Unique
IFS=',';
for word in $line; do
if [[ $word =~ $REGEX ]]; then
echo matched $word
fi
done
Output:
matched PassedShownWeekUnique
matched PassedShownDayUnique
matched PassedFailedDayUnique
matched Passed1WeekUnique
You can either use the regex:
Unique$
to get lines that end with the word "Unique", or:
^Passed.+?Unique$
to get lines that start with "Passed" and end with "Unique". Depending on your specific implementation, you may want to choose one or the other.
And if you have comma-separated input, as you described:
(Passed.+?Unique),|$
This will capture each instance of a word that starts with "Passed" and ends with "Unique". You can check each capture group to print out the item that it matched.
How about you try to use ^ and $
^ Matches the empty string at the beginning of a line; also represents the characters not in the range of a list.
$ Matches the empty string at the end of a line.
So something like this
^Passed.*?Unique$
You can read more about it here.