Getting exact match for pattern in bash - regex

I am having trouble with bash matching exactly a pattern. Say for example I am only wanting to matching letters before my file extension like this "test.bam", but in the case a number is included like, "t1st.bam" I get this output: "st".
hello="t1est.bam"
re="([a-zA-Z]+)\.bam"
if [[ $hello =~ $re ]]; then
result=${BASH_REMATCH[1]}
else
echo "unable to parse string"
fi
echo "$result"
What I would like it to do is not to match the pattern at all if a non-alpha character is provided and go into the 'else' block.Thanks

If you want the match to start at the beginning of the string, add the ^ anchor:
re='^([a-zA_Z_]+)\.bam'

Related

Regex/Shell - how to match all except those with specific pattern

I need a regex in shell to match all strings except those with specific pattern.
My specific pattern can be variable, i.e. (i|I)[2 digits numbers](u|U)[2 digits numbers] in every string should not match.
For example :
Some.text.1234.text => should match
Some.text.1234.i10u20.text => shouldn't match
Some.text.1234.I01U02.text => shouldn't match
Some.text.1234.i83U23.text => shouldn't match
You can try with that:
^(?!.*[tuTU]\d{2}).*$
Demo
Explanation:
^ start of a line
?!.* negative look ahead
[tuTU]\d{2} check if there exists such character following 2 digits only
.*$ if previous condition is negative then match entire string to end of string $
The Bash script checking if a string matches a regex or not can look like
f='It_is_your_string_to_check';
if [[ "${f^^}" =~ I[0-9]{2}U[0-9]{2} ]]; then
echo "$f is invalid";
else
echo "$f is valid"
fi;
Here, "${f^^}" turns the string into uppercase (so as not to use (U|u) and (I|i)), and then =~ operator triggers a regex check here since the pattern on the right side is not quoted. You may play it safe and define the regex pattern with a separate single-quoted string variable and use
rx='I[0-9]{2}U[0-9]{2}'
if [[ "${f^^}" =~ $rx ]]; then ...
See a Bash demo online:
s='Some.text.1234.text
Some.text.1234.i10u20.text
Some.text.1234.I01U02.text
Some.text.1234.i83U23.text'
for f in $s; do
if [[ "${f^^}" =~ I[0-9]{2}U[0-9]{2} ]]; then
echo "$f is invalid";
else
echo "$f is valid"
fi;
done;
Output:
Some.text.1234.text is valid
Some.text.1234.i10u20.text is invalid
Some.text.1234.I01U02.text is invalid
Some.text.1234.i83U23.text is invalid

Regex with characters, dots and numbers in Bash

I am trying to find a regex that it will check the following pattern:
>chr28.1.1.24407.24473
So, this pattern consists of 5 parts separated by dots. The first part is the string ">chr" following by a number (one or more digits) and all the other parts should be numbers with one or more digits.
This regex should be a part of a small script which finds first these lines and then it checks their validation.
HCE=$1
hceregex='^>chr[1-9]+\.[1-9]+\.[1-9]+\.[1-9]+\.[1-9]+$'
grep ">" $HCE > HCE.headers
file="HCE.headers"
lines=`cat $file`
for line in $lines
do
if [[ ! $line =~ $hceregex ]]
then
echo "Invalid fasta header in HCE sequence. Check the G-Anchor manual for the headers format"
exit 1
else
echo "Brilliant!!!!"
fi
done
My problem is that the regex without the escape character for the dots returns all the headers. By using the escape character it excludes everything, even the right ones.
What am I doing wrong?
Many thanks in advance.
First problem is using [1-9] which will match only digits 1-9. You should be using [0-9] to match any digit.
Second problem is use of unnecessary cat and unquoted variables. You should be using this code:
hceregex='^>chr[0-9]+\.[0-9]+\.[0-9]+\.[0-9]+\.[0-9]+$'
while read -r line; do
if [[ ! $line =~ $hceregex ]]; then
echo 'Invalid fasta header in HCE sequence'
else
echo 'Brilliant!!!!'
fi
done < file
As a further optimization, you can shorten your regex to this:
hceregex='^>chr[0-9]+(\.[0-9]+){4}$'
In your text you have zero, here 24407 but in regex not [1-9]+, you have to update it to:
^>chr[0-9]+\.[0-9]+\.[0-9]+\.[0-9]+\.[0-9]+$

RegEx : How can I extract a certain part and modify it?

I'd like to extract a certain part of a string and modify it by using a regular expression.
A given string is TestcaseVzwPerformance_8_2_1_4_1_FDD2.
I'd like to extract the part 8_2_1_4_1 from the string and then replace the underscores _ with dots . So the expected result needs to be 8.2.1.4.1.
The numbers and length of the given string can be different.
For example,
Given string // Expected result
TestcaseVzwCqi_3_9_Test2 // 3.9
TestcaseVzwSvd1xRttAclr_6_6_2_3 // 6.6.2.3
TestcaseVzwCsiFading_9_4_1_1_1_FDD4 // 9.4.1.1.1
Here is my RegEx:
((?:\D{0,}_)(\d(_\d)*)(.*))
The numbered capturing group - $2 - contains 8_2_1_4_1 but with underscores.
Can I replace the underscores with dots?
It needs to be done in one RegEx and a Replace.
regex cannot modify, for example with sed
echo TestcaseVzwPerformance_8_2_1_4_1_FDD2 |
sed -E 's/[^_]*_(([_0-9])+)_.*/\1/;s/_/./g'
8.2.1.4.1
If you have a Bash string, you can use a Bash regex to capture and Bash parameter expansions to replace:
$ s="TestcaseVzwSvd1xRttAclr_6_6_2_3"
$ [[ $s =~ ^[^_]*_([[:digit:]_]+)_* ]] && tmp=${BASH_REMATCH[1]//_/.} && echo "${tmp%.}"
6.6.2.3
Which can be in a loop:
while read -r line; do
if [[ $line =~ ^[^_]*_([[:digit:]_]+)_* ]]; then
tmp=${BASH_REMATCH[1]//_/.}
echo "\"$line\" => ${tmp%.}"
fi
done <<< 'Given string
TestcaseVzwCqi_3_9_Test2
TestcaseVzwSvd1xRttAclr_6_6_2_3
TestcaseVzwCsiFading_9_4_1_1_1_FDD4'
Prints:
"TestcaseVzwCqi_3_9_Test2" => 3.9
"TestcaseVzwSvd1xRttAclr_6_6_2_3" => 6.6.2.3
"TestcaseVzwCsiFading_9_4_1_1_1_FDD4" => 9.4.1.1.1
You can use the same loop to process a file.
If you have a file, you may as well use gawk:
$ awk 'BEGIN{FPAT="_[[:digit:]_]+"}
/_[[:digit:]]/ {sub(/^_/,"", $1); sub(/_$/,"",$1); gsub(/_/,".",$1); print $1}' file
3.9
6.6.2.3
9.4.1.1.1

Regex doesn't match with the lines in txt file

I'm reading the lines from a text file and check if it matches with the regex that I've created or not.
But it always says that your regex didn't match but the regex tool shows that it matches with my regular explanation.
while read line
do
name=$line
BRANCH_REGEX="\d{10}\-[^_]*\_\d{13}"
if [[ $name =~ $BRANCH_REGEX ]];
then
echo "BRANCH '$name' matches BRANCH_REGEX '$BRANCH_REGEX'"
else
echo "BRANCH '$name' DOES NOT MATCH BRANCH_REGEX '$BRANCH_REGEX'"
fi
done < names.txt
names.txt includes lines for example :
9000999484-suchocka_1416578464908
9000989944-schubertk_1416582641605
9001026342-extbeerfelde_1416586904787
9000687045-sturmjo_1416573131629
9001059401-extburghartswieser_1416405627982
9000806302-PDPUPDATE_1357830207068
9000658783-PDPUPDATE_1360445087963
BRANCH_REGEX="/\d{10}\-[^_]*\_\d{13}"
↑
Remove the leading /, none of your lines begin with it.
Also note that _ doesn't need to be escaped, you can write _ instead of \_.
Change your regex to:
BRANCH_REGEX="[0-9]{10}-[^_]*_[0-9]{13}"
Or else:
BRANCH_REGEX="[[:digit:]]{10}-[^_]*_[[:digit:]]{13}"
As BASH regex doesn't support \d property. There is no need to escape hyphens.

bash regex to parse text of the form +incdir+<dir1>+<dir2>

I have an input string of the form +incdir+<dir1>+<dir2>, where <dir1> and <dir2> are directory names. I want to parse this using a bash regex and have the values of the directories inside BASH_REMATCH[1], [2], ...
Here is what I tried:
function match {
if [[ "$1" =~ \+incdir(\+.*)+ ]]; then
for i in $(seq $(expr ${#BASH_REMATCH[#]} - 1)); do
echo $i ":" ${BASH_REMATCH[$i]}
done
else
echo "no match"
fi
}
This works for match +incdir+foo, but doesn't for match +incdir+foo+bar, because it does greedy matching and it outputs +foo+bar. There isn't any non-greedy matching in bash as regex in bash expression mentions so I tried the following for the pattern: \+incdir(\+[^+]*)+ but this just gives me +bar.
The way I would interpret the regex is the following: find the beginning +incdir, then match me at least one group starting with a + followed by as many characters as you can find that are not +. When you hit a + this is the start of the next group. I guess my reasoning is incorrect.
Does anyone have any idea what I'm doing wrong?
Using only bash builtins (but NOT regular expressions, which are the wrong tool for this job):
match() {
[[ $1 = *+incdir+* ]] || return # noop if no +incdir present
IFS=+ read -r -a pieces <<<"${1#*+incdir+}" # read everything after +incdir+
# into +-separated array
for idx in "${!pieces[#]}"; do # iterate over keys in array
echo "$idx: ${pieces[$idx]}" # ...and emit key/value pairs
done
}
$ match "yadda yadda +incdir+foo+bar+baz"
0: foo
1: bar
2: baz