Regex with characters, dots and numbers in Bash

Regex with characters, dots and numbers in Bash - regex

I am trying to find a regex that it will check the following pattern:
>chr28.1.1.24407.24473
So, this pattern consists of 5 parts separated by dots. The first part is the string ">chr" following by a number (one or more digits) and all the other parts should be numbers with one or more digits.
This regex should be a part of a small script which finds first these lines and then it checks their validation.
HCE=$1
hceregex='^>chr[1-9]+\.[1-9]+\.[1-9]+\.[1-9]+\.[1-9]+$'
grep ">" $HCE > HCE.headers
file="HCE.headers"
lines=`cat $file`
for line in $lines
do
if [[ ! $line =~ $hceregex ]]
then
echo "Invalid fasta header in HCE sequence. Check the G-Anchor manual for the headers format"
exit 1
else
echo "Brilliant!!!!"
fi
done
My problem is that the regex without the escape character for the dots returns all the headers. By using the escape character it excludes everything, even the right ones.
What am I doing wrong?
Many thanks in advance.

First problem is using [1-9] which will match only digits 1-9. You should be using [0-9] to match any digit.
Second problem is use of unnecessary cat and unquoted variables. You should be using this code:
hceregex='^>chr[0-9]+\.[0-9]+\.[0-9]+\.[0-9]+\.[0-9]+$'
while read -r line; do
if [[ ! $line =~ $hceregex ]]; then
echo 'Invalid fasta header in HCE sequence'
else
echo 'Brilliant!!!!'
fi
done < file
As a further optimization, you can shorten your regex to this:
hceregex='^>chr[0-9]+(\.[0-9]+){4}$'

In your text you have zero, here 24407 but in regex not [1-9]+, you have to update it to:
^>chr[0-9]+\.[0-9]+\.[0-9]+\.[0-9]+\.[0-9]+$

Related

Mix of regex and non-regex in bash if-statement

Inside of my $foo variable I have this data (please pay close attention to the .s and ,s):
,example.com,de.wikipedia.org,reddit,stackoverflow.com.,amazon.,
I am trying to write an if statement in bash that basically works like this:
if [[ "${foo}" =~ *','[a-z0-9]','* || "${foo}" =~ *','[a-z0-9]'.,'* ]]; then
echo "Invalid input detected"
else
echo "OK"
fi
It would echo Invalid input detected since reddit and amazon. are in $foo.
If I change the contents of $foo to be:
,example.com,de.wikipedia.org,www.reddit.com,stackoverflow.com.,amazon.com,
Then it would echo OK.
I am using bash 3.2.57(1)-release on OS X 10.11.6 El Capitan.

Try:
if [[ $foo =~ ,[a-z0-9]*, || $foo =~ ,[a-z0-9]*\., ]]; then
echo "Invalid input detected"
else
echo "OK"
fi
Notes:
=~ is a regular expression operator. The right-hand-side needs to be a regular expression, not a glob.
, is not a shell-active character. Thus, it does not need any special quoting.
[a-z0-9] matches exactly one alphanumeric. Since we want to allow for more any number, use [a-z0-9]*
In regular expressions, ','* matches zero or more commas. This is not what you want. One might write ,.* which, because, . is a wildcard, matches a comma followed by zero or more of anything. Since the regex is not anchored to the end, adding a final .* makes no difference.
Inside of [[...]] there is no word splitting. So shell variables do not the double-quoting that need elsewhere.
Note that, in [a-z0-9], the exact characters that match a-z or 0-9 depend on the collation order in the locale.

Getting exact match for pattern in bash

I am having trouble with bash matching exactly a pattern. Say for example I am only wanting to matching letters before my file extension like this "test.bam", but in the case a number is included like, "t1st.bam" I get this output: "st".
hello="t1est.bam"
re="([a-zA-Z]+)\.bam"
if [[ $hello =~ $re ]]; then
result=${BASH_REMATCH[1]}
else
echo "unable to parse string"
fi
echo "$result"
What I would like it to do is not to match the pattern at all if a non-alpha character is provided and go into the 'else' block.Thanks

If you want the match to start at the beginning of the string, add the ^ anchor:
re='^([a-zA_Z_]+)\.bam'

shell script odd regex

i have some regex that is behaving oddly in my shell script i have variables, and i have tried every what way to get them to behave, and they dont seem to do any regex, and i know my regex quite well thanks to regex101, here is what a sample looks like
fname="direcheck"
FIND="*"
if [[ $fname =~ $FIND ]]; then
echo "no quotes"
fi
if [[ "$fname" =~ "$FIND" ]]; then
echo "with quotes"
fi
right now it will display nothing
if i change find to
FIND="[9]*"
then it prints no quotes
if i say
FIND="[a-z]*"
then it prints no quotes
if i say
FIND="dircheck"
then nothing prints
if i say
FIND="*ck"
then nothing prints
I don't get how this regex is working
how do i use these variables, and what is the proper syntax?

* and *ck are invalid regular expressions. It would work (with no quotes) if you were comparing with ==, not =~. If you want to use the same functionality that you get in == for them, the equivalent regexps are .* and .*ck.
[9]* is any number (including zero) of characters that are 9. There is zero characters 9 in your direcheck, so it matches. (Edited from brainfart, thanks chepner)
dircheck is not found in direcheck, so not printing anything is hardly surprising.
[a-z]* is any number of characters that are between a and z (i.e. any number of lowercase letters). This will match, assuming it's not quoted.

I finally figured it out, and why it was working so oddly
[a-z]* and [9]* and [anythinghere]* they all match because it matches zero or more times. so "direcheck" has [9] zero or more times.
so
if [[ "$fname" =~ $FIND ]]; then
or
if [[ $fname =~ $FIND ]]; then
are both correct, and
if [[ "$fname" =~ "$FIND" ]]; then
matches only when the string matches exactly because $FIND is matched as a literal string not regex

Matching exactly one whitespace inside if statement

I'm trying to match a file right now to change the name of the file
tempString="hi"
end="_hi.pdf"
for c in *.pdf; do
tempString="$(echo ${c})"
if [[ $tempString =~ $AA[0-9][0-9][0-9]\.pdf$ ]]
then
echo "inside if"
tempString="$(echo $tempString | tr -d ' ')"
tempString=${tempString%.pdf}
mv "%c" "$monthyear$tempString$end"
fi
done
tempString is set to something like "AA 111.pdf"
i need it to match something like AA 111.pdf but not AA 111.pdf (one space instead of two spaces). I just want it to match exactly one whitespace inbetween AA and 111.
it keeps matching both of those examples or neither. i've tried \s, [\s], [:space:], [[:space:]], etc.
i've tried looking it up everywhere but to no avail. can somebody help me out?

The following will match one (and only one) space names like AA 111.pdf:
if [[ "$tempString" =~ ^AA" "[0-9][0-9][0-9]\.pdf$ ]];
The trick is to quote your spaces inside the regex.
Update: The following code ignores the two (and more) spaces example:
tempString="AA 111.pdf"
if [[ "$tempString" =~ ^AA" "[0-9][0-9][0-9]\.pdf$ ]]; then
echo "yes"
else
echo "no"
fi
This prints no
One-liner version:
tempString="AA 111.pdf"; if [[ "$tempString" =~ ^AA" "[0-9][0-9][0-9]\.pdf$ ]]; then echo "yes"; else echo "no"; fi

try this regex [a-zA-z]+ \d+
Demo
and if you want any character before space use this \w+ \d+
and if you want any character before and after space use this \w+ \w+
if you want to take file extension into consideration you can add \.pdf$
at the end of any regex from the above

Regex doesn't match with the lines in txt file

I'm reading the lines from a text file and check if it matches with the regex that I've created or not.
But it always says that your regex didn't match but the regex tool shows that it matches with my regular explanation.
while read line
do
name=$line
BRANCH_REGEX="\d{10}\-[^_]*\_\d{13}"
if [[ $name =~ $BRANCH_REGEX ]];
then
echo "BRANCH '$name' matches BRANCH_REGEX '$BRANCH_REGEX'"
else
echo "BRANCH '$name' DOES NOT MATCH BRANCH_REGEX '$BRANCH_REGEX'"
fi
done < names.txt
names.txt includes lines for example :
9000999484-suchocka_1416578464908
9000989944-schubertk_1416582641605
9001026342-extbeerfelde_1416586904787
9000687045-sturmjo_1416573131629
9001059401-extburghartswieser_1416405627982
9000806302-PDPUPDATE_1357830207068
9000658783-PDPUPDATE_1360445087963

BRANCH_REGEX="/\d{10}\-[^_]*\_\d{13}"
↑
Remove the leading /, none of your lines begin with it.
Also note that _ doesn't need to be escaped, you can write _ instead of \_.

Change your regex to:
BRANCH_REGEX="[0-9]{10}-[^_]*_[0-9]{13}"
Or else:
BRANCH_REGEX="[[:digit:]]{10}-[^_]*_[[:digit:]]{13}"
As BASH regex doesn't support \d property. There is no need to escape hyphens.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Regex with characters, dots and numbers in Bash - regex

In your text you have zero, here 24407 but in regex not [1-9]+, you have to update it to: ^>chr[0-9]+\.[0-9]+\.[0-9]+\.[0-9]+\.[0-9]+$

Related

Mix of regex and non-regex in bash if-statement

Getting exact match for pattern in bash

shell script odd regex

Matching exactly one whitespace inside if statement

Regex doesn't match with the lines in txt file

Categories

Resources