grepping variables containing special characters in a shell script - regex

I am trying to grep out some lines from a file based on patterns stored in a variable in bash script that may contain (, ), [ or ]. I get the desired output with patterns that do not contain the special characters but with ( or ), I get a blank output and with [ or ], I get the following error:
grep: range out of order in character class
Sample of pattern file:
14-3-3-like protein B
14-3-3-like protein B (Fragment)
3-oxoacyl-[acyl-carrier-protein] synthase 2
Sample of input file:
seq1 gi|703124372 380 285 + 2e-154 14-3-3-like protein B sp
seq2 Q96451 69 51 + 3e-16 14-3-3-like protein B (Fragment) sp
seq3 P0AAI5 104 84 - 4e-20 3-oxoacyl-[acyl-carrier-protein] synthase 2 sp
My code is as below:
if [ $#==0 ]
then echo -e "\nUSAGE: $0 [pattern file] [in file] > [out file]\n"
exit;
else
while read line; do
echo -e "Pattern: $line"
grep -P "\t$line\t" $2
echo -e "\n"
done < $1
Sample of the output:
Pattern: 14-3-3-like protein B
seq1 gi|703124372 380 285 + 2e-154 14-3-3-like protein B sp
Pattern: 14-3-3-like protein B (Fragment) sp
Pattern: 3-oxoacyl-[acyl-carrier-protein] synthase 2
grep: range out of order in character class
I've tried using grep -Fw but that also doesn't give the desired output..
I've also tried substituting the patterns in the two input files with \( and \[ instead of ( and [ but that also doesn't work..
Any idea how can I achieve this? Is there anything else I could use instead of grep?

tab=$(echo -e \\t)
grep -F "$tab$line$tab" $2
Edit:
See also the suggestion from #anubhava: grep -F $'\t'"$line"$'\t' "$2"

Related

AWK negative regular expression with variable

I am using awk in a bash script to compare two files to get just the not-matching lines.
I need to compare all three fields of the second file (as one pattern?) with all lines of the first file:
First file:
chr1 9997 10330 HumanGM18558_peak_1 150 . 10.78887 18.86368 15.08777 100
chr1 628885 635117 HumanGM18558_peak_2 2509 . 83.77238 255.95094 250.99944 5270
chr1 15966215 15966638 HumanGM18558_peak_3 81 . 7.61567 11.78841 8.17169 200
Second file:
chr1 628885 635117
chr1 1250086 1250413
chr1 16613629 16613934
chr1 16644496 16644800
chr1 16895871 16896489
chr1 16905126 16905616
The current idea is to load one file in an array and use AWKs negative regular expression to compare.
readarray a < file2.txt
for i in "${a[#]}"; do
awk -v var="$i" '!/var/' file1.narrowPeak | cat > output.narrowPeak
done
The problem is that '!/var/' is not working with variables.
With awk alone:
$ awk 'NR==FNR{a[$1,$2,$3]; next} !(($1,$2,$3) in a)' file2 file1
chr1 9997 10330 HumanGM18558_peak_1 150 . 10.78887 18.86368 15.08777 100
chr1 15966215 15966638 HumanGM18558_peak_3 81 . 7.61567 11.78841 8.17169 200
NR==FNR this will be true only for the first file, which is file2 in this example
a[$1,$2,$3] create keys based on first three fields, if spacing is exactly same between the two files, you can simply use $0 instead of $1,$2,$3
next to skip remaining commands and process next line of input
($1,$2,$3) in a to check if first three fields of file1 is present as key in array a. Then invert the condition.
Here's another way to write it (thanks to Ed Morton)
awk '{key=$1 FS $2 FS $3} NR==FNR{a[key]; next} !(key in a)' file2 file1
When the pattern is stored in a variable, you have to use the match operator:
awk -v var="something" '
$0 !~ var {print "this line does not match the pattern"}
'
With this problem, regular expression matching looks a bit awkward. I'd go with Sundeep's solution, but if you really want regex:
awk '
NR == FNR {
# construct and store the regex
patt["^" $1 "[[:blank:]]+" $2 "[[:blank:]]+" $3 + "[[:blank:]]"] = 1
next
}
{
for (p in patt)
if ($0 ~ p)
next
print
}
' second first

awk: how to extract 2 patterns from a single line and then concatenate them?

I want to find 2 patterns in each line and then print them with a dash between them as a separator. Here is a sample of lines:
20200323: #5357 BEAR_SPX_X15_NORDNET_D1 {CU=DKK, ES=E, II=DK0061205473, IR=NRB, LN=BEAR SPX X15 NORDNET D1, MIC=FNDK, NS=1, PC=C, SE=193133, SG=250, SN=193133, TK="0.01 to 100,0.05 to 500,0.1", TS=BEAR_SPX_X15_NORDNET_D1, TY=W, UQ=1}
20200323: #5358 BULL_SPX_X10_NORDNET_D2 {CU=DKK, ES=E, II=DK0061205556, IR=NRB, LN=BULL SPX X10 NORDNET D2, MIC=FNDK, NS=1, PC=P, SE=193132, SG=250, SN=193132, TK="0.01 to 100,0.05 to 500,0.1", TS=BULL_SPX_X10_NORDNET_D2, TY=W, UQ=1}
20200323: #5359 BULL_SPX_X12_NORDNET_D2 {CU=DKK, ES=E, II=DK0061205630, IR=NRB, LN=BULL SPX X12 NORDNET D2, MIC=FNDK, NS=1, PC=P, SE=193131, SG=250, SN=193131, TK="0.01 to 100,0.05 to 500,0.1", TS=BULL_SPX_X12_NORDNET_D2, TY=W, UQ=1}
Given the above lines, my desired output after running a script should look like this:
BEAR_SPX_X15_NORDNET_D1 - DK0061205473
BULL_SPX_X10_NORDNET_D2 - DK0061205556
BULL_SPX_X12_NORDNET_D2 - DK0061205630
The first alphanumeric value (e.g. BULL_SPX_X12_NORDNET_D2) is always in the 3rd position of a line.
The second alphanumeric value (e.g. DK0061205630) can be at various positions but it's always preceded by "II=" and is always exactly 12 characters length.
I tried to implement my task with the following script:
13 regex='II=.\{12\}'
14 while IFS="" read -r line; do
15 matchedString=`grep -o $regex littletest.txt | tr -d 'II=,'`
16 awk /II=/'{print $3, " - ", $matchedString}' littletest.txt > temp.txt
17 done <littletest.txt
My thought process and intentions/assumptions:
Line 13 defines a regex pattern to match the alphanumeric string preceded with "II="
In line 15 variable "matchedString" gets assigned a value that is extracted from a line via regex, with the preceding "II=" being deleted.
Line 16 uses awk expression to to detect all lines that contain "II=" and then print the third string that is found on every input file's line and also print the value of matched string pattern that was defined in the previous line of the script. So I expect that at this point a pair of extracted patterns (e.g. BEAR_SPX_X15_NORDNET_D1 - DK0061205473) should be transfered to temp.txt file.
Line 17 is taking an input file for a script to consume.
However, after running the script I did not get the desired output. Here is a sample of what I got:
BEAR_SPX_X15_NORDNET_D1
20200323: #5357 BEAR_SPX_X15_NORDNET_D1 {CU=DKK, ES=E, II=DK0061205473, IR=NRB, LN=BEAR SPX X15 NORDNET D1, MIC=FNDK, NS=1, PC=C, SE=193133, SG=250, SN=193133, TK="0.01 to 100,0.05 to 500,0.1", TS=BEAR_SPX_X15_NORDNET_D1, TY=W, UQ=1}
How could I achieve my desired output that I described earlier?
$ awk -v OFS=' - ' 'match($0,/II=/){print $3, substr($0,RSTART+3,12)}' file
BEAR_SPX_X15_NORDNET_D1 - DK0061205473
BULL_SPX_X10_NORDNET_D2 - DK0061205556
BULL_SPX_X12_NORDNET_D2 - DK0061205630
Just trying out awk.
awk 'BEGIN{ FS="[II=, ]+" ; OFS=" - " } {print $3, $8}' file.txt
Using gawk (gnu awk) that supports regex as Field Seperator (FS) , and considering that each line in your file has exactly the same format / same number of fields, this works fine in my tests:
awk '{print $3,$9}' FS="[ ]|II=" OFS=" - " file1
#or FS="[[:space:]]+|II=|[,]" if you might have more than one space between fields
Results
BEAR_SPX_X15_NORDNET_D1 - DK0061205473
BULL_SPX_X10_NORDNET_D2 - DK0061205556
BULL_SPX_X12_NORDNET_D2 - DK0061205630
Since the II= part could be anywhere, this trick could also work with a penalty of parsing the file twice:
paste -d "-" <(awk '{print $3}' file1) <(awk '/II/{print $2}' RS="[ ]" FS="=|," file1)

Match string and print output specified field side by side for multiple files

I'm new to programming so I might need explanation for each step and I have an issue:
Say I have these (tab delimited) files:
genelist.txt contains:
start_position end_position description
1 840 putative replication protein
1839 2030 hypothetical protein
2095 2328 hypothetical protein
3076 4020 transposase
4209 4322 hypothetical protein
a.txt contains:
NA1.fa
NA1:0-840 scaffold40|size16362 100.000
NA1:1838-2030 scaffold40|size16362 100.000
NA1:3075-4020 scaffold40|size16362 100.000
NA1:4208-4322 scaffold40|size16362 92.105`
b.txt contains:
NA4.fa
NA4:1838-2030 scaffold11|size142511 84.707
NA4:2094-2328 scaffold11|size142511 84.599
NA4:3075-4020 scaffold11|size142511 84.707`
And my desired output is:
start_position end_position description NA1 NA4
1 840 putative replication protein 100 -
1839 2030 hypothetical protein 100 84.707
2095 2328 hypothetical protein - 84.599
3076 4020 transposase 100 84.707
4209 4322 hypothetical protein 92.105 -
Basically, I want to match the genes based on the end position and print out the percentage matches (of the 3rd field) side by side according to the respective IDs so I can get a comparison table of their percentage identity. And if there's no match, print - or 0 so I know which exactly has a match and which doesn't.
I'm open to bash/regex/perl/python or any sort of scripting that will do the job. Apologies if this has been asked before but I couldn't find any solutions so far.
Well that was a challenge. So here is the code:
#!/bin/bash
#
# Process genelist file
#
################################################################################
usage()
{
echo "process.bash <GENELIST> <DATAFILE1> [<DATAFILE n>]"
echo "Requires at least the genelist and 1 data file."
exit 1
}
# Process arguments
if [ $# -lt 2 ]
then
usage
else
genelistfile=$1
# Remove the fist argument from $*
shift
datafiles=$*
fi
# Setup the output file ########################################################
processdate=$(date +%Y%M%d-%H%m%S)
outputfile="process_$processdate.out"
# Build the header:
# the first line of the genelist.txt
# and the first line of each datafile (processed)
header="start_position\tend_position\tdescription"
for datafile in $datafiles
do
datafileheader=$(grep -v ":" $datafile | cut -d'.' -f1)
header="$header\t$datafileheader"
done
echo -e $header >$outputfile
# Process the genelistfile #####################################################
# Read each line from the genelistfile
while read -r line
do
# Do nothing with the header line
if [ $(echo $line | grep -c start_position) -gt 0 ]
then
continue
fi
# Setup the output line, which is the line from genelistfile
# The program will add values from the datafiles as they are processed
outputline=$line
# Extract the second field in the line, endposition
endposition=$(echo $line | awk '{print $2}')
# loop on each file in argument
for datafile in $datafiles
do
foundsomething='false'
# for each line in the datafile...
while read -r line2
do
# If the line is a range line, process it
if [ $(echo $line2 | grep -c ":") -gt 0 ]
then
# Extract the range
startrange=$(echo $line2 | awk '{print $1}' | cut -d':' -f2 | cut -d'-' -f1)
endrange=$(echo $line2 | awk '{print $1}' | cut -d':' -f2 | cut -d'-' -f2)
#echo "range= $startrange --> $endrange"
# Verify if endposition fits within the range...
if [ $endposition -ge $startrange -a $endposition -le $endrange ]
then
percentage=$(echo $line2 | awk '{print $3}')
outputline="$outputline\t$percentage"
foundsomething='true'
fi
fi
done < $datafile
# When done processing the file, we must check if something was found
if [ $foundsomething == 'false' ]
then
outputline="$outputline\t-"
fi
done
# When done processing that line from genelist, output it
echo -e $outputline >>$outputfile
done < $genelistfile
I have put lots of comments to explain what is going on, but here some assumptions I took to simplify the code:
all data files have a first line with SOMETHING1.SOMETHING2. I keep SOMETHING1 as the column header.
there will not be NA1 and NAx mixed data in the same file.
the range data is always specified like NAx:start-end.
the value to extract form the range data is always the 3rd element in a line.
It worked for me with your sample data.
Have fun!

I want to find some string in front of another string pattern, how to do it?

I want to use bash shell to split string like:
Calcipotriol - Daivonex Cream 50mcg/1g 30 g [1]
Aspirin - DBL Aspirin 100mg [1] tablet
I want to get brand name "Davionex Cream" and "DBL Aspirin"
I want to get the name in front of parttern ***mg or ***mcg or ***g
how to do it?
If your sample input is representative, awk may offer the simplest solution:
awk -F'- | [0-9]+(mc?)?g' '{ print $2 }' <<'EOF'
Calcipotriol - Daivonex Cream 50mcg/1g 30 g [1]
Aspirin - DBL Aspirin 100mg [1] tablet
Foo - Foo Bar 22g [1] other
EOF
yields:
Daivonex Cream
DBL Aspirin
Foo Bar
In Bash you can do:
while IFS= read -r line || [[ -n "$line" ]]; do
if [[ "$line" =~ ^([[:alpha:]]+)[[:space:][:punct:]]+([[:alpha:][:space:]]+)[[:space:]](.*)$ ]]
then
printf "1:'%s' 2:'%s' 3:'%s'\n" "${BASH_REMATCH[1]}" "${BASH_REMATCH[2]}" "${BASH_REMATCH[3]}"
fi
done <<<"Calcipotriol - Daivonex Cream 50mcg/1g 30 g [1]
Aspirin - DBL Aspirin 100mg [1] tablet"
Prints:
1:'Calcipotriol' 2:'Daivonex Cream' 3:'50mcg/1g 30 g [1]'
1:'Aspirin' 2:'DBL Aspirin' 3:'100mg [1] tablet'
You can use sed this way:
sed -E 's/^[[:alpha:]]+ - ([[:alpha:] ]+) [[:digit:]]+.*/\1/' <<< "Calcipotriol - Daivonex Cream 50mcg/1g 30 g [1]"
=> Daivonex Cream
^[[:alpha:]]+ - => matches all the characters until the pattern we need to extract
([[:alpha:] ]+) => this is the part we want to extract
[[:digit:]]+.* => this is everything that comes after; we assume this part starts with a space and one or more digits, followed by any number of characters
\1 => the part extracted by the (...) expression above;
we replace the entire string with the matched part
You can check out this site to learn more about regexes: http://regexr.com/

How to grep any word that appears between 2 and 4 times?

My file is:
ab 12ab 1cd uu 88 ab 33 33 1 1
ab cd uu 88 88 33 33 33 cw ab
And I need to extract the words and numbers that appears 2-4 times.- {2,4}
I've tried many regex lines and even regex101.
I cant really put my finger on what's not working.
this is the closest I've got so far:
egrep -o '[\w]{2,4}' A1
Native grep doesn't supoort \w and {} notations. You have to use extended regular expressions.
Use
-E option as,
-E, --extended-regexp
Interpret pattern as an extended regular expression (i.e. force grep to behave as egrep).
Also use
-w to match words, so that it matches the entire words instead of partial.
-w, --word-regexp
The expression is searched for as a word (as if surrounded by [[:<:]]' and[[:>:]]'; see re_format(7)).
Example
$ grep -Ewo "\w{2,4}" file
ab
12ab
1cd
uu
88
ab
33
33
ab
cd
uu
88
88
33
33
33
cw
Note
You can eliminated use of an un-necessary cat by providing file as input to grep instead.
You were very close; within character class notation [], the special notation \w is being treated literally, put it out of []:
egrep -o '\w{2,4}'
Also egrep is deprecated in favor of grep -E, and you don't need the cat as grep takes file(s) as argument(s):
grep -Eo '\w{2,4}' file.txt
I would use awk for it:
awk '{for(i=1;i<=NF;i++)a[$i]++}
END{for(x in a)if(a[x]>1&&a[x]<5)print x}' file
It will scan the whole file, find out the words with occurrence (in the file) in this range [2,4]
Output is:
uu
ab
88
1
Using AWK, this solution counts the word instances per line not per file:
awk '{delete array; for(i = 1; i <= NF; i++) array[$i]+=1; for(i in array) if(array[i] >= 2 && array[i] <= 4) printf "%s ", i; printf "\n" }' input.txt
Delete to clear the array for each new line. Use fields as hash for array indexes and increment it's value by one. Print the index (field) with values between 2 and 4 inclusive.
Output:
ab 1 33
ab 88 33
Perl implementation for a file small enough to process its content as a single string:
$/ = undef;
$_ = <>;
#_ = /(\b\w+\b)/gs;
my %h; $h{$_}++ for #_;
for (keys %h) {
print "$_\n" if $h{$_} >= 2 and $h{$_} <= 4;
}
Save it into a script.pl and run:
perl script.pl < file
Of course, you can pass the code via -e option as well: perl -e 'the code' < file.
Input
ab 12ab 1cd uu 88 ab 33 33 1 1
ab cd uu 88 88 33 33 33 cw ab
Output
88
uu
ab
1
There is no 33 in the output, since it occurs 5 times in the input.
The code reads the file in slurp mode into the default variable ($_), then collects all the words (\w with word boundaries around) into #_ array. Then it counts the number of times each word occurred in the file and stores the result into %h hash. The final block prints only the items that occurred 2, 3, or 4 times, no more and no less.
Note, in Perl you should always use strict; and use warnings; in order to detect issues at early phase.