bash: extraction of substrings by pattern, empty fields and multiple occurrences - regex

I would like to extract the Pfam_A information from each line of a file:
item_1 ID=HJNANFJJ_180142;inference=ab initio prediction:Prodigal_v2.6.3;locus_tag=HJNANFJJ_180142;partial=01;product=unannotated protein;KEGG=K03531
item_4 ID=HJNANFJJ_87662;inference=ab initio prediction:Prodigal_v2.6.3;locus_tag=HJNANFJJ_87662;partial=10;product=unannotated protein;KEGG=K15725;Pfam_A=OEP;Resfams=adeC-adeK-oprM
item_8 ID=HJNANFJJ_328505;inference=ab initio prediction:Prodigal_v2.6.3;locus_tag=HJNANFJJ_328505;partial=11;product=unannotated protein;KEGG=K03578;Pfam_A=OB_NTP_bind
item_2 ID=HJNANFJJ_512995;inference=ab initio prediction:Prodigal_v2.6.3;locus_tag=HJNANFJJ_512995;partial=11;product=unannotated protein;KEGG=K00674;Pfam_A=Hexapep;Pfam_A=Hexapep_2;metacyc=TETHYDPICSUCC-RXN
item_0 ID=HJNANFJJ_188729;inference=ab initio prediction:Prodigal_v2.6.3;locus_tag=HJNANFJJ_188729;partial=11;product=unannotated protein
In some lines this information is missing at all, in some there can be multiple occurrences.
Finally, I want to get a table like this, so that instead of empty fields there is NaN and multiple occurrences are put tab separated into different fields:
item_1 NaN
item_4 OEP
item_8 OB_NTP_bind
item_2 Hexapep Hexapep_2
item_0 NaN

You may use this awk:
awk -v OFS='\t' 'NF > 1 {
s = ""
n = split($NF, a, /;/)
for (i=1; i<=n; i++)
if (split(a[i], b, /=/) == 2 && b[1] == "Pfam_A")
s = s OFS b[2]
print $1 (s ? s : OFS "NaN")
}' file
item_1 NaN
item_4 OEP
item_8 OB_NTP_bind
item_2 Hexapep Hexapep_2
item_0 NaN

A quick and dirty way would be:
awk '{ s=$0;t="";
while (match(s,"Pfam_A=[^;]*")) {
t = t (t?OFS:"") substr(s,RSTART+7,RLENGTH-7);
s = substr(s,RSTART+RLENGTH)
}
}{print $1, (t?t:"NaN")}' file

With the presumptions that in each input line, there are no other ; characters except for the ; characters that separate the data fields, and no tab characters unless they delimit the first column, a simple sed command could do the job:
sed -E 's/\s+/;/; s/;Pfam_A=/;\t/g; s/;[^\t]*//g; /\t/!s/$/\tNaN/' file

Related

File fields and columns adjustment with awk [LINUX]

I have an issue with columns delimiters adjustment in a file in linux into a database.
I need 14 columns and I use "|" as a delimiter so I applied :
awk -F'|' '{missing=14-NF;if(missing==0){print $0}else{printf "%s",$0;for(i=1;i<=missing-1;i++){printf "|"};print "|"}}' myFile
Suppose I have a row like that:
a|b|c|d|e||f||g||||h|i|
after applying the awk command it will be:
a|b|c|d|e||f||g||||h|i||
and this is not acceptable I need the data to be 14 columns only.
Sample input {In case of 14 fields row]:
a|b|c|d|e||f||g||||h|i
Do nothing
Sample input {In case of extra fields]:
a|b|c|d|e||f||g||||h|i|
ouput:
a|b|c|d|e||f||g||||h|i
Sample Input {In case of less fields}:
a|b|c|d||e||f||g|h
output:
a|b|c|d||e||f||g|h|||
You may use this gnu-awk solution:
awk -v n=14 '
BEGIN {FS=OFS="|"}
{
$0 = gensub(/^(([^|]*\|){13}[^|]*)\|.*/, "\\1", "1")
for (i=NF+1; i<=n; ++i)
$i = ""
} 1' file
a|b|c|d|e||f||g||||h|i
a|b|c|d|e||f||g||||h|i
a|b|c|d||e||f||g|h|||
Where original file is this:
cat file
a|b|c|d|e||f||g||||h|i
a|b|c|d|e||f||g||||h|i|
a|b|c|d||e||f||g|h
Here:
Using gnsub we remove all extra fields
Using for loop we create new fields to make NF = n
If you don't have gnu-awk then following should work on non-gnu awk (tested on BSD awk):
awk -v n=14 '
BEGIN {FS=OFS="|"}
{
for (i=NF+1; i<=n; ++i) $i=""
for (i=n+1; i<=NF; ++i) $i=""
NF = n
} 1' file

Add a condtion for specfic row length in a script

I want to modify the following script:
awk 'NR>242 && $1 =='$t' {print $4, "\t" '$t'}' test.txt > file
I want to add a condition for the first "1 to 121" data (corresponding to the first 121 points) and then for the "122 to 242" data (which corresponds to the other 121 points).
so it becomes:
when NR>242 take the corresponding values of rows form 1 to 121 print them to file1
when NR>242 take the corresponding values of rows form 121 to 242 print them to file2
Thanks!
Generic solution: Adding more generic solution here, where you could give all line numbers inside lines variable of awk program. Once line number matches with values it will increase counter of file with 1 eg: from file1 to file2 OR file2 to file3 and so on...
awk -v val="$t" -v lines="121,242" -v count=1'
BEGIN{
num=split(lines,arr,",")
for(i=1;i<=num;i++){
line[arr[i]]
outputfile="file"count
}
}
FNR in arr[i]{
close(outputfile)
outputfile="file"++count
}
($1 == val){
print $4 "\t" val > (outputfile)
}
' Input_file
With your shown samples, please try following. This will print all lines from 1st line to 242nd line to file1 and 243 line onwards it will print output to file2. Also program has a shell variable named t passed into awk program's variable named val here.
awk -v val="$t" '
FNR==1{
outputfile="file1"
}
FNR==243{
outputfile="file2"
}
($1 == val){
print $4 "\t" val > (outputfile)
}
' Input_file
$ awk -v val="$t" '{c=int((NR-1)%242/121)+1}
$1==val {print $4 "\t" $1 > (output"c")}' file
this should take the first, third, etc blocks of 121 records to output1 and second, fourth, etc blocks of 121 records to output2 if they satisfy the condition.
If you want to skip first two blocks (first 242 records) just add && NR>242 condition to the existing one.

add plus or minus in awk if no match

I am trying to match all the lines in the below file to match. The awk will do that the problem is that the lines that do not match should be within plus or minus 10. I am not sure how to tell awk that the if a match is not found then use either plus or minus the coordinates in file. If no match is found after that then no match is in the file. Thank you :).
file
955763
957852
976270
bigfile
chr1 955543 955763 chr1:955543-955763 AGRN-6|gc=75
chr1 957571 957852 chr1:957571-957852 AGRN-7|gc=61.2
chr1 970621 970740 chr1:970621-970740 AGRN-8|gc=57.1
awk
awk 'NR==FNR{A[$1];next}$3 in A' file bigfile > output
desired output (same as bigfile)
chr1 955543 955763 chr1:955543-955763 AGRN-6|gc=75
chr1 957571 957852 chr1:957571-957852 AGRN-7|gc=61.2
If there's no difference between a row that matches and one that's close, you could just set all of the keys in the range in the array:
awk 'NR == FNR { for (i = -10; i <= 10; ++i) A[$1+i]; next }
$3 in A' file bigfile > output
The advantage of this approach is that only one lookup is performed per line of the big file.
You need to run a loop on array a:
awk 'NR==FNR {
a[$1]
next
}
{
for (i in a)
if (i <= $3+10 && i >= $3-10)
print
}' file bigfile > output
Your data already produces the desired output (all exact match).
$ awk 'NR==FNR{a[$1];next} $3 in a{print; next}
{for(k in a)
if((k-$3)^2<=10^2) {print $0, " --> within 10 margin"; next}}' file bigfile
chr1 955543 955763 chr1:955543-955763 AGRN-6|gc=75
chr1 957571 957852 chr1:957571-957852 AGRN-7|gc=61.2
chr1 976251 976261 chr1:976251-976261 AGRN-8|gc=57.1 --> within 10 margin
I added a fake 4th row to get the margin match

Process the file by replacing some of the delimiters and filtering the data within

I am trying to process a huge file and need to modify the structure if the data. My file has 117 columns but to put it simple, Lets assume that I have a file with 10 columns
Example file:
col1, col2, col3, col4, col5, col6, col7, col8, col9, col10
1,2,3,4,5,6,7,8,9,10
I now want to
- include the column name from col6 through col10 with the column values
- and replace the delimiter with '|' from col6 through col10 for the entire file
required output
1,2,3,4,5,col6:6|col7:7|col8:8|col9:9|col10:10
Is this a possibility? I'm completely new to regex/ awk. Can some one help please
P.S: Once the data is processed, I'm trying to flush out the zeros from the '|' separated columns...
So, if the data is 1,2,3,4,5,6,0,8,0,10
I would convert it to 1,2,3,4,5,col6:6|col7:0|col8:8|col9:0|col10:10
and then remove the zero's 1,2,3,4,5,col6:6|col8:8|col10:10
so input: 1,2,3,4,5,6,0,8,0,10
Desired output: 1,2,3,4,5,col6:6|col8:8|col10:10
You can use this awk:
awk -F ', *' 'NR==1{for (i=1; i<=NF; i++) hdr[i]=$i; next}
{for (i=1; i<=NF; i++) printf "%s%s", ((i>5)?hdr[i] ":":"") $i,
((i<NF)? ((i>5)?"|":",") : ORS)}' file
Output:
1,2,3,4,5,col6:6|col7:7|col8:8|col9:9|col10:10
hdr is the associative array to hold header column names when NR==1
Update: As per comments OP want to skip columns with zero value. You can use: As per comments OP want to skip columns with zero value. You can use:
awk -F ', *' 'NR==1{for (i=1; i<=NF; i++) hdr[i]=$i; next}
{for (i=1; i<=NF; i++) if ($i>0) printf "%s%s", ((i>5)?hdr[i] ":":"") $i,
((i<NF)? ((i>5)?"|":",") : ORS)}' file
sed '1 {h
s/\([^,]*,\)\{5\}\(\([^,]*,\)\{4\}[^,]*\).*/\2/
s/,/|/g
x
b
}
G
s/\(\([^,]*,\)\{5\}\)\([^,]*,\)\{4\}[^,]*\(.*\)\n\(.*\)/\1\5\4/
' YourFile
Posix sed version:
assuming there is no , in field value
adapt the index
5 to field starting - 1 (6 in this sample)
4 to number of field to catch [ last index - start index ] ( 10 - 6 = 4 in this sample)
need modification if field catching start at 1 (\{0\} could give unexpected behaviour depending sed version)
Principle:
take sub field from line 1, change the separator and put it in holding buffer then print original header
for all other line, add holded value to the line, replace the sub field info by info after the new line (so the added value), print the result

grep, cut, sed, awk a file for 3rd column, n lines at a time, then paste into repeated columns of n rows?

I have a file of the form:
#some header text
a 1 1234
b 2 3333
c 2 1357
#some header text
a 4 8765
b 1 1212
c 7 9999
...
with repeated data in n-row chunks separated by a blank line (with possibly some other header text). I'm only interested in the third column, and would like to do some grep, cut, awk, sed, paste magic to turn it in to this:
a 1234 8765 ...
b 3333 1212
c 1357 9999
where the third column of each subsequent n-row chunk is tacked on as a new column. I guess you could call it a transpose, just n-lines at a time, and only a specific column. The leading (a b c) column label isn't essential... I'd be happy if I could just grab the data in the third column
Is this even possible? It must be. I can get things chopped down to only the interesting columns with grep and cut:
cat myfile | grep -A2 ^a\ | cut -c13-15
but I can't figure out how to take these n-row chunks and sed/paste/whatever them into repeated n-row columns.
Any ideas?
This awk does the job:
awk 'NF<3 || /^(#|[[:blank:]]*$)/{next} !a[$1]{b[++k]=$1; a[$1]=$3; next}
{a[$1] = a[$1] OFS $3} END{for(i=1; i<=k; i++) print b[i], a[b[i]]}' file
a 1234 8765
b 3333 1212
c 1357 9999
awk '/#/{next}{a[$1] = a[$1] $3 "\t"}END{for(i in a){print i, a[i]}}' file
Would produce
a 1234 8765
b 3333 1212
c 1357 9999
You can change "\t" to a different output separator like " " if you like.
sub(/\t$/, "", a[i]); may be inserted before printif uf you don't like having trailing spaces. Another solution is to check if a[$1] already has a value where you decide if you have append to a previous value or not. It complicates the code a bit though.
Using bash > 4.0:
declare -A array
while read line
do
if [[ $line && $line != \#* ]];then
c=$( echo $line | cut -f 1 -d ' ')
value=$( echo $line | cut -f 3 -d ' ')
array[$c]="${array[$c]} $value"
fi
done < myFile.txt
for k in "${!array[#]}"
do
echo "$k ${array[$k]}"
done
Will produce:
a 1234 8765
b 3333 1212
c 1357 9999
It stores the letter as the key of the associative array, and in each iteration, appends the correspondig value to it.
$ awk -v RS= -F'\n' '{ for (i=2;i<=NF;i++) {split($i,f,/[[:space:]]+/); map[f[1]] = map[f[1]] " " f[3]} } END{ for (key in map) print key map[key]}' file
a 1234 8765
b 3333 1212
c 1357 9999