select lines with duplicate columns by a specific value - regex

I have an input like this:
LineA parameter1 parameter2 56
LineB parameter1 parameter2 87
LineB parameter1 parameter2 56
LineB parameter1 parameter2 90
LineC parameter1 parameter2 40
I want to print each line but, if the first column ($1) is duplicated, only print the line with the highest value in the last column ($4).
So the output should look like this:
LineA parameter1 parameter2 56
LineB parameter1 parameter2 90
LineC ...

Try the below(assuming field 4 is >= 0 throughout)
Array b is used to track the highest value in field 4 for unique values in field 1. Array a (keyed by field 1) contains the corresponding record. As each record is processed, the record is added to array a and field 4 is added to array b if
1. a value is encountered in field 1 for the first time or 2. the value in field 4 exceeds the existing value in b for the value in field 1.
Finally, array a is printed out.
awk '$4 > b[$1] {a[$1] = $0; b[$1] = $4}
END{for (x in a) {print a[x]}}'

Code for GNU awk:
awk 'BEGIN {SUBSEP=OFS} $4>a[$1,$2,$3] {a[$1,$2,$3]=$4} END {for (i in a) {print i,a[i]}}' file

Another way in awk:
awk '
fld1!=$1 && NR>1 {print line}
fld1==$1 {line=(fld4>$4)?line:$0;next}
{line=$0;fld1=$1;fld4=$4;next}
END{print line}' file

Related

Add a condtion for specfic row length in a script

I want to modify the following script:
awk 'NR>242 && $1 =='$t' {print $4, "\t" '$t'}' test.txt > file
I want to add a condition for the first "1 to 121" data (corresponding to the first 121 points) and then for the "122 to 242" data (which corresponds to the other 121 points).
so it becomes:
when NR>242 take the corresponding values of rows form 1 to 121 print them to file1
when NR>242 take the corresponding values of rows form 121 to 242 print them to file2
Thanks!
Generic solution: Adding more generic solution here, where you could give all line numbers inside lines variable of awk program. Once line number matches with values it will increase counter of file with 1 eg: from file1 to file2 OR file2 to file3 and so on...
awk -v val="$t" -v lines="121,242" -v count=1'
BEGIN{
num=split(lines,arr,",")
for(i=1;i<=num;i++){
line[arr[i]]
outputfile="file"count
}
}
FNR in arr[i]{
close(outputfile)
outputfile="file"++count
}
($1 == val){
print $4 "\t" val > (outputfile)
}
' Input_file
With your shown samples, please try following. This will print all lines from 1st line to 242nd line to file1 and 243 line onwards it will print output to file2. Also program has a shell variable named t passed into awk program's variable named val here.
awk -v val="$t" '
FNR==1{
outputfile="file1"
}
FNR==243{
outputfile="file2"
}
($1 == val){
print $4 "\t" val > (outputfile)
}
' Input_file
$ awk -v val="$t" '{c=int((NR-1)%242/121)+1}
$1==val {print $4 "\t" $1 > (output"c")}' file
this should take the first, third, etc blocks of 121 records to output1 and second, fourth, etc blocks of 121 records to output2 if they satisfy the condition.
If you want to skip first two blocks (first 242 records) just add && NR>242 condition to the existing one.

bash: extraction of substrings by pattern, empty fields and multiple occurrences

I would like to extract the Pfam_A information from each line of a file:
item_1 ID=HJNANFJJ_180142;inference=ab initio prediction:Prodigal_v2.6.3;locus_tag=HJNANFJJ_180142;partial=01;product=unannotated protein;KEGG=K03531
item_4 ID=HJNANFJJ_87662;inference=ab initio prediction:Prodigal_v2.6.3;locus_tag=HJNANFJJ_87662;partial=10;product=unannotated protein;KEGG=K15725;Pfam_A=OEP;Resfams=adeC-adeK-oprM
item_8 ID=HJNANFJJ_328505;inference=ab initio prediction:Prodigal_v2.6.3;locus_tag=HJNANFJJ_328505;partial=11;product=unannotated protein;KEGG=K03578;Pfam_A=OB_NTP_bind
item_2 ID=HJNANFJJ_512995;inference=ab initio prediction:Prodigal_v2.6.3;locus_tag=HJNANFJJ_512995;partial=11;product=unannotated protein;KEGG=K00674;Pfam_A=Hexapep;Pfam_A=Hexapep_2;metacyc=TETHYDPICSUCC-RXN
item_0 ID=HJNANFJJ_188729;inference=ab initio prediction:Prodigal_v2.6.3;locus_tag=HJNANFJJ_188729;partial=11;product=unannotated protein
In some lines this information is missing at all, in some there can be multiple occurrences.
Finally, I want to get a table like this, so that instead of empty fields there is NaN and multiple occurrences are put tab separated into different fields:
item_1 NaN
item_4 OEP
item_8 OB_NTP_bind
item_2 Hexapep Hexapep_2
item_0 NaN
You may use this awk:
awk -v OFS='\t' 'NF > 1 {
s = ""
n = split($NF, a, /;/)
for (i=1; i<=n; i++)
if (split(a[i], b, /=/) == 2 && b[1] == "Pfam_A")
s = s OFS b[2]
print $1 (s ? s : OFS "NaN")
}' file
item_1 NaN
item_4 OEP
item_8 OB_NTP_bind
item_2 Hexapep Hexapep_2
item_0 NaN
A quick and dirty way would be:
awk '{ s=$0;t="";
while (match(s,"Pfam_A=[^;]*")) {
t = t (t?OFS:"") substr(s,RSTART+7,RLENGTH-7);
s = substr(s,RSTART+RLENGTH)
}
}{print $1, (t?t:"NaN")}' file
With the presumptions that in each input line, there are no other ; characters except for the ; characters that separate the data fields, and no tab characters unless they delimit the first column, a simple sed command could do the job:
sed -E 's/\s+/;/; s/;Pfam_A=/;\t/g; s/;[^\t]*//g; /\t/!s/$/\tNaN/' file

Process the file by replacing some of the delimiters and filtering the data within

I am trying to process a huge file and need to modify the structure if the data. My file has 117 columns but to put it simple, Lets assume that I have a file with 10 columns
Example file:
col1, col2, col3, col4, col5, col6, col7, col8, col9, col10
1,2,3,4,5,6,7,8,9,10
I now want to
- include the column name from col6 through col10 with the column values
- and replace the delimiter with '|' from col6 through col10 for the entire file
required output
1,2,3,4,5,col6:6|col7:7|col8:8|col9:9|col10:10
Is this a possibility? I'm completely new to regex/ awk. Can some one help please
P.S: Once the data is processed, I'm trying to flush out the zeros from the '|' separated columns...
So, if the data is 1,2,3,4,5,6,0,8,0,10
I would convert it to 1,2,3,4,5,col6:6|col7:0|col8:8|col9:0|col10:10
and then remove the zero's 1,2,3,4,5,col6:6|col8:8|col10:10
so input: 1,2,3,4,5,6,0,8,0,10
Desired output: 1,2,3,4,5,col6:6|col8:8|col10:10
You can use this awk:
awk -F ', *' 'NR==1{for (i=1; i<=NF; i++) hdr[i]=$i; next}
{for (i=1; i<=NF; i++) printf "%s%s", ((i>5)?hdr[i] ":":"") $i,
((i<NF)? ((i>5)?"|":",") : ORS)}' file
Output:
1,2,3,4,5,col6:6|col7:7|col8:8|col9:9|col10:10
hdr is the associative array to hold header column names when NR==1
Update: As per comments OP want to skip columns with zero value. You can use: As per comments OP want to skip columns with zero value. You can use:
awk -F ', *' 'NR==1{for (i=1; i<=NF; i++) hdr[i]=$i; next}
{for (i=1; i<=NF; i++) if ($i>0) printf "%s%s", ((i>5)?hdr[i] ":":"") $i,
((i<NF)? ((i>5)?"|":",") : ORS)}' file
sed '1 {h
s/\([^,]*,\)\{5\}\(\([^,]*,\)\{4\}[^,]*\).*/\2/
s/,/|/g
x
b
}
G
s/\(\([^,]*,\)\{5\}\)\([^,]*,\)\{4\}[^,]*\(.*\)\n\(.*\)/\1\5\4/
' YourFile
Posix sed version:
assuming there is no , in field value
adapt the index
5 to field starting - 1 (6 in this sample)
4 to number of field to catch [ last index - start index ] ( 10 - 6 = 4 in this sample)
need modification if field catching start at 1 (\{0\} could give unexpected behaviour depending sed version)
Principle:
take sub field from line 1, change the separator and put it in holding buffer then print original header
for all other line, add holded value to the line, replace the sub field info by info after the new line (so the added value), print the result

grep, cut, sed, awk a file for 3rd column, n lines at a time, then paste into repeated columns of n rows?

I have a file of the form:
#some header text
a 1 1234
b 2 3333
c 2 1357
#some header text
a 4 8765
b 1 1212
c 7 9999
...
with repeated data in n-row chunks separated by a blank line (with possibly some other header text). I'm only interested in the third column, and would like to do some grep, cut, awk, sed, paste magic to turn it in to this:
a 1234 8765 ...
b 3333 1212
c 1357 9999
where the third column of each subsequent n-row chunk is tacked on as a new column. I guess you could call it a transpose, just n-lines at a time, and only a specific column. The leading (a b c) column label isn't essential... I'd be happy if I could just grab the data in the third column
Is this even possible? It must be. I can get things chopped down to only the interesting columns with grep and cut:
cat myfile | grep -A2 ^a\ | cut -c13-15
but I can't figure out how to take these n-row chunks and sed/paste/whatever them into repeated n-row columns.
Any ideas?
This awk does the job:
awk 'NF<3 || /^(#|[[:blank:]]*$)/{next} !a[$1]{b[++k]=$1; a[$1]=$3; next}
{a[$1] = a[$1] OFS $3} END{for(i=1; i<=k; i++) print b[i], a[b[i]]}' file
a 1234 8765
b 3333 1212
c 1357 9999
awk '/#/{next}{a[$1] = a[$1] $3 "\t"}END{for(i in a){print i, a[i]}}' file
Would produce
a 1234 8765
b 3333 1212
c 1357 9999
You can change "\t" to a different output separator like " " if you like.
sub(/\t$/, "", a[i]); may be inserted before printif uf you don't like having trailing spaces. Another solution is to check if a[$1] already has a value where you decide if you have append to a previous value or not. It complicates the code a bit though.
Using bash > 4.0:
declare -A array
while read line
do
if [[ $line && $line != \#* ]];then
c=$( echo $line | cut -f 1 -d ' ')
value=$( echo $line | cut -f 3 -d ' ')
array[$c]="${array[$c]} $value"
fi
done < myFile.txt
for k in "${!array[#]}"
do
echo "$k ${array[$k]}"
done
Will produce:
a 1234 8765
b 3333 1212
c 1357 9999
It stores the letter as the key of the associative array, and in each iteration, appends the correspondig value to it.
$ awk -v RS= -F'\n' '{ for (i=2;i<=NF;i++) {split($i,f,/[[:space:]]+/); map[f[1]] = map[f[1]] " " f[3]} } END{ for (key in map) print key map[key]}' file
a 1234 8765
b 3333 1212
c 1357 9999

joining lines with a specific pattern in a text file

I'm trying to join rows based on the first value in a row. My file looks like this:
the structure is: ID, KEY, VALUE
1 1 Joe
1 2 Smith
1 3 30
2 2 Doe
2 1 John
2 3 20
The KEY stands for some kind of attribute of the ID, in this case KEY '1' is first name, '2' is surname and '3' is age.
The output should look like this:
1 Joe Smith 30
2 John Doe 20
I know that this can be done by fairly simple awk script, but I'm having trouble finding it on SO or with Google.
{
a[$1,$2]=$3
if ($1>m) {m=$1}
}
END {
for(i=1;i<=m;i++)
{
j=1
printf i " "
while (a[i,j] != "")
{
printf a[i,j] " "
j++
}
printf "\n"
}
}
This awk command should work:
awk '$2==1{fn=$3} $2==2{ln=$3} $2==3{age=$3} NR>1 && NR%3==0 {print $1,fn,ln,age}' file
One way with awk
awk '{a[$1]=(a[$1])?a[$1]FS$3:$3}END{for(;x<length(a);)print ++x,a[x]}' <(sort -nk12 file)