if statement within awk script - if-statement

I have two files: $wrkfile and $mapfile. I need to look at $5 in $wrkfile and IF $5 == "2R" then I need to pull the new rate from $2 in $mapfile and write that into $5 in $wrkfile. If $5 = "2R" then do nothing and next.
Below is an an example of $wrkfile, $mapfile, and $expected. I have also included the awk script I am using that is failing. Any help would be greatly appreciated.
awk -F"|" 'NR==FNR{a[$1]=$2} NR>FNR{ if ( $5 == "2R" ) {print($1,$2,$3,$4,a[$1],$6} else
{print($1,$2,$3,$4,$5,$6} }' OFS="|" "$mapfile" "$wrkfile" > "$output"
$wrkfile
12345678912|C|01|A|01|000000050.00|
12345678912|C|01|A|01|000000050.00|
12345678912|C|01|A|01|000000050.00|
12345678912|C|01|A|2R|000000050.00|
12345678912|C|01|A|2R|000000050.00|
12345678912|C|01|A|2R|000000050.00|
12345678912|C|01|A|2R|000000050.00|
12345678912|C|01|A|2R|000000050.00|
$mapfile
12345678912|9.00|
12345678914|10.00|
12345678993|11.00|
12345678983|12.00|
12345678963|13.00|
12345678917|14.00|
$expected
12345678912|C|01|A|01|000000050.00|
12345678912|C|01|A|01|000000050.00|
12345678912|C|01|A|01|000000050.00|
12345678912|C|01|A|2R|000000009.00|
12345678912|C|01|A|2R|000000009.00|
12345678912|C|01|A|2R|000000009.00|
12345678912|C|01|A|2R|000000009.00|
12345678912|C|01|A|2R|000000009.00|

Given your new input and output files:
$ cat tst.awk
BEGIN { FS=OFS="|" }
NR==FNR { map[$1] = $2; next }
$5=="2R" { $6 = sprintf("%0*.2f",length($6),map[$1]) }
{ print }
$ awk -f tst.awk mapfile wrkfile
12345678912|C|01|A|01|000000050.00|
12345678912|C|01|A|01|000000050.00|
12345678912|C|01|A|01|000000050.00|
12345678912|C|01|A|2R|000000009.00|
12345678912|C|01|A|2R|000000009.00|
12345678912|C|01|A|2R|000000009.00|
12345678912|C|01|A|2R|000000009.00|
12345678912|C|01|A|2R|000000009.00|

I modified some input entries to get a some matches, your code can be changed as
$ awk -F"|" -v OFS="|" '
NR==FNR{a[$1]=$2;next}
$3~/2R/{$6=a[$1]}
1' mapfile wrkfile
EMPLID |TYPE1|REC|TYPE2|CODE|RATE |
12345678912|C |01 |A |01 |000000.50.00|
12345678912|C |01 |A |01 |000000.50.00|
12345678912|C |01 |A |01 |000000.50.00|
12345678912|C |2R |A |01 |9.00 |
12345678912|C |2R |A |01 |9.00 |
12345678912|C |2R |A |01 |9.00 |
12345678912|C |2R |A |01 |9.00 |
12345678912|C |2R |A |01 |9.00 |
formatting the last column is possible but I'm not sure what the double decimal points mean.

Related

Parse default Salt highstate output

I'm trying to parse the highstate output of Salt has proven to be difficult. Without changing the output to json due to the fact that I still want it to be human legible.
What's the best way to convert the Summary into something machine readable?
Summary for app1.domain.com
--------------
Succeeded: 278 (unchanged=12, changed=6)
Failed: 0
--------------
Total states run: 278
Total run time: 7.383 s
--
Summary for app2.domain.com
--------------
Succeeded: 278 (unchanged=12, changed=6)
Failed: 0
--------------
Total states run: 278
Total run time: 7.448 s
--
Summary for app0.domain.com
--------------
Succeeded: 293 (unchanged=13, changed=6)
Failed: 0
--------------
Total states run: 293
Total run time: 7.510 s
Without a better idea I'm trying to grep and awk the output and insert it into a csv.
These two work:
cat ${_FILE} | grep Summary | awk '{ print $3} ' | \
tr '\n' ',' | sed '$s/,$/\n/' >> /tmp/highstate.csv;
cat ${_FILE} | grep -oP '(?<=unchanged=)[0-9]+' | \
tr '\n' ',' | sed '$s/,$/\n/' >> /tmp/highstate.csv;
But this one fails but works in Reger
cat ${_FILE} | grep -oP '(?<=\schanged=)[0-9]+' | \
tr '\n' ',' | sed '$s/,$/\n/' >> /tmp/highstate.csv;
EDIT1: #vintnes #ikegami I agree I'd much rather take the json output parse the output but Salt doesn't offer a summary of changes when outputting to josn. So far this is what I have and while very ugly, it's working.
cat ${_FILE} | grep Summary | awk '{ print $3} ' | \
tr '\n' ',' | sed '$s/,$/\n/' >> /tmp/highstate_tmp.csv;
cat ${_FILE} | grep -oP '(?<=unchanged=)[0-9]+' | \
tr '\n' ',' | sed '$s/,$/\n/' >> /tmp/highstate_tmp.csv;
cat ${_FILE} | grep unchanged | awk -F' ' '{ print $4}' | \
grep -oP '(?<=changed=)[0-9]+' | tr '\n' ',' | sed '$s/,$/\n/' >> /tmp/highstate_tmp.csv;
cat ${_FILE} | { grep "Warning" || true; } | awk -F: '{print $2+0} END { if (!NR) print "null" }' | \
tr '\n' ',' | sed '$s/,$/\n/' >> /tmp/highstate_tmp.csv;
cat ${_FILE} | { grep "Failed" || true; } | awk -F: '{print $2+0} END { if (!NR) print "null" }' | \
tr '\n' ',' | sed '$s/,$/\n/' >> /tmp/highstate_tmp.csv;
csvtool transpose /tmp/highstate_tmp.csv > /tmp/highstate.csv;
sed -i '1 i\instance,unchanged,changed,warning,failed' /tmp/highstate.csv;
Output:
instance,unchanged,changed,warning,failed
app1.domain.com,12,6,,0
app0.domain.com,13,6,,0
app2.domain.com,12,6,,0
Here you go. This will also work if your output contains warnings. Please note that the output is in a different order than you specified; it's the order in which each record occurs in the file. Don't hesitate with any questions.
$ awk -v OFS=, '
BEGIN { print "instance,unchanged,changed,warning,failed" }
/^Summary/ { instance=$NF }
/^Succeeded/ { split($3 $4 $5, S, /[^0-9]+/) }
/^Failed/ { print instance, S[2], S[3], S[4], $2 }
' "$_FILE"
split($3 $4 $5, S, /[^0-9]+/) handles the possibility of warnings by disregarding the first two "words" Succeeded: ### and using any number of non-digits as a separator.
edit: Printed on /^Fail/ instead of using /^Summ/ and END.
perl -e'
use strict;
use warnings qw( all );
use Text::CSV_XS qw( );
my $csv = Text::CSV_XS->new({ auto_diag => 2, binary => 1 });
$csv->say(select(), [qw( instance unchanged change warning failed )]);
my ( $instance, $unchanged, $changed, $warning, $failed );
while (<>) {
if (/^Summary for (\S+)/) {
( $instance, $unchanged, $changed, $warning, $failed ) = $1;
}
elsif (/^Succeeded:\s+\d+ \(unchanged=(\d+), changed=(\d+)\)/) {
( $unchanged, $changed ) = ( $1, $2 );
}
elsif (/^Warning:\s+(\d+)/) {
$warning = $1;
}
elsif (/^Failed:\s+(\d+)/) {
$failed = $1;
$csv->say(select(), [ $instance, $unchanged, $changed, $warning, $failed ]);
}
}
'
Provide input via STDIN, or provide path to file(s) from which to read as arguments.
Terse version:
perl -MText::CSV_XS -ne'
BEGIN {
$csv = Text::CSV_XS->new({ auto_diag => 2, binary => 1 });
$csv->say(select(), [qw( instance unchanged change warning failed )]);
}
/^Summary for (\S+)/ and #row=$1;
/^Succeeded:\s+\d+ \(unchanged=(\d+), changed=(\d+)\)/ and #row[1,2]=($1,$2);
/^Warning:\s+(\d+)/ and $row[3]=$1;
/^Failed:\s+(\d+)/ and ($row[4]=$1), $csv->say(select(), \#row);
'
Improving answer from #vintnes.
Producing output as tab separated CSV
Write awk script that reads values from lines by their order.
Print each record as it is read.
script.awk
BEGIN {print("computer","succeeded","unchanged","changed","failed","states run","run time");}
FNR%8 == 1 {arr[1] = $3}
FNR%8 == 3 {arr[2] = $2; arr[3] = extractNum($3); arr[4] = extractNum($4)}
FNR%8 == 4 {arr[5] = $2;}
FNR%8 == 6 {arr[6] = $4;}
FNR%8 == 7 {arr[7] = $4; print arr[1],arr[2],arr[3],arr[4],arr[5],arr[6],arr[7];}
function extractNum(str){match(str,/[[:digit:]]+/,m);return m[0];}
run script
Tab separated CSV output
awk -v OFS="\t" -f script.awk input-1.txt input-2.txt ...
Comma separated CSV output
awk -v OFS="," -f script.awk input-1.txt input-2.txt ...
Output
computer succeeded unchanged changed failed states run run time
app1.domain.com 278 12 6 0 278 7.383
app2.domain.com 278 12 6 0 278 7.448
app0.domain.com 293 13 6 0 293 7.510
computer,succeeded,unchanged,changed,failed,states run,run time
app1.domain.com,278,12,6,0,278,7.383
app2.domain.com,278,12,6,0,278,7.448
app0.domain.com,293,13,6,0,293,7.510
Explanation
BEGIN {print("computer","succeeded","unchanged","changed","failed","states run","run time");}
Print the heading CSV line
FNR%8 == 1 {arr[1] = $3}
Extract the arr[1] value from 3rd field in (first line from 8 lines)
FNR%8 == 3 {arr[2] = $2; arr[3] = extractNum($3); arr[4] = extractNum($4)}
Extract the arr[2,3,4] values from 2nd,3rd,4th fields in (third line from 8 lines)
FNR%8 == 4 {arr[5] = $2;}
Extract the arr[5] value from 2nd field in (4th line from 8 lines)
FNR%8 == 6 {arr[6] = $4;}
Extract the arr[6] value from 4th field in (6th line from 8 lines)
FNR%8 == 7 {arr[7] = $4;
Extract the arr[7] value from 4th field in (7th line from 8 lines)
print arr[1],arr[2],arr[3],arr[4],arr[5],arr[6],arr[7];}
print the array elements for the extracted variable at the completion of reading 7th line from 8 lines.
function extractNum(str){match(str,/[[:digit:]]+/,m);return m[0];}
Utility function to extract numbers from text field.

Process FASTA file to output header and all matching subsequences

I need to search a FASTA file for certain regions of the DNA sequence. For each match, I need to print the sequence header followed by all matches within that sequence. I want to print the header once followed by the matching sections.
The output of the following code is close, but the matched regions of DNA are printed above their header instead of beneath it. I cannot flip the two blocks of code because that cuts off the first results.
# First, I open my file and print a warning if it fails.
unless ( open FILE, "<", '/scratch/SampleDataFiles/test.fasta' ) {
die "Sorry", $!;
}
$/ = ">"; # This changes the record separator from \n to >, so I can chomp it later.
my #file = <FILE>;
my $file = "#file";
chomp $file;
# To view the file I can--
# print $file;
my $count = 0; # here I will count the matched regions
my $sequence_count = 0; # here I will count the sequences
# that contain a matched region
foreach $file ( #file ) {
# I look for each header and its following sequence
# And count the total sequences in the file
if ( $file =~ /(.*;.*;?\n)(\w+)/ ) {
my $head = $1;
my $sequence = $2;
$sequence_count = $sequence_count + 1;
# Now, I use the sequences I matched and search for a
# hydrophobic region
while ( $sequence =~ /([VILMFWCA]{8,}?)/gi ) {
# I want to know what the position of the match is
my $pos = pos( $sequence ) - 7;
print "\n", $1, " found at ", $pos;
}
# I use the count variable I made earlier to count up each
# time I match a sequence that has one or more hydrophobic region
if ( $sequence =~ /([VILMFWCA]{8,}?)/gi ) {
print "\n",
"Hydrophobic region(s) found in ",
$head,
"\n",
"-------------------------------------",
"\n";
$count = $count + 1;
}
}
}
print "Hydrophobic region(s) found in ",
$count,
" out of ",
$sequence_count,
" sequences.",
"\n",
"\n";
This is the output:
AVVAAVMW found at 325
Hydrophobic region(s) found in P30450 | Homo sapiens (Human). |
NCBI_TaxID=9606; | 365 | Name=HLA-A; Synonyms=HLAA;
-------------------------------------
VAVLMLCL found at 170
LLALVAIF found at 493
IWICWFAA found at 705
LALALAFA found at 970
Hydrophobic region(s) found in A7MBM2 | Homo sapiens (Human). |
NCBI_TaxID=9606; | 1401 | Name=DISP2; Synonyms=DISPB, KIAA1742;
-------------------------------------
Hydrophobic region(s) found in 2 out of 15 sequences.
This is the output I get if I switch them:
Hydrophobic region(s) found in P30450 | Homo sapiens (Human). |
NCBI_TaxID=9606; | 365 | Name=HLA-A; Synonyms=HLAA;
-------------------------------------
Hydrophobic region(s) found in A7MBM2 | Homo sapiens (Human). |
NCBI_TaxID=9606; | 1401 | Name=DISP2; Synonyms=DISPB, KIAA1742;
LLALVAIF found at 493
IWICWFAA found at 705
LALALAFA found at 970
Hydrophobic region(s) found in 2 out of 15 sequences.`
Per my teacher's recommendation, I have adjusted my code as follows so that I include everything within the larger while loop and restrict the number of prints with a counter. This new code prints each new header one time, and below it prints each instance of a found region of DNA (essentially flipping what I had before).
New code:
my $count = 0; # here I will count the matched regions
my $temp_count = 0; # this I will use temporarily to count
my $sequence_count = 0; # here I will count the sequences
# that contain a matched region
if ( $file =~ /(.*;.*;?\n)(\w+)/ ) {
my $head = $1;
my $sequence = $2;
$sequence_count = $sequence_count + 1;
# Now I use the sequences that I found, and
# search them for a hydrophobic region
while ( $sequence =~ /([VILMFWCA]{8,}?)/gi ) {
# I use the count variables I made earlier
# I count all times I match a sequence that has one or more hydrophobic region
$temp_count = $temp_count + 1;
# But I don't want the header repeated for the same sequence, so I limit the
# times that it can print
if ( $temp_count <= 2 ) {
print "\n", "Hydrophobic region(s) found in ", $head, "\n";
$count = $count + 1;
}
# I want to know what the position of the match is
# within the sequence
my $pos = pos( $sequence ) - 7;
print $1, " found at ", $pos, "\n", "\n";
}
}
}
print "\n",
"\n",
"-------------------------",
"\n",
"Hydrophobic region(s) found in ",
$count,
" out of ",
$sequence_count,
" sequences.",
"\n",
"\n";
If useful, here is what the file looks like:
>P31946 | Homo sapiens (Human). | NCBI_TaxID=9606; | 246 | Name=YWHAB;
MTMDKSELVQKAKLAEQAERYDDMAAAMKAVTEQGHELSNEERNLLSVAYKNVVGARRSSWRVISSIEQKTERNEKKQQMGKEYREKIEAELQDICNDVLELLDKYLIPNATQPESKVFYLKMKGDYFRYLSEVASGDNKQTTVSNSQQAYQEAFEISKKEMQPTHPIRL
GLALNFSVFYYEILNSPEKACSLAKTAFDEAIAELDTLNEESYKDSTLIMQLLRDNLTLWTSENQGDEGDAGEGEN
>P62258 | Homo sapiens (Human). | NCBI_TaxID=9606; | 255 | Name=YWHAE;
MDDREDLVYQAKLAEQAERYDEMVESMKKVAGMDVELTVEERNLLSVAYKNVIGARRASWRIISSIEQKEENKGGEDKLKMIREYRQMVETELKLICCDILDVLDKHLIPAANTGESKVFYYKMKGDYHRYLAEFATGNDRKEAAENSLVAYKAASDIAMTELPPTHPIR
LGLALNFSVFYYEILNSPDRACRLAKAAFDDAIAELDTLSEESYKDSTLIMQLLRDNLTLWTSDMQGDGEEQNKEALQDVEDENQ
>Q04917 | Homo sapiens (Human). | NCBI_TaxID=9606; | 246 | Name=YWHAH; Synonyms=YWHA1;
MGDREQLLQRARLAEQAERYDDMASAMKAVTELNEPLSNEDRNLLSVAYKNVVGARRSSWRVISSIEQKTMADGNEKKLEKVKAYREKIEKELETVCNDVLSLLDKFLIKNCNDFQYESKVFYLKMKGDYYRYLAEVASGEKKNSVVEASEAAYKEAFEISKEQMQPTHP
IRLGLALNFSVFYYEIQNAPEQACLLAKQAFDDAIAELDTLNEDSYKDSTLIMQLLRDNLTLWTSDQQDEEAGEGN
>P30450 | Homo sapiens (Human). | NCBI_TaxID=9606; | 365 | Name=HLA-A; Synonyms=HLAA;
MAVMAPRTLVLLLSGALALTQTWAGSHSMRYFYTSVSRPGRGEPRFIAVGYVDDTQFVRFDSDAASQRMEPRAPWIEQEGPEYWDRNTRNVKAHSQTDRANLGTLRGYYNQSEDGSHTIQRMYGCDVGPDGRFLRGYQQDAYDGKDYIALNEDLRSWTAADMAAQITQRK
WETAHEAEQWRAYLEGRCVEWLRRYLENGKETLQRTDAPKTHMTHHAVSDHEATLRCWALSFYPAEITLTWQRDGEDQTQDTELVETRPAGDGTFQKWASVVVPSGQEQRYTCHVQHEGLPKPLTLRWEPSSQPTIPIVGIIAGLVLFGAVIAGAVVAAVMWRRKSSDRK
GGSYSQAASSDSAQGSDMSLTACKV
>Q156A1 | Homo sapiens (Human). | NCBI_TaxID=9606; | 80 | Name=ATXN8;
MQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQ
>Q9UQB9 | Homo sapiens (Human). | NCBI_TaxID=9606; | 309 | Name=AURKC; Synonyms=AIE2, AIK3, ARK3, STK13;
MSSPRAVVQLGKAQPAGEELATANQTAQQPSSPAMRRLTVDDFEIGRPLGKGKFGNVYLARLKESHFIVALKVLFKSQIEKEGLEHQLRREIEIQAHLQHPNILRLYNYFHDARRVYLILEYAPRGELYKELQKSEKLDEQRTATIIEELADALTYCHDKKVIHRDIKPE
NLLLGFRGEVKIADFGWSVHTPSLRRKTMCGTLDYLPPEMIEGRTYDEKVDLWCIGVLCYELLVGYPPFESASHSETYRRILKVDVRFPLSMPLGARDLISRLLRYQPLERLPLAQILKHPWVQAHSRRVLPPCAQMAS
>O75366 | Homo sapiens (Human). | NCBI_TaxID=9606; | 819 | Name=AVIL;
MPLTSAFRAVDNDPGIIVWRIEKMELALVPVSAHGNFYEGDCYVILSTRRVASLLSQDIHFWIGKDSSQDEQSCAAIYTTQLDDYLGGSPVQHREVQYHESDTFRGYFKQGIIYKQGGVASGMKHVETNTYDVKRLLHVKGKRNIRATEVEMSWDSFNRGDVFLLDLGKV
IIQWNGPESNSGERLKAMLLAKDIRDRERGGRAEIGVIEGDKEAASPELMKVLQDTLGRRSIIKPTVPDEIIDQKQKSTIMLYHISDSAGQLAVTEVATRPLVQDLLNHDDCYILDQSGTKIYVWKGKGATKAEKQAAMSKALGFIKMKSYPSSTNVETVNDGAESAMFK
QLFQKWSVKDQTMGLGKTFSIGKIAKVFQDKFDVTLLHTKPEVAAQERMVDDGNGKVEVWRIENLELVPVEYQWYGFFYGGDCYLVLYTYEVNGKPHHILYIWQGRHASQDELAASAYQAVEVDRQFDGAAVQVRVRMGTEPRHFMAIFKGKLVIFEGGTSRKGNAEPDP
PVRLFQIHGNDKSNTKAVEVPAFASSLNSNDVFLLRTQAEHYLWYGKGSSGDERAMAKELASLLCDGSENTVAEGQEPAEFWDLLGGKTPYANDKRLQQEILDVQSRLFECSNKTGQFVVTEITDFTQDDLNPTDVMLLDTWDQVFLWIGAEANATEKESALATAQQYLH
THPSGRDPDTPILIIKQGFEPPIFTGWFLAWDPNIWSAGKTYEQLKEELGDAAAIMRITADMKNATLSLNSNDSEPKYYPIAVLLKNQNQELPEDVNPAKKENYLSEQDFVSVFGITRGQFAALPGWKQLQMKKEKGLF
>Q9UPA5 | Homo sapiens (Human). | NCBI_TaxID=9606; | 3926 | Name=BSN; Synonyms=KIAA0434, ZNF231;
MGNEVSLEGGAGDGPLPPGGAGPGPGPGPGPGAGKPPSAPAGGGQLPAAGAARSTAVPPVPGPGPGPGPGPGPGSTSRRLDPKEPLGNQRAASPTPKQASATTPGHESPRETRAQGPAGQEADGPRRTLQVDSRTQRSGRSPSVSPDRGSTPTSPYSVPQIAPLPSSTLC
PICKTSDLTSTPSQPNFNTCTQCHNKVCNQCGFNPNPHLTQVKEWLCLNCQMQRALGMDMTTAPRSKSQQQLHSPALSPAHSPAKQPLGKPDQERSRGPGGPQPGSRQAETARATSVPGPAQAAAPPEVGRVSPQPPQPTKPSTAEPRPPAGEAPAKSATAVPAGLGATE
QTQEGLTGKLFGLGASLLTQASTLMSVQPEADTQGQPAPSKGTPKIVFNDASKEAGPKPLGSGPGPGPAPGAKTEPGARMGPGSGPGALPKTGGTTSPKHGRAEHQAASKAAAKPKTMPKERAICPLCQAELNVGSKSPANYNTCTTCRLQVCNLCGFNPTPHLVEKTEW
LCLNCQTKRLLEGSLGEPTPLPPPTSQQPPVGAPHRASGTSPLKQKGPQGLGQPSGPLPAKASPLSTKASPLPSKASPQAKPLRASEPSKTPSSVQEKKTRVPTKAEPMPKPPPETTPTPATPKVKSGVRRAEPATPVVKAVPEAPKGGEAEDLVGKPYSQDASRSPQSL
SDTGYSSDGISSSQSEITGVVQQEVEQLDSAGVTGPHPPSPSEIHKVGSSMRPLLQAQGLAPSERSKPLSSGTGEEQKQRPHSLSITPEAFDSDEELEDILEEDEDSAEWRRRREQQDTAESSDDFGSQLRHDYVEDSSEGGLSPLPPQPPARAAELTDEDFMRRQILEM
SAEEDNLEEDDTATSGRGLAKHGTQKGGPRPRPEPSQEPAALPKRRLPHNATTGYEELLPEGGSAEATDGSGTLQGGLRRFKTIELNSTGSYGHELDLGQGPDPSLDREPELEMESLTGSPEDRSRGEHSSTLPASTPSYTSGTSPTSLSSLEEDSDSSPSRRQRLEEAK
QQRKARHRSHGPLLPTIEDSSEEEELREEEELLREQEKMREVEQQRIRSTARKTRRDKEELRAQRRRERSKTPPSNLSPIEDASPTEELRQAAEMEELHRSSCSEYSPSPSLDSEAEALDGGPSRLYKSGSEYNLPTFMSLYSPTETPSGSSTTPSSGRPLKSAEEAYEE
MMRKAELLQRQQGQAAGARGPHGGPSQPTGPRGLGSFEYQDTTDREYGQAAQPAAEGTPASLGAAVYEEILQTSQSIVRMRQASSRDLAFAEDKKKEKQFLNAESAYMDPMKQNGGPLTPGTSPTQLAAPVSFSTPTSSDSSGGRVIPDVRVTQHFAKETQDPLKLHSSP
ASPSSASKEIGMPFSQGPGTPATTAVAPCPAGLPRGYMTPASPAGSERSPSPSSTAHSYGHSPTTANYGSQTEDLPQAPSGLAAAGRAAREKPLSASDGEGGTPQPSRAYSYFASSSPPLSPSSPSESPTFSPGKMGPRATAEFSTQTPSPAPASDMPRSPGAPTPSPMV
AQGTQTPHRPSTPRLVWQESSQEAPFMVITLASDASSQTRMVHASASTSPLCSPTETQPTTHGYSQTTPPSVSQLPPEPPGPPGFPRVPSAGADGPLALYGWGALPAENISLCRISSVPGTSRVEPGPRTPGTAVVDLRTAVKPTPIILTDQGMDLTSLAVEARKYGLAL
DPIPGRQSTAVQPLVINLNAQEHTFLATATTVSITMASSVFMAQQKQPVVYGDPYQSRLDFGQGGGSPVCLAQVKQVEQAVQTAPYRSGPRGRPREAKFARYNLPNQVAPLARRDVLITQMGTAQSIGLKPGPVPEPGAEPHRATPAELRSHALPGARKPHTVVVQMGEG
TAGTVTTLLPEEPAGALDLTGMRPESQLACCDMVYKLPFGSSCTGTFHPAPSVPEKSMADAAPPGQSSSPFYGPRDPEPPEPPTYRAQGVVGPGPHEEQRPYPQGLPGRLYSSMSDTNLAEAGLNYHAQRIGQLFQGPGRDSAMDLSSLKHSYSLGFADGRYLGQGLQYG
SVTDLRHPTDLLAHPLPMRRYSSVSNIYSDHRYGPRGDAVGFQEASLAQYSATTAREISRMCAALNSMDQYGGRHGSGGGGPDLVQYQPQHGPGLSAPQSLVPLRPGLLGNPTFPEGHPSPGNLAQYGPAAGQGTAVRQLLPSTATVRAADGMIYSTINTPIAATLPITT
QPASVLRPMVRGGMYRPYASGGITAVPLTSLTRVPMIAPRVPLGPTGLYRYPAPSRFPIASSVPPAEGPVYLGKPAAAKAPGAGGPSRPEMPVGAAREEPLPTTTPAAIKEAAGAPAPAPLAGQKPPADAAPGGGSGALSRPGFEKEEASQEERQRKQQEQLLQLERERV
ELEKLRQLRLQEELERERVELQRHREEEQLLVQRELQELQTIKHHVLQQQQEERQAQFALQREQLAQQRLQLEQIQQLQQQLQQQLEEQKQRQKAPFPAACEAPGRGPPLAAAELAQNGQYWPPLTHAAFIAMAGPEGLGQPREPVLHRGLPSSASDMSLQTEEQWEASR
SGIKKRHSMPRLRDACELESGTEPCVVRRIADSSVQTDDEDGESRYLLSRRRRARRSADCSVQTDDEDSAEWEQPVRRRRSRLPRHSDSGSDSKHDATASSSSAAATVRAMSSVGIQTISDCSVQTEPDQLPRVSPAIHITAATDPKVEIVRYISAPEKTGRGESLACQT
EPDGQAQGVAGPQLVGPTAISPYLPGIQIVTPGPLGRFEKKKPDPLEIGYQAHLPPESLSQLVSRQPPKSPQVLYSPVSPLSPHRLLDTSFASSERLNKAHVSPQKHFTADSALRQQTLPRPMKTLQRSLSDPKPLSPTAEESAKERFSLYQHQGGLGSQVSALPPNSLV
RKVKRTLPSPPPEEAHLPLAGQASPQLYAASLLQRGLTGPTTVPATKASLLRELDRDLRLVEHESTKLRKKQAELDEEEKEIDAKLKYLELGITQRKESLAKDRGGRDYPPLRGLGEHRDYLSDSELNQLRLQGCTTPAGQFVDFPATAAAPATPSGPTAFQQPRFQPPA
PQYSAGSGGPTQNGFPAHQAPTYPGPSTYPAPAFPPGASYPAEPGLPNQQAFRPTGHYAGQTPMPTTQSTLFPVPADSRAPLQKPRQTSLADLEQKVPTNYEVIASPVVPMSSAPSETSYSGPAVSSGYEQGKVPEVPRAGDRGSVSQSPAPTYPSDSHYTSLEQNVPRN
YVMIDDISELTKDSTSTAPDSQRLEPLGPGSSGRPGKEPGEPGVLDGPTLPCCYARGEEESEEDSYDPRGKGGHLRSMESNGRPASTHYYGDSDYRHGARVEKYGPGPMGPKHPSKSLAPAAISSKRSKHRKQGMEQKISKFSPIEEAKDVESDLASYPPPAVSSSLVSR
GRKFQDEITYGLKKNVYEQQKYYGMSSRDAVEDDRIYGGSSRSRAPSAYSGEKLSSHDFSGWGKGYEREREAVERLQKAGPKPSSLSMAHSRVRPPMRSQASEEESPVSPLGRPRPAGGPLPPGGDTCPQFCSSHSMPDVQEHVKDGPRAHAYKREEGYILDDSHCVVSD
SEAYHLGQEETDWFDKPRDARSDRFRHHGGHAVSSSSQKRGPARHSYHDYDEPPEEGLWPHDEGGPGRHASAKEHRHGDHGRHSGRHTGEEPGRRAAKPHARDLGRHEARPHSQPSSAPAMPKKGQPGYPSSAEYSQPSRASSAYHHASDSKKGSRQAHSGPAALQSKAE
PQAQPQLQGRQAAPGPQQSQSPSSRQIPSGAASRQPQTQQQQQGLGLQPPQQALTQARLQQQSQPTTRGSAPAASQPAGKPQPGPSTATGPQPAGPPRAEQTNGSKGTAKAPQQGRAPQAQPAPGPGPAGVKAGARPGGTPGAPAGQPGADGESVFSKILPGGAAEQAGK
LTEAVSAFGKKFSSFW
>Q9NSI6 | Homo sapiens (Human). | NCBI_TaxID=9606; | 2320 | Name=BRWD1; Synonyms=C21orf107, WDR9;
MAEPSSARRPVPLIESELYFLIARYLSAGPCRRAAQVLVQELEQYQLLPKRLDWEGNEHNRSYEELVLSNKHVAPDHLLQICQRIGPMLDKEIPPSISRVTSLLGAGRQSLLRTAKDCRHTVWKGSAFAALHRGRPPEMPVNYGSPPNLVEIHRGKQLTGCSTFSTAFPG
TMYQHIKMHRRILGHLSAVYCVAFDRTGHRIFTGSDDCLVKIWSTHNGRLLSTLRGHSAEISDMAVNYENTMIAAGSCDKIIRVWCLRTCAPVAVLQGHTGSITSLQFSPMAKGSQRYMVSTGADGTVCFWQWDLESLKFSPRPLKFTEKPRPGVQMLCSSFSVGGMFLA
TGSTDHVIRMYFLGFEAPEKIAELESHTDKVDSIQFCNNGDRFLSGSRDGTARIWRFEQLEWRSILLDMATRISGDLSSEEERFMKPKVTMIAWNQNDSIVVTAVNDHVLKVWNSYTGQLLHNLMGHADEVFVLETHPFDSRIMLSAGHDGSIFIWDITKGTKMKHYFNM
IEGQGHGAVFDCKFSQDGQHFACTDSHGHLLIFGFGCSKPYEKIPDQMFFHTDYRPLIRDSNNYVLDEQTQQAPHLMPPPFLVDVDGNPHPTKYQRLVPGRENSADEHLIPQLGYVATSDGEVIEQIISLQTNDNDERSPESSILDGMIRQLQQQQDQRMGADQDTIPRG
LSNGEETPRRGFRRLSLDIQSPPNIGLRRSGQVEGVRQMHQNAPRSQIATERDLQAWKRRVVVPEVPLGIFRKLEDFRLEKGEEERNLYIIGRKRKTLQLSHKSDSVVLVSQSRQRTCRRKYPNYGRRNRSWRELSSGNESSSSVRHETSCDQSEGSGSSEEDEWRSDRK
SESYSESSSDSSSRYSDWTADAGINLQPPLRTSCRRRITRFCSSSEDEISTENLSPPKRRRKRKKENKPKKENLRRMTPAELANMEHLYEFHPPVWITDTTLRKSPFVPQMGDEVIYFRQGHEAYIEAVRRNNIYELNPNKEPWRKMDLRDQELVKIVGIRYEVGPPTLC
CLKLAFIDPATGKLMDKSFSIRYHDMPDVIDFLVLRQFYDEARQRNWQSCDRFRSIIDDAWWFGTVLSQEPYQPQYPDSHFQCYIVRWDNTEIEKLSPWDMEPIPDNVDPPEELGASISVTTDELEKLLYKPQAGEWGQKSRDEECDRIISGIDQLLNLDIAAAFAGPVD
LCTYPKYCTVVAYPTDLYTIRMRLVNRFYRRLSALVWEVRYIEHNARTFNEPESVIARSAKKITDQLLKFIKNQHCTNISELSNTSENDEQNAEDLDDSDLPKTSSGRRRVHDGKKSIRATNYVESNWKKQCKELVNLIFQCEDSEPFRQPVDLVEYPDYRDIIDTPMDF
GTVRETLDAGNYDSPLEFCKDIRLIFSNAKAYTPNKRSKIYSMTLRLSALFEEKMKKISSDFKIGQKFNEKLRRSQRFKQRQNCKGDSQPNKSIRNLKPKRLKSQTKIIPELVGSPTQSTSSRTAYLGTHKTSAGISSGVTSGDSSDSAESSERRKRNRPITNGSTLSES
EVEDSLATSLSSSASSSSEESKESSRARESSSRSGLSRSSNLRVTRTRAAQRKTGPVSLANGCGRKATRKRVYLSDSDNNSLETGEILKARAGNNRKVLRKCAAVAANKIKLMSDVEENSSSESVCSGRKLPHRNASAVARKKLLHNSEDEQSLKSEIEEEELKDENQPL
PVSSSHTAQSNVDESENRDSESESDLRVARKNWHANGYKSHTPAPSKTKFLKIESSEEDSKSHDSDHACNRTAGPSTSVQKLKAESISEEADSEPGRSGGRKYNTFHKNASFFKKTKILSDSEDSESEEQDREDGKCHKMEMNPISGNLNCDPIAMSQCSSDHGCETDLD
SDDDKIEKPNNFMKDSASQDNGLSRKISRKRVCSSDSDSSLQVVKKSSKARTGLLRITRRCAATAANKIKLMSDVEDVSLENVHTRSKNGRKKPLHLACTTAKKKLSDCEGSVHCEVPSEQYACEGKPPDPDSEGSTKVLSQALNGDSDSEDMLNSEHKHRHTNIHKIDA
PSKRKSSSVTSSGEDSKSHIPGSETDRTFSSESTLAQKATAENNFEVELNYGLRRWNGRRLRTYGKAPFSKTKVIHDSQETAEKEVKRKRSHPELENVKISETTGNSKFRPDTSSKSSDLGSVTESDIDCTDNTKTKRRKTKGKAKVVRKEFVPRDREPNTKVRTCMHNQ
KDAVQMPSETLKAKMVPEKVPRRCATVAANKIKIMSNLKETISGPENVWIRKSSRKLPHRNASAAAKKKLLNVYKEDDTTINSESEKELEDINRKMLFLRGFRSWKENAQ
>Q96KE9 | Homo sapiens (Human). | NCBI_TaxID=9606; | 485 | Name=BTBD6; Synonyms=BDPL;
MAAELYAPASAAAADLANSNAGAAVGRKAGPRSPPSAPAPAPPPPAPAPPTLGNNHQESPGWRCCRPTLRERNALMFNNELMADVHFVVGPPGATRTVPAHKYVLAVGSSVFYAMFYGDLAEVKSEIHIPDVEPAAFLILLKYMYSDEIDLEADTVLATLYAAKKYIVPALAKACVNFLETSLEAKNACVLLSQSRLFEEPELTQRCWEVIDAQAEMALRSEGFCEIDRQTLEIIVTREALNTKEAVVFEAVLNWAEAECKRQGLPITPRNKRHVLGRALYLVRIPTMTLEEFANGAAQSDILTLEETHSIFLWYTATNKPRLDFPLTKRKGLAPQRCHRFQSSAYRSNQWRYRGRCDSIQFAVDRRVFIAGLGLYGSSSGKAEYSVKIELKRLGVVLAQNLTKFMSDGSSNTFPVWFEHPVQVEQDTFYTASAVLDGSELSYFGQEGMTEVQCGKVAFQFQCSSDSTNGTGVQGGQIPELIFYA
>P0C7T9 | Homo sapiens (Human). | NCBI_TaxID=9606; | 278 | Name=BZW1L1;
MENSERNKLAMLTGVLLANGTLNASILNSLYNENLVKEGVSAAFAVKLFKSWINEKDINAVAASLRKVSMDNRLMELFPANKQSVEHFTKYFTEAGLKELSEYVRNQQTIGARKELQKELQEQMSRGDPFKDIILYVKEEMKKNNIPEPVVIGIVWSSVMSTVEWNKKEELVAEQAIKHLKQYSPLLAAFTTQGQSELTLLLKIQEYCYDNIHFMKAFQKIVVLFYKAEVLSEEPILKWYKDAHVAKGKSVFLEQMKKFVEWLKNAEEESESEAEEGD
>Q8IYA2 | Homo sapiens (Human). | NCBI_TaxID=9606; | 1237 | Name=CCDC144C;
MVSWGGEKRGGAEGSPKPAVYATRKTGSVRSQEDQWYLGYPGDQWSSGFSYSWWKNSVGSESKHGEGALDQPQHDVRLEDLGELHRAARSGDVPGVEHVLVPGDTGVDKRDRKKSIQQLVPEYKEKQTPESLPQNNNPDWHPTNLTLSDETCQRSKNLKVDDKCPSVSPSMPENQSATKELGQMNLTEREKMDTGVKTSQEPEMAKDCDREDIPIYPVLPHVQKSEEMRIEQGKLEWKNQLKLVINELKQRFGEIYEKYKIPACPEEEPLLDNSTRGTDVKDIPFNLTNNIPGCEEEDASEISVSVVFETFPEQKEPSLKNIIHSYYHPYSGSQEHVCQSSSKLHLHENKLDCDNDNKPGIGHIFSTDKNFHNDASTKKARNPEVVTVEMKEDQEFDLQMTKNMNQNSDSGSTNNYKSLKPKLENLSSLPPDSDRTSEVYLHEELQQDMQKFKNEVNTLEEEFLALKKENVQLHKEVEEEMEKHRSNSTELSGTLTDGTTVGNDDDGLNQQIPRKENGEHDRLALKQENEEKRNADMLYNKDSEQLRIKEEECGKVVETKQQLKWNLRRLVKELRTVVQERNDAQKQLSEEQDARILQDQILTSKQKELEMAQKKRNPEISHRHQKEKDLFHENCMLQEEIALLRLEIDTIKNQNKQKEKKYFEDIEVVKEKNDNLQKIIKRNEETLTETILQYSGQLNNLTAENKMLNSELENGKENQERLEIEMESYRCRLAAAVHDCDQSQTARDLKLDFQRTRQEWVRLHDKMKVDMSGLQAKNEILSEKLSNAESKINSLQIQLHNTRDALGRESLILERVQRDLSQTQCQKKETEQMYQSKLKKYIAKQESVEERLSQLQSENMLLRQQLDDVHKKANSQEKTISTIQDQFHSAAKNLQAESEKQILSLQEKNKELMDEYNHLKERMDQCEKEKAGRKIDLTEAQETVPSRCLHLDAENEVLQLQQTLFSMKAIQKQCETLQKNKKQLKQEVVNLKSYMERNMLERGEAEWHKLLIEERARKEIEEKLNEAILTLQKQAAVSHEQLAQLREDNTTSIKTQMELTVIDLESEISRIKTSQADFNKTKLERYKELYLEEVKVRESLSNELSRTNEMIAEVSTQLTVEKEQTRSRSLFTAYATRPVLESPCVGNLNDSEGLNRKHIPRKKRSALKDMESYLLKMQQKLQNDLTAEVAGSSQTGLHRIPQCSSFSSSSLHLLLCSICQPFFLILQLLLNMNLDPI
>A7MBM2 | Homo sapiens (Human). | NCBI_TaxID=9606; | 1401 | Name=DISP2; Synonyms=DISPB, KIAA1742;
MDGDSSSSSGGSGPAPGPGPEGEQRPEGEPLAPDGGSPDSTQTKAVPPEASPERSCSLHSCPLEDPSSSSGPPPTTSTLQPVGPSSPLAPAHFTYPRALQEYQGGSSLPGLGDRAALCSHGSSLSPSPAPSQRDGTWKPPAVQHHVVSVRQERAFQMPKSYSQLIAEWPVAVLMLCLAVIFLCTLAGLLGARLPDFSKPLLGFEPRDTDIGSKLVVWRALQALTGPRKLLFLSPDLELNSSSSHNTLRPAPRGSAQESAVRPRRMVEPLEDRRQENFFCGPPEKSYAKLVFMSTSSGSLWNLHAIHSMCRMEQDQIRSHTSFGALCQRTAANQCCPSWSLGNYLAVLSNRSSCLDTTQADAARTLALLRTCALYYHSGALVPSCLGPGQNKSPRCAQVPTKCSQSSAIYQLLHFLLDRDFLSPQTTDYQVPSLKYSLLFLPTPKGASLMDIYLDRLATPWGLADNYTSVTGMDLGLKQELLRHFLVQDTVYPLLALVAIFFGMALYLRSLFLTLMVLLGVLGSLLVAFFLYQVAFRMAYFPFVNLAALLLLSSVCANHTLIFFDLWRLSKSQLPSGGLAQRVGRTMHHFGYLLLVSGLTTSAAFYASYLSRLPAVRCLALFMGTAVLVHLALTLVWLPASAVLHERYLARGCARRARGRWEGSAPRRLLLALHRRLRGLRRAAAGTSRLLFQRLLPCGVIKFRYIWICWFAALAAGGAYIAGVSPRLRLPTLPPPGGQVFRPSHPFERFDAEYRQLFLFEQLPQGEGGHMPVVLVWGVLPVDTGDPLDPRSNSSLVRDPAFSASGPEAQRWLLALCHRARNQSFFDTLQEGWPTLCFVETLQRWMESPSCARLGPDLCCGHSDFPWAPQFFLHCLKMMALEQGPDGTQDLGLRFDAHGSLAALVLQFQTNFRNSPDYNQTQLFYNEVSHWLAAELGMAPPGLRRGWFTSRLELYSLQHSLSTEPAVVLGLALALAFATLLLGTWNVPLSLFSVAAVAGTVLLTVGLLVLLEWQLNTAEALFLSASVGLSVDFTVNYCISYHLCPHPDRLSRVAFSLRQTSCATAVGAAALFAAGVLMLPATVLLYRKLGIILMMVKCVSCGFASFFFQSLCCFFGPEKNCGQILWPCAHLPWDAGTGDPGGEKAGRPRPGSVGGMPGSCSEQYELQPLARRRSPSFDTSTATSKLSHRPSVLSEDLQLHDGPCCSRPPPAPASPRELLLDHQAVFSQCPALQTSSPYKQAGPSPKTRARQDSQGEEAEPLPASPEAPAHSPKAKAADPPDGFCSSASTLEGLSVSDETCLSTSEPSARVPDSVGVSPDDLDDTGQPVLERGQLNGKRDTLWLALRETVYDPSLPASHHSSLSWKGRGGPGDGSPVVLPNSQPDLPDVWLRRPSTHTSGYSS
>Q96HU8 | Homo sapiens (Human). | NCBI_TaxID=9606; | 199 | Name=DIRAS2;
MPEQSNDYRVAVFGAGGVGKSSLVLRFVKGTFRESYIPTVEDTYRQVISCDKSICTLQITDTTGSHQFPAMQRLSISKGHAFILVYSITSRQSLEELKPIYEQICEIKGDVESIPIMLVGNKCDESPSREVQSSEAEALARTWKCAFMETSAKLNHNVKELFQELLNLEKRRTVSLQIDGKKSKQQKRKEKLKGKCVIM
>Q8N4W6 | Homo sapiens (Human). | NCBI_TaxID=9606; | 341 | Name=DNAJC22;
MAKGLLVTYALWAVGGPAGLHHLYLGRDSHALLWMLTLGGGGLGWLWEFWKLPSFVAQANRAQGQRQSPRGVTPPLSPIRFAAQVIVGIYFGLVALISLSSMVNFYIVALPLAVGLGVLLVAAVGNQTSDFKNTLGSAFLTSPIFYGRPIAILPISVAASITAQRHRRYKALVASEPLSVRLYRLGLAYLAFTGPLAYSALCNTAATLSYVAETFGSFLNWFSFFPLLGRLMEFVLLLPYRIWRLLMGETGFNSSCFQEWAKLYEFVHSFQDEKRQLAYQVLGLSEGATNEEIHRSYQELVKVWHPDHNLDQTEEAQRHFLEIQAAYEVLSQPRKPWGSRR
Desired Output
Main Question
I just feel like my original code, although upside down, was more reliable because I didn't have to tell it how many times to print the header, it just looked for and printed only unique headers on its own. Is there a better way to print only new instances of headers but to print all matches of the desired sequence that follows? I could not find a way to specify print only unique matches, and I was uncertain about trying to send all headers and matched regions into a hash (and I have no idea how to do that).
Here's what I would write.
I've made several significant changes
I use chomp to remove the > from the start of the header of the next sequence, and then check that there are still some non-space character left. The very first record read will be just >, so this discards the empty record
I've removed the semicolons from your regex pattern, as they don't make much difference, and added some optional whitespace to trim any leading and trailing whitespace on the header
I've removed the non-greedy modifier ? from [VILMFWCA]{8,} as I didn't see a reason for it. Maybe I'm wrong. I've also changed it so that all matching sequences will be found even if they overlap. Again, maybe that's a bad call: I'm not a bioinformatician!
Calculating the position of each region as pos() - 7 is wrong as it depends on the length of the match. I've used the built-in array #- instead. $-[1] contains the position of the start of capture $1, $-[2] is for $2etc. and $-[0] is the position of the entire match
I keep the matching region and its start position in array #regions. When the search is finished I can test whether any were found by checking the size of
use strict;
use warnings 'all';
my $FASTA_FILE = 'test.fasta';
open my $fh, '<', $FASTA_FILE or die qq{Unable to open "$FASTA_FILE" for input: $!};
local $/ = '>';
while ( <$fh> ) {
chomp;
next unless /\S/;
next unless my ( $head, $seq ) = /\s*(.*\S)\s*\n(\w+)/;
my #regions;
while ( $seq =~ / (?= ( [VILMFWCA]{8,} ) ) /gxi ) {
push #regions, [ $1, $-[1] ];
}
next unless #regions; # Skip this sequence if no regions found
printf "%d Hydrophobic region%s found in %s\n",
scalar #regions,
#regions == 1 ? "" : "s",
$head;
printf " %s found at %d\n", #$_ for #regions;
}
output
14 Hydrophobic regions found in A7MBM2 | Homo sapiens (Human). | NCBI_TaxID=9606; | 1401 | Name=DISP2; Synonyms=DISPB, KIAA1742; DGDSSSSSGGSGPAPGPGPEGEQRPEGEPLAPDGGSPDSTQTKAVPPEASPERSCSLHSCPLEDPSSSSGPPPTTSTLQP
VAVLMLCLAVIFLC found at 88
AVLMLCLAVIFLC found at 89
VLMLCLAVIFLC found at 90
LMLCLAVIFLC found at 91
MLCLAVIFLC found at 92
LCLAVIFLC found at 93
CLAVIFLC found at 94
LLALVAIFF found at 411
LALVAIFF found at 412
IWICWFAALAA found at 623
WICWFAALAA found at 624
ICWFAALAA found at 625
CWFAALAA found at 626
LALALAFA found at 888
What I did was remove the global modifier from sequence match if statement regex, but leave the modifier after the while statement regex. This way I am not losing the first matches like I was before, but I am still able to print the header before the sequence matches.
unless (open FILE, "<", '/scratch/SampleDataFiles/test.fasta') {
die "Cannot Open File", $!;
}
$/ = ">";
my #file = <FILE>;
my $file = "#file";
chomp $file;
my $count = 0;
my $sequence_count = 0;
foreach $file (#file) {
if ($file =~ /(.*;.*;?\n)(\w+)/) {
my $head = $1;
my $sequence = $2;
$sequence_count = $sequence_count +1;
if ($sequence =~ /([VILMFWCA]{8,}?)/i) {
print "\n", "Hydrophobic region(s) found in ", $head, "\n";
$count = $count +1;
}
while ($sequence =~ /([VILMFWCA]{8,}?)/gi) {
my $pos = pos($sequence)-7;
print $1, " found at ", $pos, "\n", "\n";
}
}
}
print "\n", "\n", "-------------------------", "\n", "Hydrophobic region(s)
found in ", $count, " out of ", $sequence_count , " sequences.", "\n", "\n";
close FILE;
Thanks for your help, guys!

Dynamic pattern for matching incorrect characters in egrep

I have the next lines in files:
UserParameter=cassandra.status[*], curl -s "http://$1:$2/server-status?auto" | grep -e $3 | awk '{ print $$2 }'
UserParameter=ping.status[*],curl -s --retry 3 --max-time 3 'http://localhost:1111/engines?$1' | awk '/last_seen = / {split($$1, a, "/"); print a[2]}; END { if (!NR) print "NO_MATCHING_ENGINES" }' | tr "\n" "
and so on.
I want to display that line where comma after [*] is missed or there are any extra characters besides comma.
For example:
UserParameter=ping.status[*],,,curl -s --retry 3 --max-time 3 'http://localhost:1111/engines?$1' | awk '/last_seen = / {split($$1, a, "/"); print a[2]}; END { if (!NR) print "NO_MATCHING_ENGINES" }' | tr "\n" "
UserParameter=ping.status[*] curl -s --retry 3 --max-time 3 'http://localhost:1111/engines?$1' | awk '/last_seen = / {split($$1, a, "/"); print a[2]}; END { if (!NR) print "NO_MATCHING_ENGINES" }' | tr "\n" "
UserParameter=ping.status[*],;!curl -s --retry 3 --max-time 3 'http://localhost:1111/engines?$1' | awk '/last_seen = / {split($$1, a, "/"); print a[2]}; END { if (!NR) print "NO_MATCHING_ENGINES" }' | tr "\n" "
will be printed as long as there are extra characters and spaces besides single comma.
But:
UserParameter=ping.status[*],curl -s --retry 3 --max-time 3 'http://localhost:1111/engines?$1' | awk '/last_seen = / {split($$1, a, "/"); print a[2]}; END { if (!NR) print "NO_MATCHING_ENGINES" }' | tr "\n" "
will not be printed as long as there is single comma after [*].
I was trying to develop a pattern for egrep, but it doesn't fit for all cases where for example besides comma any other character which follows after [*]:
egrep (\[\*\].(|;|:|,|\.|))
I'll appreciate any help! Thank you!
grep -vE '\[\*\],[$/[:alpha:] ]' input
Do not print lines that match the pattern: [*], followed by any of: $, /, alphabetic character, or a space.

Unable to display multiple captures from same group of each line in powershell

So the powershell script works accurately if the regex matches only one entry each line in the masterfile, but if more than one entry matches in each line the output is not what is expected. I am using this:-
$FILE_RE = '^[^\:]*[^\.txt:]'
$TEXT_RE = '"(.*?)"'
$TEXT = (Get-Content .\masterfile.txt).ToUpper() |
ForEach-Object {
New-Object psobject -Property #{
fileName = [regex]::Matches($_, $FILE_RE).Value
Value = [regex]::Matches($_, $TEXT_RE).value
}
}
$TEXT | Select-Object fileName , Value | Sort-Object * -Unique |
Export-Csv TEXT.CSV -NoTypeInformation`
The data of the file and corresponding regex can be found here.
The output in Value tab consist System.Object[] which has 2 values. How i can have all the entries displayed which are inside System.Object[] with its corresponding file name in the first column.
The intended output is:-
fileName Value
FILE1.TXT VALUE1
FILE1.TXT VALUE2
FILE2.TXT VALUE3
FILE2.TXT TEST STRING1
FILE3.TXT VALUE4
FILE3.TXT 3456789
FILE4.TXT VALUE5
FILE4.TXT TEXT1
FILE5.TXT VALUE6
FILE5.TXT LOREM IPSUM
FILE6.TXT VALUE7
Since there are several finds of $TEXT_RE per line you need to store the file and iterate the finds.
Removed the intermediate $Text
had to trim the surrounding " from value
$FILE_RE = '^[^:]+'
$TEXT_RE = '"([^ ].+?)"'
(Get-Content .\masterfile.txt).ToUpper() | ForEach-Object {
$File = [regex]::Matches($_, $FILE_RE).Value
ForEach ($Value in ([regex]::Matches($_, $TEXT_RE).value)) {
New-Object psobject -Property #{
fileName = $File
Value = $Value.Trim('"')
}
}
} | Select-Object fileName , Value | Sort-Object * -Unique |
Export-Csv .\TEXT.CSV -NoTypeInformation
Import-Csv .\TEXT.CSV
Sample Output:
fileName Value
-------- -----
FILE1.TXT VALUE1
FILE1.TXT VALUE2
FILE2.TXT TEST STRING1
FILE2.TXT VALUE3
FILE3.TXT 3456789
FILE3.TXT VALUE4
FILE4.TXT TEXT1
FILE4.TXT VALUE5
FILE5.TXT LOREM IPSUM
FILE5.TXT VALUE6
FILE6.TXT VALUE7

use grep to extract multiple values from one line

file:
timestamp1 KKIE ABC=123 [5454] GHI=547 JKL=877 MNO=878
timestamp2 GGHI ABC=544 [ 24548] GHI=883 JKL=587 MNO=874
timestamp3 GGGIO ABC=877 [3487] GHI=77422 JKL=877 MNO=877
timestamp4 GGDI ABC=269 [ 1896] GHI=887 JKL=877 MNO=123
note: You sometimes have a space between '[' and the next digit).
when JKL=877, I want timestampx, ABC and GHI
solution 1:
timestamp1 ABC=123 GHI=547
timestamp3 ABC=877 GHI=77422
timestamp4 ABC=269 GHI=887
solution 2 (the best one):
TIMESTAMP ABC GHI
timestamp1 123 547
timestamp3 877 77422
timestamp4 269 887
I know how to have these values individually but not all of them in once.
A. solution 1:
grep JKL=877 file | awk '{print $1}'
grep JKL=877 file | grep -o '.ABC=[0-9]\{3\}'
grep JKL=877 file | grep -o '.GHI=[0-9]\{3,5\}'
without the '[' issue, I would do:
grep JKL=877 | awk '{print $1,$3,$5}' file
B. for solution 2:
grep JKL=877 file | grep -o '.ABC=[0-9]\{3\}' | tr 'ABC=' ' ' | awk '{print $1}'
(I use awk to remove the space created by tr function)
grep JKL=877 file | grep -o '.GHI=[0-9]\{3,5\}' | tr 'ABC=' ' ' | awk '{print $1}'
without the '[' issue, I would do:
printf "TIMESTAMP ABC GHI\n";
awk '{print $1,$3,$5}' file | tr 'ABC=' ' ' | tr 'GHI=' ' '
C. Now to have them all in once, I was thinking of a loop and puting matches in a variable (see https://unix.stackexchange.com/questions/37313/how-do-i-grep-for-multiple-patterns):
MATCH=".ABC=[0-9]\{3\} .GHI=[0-9]\{3,5\}" but something is wrong with my syntax; furthermore, it does not include timestampx.
printf "TIMESTAMP ABC GHI\n"
grep JKL=877 file | while read line
do
?
done
Thanx for your help.
Try using sed
printf "TIMESTAMP\tABC\tGHI\n"
sed -nr '/JKL=877/s/^(\w+).*ABC=([0-9]+).*GHI=([0-9]+).*/\1\t\2\t\3/p' file
Output:
TIMESTAMP ABC GHI
timestamp1 123 547
timestamp3 877 77422
timestamp4 269 887
With these kinds of problems it's usually best to first build an array that maps the names to the values for the name=value type of fields. That way you can simply use the fields values by addressing the array with their names however you like:
$ cat file
timestamp1 KKIE ABC=123 [5454] GHI=547 JKL=877 MNO=878
timestamp2 GGHI ABC=544 [ 24548] GHI=883 JKL=587 MNO=874
timestamp3 GGGIO ABC=877 [3487] GHI=77422 JKL=877 MNO=877
timestamp4 GGDI ABC=269 [ 1896] GHI=887 JKL=877 MNO=123
$
$ cat tst.awk
{
for (i=1;i<=NF;i++) {
split($i,tmp,/=/)
val[tmp[1]] = tmp[2]
fld[tmp[1]] = $i
}
if (val["JKL"] == 877) {
print $1, fld["ABC"], fld["GHI"]
}
}
$
$ awk -f tst.awk file
timestamp1 ABC=123 GHI=547
timestamp3 ABC=877 GHI=77422
timestamp4 ABC=269 GHI=887
#!/bin/bash
cat input.txt
echo ""
echo "############"
echo "TIMESTAMP ABC GHI"
sed -ne 's/\(timestamp[0-9]\).*ABC=\([0-9]*\).*GHI=\([0-9]*\).*JKL=877.*$/\1 \2 \3/gp' input.txt
output is
timestamp1 KKIE ABC=123 [5454] GHI=547 JKL=877 MNO=878
timestamp2 GGHI ABC=544 [ 24548] GHI=883 JKL=587 MNO=874
timestamp3 GGGIO ABC=877 [3487] GHI=77422 JKL=877 MNO=877
timestamp4 GGDI ABC=269 [ 1896] GHI=887 JKL=877 MNO=123
############
TIMESTAMP ABC GHI
timestamp1 123 547
timestamp3 877 77422
timestamp4 269 887
if you are not using things between [ and ] then just ignore them
Here's an awk version:
awk -F'=| +' -v OFS=$'\t' 'BEGIN {
print "TIMESTAMP", "ABC", "GHI"
}{
sub(/\[[^]]+\]/, "");
if ($8==877) print $1, $4, $6
}' input-file
With perl :
$ perl -lne '
print "$1 $2 $3"
if m/^(timestamp\d+).*?(ABC=\d+).*?(GHI=\d+)\s+JKL=877/i
' file
Output
timestamp1 ABC=123 GHI=547
timestamp3 ABC=877 GHI=77422
timestamp4 ABC=269 GHI=887
For the solution 1, you could try something like :
[ ~]$ awk 'BEGIN {str=""}{str=str"\n"; for (i=1;i<=NF;i++){if($i ~ "^(timestamp\(ABC|GHI)=)"){str=str""$i" "}}} END {print str}' file.txt|sed "1d;s/\ $//g"
timestamp1 ABC=123 GHI=547
timestamp2 ABC=544 GHI=883
timestamp3 ABC=877 GHI=77422
timestamp4 ABC=269 GHI=887
If you need to catch all values which match the pattern "[A-Z]+=[0-9]+" :
[ ~]$ awk 'BEGIN {str=""} {str=str"\n"; for (i=1;i<=NF;i++){if($i ~ "^(timestamp|[A-Z]+=[0-9]+)"){str=str""$i" "}}} END {print str}' file.txt|sed "1d;s/\ $//g"
timestamp1 ABC=123 GHI=547 JKL=877 MNO=878
timestamp2 ABC=544 GHI=883 JKL=587 MNO=874
timestamp3 ABC=877 GHI=77422 JKL=877 MNO=877
timestamp4 ABC=269 GHI=887 JKL=877 MNO=123
For the solution 2 :
[ ~]$ head=$(head -n1 file.txt|egrep -o "[A-Z]+=[0-9]+"|awk -F "=" 'BEGIN{s=""}{s=s""$1" "} END {print "TIMESTAMP "s}'|sed "s/\ $//g")
[ ~]$ content=$(i=1; while read; do echo $REPLY|egrep -o "[A-Z]+=[0-9]+"|awk -F "=" 'BEGIN{s=""} {s=s""$2" "} END {print "timestamp'$i' "s}'|sed "s/\ $//g"; ((i++)); done < file.txt)
[ ~]$ echo -e "$head\n$content"
TIMESTAMP ABC GHI JKL MNO
timestamp1 123 547 877 878
timestamp2 544 883 587 874
timestamp3 877 77422 877 877
timestamp4 269 887 877 123
If the number of matches on a line is constant, you can get away with a grep-only solution with a little help from paste:
grep JKL=877 file |
grep -o -e '^timestamp[0-9]' -e '\bABC=[0-9]\{3\}' -e '\bGHI=[0-9]\{3,5\}' |
grep -o '[^=]*$' |
paste - - -
Output:
timestamp1 123 547
timestamp3 877 77422
timestamp4 269 887
To include the desired header do something like this:
(
printf "TIMESTAMP\tABC\tGHI\n"
grep JKL=877 file |
grep -o -e '^timestamp[0-9]' -e '\bABC=[0-9]\{3\}' -e '\bGHI=[0-9]\{3,5\}' |
grep -o '[^=]*$' |
paste - - -
)
Output:
TIMESTAMP ABC GHI
timestamp1 123 547
timestamp3 877 77422
timestamp4 269 887
If you can make some assumptions about the order of the input and the number of fields, e.g. no white space at the end lines, you can use the simple field referencing you attempted in "solution 2", e.g.:
awk '/JKL=877/ { print $1, $4, $(NF==11 ? 7 : 8) }' FS='=| +' file
Output:
timestamp1 123 547
timestamp3 877 77422
timestamp4 269 887