Process FASTA file to output header and all matching subsequences - regex
I need to search a FASTA file for certain regions of the DNA sequence. For each match, I need to print the sequence header followed by all matches within that sequence. I want to print the header once followed by the matching sections.
The output of the following code is close, but the matched regions of DNA are printed above their header instead of beneath it. I cannot flip the two blocks of code because that cuts off the first results.
# First, I open my file and print a warning if it fails.
unless ( open FILE, "<", '/scratch/SampleDataFiles/test.fasta' ) {
die "Sorry", $!;
}
$/ = ">"; # This changes the record separator from \n to >, so I can chomp it later.
my #file = <FILE>;
my $file = "#file";
chomp $file;
# To view the file I can--
# print $file;
my $count = 0; # here I will count the matched regions
my $sequence_count = 0; # here I will count the sequences
# that contain a matched region
foreach $file ( #file ) {
# I look for each header and its following sequence
# And count the total sequences in the file
if ( $file =~ /(.*;.*;?\n)(\w+)/ ) {
my $head = $1;
my $sequence = $2;
$sequence_count = $sequence_count + 1;
# Now, I use the sequences I matched and search for a
# hydrophobic region
while ( $sequence =~ /([VILMFWCA]{8,}?)/gi ) {
# I want to know what the position of the match is
my $pos = pos( $sequence ) - 7;
print "\n", $1, " found at ", $pos;
}
# I use the count variable I made earlier to count up each
# time I match a sequence that has one or more hydrophobic region
if ( $sequence =~ /([VILMFWCA]{8,}?)/gi ) {
print "\n",
"Hydrophobic region(s) found in ",
$head,
"\n",
"-------------------------------------",
"\n";
$count = $count + 1;
}
}
}
print "Hydrophobic region(s) found in ",
$count,
" out of ",
$sequence_count,
" sequences.",
"\n",
"\n";
This is the output:
AVVAAVMW found at 325
Hydrophobic region(s) found in P30450 | Homo sapiens (Human). |
NCBI_TaxID=9606; | 365 | Name=HLA-A; Synonyms=HLAA;
-------------------------------------
VAVLMLCL found at 170
LLALVAIF found at 493
IWICWFAA found at 705
LALALAFA found at 970
Hydrophobic region(s) found in A7MBM2 | Homo sapiens (Human). |
NCBI_TaxID=9606; | 1401 | Name=DISP2; Synonyms=DISPB, KIAA1742;
-------------------------------------
Hydrophobic region(s) found in 2 out of 15 sequences.
This is the output I get if I switch them:
Hydrophobic region(s) found in P30450 | Homo sapiens (Human). |
NCBI_TaxID=9606; | 365 | Name=HLA-A; Synonyms=HLAA;
-------------------------------------
Hydrophobic region(s) found in A7MBM2 | Homo sapiens (Human). |
NCBI_TaxID=9606; | 1401 | Name=DISP2; Synonyms=DISPB, KIAA1742;
LLALVAIF found at 493
IWICWFAA found at 705
LALALAFA found at 970
Hydrophobic region(s) found in 2 out of 15 sequences.`
Per my teacher's recommendation, I have adjusted my code as follows so that I include everything within the larger while loop and restrict the number of prints with a counter. This new code prints each new header one time, and below it prints each instance of a found region of DNA (essentially flipping what I had before).
New code:
my $count = 0; # here I will count the matched regions
my $temp_count = 0; # this I will use temporarily to count
my $sequence_count = 0; # here I will count the sequences
# that contain a matched region
if ( $file =~ /(.*;.*;?\n)(\w+)/ ) {
my $head = $1;
my $sequence = $2;
$sequence_count = $sequence_count + 1;
# Now I use the sequences that I found, and
# search them for a hydrophobic region
while ( $sequence =~ /([VILMFWCA]{8,}?)/gi ) {
# I use the count variables I made earlier
# I count all times I match a sequence that has one or more hydrophobic region
$temp_count = $temp_count + 1;
# But I don't want the header repeated for the same sequence, so I limit the
# times that it can print
if ( $temp_count <= 2 ) {
print "\n", "Hydrophobic region(s) found in ", $head, "\n";
$count = $count + 1;
}
# I want to know what the position of the match is
# within the sequence
my $pos = pos( $sequence ) - 7;
print $1, " found at ", $pos, "\n", "\n";
}
}
}
print "\n",
"\n",
"-------------------------",
"\n",
"Hydrophobic region(s) found in ",
$count,
" out of ",
$sequence_count,
" sequences.",
"\n",
"\n";
If useful, here is what the file looks like:
>P31946 | Homo sapiens (Human). | NCBI_TaxID=9606; | 246 | Name=YWHAB;
MTMDKSELVQKAKLAEQAERYDDMAAAMKAVTEQGHELSNEERNLLSVAYKNVVGARRSSWRVISSIEQKTERNEKKQQMGKEYREKIEAELQDICNDVLELLDKYLIPNATQPESKVFYLKMKGDYFRYLSEVASGDNKQTTVSNSQQAYQEAFEISKKEMQPTHPIRL
GLALNFSVFYYEILNSPEKACSLAKTAFDEAIAELDTLNEESYKDSTLIMQLLRDNLTLWTSENQGDEGDAGEGEN
>P62258 | Homo sapiens (Human). | NCBI_TaxID=9606; | 255 | Name=YWHAE;
MDDREDLVYQAKLAEQAERYDEMVESMKKVAGMDVELTVEERNLLSVAYKNVIGARRASWRIISSIEQKEENKGGEDKLKMIREYRQMVETELKLICCDILDVLDKHLIPAANTGESKVFYYKMKGDYHRYLAEFATGNDRKEAAENSLVAYKAASDIAMTELPPTHPIR
LGLALNFSVFYYEILNSPDRACRLAKAAFDDAIAELDTLSEESYKDSTLIMQLLRDNLTLWTSDMQGDGEEQNKEALQDVEDENQ
>Q04917 | Homo sapiens (Human). | NCBI_TaxID=9606; | 246 | Name=YWHAH; Synonyms=YWHA1;
MGDREQLLQRARLAEQAERYDDMASAMKAVTELNEPLSNEDRNLLSVAYKNVVGARRSSWRVISSIEQKTMADGNEKKLEKVKAYREKIEKELETVCNDVLSLLDKFLIKNCNDFQYESKVFYLKMKGDYYRYLAEVASGEKKNSVVEASEAAYKEAFEISKEQMQPTHP
IRLGLALNFSVFYYEIQNAPEQACLLAKQAFDDAIAELDTLNEDSYKDSTLIMQLLRDNLTLWTSDQQDEEAGEGN
>P30450 | Homo sapiens (Human). | NCBI_TaxID=9606; | 365 | Name=HLA-A; Synonyms=HLAA;
MAVMAPRTLVLLLSGALALTQTWAGSHSMRYFYTSVSRPGRGEPRFIAVGYVDDTQFVRFDSDAASQRMEPRAPWIEQEGPEYWDRNTRNVKAHSQTDRANLGTLRGYYNQSEDGSHTIQRMYGCDVGPDGRFLRGYQQDAYDGKDYIALNEDLRSWTAADMAAQITQRK
WETAHEAEQWRAYLEGRCVEWLRRYLENGKETLQRTDAPKTHMTHHAVSDHEATLRCWALSFYPAEITLTWQRDGEDQTQDTELVETRPAGDGTFQKWASVVVPSGQEQRYTCHVQHEGLPKPLTLRWEPSSQPTIPIVGIIAGLVLFGAVIAGAVVAAVMWRRKSSDRK
GGSYSQAASSDSAQGSDMSLTACKV
>Q156A1 | Homo sapiens (Human). | NCBI_TaxID=9606; | 80 | Name=ATXN8;
MQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQ
>Q9UQB9 | Homo sapiens (Human). | NCBI_TaxID=9606; | 309 | Name=AURKC; Synonyms=AIE2, AIK3, ARK3, STK13;
MSSPRAVVQLGKAQPAGEELATANQTAQQPSSPAMRRLTVDDFEIGRPLGKGKFGNVYLARLKESHFIVALKVLFKSQIEKEGLEHQLRREIEIQAHLQHPNILRLYNYFHDARRVYLILEYAPRGELYKELQKSEKLDEQRTATIIEELADALTYCHDKKVIHRDIKPE
NLLLGFRGEVKIADFGWSVHTPSLRRKTMCGTLDYLPPEMIEGRTYDEKVDLWCIGVLCYELLVGYPPFESASHSETYRRILKVDVRFPLSMPLGARDLISRLLRYQPLERLPLAQILKHPWVQAHSRRVLPPCAQMAS
>O75366 | Homo sapiens (Human). | NCBI_TaxID=9606; | 819 | Name=AVIL;
MPLTSAFRAVDNDPGIIVWRIEKMELALVPVSAHGNFYEGDCYVILSTRRVASLLSQDIHFWIGKDSSQDEQSCAAIYTTQLDDYLGGSPVQHREVQYHESDTFRGYFKQGIIYKQGGVASGMKHVETNTYDVKRLLHVKGKRNIRATEVEMSWDSFNRGDVFLLDLGKV
IIQWNGPESNSGERLKAMLLAKDIRDRERGGRAEIGVIEGDKEAASPELMKVLQDTLGRRSIIKPTVPDEIIDQKQKSTIMLYHISDSAGQLAVTEVATRPLVQDLLNHDDCYILDQSGTKIYVWKGKGATKAEKQAAMSKALGFIKMKSYPSSTNVETVNDGAESAMFK
QLFQKWSVKDQTMGLGKTFSIGKIAKVFQDKFDVTLLHTKPEVAAQERMVDDGNGKVEVWRIENLELVPVEYQWYGFFYGGDCYLVLYTYEVNGKPHHILYIWQGRHASQDELAASAYQAVEVDRQFDGAAVQVRVRMGTEPRHFMAIFKGKLVIFEGGTSRKGNAEPDP
PVRLFQIHGNDKSNTKAVEVPAFASSLNSNDVFLLRTQAEHYLWYGKGSSGDERAMAKELASLLCDGSENTVAEGQEPAEFWDLLGGKTPYANDKRLQQEILDVQSRLFECSNKTGQFVVTEITDFTQDDLNPTDVMLLDTWDQVFLWIGAEANATEKESALATAQQYLH
THPSGRDPDTPILIIKQGFEPPIFTGWFLAWDPNIWSAGKTYEQLKEELGDAAAIMRITADMKNATLSLNSNDSEPKYYPIAVLLKNQNQELPEDVNPAKKENYLSEQDFVSVFGITRGQFAALPGWKQLQMKKEKGLF
>Q9UPA5 | Homo sapiens (Human). | NCBI_TaxID=9606; | 3926 | Name=BSN; Synonyms=KIAA0434, ZNF231;
MGNEVSLEGGAGDGPLPPGGAGPGPGPGPGPGAGKPPSAPAGGGQLPAAGAARSTAVPPVPGPGPGPGPGPGPGSTSRRLDPKEPLGNQRAASPTPKQASATTPGHESPRETRAQGPAGQEADGPRRTLQVDSRTQRSGRSPSVSPDRGSTPTSPYSVPQIAPLPSSTLC
PICKTSDLTSTPSQPNFNTCTQCHNKVCNQCGFNPNPHLTQVKEWLCLNCQMQRALGMDMTTAPRSKSQQQLHSPALSPAHSPAKQPLGKPDQERSRGPGGPQPGSRQAETARATSVPGPAQAAAPPEVGRVSPQPPQPTKPSTAEPRPPAGEAPAKSATAVPAGLGATE
QTQEGLTGKLFGLGASLLTQASTLMSVQPEADTQGQPAPSKGTPKIVFNDASKEAGPKPLGSGPGPGPAPGAKTEPGARMGPGSGPGALPKTGGTTSPKHGRAEHQAASKAAAKPKTMPKERAICPLCQAELNVGSKSPANYNTCTTCRLQVCNLCGFNPTPHLVEKTEW
LCLNCQTKRLLEGSLGEPTPLPPPTSQQPPVGAPHRASGTSPLKQKGPQGLGQPSGPLPAKASPLSTKASPLPSKASPQAKPLRASEPSKTPSSVQEKKTRVPTKAEPMPKPPPETTPTPATPKVKSGVRRAEPATPVVKAVPEAPKGGEAEDLVGKPYSQDASRSPQSL
SDTGYSSDGISSSQSEITGVVQQEVEQLDSAGVTGPHPPSPSEIHKVGSSMRPLLQAQGLAPSERSKPLSSGTGEEQKQRPHSLSITPEAFDSDEELEDILEEDEDSAEWRRRREQQDTAESSDDFGSQLRHDYVEDSSEGGLSPLPPQPPARAAELTDEDFMRRQILEM
SAEEDNLEEDDTATSGRGLAKHGTQKGGPRPRPEPSQEPAALPKRRLPHNATTGYEELLPEGGSAEATDGSGTLQGGLRRFKTIELNSTGSYGHELDLGQGPDPSLDREPELEMESLTGSPEDRSRGEHSSTLPASTPSYTSGTSPTSLSSLEEDSDSSPSRRQRLEEAK
QQRKARHRSHGPLLPTIEDSSEEEELREEEELLREQEKMREVEQQRIRSTARKTRRDKEELRAQRRRERSKTPPSNLSPIEDASPTEELRQAAEMEELHRSSCSEYSPSPSLDSEAEALDGGPSRLYKSGSEYNLPTFMSLYSPTETPSGSSTTPSSGRPLKSAEEAYEE
MMRKAELLQRQQGQAAGARGPHGGPSQPTGPRGLGSFEYQDTTDREYGQAAQPAAEGTPASLGAAVYEEILQTSQSIVRMRQASSRDLAFAEDKKKEKQFLNAESAYMDPMKQNGGPLTPGTSPTQLAAPVSFSTPTSSDSSGGRVIPDVRVTQHFAKETQDPLKLHSSP
ASPSSASKEIGMPFSQGPGTPATTAVAPCPAGLPRGYMTPASPAGSERSPSPSSTAHSYGHSPTTANYGSQTEDLPQAPSGLAAAGRAAREKPLSASDGEGGTPQPSRAYSYFASSSPPLSPSSPSESPTFSPGKMGPRATAEFSTQTPSPAPASDMPRSPGAPTPSPMV
AQGTQTPHRPSTPRLVWQESSQEAPFMVITLASDASSQTRMVHASASTSPLCSPTETQPTTHGYSQTTPPSVSQLPPEPPGPPGFPRVPSAGADGPLALYGWGALPAENISLCRISSVPGTSRVEPGPRTPGTAVVDLRTAVKPTPIILTDQGMDLTSLAVEARKYGLAL
DPIPGRQSTAVQPLVINLNAQEHTFLATATTVSITMASSVFMAQQKQPVVYGDPYQSRLDFGQGGGSPVCLAQVKQVEQAVQTAPYRSGPRGRPREAKFARYNLPNQVAPLARRDVLITQMGTAQSIGLKPGPVPEPGAEPHRATPAELRSHALPGARKPHTVVVQMGEG
TAGTVTTLLPEEPAGALDLTGMRPESQLACCDMVYKLPFGSSCTGTFHPAPSVPEKSMADAAPPGQSSSPFYGPRDPEPPEPPTYRAQGVVGPGPHEEQRPYPQGLPGRLYSSMSDTNLAEAGLNYHAQRIGQLFQGPGRDSAMDLSSLKHSYSLGFADGRYLGQGLQYG
SVTDLRHPTDLLAHPLPMRRYSSVSNIYSDHRYGPRGDAVGFQEASLAQYSATTAREISRMCAALNSMDQYGGRHGSGGGGPDLVQYQPQHGPGLSAPQSLVPLRPGLLGNPTFPEGHPSPGNLAQYGPAAGQGTAVRQLLPSTATVRAADGMIYSTINTPIAATLPITT
QPASVLRPMVRGGMYRPYASGGITAVPLTSLTRVPMIAPRVPLGPTGLYRYPAPSRFPIASSVPPAEGPVYLGKPAAAKAPGAGGPSRPEMPVGAAREEPLPTTTPAAIKEAAGAPAPAPLAGQKPPADAAPGGGSGALSRPGFEKEEASQEERQRKQQEQLLQLERERV
ELEKLRQLRLQEELERERVELQRHREEEQLLVQRELQELQTIKHHVLQQQQEERQAQFALQREQLAQQRLQLEQIQQLQQQLQQQLEEQKQRQKAPFPAACEAPGRGPPLAAAELAQNGQYWPPLTHAAFIAMAGPEGLGQPREPVLHRGLPSSASDMSLQTEEQWEASR
SGIKKRHSMPRLRDACELESGTEPCVVRRIADSSVQTDDEDGESRYLLSRRRRARRSADCSVQTDDEDSAEWEQPVRRRRSRLPRHSDSGSDSKHDATASSSSAAATVRAMSSVGIQTISDCSVQTEPDQLPRVSPAIHITAATDPKVEIVRYISAPEKTGRGESLACQT
EPDGQAQGVAGPQLVGPTAISPYLPGIQIVTPGPLGRFEKKKPDPLEIGYQAHLPPESLSQLVSRQPPKSPQVLYSPVSPLSPHRLLDTSFASSERLNKAHVSPQKHFTADSALRQQTLPRPMKTLQRSLSDPKPLSPTAEESAKERFSLYQHQGGLGSQVSALPPNSLV
RKVKRTLPSPPPEEAHLPLAGQASPQLYAASLLQRGLTGPTTVPATKASLLRELDRDLRLVEHESTKLRKKQAELDEEEKEIDAKLKYLELGITQRKESLAKDRGGRDYPPLRGLGEHRDYLSDSELNQLRLQGCTTPAGQFVDFPATAAAPATPSGPTAFQQPRFQPPA
PQYSAGSGGPTQNGFPAHQAPTYPGPSTYPAPAFPPGASYPAEPGLPNQQAFRPTGHYAGQTPMPTTQSTLFPVPADSRAPLQKPRQTSLADLEQKVPTNYEVIASPVVPMSSAPSETSYSGPAVSSGYEQGKVPEVPRAGDRGSVSQSPAPTYPSDSHYTSLEQNVPRN
YVMIDDISELTKDSTSTAPDSQRLEPLGPGSSGRPGKEPGEPGVLDGPTLPCCYARGEEESEEDSYDPRGKGGHLRSMESNGRPASTHYYGDSDYRHGARVEKYGPGPMGPKHPSKSLAPAAISSKRSKHRKQGMEQKISKFSPIEEAKDVESDLASYPPPAVSSSLVSR
GRKFQDEITYGLKKNVYEQQKYYGMSSRDAVEDDRIYGGSSRSRAPSAYSGEKLSSHDFSGWGKGYEREREAVERLQKAGPKPSSLSMAHSRVRPPMRSQASEEESPVSPLGRPRPAGGPLPPGGDTCPQFCSSHSMPDVQEHVKDGPRAHAYKREEGYILDDSHCVVSD
SEAYHLGQEETDWFDKPRDARSDRFRHHGGHAVSSSSQKRGPARHSYHDYDEPPEEGLWPHDEGGPGRHASAKEHRHGDHGRHSGRHTGEEPGRRAAKPHARDLGRHEARPHSQPSSAPAMPKKGQPGYPSSAEYSQPSRASSAYHHASDSKKGSRQAHSGPAALQSKAE
PQAQPQLQGRQAAPGPQQSQSPSSRQIPSGAASRQPQTQQQQQGLGLQPPQQALTQARLQQQSQPTTRGSAPAASQPAGKPQPGPSTATGPQPAGPPRAEQTNGSKGTAKAPQQGRAPQAQPAPGPGPAGVKAGARPGGTPGAPAGQPGADGESVFSKILPGGAAEQAGK
LTEAVSAFGKKFSSFW
>Q9NSI6 | Homo sapiens (Human). | NCBI_TaxID=9606; | 2320 | Name=BRWD1; Synonyms=C21orf107, WDR9;
MAEPSSARRPVPLIESELYFLIARYLSAGPCRRAAQVLVQELEQYQLLPKRLDWEGNEHNRSYEELVLSNKHVAPDHLLQICQRIGPMLDKEIPPSISRVTSLLGAGRQSLLRTAKDCRHTVWKGSAFAALHRGRPPEMPVNYGSPPNLVEIHRGKQLTGCSTFSTAFPG
TMYQHIKMHRRILGHLSAVYCVAFDRTGHRIFTGSDDCLVKIWSTHNGRLLSTLRGHSAEISDMAVNYENTMIAAGSCDKIIRVWCLRTCAPVAVLQGHTGSITSLQFSPMAKGSQRYMVSTGADGTVCFWQWDLESLKFSPRPLKFTEKPRPGVQMLCSSFSVGGMFLA
TGSTDHVIRMYFLGFEAPEKIAELESHTDKVDSIQFCNNGDRFLSGSRDGTARIWRFEQLEWRSILLDMATRISGDLSSEEERFMKPKVTMIAWNQNDSIVVTAVNDHVLKVWNSYTGQLLHNLMGHADEVFVLETHPFDSRIMLSAGHDGSIFIWDITKGTKMKHYFNM
IEGQGHGAVFDCKFSQDGQHFACTDSHGHLLIFGFGCSKPYEKIPDQMFFHTDYRPLIRDSNNYVLDEQTQQAPHLMPPPFLVDVDGNPHPTKYQRLVPGRENSADEHLIPQLGYVATSDGEVIEQIISLQTNDNDERSPESSILDGMIRQLQQQQDQRMGADQDTIPRG
LSNGEETPRRGFRRLSLDIQSPPNIGLRRSGQVEGVRQMHQNAPRSQIATERDLQAWKRRVVVPEVPLGIFRKLEDFRLEKGEEERNLYIIGRKRKTLQLSHKSDSVVLVSQSRQRTCRRKYPNYGRRNRSWRELSSGNESSSSVRHETSCDQSEGSGSSEEDEWRSDRK
SESYSESSSDSSSRYSDWTADAGINLQPPLRTSCRRRITRFCSSSEDEISTENLSPPKRRRKRKKENKPKKENLRRMTPAELANMEHLYEFHPPVWITDTTLRKSPFVPQMGDEVIYFRQGHEAYIEAVRRNNIYELNPNKEPWRKMDLRDQELVKIVGIRYEVGPPTLC
CLKLAFIDPATGKLMDKSFSIRYHDMPDVIDFLVLRQFYDEARQRNWQSCDRFRSIIDDAWWFGTVLSQEPYQPQYPDSHFQCYIVRWDNTEIEKLSPWDMEPIPDNVDPPEELGASISVTTDELEKLLYKPQAGEWGQKSRDEECDRIISGIDQLLNLDIAAAFAGPVD
LCTYPKYCTVVAYPTDLYTIRMRLVNRFYRRLSALVWEVRYIEHNARTFNEPESVIARSAKKITDQLLKFIKNQHCTNISELSNTSENDEQNAEDLDDSDLPKTSSGRRRVHDGKKSIRATNYVESNWKKQCKELVNLIFQCEDSEPFRQPVDLVEYPDYRDIIDTPMDF
GTVRETLDAGNYDSPLEFCKDIRLIFSNAKAYTPNKRSKIYSMTLRLSALFEEKMKKISSDFKIGQKFNEKLRRSQRFKQRQNCKGDSQPNKSIRNLKPKRLKSQTKIIPELVGSPTQSTSSRTAYLGTHKTSAGISSGVTSGDSSDSAESSERRKRNRPITNGSTLSES
EVEDSLATSLSSSASSSSEESKESSRARESSSRSGLSRSSNLRVTRTRAAQRKTGPVSLANGCGRKATRKRVYLSDSDNNSLETGEILKARAGNNRKVLRKCAAVAANKIKLMSDVEENSSSESVCSGRKLPHRNASAVARKKLLHNSEDEQSLKSEIEEEELKDENQPL
PVSSSHTAQSNVDESENRDSESESDLRVARKNWHANGYKSHTPAPSKTKFLKIESSEEDSKSHDSDHACNRTAGPSTSVQKLKAESISEEADSEPGRSGGRKYNTFHKNASFFKKTKILSDSEDSESEEQDREDGKCHKMEMNPISGNLNCDPIAMSQCSSDHGCETDLD
SDDDKIEKPNNFMKDSASQDNGLSRKISRKRVCSSDSDSSLQVVKKSSKARTGLLRITRRCAATAANKIKLMSDVEDVSLENVHTRSKNGRKKPLHLACTTAKKKLSDCEGSVHCEVPSEQYACEGKPPDPDSEGSTKVLSQALNGDSDSEDMLNSEHKHRHTNIHKIDA
PSKRKSSSVTSSGEDSKSHIPGSETDRTFSSESTLAQKATAENNFEVELNYGLRRWNGRRLRTYGKAPFSKTKVIHDSQETAEKEVKRKRSHPELENVKISETTGNSKFRPDTSSKSSDLGSVTESDIDCTDNTKTKRRKTKGKAKVVRKEFVPRDREPNTKVRTCMHNQ
KDAVQMPSETLKAKMVPEKVPRRCATVAANKIKIMSNLKETISGPENVWIRKSSRKLPHRNASAAAKKKLLNVYKEDDTTINSESEKELEDINRKMLFLRGFRSWKENAQ
>Q96KE9 | Homo sapiens (Human). | NCBI_TaxID=9606; | 485 | Name=BTBD6; Synonyms=BDPL;
MAAELYAPASAAAADLANSNAGAAVGRKAGPRSPPSAPAPAPPPPAPAPPTLGNNHQESPGWRCCRPTLRERNALMFNNELMADVHFVVGPPGATRTVPAHKYVLAVGSSVFYAMFYGDLAEVKSEIHIPDVEPAAFLILLKYMYSDEIDLEADTVLATLYAAKKYIVPALAKACVNFLETSLEAKNACVLLSQSRLFEEPELTQRCWEVIDAQAEMALRSEGFCEIDRQTLEIIVTREALNTKEAVVFEAVLNWAEAECKRQGLPITPRNKRHVLGRALYLVRIPTMTLEEFANGAAQSDILTLEETHSIFLWYTATNKPRLDFPLTKRKGLAPQRCHRFQSSAYRSNQWRYRGRCDSIQFAVDRRVFIAGLGLYGSSSGKAEYSVKIELKRLGVVLAQNLTKFMSDGSSNTFPVWFEHPVQVEQDTFYTASAVLDGSELSYFGQEGMTEVQCGKVAFQFQCSSDSTNGTGVQGGQIPELIFYA
>P0C7T9 | Homo sapiens (Human). | NCBI_TaxID=9606; | 278 | Name=BZW1L1;
MENSERNKLAMLTGVLLANGTLNASILNSLYNENLVKEGVSAAFAVKLFKSWINEKDINAVAASLRKVSMDNRLMELFPANKQSVEHFTKYFTEAGLKELSEYVRNQQTIGARKELQKELQEQMSRGDPFKDIILYVKEEMKKNNIPEPVVIGIVWSSVMSTVEWNKKEELVAEQAIKHLKQYSPLLAAFTTQGQSELTLLLKIQEYCYDNIHFMKAFQKIVVLFYKAEVLSEEPILKWYKDAHVAKGKSVFLEQMKKFVEWLKNAEEESESEAEEGD
>Q8IYA2 | Homo sapiens (Human). | NCBI_TaxID=9606; | 1237 | Name=CCDC144C;
MVSWGGEKRGGAEGSPKPAVYATRKTGSVRSQEDQWYLGYPGDQWSSGFSYSWWKNSVGSESKHGEGALDQPQHDVRLEDLGELHRAARSGDVPGVEHVLVPGDTGVDKRDRKKSIQQLVPEYKEKQTPESLPQNNNPDWHPTNLTLSDETCQRSKNLKVDDKCPSVSPSMPENQSATKELGQMNLTEREKMDTGVKTSQEPEMAKDCDREDIPIYPVLPHVQKSEEMRIEQGKLEWKNQLKLVINELKQRFGEIYEKYKIPACPEEEPLLDNSTRGTDVKDIPFNLTNNIPGCEEEDASEISVSVVFETFPEQKEPSLKNIIHSYYHPYSGSQEHVCQSSSKLHLHENKLDCDNDNKPGIGHIFSTDKNFHNDASTKKARNPEVVTVEMKEDQEFDLQMTKNMNQNSDSGSTNNYKSLKPKLENLSSLPPDSDRTSEVYLHEELQQDMQKFKNEVNTLEEEFLALKKENVQLHKEVEEEMEKHRSNSTELSGTLTDGTTVGNDDDGLNQQIPRKENGEHDRLALKQENEEKRNADMLYNKDSEQLRIKEEECGKVVETKQQLKWNLRRLVKELRTVVQERNDAQKQLSEEQDARILQDQILTSKQKELEMAQKKRNPEISHRHQKEKDLFHENCMLQEEIALLRLEIDTIKNQNKQKEKKYFEDIEVVKEKNDNLQKIIKRNEETLTETILQYSGQLNNLTAENKMLNSELENGKENQERLEIEMESYRCRLAAAVHDCDQSQTARDLKLDFQRTRQEWVRLHDKMKVDMSGLQAKNEILSEKLSNAESKINSLQIQLHNTRDALGRESLILERVQRDLSQTQCQKKETEQMYQSKLKKYIAKQESVEERLSQLQSENMLLRQQLDDVHKKANSQEKTISTIQDQFHSAAKNLQAESEKQILSLQEKNKELMDEYNHLKERMDQCEKEKAGRKIDLTEAQETVPSRCLHLDAENEVLQLQQTLFSMKAIQKQCETLQKNKKQLKQEVVNLKSYMERNMLERGEAEWHKLLIEERARKEIEEKLNEAILTLQKQAAVSHEQLAQLREDNTTSIKTQMELTVIDLESEISRIKTSQADFNKTKLERYKELYLEEVKVRESLSNELSRTNEMIAEVSTQLTVEKEQTRSRSLFTAYATRPVLESPCVGNLNDSEGLNRKHIPRKKRSALKDMESYLLKMQQKLQNDLTAEVAGSSQTGLHRIPQCSSFSSSSLHLLLCSICQPFFLILQLLLNMNLDPI
>A7MBM2 | Homo sapiens (Human). | NCBI_TaxID=9606; | 1401 | Name=DISP2; Synonyms=DISPB, KIAA1742;
MDGDSSSSSGGSGPAPGPGPEGEQRPEGEPLAPDGGSPDSTQTKAVPPEASPERSCSLHSCPLEDPSSSSGPPPTTSTLQPVGPSSPLAPAHFTYPRALQEYQGGSSLPGLGDRAALCSHGSSLSPSPAPSQRDGTWKPPAVQHHVVSVRQERAFQMPKSYSQLIAEWPVAVLMLCLAVIFLCTLAGLLGARLPDFSKPLLGFEPRDTDIGSKLVVWRALQALTGPRKLLFLSPDLELNSSSSHNTLRPAPRGSAQESAVRPRRMVEPLEDRRQENFFCGPPEKSYAKLVFMSTSSGSLWNLHAIHSMCRMEQDQIRSHTSFGALCQRTAANQCCPSWSLGNYLAVLSNRSSCLDTTQADAARTLALLRTCALYYHSGALVPSCLGPGQNKSPRCAQVPTKCSQSSAIYQLLHFLLDRDFLSPQTTDYQVPSLKYSLLFLPTPKGASLMDIYLDRLATPWGLADNYTSVTGMDLGLKQELLRHFLVQDTVYPLLALVAIFFGMALYLRSLFLTLMVLLGVLGSLLVAFFLYQVAFRMAYFPFVNLAALLLLSSVCANHTLIFFDLWRLSKSQLPSGGLAQRVGRTMHHFGYLLLVSGLTTSAAFYASYLSRLPAVRCLALFMGTAVLVHLALTLVWLPASAVLHERYLARGCARRARGRWEGSAPRRLLLALHRRLRGLRRAAAGTSRLLFQRLLPCGVIKFRYIWICWFAALAAGGAYIAGVSPRLRLPTLPPPGGQVFRPSHPFERFDAEYRQLFLFEQLPQGEGGHMPVVLVWGVLPVDTGDPLDPRSNSSLVRDPAFSASGPEAQRWLLALCHRARNQSFFDTLQEGWPTLCFVETLQRWMESPSCARLGPDLCCGHSDFPWAPQFFLHCLKMMALEQGPDGTQDLGLRFDAHGSLAALVLQFQTNFRNSPDYNQTQLFYNEVSHWLAAELGMAPPGLRRGWFTSRLELYSLQHSLSTEPAVVLGLALALAFATLLLGTWNVPLSLFSVAAVAGTVLLTVGLLVLLEWQLNTAEALFLSASVGLSVDFTVNYCISYHLCPHPDRLSRVAFSLRQTSCATAVGAAALFAAGVLMLPATVLLYRKLGIILMMVKCVSCGFASFFFQSLCCFFGPEKNCGQILWPCAHLPWDAGTGDPGGEKAGRPRPGSVGGMPGSCSEQYELQPLARRRSPSFDTSTATSKLSHRPSVLSEDLQLHDGPCCSRPPPAPASPRELLLDHQAVFSQCPALQTSSPYKQAGPSPKTRARQDSQGEEAEPLPASPEAPAHSPKAKAADPPDGFCSSASTLEGLSVSDETCLSTSEPSARVPDSVGVSPDDLDDTGQPVLERGQLNGKRDTLWLALRETVYDPSLPASHHSSLSWKGRGGPGDGSPVVLPNSQPDLPDVWLRRPSTHTSGYSS
>Q96HU8 | Homo sapiens (Human). | NCBI_TaxID=9606; | 199 | Name=DIRAS2;
MPEQSNDYRVAVFGAGGVGKSSLVLRFVKGTFRESYIPTVEDTYRQVISCDKSICTLQITDTTGSHQFPAMQRLSISKGHAFILVYSITSRQSLEELKPIYEQICEIKGDVESIPIMLVGNKCDESPSREVQSSEAEALARTWKCAFMETSAKLNHNVKELFQELLNLEKRRTVSLQIDGKKSKQQKRKEKLKGKCVIM
>Q8N4W6 | Homo sapiens (Human). | NCBI_TaxID=9606; | 341 | Name=DNAJC22;
MAKGLLVTYALWAVGGPAGLHHLYLGRDSHALLWMLTLGGGGLGWLWEFWKLPSFVAQANRAQGQRQSPRGVTPPLSPIRFAAQVIVGIYFGLVALISLSSMVNFYIVALPLAVGLGVLLVAAVGNQTSDFKNTLGSAFLTSPIFYGRPIAILPISVAASITAQRHRRYKALVASEPLSVRLYRLGLAYLAFTGPLAYSALCNTAATLSYVAETFGSFLNWFSFFPLLGRLMEFVLLLPYRIWRLLMGETGFNSSCFQEWAKLYEFVHSFQDEKRQLAYQVLGLSEGATNEEIHRSYQELVKVWHPDHNLDQTEEAQRHFLEIQAAYEVLSQPRKPWGSRR
Desired Output
Main Question
I just feel like my original code, although upside down, was more reliable because I didn't have to tell it how many times to print the header, it just looked for and printed only unique headers on its own. Is there a better way to print only new instances of headers but to print all matches of the desired sequence that follows? I could not find a way to specify print only unique matches, and I was uncertain about trying to send all headers and matched regions into a hash (and I have no idea how to do that).
Here's what I would write.
I've made several significant changes
I use chomp to remove the > from the start of the header of the next sequence, and then check that there are still some non-space character left. The very first record read will be just >, so this discards the empty record
I've removed the semicolons from your regex pattern, as they don't make much difference, and added some optional whitespace to trim any leading and trailing whitespace on the header
I've removed the non-greedy modifier ? from [VILMFWCA]{8,} as I didn't see a reason for it. Maybe I'm wrong. I've also changed it so that all matching sequences will be found even if they overlap. Again, maybe that's a bad call: I'm not a bioinformatician!
Calculating the position of each region as pos() - 7 is wrong as it depends on the length of the match. I've used the built-in array #- instead. $-[1] contains the position of the start of capture $1, $-[2] is for $2etc. and $-[0] is the position of the entire match
I keep the matching region and its start position in array #regions. When the search is finished I can test whether any were found by checking the size of
use strict;
use warnings 'all';
my $FASTA_FILE = 'test.fasta';
open my $fh, '<', $FASTA_FILE or die qq{Unable to open "$FASTA_FILE" for input: $!};
local $/ = '>';
while ( <$fh> ) {
chomp;
next unless /\S/;
next unless my ( $head, $seq ) = /\s*(.*\S)\s*\n(\w+)/;
my #regions;
while ( $seq =~ / (?= ( [VILMFWCA]{8,} ) ) /gxi ) {
push #regions, [ $1, $-[1] ];
}
next unless #regions; # Skip this sequence if no regions found
printf "%d Hydrophobic region%s found in %s\n",
scalar #regions,
#regions == 1 ? "" : "s",
$head;
printf " %s found at %d\n", #$_ for #regions;
}
output
14 Hydrophobic regions found in A7MBM2 | Homo sapiens (Human). | NCBI_TaxID=9606; | 1401 | Name=DISP2; Synonyms=DISPB, KIAA1742; DGDSSSSSGGSGPAPGPGPEGEQRPEGEPLAPDGGSPDSTQTKAVPPEASPERSCSLHSCPLEDPSSSSGPPPTTSTLQP
VAVLMLCLAVIFLC found at 88
AVLMLCLAVIFLC found at 89
VLMLCLAVIFLC found at 90
LMLCLAVIFLC found at 91
MLCLAVIFLC found at 92
LCLAVIFLC found at 93
CLAVIFLC found at 94
LLALVAIFF found at 411
LALVAIFF found at 412
IWICWFAALAA found at 623
WICWFAALAA found at 624
ICWFAALAA found at 625
CWFAALAA found at 626
LALALAFA found at 888
What I did was remove the global modifier from sequence match if statement regex, but leave the modifier after the while statement regex. This way I am not losing the first matches like I was before, but I am still able to print the header before the sequence matches.
unless (open FILE, "<", '/scratch/SampleDataFiles/test.fasta') {
die "Cannot Open File", $!;
}
$/ = ">";
my #file = <FILE>;
my $file = "#file";
chomp $file;
my $count = 0;
my $sequence_count = 0;
foreach $file (#file) {
if ($file =~ /(.*;.*;?\n)(\w+)/) {
my $head = $1;
my $sequence = $2;
$sequence_count = $sequence_count +1;
if ($sequence =~ /([VILMFWCA]{8,}?)/i) {
print "\n", "Hydrophobic region(s) found in ", $head, "\n";
$count = $count +1;
}
while ($sequence =~ /([VILMFWCA]{8,}?)/gi) {
my $pos = pos($sequence)-7;
print $1, " found at ", $pos, "\n", "\n";
}
}
}
print "\n", "\n", "-------------------------", "\n", "Hydrophobic region(s)
found in ", $count, " out of ", $sequence_count , " sequences.", "\n", "\n";
close FILE;
Thanks for your help, guys!
Related
Fetch line starting with a specific string in powershell
I am executing a command for which I am getting output like below Identifier: 3 SFP Connector: 7 LC Transceiver: 7004404000000000 4,8,16_Gbps M5 sw Short_dist Encoding: 6 64B66B Baud Rate: 140 (units 100 megabaud) Length 9u: 0 (units km) Length 9u: 0 (units 100 meters) Length 50u (OM2): 3 (units 10 meters) Length 50u (OM3): 10 (units 10 meters) Length 62.5u:0 (units 10 meters) Length Cu: 0 (units 1 meter) Vendor OUI: 00:05:1e Vendor PN: 57-0000088-01 Vendor Rev: A Wavelength: 850 (units nm) Options: 003a Loss_of_Sig,Tx_Fault,Tx_Disable BR Max: 0 BR Min: 0 Date Code: 180316 DD Type: 0x68 Enh Options: 0xfa Status/Ctrl: 0xb2 Pwr On Time: 3.52 years (30822 hours) E-Wrap Control: 0 O-Wrap Control: 0 Alarm flags[0,1] = 0x5, 0x40 Warn Flags[0,1] = 0x5, 0x40 Temperature: 32 Centigrade Current: 8.082 mAmps Voltage: 3311.6 mVolts RX Power: -5.5 dBm (280.4uW) TX Power: -2.8 dBm (523.1 uW) I need to fetch the last 2 lines only, that is starting with RX and TX out of it. I was trying like #$streamOut | Select-String -Pattern "^RX" | select -ExpandProperty line $streamOut | Select-String -Pattern "RX" | select -ExpandProperty line This is by code $session = New-SSHSession -ComputerName $SAN_IP -Credential $cred $Strem = New-SSHShellStream -SSHSession $Session $streamOut=#() $SystemView = $Strem.WriteLine("sfpshow $port_Num") sleep -Seconds 5 $streamOut = #($Strem.read()) sleep -Seconds 5 $RXTX_Data = #($streamOut | ? { $_ -match "^RX" -or $_ -match "^TX"}) $RXTX_Data When I use the below solution in above code, it is returning blank. the $streamOut is array PS C:\Windows\system32> $streamOut.GetType() IsPublic IsSerial Name BaseType -------- -------- ---- -------- True True Object[] System.Array But it is returning the entire output. Please let me know on this
As you clarified, the $streamOut variable contains a single, multiline string. In this case you can use Select-String with -AllMatches like this: $RXTX_Data = ($streamOut | Select-String -Pattern '(?m)^(RX|TX).*$' -AllMatches).Matches.Value $RXTX_Data # Output array of matching lines Output: RX Power: -5.5 dBm (280.4uW) TX Power: -2.8 dBm (523.1 uW) Note that it is important to use the (?m) inline modifier for multiline mode, which changes behaviour of ^ to anchor beginning of line and $ to anchor end of line, instead of beginning and end of whole input string. An easier alternative is to split the string into lines first, so you can use the -match operator to filter for matching lines: $streamLines = $streamOut -split '\r?\n' $RXTX_Data = $streamLines -match '^(RX|TX)' This works because -match and other comparison operators act as a filter when the LHS operand is a collection like an array. We don't need multiline mode, because the RegEx gets applied to each array element (line) individually. The above two lines could even be condensed into a one-liner: $RXTX_Data = ($streamOut -split '\r?\n') -match '^(RX|TX)'
try this : $streamOut = Get-Content -Path "Your_Output_FilePath" $streamOut | ? { $_ -match "^RX" -or $_ -match "^TX"}
Parse default Salt highstate output
I'm trying to parse the highstate output of Salt has proven to be difficult. Without changing the output to json due to the fact that I still want it to be human legible. What's the best way to convert the Summary into something machine readable? Summary for app1.domain.com -------------- Succeeded: 278 (unchanged=12, changed=6) Failed: 0 -------------- Total states run: 278 Total run time: 7.383 s -- Summary for app2.domain.com -------------- Succeeded: 278 (unchanged=12, changed=6) Failed: 0 -------------- Total states run: 278 Total run time: 7.448 s -- Summary for app0.domain.com -------------- Succeeded: 293 (unchanged=13, changed=6) Failed: 0 -------------- Total states run: 293 Total run time: 7.510 s Without a better idea I'm trying to grep and awk the output and insert it into a csv. These two work: cat ${_FILE} | grep Summary | awk '{ print $3} ' | \ tr '\n' ',' | sed '$s/,$/\n/' >> /tmp/highstate.csv; cat ${_FILE} | grep -oP '(?<=unchanged=)[0-9]+' | \ tr '\n' ',' | sed '$s/,$/\n/' >> /tmp/highstate.csv; But this one fails but works in Reger cat ${_FILE} | grep -oP '(?<=\schanged=)[0-9]+' | \ tr '\n' ',' | sed '$s/,$/\n/' >> /tmp/highstate.csv; EDIT1: #vintnes #ikegami I agree I'd much rather take the json output parse the output but Salt doesn't offer a summary of changes when outputting to josn. So far this is what I have and while very ugly, it's working. cat ${_FILE} | grep Summary | awk '{ print $3} ' | \ tr '\n' ',' | sed '$s/,$/\n/' >> /tmp/highstate_tmp.csv; cat ${_FILE} | grep -oP '(?<=unchanged=)[0-9]+' | \ tr '\n' ',' | sed '$s/,$/\n/' >> /tmp/highstate_tmp.csv; cat ${_FILE} | grep unchanged | awk -F' ' '{ print $4}' | \ grep -oP '(?<=changed=)[0-9]+' | tr '\n' ',' | sed '$s/,$/\n/' >> /tmp/highstate_tmp.csv; cat ${_FILE} | { grep "Warning" || true; } | awk -F: '{print $2+0} END { if (!NR) print "null" }' | \ tr '\n' ',' | sed '$s/,$/\n/' >> /tmp/highstate_tmp.csv; cat ${_FILE} | { grep "Failed" || true; } | awk -F: '{print $2+0} END { if (!NR) print "null" }' | \ tr '\n' ',' | sed '$s/,$/\n/' >> /tmp/highstate_tmp.csv; csvtool transpose /tmp/highstate_tmp.csv > /tmp/highstate.csv; sed -i '1 i\instance,unchanged,changed,warning,failed' /tmp/highstate.csv; Output: instance,unchanged,changed,warning,failed app1.domain.com,12,6,,0 app0.domain.com,13,6,,0 app2.domain.com,12,6,,0
Here you go. This will also work if your output contains warnings. Please note that the output is in a different order than you specified; it's the order in which each record occurs in the file. Don't hesitate with any questions. $ awk -v OFS=, ' BEGIN { print "instance,unchanged,changed,warning,failed" } /^Summary/ { instance=$NF } /^Succeeded/ { split($3 $4 $5, S, /[^0-9]+/) } /^Failed/ { print instance, S[2], S[3], S[4], $2 } ' "$_FILE" split($3 $4 $5, S, /[^0-9]+/) handles the possibility of warnings by disregarding the first two "words" Succeeded: ### and using any number of non-digits as a separator. edit: Printed on /^Fail/ instead of using /^Summ/ and END.
perl -e' use strict; use warnings qw( all ); use Text::CSV_XS qw( ); my $csv = Text::CSV_XS->new({ auto_diag => 2, binary => 1 }); $csv->say(select(), [qw( instance unchanged change warning failed )]); my ( $instance, $unchanged, $changed, $warning, $failed ); while (<>) { if (/^Summary for (\S+)/) { ( $instance, $unchanged, $changed, $warning, $failed ) = $1; } elsif (/^Succeeded:\s+\d+ \(unchanged=(\d+), changed=(\d+)\)/) { ( $unchanged, $changed ) = ( $1, $2 ); } elsif (/^Warning:\s+(\d+)/) { $warning = $1; } elsif (/^Failed:\s+(\d+)/) { $failed = $1; $csv->say(select(), [ $instance, $unchanged, $changed, $warning, $failed ]); } } ' Provide input via STDIN, or provide path to file(s) from which to read as arguments. Terse version: perl -MText::CSV_XS -ne' BEGIN { $csv = Text::CSV_XS->new({ auto_diag => 2, binary => 1 }); $csv->say(select(), [qw( instance unchanged change warning failed )]); } /^Summary for (\S+)/ and #row=$1; /^Succeeded:\s+\d+ \(unchanged=(\d+), changed=(\d+)\)/ and #row[1,2]=($1,$2); /^Warning:\s+(\d+)/ and $row[3]=$1; /^Failed:\s+(\d+)/ and ($row[4]=$1), $csv->say(select(), \#row); '
Improving answer from #vintnes. Producing output as tab separated CSV Write awk script that reads values from lines by their order. Print each record as it is read. script.awk BEGIN {print("computer","succeeded","unchanged","changed","failed","states run","run time");} FNR%8 == 1 {arr[1] = $3} FNR%8 == 3 {arr[2] = $2; arr[3] = extractNum($3); arr[4] = extractNum($4)} FNR%8 == 4 {arr[5] = $2;} FNR%8 == 6 {arr[6] = $4;} FNR%8 == 7 {arr[7] = $4; print arr[1],arr[2],arr[3],arr[4],arr[5],arr[6],arr[7];} function extractNum(str){match(str,/[[:digit:]]+/,m);return m[0];} run script Tab separated CSV output awk -v OFS="\t" -f script.awk input-1.txt input-2.txt ... Comma separated CSV output awk -v OFS="," -f script.awk input-1.txt input-2.txt ... Output computer succeeded unchanged changed failed states run run time app1.domain.com 278 12 6 0 278 7.383 app2.domain.com 278 12 6 0 278 7.448 app0.domain.com 293 13 6 0 293 7.510 computer,succeeded,unchanged,changed,failed,states run,run time app1.domain.com,278,12,6,0,278,7.383 app2.domain.com,278,12,6,0,278,7.448 app0.domain.com,293,13,6,0,293,7.510 Explanation BEGIN {print("computer","succeeded","unchanged","changed","failed","states run","run time");} Print the heading CSV line FNR%8 == 1 {arr[1] = $3} Extract the arr[1] value from 3rd field in (first line from 8 lines) FNR%8 == 3 {arr[2] = $2; arr[3] = extractNum($3); arr[4] = extractNum($4)} Extract the arr[2,3,4] values from 2nd,3rd,4th fields in (third line from 8 lines) FNR%8 == 4 {arr[5] = $2;} Extract the arr[5] value from 2nd field in (4th line from 8 lines) FNR%8 == 6 {arr[6] = $4;} Extract the arr[6] value from 4th field in (6th line from 8 lines) FNR%8 == 7 {arr[7] = $4; Extract the arr[7] value from 4th field in (7th line from 8 lines) print arr[1],arr[2],arr[3],arr[4],arr[5],arr[6],arr[7];} print the array elements for the extracted variable at the completion of reading 7th line from 8 lines. function extractNum(str){match(str,/[[:digit:]]+/,m);return m[0];} Utility function to extract numbers from text field.
Loop through a text file and Extract a set of 100 IP's from a text file and output to separate text files
I have a text file that contains around 900 IP's. I need to create batch of 100 IP's from that file and output them into new files. That would create around 9 text files. Our API only allows to POST 100 IP's at a time. Could you please help me out here? Below is the format of the text file 10.86.50.55,10.190.206.20,10.190.49.31,10.190.50.117,10.86.50.57,10.190.49.216,10.190.50.120,10.190.200.27,10.86.50.58,10.86.50.94,10.190.38.181,10.190.50.119,10.86.50.53,10.190.50.167,10.190.49.30,10.190.49.89,10.190.50.115,10.86.50.54,10.86.50.56,10.86.50.59,10.190.50.210,10.190.49.20,10.190.50.172,10.190.49.21,10.86.49.18,10.190.50.173,10.86.49.49,10.190.50.171,10.190.50.174,10.86.49.63,10.190.50.175,10.13.12.200,10.190.49.27,10.190.49.19,10.86.49.29,10.13.12.201,10.86.49.28,10.190.49.62,10.86.50.147,10.86.49.24,10.86.50.146,10.190.50.182,10.190.50.25,10.190.38.252,10.190.50.57,10.190.50.54,10.86.50.78,10.190.50.23,10.190.49.8,10.86.50.80,10.190.50.53,10.190.49.229,10.190.50.58,10.190.50.130,10.190.50.22,10.86.52.22,10.19.68.61,10.41.43.130,10.190.50.56,10.190.50.123,10.190.49.55,10.190.49.66,10.190.49.68,10.190.50.86,10.86.49.113,10.86.49.114,10.86.49.101,10.190.50.150,10.190.49.184,10.190.50.152,10.190.50.151,10.86.49.43,10.190.192.25,10.190.192.23,10.190.49.115,10.86.49.44,10.190.38.149,10.190.38.151,10.190.38.150,10.190.38.152,10.190.38.145,10.190.38.141,10.190.38.148,10.190.38.142,10.190.38.144,10.190.38.147,10.190.38.143,10.190.38.146,10.190.192.26,10.190.38.251,10.190.49.105,10.190.49.110,10.190.49.137,10.190.49.242,10.190.50.221,10.86.50.72,10.86.49.16,10.86.49.15,10.190.49.112,10.86.49.32,10.86.49.11,10.190.49.150,10.190.49.159,10.190.49.206,10.86.52.28,10.190.49.151,10.190.49.207,10.86.49.19,10.190.38.103,10.190.38.101,10.190.38.116,10.190.38.120,10.190.38.102,10.190.38.123,10.190.38.140,10.190.198.50,10.190.38.109,10.190.38.108,10.190.38.111,10.190.38.112,10.190.38.113,10.190.38.114,10.190.49.152,10.190.50.43,10.86.49.23,10.86.49.205,10.86.49.220,10.190.50.230,10.190.192.238,10.190.192.237,10.190.192.239,10.190.50.7,10.190.50.10,10.86.50.86,10.190.38.125,10.190.38.127,10.190.38.126,10.190.50.227,10.190.50.149,10.86.49.59,10.190.49.158,10.190.49.157,10.190.44.11,10.190.38.124,10.190.50.153,10.190.49.40,10.190.192.235,10.190.192.236,10.190.50.241,10.190.50.240,10.86.46.8,10.190.38.234,10.190.38.233,10.86.50.163,10.86.50.180,10.86.50.164,10.190.49.245,10.190.49.244,10.190.192.244,10.190.38.130,10.86.49.142,10.86.49.102,10.86.49.141,10.86.49.67,10.190.50.206,10.190.192.243,10.190.192.241 I tried looking online to come up with a bit of working code but can't really think what would best work in this situation $IP = 'H:\IP.txt' $re = '\d*.\d*.\d*.\d*,' Select-String -Path $IP -Pattern $re -AllMatches | Select-Object -Expand Matches | ForEach-Object { $Out = 'C:\path\to\out.txt' -f | Set-Content $clientlog }
This will do what you are after $bulkIP = (get-content H:\IP.txt) -split ',' $i = 0 # Created loop Do{ # Completed an action every 100 counts (including 0) If(0 -eq $i % 100) { # If the array is a valid entry. Removing this will usually end up creating an empty junk file called -1 or something If($bulkIP[$i]) { # outputs 100 lines into a folder with the starting index as the name. # Eg. The first 1-100, the file would be called 1.txt. 501-600 would be called 501.txt etc $bulkIP[$($i)..$($i+99)] | Out-File "C:\path\to\$($bulkip.IndexOf($bulkip[$($i)+1])).txt" } } $i++ }While($i -le 1000)
what this does ... calculates the number of batches calcs the start & end index of each batch creates a range from the above creates a PSCustomObject to hold each batch creates an array slice from the range sends that out to the collection $Var shows what is in the collection & in the 1st batch from that collection here's the code ... # fake reading in a raw text file # in real life, use Get-Content -Raw $InStuff = #' 10.86.50.55,10.190.206.20,10.190.49.31,10.190.50.117,10.86.50.57,10.190.49.216,10.190.50.120,10.190.200.27,10.86.50.58,10.86.50.94,10.190.38.181,10.190.50.119,10.86.50.53,10.190.50.167,10.190.49.30,10.190.49.89,10.190.50.115,10.86.50.54,10.86.50.56,10.86.50.59,10.190.50.210,10.190.49.20,10.190.50.172,10.190.49.21,10.86.49.18,10.190.50.173,10.86.49.49,10.190.50.171,10.190.50.174,10.86.49.63,10.190.50.175,10.13.12.200,10.190.49.27,10.190.49.19,10.86.49.29,10.13.12.201,10.86.49.28,10.190.49.62,10.86.50.147,10.86.49.24,10.86.50.146,10.190.50.182,10.190.50.25,10.190.38.252,10.190.50.57,10.190.50.54,10.86.50.78,10.190.50.23,10.190.49.8,10.86.50.80,10.190.50.53,10.190.49.229,10.190.50.58,10.190.50.130,10.190.50.22,10.86.52.22,10.19.68.61,10.41.43.130,10.190.50.56,10.190.50.123,10.190.49.55,10.190.49.66,10.190.49.68,10.190.50.86,10.86.49.113,10.86.49.114,10.86.49.101,10.190.50.150,10.190.49.184,10.190.50.152,10.190.50.151,10.86.49.43,10.190.192.25,10.190.192.23,10.190.49.115,10.86.49.44,10.190.38.149,10.190.38.151,10.190.38.150,10.190.38.152,10.190.38.145,10.190.38.141,10.190.38.148,10.190.38.142,10.190.38.144,10.190.38.147,10.190.38.143,10.190.38.146,10.190.192.26,10.190.38.251,10.190.49.105,10.190.49.110,10.190.49.137,10.190.49.242,10.190.50.221,10.86.50.72,10.86.49.16,10.86.49.15,10.190.49.112,10.86.49.32,10.86.49.11,10.190.49.150,10.190.49.159,10.190.49.206,10.86.52.28,10.190.49.151,10.190.49.207,10.86.49.19,10.190.38.103,10.190.38.101,10.190.38.116,10.190.38.120,10.190.38.102,10.190.38.123,10.190.38.140,10.190.198.50,10.190.38.109,10.190.38.108,10.190.38.111,10.190.38.112,10.190.38.113,10.190.38.114,10.190.49.152,10.190.50.43,10.86.49.23,10.86.49.205,10.86.49.220,10.190.50.230,10.190.192.238,10.190.192.237,10.190.192.239,10.190.50.7,10.190.50.10,10.86.50.86,10.190.38.125,10.190.38.127,10.190.38.126,10.190.50.227,10.190.50.149,10.86.49.59,10.190.49.158,10.190.49.157,10.190.44.11,10.190.38.124,10.190.50.153,10.190.49.40,10.190.192.235,10.190.192.236,10.190.50.241,10.190.50.240,10.86.46.8,10.190.38.234,10.190.38.233,10.86.50.163,10.86.50.180,10.86.50.164,10.190.49.245,10.190.49.244,10.190.192.244,10.190.38.130,10.86.49.142,10.86.49.102,10.86.49.141,10.86.49.67,10.190.50.206,10.190.192.243,10.190.192.241 '# $SplitInStuff = $InStuff.Split(',') $BatchSize = 25 $BatchCount = [math]::Truncate($SplitInStuff.Count / $BatchSize) + 1 $Start = $End = 0 $Result = foreach ($BC_Item in 1..$BatchCount) { $Start = $End if ($BC_Item -eq 1) { $End = $Start + $BatchSize - 1 } else { $End = $Start + $BatchSize } $Range = $Start..$End [PSCustomObject]#{ IP_List = $SplitInStuff[$Range] } } $Result '=' * 20 $Result[0] '=' * 20 $Result[0].IP_List.Count '=' * 20 $Result[0].IP_List screen output ... IP_List ------- {10.86.50.55, 10.190.206.20, 10.190.49.31, 10.190.50.117...} {10.86.49.18, 10.190.50.173, 10.86.49.49, 10.190.50.171...} {10.86.50.80, 10.190.50.53, 10.190.49.229, 10.190.50.58...} {10.190.49.115, 10.86.49.44, 10.190.38.149, 10.190.38.151...} {10.86.49.32, 10.86.49.11, 10.190.49.150, 10.190.49.159...} {10.86.49.23, 10.86.49.205, 10.86.49.220, 10.190.50.230...} {10.190.50.240, 10.86.46.8, 10.190.38.234, 10.190.38.233...} ==================== {10.86.50.55, 10.190.206.20, 10.190.49.31, 10.190.50.117...} ==================== 25 ==================== 10.86.50.55 10.190.206.20 10.190.49.31 10.190.50.117 10.86.50.57 10.190.49.216 10.190.50.120 10.190.200.27 10.86.50.58 10.86.50.94 10.190.38.181 10.190.50.119 10.86.50.53 10.190.50.167 10.190.49.30 10.190.49.89 10.190.50.115 10.86.50.54 10.86.50.56 10.86.50.59 10.190.50.210 10.190.49.20 10.190.50.172 10.190.49.21 10.86.49.18
try this $cpt=0 $Rang=1 #remove old file Get-ChildItem "H:\FileIP_*.txt" -file | Remove-Item -Force (Get-Content "H:\IP.txt") -split ',' | %{ if (!($cpt++ % 100)) {$FileResult="H:\FileIP_{0:D3}.txt" -f $Rang++} # build filename if cpt divisile by 100 $_ | Out-File $FileResult -Append }
Removing symbols and making a tab delimited file while keeping all the words after a certain string in one column
I have a file full of such lines: >Mouse|chr9:95713136-95716028 | element 1367 | positive | hindbrain (rhombencephalon)[5/8] | midbrain (mesencephalon)[3/8] | other[7/8] >Mouse|chr16:90449561-90451327 | element 1672 | positive | forebrain[4/8] | heart[6/8] >Mouse|chr3:137446183-137449401 | element 4 | positive | heart[3/4] What I want to get is something like this: Mouse chr9 95713136 95716028 element 1367 positive hindbrain (rhombencephalon)[5/8]|midbrain (mesencephalon)[3/8]|other[7/8] Such that all the words after "positive" are in one column of their own separated by a pipe, and all the columns are separated by tab. This is what I did: sed -E 's/ *[>\|:-] */\t/g' mouse_genome_vista1.txt > mouse_genome_vista2.txt sed "s/^[ \t]*//" -i mouse_genome_vista2.txt My output was like this: Mouse chr9 95713136 95716028 element 1367 positive hindbrain (rhombencephalon)[5/8] midbrain (mesencephalon)[3/8] other[7/8] Mouse chr16 90449561 90451327 element 1672 positive forebrain[4/8] heart[6/8] Mouse chr3 137446183 137449401 element 4 positive heart[3/4] It works if I have just one word after "positive" it'll be alone in its column . However if I have more than one I'll have multiple columns. For instance hindbrain, midbrain , and other are each in their own tab delimited columns I want them to be pipe separated in one column.
You may try this with perl or awk: [|:-](?=.*positive)|positive\s+\K\| Regex 101 Demo Sample Perl Solution(note it illustrates over a set of string not file): use strict; my $str = 'Mouse|chr9:95713136-95716028 | element 1367 | positive | hindbrain (rhombencephalon)[5/8] | midbrain (mesencephalon)[3/8] | other[7/8] Mouse|chr16:90449561-90451327 | element 1672 | positive | forebrain[4/8] | heart[6/8] Mouse|chr3:137446183-137449401 | element 4 | positive | heart[3/4] '; my $regex = qr/[|:-](?=.*positive)|positive\s+\K\|/xmp; my $subst = '\\t'; my $result = $str =~ s/$regex/$subst/rg; print $result;
Capturing select data between certain lines in a file in Perl.
I have a file with contents of this sort: *** X REGION *** |-------------------------------------------------------------------------------------------------| | X | | addr tag extra data | |-------------------------------------------------------------------------------------------------| | $A1 label_A1X | 1 | | $A2 label_A2X | 2 | | $A3 label_A3X | 3 | *** Y REGION *** |-------------------------------------------------------------------------------------------------| | Y | | addr tag extra data | |-------------------------------------------------------------------------------------------------| | $0 label_0Y | 99 | | $1 | 98 | I need to capture the data under 'addr' and 'tag'; separated by commas; separately for the records under 'X REGION' and 'Y REGION'. Here's what I tried: open($fh1, "<", $memFile) or warn "Cannot open $memFile, $!"; #input file with contents as described above. open($fh, "+<", $XFile) or warn "Cannot open $XFile, $!"; open($fh2, "+<", $YFile) or warn "Cannot open $YFile, $!"; while(my $line = <$fh1>) { chomp $line; $line = $line if (/\s+\*\*\*\s+X REGION\s+\*\*\*/ .. /\s+\*\*\*\s+Y REGION\s+\*\*\*/); #Trying to get at the stuff in the X region. if($line =~ /\s+|\s+\$(.*)\s+(.*)\s+|(.*)/) { $line = "$1,$2"; print $fh $line; print $fh "\n"; } my $lastLineNum = `tail -1 filename`; $line = $line if (/\*\*\* Y REGION \*\*\*/ .. $lastLineNum); #Trying to get at the stuff in the Y region. if($line =~ /\s+|\s+\$(.*)\s+(.*)\s+|(.*)/) { $line = "$1,$2"; print $fh2 $line; print $fh2 "\n"; } } This says $1 and $2 are uninitialized. Is the regex incorrect? Else (or also) what else is?
This is a snippet of code that operates as you need (taking full advantage of the default perl implicit var $_): # use die instead of warn, don't go ahead if there is no file open(my $fin, "<", $memFile) or die "Cannot open $memFile, $!"; while(<$fin>) { # Flip flop between X and Y regions if (/[*]{3}\h+X REGION\h+[*]{3}/../[*]{3}\h+Y REGION\h+[*]{3}/) { print "X: $1,$2\n" if (/.*\$(\S*)\h*(\S*)\h*[|]/) } # Flip flop from Y till the end, using undef no need of external tail if (/[*]{3}\h+Y REGION\h+[*]{3}/..undef) { print "Y: $1,$2\n" if (/.*\$(\S*)\h*(\S*)\h*[|]/) } } This is the output: X: A1,label_A1X X: A2,label_A2X X: A3,label_A3X Y: 0,label_0Y Y: 1, Online running demo Talking about your code there are many points to fix: in your regex to select the elements between the delimiters the pipe | needs escaping: using a backslash \| or the char class [|] (i prefer the latter) \s matches also newline (strictly \n or carriage return \r), don't use it as a general space plus tab \t replacement. Use \h (only horizontal spaces) instead you start the regex with \s+ but in the example the first char of the table lines is always '|' .* matches anything till (spaces included) apart from newlines (\n or \r) So a regex like .*\s+ matches the entire line plus the newline (\s) and possible spaces in the next line too The flip-flop perl operator .. gives you the lines in the selected region (edge included) but one line per time as always, so also the escaped pipe form of your regex: \s+[|]\s+\$(.*)\s+(.*)\s+[|](.*) can't match at all see as it behaves on the text. So i've so replaced the data extracting regex with this one: .*\$(\S*)\h*(\S*)\h*[|] Regex Breakout .*\$ # matches all till a literal dollar '$' (\S*) # Capturing group $1, matches zero or more non-space char [^\s] # can be replaced with (\w*) if your labels matches [0-9a-zA-Z_] \h* # Match zero or more horizontal spaces (\S*) # Capturing group $2, as above \h* # Match zero or more horizontal spaces [|] # Match a literal pipe '|'