Run a script on variables from a list file - list

I have a script which includes the following step as the first step in a series of filters of genomics data:
--option ~/folder$1/file$1 --option2 ~/folder$1/file$1 --indv NA12775 --options...
The script already uses a for-loop to go through folder/file indices 1-22. The option --indv takes a string, which is an identifiers. I have a separate list file which is just a column of "indv" identifiers:
NA06984
NA06986
NA06989
NA06994
NA07000
I have many such lists and I am looking for a solution to automatically take a single identifier from my list file, run the filtering script for "indv X" and then take the next consecutive identifier and repeat. Something like "for line in ID-list, run filter-script"...

You can use xargs for this:
xargs -I {} ./myprogram --indv {} < indvlist.txt

A couple bash methods for doing this:
for indv in $(<list-of-indv-values.txt)
do
something something .... ${indv} .....
done
or
while read indv
do
something something ... ${indv} .....
done < list-of-indv-values.txt

Related

Python Regex to Extract Genome Sequence

I’m trying to use a Python Regular Expression to extract a genome sequence from a genome database; I’ve pasted a snippet of the database below.
>GSVIVT01031739001 pacid=17837850 polypeptide=GSVIVT01031739001 locus=GSVIVG01031739001 ID=GSVIVT01031739001.Genoscope12X annot-version=Genoscope.12X ATGAAAACGGAACTCTTTCTAGGTCATTTCCTCTTCAAACAAGAAAGAAGTAAAAGTTGCATACCAAATATGGACTCGAT TTGGAGTCGTAGTGCCCTGTCCACAGCTTCGGACTTCCTCACTGCAATCTACTTCGCCTTCATCTTCATCGTCGCCAGGT TTTTCTTGGACAGATTCATCTATCGAAGGTTGGCCATCTGGTTATTGAGCAAGGGAGCTGTTCCATTGAAGAAAAATGAT GCTACACTGGGAAAAATTGTAAAATGTTCGGAGTCTTTGTGGAAACTAACATACTATGCAACTGTTGAAGCATTCATTCT TGCTATTTCCTACCAAGAGCCATGGTTTAGAGATTCAAAGCAGTACTTTAGAGGGTGGCCAAATCAAGAGTTGACGCTTC CCCTCAAGCTTTTCTACATGTGCCAATGTGGGTTCTACATCTACAGCATTGCTGCCCTTCTTACATGGGAAACTCGCAGG AGGGATTTCTCTGTGATGATGTCTCATCATGTAGTCACTGTTATCCTAATTGGGTACTCATACATATCAAGTTTTGTCCG GATCGGCTCAGTTGTCCTTGCCCTGCACGATGCAAGTGATGTCTTCATGGAAGCTGCAAAAGTTTTTAAATATTCTGAGA AGGAGCTTGCAGCAAGTGTGTGCTTTGGATTTTTTGCCATCTCATGGCTTGTCCTACGGTTAATATTCTTTCCCTTTTGG GTTATCAGTGCATCAAGCTATGATATGCAAAATTGCATGAATCTATCGGAGGCCTATCCCATGTTGCTATACTATGTTTT CAATACAATGCTCTTGACACTACTTGTGTTCCATATATACTGGTGGATTCTTATATGCTCAATGATTATGAGACAGCTGA AAAATAGAGGACAAGTTGGAGAAGATATAAGATCTGATTCAGAGGACGATGAATAG
>GSVIVT01031740001 pacid=17837851 polypeptide=GSVIVT01031740001 locus=GSVIVG01031740001 ID=GSVIVT01031740001.Genoscope12X annot-version=Genoscope.12X ATGGGTATTACTACTTCCCTCTCATATCTTTTATTCTTCAACATCATCCTCCCAACCTTAACGGCTTCTCCAATACTGTT TCAGGGGTTCAATTGGGAATCATCCAAAAAGCAAGGAGGGTGGTACAACTTCCTCATCAACTCCATTCCTGAACTATCTG CCTCTGGAATCACTCATGTTTGGCTTCCTCCACCCTCTCAGTCTGCTGCATCTGAAGGGTACCTGCCAGGAAGGCTTTAT GATCTCAATGCATCCCACTATGGTACCCAATATGAACTAAAAGCATTGATAAAGGCATTTCGCAGCAATGGGATCCAGTG CATAGCAGACATAGTTATAAACCACAGGACTGCTGAGAAGAAAGATTCAAGAGGAATATGGGCCATCTTTGAAGGAGGAA CCCCAGATGATCGCCTTGACTGGGGTCCATCTTTTATCTGCAGTGATGACACTCTTTTTTCTGATGGCACAGGAAATCCT GATACTGGAGCAGGCTTCGATCCTGCTCCAGACATTGATCATGTAAACCCCCGGGTCCAGCGAGAGCTATCAGATTGGAT GAATTGGTTAAAGATTGAAATAGGCTTTGCTGGATGGCGATTCGATTTTGCTAGAGGATACTCCCCAGATTTTACCAAGT TGTATATGGAAAACACTTCGCCAAACTTTGCAGTAGGGGAAATATGGAATTCTCTTTCTTATGGAAATGACAGTAAGCCA AACTACAACCAAGATGCTCATCGGCGTGAGCTTGTGGACTGGGTGAAAGCTGCTGGAGGAGCAGTGACTGCATTTGATTT TACAACCAAAGGGATACTCCAAGCTGCAGTGGAAGGGGAATTGTGGAGGCTGAAGGACTCAAATGGAGGGCCTCCAGGAA TGATTGGCTTAATGCCTGAAAATGCTGTGACTTTCATAGATAATCATGACACAGGTTCTACACAAAAAATTTGGCCATTC CCATCAGACAAAGTCATGCAGGGATATGTTTATATCCTCACTCATCCTGGGATTCCATCCATATTCTATGACCACTTCTT TGACTGGGGTCTGAAGGAGGAGATTTCTAAGCTGATCAGTATCAGGACCAGGAACGGGATCAAACCCAACAGTGTGGTGC GTATTCTGGCATCTGACCCAGATCTTTATGTAGCTGCCATAGATGAGAAAATCATTGCTAAGATTGGACCAAGGTATGAT GTTGGGAACCTTGTACCTTCAACCTTCAAACTTGCCACCTCTGGCAACAATTATGCTGTGTGGGAGAAACAGTAA
>GSVIVT01031741001 pacid=17837852 polypeptide=GSVIVT01031741001 locus=GSVIVG01031741001 ID=GSVIVT01031741001.Genoscope12X annot-version=Genoscope.12X ATGTCCAAATTAACTTATTTATTATCTCGGTACATGCCAGGAAGGCTTTATGATCTGAATGCATCCAAATATGGCACCCA AGATGAACTGAAAACACTGATAAAGGTGTTTCACAGCAAGGGGGTCCAGTGCATAGCAGACATAGTTATAAACCACAGAA CTGCAGAGAAGCAAGACGCAAGAGGAATATGGCCATCTTTGAAGGAGGAACCCCAGATGATCGCCTTGACTGGACCCCAT CTTTCCTTTGCAAGGACGACACTCCTTATTCCGACGGCACCGGAAACCCTGATTCTGGAGATGACTACAGTGCCGCACCA GACATCGACCACATCAACCCACGGGTTCAGCAAGAGCTAA
What I’m trying to do is get the genome (ACGT) sequence for GSVIV01031740001 (the middle sequence), and none of the others. My current regex is
sequence = re.compile('(?<=>GSVIVT01031740001) pacid=.*annot-version=.*\n[ACGT\n]*[^(?<!>GSVIVT01031740001) pacid]’)
with my logic being find the header with the genbank ID for the correct organism, give me that line, then go to a new line and give me all ACGT and new lines until I get to a header for an organism with a different genbank ID. This fails to give any results.
Yes, I know that re.compile doesn’t actually perform a search; I’m searching against a file opened as ‘target’ so my execution looks like
>>> for nucl in target:
... if re.search(sequence, nucl):
... print(nucl)
Can someone tell me what I’m doing wrong, either in my regex or by using regex in the first place? When I try this on regex101.com, it works, but when I try it in the Python interpreter (2.7.1), it fails.
Thanks!
If I understand correctly , you want JUST the genomic sequence for a given locus. So You can do something like this.(assumes your data is in a file)
lines = [line.split(' ') for line in open('results.txt') ]
somedict = {}
for each in lines:
locus = each[3].split('=')[-1]
seq = ''.join(each[6:])
somedict[locus] = seq
print somedict
It outputs a dictionary with the locus as key and sequence as value
{'GSVIVG01031741001': 'ATGTCCAAATTAACTTATTTATTATCTCGGTACATGCCAGGAAGGCTTTATGATCTGAATGCATCCAAATATGGCACCCAAGATGAACTGAAAACACTGATAAAGGTGTTTCACAGCAAGGGGGTCCAGTGCATAGCAGACATAGTTATAAACCACAGAACTGCAGAGAAGCAAGACGCAAGAGGAATATGGCCATCTTTGAAGGAGGAACCCCAGATGATCGCCTTGACTGGACCCCATCTTTCCTTTGCAAGGACGACACTCCTTATTCCGACGGCACCGGAAACCCTGATTCTGGAGATGACTACAGTGCCGCACCAGACATCGACCACATCAACCCACGGGTTCAGCAAGAGCTAA\n', 'GSVIVG01031740001': 'ATGGGTATTACTACTTCCCTCTCATATCTTTTATTCTTCAACATCATCCTCCCAACCTTAACGGCTTCTCCAATACTGTTTCAGGGGTTCAATTGGGAATCATCCAAAAAGCAAGGAGGGTGGTACAACTTCCTCATCAACTCCATTCCTGAACTATCTGCCTCTGGAATCACTCATGTTTGGCTTCCTCCACCCTCTCAGTCTGCTGCATCTGAAGGGTACCTGCCAGGAAGGCTTTATGATCTCAATGCATCCCACTATGGTACCCAATATGAACTAAAAGCATTGATAAAGGCATTTCGCAGCAATGGGATCCAGTGCATAGCAGACATAGTTATAAACCACAGGACTGCTGAGAAGAAAGATTCAAGAGGAATATGGGCCATCTTTGAAGGAGGAACCCCAGATGATCGCCTTGACTGGGGTCCATCTTTTATCTGCAGTGATGACACTCTTTTTTCTGATGGCACAGGAAATCCTGATACTGGAGCAGGCTTCGATCCTGCTCCAGACATTGATCATGTAAACCCCCGGGTCCAGCGAGAGCTATCAGATTGGATGAATTGGTTAAAGATTGAAATAGGCTTTGCTGGATGGCGATTCGATTTTGCTAGAGGATACTCCCCAGATTTTACCAAGTTGTATATGGAAAACACTTCGCCAAACTTTGCAGTAGGGGAAATATGGAATTCTCTTTCTTATGGAAATGACAGTAAGCCAAACTACAACCAAGATGCTCATCGGCGTGAGCTTGTGGACTGGGTGAAAGCTGCTGGAGGAGCAGTGACTGCATTTGATTTTACAACCAAAGGGATACTCCAAGCTGCAGTGGAAGGGGAATTGTGGAGGCTGAAGGACTCAAATGGAGGGCCTCCAGGAATGATTGGCTTAATGCCTGAAAATGCTGTGACTTTCATAGATAATCATGACACAGGTTCTACACAAAAAATTTGGCCATTCCCATCAGACAAAGTCATGCAGGGATATGTTTATATCCTCACTCATCCTGGGATTCCATCCATATTCTATGACCACTTCTTTGACTGGGGTCTGAAGGAGGAGATTTCTAAGCTGATCAGTATCAGGACCAGGAACGGGATCAAACCCAACAGTGTGGTGCGTATTCTGGCATCTGACCCAGATCTTTATGTAGCTGCCATAGATGAGAAAATCATTGCTAAGATTGGACCAAGGTATGATGTTGGGAACCTTGTACCTTCAACCTTCAAACTTGCCACCTCTGGCAACAATTATGCTGTGTGGGAGAAACAGTAA\n', 'GSVIVG01031739001': 'ATGAAAACGGAACTCTTTCTAGGTCATTTCCTCTTCAAACAAGAAAGAAGTAAAAGTTGCATACCAAATATGGACTCGATTTGGAGTCGTAGTGCCCTGTCCACAGCTTCGGACTTCCTCACTGCAATCTACTTCGCCTTCATCTTCATCGTCGCCAGGTTTTTCTTGGACAGATTCATCTATCGAAGGTTGGCCATCTGGTTATTGAGCAAGGGAGCTGTTCCATTGAAGAAAAATGATGCTACACTGGGAAAAATTGTAAAATGTTCGGAGTCTTTGTGGAAACTAACATACTATGCAACTGTTGAAGCATTCATTCTTGCTATTTCCTACCAAGAGCCATGGTTTAGAGATTCAAAGCAGTACTTTAGAGGGTGGCCAAATCAAGAGTTGACGCTTCCCCTCAAGCTTTTCTACATGTGCCAATGTGGGTTCTACATCTACAGCATTGCTGCCCTTCTTACATGGGAAACTCGCAGGAGGGATTTCTCTGTGATGATGTCTCATCATGTAGTCACTGTTATCCTAATTGGGTACTCATACATATCAAGTTTTGTCCGGATCGGCTCAGTTGTCCTTGCCCTGCACGATGCAAGTGATGTCTTCATGGAAGCTGCAAAAGTTTTTAAATATTCTGAGAAGGAGCTTGCAGCAAGTGTGTGCTTTGGATTTTTTGCCATCTCATGGCTTGTCCTACGGTTAATATTCTTTCCCTTTTGGGTTATCAGTGCATCAAGCTATGATATGCAAAATTGCATGAATCTATCGGAGGCCTATCCCATGTTGCTATACTATGTTTTCAATACAATGCTCTTGACACTACTTGTGTTCCATATATACTGGTGGATTCTTATATGCTCAATGATTATGAGACAGCTGAAAAATAGAGGACAAGTTGGAGAAGATATAAGATCTGATTCAGAGGACGATGAATAG\n'}

how to use regexp to select files in a specific order? - matlab

Let's say I have 14 files with names:
file_001.txt file_002.txt file_003.txt file_004.txt ... file_014.txt
I'm trying to write a regex that selects my files in a specific order. Assuming ls outputs:
file_001.txt file_002.txt file_003.txt ... file_014.txt
regexp(ls ,'file_0+([135]|[246])\.txt','match') gives me:
file_001.txt
file_002.txt
file_003.txt
file_004.txt
file_005.txt
file_006.txt
but what I'm aiming at is:
file_001.txt
file_003.txt
file_005.txt
file_002.txt
file_004.txt
file_006.txt
Regex is simply not the right tool for this.
You'll end up with an expression that looks like this:
file_([1-9][0-9]?|100)[1-5][5-8](12[1-9]|1[3-9][0-9]|[2-4][0-9]{2}|5[0-2][0-9])(73[89]|7[4-9][0-9]|8[0-9]{2}|9[0-8][0-9]|99[01])(9|[1-9][0-9]{1,2}|[1-7][0-9]{3}|80[0-9]{2}|81[01][0-9]|812[0-8])(83[4-9]|8[4-9][0-9]|9[0-9]{2}|[1-9][0-9]{3}|[1-5][0-9]{4}|6[0-4][0-9]{3}|65[0-4][0-9]{2}|655[0-2][0-9]|6553[0-5])\.txt

For loop using enumerate through a list with an if statement to search lines for a particular string

I am going to compile a list of a recurring strings (transaction ID).
I am flummoxed. I've researched the correct method and feel like this code should work.
However, I'm doing something wrong in the second block.
This first block correctly compiles a list of the strings that I want.
I cant get this second block to work. If I simplify, I can print each value in the list
by using
for idx, val in enumerate(tidarray): print val
It seems like I should now be able to use that value to search each line for that string,
then print the line (actually I'll be using it in conjunction with another search term to
reduce the number of line reads, but this is my basic test before honing in further.
def main():
pass
samlfile= "2013-08-18 06:24:27,410 tid:5af193fdc DEBUG org.sourceid.saml20.domain.AttributeMapping] Source attributes:{SAML_AUTHN_CTX=urn:oasis:names:tc:SAML:2.0:ac:classes"
tidarray = []
for line in samlfile:
if "tid:" in line:
str=line
tid = re.search(r'(tid:.*?)(?= )', str)
if tid.group() not in tidarray:
tidarray.append(tid.group())
for line in samlfile:
for idx, val in enumerate(tidarray):
if val in line:
print line
Can someone suggest a correction for the second block of code? I recognize that reading the file twice isn't the most elegant solution... My main goal here is to learn how to enumerate through the list and use each value in the subsequent code.
Iterating over a file twice
Basically what you do is:
for line in somefile: pass # first run
for line in somefile: pass # second run
The first run will complete just fine, the second run will not run at all.
This is because the file was read until the end and there's no more data to read lines from.
Call somefile.seek(0) to go to the beginning of the file:
for line in somefile: pass # first run
somefile.seek(0)
for line in somefile: pass # second run
Storing things uniquely
Basically, what you seem to want is a way to store the IDs from the file in the a
data structure and every id shall only be once in said structure.
If you want to store elements uniquely you use, for example, dictionaries (help(dict))
or sets (help(set)). Example with sets:
myset = set()
myset.add(2) # set([2])
myset.add(3) # set([2,3])
myset.add(2) # set([2,3])

Stacking related lines together in notepad++

Hi so I'm trying to use find and replace in notepad++ with regular expression to do the following:
I have two set of lines
first set:
[c][eu][e]I37ANKCB[/e]
[c][eu][e]OIL8ZEPW[/e]
[c][eu][e]4OOEL75O[/e]
[c][eu][e]PPNW5FN4[/e]
[c][eu][e]E2BXCWUO[/e]
[c][eu][e]SD9UQNT8[/e]
[c][eu][e]E6BK6IGO[/e]
second set:
[u]7ubju2jvioks[u2]_261
[u]89j408tah1lz[u2]_262
[u]j673xnd49tq0[u2]_263
[u]dv73osmh1wzu[u2]_264
[u]twz3u4yiaeqr[u2]_265
[u]cuhtg6r71kud[u2]_266
[u]yts0ktvt9a3r[u2]_267
now I want to the second set to by places after each of the first set like this:
[c][eu][e]I37ANKCB[/e][u]7ubju2jvioks[u2]_261
[c][eu][e]OIL8ZEPW[/e][u]89j408tah1lz[u2]_262
[c][eu][e]4OOEL75O[/e][u]j673xnd49tq0[u2]_263
[c][eu][e]PPNW5FN4[/e][u]dv73osmh1wzu[u2]_264
[c][eu][e]E2BXCWUO[/e][u]twz3u4yiaeqr[u2]_265
[c][eu][e]SD9UQNT8[/e][u]cuhtg6r71kud[u2]_266
[c][eu][e]E6BK6IGO[/e][u]yts0ktvt9a3r[u2]_267
any suggestions?
You can mark the second block in column mode using ALT and the left mouse button. Then just copy paste it at the end of the first row.
No need/Not possible using regex.
I would solve this via a simple script written in Python or Ruby or something equally quick. This works, for example:
import os
path = os.path.dirname(__file__)
with open(os.path.join(path, 'file1')) as file1:
with open(os.path.join(path, 'file2')) as file2:
lines = zip(file1.readlines(), file2.readlines())
print ''.join([a.rstrip() + b for a, b in lines])
Running it gives the correct result:
> python join.py
[c][eu][e]I37ANKCB[/e][u]7ubju2jvioks[u2]_261
[c][eu][e]OIL8ZEPW[/e][u]89j408tah1lz[u2]_262
[c][eu][e]4OOEL75O[/e][u]j673xnd49tq0[u2]_263
[c][eu][e]PPNW5FN4[/e][u]dv73osmh1wzu[u2]_264
[c][eu][e]E2BXCWUO[/e][u]twz3u4yiaeqr[u2]_265
[c][eu][e]SD9UQNT8[/e][u]cuhtg6r71kud[u2]_266
[c][eu][e]E6BK6IGO[/e][u]yts0ktvt9a3r[u2]_267
Customize to suit your needs.

Use bash to concatenate a list of items

I have a list of items like:
ERR001268_chr6
ERR001312_chr6
ERR001332_chr6
ERR001361_chr6
ERR001369_chr6
ERR001413_chr6
ERR001433_chr6
ERR001462_chr6
ERR001698_chr6
ERR001734_chr6
ERR001763_chr6
ERR001774_chr6
ERR001799_chr6
say now I want to concatenate ERR001268_chr6 until ERR001763_chr6.
I can do cat ERR001268_chr6 ERR001269_chr6....ERR001763_chr6 > xxx
But obviously I don't want to type in these items one by one...So any simple bash commands to do this?
thx
Assuming that the item list is the full list of 'files' under current directory:
cat `ls -1 ERR*_chr6 | head -n11` > xxx