Biopython translate output error - regex

I'm constructing a bash script that incorporates grep and small Python scripts ultimately capable of searching a genetic sequence file (fasta format) for strings of sequence of a given length between two sequence search strings and translating those sequences into peptides. My bash script uses two grep functions followed by a Biopython script that prints the first few lines corresponding to the desired region.
grep -E -o "ATGAGTCTT(.*)TCAGTACG" search_script_testdata.fasta > ./output1.txt
grep -E -o "(.*)TCAGTACG" output1.txt > ./output2.txt
python print_int.py > ./output3.txt
python translate.py > ./output4.txt
The code works until python translate.py.
from Bio.Seq import translate
for line in open("output3.txt"):
translate(line)
The output for translate.py is the following if run when within Python
Bio/Seq.py:1976: BiopythonWarning: Partial codon, len(sequence) not a multiple of three. Explicitly trim the sequence or add trailing N before translation. This may become an error in future.
BiopythonWarning)
'LVS'
'SLD'
The penultimate file that I hope to generate would have just the information
LVS
SLD
However, when the bash script runs, only the warning/error message and not the two amino acid sequences are output to the screen and nothing is written to output4.txt. The amino acid sequences aren't supposed to start with a Methionine, which is the source of the error message. I need the sequences in this format. Can anyone with experience in Biopython lend a hand and suggest how I can get only the amino acid sequences to output to a file?
Edit:
I've since changed the search_script_testdata.fasta file so that the expected output3.txt file would have only three lines of ATGAGTCTT which translates to MSL.
output3.txt
ATGAGTCTT
ATGAGTCTT
ATGAGTCTT
The resulting error is the same as before.
translate.py is a file with the following lines of code:
for line in open("output2.txt", "r"):
print(line[:9])
This time I get
'MSL'
'MSL'
'MSL'
with the same error message. My understanding is that this code should work with files set up such that each line has one string of genomic sequence to be translated. There is a separate method in the biopython cookbook for dealing with fasta file format translation.
Any thoughts?

If your file translate.py really looks like this:
from Bio.Seq import translate
for line in open("output3.txt"):
translate(line)
you shouldn't expect any data to be directed to the standard output and redirected to output4.txt file.
Print the translated sequence to stdout:
print translate(line)

Related

Why is this vim regex so expensive: s/\n/\\n/g

Attempting this on a sufficiently large file (say 80,000+ lines and about 500k+) will crash things or stall eventually both on my server and on my local Mac.
I've tried this at the command line as well, with the same result:
vim -es -c '%s/\n/\\n/g' -c wq $file
Also, the problem appears to be with the selection (\n) and not the replacement (\\n).
For my larger files I can of course split them and cat them back when finished, but the split points cannot be arbitrary in my case and must be adjusted manually for each and every split.
I appreciate that there are other ways to do this -- sed, etc. -- but I have similar and additional problems there, and I would like to be able to do this with vim.
I'm adding my comment as an answer:
Text editors usually don't like 'gigantic' lines (which is what you'll get with that replacement).
To test that if this is is due because of the 'big line' and not the substitution itself I did this test:
I created a simple ~500KB file with a script. No new line characters, just a single line. Then I tried to load the file with vim. Result? I had to kill it :-).
However, if on the same script I write some new lines every now and then, I have no problems opening the file.
Also, one thing you could try is the following: on vim, replace \n by \n\n if it is fast, then this should also confirm the 'big line' issue.

How can I combine multiple text files, remove duplicate lines and split the remaining lines into several files of certain length?

I have a lot of relatively small files with about 350.000 lines of text.
For example:
File 1:
asdf
wetwert
ddghr
vbnd
...
sdfre
File 2:
erye
yren
asdf
jkdt
...
uory
As you can see line 3 of file 2 is a duplicate of line 1 in file 1.
I want a program / Notepad++ Plugin that can check and remove these duplicates in multiple files.
The next problem I have is that I want all lists to be combined into large 1.000.000 line files.
So, for example, I have these files:
648563 lines
375924 lines
487036 lines
I want them to result in these files:
1.000.000 lines
511.523 lines
And the last 2 files must consist of only unique lines.
How can I possibly do this? Can I use some programs for this? Or a combination of multiple Notepad++ Plugins?
I know GSplit can split files of 1.536.243 into files of 1.000.000 and 536.243 lines, but that is not enough, and it doesn't remove duplicates.
I do want to create my own Notepad++ plugin or program if needed, but I have no idea how and where to start.
Thanks in advance.
You have asked about Notepad++ and are thus using Windows. On the other hand, you said you want to create a program if needed, so I guess the main goal is to get the job done.
This answer uses Unix tools - on Windows, you can get those with Cygwin.
To run the commands, you have to type (or paste) them in the terminal / console.
cat file1 file2 file3 | sort -u | split -l1000000 - outfile_
cat reads the files and echoes them; normally, to the screen, but the pipe | gets the output of the command left to it and pipes it through to the command on the right.
sort obviously sorts them, and the switch -u tells it to remove duplicate lines.
The output is then piped to split which is being told to split after 1000000 lines by the switch -l1000000. The - (with spaces around) tells it to read its input not from a file but from "standard input"; the output in sort -u in this case. The last word, outfile_, can be changed by you, if you want.
Written like it is, this will result in files like outfile_aa, outfile_ab and so on - you can modify this with the last word in this command.
If you have all the files in on directory, and nothing else is in there, you can use * instead of listing all the files:
cat * | sort -u | split -l1000000 - outfile_
If the files might contain empty lines, you might want to remove them. Otherwise, they'll be sorted to the top and your first file will not have the full 1.000.000 values:
cat file1 file2 file3 | grep -v '^\s*$' | sort -u | split -l1000000 - outfile_
This will also remove lines that consist only of whitespace.
grep filters input using regular expressions. -v inverts the filter; normally, grep keeps only lines that match. Now, it keeps only lines that don't match. ^\s*$ matches all lines that consist of nothing else than 0 or more characters of whitespace (like spaces or tabs).
If you need to do this regularly, you can write a script so you don't have to remember the details:
#!/bin/sh
cat * | sort -u | split -l1000000 - outfile_
Save this as a file (for example combine.sh) and run it with
./combine.sh

Is there a good way to find exact matches of a extremely long string ~500 characters from a couple megabyte sized CSV file?

I'm trying to find a match of a ~500 character long DNA sequence from a few megabyte large CSV file containing different sequences. Before each sequence in the CSV file, there is some metadata I would like to have. Each sequence and sequence metadata take up exactly one line. I've tried
grep -B 1 "extremelylongstringofDNATACGGCATAGAGGCCGAGACCTAGGATTAACGTTACTGACGAT" csvfile.csv
However that returns filename too long
An interesting and frustrating thing I bumped into was when I tried to find the line count of the csv file by using
wc -l csvfile.csv
it returned
0 csvfile.csv
And without the -l flag, it returned
0 161410 41507206 csvfile.csv
This is the result even after I added a line between the end of each sequence and the start of the following metadata of the next sequence.
The issue was that the file had CR line terminators and GNU tools were not detecting any line endings and therefore was reading the file as one huge line. I solved the issue by using mac2unix to convert the file to make it GNU line-ending readable.
Thanks to Etan Reisner for providing the hint

how to handle unix command having \x in python code

I want to execute command
sed -e 's/\x0//g' file.xml
using Python code.
But getting error ValueError: invalid \x escape
You are not showing your Python code, so there is room for speculation here.
But first, why does the file contain null bytes in the first place? It is not a valid XML file. Can you fix the process which produces this file?
Secondly, why do you want to do this with sed? You are already using Python; use its native functions for this sort of processing. If you expect to read the file line by line, something like
with open('file.xml', 'r') as xml:
for line in xml:
line = line.replace('\x00', '')
# ... your processing here
or if you expect the whole file as one long byte string:
with open('file.xml', 'r') as handle:
xml = handle.read()
xml = xml.replace('\x00', '')
If you really do want to use an external program, tr would be more natural than sed. What syntax exactly to use depends on the dialect of tr or sed as well, but the fundamental problem is that backslashes in Python strings are interpreted by Python. If there is a shell involved, you also need to take the shell's processing into account. But in very simple terms, try this:
os.system("sed -e 's/\\x0//g' file.xml")
or this:
os.system(r"sed -e 's/\x0//g' file.xml")
Here, the single quotes inside the double quotes are required because a shell interprets this. If you use another form of quoting, you need to understand the shell's behavior under that quoting mechanism, and how it interacts with Python's quoting. But you don't really need a shell here in the first place, and I'm guessing in reality your processing probably looks more like this:
sed = subprocess.Popen(['sed', '-e', r's/\x0//g', 'file.xml'],
stdin=subprocess.PIPE, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
result, err = sed.communicate()
Because no shell is involved here, all you need to worry about is Python's quoting. Just like before, you can relay a literal backslash to sed either by doubling it, or by using a r'...' raw string.
Hex escapes in Python need two hex digits.
\x00

I need to create list in python from OpenOffice Calc columns

The problem is I have large amounts of data in OpenOffice Calc, approximately 3600 entries for each of 4 different categories and 3 different sets of this data, and I need to run some calculations on it in python. I want to create lists corresponding each of the four categories. I am hoping someone can help guide me to an easy-ish, efficient way to do this whether it be script or importing data. I am using python 2.7 on a windows 8 machine. Any help is greatly appreciated.
My current method i am trying is to save odf file as cvs then use genfromtxt(from numpy).
from numpy import genfromtxt
my_data = genfromtxt('C:\Users\tomdi_000\Desktop\Load modeling(WSU)\PMU Data\Data18-1fault-Alvey-csv trial.csv', delimiter=',')
print(my_data)
File "C:\Program Files (x86)\Wing IDE 101 5.0\src\debug\tserver\_sandbox.py", line 5, in <module>
File "c:\Python27\Lib\site-packages\numpy\lib\npyio.py", line 1352, in genfromtxt
fhd = iter(np.lib._datasource.open(fname, 'rbU'))
File "c:\Python27\Lib\site-packages\numpy\lib\_datasource.py", line 147, in open
return ds.open(path, mode)
File "c:\Python27\Lib\site-packages\numpy\lib\_datasource.py", line 496, in open
raise IOError("%s not found." % path)
IOError: C:\Users omdi_000\Desktop\Load modeling(WSU)\PMU Data\Data18-1fault-Alvey-csv trial.csv not found.
the error stems from this code in _datasource.py
# NOTE: _findfile will fail on a new file opened for writing.
found = self._findfile(path)
if found:
_fname, ext = self._splitzipext(found)
if ext == 'bz2':
mode.replace("+", "")
return _file_openers[ext](found, mode=mode)
else:
raise IOError("%s not found." % path)
Your problem is that your path string 'C:\Users\tomdi_000\Desktop\Load modeling(WSU)\PMU Data\Data18-1fault-Alvey-csv trial.csv' contains an escape sequence - \t. Since you are not using raw string literal, the \t is being interpreted as a tab character, similar to the way a \n is interpreted as a newline. If you look at the line starting with IOError:, you'll see a tab has been inserted in its place. You don't get this problem with UNIX-style paths, as they use forward slashes /.
There are two ways around this. The first is to use a raw string literal:
r'C:\Users\tomdi_000\Desktop\Load modeling(WSU)\PMU Data\Data18-1fault-Alvey-csv trial.csv'
(note the r at the beginning). As explained in the link above, raw string literals don't interpret back slashes \ as beginning an escape sequence.
The second way is to use a UNIX-style path with forward slashes as path delimiters:
'C:/Users/tomdi_000/Desktop/Load modeling(WSU)/PMU Data/Data18-1fault-Alvey-csv trial.csv'
This is fine if you're hard-coding the paths into your code, or reading from a file that you generate, but if the paths are getting generated automatically, such as reading the results of an os.listdir() command for example, it's best to use raw strings instead.
If you're going to be using numpy to do the calculations on your data, then using np.genfromtxt() is fine. However, for working with CSV files, you'd be much better off using the csv module. It includes all sorts of functions for reading columns and rows, and doing data transformation. If you're just reading the data then storing it in a list, for example, csv is definitely the way to go.