Use regex to add string to different multilines with simularities - regex

I have a .txt file with multiple lines (Different yet simular for each one) that I want to add a "*.tmp" at the end.
I'm trying to use python2.7 regex to do this.
Here is what I have for the python script:
import sys
import os
import re
import shutil
#Sets the buildpath variable to equal replace one "\" with two "\\" for python code to input/output correctly
buildpath = sys.argv[1]
buildpath = buildpath.replace('\\', '\\\\')
#Opens a tmp file with read/write/append permissions
tf = open('tmp', 'a+')
#Opens the selenium script for scheduling job executions
with open('dumplist.txt') as f:
#Sets line as a variable for every line in the selenium script
for line in f.readlines():
#Sets build as a variable that will replace the \\build\path string in the line of the selenium script
build = re.sub (r'\\\\''.*',''+buildpath+'',line)
#Overwrites the build path string from the handler to the tmp file with all lines included from the selenium script
tf.write(build)
#Saves both "tmp" file and "selenium.html" file by closing them
tf.close()
f.close()
#Copies what was re-written in the tmp file, and writes it over the selenium script
shutil.copy('tmp', 'dumplist.txt')
#Deletes the tmp file
os.remove('tmp')
#exits the script
exit()
Current File Before Replacing the Line:
\\server\dir1\dir2\dir3
DUMP3f2b.tmp
1 File(s) 1,034,010,207 bytes
\\server\dir1_1\dir2_1\dir3_1
DUMP3354.tmp
1 File(s) 939,451,120 bytes
\\server\dir1_2\dir2_2\dir3_2
Current file after replacing string:
\*.tmp
DUMP3f2b.tmp
1 File(s) 1,034,010,207 bytes
\*.tmp
DUMP3354.tmp
1 File(s) 939,451,120 bytes
\*.tmp
Desired file after replacing string:
\\server\dir1\dir2\dir3\*.tmp
DUMP3f2b.tmp
1 File(s) 1,034,010,207 bytes
\\server\dir1_1\dir2_1\dir3_1\*.tmp
DUMP3354.tmp
1 File(s) 939,451,120 bytes
\\server\dir1_2\dir2_2\dir3_2\*.tmp
If anyone could help me in solving this that would be great. Thanks :)

You should use capturing groups:
>>> import re
>>> s = "\\server\dir1\dir2\dir3"
>>> print re.sub(r'(\\.*)', r'\\\1\*.tmp', s)
\\server\dir1\dir2\dir3\*.tmp
Then, modify build = re.sub (r'\\\\''.*',''+buildpath+'',line) line this way:
build = re.sub (r'(\\.*)', r'\\\1%s' % buildpath, line)
Also, you shouldn't call readlines(), just iterate over f:
for line in f:

Related

Python code for import not working properly

I am testing below python code to export content from one txt file to another but in destination contents are getting copied with some different language (may be chinese) with improper
# - *- coding: utf- 8 - *-
from sys import argv
from os.path import exists
script,from_file,to_file = argv
print "Does the output file exist ? %r" %exists(to_file)
print "If 'YES' hit Enter to proceed or terminate by CTRL+C "
raw_input('?')
infile=open(from_file)
file1=infile.read()
outfile=open(to_file,'w')
file2=outfile.write(file1)
infile.close()
outfile.close()
try this code to copy 1 file to another :
with open("from_file") as f:
with open("to_file", "w") as f1:
for line in f:
f1.write(line)

The glob.glob function to extract data from files

I am trying to run the script below. The intention of the script is to open different fasta files one after the other, and extract the geneID. The script works well if I don't use the glob.glob function. I get this message TypeError: coercing to Unicode: need string or buffer, list found
files='/home/pathtofiles/files'
#print files
#sys.exit()
for file in files:
fastas=sorted(glob.glob(files + '/*.fasta'))
#print fastas[0]
output_handle=(open(fastas, 'r+'))
genes_files=list(SeqIO.parse(output_handle, 'fasta'))
geneID=genes_files[0].id
print geneID
I am running of ideas on how to direct the script to open when file after another to give me the require information.
I see what you are trying to do, but let me first explain why your current approach is not working.
You have a path to a directory with fasta files and you want to loop over the files in that directory. But observe what happens if we do:
>>> files='/home/pathtofiles/files'
>>> for file in files:
>>> print file
/
h
o
m
e
/
p
a
t
h
t
o
f
i
l
e
s
/
f
i
l
e
s
Not the list of filenames you expected! files is a string and when you apply a for loop on a string you simply iterate over the characters in that string.
Also, as doctorlove correctly observed, in your code fastas is a list and open expects a path to a file as first argument. That's why you get the TypeError: ... need string, ... list found.
As an aside (and this is more a problem on Windows then on Linux or Mac), but it is good practice to always use raw string literals (prefix the string with an r) when working with pathnames to prevent the unwanted expansion of backslash escaped sequences like \n and \t to newline and tab.
>>> path = 'C:\Users\norah\temp'
>>> print path
C:\Users
orah emp
>>> path = r'C:\Users\norah\temp'
>>> print path
C:\Users\norah\temp
Another good practice is to use os.path.join() when combining pathnames and filenames. This prevents subtle bugs where your script works on your machine bug gives an error on the machine of your colleague who has a different operating system.
I would also recommend using the with statement when opening files. This assures that the filehandle gets properly closed when you're done with it.
As a final remark, file is a built-in function in Python and it is bad practice to use a variable with the same name as a built-in function because that can cause bugs or confusion later on.
Combing all of the above, I would rewrite your code like this:
import os
import glob
from Bio import SeqIO
path = r'/home/pathtofiles/files'
pattern = os.path.join(path, '*.fasta')
for fasta_path in sorted(glob.glob(pattern)):
print fasta_path
with open(fasta_path, 'r+') as output_handle:
genes_records = SeqIO.parse(output_handle, 'fasta')
for gene_record in genes_records:
print gene_record.id
This is way I solved the problem, and this script works.
import os,sys
import glob
from Bio import SeqIO
def extracting_information_gene_id():
#to extract geneID information and add the reference gene to each different file
files=sorted(glob.glob('/home/path_to_files/files/*.fasta'))
#print file
#sys.exit()
for file in files:
#print file
output_handle=open(file, 'r+')
ref_genes=list(SeqIO.parse(output_handle, 'fasta'))
geneID=ref_genes[0].id
#print geneID
#sys.exit()
#to extract the geneID as a reference record from the genes_files
query_genes=(SeqIO.index('/home/path_to_file/file.fa', 'fasta'))
#print query_genes[geneID].format('fasta') #check point
#sys.exit()
ref_gene=query_genes[geneID].format('fasta')
#print ref_gene #check point
#sys.exit()
output_handle.write(str(ref_gene))
output_handle.close()
query_genes.close()
extracting_information_gene_id()
print 'Reference gene sequence have been added'

How to change the values of a parameter in multiple files using python

I am a new user of Python. I got to learn a way of changing value of a parameter in a single file. The script:
#####test.py##########
from sys import argv
script,filename,sigma = argv
file_data = open(filename,'r')
txt = file_data.read()
txt=txt.replace('3.7',sigma)
file_data = open(filename,'w')
file_data.write(txt)
file_data.close()
It's run in command line with test.txt as
test.py test.txt 2.
3.7 is replaced by 2 in test.txt, as a result.
Now if I want to do the same for all the .txt files in the directory e.g.
test.py *.txt 2
what are the suggested modifications?
Your suggestions are highly appreciated.
Hafiz.
bash (or whatever your shell is) will expand the *.txt (to test0.txt test1.txt ... or whatever the *.txt files in your current directory are called) before passing it to your python script. your python script will therefore get many arguments (and not just 2 as you expect). print sys.argv to inspect.
you could solve that in bash itself with something like
for name in *.txt; do test.py ${name} 2; done
otherwise you would need to treat sys.argv differently in python and allow for more than 2 arguments.
Importing glob solved that issue. But I've got some queries.
Query 1:
I'm rewriting my code as:
#####test.py##########
from sys import argv
script,filename,sigma = argv
file_data = open(filename,'r')
txt = file_data.read()
txt=txt.replace('3.7'|'3',sigma) #gives syntax error
file_data = open(filename,'w')
file_data.write(txt)
file_data.close()
I want to replace 3.7 or 3 by sigma. What will be the corrected code?
Query 2:
I'm rewriting it in the following manner:
#####test.py##########
from sys import argv
script,filename,sigma = argv
file_data = open(filename,'r')
txt = file_data.read()
txt=txt.replace('x="2"','x=sigma')
file_data = open(filename,'w')
file_data.write(txt)
file_data.close()
With
py test.py test.txt 3.
I get x=sigma, but I want to get x=3
What'd be the modification?
Regards,
Hafiz

Read & write txt file error - 'str' object has no attribute 'name', polish dialectical chars in path error

I use Python 2.7 on Win 7 Pro SP1.
I try code:
import os
path = "E:/data/keyword"
os.chdir(path)
files = os.listdir(path)
query = "{keyword} AND NOT("
result = open("query.txt", "w")
for file in files:
if file.endswith(".txt"):
file_path = file.name
dane = open(file_path, "r")
query.append(dane)
result.append(" OR ")
result.write(query)
result.write(")")
result.close()
I get error:
file_path = file.name AttributeError: 'str' object has no attribute
'name'
I can't figure why.
I have secon error when path is with polish dialectical chars like "ąęłńóżć". I get error for:
path = "E:/Bieżące projekty/keyword"
I try fix it to:
path =u"E:/Bieżące projekty/keyword"
but it not help. I'm starting with Python and I can't find out why this code is not working.
What i want
Find all text file in the directory.
Join all text file in one file text named "query.txt"
fx.
file 1
data1 data2
file 2
data 3 data 4
Output from "query.txt":
data1 data2 data 3 data 4
Above code working fine when path variable is without polish dialectical characters. When I change path I get error:
SyntaXError: Non-ASCII character '\xc5' in file query.py on line 9, but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details
On python doc PEP263 I find magic quote. Polish lang coding characters like "ąęłńóźżć" standard is ISO-8859-2. So i try add encoding to code. I try use UTF-8 too and I get the same error. My all code is (without 5 first lines with comment what code doing):
import os
#path = r"E:/data"
# -*- coding: iso-8859-2 -*-
path = r"E:/Bieżące przedsięwzięcia"
os.chdir(path)
files = os.listdir(path)
query = "{keyword} AND NOT("
for file in files:
if file.endswith(".txt"):
dane = open(file, "r")
text = dane.read()
query += text
print(query)
dane.close()
query.join(" OR ")
result = open("query.txt", "w")
result.write(query)
result.write(")")
result.close()
On Unicode/UTF-8 character here I found that polish char "ż" is coded in UTF-8 as "\xc5\xbc". Mark # to coding line with path with "ż" as comment make error too. When I remove line with this char code:
path = r"E:/Bieżące przedsięwzięcia"
working fine and I get result which I want.
For editing I use Notepad++ with default setings. I only set in python code tab replace by four space.
*
Second Question
I try find in Python doc in variable path what r does mean. I can't find it in Python 2.7 string documentation. Could someone tell my how this part of Python (like u, r before string value) is named fx.
path = u"somedata"
path = r"somedata"?
I would get doc to read about it.

Formatting text file

I have a txt file that I would like to alter so I will be able to place the data into columns see example below. The reason behind this is so I can import this data into a database / array and perform calculations on them. I tried importing/pasting the data into LibreCalc but it just imports everything into one column or it opens the file in LibreWriter I'm using ubuntu 10.04. Any ideas? I'm willing to use another program to work around this issue. I could also work with a comma delimited file but I'm not to sure how to convert the data to that format automatically.
Trying to get this:
WAVELENGTH, WAVENUMBER, INTENSITY, CLASSIFICATION, CODE,
1132.8322, 88274.326, 2300, PT II, 9356- 97630, 05,
Here's a link to the full file.
pt.txt file
Try this:
sed -e "s/(\s+)/,$1/g" pt.txt
is this what you want?
awk 'BEGIN{OFS=","}NF>1{$1=$1;print}' pt.txt
if you want the output format looks better, and you have "column" installed, you can try this too:
awk 'BEGIN{OFS=", "}NF>1{$1=$1;print}' pt.txt|column -t
The awk and sed one-liners are cool, but I expect you'll end up needing to do more than simply splitting up the file. If you do, and if you have access to Python 2.7, the following little script will get you going.
# -*- coding: utf-8 -*-
"""Convert to comma-delimited"""
import csv
from os import path
import re
import sys
def splitline(line):
return re.split('\s{2,}', line)
def main():
srcpath = path.abspath(sys.argv[1])
targetpath = path.splitext(srcpath)[0] + '.csv'
with open(srcpath) as infile, open(targetpath, 'w') as outfile:
writer = csv.writer(outfile)
for line in infile:
if line.startswith(' '):
line = line.strip()
cols = splitline(line)
writer.writerow(cols)
if __name__ == '__main__':
main()
The easiest way turned out to be importing using a fixed width like tohuwawohu suggested
Thanks
Without transforming it to a comma-separated file, you could access the csv import options by simply changing the file extension to .csv (maybe you should remove the "header" part manually, so that only the columns heads and the data rows do remain). After that, you can try to use whitespace as column delimiter, or even easier: select "fixed width" and set the columns manually. – tohuwawohu Oct 20 at 9:23