I have a text file which contains special characters. I want to replace ");" with "firstdata);seconddata".
The catch is that both ")" and ";" should be together and then replaced with "firstdata);seconddata".
I have the following code.
import re
string = open('trial.txt').read()
new_str = re.sub('[);]', 'firstdata);seconddata', string)
open('b.txt', 'w').write(new_str)
Please, suggest me how to change my code to get the right output.
This should do:
import re
with open("file.txt", "r") as rfile:
s = rfile.read()
rplce = re.sub('\);', "REPLACED", s)
with open("file.txt", "w") as wfile:
wfile.write(rplce)
You can use the built in str.replace() method in Python
string = "foobar);"
string.replace(");", 'firstdata);seconddata') # -> 'foobarfirstdata);seconddata'
Here are the docs for common string operations like this in Python
https://docs.python.org/3/library/string.html
You may use more simple way.
with open('input_file.txt', 'r') as input_file:
with open('output_file.txt', 'w') as output_file:
for line in input_file:
x = line.replace('findtext','replacetext')
output_file.write(x)
Related
Here is my program I want to replace a word to another word in a text file using a regular expression but I am not able to save those words in the text file. Can anyone, please help me to save a file. Thank you in advance.
Given below is my code:
import re
with open("c:\Users\Desktop\hh.txt","r+") as f:
for i in f.readlines():
content=re.sub("hai","welcome",i)
#after replace how can i save these words in text file again
A simple approach for small files is to do your reading and writing separately.
import re
path = 'hh.txt'
with open(path, "r") as f:
oldlines = f.readlines()
newlines = []
for line in oldlines:
newlines.append(re.sub("hai", "welcome", line))
with open(path, "w") as f:
f.writelines(newlines)
If you're dealing with huge files, I suggest you write to a temporary file while reading from your input file. Then do a file delete then a file rename.
I have a csv file that looks like this (obviously < anystring > means just that).
<anystring>tony_UPearly_start,1,2,3
<anystring>tony_UPlate_start,4,5,6
<anystring>tony_UP<anystring>_start,7,8,9
<anystring>jane_UPearly_start,1,2,3
<anystring>jane_UPlate_start,4,5,6
<anystring>jane_UP<anystring>_start,7,8,9
I am trying to extract rows UPanystring (rows 3 and 6 in this example) using negative look forward to exclude rows 1,2 and 4,5
import re
import csv
search = re.compile(r'.*_UP(?!early|late)')
output = []
with open('test.csv', mode='r', encoding='utf-8') as f:
csvfile = csv.reader(f)
for row in csvfile:
if row[0] == search:
output.append(row)
print(output)
>>>[]
when I am after
print (output)
[<anystring>tony_UP<anystring>_start,7,8,9, <anystring>jane_UP<anystring>_start,7,8,9]
The regex search works when I test on a regex platform but not in python?
Thanks for the comments: the search code now looks like
search = re.compile(r'^.*?_UP(?!early|late).*$')
output = []
with open('test.csv', mode='r', encoding='utf-8') as f:
csvfile = csv.reader(f)
for row in csvfile:
search.search(row[0]) # it think this needs and if=true but it won't accept a boolean here?
output.append(row)
This now returns all rows (ie filters nothing whereas before it filtered everything)
You want to return a list of rows that contain _UP not followed with early or late.
The pattern should look like
search = re.compile(r'_UP(?!early|late)')
You do not need any ^, .*, etc. because when you use re.search, you are looking for a pattern match anywhere inside a string.
Then, all you need is to test the row for the regex match:
if search.search(row):
output.append(row)
See the Python demo:
import re
csvfile="""<anystring>tony_UPearly_start,1,2,3
<anystring>tony_UPlate_start,4,5,6
<anystring>tony_UP<anystring>_start,7,8,9
<anystring>jane_UPearly_start,1,2,3
<anystring>jane_UPlate_start,4,5,6
<anystring>jane_UP<anystring>_start,7,8,9""".splitlines()
search = re.compile(r'_UP(?!early|late)')
output = []
for row in csvfile:
if search.search(row):
output.append(row)
print(output)
And the output is your expected list:
['<anystring>tony_UP<anystring>_start,7,8,9', '<anystring>jane_UP<anystring>_start,7,8,9']
The words of the "wordslist" and the text I'm searching are in Cyrillic. The text is coded in UTF-8 (as set in Notepad++). I need Python to match a word in the text and get everything after the word until a full-stop followed by new line.
EDIT
with open('C:\....txt', 'rb') as f:
wordslist = []
for line in f:
wordslist.append(line)
wordslist = map(str.strip, wordslist)
/EDIT
for i in wordslist:
print i #so far, so good, I get Cyrillic
wantedtext = re.findall(i+".*\.\r\n", open('C:\....txt', 'rb').read())
wantedtext = str(wantedtext)
print wantedtext
"Wantedtext" shows and saves as "\xd0\xb2" (etc.).
What I tried:
This question is different, because there is no variable involved:
Convert bytes to a python string. Also, the solution from the chosen answer
wantedtext.decode('utf-8')
didn't work, the result was the same. The solution from here didn't help either.
EDIT: Revised code, returning "[]".
with io.open('C:....txt', 'r', encoding='utf-8') as f:
wordslist = f.read().splitlines()
for i in wordslist:
print i
with io.open('C:....txt', 'r', encoding='utf-8') as my_file:
my_file_test = my_file.read()
print my_file_test #works, prints cyrillic characters, but...
wantedtext = re.findall(i+".*\.\r\n", my_file_test)
wantedtext = str(wantedtext)
print wantedtext #returns []
(Added after a comment below: This code works if you erase \r from the regular expression.)
Python 2.x only
Your find is probably not working because you're mixing strs and Unicodes strs, or strs containing different encodings. If you don't know what the difference between Unicode str and str, see: https://stackoverflow.com/a/35444608/1554386
Don't start decoding stuff unless you know what you're doing. It's not voodoo :)
You need to get all your text into Unicode objects first.
Split your read into a separate line - it's easier to read
Decode your text file. Use io.open() which support Python 3 decoding. I'm going assume your text file is UTF-8 (We'll soon find out if it's not):
with io.open('C:\....txt', 'r', encoding='utf-8') as my_file:
my_file_test = my_file.read()
my_file_test is now a Unicode str
Now you can do:
# finds lines beginning with i, ending in .
regex = u'^{i}*?\.$'.format(i=i)
wantedtext = re.findall(regex, my_file_test, re.M)
Look at wordslist. You don't say what you do with it but you need to make sure it's a Unicode str too. If you read from a file, use the same io.open from above.
Edit:
For wordslist, you can decode and read the file into a list while removing line feeds in one go:
with io.open('C:\....txt', 'r', encoding='utf-8') as f:
wordslist = f.read().splitlines()
from __future__ import division
import nltk
from nltk.corpus import wordnet as wn
tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
fp = open("inpsyn.txt")
data = fp.read()
#to tokenize input text into sentences
print '\n-----\n'.join(tokenizer.tokenize(data))# splits text into sentences
#to tokenize the tokenized sentences into words
tokens = nltk.wordpunct_tokenize(data)
text = nltk.Text(tokens)
words = [w.lower() for w in text]
print words #to print the tokens
for a in words:
print a
syns = wn.synsets(a)
print "synsets:", syns
for s in syns:
for l in s.lemmas:
print l.name
print s.definition
print s.examples
i do not find any code related to my question. If there is any please mail me the link.
this is a code which will not find synonyms from a given text file or related sentence
It's as simple as a coding error - look where you define a in the loop (for a in words). Now look further where you try syns=wn.synsets(a). In this case a is not defined outside the loop. What you want is include all your synsets code within the for a in words loop. Here is what you want altogether:
...
words = [w.lower() for w in nltk.wordpunct_tokenize(data)] # other lines in your code are just excessive
for a in words:
syns = wn.synsets(a)
print "synsets:", syns
for s in syns:
for l in s.lemmas:
print l.name
print s.definition
print s.examples
That's a bit of a silly mistake. Also, please learn some cleaner coding - the current is very untidy and painful to look at.
I try to write a python scripts to analys a data txt.I want the script to do such things:
find all the time data in one line, and compare them.but this is my first time to write RE syntax.so I write a small script at 1st.
and my script is:
import sys
txt = open('1.txt','r')
a = []
for eachLine in txt:
a.append(eachLine)
import re
pattern = re.compile('\d{2}:\d{2}:\d{2}')
for i in xrange(len(a)):
print pattern.match(a[i])
#print a
and the output is always None.
my txt is just like the picture:
what's the problem? plz help me. thx a lot.
and my python is python 2.7.2.my os is windows xp sp3.
Didn't you miss one of the ":" in you regex? I think you meant
re.compile('\d{2}:\d{2}:\d{2}')
The other problems are:
First, if you want to search in the hole text, use search instead of match. Second, to access your result you need to call group() in the match object returned by your search.
Try it:
import sys
txt = open('1.txt','r')
a = []
for eachLine in txt:
a.append(eachLine)
import re
pattern = re.compile('\d{2}:\d{2}:\d{2}')
for i in xrange(len(a)):
match = pattern.search(a[i])
print match.group()
#print a
I think you're missing the colons and dots in your regex. Also try using re.search or re.findall instead on the entire text. Like this:
import re, sys
text = open("./1.txt", "r").read() # or readlines() to make a list of lines
pattern = re.compile('\d{2}:\d{2}:\d{2}')
matches = pattern.findall(text)
for i in matches:
print(i);