I'm trying to import the first 6 lines of a .txt file into a collection of objects:
from Header import Header
class HeaderReader(list):
def __init__(self):
super(HeaderReader, self).__init__()
#staticmethod
def headerCreator(filePath):
with open(filePath, 'r') as file:
headerHolder = HeaderReader()
for line in file:
splittedEls = line.split('\n')
if len(splittedEls) != 6:
continue
header = Header(
splittedEls[0],
splittedEls[1],
splittedEls[2],
splittedEls[3],
splittedEls[4],
splittedEls[5]
)
headerHolder.append(header)
return headerHolder
Header is a object with 6 atributes, the first 6 lines of the .txt file is divided by line (\n) after that there is other content, I think the problem is probably in the if len(splittedEls) != 6:but I'm not sure if it is or how to resolve it. Also it might be a error to define a class, give it a __init__ then call a staticmethod. When i pass it on a .txt file it return a empty list. Any ideas?
Related
I am trying to modify the code to apply to multiple text files in the same directory. The code looks as follows but there is an error "NameError: name 'output' is not defined". Can you help me to suggest improvements to the code?
import re
def replaceenglishwords(filename):
mark_pattern = re.compile("\\*CHI:.*")
word_pattern = re.compile("([A-Za-z]+)")
for line in filename:
# Split into possible words
parts = line.split()
if mark_pattern.match(parts[0]) is None:
output.write()
continue
# Got a CHI line
new_line = line
for word in parts[1:]:
matches = word_pattern.match(word)
if matches:
old = f"\\b{word}\\b"
new = f"{matches.group(1)}#s:eng"
new_line = re.sub(old, new, new_line, count=1)
output.write(new_line)
import glob
for file in glob.glob('*.txt'):
outfile = open(file.replace('.txt', '-out.txt'), 'w', encoding='utf8')
for line in open(file, encoding='utf8'):
print(replaceenglishwords(line), '\n', end='', file=outfile)
outfile.close()
replaceenglishwords needs two parameters, one for the file you are searching and one for the file where you write you results: replaceenglishwords(filename, output). It looks like your function is reading the input file line by line by itself.
Now you can open both files in your loop and pass them to replaceenglishwords:
for file in glob.glob('*.txt'):
textfile = open(file, encoding='utf8')
outfile = open(file.replace('.txt', '-out.txt'), 'w', encoding='utf8')
replaceenglishwords(textfile, outfile)
textfile.close()
outfile.close()
I have to monitor an XML file being written by a tool running all the day. But the XML file is properly completed and closed only at the end of the day.
Same constraints as XML stream processing:
Parse an incomplete XML file on-the-fly and trigger actions
Keep track of the last position within the file to avoid processing it again from the beginning
On answer of Need to read XML files as a stream using BeautifulSoup in Python, slezica suggests xml.sax, xml.etree.ElementTree and cElementTree. But no success with my attempts to use xml.etree.ElementTree and cElementTree. There are also xml.dom, xml.parsers.expat and lxml but I do not see support for "on-the-fly parsing".
I need more obvious examples...
I am currently using Python 2.7 on Linux, but I will migrate to Python 3.x => please also provide tips on new Python 3.x features. I also use watchdog to detect XML file modifications => Optionally, reuse the watchdog mechanism. Optionally support also Windows.
Please provide easy to understand/maintain solutions. If it is too complex, I may just use tell()/seek() to move within the file, use stupid text search in the raw XML and finally extract the values using basic regex.
XML sample:
<dfxml xmloutputversion='1.0'>
<creator version='1.0'>
<program>TCPFLOW</program>
<version>1.4.6</version>
</creator>
<configuration>
<fileobject>
<filename>file1</filename>
<filesize>288</filesize>
<tcpflow packets='12' srcport='1111' dstport='2222' family='2' />
</fileobject>
<fileobject>
<filename>file2</filename>
<filesize>352</filesize>
<tcpflow packets='12' srcport='3333' dstport='4444' family='2' />
</fileobject>
<fileobject>
<filename>file3</filename>
<filesize>456</filesize>
...
...
First test using SAX failed:
import xml.sax
class StreamHandler(xml.sax.handler.ContentHandler):
def startElement(self, name, attrs):
print 'start: name=', name
def endElement(self, name):
print 'end: name=', name
if name == 'root':
raise StopIteration
if __name__ == '__main__':
parser = xml.sax.make_parser()
parser.setContentHandler(StreamHandler())
with open('f.xml') as f:
parser.parse(f)
Shell:
$ while read line; do echo $line; sleep 1; done <i.xml >f.xml &
...
$ ./test-using-sax.py
start: name= dfxml
start: name= creator
start: name= program
end: name= program
start: name= version
end: name= version
Traceback (most recent call last):
File "./test-using-sax.py", line 17, in <module>
parser.parse(f)
File "/usr/lib64/python2.7/xml/sax/expatreader.py", line 107, in parse
xmlreader.IncrementalParser.parse(self, source)
File "/usr/lib64/python2.7/xml/sax/xmlreader.py", line 125, in parse
self.close()
File "/usr/lib64/python2.7/xml/sax/expatreader.py", line 220, in close
self.feed("", isFinal = 1)
File "/usr/lib64/python2.7/xml/sax/expatreader.py", line 214, in feed
self._err_handler.fatalError(exc)
File "/usr/lib64/python2.7/xml/sax/handler.py", line 38, in fatalError
raise exception
xml.sax._exceptions.SAXParseException: report.xml:15:0: no element found
Since yesterday I found the Peter Gibson's answer about the undocumented xml.etree.ElementTree.XMLTreeBuilder._parser.EndElementHandler.
This example is similar to the other one but uses xml.etree.ElementTree (and watchdog).
It does not work when ElementTree is replaced by cElementTree :-/
import time
import watchdog.events
import watchdog.observers
import xml.etree.ElementTree
class XmlFileEventHandler(watchdog.events.PatternMatchingEventHandler):
def __init__(self):
watchdog.events.PatternMatchingEventHandler.__init__(self, patterns=['*.xml'])
self.xml_file = None
self.parser = xml.etree.ElementTree.XMLTreeBuilder()
def end_tag_event(tag):
node = self.parser._end(tag)
print 'tag=', tag, 'node=', node
self.parser._parser.EndElementHandler = end_tag_event
def on_modified(self, event):
if not self.xml_file:
self.xml_file = open(event.src_path)
buffer = self.xml_file.read()
if buffer:
self.parser.feed(buffer)
if __name__ == '__main__':
observer = watchdog.observers.Observer()
event_handler = XmlFileEventHandler()
observer.schedule(event_handler, path='.')
try:
observer.start()
while True:
time.sleep(10)
finally:
observer.stop()
observer.join()
While the script is running, do not forget to touch one XML file, or simulate the on-the-fly writing using this one line script:
while read line; do echo $line; sleep 1; done <in.xml >out.xml &
For information, the xml.etree.ElementTree.iterparse does not seem to support a file being written. My test code:
from __future__ import print_function, division
import xml.etree.ElementTree
if __name__ == '__main__':
context = xml.etree.ElementTree.iterparse('f.xml', events=('end',))
for action, elem in context:
print(action, elem.tag)
My output:
end program
end version
end creator
end filename
end filesize
end tcpflow
end fileobject
end filename
end filesize
end tcpflow
end fileobject
end filename
end filesize
Traceback (most recent call last):
File "./iter.py", line 9, in <module>
for action, elem in context:
File "/usr/lib64/python2.7/xml/etree/ElementTree.py", line 1281, in next
self._root = self._parser.close()
File "/usr/lib64/python2.7/xml/etree/ElementTree.py", line 1654, in close
self._raiseerror(v)
File "/usr/lib64/python2.7/xml/etree/ElementTree.py", line 1506, in _raiseerror
raise err
xml.etree.ElementTree.ParseError: no element found: line 20, column 0
Three hours after posting my question, no answer received. But I have finally implemented the simple example I was looking for.
My inspiration is from saaj's answer and is based on xml.sax and watchdog.
from __future__ import print_function, division
import time
import watchdog.events
import watchdog.observers
import xml.sax
class XmlStreamHandler(xml.sax.handler.ContentHandler):
def startElement(self, tag, attributes):
print(tag, 'attributes=', attributes.items())
self.tag = tag
def characters(self, content):
print(self.tag, 'content=', content)
class XmlFileEventHandler(watchdog.events.PatternMatchingEventHandler):
def __init__(self):
watchdog.events.PatternMatchingEventHandler.__init__(self, patterns=['*.xml'])
self.file = None
self.parser = xml.sax.make_parser()
self.parser.setContentHandler(XmlStreamHandler())
def on_modified(self, event):
if not self.file:
self.file = open(event.src_path)
self.parser.feed(self.file.read())
if __name__ == '__main__':
observer = watchdog.observers.Observer()
event_handler = XmlFileEventHandler()
observer.schedule(event_handler, path='.')
try:
observer.start()
while True:
time.sleep(10)
finally:
observer.stop()
observer.join()
While the script is running, do not forget to touch one XML file, or simulate the on-the-fly writing using the following command:
while read line; do echo $line; sleep 1; done <in.xml >out.xml &
I need to read all files in a directory and save into list, then read list those files one by one.
I don't want to use external module like 'glob module'. So, trying through 2 different approach:
First approach:
import os
file_list = os.listdir("jsons")
for files in file_list:
data = open(files,"r")
output:
['A03DUrQz1BM9SQ2.json', 'A04D5V1u1BMxaV6.json', 'A0kxiHL81AN9pH5.json', 'A1Fxs5Ag1A8vuB5.json', 'A2Dsv7RE1BDqYt5.json', 'A2HkZPkn1BpvvG5.json']
but here issue is that filenames are saved in string format and not able to open this file as it read it with quotes ''.
2nd approach:
file_list = os.system("ls jsons/")
**print file_list.split()**
for files in file_list:
data = open(files,"r")
print data
output:
Traceback (most recent call last):
File "asn-1_q3.py", line 9, in <module>
print file_list.split()
AttributeError: 'int' object has no attribute 'split'
Here, it is saving as int and not able to split the file.
How should I solve them ?
You need to read your file object and os.path.join the file name with the original directory name (or it will look for the files in the current directory):
import os
import os.path
file_list = os.listdir("jsons")
for file_name in file_list:
with open(os.path.join("jsons", file_name), "r") as src_file:
data = src_file.read()
print(data)
Here's an example that uses generators to limit the amount of data in memory (vs loading all the data into an array):
import os
os.path
def all_file_content(directory_name):
file_list = os.listdir(directory_name)
for file_name in file_list:
with open(os.path.join(directory_name, file_name), "r") as src_file:
yield src_file.read()
for file_content in all_file_content("jsons"):
print(file_content)
My code is currently taking in a csv file and outputting to text file. The piece of code I have below and am having trouble with is from the csv I am searching for a keyword like issues and every row that has that word I want to output that to a text file. Currently, I have it printing to a JSON file but its all on one line like this
"something,something1,something2,something3,something4,something5,something6,something7\r\n""something,something1,something2,something3,something4,something5,something6,something7\r\n"
But i want it to print out like this:
"something,something1,something2,something3,something4,something5,something6,something7"
"something,something1,something2,something3,something4,something5,something6,something7"
Here is the code I have so far:
def search(self, filename):
with open(filename, 'rb') as searchfile, open("weekly_test.txt", 'w') as text_file:
for line in searchfile:
if 'PBI 43125' in line:
#print (line)
json.dump(line, text_file, sort_keys=True, indent = 4)
So again I just need a little guidance on how to get my json file to be formatted the way I want.
Just replace print line with print >>file, line
def search(self, filename):
with open('test.csv', 'r') as searchfile, open('weekly_test.txt', 'w') as search_results_file:
for line in searchfile:
if 'issue' in line:
print >>search_results_file, line
# At this point, both the files will be closed automatically
I have 5 files in a folder App:
App|
|--A.txt
|--B.txt
|--C.txt
|--D.txt
|--E.txt
|--Run.py
|--Other Folders or Files
Now I want to know if files (A.txt,B.txtC.txt,C.txt,D.txt,E.txt) is present or not and if its there than I want to call a function Cleaner which will supply names of these files to that function. I have written this code but nothing is happening.The function is not getting called.
import glob
import csv
import itertools
files = glob.glob("*.txt")
i = 0
def sublist(a, b):
seq = iter(b)
try:
for x in a:
while next(seq) != x: pass
else:
return True
except StopIteration:
pass
return False
required_files = ['Alternate_ADR6_LFB1.txt', 'Company_Code.txt', 'Left_LIFNR.txt', 'LFA1.txt', 'LFB1.TXT', 'LFBK.TXT']
if sublist(required_files,files):
for files in required_files:
try:
f = open(files , 'r')
f.close()
except IOError as e:
print 'Error opening or accessing files'
i = 1
else:
print 'Required files are not in correct folder'
if i == 1:
for files in required_files:
Cleansing(files)
def Cleansing(filename):
with open('filename', 'rb') as f_input:
...
...
break
with open('filename', 'rb') as f_input, open('filename_Cleaned.csv', 'wb') as f_output:
csv_output = csv.writer(f_output)
csv_output.writerow('something')
Upadate
I think now I am able to call the function and also able to check the valid files but its not that pythonic. And I am not able to open or create a file with the name of the file plus _cleaned :filename_cleaned.csv.
You want to check if a list of files (required_files) are in a folder.
You successfully get the complete list of text files in the folder with files = glob.glob("*.txt")
So the first question is: Checking for sublist in list
As the order is not important, we can use sets:
if set(required_files) <= set(files):
# do stuff
else:
#print warning
Next question: How to open the files and create an outputs with names like "filename_Cleaned.csv"
A very important thing you have to understand: "filename" is not the same thing as filename. The first is a string, it will always be the same thing, it will not be replaced by real filenames. When writing open('filename', 'rb') you're trying to open a file called "filename".
filename however can be a variable name and take different values.
for filename in required_files:
Cleansing(filename)
def Cleansing(filename):
with open(filename, 'rb') as f_input, open(filename+'_Cleaned.csv', 'wb') as f_output:
#read stuff in f_input
#write stuff in f_output