I have to monitor an XML file being written by a tool running all the day. But the XML file is properly completed and closed only at the end of the day.
Same constraints as XML stream processing:
Parse an incomplete XML file on-the-fly and trigger actions
Keep track of the last position within the file to avoid processing it again from the beginning
On answer of Need to read XML files as a stream using BeautifulSoup in Python, slezica suggests xml.sax, xml.etree.ElementTree and cElementTree. But no success with my attempts to use xml.etree.ElementTree and cElementTree. There are also xml.dom, xml.parsers.expat and lxml but I do not see support for "on-the-fly parsing".
I need more obvious examples...
I am currently using Python 2.7 on Linux, but I will migrate to Python 3.x => please also provide tips on new Python 3.x features. I also use watchdog to detect XML file modifications => Optionally, reuse the watchdog mechanism. Optionally support also Windows.
Please provide easy to understand/maintain solutions. If it is too complex, I may just use tell()/seek() to move within the file, use stupid text search in the raw XML and finally extract the values using basic regex.
XML sample:
<dfxml xmloutputversion='1.0'>
<creator version='1.0'>
<program>TCPFLOW</program>
<version>1.4.6</version>
</creator>
<configuration>
<fileobject>
<filename>file1</filename>
<filesize>288</filesize>
<tcpflow packets='12' srcport='1111' dstport='2222' family='2' />
</fileobject>
<fileobject>
<filename>file2</filename>
<filesize>352</filesize>
<tcpflow packets='12' srcport='3333' dstport='4444' family='2' />
</fileobject>
<fileobject>
<filename>file3</filename>
<filesize>456</filesize>
...
...
First test using SAX failed:
import xml.sax
class StreamHandler(xml.sax.handler.ContentHandler):
def startElement(self, name, attrs):
print 'start: name=', name
def endElement(self, name):
print 'end: name=', name
if name == 'root':
raise StopIteration
if __name__ == '__main__':
parser = xml.sax.make_parser()
parser.setContentHandler(StreamHandler())
with open('f.xml') as f:
parser.parse(f)
Shell:
$ while read line; do echo $line; sleep 1; done <i.xml >f.xml &
...
$ ./test-using-sax.py
start: name= dfxml
start: name= creator
start: name= program
end: name= program
start: name= version
end: name= version
Traceback (most recent call last):
File "./test-using-sax.py", line 17, in <module>
parser.parse(f)
File "/usr/lib64/python2.7/xml/sax/expatreader.py", line 107, in parse
xmlreader.IncrementalParser.parse(self, source)
File "/usr/lib64/python2.7/xml/sax/xmlreader.py", line 125, in parse
self.close()
File "/usr/lib64/python2.7/xml/sax/expatreader.py", line 220, in close
self.feed("", isFinal = 1)
File "/usr/lib64/python2.7/xml/sax/expatreader.py", line 214, in feed
self._err_handler.fatalError(exc)
File "/usr/lib64/python2.7/xml/sax/handler.py", line 38, in fatalError
raise exception
xml.sax._exceptions.SAXParseException: report.xml:15:0: no element found
Since yesterday I found the Peter Gibson's answer about the undocumented xml.etree.ElementTree.XMLTreeBuilder._parser.EndElementHandler.
This example is similar to the other one but uses xml.etree.ElementTree (and watchdog).
It does not work when ElementTree is replaced by cElementTree :-/
import time
import watchdog.events
import watchdog.observers
import xml.etree.ElementTree
class XmlFileEventHandler(watchdog.events.PatternMatchingEventHandler):
def __init__(self):
watchdog.events.PatternMatchingEventHandler.__init__(self, patterns=['*.xml'])
self.xml_file = None
self.parser = xml.etree.ElementTree.XMLTreeBuilder()
def end_tag_event(tag):
node = self.parser._end(tag)
print 'tag=', tag, 'node=', node
self.parser._parser.EndElementHandler = end_tag_event
def on_modified(self, event):
if not self.xml_file:
self.xml_file = open(event.src_path)
buffer = self.xml_file.read()
if buffer:
self.parser.feed(buffer)
if __name__ == '__main__':
observer = watchdog.observers.Observer()
event_handler = XmlFileEventHandler()
observer.schedule(event_handler, path='.')
try:
observer.start()
while True:
time.sleep(10)
finally:
observer.stop()
observer.join()
While the script is running, do not forget to touch one XML file, or simulate the on-the-fly writing using this one line script:
while read line; do echo $line; sleep 1; done <in.xml >out.xml &
For information, the xml.etree.ElementTree.iterparse does not seem to support a file being written. My test code:
from __future__ import print_function, division
import xml.etree.ElementTree
if __name__ == '__main__':
context = xml.etree.ElementTree.iterparse('f.xml', events=('end',))
for action, elem in context:
print(action, elem.tag)
My output:
end program
end version
end creator
end filename
end filesize
end tcpflow
end fileobject
end filename
end filesize
end tcpflow
end fileobject
end filename
end filesize
Traceback (most recent call last):
File "./iter.py", line 9, in <module>
for action, elem in context:
File "/usr/lib64/python2.7/xml/etree/ElementTree.py", line 1281, in next
self._root = self._parser.close()
File "/usr/lib64/python2.7/xml/etree/ElementTree.py", line 1654, in close
self._raiseerror(v)
File "/usr/lib64/python2.7/xml/etree/ElementTree.py", line 1506, in _raiseerror
raise err
xml.etree.ElementTree.ParseError: no element found: line 20, column 0
Three hours after posting my question, no answer received. But I have finally implemented the simple example I was looking for.
My inspiration is from saaj's answer and is based on xml.sax and watchdog.
from __future__ import print_function, division
import time
import watchdog.events
import watchdog.observers
import xml.sax
class XmlStreamHandler(xml.sax.handler.ContentHandler):
def startElement(self, tag, attributes):
print(tag, 'attributes=', attributes.items())
self.tag = tag
def characters(self, content):
print(self.tag, 'content=', content)
class XmlFileEventHandler(watchdog.events.PatternMatchingEventHandler):
def __init__(self):
watchdog.events.PatternMatchingEventHandler.__init__(self, patterns=['*.xml'])
self.file = None
self.parser = xml.sax.make_parser()
self.parser.setContentHandler(XmlStreamHandler())
def on_modified(self, event):
if not self.file:
self.file = open(event.src_path)
self.parser.feed(self.file.read())
if __name__ == '__main__':
observer = watchdog.observers.Observer()
event_handler = XmlFileEventHandler()
observer.schedule(event_handler, path='.')
try:
observer.start()
while True:
time.sleep(10)
finally:
observer.stop()
observer.join()
While the script is running, do not forget to touch one XML file, or simulate the on-the-fly writing using the following command:
while read line; do echo $line; sleep 1; done <in.xml >out.xml &
Related
I'm still learning Python so I apologize up front. I am trying to write a watcher that watches for the creation of a file that matches a specific pattern. I am using watchdog with some code that is out there to get it watch a directory.
I don't quite understand how to have the watcher watch for a filename that matches a pattern. I've edited the regexes=[] field with what I think might work but I have not had luck.
Here is an example of a file that might come in: Hires For Monthly TM_06Jan_0946.CSV
I can get the watcher to tell me when a .CSV is created in the directory, I just can't get it to only tell me when essentially Hires.*\.zip is created.
I've checked the this link but have not had luck
How to run the RegexMatchingEventHandler of Watchdog correctly?
Here is my code:
import time
import watchdog.events
from watchdog.observers import Observer
from watchdog.events import RegexMatchingEventHandler
import re
import watchdog.observers
import time
import fnmatch
import os
class Handler(watchdog.events.RegexMatchingEventHandler):
def __init__(self):
watchdog.events.RegexMatchingEventHandler.__init__(self, regexes=['.*'], ignore_regexes=[], ignore_directories=False, case_sensitive=False)
def on_created(self, event):
print("Watchdog received created event - % s." % event.src_path)
def on_modified(self, event):
print("Watchdog received modified event - % s." % event.src_path)
if __name__ == "__main__":
src_path = r"C:\Users\Downloads"
event_handler = Handler()
observer = watchdog.observers.Observer()
observer.schedule(event_handler, path=src_path, recursive=True)
observer.start()
try:
while True:
time.sleep(1)
except KeyboardInterrupt:
observer.stop()
observer.join() ```
The reason why '.*' regex works and while the Hires.*\.zip regex doesn't is watchdog tries to match the full path with the regex. In terms of regex: r'Hires.*\.zip' will not give you a full match with "C:\Users\Downloads\Hires blah.zip"
Instead, try:
def __init__(self):
watchdog.events.RegexMatchingEventHandler.__init__(
self, regexes=[r'^.*?\.cvs$', r'.*?Hires.*?\.zip'],
ignore_regexes=[], ignore_directories=False, case_sensitive=False)
This will match with all the cvs files and the zip files starts with "Hires". The '.*?' matches with "C:\Users\Downloads" and the rest ensures file types, file names etc.
I am using grpc with protobuf 2.6.1 in python 2.7, and when I run my client side code, I have the following errors:
Traceback (most recent call last):
File "debate_client.py", line 31, in <module>
run_client()
File "debate_client.py", line 17, in run_client
reply = stub.Answer(debate_pb2.AnswerRequest(question=question, timeout=timeout), 30)
File "/Users/elaine/Desktop/gitHub/grpc/python2.7_virtual_environment/lib/python2.7/site-packages/grpc/framework/crust/implementations.py", line 73, in __call__
protocol_options, metadata, request)
File "/Users/elaine/Desktop/gitHub/grpc/python2.7_virtual_environment/lib/python2.7/site-packages/grpc/framework/crust/_calls.py", line 109, in blocking_unary_unary
return next(rendezvous)
File "/Users/elaine/Desktop/gitHub/grpc/python2.7_virtual_environment/lib/python2.7/site-packages/grpc/framework/crust/_control.py", line 412, in next
raise self._termination.abortion_error
grpc.framework.interfaces.face.face.RemoteError: RemoteError(code=StatusCode.UNKNOWN, details="")
Here is my client side code:
from grpc.beta import implementations
import debate_pb2
import sys
def run_client():
params = sys.argv
print params
how = params[1]
question = params[2]
channel = implementations.insecure_channel('localhost', 29999)
stub = debate_pb2.beta_create_Candidate_stub(channel)
if how.lower() == "answer":
timeout = int(params[3])
reply = stub.Answer(debate_pb2.AnswerRequest(question=question, timeout=timeout), 30)
elif how.lower() == "elaborate":
blah = params[3:len(sys.argv)]
for i in range(0, len(blah)):
blah[i] = int(blah[i])
reply = stub.Elaborate(debate_pb2.ElaborateRequest(topic=question, blah_run=blah), 30)
if reply is None:
print "No comment"
else:
print reply.answer
if __name__ == "__main__":
run_client()
And here is my server side code:
import debate_pb2
import consultation_pb2
import re
import random
from grpc.beta import implementations
class Debate(debate_pb2.BetaCandidateServicer):
def Answer(self, request, context=None):
#Answer implementation
def Elaborate(self, request, context=None):
#Elaborate implementation
def run_server():
server = debate_pb2.beta_create_Candidate_server(Debate())
server.add_insecure_port('localhost:29999')
server.start()
if __name__ == "__main__":
run_server()
Any idea where the remote error comes from? Thank you so much!
Hello Elaine and thank you for trying out gRPC Python.
Nothing leaps out at me as an obvious smoking gun, but a couple of things I see are:
gRPC Python isn't known to work with protobuf 2.6.1. Have you tried working with the very latest protobuf release (3.0.0a3 at this time)?
context isn't an optional keyword parameter in servicer methods; it's a required, positional parameter. Does dropping =None from your servicer method implementations effect any change?
The same happened to me just now, and I figured out why.
Make sure the messages in your proto definition and the message in your implementations match the format.
For example, my Response message had a message= param in my python server, but not in my proto definition.
I think your function implementations should be outside class Debate or might be your functions are not correctly implemented to give the desired result.
I faced a similar error because my functions were inside the class but moving it outside the class fixed it.
I am following an example to read a config file from the following: https://wiki.python.org/moin/ConfigParserExamples
But I get a KeyError and can't figure out why. It is reading the files and I can even print the sections. I think I am doing something really stupid. Any help greatly appreciated.
Here is the code:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import ConfigParser
import logging
config_default=ConfigParser.ConfigParser()
class Setting(object):
def get_setting(self, section, my_setting):
default_setting = self.default_section_map(section)[my_setting]
return default_setting
def default_section_map(self,section):
dict_default = {}
config_default.read('setting.cfg')
sec=config_default.sections()
options_default = config_default.options(section)
logging.info('options_default: {0}'.format(options_default))
for option in options_default:
try:
dict_default[option] = config_default.get(section, option)
if dict_default[option] == -1:
print("skip: %s" % option)
except:
print("exception on %s!" % option)
dict_default[option] = None
return dict_default
return complete_path
if __name__ == '__main__':
conf=Setting()
host=conf.get_setting('mainstuff','name')
#host=conf.setting
print 'host setting is :' + host
My config file is named setting.cfg and looks like this:
[mainstuff]
name = test1
domain = test2
[othersection]
database_ismaster = no
database_master = test3
database_default = test4
[mysql]
port = 311111
user = someuser
passwd = somecrazylongpassword
[api]
port = 1111
And the Error is this...
exception on domain! Traceback (most recent call last): File
"./t.py", line 51, in
host=conf.get_setting('mainstuff','name') File "./t.py", line 14, in get_setting
default_setting = self.default_section_map(section)[my_setting] KeyError: 'name'
Be sure that your complete file path is setting.cfg. If you put your file in another folder or if it is named different, Python is going to report the same KeyError.
you have no general section. in order to get the hostname you need something like
[general]
hostname = 'hostname.net'
in your setting.cfg. now your config file matches the program -- maybe you prefer to adapt your porgram to match the config file? ...this should get you started at least.
UPDATE:
as my answer is useless now, here is something you could try to build on (assuming it works for you...)
import ConfigParser
class Setting(object):
def __init__(self, cfg_path):
self.cfg = ConfigParser.ConfigParser()
self.cfg.read(cfg_path)
def get_setting(self, section, my_setting):
try:
ret = self.cfg.get(section, my_setting)
except ConfigParser.NoOptionError:
ret = None
return ret
if __name__ == '__main__':
conf=Setting('setting.cfg')
host = conf.get_setting('mainstuff', 'name')
print 'host setting is :', host
This error occurs mainly due to 2 reasons:
Issue in reading the config file due to not getting the proper path. Absolute path may be used. Try reading the config file first whether any issue.
f = open("config.ini", "r")
print(f.read())
Not been able to find mentioned section in config file.
I'm trying to run the following script with about 3000 items. The script takes the link provided by self.book and returns the result using getPage. It loops through each item in self.book until there are no more items in the dictionary.
Here's the script:
from twisted.internet import reactor
from twisted.web.client import getPage
from twisted.web.error import Error
from twisted.internet.defer import DeferredList
import logging
from src.utilitybelt import Utility
class getPages(object):
""" Return contents from HTTP pages """
def __init__(self, book, logger=False):
self.book = book
self.data = {}
util = Utility()
if logger:
log = util.enable_log("crawler")
def start(self):
""" get each page """
for key in self.book.keys():
page = self.book[key]
logging.info(page)
d1 = getPage(page)
d1.addCallback(self.pageCallback, key)
d1.addErrback(self.errorHandler, key)
dl = DeferredList([d1])
# This should stop the reactor
dl.addCallback(self.listCallback)
def errorHandler(self,result, key):
# Bad thingy!
logging.error(result)
self.data[key] = False
logging.info("Appended False at %d" % len(self.data))
def pageCallback(self, result, key):
########### I added this, to hold the data:
self.data[key] = result
logging.info("Data appended")
return result
def listCallback(self, result):
#print result
# Added for effect:
if reactor.running:
reactor.stop()
logging.info("Reactor stopped")
About halfway through, I experience this error:
File "/usr/lib64/python2.7/site-packages/twisted/internet/posixbase.py", line 303, in _handleSignals
File "/usr/lib64/python2.7/site-packages/twisted/internet/posixbase.py", line 205, in __init__
File "/usr/lib64/python2.7/site-packages/twisted/internet/posixbase.py", line 138, in __init__
exceptions.OSError: [Errno 24] Too many open files
libgcc_s.so.1 must be installed for pthread_cancel to work
libgcc_s.so.1 must be installed for pthread_cancel to work
As of right now, I'll try to run the script with less items to see if that resolves the issue. However, there must be a better way to do it & I'd really like to learn.
Thank you for your time.
It looks like you are hitting the open file descriptors limit (ulimit -n) which is likely to be 1024.
Each new getPage call opens a new file handle which maps to the client TCP socket opened for the HTTP request. You might want to limit the amount of getPage calls you run concurrently. Another way around is to up the file descriptor limit for your process, but then you might still exhaust ports or FDs if self.book grows beyond 32K items.
I've written one small python script to fix the checksum of L3-4 protocols using scapy. When I'm running the script it is not taking command line argument or may be some other reason it is not generating the fix checksum pcap. I've verified the rdpcap() from scapy command line it is working file using script it is not getting executed. My program is
import sys
import logging
logging.getLogger("scapy").setLevel(1)
try:
from scapy.all import *
except ImportError:
import scapy
if len(sys.argv) != 3:
print "Usage:./ChecksumFixer <input_pcap_file> <output_pcap_file>"
print "Example: ./ChecksumFixer input.pcap output.pcap"
sys.exit(1)
#------------------------Command Line Argument---------------------------------------
input_file = sys.argv[1]
output_file = sys.argv[2]
#------------------------Get The layer and Fix Checksum-------------------------------
def getLayer(p):
for paktype in (scapy.IP, scapy.TCP, scapy.UDP, scapy.ICMP):
try:
p.getlayer(paktype).chksum = None
except: AttributeError
pass
return p
#-----------------------FixPcap in input file and write to output fi`enter code here`le----------------
def fixpcap():
paks = scapy.rdpcap(input_file)
fc = map(getLayer, paks)
scapy.wrpcap(output_file, fc)
The reason your function is not executed is that you're not invoking it. Adding a call to fixpcap() at the end of your script shall fix this issue.
Furthermore, here are a few more corrections & suggestions:
The statement following except Exception: should be indented as well, as follows:
try:
from scapy.all import *
except ImportError:
import scapy
Use argparse to parse command-line arguments.
Wrap your main code in a if __name__ == '__main__': block.