Python multiprocessing queues and streaming data - python-2.7

I am connecting to an API that sends streaming data in string format. This stream runs for about 9 hours per day. Each string that is sent (about 1-5, 40 charter stings per second) need to be parsed and certain values need to extracted. Those values are then stored in a list which is parsed and data is retrieved at various intervals to create another list which needs to be parsed. What is the best way to accomplish this with miltiprocessing, queues? Is there a better way?
from multiprocessing import Process
import requests
data_stream = requests.get("http://myDataStreamUrl", stream=True)
lines = data_stream.iter_lines()
first_line = next(lines)
for line in lines:
find what i need and append to first_list
def parse_first_list_and create_second_list():
find what i need in first_list to create second list
def parse_second_list():
find what i need

Related

AWS Textract (OCR) not detecting some cells

I am using AWS Textract to read and parse tables from PDF into CSV. Lovely, AWS has a documentation for it! https://docs.aws.amazon.com/textract/latest/dg/examples-export-table-csv.html
I have set up the Asynchronous method as they suggest, and it works for a POC. However, for some documents, some lines are not shown in my csv.
After digging a bit into the json produced (the issue is persistent if I use AWS CLI to make the document analysis), I noticed that the values missing have no CELL block referenced. Those missing values are referenced into WORD block, and LINE block, but not in CELL block. According to the script, that's exactly the reason why it's not added to my csv.
We could assume it's not that good OCR algorithm. But the fun fact about this, is that if I use the same pdf within AWS Textract console, all the data is parsed into the table!
Is any of you aware of any parameters I would need to use to be sure to detect the values as CELL? Or do you think behind the scenes, they simply use a more powerful script (that would actually use the (x,y) coordinates of each WORD to match the table?
I also compared the json produced from CLI to the one from the console, and it's actually different! (not only IDs, but also as said some values are in CELL's block for console, while in LINE/WORD only for CLI)
Important fact: my PDF is 3 pages long. The first page is working perfectly fine with all the values, but the second one is missing the first 10 lines of the table basically. After those 10 lines, everything is parsed on this page as well.
Any suggestions? Or script to parse more efficiently the json provided?
Thank you!
Update: Basically the issue was the pagination of the results. There is a maximum of 1000 objects according to AWS documentation: https://docs.aws.amazon.com/textract/latest/dg/API_GetDocumentAnalysis.html#API_GetDocumentAnalysis_RequestSyntax
If you have more than this amount of object in the single table, then the IDs are in the first 1000, while the object itself is referenced in second batch (1001 --> 2000). So when trying to add the cell to the table, it can't find the reference.
Basically the solution is quite easy. We need to alter the GetResults function to concatenate each response, and THEN run the other functions.
Here is a functioning code:
def GetResults(jobId, file_name):
maxResults = 1000
paginationToken = None
finished = False
blocks = []
while finished == False:
response = None
if paginationToken == None:
response = textract.get_document_analysis(JobId=jobId, MaxResults=maxResults)
else:
response = textract.get_document_analysis(JobId=jobId, MaxResults=maxResults,
NextToken=paginationToken)
blocks += response['Blocks']
if 'NextToken' in response:
paginationToken = response['NextToken']
else:
finished = True
table_csv = get_table_csv_results(blocks)
output_file = file_name + ".csv"
# replace content
with open(output_file, "w") as fout: # Important to change "at" to "w"
fout.write(table_csv)
# show the results
print('Detected Document Text')
print('Pages: {}'.format(response['DocumentMetadata']['Pages']))
print('OUTPUT TO CSV FILE: ', output_file)
Hope this will help people.

How to send namedtuples between tranformations?

How to send out a row as namedtuple from one Bonobo transformation? So in the receiving transformation I have field-level access to row data.
I'm now using dicts to send data between transformations. But they have a disadvantage: they're mutable (bad things can happen if you forgot to create a fresh one at output of transformation).
I thought that simply replacing a dict with a namedtuple would do the trick, but apparently Bonobo doesn't support sending out a namedtuple. I read something about context.set_output_fields[list of keys]), but can't figure out how to use it. A small example would be great!
Using namedtuple is very straightforward, you can yield a namedtuple instance and retrieve it expanded as the next transformation input:
import bonobo
import collections
Hero = collections.namedtuple("Hero", ["name", "power"])
def produce():
yield Hero(name="Road Runner", power="speed")
yield Hero(name="Wile E. Coyote", power="traps")
yield Hero(name="Guido", power="dutch")
def consume(name, power):
print(name, "has", power, "power")
def get_graph():
graph = bonobo.Graph()
graph >> produce >> consume
return graph
if __name__ == "__main__":
with bonobo.parse_args() as options:
bonobo.run(get_graph())
The "output fields" of produce() will be set from the namedtuple fields, and the "input fields" of consume(...) will be detected from the first input row.
The context.set_output_fields(...) method is only useful if for whatever reason you don't want to used named datastructures (like namedtuples) but prefer to use tuples, yet needing to name the values in the tuple.
Hope that helps!

Why must I run this code a few times before my entire .csv file is converted into a .yaml file?

I am trying to build a tool that can convert .csv files into .yaml files for further use. I found a handy bit of code that does the job nicely from the link below:
Convert CSV to YAML, with Unicode?
which states that the line will take the dict created by opening a .csv file and dump it to a .yaml file:
out_file.write(ry.safe_dump(dict_example,allow_unicode=True))
However, one small kink I have noticed is that when it is run once, the generated .yaml file is typically incomplete by a line or two. In order to have the .csv file exhaustively read through to create a complete .yaml file, the code must be run two or even three times. Does anybody know why this could be?
UPDATE
Per request, here is the code I use to parse my .csv file, which is two columns long (with a string in the first column and a list of two strings in the second column), and will typically be 50 rows long (or maybe more). Also note that it designed to remove any '\n' or spaces that could potentially cause problems later on in the code.
csv_contents={}
with open("example1.csv", "rU") as csvfile:
green= csv.reader(csvfile, dialect= 'excel')
for line in green:
candidate_number= line[0]
first_sequence= line[1].replace(' ','').replace('\r','').replace('\n','')
second_sequence= line[2].replace(' ','').replace('\r','').replace('\n','')
csv_contents[candidate_number]= [first_sequence, second_sequence]
csv_contents.pop('Header name', None)
Ultimately, it is not that important that I maintain the order of the rows from the original dict, just that all the information within the rows is properly structured.
I am not sure what would cause could be but you might be running out of memory as you create the YAML document in memory first and then write it out. It is much better to directly stream it out.
You should also note that the code in the question you link to, doesn't preserve the order of the original columns, something easily circumvented by using round_trip_dump instead of safe_dump.
You probably want to make a top-level sequence (list) as in the desired output of the linked question, with each element being a mapping (dict).
The following parses the CSV, taking the first line as keys for mappings created for each following line:
import sys
import csv
import ruamel.yaml as ry
import dateutil.parser # pip install python-dateutil
def process_line(line):
"""convert lines, trying, int, float, date"""
ret_val = []
for elem in line:
try:
res = int(elem)
ret_val.append(res)
continue
except ValueError:
pass
try:
res = float(elem)
ret_val.append(res)
continue
except ValueError:
pass
try:
res = dateutil.parser.parse(elem)
ret_val.append(res)
continue
except ValueError:
pass
ret_val.append(elem.strip())
return ret_val
csv_file_name = 'xyz.csv'
data = []
header = None
with open(csv_file_name) as inf:
for line in csv.reader(inf):
d = process_line(line)
if header is None:
header = d
continue
data.append(ry.comments.CommentedMap(zip(header, d)))
ry.round_trip_dump(data, sys.stdout, allow_unicode=True)
with input xyz.csv:
id, title_english, title_russian
1, A Title in English, Название на русском
2, Another Title, Другой Название
this generates:
- id: 1
title_english: A Title in English
title_russian: Название на русском
- id: 2
title_english: Another Title
title_russian: Другой Название
The process_line is just some sugar that tries to convert strings in the CSV file to more useful types and strings without leading spaces (resulting in far less quotes in your output YAML file).
I have tested the above on files with 1000 rows, without any problems (I won't post the output though).
The above was done using Python 3 as well as Python 2.7, starting with a UTF-8 encoded file xyz.csv. If you are using Python 2, you can try unicodecsv if you need to handle Unicode input and things don't work out as well as they did for me.

How can I share an updated values in dictionary among different process?

I am newbe here and also in Python. I have a question about dictionaries and multiprocessing. I want to run this part of the code on the second core of my Raspberry Pi (on first is running GUI application). So, I created a dictionary (keys(20) + array with the length of 256 for each of this key - script below is just short example). I initialized this dictionary in a separate script and put all values inside this dictionary on zero. Script table1.py (this dictionary should be available from both cores)
diction = {}
diction['FrameID']= [0]*10
diction['Flag']= ["Z"]*10
In the second script (should run on the second core), I read the values that I get from the serial port and put/set them in this dictionary (parsing + conversion) according to the appropriate place. Since I get much information through a serial port, the information is changing all the time. Script Readmode.py
from multiprocessing import Process
import time
import serial
import table1
def replace_all(text, dic):
for i, j in dic.iteritems():
text = text.replace(i, j)
return text
def hexTobin(hex_num):
scale = 16 ## equals to hexadecimal
num_of_bits = len(hex_num)*4
bin_num = bin(int(hex_num, scale))[2:].zfill(num_of_bits) #hex to binary
return bin_num
def readSerial():
port = "/dev/ttyAMA0"
baudrate = 115200
ser = serial.Serial(port, baudrate, bytesize=8, parity=serial.PARITY_NONE, stopbits=1, xonxoff=False, rtscts=False)
line = []
for x in xrange(1):
ser.write(":AAFF:AA\r\n:F1\r\n") # new
while True:
for c in ser.read():
line.append(c)
a=''.join(line[-2:])
if a == '\r\n':
b=''.join(line)
print("what I get:" + b)
c=b[b.rfind(":"):len(b)] #string between last ":" start delimiter and stop delimiter
reps = {':':'', '\r\n':''} #throw away start and stop delimiter
txt = replace_all(c, reps)
print("hex num: " + txt)
bina_num=hexTobin(txt) # convert hex to bin
print("bin num: " + bina_num)
ssbit = bina_num[:3] # first three bits
print("select first three bits: " + ssbit)
abit=int(ssbit,2) # binary to integer
if abit == 5:
table1.diction['FrameID'][0]=abit
if abit == 7:
table1.diction['FrameID'][5]=abit
print("int num: ",abit)
print(table1.diction)
line = []
break
ser.close()
p1=Process(target=readSerial)
p1.start()
During that time I want to read information in this dictionary and use them in another process. But When I try to read that values there are all zero.
My question is how to create a dictionary that will be available for both process and can be updated based on data get from serial port?
Thank you for your answer in advance.
In Python, when you start two different scripts, even if they import some common modules, they share nothing (except the code). If two scripts both import table1, then they both have their own instance of the table1 module and therefore their own instance of the table1.diction variable.
If you think about it, it has to be like that. Otherwise all Python scripts would share the same sys.stdout, for example. So, more or less nothing is shared between two different scripts executing at the same time.
The multiprocessing module lets you create more than one process from the same single script. So you'll need to merge your GUI and your reading function into the same script. But then you can do what you want. Your code will look something like this:
import multiprocessing
# create shared dictionary sd
p1=Process(target=readSerial, args = (sd,)
p1.start()
# create the GUI behaviour you want
But hang on a minute. This won't work either. Because when the Process is created it starts a new instance of the Python interpreter and creates all its own instances again, just like starting a new script. So even now, by default, nothing in readSerial will be shared with the GUI process.
But fortunately the multiprocessing module provides some explicit techniques to share data between the processes. There is more than one way to do that, but here is one that works:
import multiprocessing
import time
def readSerial(d):
d["test"] = []
for i in range(100):
l = d["test"]
l.append(i)
d["test"] = l
time.sleep(1)
def doGUI(d):
while True:
print(d)
time.sleep(.5)
if __name__ == '__main__':
with multiprocessing.Manager() as manager:
sd = manager.dict()
p = multiprocessing.Process(target=readSerial, args=(sd,))
p.start()
doGUI(sd)
You'll notice that the append to the list in the readSerial function is a bit odd. That is because the dictionary object we are working with here is not a normal dictionary. It's actually a dictionary proxy that conceals a pipe used to send the data between the two processes. When I append to the list inside the dictionary the proxy needs to be notified (by assigning to the dictionary) so it knows to send the updated data across the pipe. That is why we need the assignment (rather than simply directly mutating the list inside the dictionary). For more on this look at the docs.

How to replay a list of event consistently

I have a file containing a list of event spaced with some time. Here is an example:
0, Hello World
0.5, Say Hi
2, Say Bye
I would like to be able to replay this sequence of events. The first column is the delta between the two consecutive events ( the first starts immendiately, the second happens 0.5s later, the third 2s later, ... )
How can i do that on Windows . Is there anything that can ensure that I am very accurate on the timing ? The idea is to be as close as what you would have listneing some music , you don't want your audio event to happen close to the right time but just on time .
This can be done easily by using the sleep function from the time module. The exact code should work like this:
import time
# Change data.txt to the name of your file
data_file = open("data.txt", "r")
# Get rid of blank lines (often the last line of the file)
vals = [i for i in data_file.read().split('\n') if i]
data_file.close()
for i in vals:
i = i.split(',')
i[1] = i[1][1:]
time.sleep(float(i[0]))
print i[1]
This is an imperfect algorithm, but it should give you an idea of how this can be done. We read the file, split it to a newline delimited list, then go through each comma delimited couplet sleeping for the number of seconds specified, and printing the specified string.
You're looking for time.sleep(...) in Python.
If you load that file as a list, and then print the values,
import time
with open("datafile.txt", "r") as infile:
lines = infile.read().split('\n')
for line in lines:
wait, response = line.split(',')
time.sleep(float(wait))
print response