How can I share an updated values in dictionary among different process? - python-2.7

I am newbe here and also in Python. I have a question about dictionaries and multiprocessing. I want to run this part of the code on the second core of my Raspberry Pi (on first is running GUI application). So, I created a dictionary (keys(20) + array with the length of 256 for each of this key - script below is just short example). I initialized this dictionary in a separate script and put all values inside this dictionary on zero. Script table1.py (this dictionary should be available from both cores)
diction = {}
diction['FrameID']= [0]*10
diction['Flag']= ["Z"]*10
In the second script (should run on the second core), I read the values that I get from the serial port and put/set them in this dictionary (parsing + conversion) according to the appropriate place. Since I get much information through a serial port, the information is changing all the time. Script Readmode.py
from multiprocessing import Process
import time
import serial
import table1
def replace_all(text, dic):
for i, j in dic.iteritems():
text = text.replace(i, j)
return text
def hexTobin(hex_num):
scale = 16 ## equals to hexadecimal
num_of_bits = len(hex_num)*4
bin_num = bin(int(hex_num, scale))[2:].zfill(num_of_bits) #hex to binary
return bin_num
def readSerial():
port = "/dev/ttyAMA0"
baudrate = 115200
ser = serial.Serial(port, baudrate, bytesize=8, parity=serial.PARITY_NONE, stopbits=1, xonxoff=False, rtscts=False)
line = []
for x in xrange(1):
ser.write(":AAFF:AA\r\n:F1\r\n") # new
while True:
for c in ser.read():
line.append(c)
a=''.join(line[-2:])
if a == '\r\n':
b=''.join(line)
print("what I get:" + b)
c=b[b.rfind(":"):len(b)] #string between last ":" start delimiter and stop delimiter
reps = {':':'', '\r\n':''} #throw away start and stop delimiter
txt = replace_all(c, reps)
print("hex num: " + txt)
bina_num=hexTobin(txt) # convert hex to bin
print("bin num: " + bina_num)
ssbit = bina_num[:3] # first three bits
print("select first three bits: " + ssbit)
abit=int(ssbit,2) # binary to integer
if abit == 5:
table1.diction['FrameID'][0]=abit
if abit == 7:
table1.diction['FrameID'][5]=abit
print("int num: ",abit)
print(table1.diction)
line = []
break
ser.close()
p1=Process(target=readSerial)
p1.start()
During that time I want to read information in this dictionary and use them in another process. But When I try to read that values there are all zero.
My question is how to create a dictionary that will be available for both process and can be updated based on data get from serial port?
Thank you for your answer in advance.

In Python, when you start two different scripts, even if they import some common modules, they share nothing (except the code). If two scripts both import table1, then they both have their own instance of the table1 module and therefore their own instance of the table1.diction variable.
If you think about it, it has to be like that. Otherwise all Python scripts would share the same sys.stdout, for example. So, more or less nothing is shared between two different scripts executing at the same time.
The multiprocessing module lets you create more than one process from the same single script. So you'll need to merge your GUI and your reading function into the same script. But then you can do what you want. Your code will look something like this:
import multiprocessing
# create shared dictionary sd
p1=Process(target=readSerial, args = (sd,)
p1.start()
# create the GUI behaviour you want
But hang on a minute. This won't work either. Because when the Process is created it starts a new instance of the Python interpreter and creates all its own instances again, just like starting a new script. So even now, by default, nothing in readSerial will be shared with the GUI process.
But fortunately the multiprocessing module provides some explicit techniques to share data between the processes. There is more than one way to do that, but here is one that works:
import multiprocessing
import time
def readSerial(d):
d["test"] = []
for i in range(100):
l = d["test"]
l.append(i)
d["test"] = l
time.sleep(1)
def doGUI(d):
while True:
print(d)
time.sleep(.5)
if __name__ == '__main__':
with multiprocessing.Manager() as manager:
sd = manager.dict()
p = multiprocessing.Process(target=readSerial, args=(sd,))
p.start()
doGUI(sd)
You'll notice that the append to the list in the readSerial function is a bit odd. That is because the dictionary object we are working with here is not a normal dictionary. It's actually a dictionary proxy that conceals a pipe used to send the data between the two processes. When I append to the list inside the dictionary the proxy needs to be notified (by assigning to the dictionary) so it knows to send the updated data across the pipe. That is why we need the assignment (rather than simply directly mutating the list inside the dictionary). For more on this look at the docs.

Related

Parse CSV efficiently in python

I am writing a CSV parser which has following structure
class decode:
def __init__(self):
self.fd = open('test.csv')
def decodeoperation(self):
for row in self.fd:
getcmd = self.decodecmd(row)
if cmd == 'A'
self.decodeAopt()
elif cmd == 'B':
self.decodeBopt()
def decodeAopt(self):
for row in self.fd:
#decodefurther dependencies based on cmd A till
#a condition occurs on any further row
return
def decodeBopt(self):
for row in self.fd:
#decodefurther dependencies based on cmd B till
#a condition occurs on any further row
return
The current code is working fine for me but I am not feeling good to iterate through the CSV file in all the methods. Could it be done in a better way?
There is nothing inherently wrong with using a common iterator across multiple methods, as long as you can determine in advance which method to dispatch to at any given point in the sequence (which you are doing by decoding the cmd from the row and getting 'A', 'B', etc.). The design has issues if you have to read several items before you could determine which method to call, and might have to back up if you picked the wrong method and needed to try another. In parsing, this is called backtracking. Since you are passing around a file object, backing up is difficult. Note that your separate decoder methods will have to know when to stop before reading the next row that contains a command, so they will need some sort of terminating sentinel row that they can recognize.
Some general comments on your Python and class design:
You have a nice simple if-elif-elif dispatch table that can translate to a Python dict like this:
# put this code in place of your "if cmd == ... elif elif elif..." code
dispatch = {
# note - no ()'s, we just want to reference the methods, not call them
'A': self.decodeAopt,
'B': self.decodeBopt,
'C': self.decodeCopt,
# look how easy it is to add more decoders
}
# lookup which decoder to use for the current cmd
decoder = dispatch[cmd]
# run it
decoder()
# or do it all in one line
dispatch[cmd]()
Instead of having your __init__ method open a file, let it accept an iterator object. This will make it much easier to write tests for your object, since you'll be able to pass simple Python lists containing CSV rows.
class decode:
def __init__(self, sequence):
self.fd = sequence
You might want to rename this var from 'fd' to something like 'seq', since it doesn't have to be a file, but could be any iterable that gives you decodable rows.
If you are doing your own CSV parsing, look at using the builtin csv module. It will do quite a bit of work for you, like parsing quoted strings that could contain commas, and can give you easy-to-work-with dicts for each row, given headers read from the input file, or specified by you. If you have modified __init__ as I suggested, you can use it like:
import csv
# assuming test.csv has a header row
reader = csv.DictReader(open('test.csv'))
# or specify headers if not - I encourage you to give these columns better names
reader.fieldnames = ['cmd', 'val1', 'val2', 'val3']
decoder = decode(reader)
decoder.decodeoperation()
Then you can write in decodeoperation:
cmd = row['cmd']
Note that this would impart a slightly different design to your class, that it would expect to be given a sequence of dicts, rather than a sequence of strings.

Mapreduce wordcount in python

Well I am new to mapreducer programs. So when i search a example of mapreducer programs all I get is word count program. All the programs related to word count use the text to file as the input. I tried using a csv file as a input and the reducer is not working as it works for text file.
http://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-program-in-python/
This is the current example i am looking at. Would someone explain the reason for this?
You can use the collections.Counter class:
from collections import Counter
with open(filename) as handler:
counter = Counter(handler.read().split())
print(counter.most_common(10))
On the documentation you will find a lot of more useful informations.
Distinguish between
MapReduce: a programming model for parallel execution
Python: a programming language
Hadoop: a cluster platform Python can run on
Do you really need the MapReduce for Hadoop or just a pure Python example? If the latter one, it can be much easier than your link:
import multiprocessing
def word_count(line, delimiter=","):
"""Worker"""
summary = {}
for word in line.strip().split(delimiter):
if word in summary:
summary[word] += 1
else:
summary[word] = 1
return summary
pool = multiprocessing.Pool()
result = {}
# Map: each line to a separate worker
for summary in pool.imap_unordered(word_count, open("/path/to/file.csv")):
# Reduce: aggregate the result of each line into one result
for (word, count) in summary.items():
result[word] = result[word]+count if word in result else count
print(result)

View execute time is very long (above one minute)

In our Django project, there is a view which creates multiple objects (from 5 to even 100). The problem is that creating phase takes a very long time.
Don't know why is that so but I suppose that it could be because on n objects, there are n database lookups and commits.
For example 24 objects takes 67 seconds.
I want to speed up this process.
There are two things I think may be worth to consider:
To create these objects in one query so only one commit is executed.
Create a ThreadPool and create these objects parallel.
This is a part of the view which causes problems (We use Postgres on localhost so connection is not a problem)
#require_POST
#login_required
def heu_import(request):
...
...
product = Product.objects.create(user=request.user,name=name,manufacturer=manufacturer,category=category)
product.groups.add(*groups)
occurences = []
counter = len(urls_xpaths)
print 'Occurences creating'
start = datetime.now()
eur_currency = Currency.objects.get(shortcut='eur')
for url_xpath in urls_xpaths:
counter-=1
print counter
url = url_xpath[1]
xpath = url_xpath[0]
occ = Occurence.objects.create(product=product,url=url,xpath=xpath,active=True if xpath else False,currency=eur_currency)
occurences.append(occ)
print 'End'
print datetime.now()-start
...
return render(request,'main_app/dashboard/new-product.html',context)
Output:
Occurences creating
24
.
.
.
0
End
0:01:07.727000
EDIT:
I tried to put the for loop into the with transaction.atomic(): block but it seems to help only a bit (47 seconds instead of 67).
EDIT2:
I'm not sure but it seems that SQL queries are not a problem:
Please use bulk_create for inserting multiple objects.
occurences = []
for url_xpath in urls_xpaths:
counter-=1
print counter
url = url_xpath[1]
xpath = url_xpath[0]
occurances.append(Occurence(product=product,url=url,xpath=xpath,active=True if xpath else False,currency=eur_currency))
Occurence.objects.bulk_create(occurences)

Python multiprocessing queues and streaming data

I am connecting to an API that sends streaming data in string format. This stream runs for about 9 hours per day. Each string that is sent (about 1-5, 40 charter stings per second) need to be parsed and certain values need to extracted. Those values are then stored in a list which is parsed and data is retrieved at various intervals to create another list which needs to be parsed. What is the best way to accomplish this with miltiprocessing, queues? Is there a better way?
from multiprocessing import Process
import requests
data_stream = requests.get("http://myDataStreamUrl", stream=True)
lines = data_stream.iter_lines()
first_line = next(lines)
for line in lines:
find what i need and append to first_list
def parse_first_list_and create_second_list():
find what i need in first_list to create second list
def parse_second_list():
find what i need

Share a big dictionary in python multiprcess in Windows

I am writing a feature-collection program which would extract information from over 20000 files with Python 2.7. I store some pre-computed information in a dictionary. In roder to make my prgram faster, I used multiprocess and the big dictionary must be used in these process. (Actually, I just want to use each process to handle a part of the files, every process is actually the same function). This dictionary will not be changed in these processes, it is just a parameter for the function.
I found that each process will create its own address space, each with a copy of this big dictionary. My computer does not such a big memory to store many copy of this dictionary. I wonder if there is way to create a static dict object that can be used by every process? Below is my code of the multiprocess part, pmi_dic is the big dictionary (maybe several GB), it will not be changed in the function get_features.
processNum = 2
pool = mp.Pool(processes = processNum)
fileNum = len(filelist)
offset = fileNum / processNum
for i in range(processNum):
if (i == processNum - 1):
start = i * offset
end = fileNum
else:
start = i * offset
end = start + offset
print str(start) + ' ' + str(end)
pool.apply_async(get_features, args = (df_dic, pmi_dic, start, end, filelist, wordNum))
pool.close()
pool.join()