I am writing a feature-collection program which would extract information from over 20000 files with Python 2.7. I store some pre-computed information in a dictionary. In roder to make my prgram faster, I used multiprocess and the big dictionary must be used in these process. (Actually, I just want to use each process to handle a part of the files, every process is actually the same function). This dictionary will not be changed in these processes, it is just a parameter for the function.
I found that each process will create its own address space, each with a copy of this big dictionary. My computer does not such a big memory to store many copy of this dictionary. I wonder if there is way to create a static dict object that can be used by every process? Below is my code of the multiprocess part, pmi_dic is the big dictionary (maybe several GB), it will not be changed in the function get_features.
processNum = 2
pool = mp.Pool(processes = processNum)
fileNum = len(filelist)
offset = fileNum / processNum
for i in range(processNum):
if (i == processNum - 1):
start = i * offset
end = fileNum
else:
start = i * offset
end = start + offset
print str(start) + ' ' + str(end)
pool.apply_async(get_features, args = (df_dic, pmi_dic, start, end, filelist, wordNum))
pool.close()
pool.join()
Related
I have a file of binary values. The section I am looking at is 4 byte int with the values in the pattern of MW1, MVAR1, MW2, MVAR2,...
I read the values in with
temp = array.array("f")
temp.fromfile(file, length *2)
mw_mvar = temp.tolist()
I then calculate the magnitude like this.
mag = [0] * length
for x in range(0,length * 2, 2):
a = mw_mvar[x]
b = mw_mvar[x + 1]
mag[(x / 2)] = sqrt(a*a + b*b)
The calculations (not the read) are doubling the total length of my script. I know there is (theoretically) a way to do this faster because am mimicking a script that ultimately calls fortran (pyd to call function dlls in fortran i think) which is able to do this calculation with negligible affect on run time.
This is the best i can come up with. any suggestions for improvements?
I have also tried math.pow(), **.5, **2 with no differences.
with no luck improving the calculations, I went around the problem. I realised that I only needed 1% of those calculated values so I created a class to calculate them on demand. It was important (to me) that the resulting code act similar to as if it were a list of calculated values. A lot of the remainder of the process uses the values and different versions of the data are pre-calculated. The class means i don't need a set of procedures for each version of data
class mag:
def __init__(self,mw_mvar):
self._mw_mvar = mw_mvar
#_sgn = sgn
def __len__(self):
return len(self._mw_mvar/2)
def __getitem__(self, item):
return sqrt(self._mw_mvar[2*item] ** 2 + self._mw_mvar[2*item+1] ** 2)
ps this could also be done in a function and take both versions. i would have had to make more changes to the overall script.
function (a,b,x):
if b[x]==0:
return a[x]
else:
return sqrt(a[x]**2 + b[x]**2)
I'm using the map function with pandas to read in files with multiprocessing like
files = glob.glob('C:\Desktop\Folder\*.xlsx')
def read_excel(filename):
return pd.read_excel(filename)
file_list = [filename for filename in files]
pool = Pool(processors = 4)
pool.map(read_excel, file_list)
But the problem is that whereas before I was using a for loop and could have a counter
count += 1
for each iteration through the loop and print the count / len(files) to get a sense of how far along the process is, I can't do that here. I realize with multiprocessing it could get a little funky but there should be some way to implement this.
I am newbe here and also in Python. I have a question about dictionaries and multiprocessing. I want to run this part of the code on the second core of my Raspberry Pi (on first is running GUI application). So, I created a dictionary (keys(20) + array with the length of 256 for each of this key - script below is just short example). I initialized this dictionary in a separate script and put all values inside this dictionary on zero. Script table1.py (this dictionary should be available from both cores)
diction = {}
diction['FrameID']= [0]*10
diction['Flag']= ["Z"]*10
In the second script (should run on the second core), I read the values that I get from the serial port and put/set them in this dictionary (parsing + conversion) according to the appropriate place. Since I get much information through a serial port, the information is changing all the time. Script Readmode.py
from multiprocessing import Process
import time
import serial
import table1
def replace_all(text, dic):
for i, j in dic.iteritems():
text = text.replace(i, j)
return text
def hexTobin(hex_num):
scale = 16 ## equals to hexadecimal
num_of_bits = len(hex_num)*4
bin_num = bin(int(hex_num, scale))[2:].zfill(num_of_bits) #hex to binary
return bin_num
def readSerial():
port = "/dev/ttyAMA0"
baudrate = 115200
ser = serial.Serial(port, baudrate, bytesize=8, parity=serial.PARITY_NONE, stopbits=1, xonxoff=False, rtscts=False)
line = []
for x in xrange(1):
ser.write(":AAFF:AA\r\n:F1\r\n") # new
while True:
for c in ser.read():
line.append(c)
a=''.join(line[-2:])
if a == '\r\n':
b=''.join(line)
print("what I get:" + b)
c=b[b.rfind(":"):len(b)] #string between last ":" start delimiter and stop delimiter
reps = {':':'', '\r\n':''} #throw away start and stop delimiter
txt = replace_all(c, reps)
print("hex num: " + txt)
bina_num=hexTobin(txt) # convert hex to bin
print("bin num: " + bina_num)
ssbit = bina_num[:3] # first three bits
print("select first three bits: " + ssbit)
abit=int(ssbit,2) # binary to integer
if abit == 5:
table1.diction['FrameID'][0]=abit
if abit == 7:
table1.diction['FrameID'][5]=abit
print("int num: ",abit)
print(table1.diction)
line = []
break
ser.close()
p1=Process(target=readSerial)
p1.start()
During that time I want to read information in this dictionary and use them in another process. But When I try to read that values there are all zero.
My question is how to create a dictionary that will be available for both process and can be updated based on data get from serial port?
Thank you for your answer in advance.
In Python, when you start two different scripts, even if they import some common modules, they share nothing (except the code). If two scripts both import table1, then they both have their own instance of the table1 module and therefore their own instance of the table1.diction variable.
If you think about it, it has to be like that. Otherwise all Python scripts would share the same sys.stdout, for example. So, more or less nothing is shared between two different scripts executing at the same time.
The multiprocessing module lets you create more than one process from the same single script. So you'll need to merge your GUI and your reading function into the same script. But then you can do what you want. Your code will look something like this:
import multiprocessing
# create shared dictionary sd
p1=Process(target=readSerial, args = (sd,)
p1.start()
# create the GUI behaviour you want
But hang on a minute. This won't work either. Because when the Process is created it starts a new instance of the Python interpreter and creates all its own instances again, just like starting a new script. So even now, by default, nothing in readSerial will be shared with the GUI process.
But fortunately the multiprocessing module provides some explicit techniques to share data between the processes. There is more than one way to do that, but here is one that works:
import multiprocessing
import time
def readSerial(d):
d["test"] = []
for i in range(100):
l = d["test"]
l.append(i)
d["test"] = l
time.sleep(1)
def doGUI(d):
while True:
print(d)
time.sleep(.5)
if __name__ == '__main__':
with multiprocessing.Manager() as manager:
sd = manager.dict()
p = multiprocessing.Process(target=readSerial, args=(sd,))
p.start()
doGUI(sd)
You'll notice that the append to the list in the readSerial function is a bit odd. That is because the dictionary object we are working with here is not a normal dictionary. It's actually a dictionary proxy that conceals a pipe used to send the data between the two processes. When I append to the list inside the dictionary the proxy needs to be notified (by assigning to the dictionary) so it knows to send the updated data across the pipe. That is why we need the assignment (rather than simply directly mutating the list inside the dictionary). For more on this look at the docs.
Let's assume to make it easier that I have 10 files named "1","2",...,"10".
Today I am in a situation where i want to load in a script those 10 files, one at a time.
I am using that code, which is written ten times in a row with in between the mathemical operations I want to use on the Data contained in those files :
Tk().withdraw()
filename2 = askopenfilename()
with load(filename2) as data:
..."mathematical operations"...
Tk().withdraw()
filename3 = askopenfilename()
with load(filename3) as data:
etc,etc ...
This way opens 10 dialog boxes,one after one, where I need to type the name of the file to load it ( so I type "1", hit enter, then type "2" in the next box, hit enter, blablabla ).
I am looking for a way to have only one box of dialog to open (or maybe you know something even smarter to do), and type one time in a row the right order of numbers so the script load them one at a time on by himself.
In other words, in a short amount of time I will have 300 files, I just want to type once :1,2,3,4,5,...,300 and hit enter, rather than doing what I described earlier.
Or maybe a way to just type "300" and the script knows he has to look for files starting at "1" et incrementing one by one.
The open function just takes a string, and you can create that string any way you want. You can concatenate the static parts of your filename with a changing number in a for loop:
s_pre = 'file'
s_ext = '.txt'
numFiles = int(raw_input("number of files: "))
for i in range(1, numFiles + 1):
filename = s_pre + str(i) + s_ext
with open(filename) as data:
## input stuff
## math stuff
I assume load is your function, and you can just pass this filename in the loop to load as well.
Summary: Can I cause time to (appear to) pass in my script without busywaiting? Using os.execute('sleep 1') doesn't cut it (and also wastes time).
Details
I have a library that does work after specific intervals. I currently schedule this by injecting a table into a time-sorted list. Roughly:
local delayedWork = {}
function handleLater( data, delaySeconds )
local delayedItem = { data=data, execTime=os.clock() + delaySeconds }
local i=1
for _,existing in ipairs(delayedWork) do
if existing.execTime > delayedItem.execTime then break else i=i+1 end
end
table.insert(delayedWork,i,delayedItem)
end
Later on I periodically check for work to do. Roughly:
function processDelayedWork()
local i,last = 1,#delayedWork
while i<=last do
local delayedItem = delayedWork[i]
if delayedItem.execTime <= os.clock() then
table.remove(delayedWork,i)
-- use delayedItem.data
last = last-1
else
i=i+1
end
end
end
(The fact that this uses CPU time instead of wall time is an issue for a different question. The fact that I repeatedly shift the array during processing instead of one compacting pass is an optimization not relevant here.)
When I am running unit tests of the system I need to cause time to pass, preferably faster than normal. However, calling os.execute('sleep 1') consumes one wall clock second, but does not cause os.clock() to increment.
$ lua -e "require 'os' print(os.clock()) os.execute('sleep 1') print(os.clock())"
0.002493
0.002799
I can't use os.time() because this is only integer seconds (and less than that on systems compiled to use float instead double for numbers):
$ lua -e "require 'os' print(os.time()) os.execute('sleep 0.4') print(os.time())"
1397704209
1397704209
Can I force time to pass in my script without just busywaiting?
Paul answered this already: make os.clock return whatever you need.
I'm adding an implementation that lets you control how fast time passes according to os.clock \\\\\\\\\\\\\\\\\\\\\\\.
do
local realclock = os.clock
local lasttime = realclock()
local faketime = lasttime
local clockMultiplier = 1
function setClockMultiplier(multiplier)
clockMultiplier = multiplier
end
function adjustTime(offsetAmount)
faketime = faketime + offsetAmount
end
function os.clock()
local now = realclock()
adjustTime((now - lasttime) * clockMultiplier)
lasttime = now
return faketime
end
end
You can now call setClockMultiplier to freely slow down or speed up the passage of time as reported by os.clock. You can call adjustTime to advance or retard the clock by arbitrary amounts.
Why not write your own clock function that does what you want?
do
local clock = os.clock
local increment = 0
os.clock = function(inc)
increment = increment + (inc or 0)
return clock()+increment
end
end
print(os.clock())
os.clock(1)
print(os.clock())
This will print 0.001 1.001 or something similar.