How can I keep track of files processed while using map? - python-2.7

I'm using the map function with pandas to read in files with multiprocessing like
files = glob.glob('C:\Desktop\Folder\*.xlsx')
def read_excel(filename):
return pd.read_excel(filename)
file_list = [filename for filename in files]
pool = Pool(processors = 4)
pool.map(read_excel, file_list)
But the problem is that whereas before I was using a for loop and could have a counter
count += 1
for each iteration through the loop and print the count / len(files) to get a sense of how far along the process is, I can't do that here. I realize with multiprocessing it could get a little funky but there should be some way to implement this.

Related

How to get 3 unique values using random.randint() in python?

I am trying to populate a list in Python3 with 3 random items being read from a file using REGEX, however i keep getting duplicate items in the list.
Here is an example.
import re
import random as rn
data = '/root/Desktop/Selenium[FILTERED].log'
with open(data, 'r') as inFile:
index = inFile.read()
URLS = re.findall(r'https://www\.\w{1,10}\.com/view\?i=\w{1,20}', index)
list_0 = []
for i in range(3):
list_0.append(URLS[rn.randint(1, 30)])
inFile.close()
for i in range(len(list_0)):
print(list_0[i])
What would be the cleanest way to prevent duplicate items being appended to the list?
(EDIT)
This is the code that i think has done the job quite well.
def random_sample(data):
r_e = ['https://www\.\w{1,10}\.com/view\?i=\w{1,20}', '..']
with open(data, 'r') as inFile:
urls = re.findall(r'%s' % r_e[0], inFile.read())
x = list(set(urls))
inFile.close()
return x
data = '/root/Desktop/[TEMP].log'
sample = random_sample(data)
for i in range(3):
print(sample[i])
Unordered collection with no duplicate entries.
Use the builtin random.sample.
random.sample(population, k)
Return a k length list of unique elements chosen from the population sequence or set.
Used for random sampling without replacement.
Addendum
After seeing your edit, it looks like you've made things much harder than they have to be. I've wired a list of URLS in the following, but the source doesn't matter. Selecting the (guaranteed unique) subset is essentially a one-liner with random.sample:
import random
# the following two lines are easily replaced
URLS = ['url1', 'url2', 'url3', 'url4', 'url5', 'url6', 'url7', 'url8']
SUBSET_SIZE = 3
# the following one-liner yields the randomized subset as a list
urlList = [URLS[i] for i in random.sample(range(len(URLS)), SUBSET_SIZE)]
print(urlList) # produces, e.g., => ['url7', 'url3', 'url4']
Note that by using len(URLS) and SUBSET_SIZE, the one-liner that does the work is not hardwired to the size of the set nor the desired subset size.
Addendum 2
If the original list of inputs contains duplicate values, the following slight modification will fix things for you:
URLS = list(set(URLS)) # this converts to a set for uniqueness, then back for indexing
urlList = [URLS[i] for i in random.sample(range(len(URLS)), SUBSET_SIZE)]
Or even better, because it doesn't need two conversions:
URLS = set(URLS)
urlList = [u for u in random.sample(URLS, SUBSET_SIZE)]
seen = set(list_0)
randValue = URLS[rn.randint(1, 30)]
# [...]
if randValue not in seen:
seen.add(randValue)
list_0.append(randValue)
Now you just need to check list_0 size is equal to 3 to stop the loop.

View execute time is very long (above one minute)

In our Django project, there is a view which creates multiple objects (from 5 to even 100). The problem is that creating phase takes a very long time.
Don't know why is that so but I suppose that it could be because on n objects, there are n database lookups and commits.
For example 24 objects takes 67 seconds.
I want to speed up this process.
There are two things I think may be worth to consider:
To create these objects in one query so only one commit is executed.
Create a ThreadPool and create these objects parallel.
This is a part of the view which causes problems (We use Postgres on localhost so connection is not a problem)
#require_POST
#login_required
def heu_import(request):
...
...
product = Product.objects.create(user=request.user,name=name,manufacturer=manufacturer,category=category)
product.groups.add(*groups)
occurences = []
counter = len(urls_xpaths)
print 'Occurences creating'
start = datetime.now()
eur_currency = Currency.objects.get(shortcut='eur')
for url_xpath in urls_xpaths:
counter-=1
print counter
url = url_xpath[1]
xpath = url_xpath[0]
occ = Occurence.objects.create(product=product,url=url,xpath=xpath,active=True if xpath else False,currency=eur_currency)
occurences.append(occ)
print 'End'
print datetime.now()-start
...
return render(request,'main_app/dashboard/new-product.html',context)
Output:
Occurences creating
24
.
.
.
0
End
0:01:07.727000
EDIT:
I tried to put the for loop into the with transaction.atomic(): block but it seems to help only a bit (47 seconds instead of 67).
EDIT2:
I'm not sure but it seems that SQL queries are not a problem:
Please use bulk_create for inserting multiple objects.
occurences = []
for url_xpath in urls_xpaths:
counter-=1
print counter
url = url_xpath[1]
xpath = url_xpath[0]
occurances.append(Occurence(product=product,url=url,xpath=xpath,active=True if xpath else False,currency=eur_currency))
Occurence.objects.bulk_create(occurences)

Share a big dictionary in python multiprcess in Windows

I am writing a feature-collection program which would extract information from over 20000 files with Python 2.7. I store some pre-computed information in a dictionary. In roder to make my prgram faster, I used multiprocess and the big dictionary must be used in these process. (Actually, I just want to use each process to handle a part of the files, every process is actually the same function). This dictionary will not be changed in these processes, it is just a parameter for the function.
I found that each process will create its own address space, each with a copy of this big dictionary. My computer does not such a big memory to store many copy of this dictionary. I wonder if there is way to create a static dict object that can be used by every process? Below is my code of the multiprocess part, pmi_dic is the big dictionary (maybe several GB), it will not be changed in the function get_features.
processNum = 2
pool = mp.Pool(processes = processNum)
fileNum = len(filelist)
offset = fileNum / processNum
for i in range(processNum):
if (i == processNum - 1):
start = i * offset
end = fileNum
else:
start = i * offset
end = start + offset
print str(start) + ' ' + str(end)
pool.apply_async(get_features, args = (df_dic, pmi_dic, start, end, filelist, wordNum))
pool.close()
pool.join()

How can I share an updated values in dictionary among different process?

I am newbe here and also in Python. I have a question about dictionaries and multiprocessing. I want to run this part of the code on the second core of my Raspberry Pi (on first is running GUI application). So, I created a dictionary (keys(20) + array with the length of 256 for each of this key - script below is just short example). I initialized this dictionary in a separate script and put all values inside this dictionary on zero. Script table1.py (this dictionary should be available from both cores)
diction = {}
diction['FrameID']= [0]*10
diction['Flag']= ["Z"]*10
In the second script (should run on the second core), I read the values that I get from the serial port and put/set them in this dictionary (parsing + conversion) according to the appropriate place. Since I get much information through a serial port, the information is changing all the time. Script Readmode.py
from multiprocessing import Process
import time
import serial
import table1
def replace_all(text, dic):
for i, j in dic.iteritems():
text = text.replace(i, j)
return text
def hexTobin(hex_num):
scale = 16 ## equals to hexadecimal
num_of_bits = len(hex_num)*4
bin_num = bin(int(hex_num, scale))[2:].zfill(num_of_bits) #hex to binary
return bin_num
def readSerial():
port = "/dev/ttyAMA0"
baudrate = 115200
ser = serial.Serial(port, baudrate, bytesize=8, parity=serial.PARITY_NONE, stopbits=1, xonxoff=False, rtscts=False)
line = []
for x in xrange(1):
ser.write(":AAFF:AA\r\n:F1\r\n") # new
while True:
for c in ser.read():
line.append(c)
a=''.join(line[-2:])
if a == '\r\n':
b=''.join(line)
print("what I get:" + b)
c=b[b.rfind(":"):len(b)] #string between last ":" start delimiter and stop delimiter
reps = {':':'', '\r\n':''} #throw away start and stop delimiter
txt = replace_all(c, reps)
print("hex num: " + txt)
bina_num=hexTobin(txt) # convert hex to bin
print("bin num: " + bina_num)
ssbit = bina_num[:3] # first three bits
print("select first three bits: " + ssbit)
abit=int(ssbit,2) # binary to integer
if abit == 5:
table1.diction['FrameID'][0]=abit
if abit == 7:
table1.diction['FrameID'][5]=abit
print("int num: ",abit)
print(table1.diction)
line = []
break
ser.close()
p1=Process(target=readSerial)
p1.start()
During that time I want to read information in this dictionary and use them in another process. But When I try to read that values there are all zero.
My question is how to create a dictionary that will be available for both process and can be updated based on data get from serial port?
Thank you for your answer in advance.
In Python, when you start two different scripts, even if they import some common modules, they share nothing (except the code). If two scripts both import table1, then they both have their own instance of the table1 module and therefore their own instance of the table1.diction variable.
If you think about it, it has to be like that. Otherwise all Python scripts would share the same sys.stdout, for example. So, more or less nothing is shared between two different scripts executing at the same time.
The multiprocessing module lets you create more than one process from the same single script. So you'll need to merge your GUI and your reading function into the same script. But then you can do what you want. Your code will look something like this:
import multiprocessing
# create shared dictionary sd
p1=Process(target=readSerial, args = (sd,)
p1.start()
# create the GUI behaviour you want
But hang on a minute. This won't work either. Because when the Process is created it starts a new instance of the Python interpreter and creates all its own instances again, just like starting a new script. So even now, by default, nothing in readSerial will be shared with the GUI process.
But fortunately the multiprocessing module provides some explicit techniques to share data between the processes. There is more than one way to do that, but here is one that works:
import multiprocessing
import time
def readSerial(d):
d["test"] = []
for i in range(100):
l = d["test"]
l.append(i)
d["test"] = l
time.sleep(1)
def doGUI(d):
while True:
print(d)
time.sleep(.5)
if __name__ == '__main__':
with multiprocessing.Manager() as manager:
sd = manager.dict()
p = multiprocessing.Process(target=readSerial, args=(sd,))
p.start()
doGUI(sd)
You'll notice that the append to the list in the readSerial function is a bit odd. That is because the dictionary object we are working with here is not a normal dictionary. It's actually a dictionary proxy that conceals a pipe used to send the data between the two processes. When I append to the list inside the dictionary the proxy needs to be notified (by assigning to the dictionary) so it knows to send the updated data across the pipe. That is why we need the assignment (rather than simply directly mutating the list inside the dictionary). For more on this look at the docs.

Python Generator : Not able to generate multiple files

I have a file of 65,000 docs and their contents. I have broken this file in two data sets as training and test data set. I want to break the training data set in small files by number of lines and train my model but the code is producing only first break up and keeps on producing that. Most probably, I am consuming the used generator every time. I have posted the code for reference below. Any improvement or logical error finding will be widely appreciated. Thanks.
Code to create training and test data sets :
fo = open('desc_py_output.txt','rb')
def generate_train_test(doc_iter,size):
while True:
data = [line for line in itertools.islice(doc_iter, size)]
if not data:
break
yield data
for i,line in enumerate(generate_train_test(fo,50000)):
if(i==0):
training_data = line
else:
test_data = line
Now I am trying to create small files of 5000 docs using the following code:
def generate_in_chunks(doc_iter,size):
while True:
data = [line for line in itertools.islice(doc_iter, size)]
if not data:
break
yield data
for i,line in enumerate(generate_in_chunks(training_data,5000)):
x = [member.split('^')[2] for member in line]
y = [member.split('^')[1] for member in line]
print x[0]
this is printing same documents again and again.
The generate_train_test function yields lists, thus in your generate_in_chunks function, doc_iter is a list, not an iterator. A list does not get consumed, thus the islice will always start again from the beginning. Make sure doc_iter is an iterator at the beginning, then it will work. Also, it seems like you can use the same function for both.
def chunkify(doc_iter, size):
doc_iter = iter(doc_iter) # make sure doc_iter really is an iterator
while True:
data = [line for line in itertools.islice(doc_iter, size)]
if not data:
break
yield data
Alternatively, you could return a generator instead of a list, but this will only work if you consume that generator before yielding the next one (otherwise you'd get into an infinite loop). In that case, you could use something like this.