pytorch dataloader stucked if using opencv resize method - python-2.7

I can run all the cells of the tutorial notebook of Pytorch about dataloading (pytorch tutorial).
But when I use OpenCV in place of Skimage to resize the image, the dataloader gets stuck, i.e nothing happens.
In the Rescale class:
class Rescale(object):
.....
def __call__(self, sample):
....
#img = transform.resize(image, (new_h, new_w))
img = cv2.resize(image, (new_h, new_w))
.....
The dataloader and the for loop are defined with:
dataloader = DataLoader(transformed_dataset, batch_size=4,
shuffle=True, num_workers=4)
for i_batch, sample_batched in enumerate(dataloader):
print(i_batch, sample_batched['image'].size(),
sample_batched['landmarks'].size())
I can get the iterator to print something if num_workers=0. It looks like opencv does not play well with the Multiprocessing of pytorch.
I would really prefer to use same package to transform the images at train time and test time (and I am already using OpenCV for the image rescale at test time).
Any suggestions would be greatly appreciated.

I had a very similar problem and that's how I solved it:
when you import cv2 set cv2.setNumThreads(0)
and then you can set num_workers>0 in the dataloader in PyTorch.
Seems like OpenCV tries to multithread and somewhere something goes into a deadlock.
Hope it helps.

Except for cv2.setNumThreads(0),
import multiprocessing
multiprocessing.set_start_method('spawn')
can also solve this problem.
From Pytorch issues these five may help (not recommended):
1. time.sleep(0.003)
2. pin_memory = True/False
3. num_workers = 0/1
4. from torch.utils.data.dataloader import DataLoader
5. writing 8192 to /proc/sys/kernel/shmmni
OpenCV and Pytorch multiprocessing don't play well together, sometimes. When running a code with OpenCV functions calls embedded in a homebrewed function parallelized in a "multiprocessing" pool, the code eventually ends up with idle processors after several calls to the pool than can fluctuate from run to run.
Forking could be the problem, forking only clones the current thread. In this case, the thread pool may wrongly assume that it has more threads than it has.
The code is waiting infinitely for a condition never signaled because the number of threads is not as expected after forking.
Spawning resets the memory after forking, forcing a re-initialization of all data structures (as the thread pool) within OpenCV.
More discussion in Github of OpenCV issues and Pytorch issues.

Related

Pathos multiprocessing pool hangs

I'm trying to use multiprocessing inside docker container. However, I'm facing two issues.
(I'm using python 2.7)
Creating ProcessingPool()/Pool() (I tried both) takes abnormally long time to create. Maybe over a minute or two.
After it processes the function, it hangs.
I basically trying to run a very simple case inside my container. Here's what I have..
import pathos.multiprocessing import ProcessingPool
import multiprocessing
class MultiprocessClassExample():
.
.
.
def worker(self, number):
return "Printing number %s" %(number)
.
.
def generateNumber(self):
PROCESSES = multiprocessing.cpu_count() - 1
NUMBER = ['One', 'Two', 'Three', 'Four', 'Five']
result = ProcessingPool(PROCESSES).map(self.worker, NUMBER)
print("Finished processing.")
print(result)
and I call using the following code.
MultiprocessClassExample().generateNumber()
Now, this seems fairly straight forward enough. I ran this on a jupyter notebook and it ran without an issue. I also tried running python inside my docker container, and tried running the above code inside, and it went fine. So I'm assuming it has to do with the complete code that I have. Obviously I didn't write out all the code, but that's the main section of the code I'm trying to handle right now.
I would expect the above code to work as well. However, first thing I notice is that when I call ProcessingPool(), it takes a long time. I tried regular multiprocessing.Pool() before, and had the same effect. Whereas, in the notebook, it ran very quick and smoothly.
After waiting several minutes, it prints :
Printing number One
Printing number Two
Printing number Three
Printing number Four
Printing number Five
and that's it. It never prints out Finished processing. and it just hangs there.
But when the print statements appear, I notice that several debug message appear at the same time. It says
[CRITICAL] WORKER TIMEOUT
[WARNING] Worker graceful timeout
[INFO] Worker exiting
[INFO] Booting worker with pid:
Any suggestions would be greatly appreciated.

Process Several Pcap Files Simultaneously - Django

In essence, the following function, called by the user of the django application that I am developing, uses the Scapy library to process 80-odd fairly large pcaps in order to initially parse their destination IP addresses.
I was wondering whether it would be possible to process several pcaps simultaneously, as the CPU is not being utilised to it's full capacity, ideally using multi-threading
def analyseall(request):
allpcaps = Pcaps.objects.all()
for individualpcap in allpcaps:
strfilename = str(individualpcap.filename)
print(strfilename)
pcapuuid = individualpcap.uuid
print(pcapuuid)
packets = rdpcap(strfilename)
print("hokay")
for packet in packets:
if packet.haslayer(IP):
# print(packet[IP].src)
# print(packet[IP].dst)
dstofpacket = packet[IP].dst
PcapsIps.objects.update_or_create(ip=dstofpacket, uuid=individualpcap)
return render(request, 'about.html', {"list": list})
You can use above answer (multiprocessing), and also improve scapy’s reading speed, by using the PcapReader generator rather than rdpcap
with PcapReader(filename) as fdesc:
for pkt in fdesc:
[actions on the pkt]
I consider mixing multiprocessing and Django tricky. I was working on such solution once and finally I decided to use Celery and RabbitMQ.
Using Celery you can easily define task of processing single pcap. Then you can start a few independent workers for processing files in the background. Such solution will result in a little more complicated architecture (you need to provide message queue e. g. RabbitMQ and the Celery workers), however you can gain a much simpler code.
http://docs.celeryproject.org/en/latest/django/first-steps-with-django.html
In my case Celery saved a lot of time.
You can also check this question and answers:
How to use python multiprocessing module in django view

multiprocessing for keras model predict with single GPU

Background
I want to predict pathology images using keras with Inception-Resnet_v2. I have trained the model already and got a .hdf5 file. Because the pathology image is very large (for example: 20,000 x 20,000 pixels), so I have to scan the image to get small patches for prediction.
I want to speed up the prediction procedure using multiprocessing lib with python2.7. The main idea is using different subprocesses to scan different lines and then sending patches to model.
I saw somebody suggests importing keras and loading model in subprocesses. But I don't think it is suitable for my task. Loading model usingkeras.models.load_model() one time will take about 47s, which is very time-consuming. So I can't reload the model every time when I start a new subprocess.
Question
My question is can I load the model in my main process and pass it as a parameter to subprocesses?
I have tried two methods but both of them didn't work.
Method 1. Using multiprocessing.Pool
The code is :
import keras
from keras.models import load_model
import multiprocessing
def predict(num,model):
print dir(model)
print num
model.predict("image data, type:list")
if __name__ == '__main__':
model = load_model("path of hdf5 file")
list = [(1,model),(2,model),(3,model),(4,model),(5,model),(6,model)]
pool = multiprocessing.Pool(4)
pool.map(predict,list)
pool.close()
pool.join()
The output is
cPickle.PicklingError: Can't pickle <type 'module'>: attribute lookup __builtin__.module failed
I searched the error and found Pool can't map unpickelable parameters, so I try method 2.
Method 2. Using multiprocessing.Process
The code is
import keras
from keras.models import load_model
import multiprocessing
def predict(num,model):
print num
print dir(model)
model.predict("image data, type:list")
if __name__ == '__main__':
model = load_model("path of hdf5 file")
list = [(1,model),(2,model),(3,model),(4,model),(5,model),(6,model)]
proc = []
for i in range(4):
proc.append(multiprocessing.Process(predict, list[i]))
proc[i].start()
for i in range(4):
proc[i].join()
In Method 2, I can print dir(model). I think it means the model is passed to subprocesses successfully. But I got this error
E tensorflow/stream_executor/cuda/cuda_driver.cc:1296] failed to enqueue async memcpy from host to device: CUDA_ERROR_NOT_INITIALIZED; GPU dst: 0x13350b2200; host src: 0x2049e2400; size: 4=0x4
The environment which I use:
Ubuntu 16.04, python 2.7
keras 2.0.8 (tensorflow backend)
one Titan X, Driver version 384.98, CUDA 8.0
Looking forward to reply! Thanks!
Maybe you can use apply_async() instead of Pool()
and you can find more details here:
Python multiprocessing pickling error
Multi-processing works on CPU, while model prediction happened in GPU, which there is only one. I cannot see how multi-processing can help you on prediction.
Instead, I think you can use multi-processing to scan different patches, which you seems to have already managed to achieve. Then stack these patches into a batch or batches to predict in parallel in GPU.
As noted by Statham multiprocess requires all args to be compatible with pickle. This blog post describes how to save a keras model as a pickle: [http://zachmoshe.com/2017/04/03/pickling-keras-models.html][1]
It may be a sufficient workaround to get your keras model passed as an arg to multiprocess, but I have not tested the idea myself.
I will also add that I had better luck running two keras processes on a single gpu using windows rather than linux. On linux I was getting out of memory errors on the 2nd process, but the same memory allocation (45% of total GPU ram for each) worked on windows. In my case they were fits - for running predictions only, maybe the memory requirements are less.

Parallelize for loop in python

I have a genetic algorithm which I would like to speed up. I'm thinking the easiest way to achieve this is by pythons multiprocessing module. After running cProfile on my GA, I found out that most of the computational time takes place in the evaluation function.
def evaluation():
scores = []
for chromosome in population:
scores.append(costly_function(chromosome))
How would I go about to parallelize this method? It is important that all the scores append in the same order as they would if the program would run sequentially.
I'm using python 2.7
Use pool (I show both imap and map because of some results on google say map may not be OK for ordering though I have yet to see proof):
from multiprocessing import Pool
def evaluation(population):
return list(Pool(processes=nprocs).imap(costly_function,population))
or (what I use):
return Pool(processes=nprocs).map(costly_function,population)
Define nprocs to the number of parallel process you want.
From:
https://docs.python.org/dev/library/multiprocessing.html#multiprocessing.pool.Pool

Python program unexpectidely dies

I am training an LSTM network using lasagne, but my program dies without any error during the training. If I use a much smaller dataset ( only 13 data points) the entire code runs.
I was initially using my GPU and thought that the memory was running out. But even though I used CPU It dies.
import resource
rsrc = resource.RLIMIT_DATA
resource.setrlimit(rsrc,(1024, resource.RLIM_INFINITY ))
My final attempt was to try this but to no avail.