Background
I want to predict pathology images using keras with Inception-Resnet_v2. I have trained the model already and got a .hdf5 file. Because the pathology image is very large (for example: 20,000 x 20,000 pixels), so I have to scan the image to get small patches for prediction.
I want to speed up the prediction procedure using multiprocessing lib with python2.7. The main idea is using different subprocesses to scan different lines and then sending patches to model.
I saw somebody suggests importing keras and loading model in subprocesses. But I don't think it is suitable for my task. Loading model usingkeras.models.load_model() one time will take about 47s, which is very time-consuming. So I can't reload the model every time when I start a new subprocess.
Question
My question is can I load the model in my main process and pass it as a parameter to subprocesses?
I have tried two methods but both of them didn't work.
Method 1. Using multiprocessing.Pool
The code is :
import keras
from keras.models import load_model
import multiprocessing
def predict(num,model):
print dir(model)
print num
model.predict("image data, type:list")
if __name__ == '__main__':
model = load_model("path of hdf5 file")
list = [(1,model),(2,model),(3,model),(4,model),(5,model),(6,model)]
pool = multiprocessing.Pool(4)
pool.map(predict,list)
pool.close()
pool.join()
The output is
cPickle.PicklingError: Can't pickle <type 'module'>: attribute lookup __builtin__.module failed
I searched the error and found Pool can't map unpickelable parameters, so I try method 2.
Method 2. Using multiprocessing.Process
The code is
import keras
from keras.models import load_model
import multiprocessing
def predict(num,model):
print num
print dir(model)
model.predict("image data, type:list")
if __name__ == '__main__':
model = load_model("path of hdf5 file")
list = [(1,model),(2,model),(3,model),(4,model),(5,model),(6,model)]
proc = []
for i in range(4):
proc.append(multiprocessing.Process(predict, list[i]))
proc[i].start()
for i in range(4):
proc[i].join()
In Method 2, I can print dir(model). I think it means the model is passed to subprocesses successfully. But I got this error
E tensorflow/stream_executor/cuda/cuda_driver.cc:1296] failed to enqueue async memcpy from host to device: CUDA_ERROR_NOT_INITIALIZED; GPU dst: 0x13350b2200; host src: 0x2049e2400; size: 4=0x4
The environment which I use:
Ubuntu 16.04, python 2.7
keras 2.0.8 (tensorflow backend)
one Titan X, Driver version 384.98, CUDA 8.0
Looking forward to reply! Thanks!
Maybe you can use apply_async() instead of Pool()
and you can find more details here:
Python multiprocessing pickling error
Multi-processing works on CPU, while model prediction happened in GPU, which there is only one. I cannot see how multi-processing can help you on prediction.
Instead, I think you can use multi-processing to scan different patches, which you seems to have already managed to achieve. Then stack these patches into a batch or batches to predict in parallel in GPU.
As noted by Statham multiprocess requires all args to be compatible with pickle. This blog post describes how to save a keras model as a pickle: [http://zachmoshe.com/2017/04/03/pickling-keras-models.html][1]
It may be a sufficient workaround to get your keras model passed as an arg to multiprocess, but I have not tested the idea myself.
I will also add that I had better luck running two keras processes on a single gpu using windows rather than linux. On linux I was getting out of memory errors on the 2nd process, but the same memory allocation (45% of total GPU ram for each) worked on windows. In my case they were fits - for running predictions only, maybe the memory requirements are less.
Related
In my website i'm now converting my upload images into webp, because is smaller than the other formats, the users will load my pages more faster(Also mobile user). But it's take some time to converter a medium image.
import StringIO
import time
from PIL import Image as PilImage
img = PilImage.open('222.jpg')
originalThumbStr = StringIO.StringIO()
now = time.time()
img.convert('RGBA').save(originalThumbStr, 'webp', quality=75)
print(time.time() - now)
It's take 2,8 seconds to convert this follow image:
860kbytes, 1920 x 1080
My memory is 8gb ram, with a processor with 4 cores(Intel I5), without GPU.
I'm using Pillow==5.4.1.
Is there a faster way to convert a image into WEBB more faster. 2,8s it's seems to long to wait.
If you want them done fast, use vips. So, taking your 1920x1080 image and using vips in the Terminal:
vips webpsave autumn.jpg autumn.webp --Q 70
That takes 0.3s on my MacBook Pro, i.e. it is 10x faster than the 3s your PIL implementation achieves.
If you want lots done really fast, use GNU Parallel and vips. So, I made 100 copies of your image and converted the whole lot to WEBP in parallel like this:
parallel vips webpsave {} {#}.webp --Q 70 ::: *jpg
That took 4.9s for 100 copies of your image, i.e. it is 50x faster than the 3s your PIL implementation achieves.
You could also use the pyvips binding - I am no expert on this but this works and takes 0.3s too:
#!/usr/bin/env python3
import pyvips
# VIPS
img = pyvips.Image.new_from_file("autumn.jpg", access='sequential')
img.write_to_file("autumn.webp")
So, my best suggestion would be to take the 2 lines of code above and use a multiprocessing pool or multithreading approach to get a whole directory of images processed. That could look like this:
#!/usr/bin/env python3
import pyvips
from glob import glob
from pathlib import Path
from multiprocessing import Pool
def doOne(f):
img = pyvips.Image.new_from_file(f, access='sequential')
webpname = Path(f).stem + ".webp"
img.write_to_file(webpname)
if __name__ == '__main__':
files = glob("*.jpg")
with Pool(12) as pool:
pool.map(doOne, files)
That takes 3.3s to convert 100 copies of your image into WEBP equivalents on my 12-core MacBook Pro with NVME disk.
I can run all the cells of the tutorial notebook of Pytorch about dataloading (pytorch tutorial).
But when I use OpenCV in place of Skimage to resize the image, the dataloader gets stuck, i.e nothing happens.
In the Rescale class:
class Rescale(object):
.....
def __call__(self, sample):
....
#img = transform.resize(image, (new_h, new_w))
img = cv2.resize(image, (new_h, new_w))
.....
The dataloader and the for loop are defined with:
dataloader = DataLoader(transformed_dataset, batch_size=4,
shuffle=True, num_workers=4)
for i_batch, sample_batched in enumerate(dataloader):
print(i_batch, sample_batched['image'].size(),
sample_batched['landmarks'].size())
I can get the iterator to print something if num_workers=0. It looks like opencv does not play well with the Multiprocessing of pytorch.
I would really prefer to use same package to transform the images at train time and test time (and I am already using OpenCV for the image rescale at test time).
Any suggestions would be greatly appreciated.
I had a very similar problem and that's how I solved it:
when you import cv2 set cv2.setNumThreads(0)
and then you can set num_workers>0 in the dataloader in PyTorch.
Seems like OpenCV tries to multithread and somewhere something goes into a deadlock.
Hope it helps.
Except for cv2.setNumThreads(0),
import multiprocessing
multiprocessing.set_start_method('spawn')
can also solve this problem.
From Pytorch issues these five may help (not recommended):
1. time.sleep(0.003)
2. pin_memory = True/False
3. num_workers = 0/1
4. from torch.utils.data.dataloader import DataLoader
5. writing 8192 to /proc/sys/kernel/shmmni
OpenCV and Pytorch multiprocessing don't play well together, sometimes. When running a code with OpenCV functions calls embedded in a homebrewed function parallelized in a "multiprocessing" pool, the code eventually ends up with idle processors after several calls to the pool than can fluctuate from run to run.
Forking could be the problem, forking only clones the current thread. In this case, the thread pool may wrongly assume that it has more threads than it has.
The code is waiting infinitely for a condition never signaled because the number of threads is not as expected after forking.
Spawning resets the memory after forking, forcing a re-initialization of all data structures (as the thread pool) within OpenCV.
More discussion in Github of OpenCV issues and Pytorch issues.
In essence, the following function, called by the user of the django application that I am developing, uses the Scapy library to process 80-odd fairly large pcaps in order to initially parse their destination IP addresses.
I was wondering whether it would be possible to process several pcaps simultaneously, as the CPU is not being utilised to it's full capacity, ideally using multi-threading
def analyseall(request):
allpcaps = Pcaps.objects.all()
for individualpcap in allpcaps:
strfilename = str(individualpcap.filename)
print(strfilename)
pcapuuid = individualpcap.uuid
print(pcapuuid)
packets = rdpcap(strfilename)
print("hokay")
for packet in packets:
if packet.haslayer(IP):
# print(packet[IP].src)
# print(packet[IP].dst)
dstofpacket = packet[IP].dst
PcapsIps.objects.update_or_create(ip=dstofpacket, uuid=individualpcap)
return render(request, 'about.html', {"list": list})
You can use above answer (multiprocessing), and also improve scapy’s reading speed, by using the PcapReader generator rather than rdpcap
with PcapReader(filename) as fdesc:
for pkt in fdesc:
[actions on the pkt]
I consider mixing multiprocessing and Django tricky. I was working on such solution once and finally I decided to use Celery and RabbitMQ.
Using Celery you can easily define task of processing single pcap. Then you can start a few independent workers for processing files in the background. Such solution will result in a little more complicated architecture (you need to provide message queue e. g. RabbitMQ and the Celery workers), however you can gain a much simpler code.
http://docs.celeryproject.org/en/latest/django/first-steps-with-django.html
In my case Celery saved a lot of time.
You can also check this question and answers:
How to use python multiprocessing module in django view
I'm trying to use a validation monitor in skflow by passing my validation set as numpy array.
Here is some simple code to reproduce the problem (I installed tensorflow from the provided binaries for Ubuntu/Linux 64-bit, GPU enabled, Python 2.7):
import numpy as np
from sklearn.cross_validation import train_test_split
from tensorflow.contrib import learn
import tensorflow as tf
import logging
logging.getLogger().setLevel(logging.INFO)
#Some fake data
N=200
X=np.array(range(N),dtype=np.float32)/(N/10)
X=X[:,np.newaxis]
Y=np.sin(X.squeeze())+np.random.normal(0, 0.5, N)
X_train, X_test, Y_train, Y_test = train_test_split(X, Y,
train_size=0.8,
test_size=0.2)
val_monitor = learn.monitors.ValidationMonitor(X_test, Y_test,early_stopping_rounds=200)
reg=learn.DNNRegressor(hidden_units=[10,10],activation_fn=tf.tanh,model_dir="tmp/")
reg.fit(X_train,Y_train,steps=5000,monitors=[val_monitor])
print "train error:", reg.evaluate(X_train, Y_train)
print "test error:", reg.evaluate(X_test, Y_test)
The code runs but only the first validation step is done properly, then validation always returns the same value even if training is actually going fine which can be checked by running an evaluation on the test set at the end. The following message also appears for each validation step.
INFO:tensorflow:Input iterator is exhausted.
Any help is welcome!
Thanks,
David
I was able to solve this by adding:
config=tf.contrib.learn.RunConfig(save_checkpoints_secs=1)
to the DNNRegressor call.
Improving on dbikard's solution:
Add config=tf.contrib.learn.RunConfig(save_checkpoints_steps=val_monitor._every_n_steps) to the DNN Regressor call instead.
This saves checkpoints when they are needed (i.e each time before the monitor is triggered) rather than once per second.
I am programming in python 2.7 with NLTK library for both text prepossessing and classification in sentiment analysis. I am using nltk wrapper of scikit-learn algorithms. bellow code is after prepossessing and separation to train and test sets.
from nltk.classify.scikitlearn import SklearnClassifier
from sklearn.svm import SVC, LinearSVC, NuSVC
training_set = nltk.classify.util.apply_features(extractFeatures, trainTweets)
testing_set = nltk.classify.util.apply_features(extractFeatures, testTweets)
#LinearSVC
LinearSVC_classifier = SklearnClassifier(LinearSVC())
LinearSVC_classifier.train(training_set)
LinearSVCAccuracy = nltk.classify.accuracy(LinearSVC_classifier, testing_set)*100
print "LinearSVC accuracy percentage:" + str(LinearSVCAccuracy)
it works fine when the number of rows are like 4000 tweets for training, but when it increases to for example 10000 tweets, possess is getting killed with following error.
Memory cgroup out of memory: Kill process 24293 (python) score 848 or
sacrifice child
Killed process 24293, UID 29091, (python) total-vm:14569168kB,
anon-rss:14206656kB, file-rss:3412kB
Clocksource tsc unstable (delta = -17179861691 ns). Enable clocksource
failover by adding clocksource_failover kernel parameter.
RAM of my pc is 8 Gig but I even tried with 16 Gig RAM and still has problem. How can I Classify this amount for tweets without any problem?
Which OS are you running? Which python distribution? Try to install cython and/or using scikit-learn directly. Have a look at scikit-learn optimization techniques