Matcaffe training net produces "Data layer prefetch queue empty"

Matcaffe training net produces "Data layer prefetch queue empty" - mex

I'm trying to figure out why my MatCaffe implementation cannot pop from my train lmdb, which I've created using convert_imageset.bin.
What I do is basically just this:
solver = caffe.Solver(solverFile);
solver.step(500);
and when looking at the terminal, the output after the last statement is this:
I0322 11:15:11.830241 **098 net.cpp:228] data does not need backward computation.
I0322 11:15:11.830250 **098 net.cpp:270] This network produces output accuracy
I0322 11:15:11.830257 **098 net.cpp:270] This network produces output loss
I0322 11:15:11.830281 **098 net.cpp:283] Network initialization done.
I0322 11:15:11.830377 **098 solver.cpp:60] Solver scaffolding done.
I0322 11:15:16.625566 **098 solver.cpp:341] Iteration 0, Testing net (#0)
I0322 11:15:19.976579 **098 solver.cpp:409] Test net output #0: accuracy = 0.445407
I0322 11:15:19.976654 **098 solver.cpp:409] Test net output #1: loss = 0.693147 (* 1 = 0.693147 loss)
I0322 11:15:20.317916 **098 solver.cpp:237] Iteration 0, loss = 0.693147
I0322 11:15:20.317989 **098 solver.cpp:253] Train net output #0: loss = 0.693147 (* 1 = 0.693147 loss)
I0322 11:15:20.318009 **098 sgd_solver.cpp:106] Iteration 0, lr = 0.001
I0322 11:15:21.342550 **098 blocking_queue.cpp:50] Data layer prefetch queue empty
I can reproduce this problem even when I delete the locks.mdb to make sure that no locks are left when I restart this procedure. After the message I only can do a hard Matlab shutdown.
I've checked the lmdb with Matlab LMDB and the contents of both, my train and test lmdb seem to be ok. Parameters I've used to generate the lmdb: shuffle.
Note (might be the source of problems here): currently I'm facing MEX problems with this constellation. On the first run of my implementation I get the error message
"Unexpected unknown exception from MEX file.."
for which the terminal output looks like this:
I0322 11:42:09.465801 **875 layer_factory.hpp:77] Creating layer data
I0322 11:42:09.466012 **875 net.cpp:106] Creating Layer data
I0322 11:42:09.466030 **875 net.cpp:411] data -> data
I0322 11:42:09.466053 **875 net.cpp:411] data -> label
I0322 11:42:09.469091 **151 db_lmdb.cpp:38] Opened lmdb /home/user/caffe-master/data/train/lmdbTrain
What I've tried so far:
I implemented a try-catch block so that the pointers, spaces etc. are freed (hopefully) by using 'caffe.reset_all();" so that in ANY case this method is called.
On the second run, I get the above mentioned output. It seems that my first run blocks the lmdb access, what led me to delete locks.mdb manually between the first and the second run -> same effect unfortunately. A "manual" train by command line does work with the same lmdb's. Only the matcaffe run seems to raise these problems and questions. Note that I want to use Matcaffe for manual initialization of my weights for the layers - "weight_filler" in .prototxt isn't an option.
My MatCaffe implementation is from January 2016 and I've also recompiled the mex-file for caffe_ with the correct gcc version (before it gave me the warning that my gcc version should be "x" -> changed to "x" and recompiled).
Do you have any other ideas, recommendations or inputs please?
Thank you!

Related

Resource Exhaused error while using Ray Tune

I am trying to perform HPO for CNN on fashion mnist dataset using raytune and hyperopt
Error that I am getting in executing my keras code with fashion mnist dataset with one convolutional layer and the number of dense layer(determined by tunable hyperparameter) is below
status = StatusCode.RESOURCE_EXHAUSTED
details = "Received message larger than max (222322986 vs. 104857600)"
debug_error_string

Run tflite accuracy tool on official tensorflow resnet50 model

I have downloaded the official resnet50 model provided here: https://github.com/tensorflow/models/tree/master/official/resnet. I needed a tflite quantized version of this model and hence I converted the model to a tflite format as follows :
toco --output_file /tmp/resnet50_quant.tflite --saved_model_dir <path/to/saved_model_dir> --output_format TFLITE --quantize_weights QUANTIZE_WEIGHTS
After this, I thought I'd run the tflite accuracy tool to verify the accuracy of this model is still reasonable. Although it looks like I run into the following issue:
bazel run -c opt --copt=-march=native --cxxopt='--std=c++11' -- //tensorflow/contrib/lite/tools/accuracy/ilsvrc:imagenet_accuracy_eval --model_file=/tmp/resnet50_quant.tflite --ground_truth_images_path=<path/to/images> --ground_truth_labels=/tmp/validation_labels.txt --model_output_labels=/tmp/tf_labels.txt --output_file_path=/tmp/accuracy_output.txt --num_images=0
INFO: Analysed target //tensorflow/contrib/lite/tools/accuracy/ilsvrc:imagenet_accuracy_eval (0 packages loaded).
INFO: Found 1 target...
Target //tensorflow/contrib/lite/tools/accuracy/ilsvrc:imagenet_accuracy_eval up-to-date:
bazel-bin/tensorflow/contrib/lite/tools/accuracy/ilsvrc/imagenet_accuracy_eval
INFO: Elapsed time: 14.589s, Critical Path: 14.28s
INFO: 3 processes: 3 local.
INFO: Build completed successfully, 4 total actions
INFO: Running command line: bazel-bin/tensorflow/contrib/lite/tools/accuracy/ilsvrc/imagenet_accuracy_eval '--model_file=/tmp/resnet50_quant.tflite' '--ground_truth_images_path=<path/to/images>' '--ground_truth_labels=/tmp/validation_labels.txt' '--model_output_labels=/tmp/tf_labels.txt' '--output_file_path=/tmp/accuracy_output.txt' 'INFO: Build completed successfully, 4 total actions
2018-10-12 15:30:06.237058: E tensorflow/contrib/lite/tools/accuracy/ilsvrc/imagenet_accuracy_eval.cc:155] Starting evaluation with: 4 threads.
2018-10-12 15:30:06.536802: E tensorflow/contrib/lite/tools/accuracy/ilsvrc/imagenet_accuracy_eval.cc:98] Starting model evaluation: 50000
2018-10-12 15:30:06.565334: W tensorflow/core/framework/op_kernel.cc:1273] OP_REQUIRES failed at run_tflite_model_op.cc:89 : Invalid argument: Data shapes mismatch for tensors: 0 expected: [64,224,224,3] got: [1,224,224,3]
2018-10-12 15:30:06.565453: F tensorflow/contrib/lite/tools/accuracy/ilsvrc/imagenet_model_evaluator.cc:222] Non-OK-status: eval_pipeline->Run(CreateStringTensor(image_label.image), CreateStringTensor(image_label.label)) status: Invalid argument: Data shapes mismatch for tensors: 0 expected: [64,224,224,3] got: [1,224,224,3]
[[{{node stage_run_tfl_model_output}} = RunTFLiteModel[input_type=[DT_FLOAT], model_file_path="/tmp/resnet50_quant.tflite", output_type=[DT_FLOAT], _device="/job:localhost/replica:0/task:0/device:CPU:0"](stage_inception_preprocess_output)]]
It looks like the issue is that the official resnet model has an input tensor of [64, 224, 224, 3] whereas the accuracy tool provides an input of [1, 224, 224, 3]. So, the official model seems to expect a batch of 64 images and hence the accuracy tool fails.
I was wondering what I need to do to get the accuracy tool to run on the official resnet50 model? I'm guessing that although the input tensor for resnet 50 is [64, 224, 224, 3], there should be a way to still run a single image through the model.

There are two ways to go about it:
Resize the input of your model to [1, 224, 224, 3] and run the tool.
You could try looking at this and then modifying this file accordingly.
Alternatively modify the same tool so that it feeds in 64 images at a time instead of 1. You can look at the same code file I point to above and feed 64 at a time instead of 1.
If you're looking for long-term support, consider filing a feature request on Github where we can support batching.

TPUEstimator does not work with use_tpu=False

I’m trying to run a model using TPUEstimator locally on a CPU first to validate that it works by setting use_tpu=False on the estimator initialization. When running train I get this error.
InternalError: failed to synchronously memcpy host-to-device: host 0x7fcc7e4d4000 to device 0x1deffc002 size 4096: Failed precondition: Unable to enqueue when not opened, queue: [0000:00:04.0 PE0 C0 MC0 TN0 Queue HBM_WRITE]. State is: CLOSED
[[Node: optimizer/gradients/neural_network/fully_connected_2/BiasAdd_grad/BiasAddGrad_G14 = _Recv[client_terminated=false, recv_device="/job:worker/replica:0/task:0/device:TPU:0", send_device="/job:worker/replica:0/task:0/device:CPU:0", send_device_incarnation=-7832507818616568453, tensor_name="edge_42_op...iasAddGrad", tensor_type=DT_FLOAT, _device="/job:worker/replica:0/task:0/device:TPU:0"]()]]
It looks like it’s still trying to use the TPU, as it says recv_device="/job:worker/replica:0/task:0/device:TPU:0". Why is it trying to use the TPU when use_tpu is set to False?

What optimizer are you using? This type of error can happen if you use a tf.contrib.tpu.CrossShardOptimizer and use_tpu is set to False. The optimizer is trying to shard the work across TPU cores but can’t because you’re running on your CPU.
It’s common practice to have a command line flag that sets whether the TPU is being used. This flag is used to toggle things like CrossShardOptimizer and use_tpu. For example, in the MNIST reference model:
if FLAGS.use_tpu:
optimizer = tf.contrib.tpu.CrossShardOptimizer(optimizer)
https://github.com/tensorflow/models/blob/ad3526a98e7d5e9e57c029b8857ef7b15c903ca2/official/mnist/mnist_tpu.py#L102

DLIB: train_shape_predictor_ex.exe for 194 landmarks with halen dataset gives runtime error: bad allocation

I am trying train dlib's shape_predictor for 194 landmarks with halen dataset
but it gives bad allocation exception when I run it command prompt
D:\Facial Feature Extraction>train_shape_predictor_ex.exe face_detector
Program is started
exception thrown!
bad allocation
, I reduced the number of image to only 50 then it run successfully but the result is not satisfactory. So I tried to train with 64 GB RAM System but bow I increased the parameter
trainer.set_nu(0.05);
trainer.set_tree_depth(2);
but now it is still showing bad allocation error. If I train with less data and for smaller parameter the train model is not correct.

Build your application in Release Mode and target to 64-bit Windows plateform.
Also Enable \LARGEADDRESSAWARE Flag in your Project.
Here is a link to your question:
Answer

Analog output from USB6009 using python and NIDAQmx base on Mac OSX

All,
I'm attempting to use Python and DAQmx Base to record analog input and generate analog output from my USB 6009 device. I've been using a wrapper I found and have been able to get AI but am struggling with AO.
There is a base class NITask which handles task generation etc. The class i'm calling is below. The function throws an error when I try to configure the clock. When I do not there is no error but nor is there voltage generated on the output. Any help would be appreciated.
Thanks!
class AOTask(NITask):
def __init__(self, min=0.0, max=5.0,
channels=["Dev1/ao0"],
timeout=10.0):
NITask.__init__(self)
self.min = min
self.max = max
self.channels = channels
self.timeout = timeout
self.clockSource ="OnboardClock"
sampleRate=100
self.sampleRate = 100
self.timeout = timeout
self.samplesPerChan = 1000
self.numChan = chanNumber(channels)
if self.numChan is None:
raise ValueError("Channel specification is invalid")
chan = ", ".join(self.channels)
self.CHK(self.nidaq.DAQmxBaseCreateTask("",ctypes.byref(self.taskHandle)))
self.CHK(self.nidaq.DAQmxBaseCreateAOVoltageChan(self.taskHandle, "Dev1/ao0", "", float64(self.min), float64(self.max), DAQmx_Val_Volts, None))
self.CHK(self.nidaq.DAQmxBaseCfgSampClkTiming(self.taskHandle, "", float64(self.sampleRate), DAQmx_Val_Rising, DAQmx_Val_FiniteSamps, uInt64(self.samplesPerChan)))
"""Data needs to be of type ndarray"""
def write(self, data):
nWritten = int32()
# data = numpy.float64(3.25)
data = data.astype(numpy.float64)
self.CHK(self.nidaq.DAQmxBaseWriteAnalogF64(self.taskHandle,
int32(1000), 0,float64(-1),DAQmx_Val_GroupByChannel,
data.ctypes.data,None,None))
# if nWritten.value != self.numChan:
# print "Expected to write %d samples!" % self.numChan

Your question covers two problems:
Why does DAQmxBaseCfgSampClkTiming return an error?
Without using that function, why isn't any output generated?
1. Hardware vs Software Timing
rjb3 wrote:
The function throws an error when I try to configure the clock. When I do not there is no error but nor is there voltage generated on the output.
Your program receives the error because the USB 600x devices do not support hardware-timed analog output [1]:
The NI USB-6008/6009 has two independent analog output channels that can generate outputs from 0 to 5 V. All updates of analog output channels are software-timed. GND is the ground-reference signal for the analog output channels.
"Software-timed" means a sample is written on demand by the program whenever DAQmxBaseWriteAnalogF64 is called. If an array of samples is written, then that array is written one at a time. You can learn more about how NI defines timing from the DAQmx help [2]. While that document is for DAQmx, the same concepts apply to DAQmx Base since the behavior is defined by the devices and not their drivers. The differences are in how much of the hardware's capabilities are implemented by the driver -- DAQmx implements everything, while DAQmx Base is a small select subset.
2. No Output When Software Timed
rjb3 wrote:
When I do not there is no error but nor is there voltage generated on the output.
I am not familiar with the Python bindings for the DAQmx Base API, but I can recommend two things:
Try using the installed genVoltage.c C example and confirm that you can see voltage on the ao channel.
Examples are installed here: /Applications/National Instruments/NI-DAQmx Base/examples
If you see output, you've confirmed that the device and driver are working correctly, and that the bug is likely in the python file.
If you don't see output, then the device or driver has a problem, and the best place to get help troubleshooting is the NI discussion forums at http://forums.ni.com.
Try porting genVoltage.c using the python bindings. At first glance, I would try:
Use DAQmxBaseStartTask before DAQmxBaseWriteAnalogF64
Or set the autostart parameter in your call to DAQmxBaseWriteAnalogF64 to true.
References
[1] NI USB-6008/6009 User Guide And Specifications :: Analog Output (page 16)
http://digital.ni.com/manuals.nsf/websearch/CE26701AA052E1F0862579AD0053BE19
[2] Timing, Hardware Versus Software
http://zone.ni.com/reference/en-XX/help/370466V-01/TOC11.htm

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js