Resource Exhaused error while using Ray Tune

Resource Exhaused error while using Ray Tune - ray

I am trying to perform HPO for CNN on fashion mnist dataset using raytune and hyperopt
Error that I am getting in executing my keras code with fashion mnist dataset with one convolutional layer and the number of dense layer(determined by tunable hyperparameter) is below
status = StatusCode.RESOURCE_EXHAUSTED
details = "Received message larger than max (222322986 vs. 104857600)"
debug_error_string

Related

EntityTooLarge ERROR while fitting RCF data in Sagemaker

I am fitting a Random Cut Forest model on AWS SageMaker to a dataset using " rcf.fit(rcf.record_set(data[['Variable 1','Variable 2']].values.reshape(-1, 1)))" , but getting the below error :
An error occurred (EntityTooLarge) when calling the PutObject operation: Your proposed upload exceeds the maximum allowed size.
The size of the ndarray is 239393964
Works well for a sample of the data, but not working for the entire data set (total records : 400M)
How can I fix this?

ANN training progress resets every new training session using FANN

I have a standard neural network which I have trained for some time, but not until perfection. After the training session is complete, I save the network on disk.
After some time I want to resume training the network from where it left. The problem is, it seems that every time I start a new training session, the weights and biases seem to be totally reset, which means I'm training the network from scratch all over again:
Previous session:
New session:
Here is the excerpt from my training function:
void trainNet(fann *net) {
const unsigned int
max_epochs = 1000,
epochs_between_reports = 10;
const float desired_error = 0.01f;
net -> learning_momentum = 0.1f;
fann_train_on_file(net, "sessions.data", max_epochs, epochs_between_reports, desired_error);
fann_save(net, "network.net");
fann_destroy(net);
}
What am I missing? It seems so intuitive to me that you could train a network over a span of multiple sessions. Am I wrong? Is it a limitation of the library?
The training data has remained constant between sessions. This isn't limited to this specific network, either -- networks of any format seem to invoke the same issue.

What am I missing?
As per Documentation - FANN Training > Training Data Manipulation > fann_set_training_algorithm :
Set the training algorithm.
Example :
fann_set_training_algorithm(net, FANN_TRAIN_INCREMENTAL)

DLIB: train_shape_predictor_ex.exe for 194 landmarks with halen dataset gives runtime error: bad allocation

I am trying train dlib's shape_predictor for 194 landmarks with halen dataset
but it gives bad allocation exception when I run it command prompt
D:\Facial Feature Extraction>train_shape_predictor_ex.exe face_detector
Program is started
exception thrown!
bad allocation
, I reduced the number of image to only 50 then it run successfully but the result is not satisfactory. So I tried to train with 64 GB RAM System but bow I increased the parameter
trainer.set_nu(0.05);
trainer.set_tree_depth(2);
but now it is still showing bad allocation error. If I train with less data and for smaller parameter the train model is not correct.

Build your application in Release Mode and target to 64-bit Windows plateform.
Also Enable \LARGEADDRESSAWARE Flag in your Project.
Here is a link to your question:
Answer

Matcaffe training net produces "Data layer prefetch queue empty"

I'm trying to figure out why my MatCaffe implementation cannot pop from my train lmdb, which I've created using convert_imageset.bin.
What I do is basically just this:
solver = caffe.Solver(solverFile);
solver.step(500);
and when looking at the terminal, the output after the last statement is this:
I0322 11:15:11.830241 **098 net.cpp:228] data does not need backward computation.
I0322 11:15:11.830250 **098 net.cpp:270] This network produces output accuracy
I0322 11:15:11.830257 **098 net.cpp:270] This network produces output loss
I0322 11:15:11.830281 **098 net.cpp:283] Network initialization done.
I0322 11:15:11.830377 **098 solver.cpp:60] Solver scaffolding done.
I0322 11:15:16.625566 **098 solver.cpp:341] Iteration 0, Testing net (#0)
I0322 11:15:19.976579 **098 solver.cpp:409] Test net output #0: accuracy = 0.445407
I0322 11:15:19.976654 **098 solver.cpp:409] Test net output #1: loss = 0.693147 (* 1 = 0.693147 loss)
I0322 11:15:20.317916 **098 solver.cpp:237] Iteration 0, loss = 0.693147
I0322 11:15:20.317989 **098 solver.cpp:253] Train net output #0: loss = 0.693147 (* 1 = 0.693147 loss)
I0322 11:15:20.318009 **098 sgd_solver.cpp:106] Iteration 0, lr = 0.001
I0322 11:15:21.342550 **098 blocking_queue.cpp:50] Data layer prefetch queue empty
I can reproduce this problem even when I delete the locks.mdb to make sure that no locks are left when I restart this procedure. After the message I only can do a hard Matlab shutdown.
I've checked the lmdb with Matlab LMDB and the contents of both, my train and test lmdb seem to be ok. Parameters I've used to generate the lmdb: shuffle.
Note (might be the source of problems here): currently I'm facing MEX problems with this constellation. On the first run of my implementation I get the error message
"Unexpected unknown exception from MEX file.."
for which the terminal output looks like this:
I0322 11:42:09.465801 **875 layer_factory.hpp:77] Creating layer data
I0322 11:42:09.466012 **875 net.cpp:106] Creating Layer data
I0322 11:42:09.466030 **875 net.cpp:411] data -> data
I0322 11:42:09.466053 **875 net.cpp:411] data -> label
I0322 11:42:09.469091 **151 db_lmdb.cpp:38] Opened lmdb /home/user/caffe-master/data/train/lmdbTrain
What I've tried so far:
I implemented a try-catch block so that the pointers, spaces etc. are freed (hopefully) by using 'caffe.reset_all();" so that in ANY case this method is called.
On the second run, I get the above mentioned output. It seems that my first run blocks the lmdb access, what led me to delete locks.mdb manually between the first and the second run -> same effect unfortunately. A "manual" train by command line does work with the same lmdb's. Only the matcaffe run seems to raise these problems and questions. Note that I want to use Matcaffe for manual initialization of my weights for the layers - "weight_filler" in .prototxt isn't an option.
My MatCaffe implementation is from January 2016 and I've also recompiled the mex-file for caffe_ with the correct gcc version (before it gave me the warning that my gcc version should be "x" -> changed to "x" and recompiled).
Do you have any other ideas, recommendations or inputs please?
Thank you!

Text classification process kills when I am using linear SVM for 10000 rows

I am programming in python 2.7 with NLTK library for both text prepossessing and classification in sentiment analysis. I am using nltk wrapper of scikit-learn algorithms. bellow code is after prepossessing and separation to train and test sets.
from nltk.classify.scikitlearn import SklearnClassifier
from sklearn.svm import SVC, LinearSVC, NuSVC
training_set = nltk.classify.util.apply_features(extractFeatures, trainTweets)
testing_set = nltk.classify.util.apply_features(extractFeatures, testTweets)
#LinearSVC
LinearSVC_classifier = SklearnClassifier(LinearSVC())
LinearSVC_classifier.train(training_set)
LinearSVCAccuracy = nltk.classify.accuracy(LinearSVC_classifier, testing_set)*100
print "LinearSVC accuracy percentage:" + str(LinearSVCAccuracy)
it works fine when the number of rows are like 4000 tweets for training, but when it increases to for example 10000 tweets, possess is getting killed with following error.
Memory cgroup out of memory: Kill process 24293 (python) score 848 or
sacrifice child
Killed process 24293, UID 29091, (python) total-vm:14569168kB,
anon-rss:14206656kB, file-rss:3412kB
Clocksource tsc unstable (delta = -17179861691 ns). Enable clocksource
failover by adding clocksource_failover kernel parameter.
RAM of my pc is 8 Gig but I even tried with 16 Gig RAM and still has problem. How can I Classify this amount for tweets without any problem?

Which OS are you running? Which python distribution? Try to install cython and/or using scikit-learn directly. Have a look at scikit-learn optimization techniques

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Resource Exhaused error while using Ray Tune - ray

Related

EntityTooLarge ERROR while fitting RCF data in Sagemaker

ANN training progress resets every new training session using FANN

DLIB: train_shape_predictor_ex.exe for 194 landmarks with halen dataset gives runtime error: bad allocation

Matcaffe training net produces "Data layer prefetch queue empty"

Text classification process kills when I am using linear SVM for 10000 rows

Categories

Resources