Google AutoML training error / unable to deploy model - google-cloud-platform

I have a multi-label dataset with 727253 labeled images. Smallest label occurence is ~15 and largest around 200000. Model training started ~18h ago and failed now with the following message:
Unable to deploy model
cancel_lro() got an unexpected keyword argument 'min_nodes'
Pipeline d884756f14314048b7a036f5b07f0fd2 timeout.
The automatically generated email contained the following:
Last error message
Please reference 116298312436989152 when reporting errors.
Is this already known? Also I chose the free plan (1h) to train. Do I need to increase this to work properly? Is there any way to see a status during training to predict large waiting times without outcome? (I tried the API but there was no percentage or anything like else, only for finished models.)
Thanks in advance!

This seems like an internal error. The main problem seems to be that the pipeline timed out. As part of the timeout it tries to do some sort of cleanup and this cleanup seems to have a bug.
My recommendation is re-try the pipeline.

Related

k8s get resources from cluster take too much time

I need to get all resources based on label, I used the following code which works, However, it takes too much time ( ~20sec) to get the response, even which I restrict it to only one namespace (vrf), any idea what im doing wrong here?
resource.NewBuilder(flags).
Unstructured().
ResourceTypes(res...).
NamespaceParam("vrf").AllNamespaces(false).
LabelSelectorParam("a=b").SelectAllParam(selector == "").
Flatten().
Latest().Do().Object()
https://pkg.go.dev/k8s.io/cli-runtime#v0.26.1/pkg/resource#Builder
As I already using label and ns, not sure what should else I do in this case.
Ive checked the cluster connection and it seems that everything is ok, running regular kubectl are getting very fast response, just this query took much time.
The search may be heavy due to the sheer size of the resources the query has to search into. Have you looked into this possibility and further reduce the size using one more label or filter on top of current.
Also check the performance of you Kubernetes api server when the operation is being performed and optimize it.

AWS Step function not recognizing lambda step

I have the following architecture
I followed this link to the T, https://github.com/aws/amazon-sagemaker-examples/blob/main/step-functions-data-science-sdk/automate_model_retraining_workflow/automate_model_retraining_workflow.ipynb. I am not sure how to debug to see what is going wrong. Any suggestions would be appreciated.
To provide more context, this is a machine learning deployment project. What I am doing in the picture is chaining processes together. The "Query Training Results" part is a Lambda function that pulls the training metrics data from an S3 location. For some reason this part gets cancelled.
From what I found online (Why would a step function cancels itself when there are no errors), “this happens in step functions when you have a Choice state, and the Variable you are referencing is not actually in the state input.” There is also some answers in that post that suggest that the dictionary metrics need to be of string type which I made sure I casted it as such.
The problem I am having is when you click on that grey box it provides no information other than the fact that it was cancelled, so I have no clue what is going wrong.

Tensorflow runs "Running per image evaluation" indefinitly

I am running my first tensorflow job (object detection training) right now, using the tensorflow API. I am using the ssd mobilenet network from the modelzoo. I used the >>ssd_mobilenet_v1_0.75_depth_quantized_300x300_coco14_sync.config<< as a config-file and as a fine tune checkpoint the >>ssd_mobilenet_v1_0.75_depth_300x300_coco14_sync_2018_07_03<< checkpoint.
I started my training with the following command:
PIPELINE_CONFIG_PATH='/my_path_to_tensorflow/tensorflow/models/research/object_detection/models/model/ssd_mobilenet_v1_0.75_depth_quantized_300x300_coco14_sync.config'
MODEL_DIR='/my_path_to_tensorflow/tensorflow/models/research/object_detection/models/model/train'
NUM_TRAIN_STEPS=200000
SAMPLE_1_OF_N_EVAL_EXAMPLES=1
python object_detection/model_main.py \
--pipeline_config_path=${PIPELINE_CONFIG_PATH} \
--model_dir=${MODEL_DIR} \
--num_train_steps=${NUM_TRAIN_STEPS} \
--sample_1_of_n_eval_examples=$SAMPLE_1_OF_N_EVAL_EXAMPLES \
--alsologtostderr
No coming to my problem, I hope the community can help me with. I trained the network over night and it trained for 1400 steps and then started evaluating per image, which was running the entire night. Next morning I saw, that network only evaluated and the training was still at 1400 steps. You can see part of the console output in the image below.
Console output from evaluation
I tried to take control by using the eval config parameter in the config file.
eval_config: {
metrics_set: "coco_detection_metrics"
use_moving_averages: false
num_examples: 5000
}
I added max_evals = 1, because the documentation says that I can limit the evaluation like this. I also changend eval_interval_secs = 3600 because I only wanted one eval every hour. Both options had no effect.
I also tried other config-files from the modelzoo, with no luck. I searched google for hours, only to find answers which told me to change the parameters I already changed. So I am coming to stackoverflow to find help in this Matter.
Can anybody help me, maybe hat the same experience? Thanks in advance for all your help!
Environment information
$ pip freeze | grep tensor
tensorboard==1.11.0
tensorflow==1.11.0
tensorflow-gpu==1.11.0
$ python -V
Python 2.7.12
I figured out a solution for the problem. The problem with tensorflow 1.10 and after is, that you can not set checkpoint steps or checkpoint secs in the config file like before. By default tensorflow 1.10 and after saves a checkpoint every 10 min. If your hardware is not fast enough and you need more then 10 min for evaluation, you are stuck in a loop.
So to change the time steps or training steps till a new checkpoint is safed (which triggers the evaluation), you have to navigate to the model_main.py in the following folder:
tensorflow/models/research/object_detection/
Once you opened model_main.py, navigate to line 62. Here you will find
config = tf.estimator.RunConfig(model_dir=FLAGS.model_dir)
To trigger the checkpoint save after 2500 steps for example, change the entry to this:
config = tf.estimator.RunConfig(model_dir=FLAGS.model_dir,save_checkpoints_steps=2500).
Now the model is saved every 2500 steps and afterwards an evaluation is done.
There are multiple parameters you can pass through this option. You can find a documentation here:
tensorflow/tensorflow/contrib/learn/python/learn/estimators/run_config.py.
From Line 231 to 294 you can see the parameters and documentation.
I hope I can help you with this and you don't have to look for an answer as long as I did.
Could it be that evaluation takes more than 10 minutes in your case? It could be that since 10 minutes is the default interval for making evaluation, it keeps evaluating.
Unfortunately, the current API doesn't easily support altering the time interval for evaluation.
By default, evaluation happens after every checkpoint saving, which by default is set to 10 minutes.
Therefore you can change the time for saving a checkpoint by specifying save_checkpoint_secs or save_checkpoint_steps as an input to the instance of MonitoredSession (or MonitoredTrainingSession). Unfortunately and best to my knowledge, these parameters are not available to be set as flags to model_main.py or from the config file. Therefore, you can either change their value by hard coding, or exporting them out so that they will be available.
An alternative way, without changing the frequency of saving a checkpoint, is modifying the evaluation frequency which is specified as throttle_secs to tf.estimator.EvalSpec.
See my explanation here as to how to export this parameter to model_main.py.

Kibana Mapping conflict How to make sure this error wont get repeated

I am new to kibana we are using Aws es 5.5. i have setuped the dashboards yesterday which are working fine but today morning when i see all dashboards are empty with out no data. i found it was due to Mapping conflict. In google i found one Answer was to reindex the data. how can we prevent in future this type of errors.
Any Answers would be greatly Appreciated.
Probably you have the same field twice with not the same mapping, for example gender define as string in one place and in other place define as number.
You need to check it and prevent it next time

AWS Glue Error "Path does not exist"

Every time I try to run some very simple jobs (import json on s3 to Redshift) I get the following error:
pyspark.sql.utils.AnalysisException: u'Path does not exist:
s3://my-temp-glue-dir/f316d46f-eaf3-497a-927b-47ff04462e4a;'
This is not a permissions issue, since I have some other (more complex jobs with joins) working reliably. Really not sure what the issue could be - any help would be appreciated.
I'm using 2 DPU's, but have tried 5. I also tried using a different temp directory. Also, there are hundreds of files, and some of the files are very small (a few lines), but I'm not sure if that is relevant.
I believe the cause of this error is simply the number of files I'm attempting to load at the same time (and that the error itself is misleading). After disabling bookmarks, and using a subset of the data, things are working as expected.