How to know the resource of each task assignment on RAY - ray

I newbie in Ray, I try to run some ML training task using RAY,
I'd like to know how much CPU, RAM, and GPU each task uses.
RAY supports the dashboard to view resource information, but I need to write down to log file for easy track.
How can I write the percent of CPU, RAM, and GPU each task uses to file log.
I use Python.

That's a good question--I don't think Ray supports anything like that at the moment, but I've filed an enhancement request here: https://github.com/ray-project/ray/issues/16711

Related

Exhausted Compute Units Google Colab

i'm planning to subs google colab pro to get better GPU memory when doing some research. But i was wondering if i exhaust my 100 compute units in the first day due to continues usage of GPU, can i still use GPU for my google colab?
If anyone know or already tried to use GPU after having 0 compute units, is it still possible to use a GPU? please kindly share your experience
For what I understood reading here: Google colab subscription
What happens if I don't have computing units? All users can access Colab resources based on availability. If you do not have computing units, you can only use Colab resources reserved for non-paying users.
So, if you finish your compute units you'll be downgraded to the free user version of Colab. In that case, you can be assigned another gpu as a free user but with the common limitations.
Another useful link to understand a little bit better about usage of computing units is the following: What exactly is a compute unit?.
I'll be trying the paid version in a few days for at least a month to see if it is really worth it.
Hope it helps!

Error symbol log running GPU on google cloud ML

I'm trying to use google cloud ml with GPU mode.
When I train BASIC_GPU mode, I have many error log.
But, It works training well.
I am not sure whether the learning was good working in GPU mode.
This is error log history.
enter image description here
This is the some part of print config.log_device_placement.
enter image description here
Also, I tried training complex_model_m_gpu mode.
I also have error log like BASIC_GPU.
But, I can't see gpu:/1, gpu:/2, gpu:/3 when i print config.log_device_placement. Only gpu:/0 i can see.
The important thing is that BASIC_GPU and complex_model_m_gpu have same speed for running time.
I wonder whether the learning was good working in GPU mode or there is something wrong.
Sorry for my english, anyone knows the problem then help me.
thank you.
Please refer to TensorFlow's performance guide for optimizing for GPUs for tips on how to make the most of your GPUs.
A couple things to note
You can turn on logging of device placement to see which ops get assigned to which Devices. This is a great way to check that ops are actually assigned to GPUs and that you are using all GPUs when you have multiple GPUs.
TensorBoard should also provide information about device placement so that is another way to check that you are using all GPUs.
When using multiple GPUs, you need to make sure you are assigning ops to all GPUs. The TensorFlow guide provides more information on this topic.

Speeding up model training using MITIE with Rasa

I'm training a model for recognizing short, one to three sentence strings of text using the MITIE back-end in Rasa. The model trains and works using spaCy, but it isn't quite as accurate as I'd like. Training on spaCy takes no more than five minutes, but training for MITIE ran for several days non-stop on my computer with 16GB of RAM. So I started training it on an Amazon EC2 r4.8xlarge instance with 255GB RAM and 32 threads, but it doesn't seem to be using all the resources available to it.
In the Rasa config file, I have num_threads: 32 and set max_training_processes: 1, which I thought would help use all the memory and computing power available. But now that it has been running for a few hours, CPU usage is sitting at 3% (100% usage but only on one thread), and memory usage stays around 25GB, one tenth of what it could be.
Do any of you have any experience with trying to accelerate MITIE training? My model has 175 intents and a total of 6000 intent examples. Is there something to tweak in the Rasa config files?
So I am going to try to address this from several angles. First specifically from the Rasa NLU angle the docs specifically say:
Training MITIE can be quite slow on datasets with more than a few intents.
and provide two alternatives:
Use the mite_sklearn pipeline which trains using sklearn.
Use the MITIE fork where Tom B from Rasa has modified the code to run faster in most cases.
Given that you're only getting a single cores used I doubt this will have an impact, but it has been suggested by Alan from Rasa that num_threads should be set to 2-3x your number of cores.
If you haven't evaluated both of those possibilities then you probably should.
Not all aspects of MITIE are multi-threaded. See this issue opened by someone else using Rasa on the MITIE GitHub page and quoted here:
Some parts of MITIE aren't threaded. How much you benefit from the threading varies from task to task and dataset to dataset. Sometimes only 100% CPU utilization happens and that's normal.
Specifically on training data related I would recommend that you look at the evaluate tool recently introduced into the Rasa repo. It includes a confusion matrix that would potentially help identify trouble areas.
This may allow you to switch to spaCy and use a portion of your 6000 examples as an evaluation set and adding back in examples to the intents that aren't performing well.
I have more questions on where the 6000 examples came from, if they're balanced, and how different each intent is, have you verified that words from the training examples are in the corpus you are using, etc but I think the above is enough to get started.
It will be no surprise to the Rasa team that MITIE is taking forever to train, it will be more of a surprise that you can't get good accuracy out of another pipeline.
As a last resort I would encourage you to open an issue on the Rasa NLU GitHub page and and engage the team there for further support. Or join the Gitter conversation.

Distributed Tensorflow Training of Reinpect Human detection model

I am working on Distributed Tensorflow, particularly the implementation of Reinspect model using Distributed Tensorflow given in the following paper https://github.com/Russell91/TensorBox .
We are using Between-graph-Asynchronous implementation of Distributed tensorflow settings but the results are very surprising. While bench marking, we have come to see that Distributed training takes almost more than 2 times more training time than a single machine training. Any leads about what could be happening and what else could be tried be would be really appreciated. Thanks
Note: There is a correction in the post, we are using between-graph implementation not in-graph implementation. Sorry for the mistake
In general, I wouldn't be surprised if moving from a single-process implementation of a model to a multi-machine implementation would lead to a slowdown. From your question, it's not obvious what might be going on, but here are a few general pointers:
If the model has a large number of parameters relative to the amount of computation (e.g. if it mostly performs large matrix multiplications rather than convolutions), then you may find that the network is the bottleneck. What is the bandwidth of your network connection?
Are there a large number of copies between processes, perhaps due to unfortunate device placement? Try collecting and visualizing a timeline to see what is going on when you run your model.
You mention that you are using "in-graph replication", which is not currently recommended for scalability. In-graph replication can create a bottleneck at the single master, especially when you have a large model graph with many replicas.
Are you using a single input pipeline across the replicas or multiple input pipelines? Using a single input pipeline would create a bottleneck at the process running the input pipeline. (However, with in-graph replication, running multiple input pipelines could also create a bottleneck as there would be one Python process driving the I/O with a large number of threads.)
Or are you using the feed mechanism? Feeding data is much slower when it has to cross process boundaries, as it would in a replicated setting. Using between-graph replication would at least remove the bottleneck at the single client process, but to get better performance you should use an input pipeline. (As Yaroslav observed, feeding and fetching large tensor values is slower in the distributed version because the data is transferred via RPC. In a single process these would use a simple memcpy() instead.)
How many processes are you using? What does the scaling curve look like? Is there an immediate slowdown when you switch to using a parameter server and single worker replica (compared to a single combined process)? Does the performance get better or worse as you add more replicas?
I was looking at similar thing recently, and I noticed that moving data from grpc into Python runtime is slower than expected. In particular consider following pattern
add_op = params.assign_add(update)
...
sess.run(add_op)
If add_op lies on a different process, then sess.run adds a decoding step that happens at rate of 50-100 MB/second.
Here's a benchmark and relevant discussion

Recognition of an animal in pictures

I am facing a challenging problem. On the courtyard of company I am working is a camera trap which takes a photo of every movement. On some of these pictures there are different kinds of animals (mostly deep gray mice) that cause damages to our cable system. My idea is to use some application that could recognize if there is a gray mouse on the picture or not. Ideally in realtime. So far we have developed a solution that sends alarms for every movement but most of alarms are false. Could you provide me some info about possible ways how to solve the problem?
In technical parlance, what you describe above is often called event detection. I know of no ready-made approach to solve all of this at once, but with a little bit of programming you should be all set even if you don't want to code any computer vision algorithms or some such.
The high-level pipeline would be:
Making sure that your video is of sufficient quality. Gray mice sound kind of tough, plus the pictures are probably taken at night - so you should have sufficient infrared lighting etc. But if a human can make it out whether an alarm is false or true, you should be fine.
Deploying motion detection and taking snapshot images at the time of movements. It seems like you have this part already worked out, great! Detailing your setup could benefit others. You may also need to crop only the area in motion from the image, are you doing that?
Building an archive of images, including your decision of whether they are false or true alarm (labels in machine learning parlance). Try to gather at least a few tens of example images for both cases, and make them representative of real-world variations (do you have the problem during daytime as well? is there snowfall in your region?).
Classifying the images taken from the video stream snapshot to check whether it's a false alarm or contains bad critters eating cables. This sounds tough, but deep learning and machine learning is making advances by leaps; you can either:
deploy your own neural network built in a framework like caffe or Tensorflow (but you will likely need a lot of examples, at least tens of thousands I'd say)
use an image classification API that recognizes general objects, like Clarifai or Imagga - if you are lucky, it will notice that the snapshots show a mouse or a squirrel (do squirrels chew on cables?), but it is likely that on a specialized task like this one, these engines will get pretty confused!
use a custom image classification API service which is typically even more powerful than rolling your own neural network since it can use a lot of tricks to sort out these images even if you give it just a small number of examples for each image category (false / true alarm here); vize.it is a perfect example of that (anyone can contribute more such services?).
The real-time aspect is a bit open-ended, as the neural networks take some time to process an image — you also need to include data transfer etc. when using a public API, but if you roll out your own, you will need to spend a lot of effort to get low latency as the frameworks are by default optimized for throughput (batch prediction). Generally, if you are happy with ~1s latency and have a good internet uplink, you should be fine with any service.
Disclaimer: I'm one of the co-creators of vize.it.
How about getting a cat?
Also, you could train your own custom classifier using the IBM Watson Visual Recognition service. (demo: https://visual-recognition-demo.mybluemix.net/train ) It's free to try and you just need to supply example images for the different categories you want to identify. Overall, Petr's answer is excellent.