TensorFlow profiling - profiling

Here offered the way how to profile tensorflow code. In my case I launch tf.run in several threads in parallel. How I can use this technique to profile multithreading architecture? When I use global metadata and options they log only single thread.
Thanks!

Related

Tuning Sagemaker Pipeline

Is it possible to do a hyper-parameter optimization on Sagemaker for a flow (e.g., pipeline) of a processing job followed by a training job?
In Sagemaker pipelines, I see I can use the tuner step with any training step. However, I can't see any helpful resource for integrating the processing job into the optimization.
Any ideas on how to do this task without merging two steps into one step?
In this relatively old question, it was asking about optimizing 2 models jointly. Here, I am asking about Processing and Training jobs.
I really appreciate any help you can provide.
There is no way to run a HyperparameterTuner for Processor, it expects an estimator as an input. On the other hand, you can just put your processing logic with hyperparameters into an appropriate Estimator (sklearn for example) and then output processing script with hyperparameters as an artefact from /opt/ml/model/model.joblib into a S3 or as a model artefact. When done tuning, just load it into Processor via model_dir, and you got your Processor with tuned hyperparameters.

Do schedulers slow down your web application?

I'm building a jewellery e-commerce store, and a feature I'm building is incorporating current market gold prices into the final product price. I intend on updating the gold price every 3 days by making calls to an api using a scheduler, so the full process is automated and requires minimal interaction with the system.
My concern is this: will a scheduler, with one task executed every 72 hrs, slow down my server (using client-server model) and affect its performance?
The server side application is built using Django, Django REST framework, and PostgresSQL.
The scheduler I have in mind is Advanced Python Scheduler.
As far as I can see from the docs of "Advanced Python Scheduler", they do not provide a different process to run the scheduled tasks. That is left up to you to figure out.
From their docs, they are recommending a "BackgroundScheduler" which runs in a separate thread.
Now there are multiple issues which could arise:
If you're running multiple Django instances (using gunicorn or uwsgi), APS scheduler will run in each of those processes. This is a non-trivial problem to solve unless APS has considered this (you will have to check the docs).
BackgroundScheduler will run in a thread, but python is limited by the GIL. So if your background tasks are CPU intensive, your Django process will get slower at processing incoming requests.
Regardless of thread or not, if your background job is CPU intensive + lasts a long time, it can affect your server performance.
APS seems like a much lower-level library, and my recommendation would be to use something simpler:
Simply using system cronjobs to run every 3 days. Create a django management command, and use the cron to execute that.
Use django supported libraries like celery, rq/rq-scheduler/django-rq, or django-background-tasks.
I think it would be wise to take a look at https://github.com/arteria/django-background-tasks as it is the simplest of all with the least amount of setup required. Once you get a bit familiar with this you can weigh the pros & cons on what is appropriate for your use case.
Once again, your server performance depends on what your background task is doing and how long does it lasts.

Is it possible to parallelize preprocessings with tensorflow-transform on my machine?

I am trying to preprocess larger amounts of data (one tfrecord file ~1Go) using tensorflow-transform v0.11.0 and beam only locally.
My code is largely inspired from https://github.com/tensorflow/transform/blob/master/examples/census_example.py
I have a beam pipeline that works on smaller datasets (<100Mo) but the processing time increases dramatically as I add more data. Being new to tf-transform and apache Beam, I have a hard time finding causes and solutions to the problem... And I would like to avoid using google DataFlow.
My pipeline runs locally using beam directRunner, if I understood correctly, but it uses only one core. Using multiple cores could be one way to improve my preprocessing time, but I do not know if that is possible with the beam directRunner. Is there a way to make a tensorflow-transform pipeline run on multiple cores on my machine ?
I looked in the options of the beam pipeline and of the directRunner, and I can't find any indication about letting a runner access multiple cores or creating multiple directRunners for a pipeline.
Thank you very much for any help I could get !
To add to Anton's comment,
You can utilize Apache Flink to run the pipeline in parallel. More details are summarized in Tensorflow transform on beams with flink runner
You will also have to set the parallelism according to the total number of cores and start those many Flink TaskManagers. My recommendation would be to set parallelism to (total number of cores/2)
I don't believe that's supported. Direct runner's main purpose is to make sure the pipeline implements Beam model correctly. It is not optimized for production use, and will probably actually introduce inefficiencies: https://beam.apache.org/documentation/runners/direct/
As a workaround you can manually start multiple direct runner pipelines to process different portions of data.
Better option would be to use an actual parallel runner to run these kinds of jobs, e.g. you can spin up a Flink cluster: https://beam.apache.org/documentation/runners/flink/
#Ankur #Anton Thanks for your answers, I agree that this approach is not production friendly... We will try two other solutions:
tensorflow-transform on DataFlow
removing tensorflow-transform altogether and use presto to get vocabulary files for categorical inputs, compute means and standard deviations to scale numerical inputs, etc on the whole dataset

Update a table after x minutes/hours

I have these tables
exam(id, start_date, deadline, duration)
exam_answer(id, exam_id, answer, time_started, status)
exam_answer.status possible values are 0-not yet started 1-started 2-submitted
Is there a way to update exam_answer.status once now - exam_answer.time_started is greater than exam.duration? Or if it is already past deadline?
I'll also mentioning this if it might help me better, I'm building this for a django project.
Django applications, like any other WSGI/web application, are only meant to handle request-response flows. If there aren't any requests, there is no activity and such changes will not happen.
You can either write a custom management command that's executed periodically by a cron job, but you run into the risk of possibly displaying incorrect data. You have elegant means at your disposal to compute the statuses before any related views start their processing, but this might be potentially a wasteful use of resources.
Your best bet might be to integrate a task scheduler with your application, such as Celery. Do not be discouraged because Celery seemingly runs in a concurrent multiprocess environment across several machines--the service can be configured to run in a single-thread and it provides a clean interface for scheduling such tasks that have to run at some exact point in the future.

How to distribute a program on an unreliable cluster?

What I'm looking for is any/all of the following:
automatic discovery of worker failure (computer off for instance)
detection of all running (linux) PCs on a given IP address range (computer on)
... and auto worker spawning (ping+ssh?)
load balancing so that workers do not slow down other processes (nice?)
some form of message passing
... and don't want to reinvent the wheel.
C++ library, bash scripts, stand alone program ... all are welcome.
If you give an example of software then please tell us what of above functions does it have.
Check out the Spread Toolkit, a C/C++ group communication system. It will allow you detect node/process failure and recovery/startup, in a manner that allows you to rebalance a distributed workload.
What you are looking for is called a "job scheduler". There are many job schedulers on the market, these are the ones I'm familiar with:
SGE handles any and all issues related to job scheduling on multiple machines (recovery, monitoring, priority, queuing). Your software does not have to be SGE-aware, since SGE simply provides an environment in which you submit batch jobs.
LSF is a better alternative, but not free.
To support message passing, see the MPI specification. SGE fully supports MPI-based distribution.
Depending on your application requirements, I would check out the BOINC infrastructure. They're implementing a form of client/server communication in their latest releases, and it's not clear what form of communication you need. Their API is in C, and we've written wrappers for it in C++ very easily.
The other advantage of BOINC is that it was designed to scale for large distributed computing projects like SETI or Rosetta#Home, so it supports things like validation, job distribution, and management of different application versions for different platforms.
Here's the link:
BOINC website
There is Hadoop. It has Map Reduce, but I'm not sure whether it has any other features I need. Anybody know?
You are indeed looking for a "job scheduler." Nodes are "statically" registered with a job scheduler. This allows the jobs scheduler to inspect the nodes and determine the core count, RAM, available scratch disc space, OS, and much more. All of that information can be used to select the required resources for a job.
Job schedulers also provide basic health monitoring of the cluster. Nodes that are down are automatically removed from the list of available nodes. Nodes which are running jobs (through the scheduler) are also removed from the list of available nodes.
SLURM is a resource manager & job scheduler that you might consider. SLURM has integration hooks for LSF and PBSPro. Several MPI implementations are "SLURM aware" and can use/set environment variables that will allow an MPI job to run on the nodes allocated to it by SLURM.