Poor PySpark performance on Stand-alone cluster + Docker - amazon-web-services

I'm running a Spark slave inside a Docker container on AWS c4.8xlarge machines (one or more) and struggling to get the expected performance when compared to just using multiprocessing on my laptop (with quad-core Intel i7-6820HQ). (see edit below, there is a huge overhead on same hardware as well)
I'm looking for solutions to horizontally scale analytics model training with a "Multiprocessor" which can work in a single thread, multi-process or in a distributed Spark scenario:
class Multiprocessor:
# ...
def map(self, func, args):
if has_pyspark:
n_partitions = min(len(args), 1000)
return _spark_context.parallelize(args, n_partitions).map(func).collect()
elif self.max_n_parallel > 1:
with multiprocessing.Pool(self.max_n_parallel) as pool:
return list(pool.map(func, args))
else:
return list(map(func, args))
As you can see Spark's role is to distribute calculations and simply retrieve results, parallelize().map() is the only API used. args is just a list of integer id tuples, nothing too heavy.
I'm using Docker 1.12.1 (--net host), Spark 2.0.0 (stand-alone cluster), Hadoop 2.7, Python 3.5 and openjdk-7. Results for the same training dataset, every run is CPU-bound:
5.4 minutes with local multiprocessing (4 processes)
5.9 minutes with four c4.8xlarge slaves (10 cores in use / each)
6.9 minutes with local Spark (master local[4])
7.7 minutes with three c4.8xlarge slaves (10 cores in use / each)
25 minutes with a single c4.8xlarge slave (10 cores) (!)
27 minutes with local VM Spark slave (4 cores) (!)
All 36 virtual CPUs seem to be in use, load averages are 250 - 350. There were about 360 args values to be mapped, their processing took 15 - 45 seconds (25th and 75th percentiles). CG times were insignificant. Even tried returning "empty" results to avoid network overhead but it did not affect the total time. Ping to AWS via VPN is 50 - 60 ms.
Any tips on which other metrics I should look into, feel I'm wasting lots CPU cycles somewhere. I'd really like to build architecture around Spark but based on these PoCs at least machines on AWS are way too expensive. Gotta do tests with other local hardware I've access to.
EDIT 1: Tested on a Linux VM on laptop, took 27 minutes when using the stand-alone cluster which is 20 minutes more than with local[4].
EDIT 2: There seems to be 7 pyspark daemons for each slave "core", all of taking significant amount of CPU resources. Is this expected behavior? (picture from laptop's VM)
EDIT 3: Actually this happens even when starting the slave just a single core, I get 100% CPU utilization. According to this answer red color indicates kernel level threads, could Docker play a role here? Anyway, I don't remember seeing this issue when I was prototyping it with Python 2.7, I got very minimal performance overhead. Now updated to Java OpenJDK 8, it made no difference. Also got same results with Spark 1.5.0 and Hadoop 2.6.
EDIT 4: I could track down that by default scipy.linalg.cho_factor uses all available cores, that is why I'm seeing high CPU usage even with one core for the Spark slave. Must investigate further...
Final edit: The issue seems to have nothing to do with AWS or Spark, I've got poor performance on stand-alone Python within the Docker container. See my answer below.

Had the same problem - for me the root cause was memory allocation.
Make sure you allocate enough memory to your spark instances.
In start-slave.sh - run --help to get the memory option (the default is 1GB per node regardless the actual memory in the machine).
You can view in the UI (port 8080 on the master) the allocated memory per node.
You also need to set the memory per executor when you submit your application, i.e. spark-submit (again the default is 1GB), like before - run with --help to get the memory option.
Hope this helps.

Sorry for the confusion (I'm the OP), it took me a while to dig down to what is really happening. I did lots of benchmarking and finally I realized that on Docker image I was using OpenBLAS which by default multithreads linalg functions. My code is running cho_solve hundreds of times on matrices of size ranging from 80 x 80 to 140 x 140. There was simply tons of overhead from launching all these threads, which I don't need in the first place as I'm doing parallel computation via multiprocessing or Spark.
# N_CORES=4 python linalg_test.py
72.983 seconds
# OPENBLAS_NUM_THREADS=1 N_CORES=4 python linalg_test.py
9.075 seconds

Related

Performance of a cluster Redis with Elastic Cache

I am making performance tests with a Redis Cluster mode enable ( AWS Elastic Cache - default.redis6.x.cluster.on ). cluster. And monitoring with datadog .
For the test we made a application who make juste a set and get on Redis. We make some threads that make calls to this app. And by the threads, we make around 100 calls simultaneously. So redis will receive also around 100 calls ( a keys 'abc' operation)
what we saw is that when we have only one to 10 thread (10 calls simultaneously) . every redis call spent 1-5 mileseconds. but when we make 100 calls simultaneously or more, this time is 300-500 ms for every calls.
I would know if there are some way to discover a average performance on this scenario, if should be normal have around 300-500 ms (mileseconds) ? What could be the piste for analyse the performance of Redis?
Thank you

How can I optimise requests/second under peak load for Django, UWSGI and Kubernetes

We have an application that experiences some pretty short, sharp spikes - generally about 15-20mins long with a peak of 150-250 requests/second, but roughly an average of 50-100 requests/second over that time. p50 response times around 70ms (whereas p90 is around 450ms)
The application is generally just serving models from a database/memcached cluster, but also sometimes makes requests to 3rd party APIs etc (tracking/Stripe etc).
This is a Django application running with uwsgi, running on Kubernetes.
I'll spare you the full uwsgi/kube settings, but the TLDR:
# uwsgi
master = true
listen = 128 # Limited by Kubernetes
workers = 2 # Limited by CPU cores (2)
threads = 1
# Of course much more detail here that I can load test...but will leave it there to keep the question simple
# kube
Pods: 5-7 (horizontal autoscaling)
If we assume 150ms average response time, I'd roughly calculate a total capacity of 93requests/second - somewhat short of our peak. In our logs we often get uWSGI listen queue of socket ... full logs, which makes sense.
My question is...what are our options here to handle this spike? Limitations:
It seems the 128 listen queue is determined by the kernal, and the kube docs suggest it's unsafe to increase this.
Our Kube nodes have 2 cores. The general advice seems to be to set your number of workers to 2 * cores (possibly + 1), so we're pretty much at our limit here. Increasing to 3 doesn't seem to have much impact.
Multiple threads in Django can apparently cause weird bugs
Is our only option to keep scaling this horizontally at the kubernetes level? Aside from making our queries/caching as efficient as possible of course.

Out of memory issue with celery and redis - need help tuning celery

I have a django 2 app that uploads 1 to many photographs (~3-4 MB per image), and uses face_recognition to find the face locations and encoding. The images and created thumbnails are saved to the file system and the data in a mysql database. The app works, except that uploading and finding 16 faces in 3 photos takes about 2 min and uses about 6.5 GB of RAM and no swap (my Ubuntu 18.04 system has a total of 16 GB of RAM and runs with about 8 GB of free memory and 1 GB of swap.)
When I use celery (v 4.2.1) and redis (v 2.10.6) and redis-server (v 4.0.9) to offload the face recognition from my django app, the celery tasks run out of memory and the workers are killed before they finish There is 1 celery task per photo for face recognition and django handles the file uploading, thumbnail creation and db writingfor the three photos. Usually 1 task will finish, but not always. I even added exponential retries for the celery tasks, but that did not help. Looking at top during the celery face recognition, I noticed that the amount of free RAM stayed around 6 GB, but the 1 GB of swap is totally consumed and then the celery tasks start to die. The error message I get is out of memory when they die.
From my observations, I conclude that my system has enough RAM for these three pictures, but somehow I need to tune celery so that it uses less swap and more memory. I am not sure if this is the real problem or not, nor have I found any way to tune celery. BTW, I have the same issues whether I run django in debug mode or not.
Thanks for any suggestions you may have to solve this problem of celery running out of memory.
Mark

high cpu in redis 2.8 (elasticache) cache.r3.large

looking for some help in ElasticCache
We're using ElasticCache Redis to run a Resque based Qing system.
this means it's a mix of sorted sets and Lists.
at normal operation, everything is OK and we're seeing good response times & throughput.
CPU level is around 7-10%, Get+Set commands are around 120-140K operations. (All metrics are cloudwatch based. )
but - when the system experiences a (mild) burst of data, enqueing several K messages, we see the server become near non-responsive.
the CPU is steady # 100% utilization (metric says 50, but it's using a single core)
number of operation drops to ~10K
response times are slow to a matter of SECONDS per request
We would expect, that even IF the CPU got loaded to such an extent, the throughput level would have stayed the same, this is what we experience when running Redis locally. redis can utilize CPU, but throughput stays high. as it is natively single-cored, not context switching appears.
AFAWK - we do NOT impose any limits, or persistence, no replication. using the basic config.
the size: cache.r3.large
we are nor using periodic snapshoting
This seems like a characteristic of a rouge lua script.
having a defect in such a script could cause a big CPU load, while degrading the overall throughput.
are you using such ? try to look in the Redis slow log for one

RabbitMQ on EC2 Consuming Tons of CPU

I am trying to get RabbitMQ with Celery and Django going on an EC2 instance to do some pretty basic background processing. I'm running rabbitmq-server 2.5.0 on a large EC2 instance.
I downloaded and installed the test client per the instructions here (at the very bottom of the page). I have been just letting the test script go and am getting the expected output:
recving rate: 2350 msg/s, min/avg/max latency: 588078478/588352905/588588968 microseconds
recving rate: 1844 msg/s, min/avg/max latency: 588589350/588845737/589195341 microseconds
recving rate: 1562 msg/s, min/avg/max latency: 589182735/589571192/589959071 microseconds
recving rate: 2080 msg/s, min/avg/max latency: 589959557/590284302/590679611 microseconds
The problem is that it is consuming an incredible amount of CPU:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
668 rabbitmq 20 0 618m 506m 2340 S 166 6.8 2:31.53 beam.smp
1301 ubuntu 20 0 2142m 90m 9128 S 17 1.2 0:24.75 java
I was testing on a micro instance earlier and it was completely consuming all resources on the instance.
Is this to be expected? Am I doing something wrong?
Thanks.
Edit:
The real reason for this post was that celerybeat seemed to run okay for awhile and then suddenly consume all resources on the system. I installed the rabbitmq management tools and have been investigating how the queues are created from celery and from the rabbitmq test suite. It seems to me that celery is orphaning these queues and they are not going away.
Here is the queue as generated by the test suite. One queue is created and all the messages go into it and come out:
Celerybeat creates a new queue for every time it runs the task:
It sets the auto-delete parameter to true, but I'm not entirely sure when these queues will get deleted. They seem to just slowly build up and eat resources.
Does anyone have an idea?
Thanks.
Ok, I figured it out.
Here's the relevant piece of documentation:
http://readthedocs.org/docs/celery/latest/userguide/tasks.html#amqp-result-backend
Old results will not be cleaned automatically, so you must make sure to consume the results or else the number of queues will eventually go out of control. If you’re running RabbitMQ 2.1.1 or higher you can take advantage of the x-expires argument to queues, which will expire queues after a certain time limit after they are unused. The queue expiry can be set (in seconds) by the CELERY_AMQP_TASK_RESULT_EXPIRES setting (not enabled by default).
To add to Eric Conner's solution to his own problem, http://docs.celeryproject.org/en/latest/userguide/tasks.html#tips-and-best-practices states:
Ignore results you don’t want
If you don’t care about the results of a task, be sure to set the ignore_result option, as storing results wastes time and resources.
#app.task(ignore_result=True)
def mytask(…):
something()
Results can even be disabled globally using the CELERY_IGNORE_RESULT setting.
That along with Eric's answer is probably a bare minimum best practices for managing your results backend.
If you don't need a results backend, set CELERY_IGNORE_RESULT or don't set a results backend at all. If you do need a results backend, set CELERY_AMQP_TASK_RESULT_EXPIRES to be safeguarded against unused results building up. If you don't need it for a specific app, set the local ignore as above.