AWS Batch job long STARTING time - amazon-web-services

I have an issue with long starting time for AWS Batch job. It's random, sometimes it takes second to transition from STARTING to RUNNING, but sometimes it takes more than 10 minutes. According to the documentation, in STARTING state container initiation operations are done, so I understand it can take some time to download and run container on newly created machine in compute environment, but it happens also on machines that were used just before and should have the container already prepared.
Is there any way I can optimise job's STARTING time?

Duration of STARTING state depends on how fast an environment can pull and start a container.
You can speed things up if you use smaller Docker image (smaller the image, faster pulling and starting the container) and higher vCPU and RAM. All of these can be configured in Job Definition.
it happens also on machines that were used just before and should have the container already prepared
You don't control how AWS manages an environment.

Related

large number of docker containers and aws resources

My application allow users to create multiple nodered instances using docker containers , each docker container handle one instance , the number of containers could reach 65000 container , so:
how can i configure host resources to handle that number of containers
if my host memory is 16 gb how to check if it can handle all those instances
if No should i increase memroy size or should i create another instance (aws instance) and use it
You don't, 16GB / 65000 = 0.25mb per container, there is no way a NodeJS process will start, run and do anything useful. So you are never going to run 65000 containers on a single machine.
As I said in the comment on your previous question, you need to spend some time running tests and determine how much memory/CPU your typical use case will need and you then set resource limits on the container when started to limit it to those values. How to set resource limits are in the Docker docs
If your work load is variable then you may need to have different sizes of container depending on the workload.
You then do the basic maths that is CPU in the box/ %CPU per container and memory in the box / memory per container and which ever is the smaller number is the max number of containers you can run. (You will also need to include some overhead for monitoring/management and other house keeping tasks)
It's up to you to decide what approach to sizing/choosing cloud infrastructure to support that work load and what will be economically viable.
(or you could outsource this to people that have already done a bunch of this e.g. FlowForge inc [full disclosure, I am lead developer at FlowForge])

Task takes too much time pending on ECS

I've been with a weird problem for some days.
I'm implementing the ECS logic to drain instances on termination (specifically on Spot interruption notice) using the ECS_ENABLE_SPOT_INSTANCE_DRAINING=true env var on the ecs-agent.
The process works fine, when an interruption notice arrives, ECS drains the instance and moves the containers to another one, but here is the problem, if the instance never started that image before, it takes too much time to start (About 3 min, when the spot interruption time is in 2 min) causing availability issues. If the image started in that instance before, it only takes 20 sec to spin up the task!
Have you experienced this problem before using ECS?
PD: The images are about 500MB is that large for an image??
There are some strategies available to you:
Reduce the size of the image by optimising the Dockerfile. A smaller image is quicker to pull from the repository.
Bake the large image into the AMI used in the cluster. Now every new spot machine will have the image. Depending on how the Dockerfile is created, a significant number of layers could be reused resulting on quicker image pulls.
Once the image is pulled to the machine, the image is cached and subsequent pulls will almost be instantaneous.

Working around AWS Step Function Map concurrency limit

I have a Map task in an AWS Step Function which executes 100-200 lambdas in parallel, each running for a few minutes, then collects the results. However, I'm running into throttling where not all lambdas are started for some time. The relevant AWS documentation says you may experience throttling with more than ~40 items, which is what I believe I'm running into.
Does anyone have any experience with working around this concurrency limitation? Can i have nested Maps, or could I bucket my tasks into several Maps I run in parallel?
Use nested state machine inside your map state, so you can have ~40 child state machines executing in parallel. Then inside each child state machine use a map state to process ~40 items in parallel.
This way you can reach processing ~1600 items in parallel.
But before reaching that you will reach AWS Step Functions Quotas:
https://docs.aws.amazon.com/step-functions/latest/dg/limits.html
I ended up working around this 40 item limit by creating 10 copies of the Map task in a Parallel, and bucketing up the task information to split tasks between these 10 copies. This means I can now run ~400 tasks before running into throttling issues. My state machine looks something like this:
AWS now offers a direct solution to this, called "distributed map state", announced at re:invent 2022 https://docs.aws.amazon.com/step-functions/latest/dg/concepts-asl-use-map-state-distributed.html
It allows you to perform 10,000 concurrent map state tasks. Under the hood, it runs them as child step function workflows, which can be specified to run as either standard or express workflows

Are there any problems with running same cron job that takes 2 hours to complete every 10 minutes?

I have a script that takes two hours to run and I want to run it every 15 minutes as a cronjob on a cloud vm.
I noticed that my cpu is often at 100% usage. Should I resize memory and/or number_of_cores ?
Each time you execute your cron job, a new process will be created.
So if your job takes 120 min (2h) to complete, and you will be starting new jobs every 15 minutes, then you will be having 8 jobs running at the same time (120/15).
Thus, if the jobs are resource intensive, you will observe issues, such as 100% cpu usage.
So the question whether to up-scale or not is really dependent on the nature of these jobs. What do they do, how much cpu and memory do they take? Based on your description you are already running at 100% CPU often, thus an upgrade would be warranted in my view.
It would depend on your cron, but outside of resourcing for your server/application the following issues should be considered:
Is there overlap in data? i.e. do you retrieve a pool of data that will be processed multiple times.
Will duplicate critical actions happen? i.e. will a customer receive an email multiple times, will a payment be processed multiple times.
Is there a chance of a race condition that cause the script to exit early.
Will there be any collisions in the processing i.e. duplicate bookings made etc.
You will need to increase the CPU and Memory specification of your VM instance (in GCP) due to the high CPU load of your instance. The document [1] on upgrading the machine type of your VM instance, to do this need to shutdown your VM instance and change it´s machine type.
To know about different machine types in GCP, please have the link [2].
On the other hand, you can autoscale based on the average CPU utilization if you use managed instance group (MIG) [3]. Using this policy tells the autoscaler to collect the CPU utilization of the instances in the group and determine whether it needs to scale. You set the target CPU utilization the autoscaler should maintain and the autoscaler works to maintain that level.
[1] https://cloud.google.com/compute/docs/instances/changing-machine-type-of-stopped-instance
[2] https://cloud.google.com/compute/docs/machine-types
[3] https://cloud.google.com/compute/docs/autoscaler/scaling-cpu-load-balancing#scaling_based_on_cpu_utilization

Neo4j performance discrepancies local vs cloud

I am encountering drastic performance differences between a local Neo4j instance running on a VirtualBox-hosted VM and a basically identical Neo4j instance hosted in Google Cloud (GCP). The task involves performing a simple load from a Postgres instance also located in GCP. The entire load takes 1-2 minutes on the VirtualBox-hosted VM instance and 1-2 hours on the GCP VM instance. The local hardware setup is a 10-year-old 8 core, 16GB desktop running VirtualBox 6.1.
With both VirtualBox and GCP I perform these similar tasks:
provision a 4 core, 8GB Ubuntu 18 LTS instance
install Neo4j Community Edition 4.0.2
use wget to download the latest apoc and postgres jdbc jars into the plugins dir
(only in GCP is the neo4j.conf file changed from defaults. I uncomment the "dbms.default_listen_address=0.0.0.0" line to permit non-localhost connections. Corresponding GCP firewall rule also created)
restart neo4j service
install and start htop and iotop for hardware monitoring
login to empty neo4j instance via broswer console
load jdbc driver and run load statement
The load statement uses apoc.periodic.iterate to call apoc.load.jdbc. I've varied the "batchSize" parameter in both environments from 100-10000 but only saw marginal changes in either system. The "parallel" parameter is set to false because true causes lock errors.
Watching network I/O, both take the first ~15-25 seconds to pull the ~700k rows (8 columns) from the database table. Watching CPU, both keep one core maxed at 100% while another core varies from 0-100%. Watching memory, neither takes more than 4GB and swap stays at 0. Initially, I did use the config recommendations from "neo4j-admin memrec" but those didn't seem to significantly change anything either in mem usage or overall execution time.
Watching disk, that is where there are differences. But I think these are symptoms and not the root cause: the local VM consistently writes 1-2 MB/s throughout the entire execution time (1-2 minutes). The GCP VM burst writes 300-400 KB/s for 1 second every 20-30 seconds. But I don't think the GCP disks are slow or the problem (I've tried with both GCP's standard disk and their SSD disk). If the GCP disks were slow, I would expect to see sustained write activity and a huge write-to-disk queue. It seems whenever something should be written to disk, it gets done quickly in GCP. It seems the bottleneck is before the disk writes.
All I can think of are that my 10-year-old cores are way faster than a current GCP vCPU, or that there is some memory heap thing going on. I don't know much about java except heaps are important and can be finicky.
Do you have the exact same :schema on both systems? If you're missing a critical index used in your LOAD query that could easily explain the differences you're seeing.
For example, if you're using a MATCH or a MERGE on a node by a certain property, it's the difference between doing a quick lookup of the node via the index, or performing a label scan of all nodes of that label checking every single one to see if the node exists or if it's the right node. Understand also that this process repeats for every single row, so in the worst case it's not a single label scan, it's n times that.