How to Make AWS Infrastructure perform comparable to local server running on a MacBook Pro - amazon-web-services

I have a web application that is caching data in CSV files, and then in response to an HTTP request, reading the CSV files into memory, constructing a JavaScript Object, then sending that Object as JSON to the client.
When I run this on my Local Server on my Macbook Pro (2022, Chip: Apple M1 Pro, 16GB Memory, 500GB Hard Drive), the 24 CSV files at about 15MB each are all read in about 2.5 seconds, and then the subsequent processing takes another 3 seconds, for a total execution time of about 5.5 seconds.
When I deploy this application to AWS, however, I am struggling to create a comparably performant environment.
I am using AWS Elastic Beanstalk to spin up an EC2 instance, and then attaching an EBS volume to store the CSV files. I know that as the EBS is running on a separate instance, there is possible network latency, but my understanding is typically pretty negligible as far as overall effect on performance.
What I have tried thus far:
Using a Compute focused instance (c5.4xlarge) which is automatically EBS optimized. Then using a Provisioned IOPS (io2) with 1000 GiB storage and 400 IOPS. (Performance, about 10 seconds total)
Using a High Throughput EBS volume, which is supposed to offer greater performance for sequential read jobs (like what I imagined reading a CSV file would be), but that actually performed a little worse than the Provisioned IOPS EBS instance. (Performance, about 11 seconds total)
Can anyone offer any recommendations for which EC2 Instance and EBS Volume should be configured to achieve comparable performance with my local machine? I don't expect to get it matching exactly, but do expect that it can be closer than about twice as slow as the local server.

Related

AWS LightSail RDS - How Much RAM Do I Need

I'm just setting up a hight availability WordPress network and I need to decide how much RAM I need for the database instance. On a web server you run "top" and find out how much RAM is being used per MySql process and then look in your config file and look at the maximum number of processes that are allowed to run.
How do you calculate how much RAM you will need on a High availability MySQL database running in AWS LightSail? The plans seem very light on RAM. For example a $20 webserver gets 4GB of RAM whereas a $60 database server get 1GB of RAM. Why is this and how many processes will 1GB run?

What AWS service can I use to efficiently process large amounts of S3 data on a weekly basis?

I have a large amount of images stored in an AWS S3 bucket.
Every week, I run a classification task on all these images. The way I'm doing it currently is by downloading all the images to my local PC, processing them, then making database changes once the process is complete.
I would like to reduce the amount of time spent downloading images to increase the overall speed of the classification task.
EDIT2:
I actually am required to process 20,000 images at a time to increase performance of the classification engine. This means I can't use Lambdas since the maximum option for RAM available is 3GB and I need 16GB to process all 20,000 images
The classification task uses about 16GB of RAM. What AWS service can I use to automate this task? Is there a service that can be put on the same VLAN as the S3 Bucket so that images transfer very quickly?
The entire process takes about 6 hours to do. If I spin up an EC2 with 16GB of RAM it would be very cost ineffective as it would finish after 6 hours then spend the remainder of the week sitting there doing nothing.
Is there a service that can automate this task in a more efficient manner?
EDIT:
Each image is around 20-40KB. The classification is a neural network, so I need to download each image so I can feed it through the network.
Multiple images are processed at the same time (batches of 20,000), but the processing part doesn't actually take that long. The longest part of the whole process is the downloading part. For example, downloading takes about 5.7 hours, processing takes about 0.3 hours in total. Hence why I'm trying to reduce the amount of downloading time.
For your purpose you can still use EC2 instance. And if you have large amount of data to be downloaded from S3, you can attach and EBS volume to the instance.
You need to setup the instance with all the tools and software required for running your job. And when you don't have any process to run, you can shut down the instance. And boot it up when you want to run the process.
EC2 instances are not charged for the time they are in stopped state. You will be charged for the EBS volume and Elasitc IP attached to the Instance.
You also will be charged for the storage of the EC2 image on S3.
But I think these cost will be less than the cost of running EC2 instance all the time.
You can schedule start and stop the instance using AWS instance scheduler.
https://www.youtube.com/watch?v=PitS8RiyDv8
You can also use AutoScaling but that would be more complex solution than using the Instance Scheduler.
I would look into Kinesis streams for this, but it's hard to tell because we don't know exactly what processing you are doing to the images

Faster Upload Speeds with AWS EC2 Instance

I've got a t2.medium instance with an EBS volume and EFS in the U.S. West (Oregon) availability region.
Users (often out of California) can upload image files using a javascript file uploader, but no matter how fast the user's connection is, they can't seem to upload any faster than ~500kb/s.
For example, if a user speed-tests their upload rate at 5mb/s, and then uploads a 5MB image file, it will still take nearly 11 seconds to complete.
I get similar results when using FTP to upload files.
My initial thought was that I should change my instance to something with better Network Performance — but since I'm uploading directly to the EFS and not an amazon bucket or something else, I wasn't sure networking was my problem.
How can I achieve faster upload rates? Is this a limitation of my instance?
I would definitely experiment with different instance types as the instance family and size is directly correlated with the network performance. The t2 family of instances has one of the lowest network throughputs.
Here are two resources to help you figure out what to expect for network throughput for the various instance types:
Cloudonaut EC2 Network Performance Cheat Sheet
Amazon EC2 Instance Type documentation
The t3 family is the latest gen of low cost and burstable t instances which include enhanced networking with a much improved burstable network rate of up to 5 Gbps. This may work for you if your uploads are infrequent. At a minimum, you could switch to the t3 family to improve your network performance without changing your cost much at all.
Side note: If you are using an older AMI, you may not be able to directly use your AMI from your t2 instance as you will need a modern version of an OS that supports the enhanced networking.

Capacity planning on AWS

I need some understanding on how to do capacity planning for AWS and what kind of infrastructure components to use. I am taking the below example.
I need to setup a nodejs based server which uses kafka, redis, mongodb. There will be 250 devices connecting to the server and sending in data every 10 seconds. Size of each data packet will be approximately 10kb. I will be using the 64bit ubuntu image
What I need to estimate,
MongoDB requires atleast 3 servers for redundancy. How do I estimate the size of the VM and EBS volume required e.g. should be m4.large, m4.xlarge or something else? Default EBS volume size is 30GB.
What should be the size of the VM for running the other application components which include 3-4 processes of nodejs, kafka and redis? e.g. should be m4.large, m4.xlarge or something else?
Can I keep just one application server in an autoscaling group and increase as them as the load increases or should i go with minimum 2
I want to generally understand that given the number of devices, data packet size and data frequency, how do we go about estimating which VM to consider and how much storage to consider and perhaps any other considerations too
Nobody can answer this question for you. It all depends on your application and usage patterns.
The only way to correctly answer this question is to deploy some infrastructure and simulate standard usage while measuring the performance of the systems (throughput, latency, disk access, memory, CPU load, etc).
Then, modify the infrastructure (add/remove instances, change instance types, etc) and measure again.
You should certainly run a minimal deployment per your requirements (eg instances in separate Availability Zones for High Availability) and you can use Auto Scaling to add extra capacity when required, but simulated testing would also be required to determine the right triggers points where more capacity should be added. For example, the best indicator might be memory, or CPU, or latency. It all depends on the application and how it behaves under load.

What AWS disk options should I use for my EC2 instance?

Created a new Ubuntu c3.xlarge instance and when I get to storage options I get the option to change ROOT to General Purpose SSD, Provisioned IOPS or magnetic, also if I pick Provisioned IOPS i can set another value. Additional data storage under Instance Store 0 has no options but if change to EBS then I have the same options.
I'm really struggling to understand:
The speed of each option
the costs of each option
The Amazon documentation is very unclear
I'm using this instance to transfer data from text files into a Postgres relational database, these files have to be processed line by line with a number of INSERT statements per line so is slow on my local computer (5 million rows of data takes 15 hours). Originally the database was separately on RDS but it was incredibly slow, so I installed the database locally on the instance itself remove network latency which has speed up things a bit but it is still considerably slower than my local humble linux server.
Looking at the instance logs whilst loading the data CPU instance is only at 6% so now thinking that disk may be limiting the factor. The database will be using the / (Not sure if SSD or magnetic - how can I find out) disk and the data files are on the /mnt (using Instance Store 0) disk.
I only need this instance to do two things:
Load database from datafiles
Create Lucene Search Index from database
(so the database is just an interim step)
The Search Index is transferred to an EBean Server and then I don;t need this instance for another month when I then repeat the process with new data so with that in mind I can afford to spend more money for faster processing because I'm only going to use 1 day a month, then I can stop the instance and incur no further costs ?
Please what can I do to determine the problem and speed things up ?
Here is my personal guideline:
If the volume is small (<33G) and only require a eventual burst in performance, such as a boot volume, use magnetic drives.
If you need predictable performance and high throughput, use PIOPS volumes and EBS optimized instances.
Otherwise, use General Purpose SSD.
Your CPU is only at 6%, maybe you can try to use multi-process?
Did you test your remote instance's volume's I/O performance?
PIOPS is expensive, but it did not significantly better than gp2, the only advantage is stable.
For example, I create a 500G gp2 and a 500G PIOPS with 1500IOPS, then I try to insert and find 1,000,000 documents by mongodb, then I check the io performanace by such as mongoperf/iostat/mongostat/dstat
Each volume's iops performance is expect to 1500,
but gp2's iops is unstable, almost from 700 to 1600(r+w), if only read, it can brust to 4000, if only write, it just reach 800.
piops is perfect stable, it iops is almost 1470.
To your situation, I suggest to consider about gp2 (volume size depend on your iops demand, 500G gp2 = 1500iops, 1T gp2 = 3000iops(maximum))