Data Engineering - infrastructure/services for efficient extraction of data (AWS) [closed] - amazon-web-services

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 4 years ago.
Improve this question
Let's assume the standard data engineering problem:
every day at 3.00 AM connect to an API
download data
store them in a data lake
Let's say there is a python script that does the API hit and storage, but that is not that important.
Ideally I would like to have some service that comes alive, runs this script and kills itself... So far, I thought about those possibilities (using AWS services):
(AWS) Lambda - FaaS, ideal match for the usecase. But there is a problem: bandwith of the function (limited RAM/CPU) and timeout of 5 mins.
(AWS) Lambda + Step Functions + range requests: fire multiple Lambdas in parallel, each downloading a part of the file. Coordination via Step Functions. It solves the issue of 1) but it feels very complicated.
(AWS EC2) Static VM: classic approach: I have a VM, I have a python interpreter, I have a cron -> every night I run the script. Or every night, I can trigger a build of new EC2 machine using CloudFormation, run the script and then kill it. Problems: feels very old-school - like there has to be a better way to do it.
(AWS ECS) Docker: I have very little experience with docker. Probably similar to the VM case, but feels more versatile/controllable. I don't know if there is a good orchestrator for this kind of job and how easy it is (firing docker and killing it)
How I see it:
Exactly what I would like to have, but it is not good for downloading big data because of the resource constrains.
Complicated workaround for 1)
Feels very oldschool, additional devops expenses
Don't know a lot about this topic, feels like the current state-of-art
My question is: what is the current state-of-art for this kind of job? What services are useful and what are the experiences with them?

A variation on #3... Launch a Linux Amazon EC2 instance with a User Data script, with Shutdown Behavior set to Terminate.
The User Data script performs the download and copies the data to Amazon S3. It then executes sudo shutdown -h to turn off the instance. (Or, if the script is complex, the User Data script can download a program from an S3 bucket, then execute it.)
Linux EC2 instances are now charged per-second, so think of it like a larger version of Lambda that has more disk space and does not have a 5-minute limit.
There is no need to use CloudFormation to launch the instance because then you'd just need to delete the CloudFormation stack. Instead, just launch the instance directly with the necessary parameters. You could even create a Launch Template with the parameters and then simply launch an instance using the Launch Template.
You could even add a few smarts to the process and launch the instance using Spot Pricing (set the bid price to normal On-Demand pricing, since worst case you'll just pay the normal price). If the Spot Instance won't launch due to insufficient spare capacity, then launch an On-Demand instance instead.

Related

Run application on multiple locations and Pay as You Go [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 2 years ago.
Improve this question
Hey guys I need advice in order to make the right architectural decision.
I need to be able to run the console application (or docker container in the future) on different locations (Countries/Citys) without paying for hundreds always running virtual machines.
In other words, I need to press the button and run the application for a couple of hours on a server in New York, next press, and the same application will be run in Stambul.
The straight forward approach is to buy hundreds of virtual machines, but there are two problems with it:
It's too expensive.
Probably only a couple of them will be used but I'll have to pay for all of them.
What can you recommend?
Does Azure support it? Or maybe AWS?
First thing, cloud service provider work base on the region instead of a city like you mentioned new york etc but you can choose always nearest region to the country/city in which you want to run your application. you can also try cloudping or aws cloudping for nearest region.
In other words, I need to press the button and run the application for
a couple of hours on a server in New York, next press, and the same
application will be run in Stambul.
So I will recommend docker container as you want to run the same application in a different region so instead of mainain AMI better to go with the container.
AWS fargate is designing for pay as you go purpose along with zero server maintenance mean you just need to specify the docker image and run your application, rest AWS will take care of the resources.
AWS Fargate is a serverless compute engine for containers that works
with both Amazon Elastic Container Service (ECS) and Amazon Elastic
Kubernetes Service (EKS). Fargate makes it easy for you to focus on
building your applications. Fargate removes the need to provision and
manage servers, lets you specify and pay for resources per
application, and improves security through application isolation by
design.
like you mentioned
without paying for hundreds always running virtual machines.
So you do not need pay, you will only pay for the compute hours that used by your application when you start/run the container.
With AWS Fargate, there are no upfront payments and you only pay for
the resources that you use. You pay for the amount of vCPU and memory
resources consumed by your containerized applications.
AWS Fargate pricing
For deployment purpose, I will recommend terraform so you will only need to create resources for region and for the rest of the region you can make it parameterized.

AWS best way to handle high volume transactions [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 2 years ago.
Improve this question
I am writing a system that has extremely high volume of transactions, CRUD and I am working with AWS. What are the considerations that I must keep in mind given that none of the data should be lost?
I have done some research and they say to use SQS queues to make sure that data is not lost. What other backup, redundancy, quick processing considerations should I keep in mind?
So if you want to create a system that is highly resilient, whilst also being redundant I would advise you to take a read of the AWS Well Architected Framework. This will go into more detail that a person can provide on stack overflow.
Regarding individual technologies:
If you're transactional like you said, then you should look at using a relational data store for storing the data. I'd recommend taking a look at Amazon Aurora, it has built in features like auto scaling of read onlys and multi master support. Whilst you might be expecting large number, by using autoscaling you will only pay for what you use.
Try to decouple your APIs, have a dumb validation layer before handing off to your backend if you can help it. Technologies like SQS (as you mentioned before) help with decoupling when you combine with Lambda.
SQS guarantees at least once, so if your system should not write duplicates you'll want to account for idempotency in your application.
Also use a dead letter queue (DLQ) to handle any failed actions.
Ensure any resources residing in your VPC are spread across availability zones.
Use S3, EC2 Backup Manager and RDS snapshots to ensure data is backed up. Most other services has some sort of backup functionality you can enable.
Use autoscaling wherever possible to ensure you're reducing costs.
Build any infrastructure using an IaC tool (CloudFormation or Terraform), and any provisioning of resources via a tool like (Ansible, Puppet, Chef). Try to follow a pre baked AMI workflow to ensure that it is quick to return to the base server state.

What services of AWS to use for a microservice architecture? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 2 days ago.
Improve this question
I have a small mobile application which is based on microservice architecture. Two of the microservices are in java, one is in node. My application uses a MySQL database.
Presently I am using a VPS to host my services, all software (MySQL, tomcat and pm2) are installed in the same VPS.
Now I am planning to move to AWS and (as I have no prior AWS experience) I am overwhelmed by the services provided by AWS.
Can anyone please help me decide on this?
Since the usage is going to be very low at this point, I have to get this setup with the least monthly costs to be incurred.
For that I am thinking to get 1 EC2 instance and install all software in different docker containers (including the DB). Will this approach work? Or will I have to get another RDS instance? Is docker required? Or can I directly install all the software?
Such question is opinion based and community discourage such question but there is few things that are very clear to explain.
Will I have to get another RDS instance
I will not recommend to go with the container for DB, it will be hard to scale and maintain backup etc in container also there is a risk to lost data if no proper mounting were set in the container configuration.
I will recommend going with RDS free tier which is free for one year ( Term and condition apply)
It will be easy in future to upgrade, scale and maintain backup with RDS.
AWS Free Tier with Amazon RDS
750 hours of Amazon RDS Single-AZ db.t2.micro Instance usage running
MySQL, MariaDB, PostgreSQL, Oracle BYOL or SQL Server (running SQL
Server Express Edition) – enough hours to run a DB Instance
continuously each month
rds-free-tier
I am thinking to get 1 ec2 install and install all softwares in
different docker containers (including the DB),
At the initial level, it's fine to go with 1 instance. but here is the flow
Create ECS cluster
Create ECR registry and push your image to ECR
Create Task definition against each docker iamge
Create service for each task definition
As mentioned in the comment you can explore EKS, but in AWS I will prefer ECs.
You can explore this to start with gentle-introduction-to-how-aws-ecs-works-with-example
High-level look will be
It's been a while since I asked this question - and though I agree this is more of an opinion-based question - I still think this answer is something that would be a good start into cloud deployments.
Application hosting
For startups and small projects like these, the best approach would be to go with serverless lambda functions - and though it would add an overhead of lambda functions in code its worth the effort as it keeps the cost to almost zero until you start to get some tangible traffic.
Another approach for the application microservices could be docker - but docker containers are more for containerized deployment - to make sure code runs the same in the prod environment as it does in the dev- rather than this one should go with a small EC2 instance and deploy the microservices as separate processes (PM2 for node js). Though differential scaling would be tough at this point it doesn't matter - as soon as you start seeing CPU metrics touch the roof you can start decoupling the more used Microservices to another machine - and have a load balancer in front of it.
K8s is overkill at this point as again with one worker node even though the control plane is free to use - its just no point - until you have a sizable number of worker nodes
Database Deployment
Stateful deployments are trickier comparatively as there is a chance of data loss. Easier would be to go with managed DB hosting at this point such as AWS aurora/RDS or if you plan to use NoSQL then mongoDB atlas. Managing DB along with backups would be a painful task especially when you are saving every penny in infra costs.

Why do people use AWS over Docker Containers? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 1 year ago.
Improve this question
AWS provides services like Elasticache, redis, databases and all are charged on hourly basis. But these services are also available in form of docker containers in docker hub. All the AWS services liste above use an instance. Meaning, an independent instance for databases and all. But what if one starts using an ec2 instance, and start downloading all the images for all the dependancies on databases. That would save them a lot of money right?
I have used docker before and it has almost all the images for the services aws provides.
EC2 is not free. You can run, for example, MySQL on an EC2 instance. It will be cheaper than using RDS, but you still need to pay for the compute and storage resources it consumes. Even if you run a database on a larger shared EC2 instance you need to account for its storage and CPU cycles, and you might need more or larger instances to run more tasks there.
(As of right now, in the us-east-1 region, a MySQL db.m5.large instance is US$0.171 per hour or US$895 per year paid up front, plus US$0.115 per GB of capacity per month; the same m5.large EC2 instance is US$0.096 per hour or US$501 per year, and storage is US$0.10 per GB per month. [Assuming 1-year, all-up-front, non-convertible reserved instances.])
There are good reasons to run databases not-in-Docker. Particularly in a microservice environment, application Docker containers are stateless, replicated, update their images routinely, can be freely deleted, and can be moved across hosts (by deleting and recreating them somewhere else). (In Kubernetes/EKS, look at how a Deployment object works.) None of these are true of databases, which are all about keeping state, cannot be deleted, cannot be moved (the data has to come with), and must be backed up.
RDS has some useful features. You can change the size of your database instance with some downtime, but no data loss. AWS will keep managed snapshots for you, and it's straightforward (if slow) to create a new database from a snapshot of your existing database. Patch updates to the database are automatically applied for you. You can pay Amazon for these features, in effect, or pay your own DBA to do the same tasks for a database running on an EC2 instance.
None of this is to say you have to use RDS, you do in fact save on AWS by running the same software on EC2, and it may or may not be in Docker. RDS is a reasonable choice in an otherwise all-Docker world though. The same basic tradeoffs apply to other services like Elasticache (for Redis).

Amazon Web Services (AWS) EC2 , EMR, S3 [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about a specific programming problem, a software algorithm, or software tools primarily used by programmers. If you believe the question would be on-topic on another Stack Exchange site, you can leave a comment to explain where the question may be able to be answered.
Closed 9 years ago.
Improve this question
I have been learning aws for quite sometime. I would like to confirm the overall picture of what I have learned so far : I take a normal PC as an analogy to this :
**EC2 similar to arithmetic and logical unit of PC
EMR similar to the OS of PC
S3 similar to the hard-disk of PC**
Please correct me if am wrong and explain me the AWS EC2,EMR,S3 with comparison to another system/service etc.
(Please dont direct to amazon doc links/tutorials as I have crossed all those and I want to confirm my understanding)
Thanks in advance
I think your analogies are reasonable from a 10,000 foot view. However, I wouldn't say they are correct since there are a lot of subtleties involved. Let me list a few.
EC2 does handle compute side of your application hence it does have a similar role to an ALU has in a microprocessor. However, two major differences.
a) EC2 is not like the ALU because EC2 consists of the ability to launch/terminate new compute resources. An ALU by definition is a fixed compute entity while EC2 by definition is a system for provisioning compute resources. Very different.
b) EC2 is not stateless but an ALU is. EC2-provided instances have disk, memory, etc. Thus they can carry the entire state of application. S3 is not a required component. In a computer, ALU by itself isn't useful you additional memory is required.
EMR to OS. EMR is really just Hadoop. Hadoop is a task distribution platform. EMR is like an OS in that it does task scheduling. However, a major part of an OS is doing arbitration between different app threads. Whereas, Hadoop is about taking a big data problem and running it in a distributed fashion across many computers. It does no resource arbitration and works on one problem at a time. Thus, its not really like an OS. Apache Yarn to me is closer to an OS btw.
Your S3 analogy is also partially correct. AWS has many types of storage. There is Ephermal storage which is like memory and goes away when an instance dies. There is EBS volumes that are permanent disks attached to instances (or sitting idle) with data on them. S3 is the third type of storage which is like having a web storage. You can upload files to S3 and access them. S3 is very much like a remote disk. To complete, AWS also has Glacier which is archival storage which is even more distant than S3.
Hope this helps.