I am currently looking at the possibility to use AWS as a way to scale up the infrastructure. I am looking for the best way to set up an application to run different computational pipelines with data provided by the user. I have already seen the possibility of creating on-demand cluster using containers to run the analysis that are currently available (already predefined and ready in containers).
I am looking for advices in which are the amazon services tipically used to launch computations (or containers) once they have been selected in a web app by a user and stored in the backend.
Thanks
Related
I need to run a pipeline of data transformation that is composed of several scripts in distinct projects = Python repos.
I am thinking of using Compute Engine to run these scripts in VMs when needed as I can manage resources required.
I need to be able to orchestrate these scripts in the sense that I want to run steps sequentially and sometimes asyncronously.
I see that GCP provides us with a Worflows components which seems to suit this case.
I am thinking of creating a specific project to orchestrate the executions of scripts.
However I cannot see how I can trigger the execution of my scripts which will not be in the same repo as the orchestrator project. From what I understand of GCE, VMs are only created when scripts are executed and provide no persistent HTTP endpoints to be called to trigger the execution from elsewhere.
To illustrate, let say I have two projects step_1 and step_2 which contain separate steps of my data transformation pipeline.
I would also have a project orchestrator with the only use of triggering step_1 and step_2 sequentially in VMs with GCE. This project would not have access to the code repos of these two former projects.
What would be the best practice in this case? Should I use other components than GCE and Worflows for this or there is a way to trigger scripts in GCE from an independent orchestration project?
One possible solution would be to not use GCE (Google Compute Engines) but instead create Docker containers that contain your task steps. These would then be registered with Cloud Run. Cloud Run spins up docker containers on demand and charges you only for the time you spend processing a request. When the request ends, you are no longer charged and hence you are optimally consuming resources. Various events can cause a request in Cloud Run but the most common is a REST call. With this in mind, now assume that your Python code is now packaged in a container which is triggered by a REST server (eg. Flask). Effectively you have created "microservices". These services can then be orchestrated by Cloud Workflows. The invocation of these microservices is through REST endpoints which can be Internet addresses with authorization also present. This would allow the microservices (tasks/steps) to be located in separate GCP projects and the orchestrator would see them as distinct callable endpoints.
Other potentials solutions to look at would be GKE (Kubernetes) and Cloud Composer (Apache Airflow).
If you DO wish to stay with Compute Engines, you can still do that using shared VPC. Shared VPC would allow distinct projects to have network connectivity between each other and you could use Private Catalog to have the GCE instances advertize to each other. You could then have a GCE instance choreograph or, again, choreograph through Cloud Workflows. We would have to check that Cloud Workflows supports parallel items ... I do not believe that as of the time of this post it does.
This is a common request, to organize automation into it's own project. You can setup service account that spans multiple projects.
See a tutorial here: https://gtseres.medium.com/using-service-accounts-across-projects-in-gcp-cf9473fef8f0
On top of that, you can also think to have Workflows in both orchestrator and sublevel project. This way the orchestrator Workflow can call another Workflow. So the job can be easily run, and encapsuled also under the project that has the code + workflow body, and only the triggering comes from other project.
I'm trying to figure out what AWS services I need for the mobile application I'm working on with my startup. The application we're working on should go into the app-/play-store later this year, so we need a "best-practice" solution for our case. It must be high scaleable so if there are thousands of requests to the server it should remain stable and fast. Also we maybe want to deploy a website on it.
Actually we are using Uberspace (link) servers with an Node.js application and MongoDB running on it. Everything works fine, but for the release version we want to go with AWS. What we need is something we can run Node.js / MongoDB (or something similar to MongoDB) on and something to store images like profile pictures that can be requested by the user.
I have already read some informations about AWS on their website but that didn't help a lot. There are so many services and we don't know which of these fit our needs perfectly.
A friend told me to just use AWS EC2 for the Node.js server + MongoDB and S3 to store images, but on some websites I have read that it is better to use this architecture:
We would be glad if there is someone who can share his/her knowledge with us!
To run code: you can use lambda, but be careful: the benefit you
don't have to worry about server, the downside is lambda sometimes
unreasonably slow. If you need it really fast then you need it on EC2
with auto-scaling. If you tune it up properly it works like a charm.
To store data: DynamoDB if you want it really fast (single digits
milliseconds regardless of load and DB size) and according to best
practices. It REQUIRES proper schema or will cost you a fortune,
otherwise use MongoDB on EC2.
If you need RDBMS then RDS (benefits:
scalability, availability, no headache with maintenance)
Cache: they have both Redis and memcached.
S3: to store static assets.
I do not suggest CloudFront, there are another CDN on market with better
price/possibilities.
API gateway: yes, if you have an API.
Depending on your app, you may need SQS.
Cognito is a good service if you want to authenticate your users at using google/fb/etc.
CloudWatch: if you're metric-addict then it's not for you, perhaps standalone EC2
will be better. But, for most people CloudWatch is abcolutely OK.
Create all necessary alarms (CPU overload etc).
You should use roles
to allow access to your S3/DB from lambda/AWS.
You should not use the root account but create a separate user instead.
Create billing alarm: you'll know if you're going to break budget.
Create lambda functions to backup your EBS volumes (and whatever else you may need to backup). There's no problem if backup starts a second later, so
Lambda is ok here.
Run Trusted Adviser now and then.
it'd be better for you to set it up using CloudFormation stack: you'll be able to deploy the same infrastructure with ease in another region if/when needed, also it's relatively easier to manage Infrastructure-as-a-code than when it built manually.
If you want a very high scalable application, you may be need to use a serverless architecture with AWS lambda.
There is a framework called serverless that helps you to manage and organize all your lambda function and put them behind AWS Gateway.
For the storage you can use AWS EC2 and install MongoDB or you can go with AWS DynamODB as your NoSql storage.
If you want a frontend, both web and mobile, you may be want to visit the react native approach.
I hope I've been helpful.
I'm developing a prototype IoT application which does the following
Receive/Store data from sensors.
Web application with a web-based IDE for users to deploy simple JavaScript/Python scripts which gets executed in Docker Containers.
Data from the sensors gets streamed to these containers.
User programs can use this data to do analytics, monitoring etc.
The logs of these programs are outputted to the user on the webapp
Current Architecture and Services
Using one AWS EC2 instance. I chose EC2 because I was trying to figure out the architecture.
Stack is Node.js, RabbitMQ, Express, MySQl, MongoDB and Docker
I'm not interested in using AWS IoT services like AWS IoT and Greengrass
I've ruled out Heroku since I'm using other AWS services.
Questions and Concerns
My goal is prototype development for a Beta release to a set of 50 users
(hopefully someone else will help/work on a production release)
As far as possible, I don't want to spend a lot of time migrating between services since developing the product is key. Should I stick with EC2 or move to Beanstalk?
If I stick with EC2, what is the best way to handle small-medium traffic? Use one large EC2 machine or many small micro instances?
What is a good way to manage containers? Is it worth it use swarm and do container management? What if I have to use multiple instances?
I also have small scripts which have status of information of sensors which are needed by web app and other services. If I move to multiple instances, how can I make these scripts available to multiple machines?
The above question also holds good for servers, message buses, databases etc.
My goal is certainly not production release. I want to complete the product, show I have users who are interested and of course, show that the product works!
Any help in this regard will be really appreciated!
If you want to manage docker containers with least hassle in AWS, you can use Amazon ECS service to deploy your containers or else go with Beanstalk. Also you don't need to use Swarm in AWS, ECS will work for you.
Its always better to scale out rather scale up, using small to medium size EC2 instances. However the challenge you will face here is managing and scaling underlying EC2's as well as your docker containers. This leads you to use Large EC2 instances to keep EC2 scaling aside and focus on docker scaling(Which will add additional costs for you)
Another alternative you can use for the Web Application part is to use, AWS Lambda and API Gateway stack with Serverless Framework, which needs least operational overhead and comes with DevOps tools.
You may keep your web app on Heroku and run your IoT server in AWS EC2 or AWS Lambda. Heroku is on AWS itself, so this split setup will not affect performance. You may heal that inconvenience of "sitting on two chairs" by writing a Terraform script which provisions both EC2 instance and Heroku app and ties them together.
Alternatively, you can use Dockhero add-on to run your IoT server in a Docker container alongside your Heroku app.
ps: I'm a Dockhero maintainer
Is there a way list/view(graphically?) all created resources on amazon? All the db's users, pools etc.
The best way I can think of is to run each of the cli aws <resource> ls commands in a bash file.
What would be great would be to have a graphical tool that showed all the relationships. Is anyone aware of such a tool?
UPDATE
I decided to make my own start on this, currently its just on the cli, but might move to graphical output. Help needed!
https://github.com/QuantumInformation/aws-x-ray
No, it is not possible to easily list all services created on AWS.
Each service has a set of API calls and will typically have Describe* calls that can list resources. However, these commands would need to be issued to each service individually and they typically have different syntax.
There are third-party services (eg Kumolus) that offer functionality to list and visualize services but they are typically focussed on Amazon EC2 and Amazon VPC-based services. They definitely would not go 'into' a database to list DB users, but they would show Amazon RDS instances.
I am new to AWS (Amazon Web Services) as well as our own custom boto based python deployment scripts, but wanted to ask for advice or best practices for a simple configuration management task. We have a simple web application with configuration data for several different backend environments controlled by a command line -D defined java environment variable. Sometimes, the requirement comes up that we need to switch from one backend environment to another due to maintenance or deployment schedules of our backend services.
The current procedure requires python scripts to completely destroy and rebuild all the virtual infrastructure (load balancers, auto scale groups, etc.) to redeploy the application with a change to the command line parameter. On a traditional server infrastructure, we would log in to the management console of the container, change the variable, bounce the container, and we're done.
Is there a best practice for this operation on AWS environments, or is the complete destruction and rebuilding of all the pieces the only way to accomplish this task in an AWS environment?
It depends on what resources you have to change. AWS is evolving everyday in a fast paced manner. I would suggest you to take a look at the AWS API for the resources you need to deal with and check if you can change a resource without destroying it.
Ex: today you cannot change a Launch Group once it is created. you must delete it and create it again with the new configurations. but if you have one auto scaling group attached to that launch group you will have to delete the auto scaling group and so on.
IMHO a see no problems with your approach, but as I believe that there is always room for improvement, I think you can refactor it with the help of AWS API documentation.
HTH
I think I found the answer to my own question. I know the interface to AWS is constantly changing, and I don't think this functionality is available yet in the Python boto library, but the ability I was looking for is best described as "Modifying Attributes of a Stopped Instance" with --user-data as being the attribute in question. Documentation for performing this action using HTTP requests and the command line interface to AWS can be found here: http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/Using_ChangingAttributesWhileInstanceStopped.html