AWS containerised apps and database on same Redshift cluster - amazon-web-services

I a simple question for someone with experience with AWS but I am getting a little confused with the terminology and know how to proceed with which node to purchase.
At my company we currently have a a postgres db that we insert into continuously.
We probably insert ~ 600M rows at year at the moment but would like to be able to scale up.
Each Row is basically a timestamp and two floats, one int and one enum type.
So the workload is write intensive but with also constant small reads.
(There will be the occasional large read)
There are also two services that need to be run (both Rust based)
1, We have a rust application that abstracts the db data allowing clients to access it through a restful interface.
2, We have a rust app that gets the data to import from thousands on individual devices through modbus)
These devices are on a private mobile network. Can I setup AWS cluster nodes to be able to access a private network through a VPN ?
We would like to move to Amazon Redshift but am confused with the node types
Amazon recommend choosing RA3 or DC2
If we chose ra3.4xlarge that means you get one cluster of nodes right ?
Can I run our rust services on that cluster along with a number of Redshift database instances?
I believe AWS uses docker and I could containerise my services easily I think.
Or am I misunderstanding things and when you purchase a Redshift cluster you can only run Redshift on this cluster and have to get a different one for containerised applications, possibly an ec2 cluster ?
Can anyone recommend a better fit for scaling this workload ?
Thanks

I would not recommend Redshift for this application and I'm a Redshift guy. Redshift is designed for analytic workloads (lots or reads and few, large writes). Constant updates is not what it is designed to do.
I would point you to Postgres RDS as the best fit. It has a Restful API interface already. This will be more of the transactional database you are looking for with little migration change.
When your data get really large (TB+) you can add Redshift to the mix to quickly perform the analytics you need.
Just my $.02

Redshift is a Managed service, you don't get any access to it for installing stuff, neither is there a possibility of installing/running any custom software of your own
Or am I misunderstanding things and when you purchase a Redshift cluster you can only run Redshift on this cluster
Yes, you don't run stuff - AWS manages the cluster and you run your analytics/queries etc.
have to get a different one for containerised applications, possibly an ec2 cluster ?
Yes, you could possibly make use of EC2, running the orchestrators on your own, or make use of ECS/Fargate/EKS depending on your budget/how skilled your members are etc

Related

Optimizing latency between application server (EC2) and RDS

here's how the story goes.
We started transforming a monolith, single-machine, e-commerce application (Apache/PHP) to cloud infrastructure. Obviously, the application and the database (MySQL) were on the same machine.
We decided to move to AWS. And as the first step of transformation, we decided to split the database and application. Hosting application on a c4.xlarge machine. And hosting database to RDS Aurora MySQL on a db.r5.large machine, with default options.
This setup performed well. Especially the database performance went up high.
Unfortunately, when the traffic spiked up, we started experiencing long response times. Looked like RDS, although being really fast for executing queries, wasn't returning results fast enough over the network to the EC2 machine.
So that was our conclusion after an in-depth analysis of the setup including Apache/MySQL/PHP tuning parameters. The delayed response time was definitely due to the network latency between EC2 and RDS/Aurora machine, both machines being in the same region.
Before adding additional resources (ex: ElastiCache etc) we'd first like to look into any default configuration we can play around with to solve this problem.
What do you think we missed there?
One of the bigest strength with the cloud is the scalability and you should always design your application to utilise it and it sounds like your RDS instance is getting chocked due to nr of request more than the process time for the queries. So rather go more small instances with load balancing than one big doing all the job. And with Load Balancers you will get away from a singel point of failure due to you can have replicas of your database and they can even be placed in different AZ.
Here is a blogpost you can read on the topic:
https://aws.amazon.com/blogs/database/scaling-your-amazon-rds-instance-vertically-and-horizontally/
Good luck in your aws journey.
The Best answer to your question is using read replicas, but remember only your read requests could be sent to your read replicas so you would need to design your application that way
Also for some cost savings, you should try aurora serverless
One more option is passing traffic between ec2 and rds through a private network rather than using the public internet to connect your ec2 to rds that can be one of the mistakes that might be happening

70 TB Cassandra migration to AWS

We have a 70TB cluster which has around 200 Keyspaces, and planning to move this to AWS. Few approaches which we are thinking
Replace the node in one of cluster with a Node in AWS, and do that one by one for all nodes
Create a new Cluster in AWS, Bulk Copy each key space and do the dual write to both the clusters, and cutover during downtime.
Any other better ways to do this? Could we use the AWS as a new DC and change one keyspace at a time?
Yes, for a live migration you can use a hybrid cloud model and create a new DC in AWS. This is probably the best approach if you want to migrate data without downtime and you can do this keyspace-by-keyspace to manage the I/O streaming.
This blog article by Alain Rodriguez on Cassandra Data Center Switch provides a walk though of how to do this in great detail.
Using AWS Snowball is a faster and cheaper approach if downtime is an option.
You can use AWS as a new cluster. But you need to be careful. Not all cassandra sstable can talk each other, so you need to verify the compatibility between sstables. Another issue is that you can cause some high load in your "old" cluster.
So i high recomend that you start with this parameters very low to test the powerful of your cluster and the AWS cluster:
compaction_throughput_mb_per_sec (Default 16)
stream_throughput_outbound_megabits_per_sec (Default 200)
Bootstrap a new AWS node inside your actual cluster is not a good idea, because you will tell to cassandra redistributed the keys between the cluster each time that you bootstrap a new node and you will stay without "plan b" if anything wrong.
Another good solution is make a separeted cluster(without connect them) in AWS and move the data with SPARK. Just move the data without transform is very simple and you are "on the control" of the process .

Dynamodb vs Redis

We're using AWS, and considering to use DynamoDB or Redis on our new service.
Below is our service's character
Insert/Delete occur over between hundreds and thousands per minute, and will be larger later.
We don't need quick search, only need to find a value with key
Data should not be lost.
There are another data that doesn't have a lot of Insert/Delete unlike 1.
I'm worried about when Redis server down.
When the Redis failure, our data will be removed.
That's why I'm considering to select Amazon DynamoDB.
Because DynamoDB is NoSQL, so Insert/Delete is so fast(slower than Redis, but we don't need to that much speed), and store data permanently.
But I'm not sure that my thinking is right or not.
If I'm thinking wrong or don't think another important point, I'm going appreciate when you guys teach me.
Thanks.
There are two type of Redis deployment in AWS ElastiCache service:
Standalone
Multi-AZ cluster
With standalone installation it is possible to turn on persistence for a Redis instance, so service can recover data after reboot. But in some cases, like underlying hardware degradation, AWS can migrate Redis to another instance and lose persistent log.
In Multi-AZ cluster installation it is not possible to enable persistence, only replication is occur. In case of failure it takes a time to promote replica to master state. Another way is to use master and slave endpoints in the application directly, which is complicated. In case of failure which cause a restart both Redis node at time it is possible to lose all data of the cluster configuration too.
So, in general, Redis doesn't provide high durability of the data, while gives you very good performance.
DynamoDB is highly available and durable storage of you data. Internally it replicates data into several availability zones, so it is highly available by default. It is also fully managed AWS service, so you don't need to care about Clusters, Nodes, Monitoring ... etc, which is considering as a right cloud way.
Dynamo DB is charging by R/W operation (on-demand or reserved capacity model) and amount of stored data. In may be really cheap for testing of the service, but much more expensive under the heavy load. You should carefully analyze you workload and calculate total service costs.
As for performance: DynamoDB is a SSD Database comparing to Redis in-memory store, but it is possible to use DAX - in-memory cache read replica for DynamoDB as accelerator on heavy load. So you won't be strictly limited with the DynamoDB performance.
Here is the link to DynamoDB pricing calculator which one of the most complicated part of service usage: https://aws.amazon.com/dynamodb/pricing/

How to scale horizontally Amazon RDS instance?

How to scale horizontally amazon RDS instance? EC2 and load balancer+autoscaling is extremly easy to implement, but if I want scaling amazon RDS?
I can ugrade my RDS instance with more powerfull instance or I can create a read replica and I can direct SELECT queries to it. But in this mode I don't scale anything if I have a read-oriented web application. So, can I create RDS read replica with autoscaling and balance them with load balancer?
You can use a HAProxy to load-balance Amazon RDS Read Replica's. Check this http://harish11g.blogspot.ro/2013/08/Load-balancing-Amazon-RDS-MySQL-read-replica-slaves-using-HAProxy.html.
Hope this helps.
Note RDS covers several database engines- mysql, postgresql, Oracle, MSSQL.
Generally speaking, you can scale up (larger instance), use readonly databases, or shard. If you are using mysql, look at AWS Aurora. Think about using the database optimally- perhaps combining with memcached or Redis (both available under AWS Elasticache). Think about using a search engine (lucene, elasticsearch, cloudsearch).
Some general resources:
http://highscalability.com/
https://gigaom.com/2011/12/06/facebook-shares-some-secrets-on-making-mysql-scale/
If you are using PostgreSQL and have a workload that can be partitioned by a certain key and doesn't require complex transactions, then you could take a look at the pg_shard extension. pg_shard lets you create distributed tables that are sharded across multiple servers. Queries on the distributed table will be transparently routed to the right shard.
Even though RDS doesn't have the pg_shard extension installed, you can set up one or PostgreSQL servers on EC2 with the pg_shard extension and use RDS nodes as worker nodes. The pg_shard node only needs to store a tiny bit of metadata which can be backed up in one of the worker nodes, so they are relatively low maintenance and can be scaled out to accommodate higher query rates.
A guide with a link to a CloudFormation template to set everything up automatically is available at: https://www.citusdata.com/blog/14-marco/178-scaling-out-postgresql-on-amazon-rds-using-masterless-pg-shard

need some guidance on usage of Amazon AWS

every once in a while i read/hear about AWS and now i tried reading the docs.
But such docs seem to be written for people who already know which AWS they need to use and only search for how it can be used.
So, for myself, to understand AWS better i try to sketch a hypothetical Webapplication with a few questions.
The apps purpose is to modify content like videos or images. So a user has some kind of webinterface where he can upload his files, do some settings and a server grabs the file and modifies it (e.g. reencoding). The Service also extracts the audio track of a video and trys to index the spoken words so the customer can search within his videos. (well its just hypothetical)
So my questions:
given my own domain 'oneofmydomains.com' is it possible to host the complete webinterface on AWS? i thought about using GWT to create the interface and just deliver the JS/images via AWS, but which one, simple storage? what about some kind of index.html, is there an EC2 instance needed to host a webserver which has to run 24/7 causing costs?
now the user has the interface with a login form, is it possible to manage logins with an AWS? here i also think about an EC2 instance hosting a database, but it would also cause costs and im not sure if there is a better way?
the user has logged in and uploads a file. which storage solution could be used to save the customers original and modified content?
now the user wants to browse the status of his uploads, this means i need some kind of ACL, so that the customer only sees his own files. do i need to use a database (e.g. EC2) for this, or does amazon provide some kind of ACL, so the GWT webinterface will be secure without any EC2?
the customers files are reencoded and the audio track is indexed. so he wants to search for a video. Which service could be used to create and maintain the index for each customer?
hope someone can give a few answers so i understand AWS better on how one could use it
thx!
Amazon AWS offers a whole ecosystem of services which should cover all aspects of a given architecture, from hosting to data storage, or messaging, etc. Whether they're the best fit for purpose will have to be decided on a case by case basis. Seeing as your question is quite broad I'll just cover some of the basics of what AWS has to offer and what the different types of services are for:
EC2 (Elastic Cloud Computing)
Amazon's cloud solution, which is basically the same as older virtual machine technology but the 'cloud' offers additional knots and bots such as automated provisioning, scaling, billing etc.
you pay for what your use (by hour), for the basic (single CPU, 1.7GB ram) would prob cost you just under $3 a day if you run it 24/7 (on a windows instance that is)
there's a number of different OS to choose from including linux and windows, linux instances are cheaper to run without the license cost associated with windows
once you're set up the server to be the way you want, including any server updates/patches, you can create your own AMI (Amazon machine image) which you can then use to bring up another identical instance
however, if all your html are baked into the image it'll make updates difficult, so normal approach is to include a service (windows service for instance) which will pull the latest deployment package from a storage (see S3 later) service and update the site at start up and at intervals
there's the Elastic Load Balancer (which has its own cost but only one is needed in most cases) which you can put in front of all your web servers
there's also the Cloud Watch (again, extra cost) service which you can enable on a per instance basis to help you monitor the CPU, network in/out, etc. of your running instance
you can set up AutoScalers which can automatically bring up or terminate instances based on some metric, e.g. terminate 1 instance at a time if average CPU utilization is less than 50% for 5 mins, bring up 1 instance at a time if average CPU goes beyond 70% for 5 mins
you can use the instances as web servers, use them to run a DB, or a Memcache cluster, etc. choice is yours
typically, I wouldn't recommend having Amazon instances talk to a DB outside of Amazon because of the round trip is much longer, the usual approach is to use SimpleDB (see below) as the database
the AmazonSDK contains enough classes to help you write some custom monitor/scaling service if you ever need to, but the AWS console allows you to do most of your configuration anyway
SimpleDB
Amazon's non-relational, key-value data store, compared to a traditional database you tend to pay a penalty on per query performance but get high scalability without having to do any extra work.
you pay for usage, i.e. how much work it takes to execute your query
extremely scalable by default, Amazon scales up SimpleDB instances based on traffic without you having to do anything, AND any control for that matter
data are partitioned in to 'domains' (equivalent to a table in normal SQL DB)
data are non-relational, if you need a relational model then check out Amazon RDB, I don't have any experience with it so not the best person to comment on it..
you can execute SQL like query against the database still, usually through some plugin or tool, Amazon doesn't provide a front end for this at the moment
be aware of 'eventual consistency', data are duplicated on multiple instances after Amazon scales up your database, and synchronization is not guaranteed when you do an update so it's possible (though highly unlikely) to update some data then read it back straight away and get the old data back
there's 'Consistent Read' and 'Conditional Update' mechanisms available to guard against the eventual consistency problem, if you're developing in .Net, I suggest using SimpleSavant client to talk to SimpleDB
S3 (Simple Storage Service)
Amazon's storage service, again, extremely scalable, and safe too - when you save a file on S3 it's replicated across multiple nodes so you get some DR ability straight away.
you only pay for data transfer
files are stored against a key
you create 'buckets' to hold your files, and each bucket has a unique url (unique across all of Amazon, and therefore S3 accounts)
CloudBerry S3 Explorer is the best UI client I've used in Windows
using the AmazonSDK you can write your own repository layer which utilizes S3
Sorry if this is a bit long winded, but that's the 3 most popular web services that Amazon provides and should cover all the requirements you've mentioned. We've been using Amazon AWS for some time now and there's still some kinks and bugs there but it's generally moving forward and pretty stable.
One downside to using something like aws is being vendor locked-in, whilst you could run your services outside of amazon and in your own datacenter or moving files out of S3 (at a cost though), getting out of SimpleDB will likely to represent the bulk of the work during migration.