Reading thru the documentation of aws, im quite confused with these three concepts
Cluster: composed of one or more compute nodes; composed of one or mode database
Compute node: run the query execution plans and transmit data among themselves to serve these queries
Database: User data is stored on the compute nodes
With this it is easy to assume that a compute node and a database is the same, isn't it? But when creating a redshift cluster, a portion of it is named as database configuration but seemingly referring to cluster. Below is an image of it, if my understanding is correct from the documentation, database configuration should be referring to compute nodes and not the cluster.
With these, what exactly is a cluster, database, and a compute node?
With this it is easy to assume that a compute node and a database is the same, isn't it?
No, that's not the case. You can have single-node Redshift cluster with multiple databases, or a single (large) database hosted on multiple compute nodes.
Basically, node refers to the hardware layer of Redshift, while database refers to software layer only.
Your screenshot shows only a default database called dev. You can create many more if you want. All hosted on the same cluster.
Related
I a simple question for someone with experience with AWS but I am getting a little confused with the terminology and know how to proceed with which node to purchase.
At my company we currently have a a postgres db that we insert into continuously.
We probably insert ~ 600M rows at year at the moment but would like to be able to scale up.
Each Row is basically a timestamp and two floats, one int and one enum type.
So the workload is write intensive but with also constant small reads.
(There will be the occasional large read)
There are also two services that need to be run (both Rust based)
1, We have a rust application that abstracts the db data allowing clients to access it through a restful interface.
2, We have a rust app that gets the data to import from thousands on individual devices through modbus)
These devices are on a private mobile network. Can I setup AWS cluster nodes to be able to access a private network through a VPN ?
We would like to move to Amazon Redshift but am confused with the node types
Amazon recommend choosing RA3 or DC2
If we chose ra3.4xlarge that means you get one cluster of nodes right ?
Can I run our rust services on that cluster along with a number of Redshift database instances?
I believe AWS uses docker and I could containerise my services easily I think.
Or am I misunderstanding things and when you purchase a Redshift cluster you can only run Redshift on this cluster and have to get a different one for containerised applications, possibly an ec2 cluster ?
Can anyone recommend a better fit for scaling this workload ?
Thanks
I would not recommend Redshift for this application and I'm a Redshift guy. Redshift is designed for analytic workloads (lots or reads and few, large writes). Constant updates is not what it is designed to do.
I would point you to Postgres RDS as the best fit. It has a Restful API interface already. This will be more of the transactional database you are looking for with little migration change.
When your data get really large (TB+) you can add Redshift to the mix to quickly perform the analytics you need.
Just my $.02
Redshift is a Managed service, you don't get any access to it for installing stuff, neither is there a possibility of installing/running any custom software of your own
Or am I misunderstanding things and when you purchase a Redshift cluster you can only run Redshift on this cluster
Yes, you don't run stuff - AWS manages the cluster and you run your analytics/queries etc.
have to get a different one for containerised applications, possibly an ec2 cluster ?
Yes, you could possibly make use of EC2, running the orchestrators on your own, or make use of ECS/Fargate/EKS depending on your budget/how skilled your members are etc
In kubenetes what are advantages and disadvantages using single shared pvc for all pod & multiple pvc for each pod.
Statefulset with single PV/PVC and Statefulset with multiple PV/PVC has different use cases and should be used according to the application you want to deploy. You can not precede one over another.
Let me explain you with example of databases, if you want to deploy the relational database like postgresql where all the data stored at one place. You need statefulset with single PV/PVC and all replicas to that write to that particular volume only. This is the only way to keep the data consistent in postgresql.
Now lets say you want to deploy a distributed nosql database like cassandra/mongodb, where the data is splitted along the different machines and cluster of database. In such databases, data is replicated over different nodes and in that case statefulsets pods acts as a different node of that database. So, such pods needs different volume to store their data. Hence if you're running cassandra statefulset with 3 pods, those pods must have different PV/PVC attached to them. Each node writes data on its own PV and which ultimately replicated to other nodes.
I am new to AWS, and I want to store some temporary data on memcached. My memcached has two nodes, one in us-east-1-a, one in us-east-1-b. It stores data in 2 nodes but is not syncing them. Is there any way I can get all data from 2 buckets instead of going into 2 nodes 1 by 1.
edit: I use telnet to connect to the endpoint of memcached
So memcached clusters do not replicate data between nodes. Only redis will do that if setup like that, that is why you need to use the Configuration Endpoint for memcached, it knows all the instances in your cluster
The Memcached engine supports partitioning your data across multiple nodes
A Memcached cluster is a logical grouping of one or more ElastiCache Nodes. Data is partitioned across the nodes in a Memcached cluster.
Memcached cluster, If you use Automatic Discovery, you can use the cluster's configuration endpoint to configure your Memcached client. This means you must use a client that supports Automatic Discovery.
If you don't use Automatic Discovery, you must configure your client to use the individual node endpoints for reads and writes. You must also keep track of them as you add and remove nodes.
Finding the Endpoints for a Memcached Cluster (Console)
I can't seem to find a lot of documentation on how the clusters in RethinkDB actually work.
In Cassandra I connect to a cluster by defining one or more hosts, so in case one of them is down, or even has been removed, I still can connect to the whole cluster, before the code/configuration will be updated, reflecting the changes of my hosts IP addresses.
As far as I've understood it, RethinkDB doesn't have such a logic and I'd need to implement it myself, but I'd still be at all times connected to the whole cluster, is that correct?
When creating a database, it is "kind of" created for the whole cluster, there is no way and no need to specify the exact servers which would be taking care of it. When creating a table and I don't specify a primary replica tag, which server will be the primary replica? If I specify a tag which is assigned to multiple servers - same question applies. How is the final server which will be the main replica selected?
In Cassandra I connect to a cluster by defining one or more hosts, so in case one of them is down, or even has been removed, I still can connect to the whole cluster, before the code/configuration will be updated, reflecting the changes of my hosts IP addresses.
In RethinkDB, you connect to the cluster by connecting to a node in the cluster. That node will take care of communicating with all the other nodes in the cluster. If that node disconnects from the cluster, then you might not be able to do writes or read, depending on your cluster sharding and replication. If that node fails, you won't be able to do anything. At that point, you can try connecting to another node.
As far as I've understood it, RethinkDB doesn't have such a logic and I'd need to implement it myself
Yes, RethinkDB won't automatically reconnect you to another node in the cluster if your node fails. That being said, this might be as simple as having multiple connections and switching between them (unless I'm missing something!).
When creating a database, it is "kind of" created for the whole cluster, there is no way and no need to specify the exact servers which would be taking care of it.
Yes, when you create a database it's created for the whole cluster. A database doesn't really 'live' in a specific node. It's only tables that live in a specific node.
When creating a table and I don't specify a primary replica tag, which server will be the primary replica?
RethinkDB will automatically take care of that. It will pick the server where the primary replica will be, based on the following:
Sever distribution load (which servers have more tables and data).
Wether a specific server was already a primary/secondary for that table.
If you want to manually control in which server the primary or secondary ends up, you can set it manually through the table_config table in the rethinkdb database. (You take a peak at that database. It give you a better view into how RethinkDB works!)
If I specify a tag which is assigned to multiple servers - same question applies.
Same as above.
How is the final server which will be the main replica selected?
Same as above.
In terms of documentation, I would suggest the following:
Sharding and replication: http://rethinkdb.com/docs/sharding-and-replication/ (Although your questions suggest you probably already read this :))
How to scale horizontally amazon RDS instance? EC2 and load balancer+autoscaling is extremly easy to implement, but if I want scaling amazon RDS?
I can ugrade my RDS instance with more powerfull instance or I can create a read replica and I can direct SELECT queries to it. But in this mode I don't scale anything if I have a read-oriented web application. So, can I create RDS read replica with autoscaling and balance them with load balancer?
You can use a HAProxy to load-balance Amazon RDS Read Replica's. Check this http://harish11g.blogspot.ro/2013/08/Load-balancing-Amazon-RDS-MySQL-read-replica-slaves-using-HAProxy.html.
Hope this helps.
Note RDS covers several database engines- mysql, postgresql, Oracle, MSSQL.
Generally speaking, you can scale up (larger instance), use readonly databases, or shard. If you are using mysql, look at AWS Aurora. Think about using the database optimally- perhaps combining with memcached or Redis (both available under AWS Elasticache). Think about using a search engine (lucene, elasticsearch, cloudsearch).
Some general resources:
http://highscalability.com/
https://gigaom.com/2011/12/06/facebook-shares-some-secrets-on-making-mysql-scale/
If you are using PostgreSQL and have a workload that can be partitioned by a certain key and doesn't require complex transactions, then you could take a look at the pg_shard extension. pg_shard lets you create distributed tables that are sharded across multiple servers. Queries on the distributed table will be transparently routed to the right shard.
Even though RDS doesn't have the pg_shard extension installed, you can set up one or PostgreSQL servers on EC2 with the pg_shard extension and use RDS nodes as worker nodes. The pg_shard node only needs to store a tiny bit of metadata which can be backed up in one of the worker nodes, so they are relatively low maintenance and can be scaled out to accommodate higher query rates.
A guide with a link to a CloudFormation template to set everything up automatically is available at: https://www.citusdata.com/blog/14-marco/178-scaling-out-postgresql-on-amazon-rds-using-masterless-pg-shard