I understand Elasticache is faster than RDS for retrieving data. What I don't understand is where Elasticache's data is actually stored.
AWS documentation says it is stored "in-memory", but... the memory of which machine(s)?
When creating an ElastiCache cluster, you define the type of the nodes that make your cluster.
Node type maps to a group of supported EC2 instance types. When your cluster is created, the ElastiCache services provisions EC2 machines with that instance type - meaning the data resides in-memory of the EC2 machines provisioned by the ElastiCache service.
Note that for certain node types (from the r6gd family) have their data tiered between memory and local SSD (solid state drives) storage.
Related
My java servlet web app is hosted in AWS EC2 instance. Is storing sensitive data (say db credentials) in my property (config) file of my java web app safe? When the EBS volumn is deallocated, will it contain the data I saved and used by someone else with in the same/different AWS account? Are there any security risks?
Data stored on the EBS volume is zeroed out after you delete the volume. This is carried out by AWS automatically.
Yes, the blocks on the EBS volume will be zeroed after you delete the volume.
From Amazon EBS volumes - Amazon Elastic Compute Cloud:
The data persists on the volume until the volume is deleted explicitly. The physical block storage used by deleted EBS volumes is overwritten with zeroes before it is allocated to another account. If you are dealing with sensitive data, you should consider encrypting your data manually or storing the data on a volume protected by Amazon EBS encryption.
For more information on EBS encryption, see Amazon EBS encryption - Amazon Elastic Compute Cloud
I went with another approach considering the reason that anyone who has access to the file (via remote or someway) can read and pass it across. I used AWS systems manager (param store) to store the sensitive values as secure string. App retrieves it from param store and use it at run time. To reduce multiple hits, the value is cached for a configurable time. The original question is about the security of EBS and not about the alternate. However sharing my approach to let someone aware the alternate.
I'm trying AWS auto-scaling for the first time, as far as I understand it creates instances if for example my CPU Utilization reaches critical level, that I define.
So I am curious, after I lunch my instance I spend a fair amount of time configuring it and copying the data, if AWS auto-scales my instance how will it configure the new instances and move the data to it?
You can't store any data that you want to keep on an instance that is part of an autoscaling group (well you can, but you will lose it).
There are (at least) two ways to answer your question:
Create a 'golden image', in other words spin-up an instance, configure it, install the software etc and then save it as an AMI (amazon machine image). Then tell the autoscaling group to use that AMI each time an instance starts - it will be pre-configured when it starts.
Put a script on the instance that tells the instance how to configure itself when it starts up (in the user data). SO basically each time an instance scales up, it runs the script and does all the steps it needs to to configure itself.
As for you data, best practice would be to store any data you want to keep in a database or object store that is not on the instance - so something like RDS, DynamoDB or even S3 objects.
You could also use AWS EFS, store there your data/scripts that the EC2 Instances will be sharing, and automatically mount it every time a new EC2 Instance is created via /etc/fstab.
Once you have configured the EFS to be mounted on the EC2 Instance (/etc/fstab), you should create a new AMI, and use this new AMI to create a new Launch Configuration and AutoScaling Group, so that the new Instances automatically mount your EFS and are able to consume that shared data.
https://aws.amazon.com/efs/faq/
Q. What use cases is Amazon EFS intended for?
Amazon EFS is designed to provide performance for a broad spectrum of
workloads and applications, including Big Data and analytics, media
processing workflows, content management, web serving, and home
directories.
Q. When should I use Amazon EFS vs. Amazon Simple Storage Service (S3)
vs. Amazon Elastic Block Store (EBS)?
Amazon Web Services (AWS) offers cloud storage services to support a
wide range of storage workloads.
Amazon EFS is a file storage service for use with Amazon EC2. Amazon
EFS provides a file system interface, file system access semantics
(such as strong consistency and file locking), and
concurrently-accessible storage for up to thousands of Amazon EC2
instances. Amazon EBS is a block level storage service for use with
Amazon EC2. Amazon EBS can deliver performance for workloads that
require the lowest-latency access to data from a single EC2 instance.
Amazon S3 is an object storage service. Amazon S3 makes data available
through an Internet API that can be accessed anywhere.
https://docs.aws.amazon.com/efs/latest/ug/mount-fs-auto-mount-onreboot.html
You can use the file fstab to automatically mount your Amazon EFS file
system whenever the Amazon EC2 instance it is mounted on reboots.
There are two ways to set up automatic mounting. You can update the
/etc/fstab file in your EC2 instance after you connect to the instance
for the first time, or you can configure automatic mounting of your
EFS file system when you create your EC2 instance.
I recommend using a shared data container if it is data that is updated and the updated data is needed by all instances that might be spinning up.
If it is database data or you could store the needed data in a database I would consider using an RDS.
If it is static data only used to configure the instances like dumps or configuration files which are not updated by running instances then I would recommend pulling them from CloudFlare or S3 of iT is not possible to pull them from a repository.
Good luck
I have some doubts about a deployment of CDH on AWS. I read the reference architecture doc and other material I found on Cloudera Engineering Blog but I need some more suggestions about it.
1) Is the CDH deployment available only for some kind of instances or I can deploy it on all the AWS instance types?
2) Assuming I want to create a cluster that will be active 24x7. For a long-running cluster I understood it's better to have a cluster based on local-storage instances. If we consider a cluster of 2PBs I think that d2.8xlarge should be the best choice for the datanodes. About the Master Nodes: - if I want to deploy only 3 Master Nodes, is it better to have them as local-storage instances too or as EBS attached instances to be able to react quickly to a possible Master Node failure? - are there some best practice about the master node instance type (EBS or local-storage)? About the Data Nodes: - if a data node fails, Has the CDH some automated mechanism to automatically spin-up a new instance and connect it to the cluster in order to restore the cluster without down-times? Have we to build a script from scratch to do this thing? About the Edge Nodes: - are there some best practice about the instance type (EBS or local-storage)?
3) If I want to do a backup of the cluster on S3: - when I do a distcp from the CDH to S3, can I move the data directly on Glacier instead of the normal S3? If I have some compression applied on the data (e.g. snappy, gzip, etc.) and I do a distcp to S3: - Is the space occupied on S3 the same or the distcp command decompress the data for the copy?
If I have a cluster based on EBS attached instances: - is it possible to snapshot the disks and re-attach a datanode with the EBS disks rebuilt from the snapshot?
4) If I have the Data Nodes deployed as r4.8xlarge and I need more horsepower, is it possible to scale-up the cluster from r4.8xlarge to a r4.16xlarge on-the-fly? Attaching and detaching the disks in few mins?
Thanks a lot for the clarifications, I hope my doubts will help also other users.
1) There's no explicit restriction on instance types where CDH components will work, but you'd need to pick types with a minimum of horsepower. For example, I don't expect that a micro size instance would work for much of anything. A type that is too small will generally cause daemons to run out of memory. The reference architecture has suggested instance types for certain situations.
2) You should stick with EBS for the root volume of instance types. There are a few reasons, including that newer instance types don't even support local instance storage for the root disk.
CDH doesn't have a mechanism for replacing data nodes when they fail. You could roll something yourself, possibly with help from Cloudera Director.
3) You can set up lifecycle rules for data in S3 to migrate it from the standard storage class into Glacier over time, or you can just write directly to Glacier; it doesn't look like direct Glacier access can be done through the s3a connector. I'm pretty sure distcp and S3 won't fiddle with compression; what you copy is opaque to S3 for sure. You can snapshot EBS volumes (root or additionally attached), then detach them and re-attach them to a different instance; this isn't necessarily a great way to back up datanodes vs. the distcp route, because each datanode is unique and has changing data as the cluster runs.
4) You can resize EBS-backed EC2 instances without detaching and re-attaching disks. You do have to stop an instance to resize it.
Point 3 only:
You need to distcp to S3 and move that to glacier via the AWS settings
It doesn't do anything to the data, compression, etc.
see the (hortonworks doc) Distcp and S3 and read its warnings/caveats. In particular, incremental distcp isn't checksum-based, atomic distcp isn't, it's just "really slow distcp"
can someone help me with these questions please :
1- the documentation stated that Aurora will automatically fail-over to the read replicas, my question is how does it select the replica which will be promoted if you have more than one with different instances class?
2- can I disable this automatic fail-over (just asking, not stating that I will do it)
3- what is the purpose of multi-az in Aurora if you can have the same effect with much more control on instances classes while creating replicas and let Aurora do the auto fail-over for you. please correct me if I am wrong with this assumption.
thanks in advance
The algorithm for election of a new master in case of failure is not really documented... but it doesn't seem to matter, because Aurora replicas seem to be different than other RDS replicas, because all the instances in the cluster are necessarily of the same instance class.
Unlike other RDS offerings, read replicas in Aurora do not appear to have an independent copy of the backing store -- instead, the backing store itself provides redundancy, being replicated at the storage level with two copies in each of three availability zones.
The cluster volume is made up of multiple copies of the data for the DB cluster, but the data in the cluster volume is represented as a single, logical volume to the primary and Aurora Replicas in the DB cluster.
Because the cluster volume is shared among all instances in your DB cluster, no additional work is required to replicate a copy of the data for each Aurora Replica.
http://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/Aurora.Replication.html
Multi-AZ in Aurora is also not the same thing -- with other RDS products, multi-AZ involves a second, invisible instance, running in parallel with the master. The Aurora literature uses the phrase "multi-AZ technology," but the meaning appears to be different. Note that the Aurora pricing tables don't show a separate pricing rate for "multi-AZ" the way MySQL and MariaDB do.
Failover doesn't appear to be something that can be disabled. Even if you have no replicas, Aurora will still "fail over" if the master fails -- but it does it by spinning up a replacement master using the existing cluster volume as the backing store.
The above answer is no longer valid anymore.
Multi AZ = Aurora Cluster with at least one Read Replica in a
different AZ.
You can still create multiple read replicas for a cluster but if you create them within the same AZ of your writer, cluster will not be multi AZ.
Within each AWS Region, Availability Zones (AZs) represent locations that are distinct from each other to provide isolation in case of outages. We recommend that you distribute the primary instance and reader instances in your DB cluster over multiple Availability Zones to improve the availability of your DB cluster. That way, an issue that affects an entire Availability Zone doesn't cause an outage for your cluster.
You can set up a Multi-AZ cluster by making a simple choice when you create the cluster. The choice is simple whether you use the AWS Management Console, the AWS CLI, or the Amazon RDS API. You can also make an existing Aurora cluster into a Multi-AZ cluster by adding a new reader instance and specifying a different Availability Zone.
https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/Concepts.AuroraHighAvailability.html
When launching an Aurora instance I have the option of "Multi-AZ Deployment", which it describes as "Specifies if the DB Instance should have a standby deployed in another Availability Zone."
However the Aurora documentation states that Aurora already automatically spreads the database across different availability zones?
Additionally, what is the difference between an Aurora Multi-AZ standby and an ordinary Aurora replica. Is that that an ordinary replica can be read from increasing performance whereas a standby cannot be read from?
Aurora replicates your data across three availability zones, at the storage layer... but the database server instance, itself, is still a virtual machine running on a single physical machine that is located in a single availability zone.
The Aurora storage layer is outside that instance, and is able to let access continue uninterrupted without data loss, even in the event of the loss of up to two AZs, but the loss of the zone containing the db instance will still cause an outage for you, if you only have a single Aurora instance in your cluster (1 master, 0 replicas). Loss of an entire availability zone is one of those things that is highly improbable but not impossible. Your db instance is still a single point of failure when you only have one.
Multi-AZ makes allowance for a complete redundant database instance, in a different AZ, which will automatically take over for the primary within one minute, if it works as designed, in case of the loss of the AZ hosting the primary instance or a catastrophic failure of the primary instance. It's a second virtual machine, on a second physical machine, in a second availability zone. It's always running, but you can't access it. It's in the background, managed and monitored by the RDS infrastructure, but it is only accessible to you in the case of primary instance failure. The secondary machine can also be used to reduce downtime in the event of a software upgrade or maintenance event on the primary. When failover occurs, if you are using DNS to connect to your database (as you should), you'll find that the DNS entry is automatically pointed to the secondary.
Contrast this to a read replica, which is accessible all the time and can thus provide a significant performance benefit, by allowing the offloading of reads. Failing over to a replica involves promoting it to become a standalone master (which permanently detaches it from its own former master) and reconfiguring your application to use the alternate endpoint. This, of course, is still faster than recovering from a failure in the master by using a point-in-time snapshot to create a replacement master instance.
https://aws.amazon.com/rds/details/multi-az/
Storage in Aurora is replicated across three availability zones. The database head node is a single instance. So, while your data is spread across multiple targets, the head node is not.
When you enable a multi-AZ deployment, we create an Aurora read replica that is available as a failover target. Any Aurora read replicas you create (up to a max of 15 at this time) are also available as failover targets.
There isn't any meaningful difference between Multi-AZ and other Aurora replicas. This is primarily a simplification in the user interface for customers accustomed to using Multi-AZ for other RDS engines.
AWS Management console.
The answer to this is straightforward.
You can create Multi-AZ in the management console or ignore it. Irrespective, the shared storage for Amazon Aurora is across three AZ (Multi-AZs) as it's the feature of Amazon Aurora however if we choose the Mult-AZ option then we will also have your instances of Amazon Aurora in multiple AZs.
Thus you should choose the Amazon console image option