How does AWS Athena manage to execute queries immediately? - amazon-web-services

Does Athena have a gigantic cluster of machines ready to take queries from users and run them against their data? Are they using a specific open-source cluster management software for this?

I believe AWS will never disclose how they operate Athena service. However, as Athena is managed PrestoDB the overall design can be deduced based on that.
PrestoDB does not require cluster manager like YARN, Messos. It has own planner and scheduler that is able to run SQL physical plan on worker nodes.
I assume that AWS within each availability zone maintains PrestoDB coordinator connected to data catalog(AWS Glue) and set of presto worker. Workers are elastic and autoscaled. In case of inactivity, they're downscaled, but when the burst of activity occurs new workers added to the cluster.

Related

How to setup AWS RDS standalone instance without traffic from actual RDS cluster

We need to know what are the best options to set AWS RDS instance (Aurora mysql) that is standalone and does not get traffic from actual RDS cluster.
Requirement is for our data team to write analytical queries but we do not want it to impact actual application and DB performance. Hence we need a DB which always has near to live data but live traffic or application does not connect to this instance.
Need to know which fits better, DL clone OR AWS Pilot light OR AWS Warn standby OR AWS hot standby OR
multi-AZ configuration.
Kindly let us know which one would fit our requirement better.
We have so far read about below 3 options,
AWS Amazon Aurora DB clone, https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/Aurora.Managing.Clone.html
AWS Pilot light or AWS Warn standby or AWS hot standby
. https://aws.amazon.com/blogs/architecture/disaster-recovery-dr-architecture-on-aws-part-iii-pilot- light-and-warm-standby/
With multi-AZ configuration, we can create a new instance in new AZ, so that his instance will have a different host (kind off, a fail over strategy), where traffic to his instance will be from our queries and not from live prod application, unless there is some fail over issue.
Option 1, Aurora cloning says
Run workload-intensive operations, such as exporting data or running analytical queries on the clone.
...which seems to be your use case here.
Just be aware that the clone will not see any changes to the original data after it is made. So you will need to periodically delete and re-clone to get the updated data
Regarding option 2, I wrote those blog posts, and I do not think that approach suits your use case. That approach is for disaster recovery
Option 3 may work. To modify it a bit, the concept here is to create an Aurora Replica, which as you say is a separate instance. The problem here is the reader endpoint for your production workload, it may hit that instance (which is not what you want)
EDIT: Adding new option 4
Option 4. Check out Amazon Aurora zero-ETL integration with Amazon Redshift. This zero-ETL integration also enables you to analyze data from multiple Aurora database clusters in an Amazon Redshift cluster.

Cloud Data Fusion pricing - development vs execution

we are looking to get some clarity on Cloud Data Fusion pricing. It looks like if we create a Cloud Data Fusion instance, as long as the instance is alive, we incur hourly rates. This can be quite high: $1100 per month for development and $3000 per enterprise instance.
https://cloud.google.com/data-fusion/pricing
There seems to be no way to stop an instance - this was confirmed by support, only delete.
However, the pricing talks of development vs execution. Wondering if we can avoid the instance charges once we are done deploying a pipeline. Not clear if this is possible or even a deployed pipeline requires an instance.
Thanks.
You can deploy your pipeline in 2 modes:
Either Cloud Data fusion create an ephemeral cluster, deploy your pipeline and tear down the cluster at the end -> Here you need to keep Data Fusion to tear down the cluster. But you can delete it before
Or run the pipeline on an existing cluster. This time, after the pipeline deployment and start, you can shut down the instance.
I agree, it's not clear but you can deduce this when you know how work an Hadoop cluster.
Note: don't forget to export your pipeline before deleting the instance
Note2: the instance also offer trigger scheduling to run the pipeline. Of course, if you delete the instance this feature is useless for you!

How to Automate Redshift Cluster Start/Stop for night time?

I have a AWS Redshift Cluster dc2.8xlarge and currently I am paying huge bill each month for running the cluster 24/7.
Is there a way I can automate the cluster uptime so that the cluster will be running in day time and I can stop the cluster at 8PM in evening and again start it in 8AM in morning.
Update: Stop/Start is now available. See: Amazon Redshift launches pause and resume
Amazon Redshift does not have a Start/Stop concept. However, there are a few options...
You could resize the cluster so that it is a lower-cost. A Redshift Cluster is sized for Compute and for Storage. You could reduce the number of nodes as long as you retain enough nodes for your Storage needs.
Also, Amazon Redshift has introduced RA3 nodes with managed storage enabling independent compute and storage scaling, which means you might be able scale-down to a single node. (This is a new node type, I'm not sure of how it works.)
Another option is to take a Snapshot and Shutdown the cluster. This will result in no costs for the cluster (but the Snapshot will be charged). Then, create a new cluster from the Snapshot when you want the cluster again.
Scheduling the above can be done in Amazon CloudWatch Events, which can trigger an AWS Lambda function. Within the function, you can make the necessary API calls to the Amazon Redshift service.
If you are concerned with the general cost of your cluster, you might want to downside from the dc2.8xlarge. You could either use multiple dc2.large nodes, or even consider a move to ds2.xlarge, which is a lower cost per TB of data stored.
good news :)
Now we can able to pause and resume the Redshift cluster (both Console and CLI)
check out the link:
https://aws.amazon.com/blogs/big-data/lower-your-costs-with-the-new-pause-and-resume-actions-on-amazon-redshift/
Now we can pause and resume an AWS Redshift cluster.
We can also schedule the pause and the resume, which is a very important feature to check on the costs.
Link: https://aws.amazon.com/blogs/big-data/lower-your-costs-with-the-new-pause-and-resume-actions-on-amazon-redshift/
This will help you in automating the cluster uptime & downtime so that the cluster will be running in day time and is paused automatically at a specific time in the evening and again start in the morning automatically.
its pretty easy to use opensource https://cloudcustodian.io to automate nightime/weekend off hours on redshift and other aws resources.

AWS Data Pipeline - Can we re-use the EC2 instance created during 'on-demand' pipeline activation?

Question -
Can I reuse the ec2 resource created during first on-demand run of the data pipeline in subsequent on-demand runs as well?
Description -
I have configured an 'on-demand' AWS data pipeline which is required to be activated many times during a day ( say 3 times within an hour ).
( I can not go with the cron or timeseries style scheduling since I have to pass different parameters to the pipeline at each execution)
In each on-demand activation, Data pipeline seems to create a new ec2 resource ? Is this the case?
Can I reuse the ec2 resource created during first on-demand run in other subsequent runs as well?
AWS Documentation provides the following information but it's not clear whether that applies to 'on-demand' pipelines as well.
AWS Data Pipeline allows you to maximize the efficiency of resources
by supporting different schedule periods for a resource and an
associated activity.
For example, consider an activity with a 20-minute schedule period. If
the activity's resource were also configured for a 20-minute schedule
period, AWS Data Pipeline would create three instances of the resource
in an hour and consume triple the resources necessary for the task.
Instead, AWS Data Pipeline lets you configure the resource with a
different schedule; for example, a one-hour schedule. When paired with
an activity on a 20-minute schedule, AWS Data Pipeline creates only
one resource to service all three instances of the activity in an
hour, thus maximizing usage of the resource.
This isn't possible with Data-Pipeline-managed resources. For this scenario, you would need to spin up the EC2 instance yourself and configure TaskRunner:
You can install Task Runner on computational resources that you
manage, such as an Amazon EC2 instance, or a physical server or
workstation. Task Runner can be installed anywhere, on any compatible
hardware or operating system, provided that it can communicate with
the AWS Data Pipeline web service.
To connect a Task Runner that you've installed to the pipeline
activities it should process, add a workerGroup field to the object,
and configure Task Runner to poll for that worker group value. You do
this by passing the worker group string as a parameter (for example,
--workerGroup=wg-12345) when you run the Task Runner JAR file.
This way Data Pipeline will not create any resources for you, and all activities will run on the EC2 instance that you provided.

Migrating Redis to AWS Elasticache with minimal downtime

Let's start by listing some facts:
Elasticache can't be a slave of my existing Redis setup. Real shame, that would be so much more efficent.
I have only one Redis server to migrate, with roughly 3gb of data.
Downtime must be less than 10 mins. I assume the usual "stop the site, stop redis, provision cluster with snapshot" will take longer than this.
Similar to this question: How do I set an elasticache redis cluster as a slave?
One idea on how this might work:
Set Redis to use an AOF and trigger BGSAVE at the same time.
When BGSAVE finishes, provision the Elasticache cluster with RDB seed.
Stop the site and shut down my local Redis instance.
Use an aof-replay tool to replay the AOF into Elasticache.
Start the site again, pointed at the Elasticache cluster.
My questions:
How can I guarantee that my AOF file begins at exactly the point the RDB file ends, and that no data will be written in between?
Is there an AOF tool supported by the maintainers of Redis, or are they all third-party solutions, and therefore (potentially) of questionable reliability?*
* No offence intended to any authors of such tools, I'm sure they're great, I just feel much more confident using a tool written by the same team as the product to avoid potential compatibility bugs.
I have only one Redis server to migrate, with roughly 3gb of data
I would halt, save the REDIS to S3 and then upload it to a new cluster.
I'm guessing 10 mins to save the file and get it into s3.
10 minutes to just launch an elasticache cluster from that data.
Leaves you ten extra minutes to configure and test.
But there is a simple way of knowing EXACTLY how long.
Do a test migration of it.
DONT stop your live system
Run BGSAVE and get a dump of your Redis (leave everything running as normal)
move the dump S3
launch an elasticache cluster for it.
Take DETAILED notes, TIME each step, copy the commands to a notepad window.
Put a Word/excel document so you have a migration document. That way you know how long it takes and there are no surprises. Let us know how it goes.
ElastiCache has online migration support. You can use the start-migration API to start migration from self managed cluster to ElastiCache cluster.
aws elasticache start-migration --replication-group-id <ElastiCache Replication Group Id> --customer-node-endpoint-list "Address='<IP Address>',Port=<Port>"
The input to the API is your ElastiCache replication group id and the IP and port of the master of your self managed cluster. You need to ensure that the IP address is accessible from ElastiCache node. (An example IP address would be the private IP address of the master of your self managed cluster). This API will make the master node of the ElastiCache cluster call 'SLAVEOF' on the master of your self managed cluster. This will establish a replication stream and will start migrating data from self-managed cluster to ElastiCache cluster. During migration, the master of the ElastiCache cluster will stop accepting writes sent to it directly. You can start using ElastiCache cluster from your application for reads.
Once you have all your data in ElastiCache cluster, you can use the complete-migration API to stop the migration. This API will stop the replication from self managed cluster to ElastiCache cluster.
aws elasticache complete-migration --replication-group-id <ElastiCache Replication Group Id>
After this, the master of the ElastiCache cluster will start accepting writes. You can start using ElastiCache cluster from your application for both read and write.
The following limitations to be aware of for this migration method:
An existing or newly created ElastiCache deployment should meet the following requirements for migration:
It's cluster-mode disabled using Redis engine version 5.0.5 or higher.
It doesn't have either encryption in-transit or encryption at-rest enabled.
It has Multi-AZ with Auto-Failover enabled.
It has sufficient memory available to fit the data from your Redis on EC2 instance. To configure the right reserved memory settings, see Managing Reserved Memory.
There are a few ways to migrate the data without downtime. They are harder to achieve though.
you could have your app write to two redis instances simultaneously - one of which would be on EC. Once the caches are both 'warm', you could just restart your app, and read from the EC cache.
You could initially migrate to EC2 instead of EC. not really what you were hoping to hear, I imagine. this is easy to do because you can set EC2 as salve of your redis instance. Also, migrating from EC2 to EC is somewhat easier (the data is already on AWS), so there's a benefit for users with huge sets of data.
You could, in theory, intercept the commands from the client and send them to EC, thus effectively "replicating". But this requires some programming ( I dont believe a tool like this exists ATM) and would be hard with multiple, ephemeral clients.