Cloud Data Fusion pricing - development vs execution - google-cloud-platform

we are looking to get some clarity on Cloud Data Fusion pricing. It looks like if we create a Cloud Data Fusion instance, as long as the instance is alive, we incur hourly rates. This can be quite high: $1100 per month for development and $3000 per enterprise instance.
https://cloud.google.com/data-fusion/pricing
There seems to be no way to stop an instance - this was confirmed by support, only delete.
However, the pricing talks of development vs execution. Wondering if we can avoid the instance charges once we are done deploying a pipeline. Not clear if this is possible or even a deployed pipeline requires an instance.
Thanks.

You can deploy your pipeline in 2 modes:
Either Cloud Data fusion create an ephemeral cluster, deploy your pipeline and tear down the cluster at the end -> Here you need to keep Data Fusion to tear down the cluster. But you can delete it before
Or run the pipeline on an existing cluster. This time, after the pipeline deployment and start, you can shut down the instance.
I agree, it's not clear but you can deduce this when you know how work an Hadoop cluster.
Note: don't forget to export your pipeline before deleting the instance
Note2: the instance also offer trigger scheduling to run the pipeline. Of course, if you delete the instance this feature is useless for you!

Related

How to setup AWS RDS standalone instance without traffic from actual RDS cluster

We need to know what are the best options to set AWS RDS instance (Aurora mysql) that is standalone and does not get traffic from actual RDS cluster.
Requirement is for our data team to write analytical queries but we do not want it to impact actual application and DB performance. Hence we need a DB which always has near to live data but live traffic or application does not connect to this instance.
Need to know which fits better, DL clone OR AWS Pilot light OR AWS Warn standby OR AWS hot standby OR
multi-AZ configuration.
Kindly let us know which one would fit our requirement better.
We have so far read about below 3 options,
AWS Amazon Aurora DB clone, https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/Aurora.Managing.Clone.html
AWS Pilot light or AWS Warn standby or AWS hot standby
. https://aws.amazon.com/blogs/architecture/disaster-recovery-dr-architecture-on-aws-part-iii-pilot- light-and-warm-standby/
With multi-AZ configuration, we can create a new instance in new AZ, so that his instance will have a different host (kind off, a fail over strategy), where traffic to his instance will be from our queries and not from live prod application, unless there is some fail over issue.
Option 1, Aurora cloning says
Run workload-intensive operations, such as exporting data or running analytical queries on the clone.
...which seems to be your use case here.
Just be aware that the clone will not see any changes to the original data after it is made. So you will need to periodically delete and re-clone to get the updated data
Regarding option 2, I wrote those blog posts, and I do not think that approach suits your use case. That approach is for disaster recovery
Option 3 may work. To modify it a bit, the concept here is to create an Aurora Replica, which as you say is a separate instance. The problem here is the reader endpoint for your production workload, it may hit that instance (which is not what you want)
EDIT: Adding new option 4
Option 4. Check out Amazon Aurora zero-ETL integration with Amazon Redshift. This zero-ETL integration also enables you to analyze data from multiple Aurora database clusters in an Amazon Redshift cluster.

How to Automate Redshift Cluster Start/Stop for night time?

I have a AWS Redshift Cluster dc2.8xlarge and currently I am paying huge bill each month for running the cluster 24/7.
Is there a way I can automate the cluster uptime so that the cluster will be running in day time and I can stop the cluster at 8PM in evening and again start it in 8AM in morning.
Update: Stop/Start is now available. See: Amazon Redshift launches pause and resume
Amazon Redshift does not have a Start/Stop concept. However, there are a few options...
You could resize the cluster so that it is a lower-cost. A Redshift Cluster is sized for Compute and for Storage. You could reduce the number of nodes as long as you retain enough nodes for your Storage needs.
Also, Amazon Redshift has introduced RA3 nodes with managed storage enabling independent compute and storage scaling, which means you might be able scale-down to a single node. (This is a new node type, I'm not sure of how it works.)
Another option is to take a Snapshot and Shutdown the cluster. This will result in no costs for the cluster (but the Snapshot will be charged). Then, create a new cluster from the Snapshot when you want the cluster again.
Scheduling the above can be done in Amazon CloudWatch Events, which can trigger an AWS Lambda function. Within the function, you can make the necessary API calls to the Amazon Redshift service.
If you are concerned with the general cost of your cluster, you might want to downside from the dc2.8xlarge. You could either use multiple dc2.large nodes, or even consider a move to ds2.xlarge, which is a lower cost per TB of data stored.
good news :)
Now we can able to pause and resume the Redshift cluster (both Console and CLI)
check out the link:
https://aws.amazon.com/blogs/big-data/lower-your-costs-with-the-new-pause-and-resume-actions-on-amazon-redshift/
Now we can pause and resume an AWS Redshift cluster.
We can also schedule the pause and the resume, which is a very important feature to check on the costs.
Link: https://aws.amazon.com/blogs/big-data/lower-your-costs-with-the-new-pause-and-resume-actions-on-amazon-redshift/
This will help you in automating the cluster uptime & downtime so that the cluster will be running in day time and is paused automatically at a specific time in the evening and again start in the morning automatically.
its pretty easy to use opensource https://cloudcustodian.io to automate nightime/weekend off hours on redshift and other aws resources.

How does AWS Athena manage to execute queries immediately?

Does Athena have a gigantic cluster of machines ready to take queries from users and run them against their data? Are they using a specific open-source cluster management software for this?
I believe AWS will never disclose how they operate Athena service. However, as Athena is managed PrestoDB the overall design can be deduced based on that.
PrestoDB does not require cluster manager like YARN, Messos. It has own planner and scheduler that is able to run SQL physical plan on worker nodes.
I assume that AWS within each availability zone maintains PrestoDB coordinator connected to data catalog(AWS Glue) and set of presto worker. Workers are elastic and autoscaled. In case of inactivity, they're downscaled, but when the burst of activity occurs new workers added to the cluster.

AWS Data Pipeline - Can we re-use the EC2 instance created during 'on-demand' pipeline activation?

Question -
Can I reuse the ec2 resource created during first on-demand run of the data pipeline in subsequent on-demand runs as well?
Description -
I have configured an 'on-demand' AWS data pipeline which is required to be activated many times during a day ( say 3 times within an hour ).
( I can not go with the cron or timeseries style scheduling since I have to pass different parameters to the pipeline at each execution)
In each on-demand activation, Data pipeline seems to create a new ec2 resource ? Is this the case?
Can I reuse the ec2 resource created during first on-demand run in other subsequent runs as well?
AWS Documentation provides the following information but it's not clear whether that applies to 'on-demand' pipelines as well.
AWS Data Pipeline allows you to maximize the efficiency of resources
by supporting different schedule periods for a resource and an
associated activity.
For example, consider an activity with a 20-minute schedule period. If
the activity's resource were also configured for a 20-minute schedule
period, AWS Data Pipeline would create three instances of the resource
in an hour and consume triple the resources necessary for the task.
Instead, AWS Data Pipeline lets you configure the resource with a
different schedule; for example, a one-hour schedule. When paired with
an activity on a 20-minute schedule, AWS Data Pipeline creates only
one resource to service all three instances of the activity in an
hour, thus maximizing usage of the resource.
This isn't possible with Data-Pipeline-managed resources. For this scenario, you would need to spin up the EC2 instance yourself and configure TaskRunner:
You can install Task Runner on computational resources that you
manage, such as an Amazon EC2 instance, or a physical server or
workstation. Task Runner can be installed anywhere, on any compatible
hardware or operating system, provided that it can communicate with
the AWS Data Pipeline web service.
To connect a Task Runner that you've installed to the pipeline
activities it should process, add a workerGroup field to the object,
and configure Task Runner to poll for that worker group value. You do
this by passing the worker group string as a parameter (for example,
--workerGroup=wg-12345) when you run the Task Runner JAR file.
This way Data Pipeline will not create any resources for you, and all activities will run on the EC2 instance that you provided.

Cassandra Datastax AMI on EC2 - Recover from "Stop"/"Start"

We're looking for the best way to deploy a small production Cassandra cluster (community) on EC2. For performance reasons, all recommendations are to avoid EBS.
But when deploying the Datastax provided AMI with Ephemeral storage, whenever the ephemeral storage is wiped out the instance dies permanently. (Start + Stop manually, or sometimes triggered by AWS for maintenance) will render the instance unusable.
OpsCenter fails to fix the instance after a reboot and the instance does not recover on its own.
I'd expect the instance to launch itself back up, run some script to detect that the ephemeral storage is wiped, and sync with the cluster. Since it does not the AMI looks appropriate only for dev tasks.
Can anyone please help us understand what is the alternative? We can live with a momentary loss of a node due to replication but if the node never recovers and a new cluster is required this looks like a dead end for a production environment.
is there a way to install Cassandra on EC2 so that it will recover from an Ephemeral storage loss?
If we buy a license for an enterprise edition will this problem go away?
Does this meant that in spite of poor performance, EBS (optimized) with PIOPS is the best way to run Cassandra on AWS?
Is the recommendation to just avoid stopping + starting the instance and hope that AWS will not retire or reallocate their host machine? What is the recommendation in this case?
What about AWS rolling update? Upgrading one machine (killing it) and starting it again, then proceeding to next machine will erase all cluster data, since machines will be responsive (unlike Cassandra on those). That way it can destroy small (e.g. 3 node) cluster.
Has anyone had good experience with payed services such as Instacluster?
New docs from Datastax actually indicate that EBS Optimized GP2 SSD backed instances can be used for production workloads. With EBS backed, you can easily do snapshots which virtually eliminate the chance of data loss on a node, and it makes it so that they are easily migrated to a new host by a simple start/stop.
With ephemeral, you basically have to plan around failure, consider if your entire cluster is in a single region (SimpleSnitch) and that region goes down.
http://docs.datastax.com/en/cassandra/3.x/cassandra/planning/planPlanningEC2.html