Run multiple Data fusion replication jobs on one dataproc cluster - google-cloud-platform

I am currently analyzing GCP data fusion replication features to ingest initial snapshot followed by the CDC.
The plan is to create one replication job per table because adding a new table is not supported once the replication job is created. I tried to a table by deleting and creating the replication job with same name. But it results the initial snapshot load for the other tables as well.
Having said that, in order to overcome the above 2 scenarios, I am planning to create replication job per table. However, every replication job creates its own dataproc cluster which will incur more costs.
Is it possible to run all replication jobs on one dataproc autoscaling cluster?
Note: The instance type is Basic. 

Yes, Reusing the cluster is possible https://cdap.atlassian.net/wiki/spaces/DOCS/pages/1683390465/Reusing+Dataproc+clusters,
This will mark your already provisioned cluster as reusable at the end of the job, and will just save you provisioning time of dataproc cluster (approx. 90 - 150 sec) for every run.
Not sure if multiple data fusion jobs can be summited to same cluster parallelly, which I am looking for :)

Related

AWS Redshift Disaster Recovery - Is it possible to restore tables in another account to cluster provisioned by IAC?

Setting AWS Redshift disaster recovery plan. Ideally I would like to have the ability to restore table data to a new cluster (provisioned by IAC) in my DR account.
Sharing snapshots with the DR account I don't believe will work as restoring tables needs to take place within the cluster the snapshots were created from.
Restoring snapshots to a new provisioned cluster isn't ideal as that cluster creation takes place outside of our IAC.
I believe my only other option would be to use the COPY/ UNLOAD sql commands?
You can have your Redshift cluster automatically backed up to a second AWS Region. In case of failover to the second Region, you can restore the Redshift cluster there
If your IaC is CloudFormation, you can then bring the newly restored cluster into your IaC stack

Cloud Data Fusion pricing - development vs execution

we are looking to get some clarity on Cloud Data Fusion pricing. It looks like if we create a Cloud Data Fusion instance, as long as the instance is alive, we incur hourly rates. This can be quite high: $1100 per month for development and $3000 per enterprise instance.
https://cloud.google.com/data-fusion/pricing
There seems to be no way to stop an instance - this was confirmed by support, only delete.
However, the pricing talks of development vs execution. Wondering if we can avoid the instance charges once we are done deploying a pipeline. Not clear if this is possible or even a deployed pipeline requires an instance.
Thanks.
You can deploy your pipeline in 2 modes:
Either Cloud Data fusion create an ephemeral cluster, deploy your pipeline and tear down the cluster at the end -> Here you need to keep Data Fusion to tear down the cluster. But you can delete it before
Or run the pipeline on an existing cluster. This time, after the pipeline deployment and start, you can shut down the instance.
I agree, it's not clear but you can deduce this when you know how work an Hadoop cluster.
Note: don't forget to export your pipeline before deleting the instance
Note2: the instance also offer trigger scheduling to run the pipeline. Of course, if you delete the instance this feature is useless for you!

restoring aurora cluster from s3 or restoring from snapshot

well I have couple of questions. I have a aurora cluster with a single MySQL RDS instance which has 450GB of data. we use this cluster only when we are doing some specific testing.so I want to delete this cluster but keep its data available to me so I can make a new cluster whenever we need any testing to be done.
there are couple of ways this can be done as far as I know
take a snapshot of the cluster and restore the cluster from the
snapshot whenever required.
backup the cluster to s3 and restore the
cluster from s3 when required
which way is more faster and which one is more cost efficient?
can an entire cluster be restored from s3 if so what are the steps involved ? , I found the aws documentation bit too messy.
If we stop a aurora cluster, it again automatically restarts within 7 days , is there a way to prevent this automatic restart and keep it stopped when it is not required and start when required ?

How to Automate Redshift Cluster Start/Stop for night time?

I have a AWS Redshift Cluster dc2.8xlarge and currently I am paying huge bill each month for running the cluster 24/7.
Is there a way I can automate the cluster uptime so that the cluster will be running in day time and I can stop the cluster at 8PM in evening and again start it in 8AM in morning.
Update: Stop/Start is now available. See: Amazon Redshift launches pause and resume
Amazon Redshift does not have a Start/Stop concept. However, there are a few options...
You could resize the cluster so that it is a lower-cost. A Redshift Cluster is sized for Compute and for Storage. You could reduce the number of nodes as long as you retain enough nodes for your Storage needs.
Also, Amazon Redshift has introduced RA3 nodes with managed storage enabling independent compute and storage scaling, which means you might be able scale-down to a single node. (This is a new node type, I'm not sure of how it works.)
Another option is to take a Snapshot and Shutdown the cluster. This will result in no costs for the cluster (but the Snapshot will be charged). Then, create a new cluster from the Snapshot when you want the cluster again.
Scheduling the above can be done in Amazon CloudWatch Events, which can trigger an AWS Lambda function. Within the function, you can make the necessary API calls to the Amazon Redshift service.
If you are concerned with the general cost of your cluster, you might want to downside from the dc2.8xlarge. You could either use multiple dc2.large nodes, or even consider a move to ds2.xlarge, which is a lower cost per TB of data stored.
good news :)
Now we can able to pause and resume the Redshift cluster (both Console and CLI)
check out the link:
https://aws.amazon.com/blogs/big-data/lower-your-costs-with-the-new-pause-and-resume-actions-on-amazon-redshift/
Now we can pause and resume an AWS Redshift cluster.
We can also schedule the pause and the resume, which is a very important feature to check on the costs.
Link: https://aws.amazon.com/blogs/big-data/lower-your-costs-with-the-new-pause-and-resume-actions-on-amazon-redshift/
This will help you in automating the cluster uptime & downtime so that the cluster will be running in day time and is paused automatically at a specific time in the evening and again start in the morning automatically.
its pretty easy to use opensource https://cloudcustodian.io to automate nightime/weekend off hours on redshift and other aws resources.

How does AWS Athena manage to execute queries immediately?

Does Athena have a gigantic cluster of machines ready to take queries from users and run them against their data? Are they using a specific open-source cluster management software for this?
I believe AWS will never disclose how they operate Athena service. However, as Athena is managed PrestoDB the overall design can be deduced based on that.
PrestoDB does not require cluster manager like YARN, Messos. It has own planner and scheduler that is able to run SQL physical plan on worker nodes.
I assume that AWS within each availability zone maintains PrestoDB coordinator connected to data catalog(AWS Glue) and set of presto worker. Workers are elastic and autoscaled. In case of inactivity, they're downscaled, but when the burst of activity occurs new workers added to the cluster.