Whitelisting Azure Databricks Spark Cluster on AWS Redshift - amazon-web-services

Hi I am not sure if anyone has come across this situation before. I have both Azure and AWS environment. I have a Spark cluster running on Azure Databricks. I have a python/pyspark script that I want to run on the Azure Databricks Spark cluster. In this script I want to write some data into a AWS Redshift cluster which I plan to do using the psycopg2 library. Where can I find the IP address of the Azure Databricks Spark cluster so that I can whitelist it in the security group of the AWS Redshift cluster. I think at the moment I cannot write to the AWS Redshift cluster because the script is running on Azure Databricks Spark cluster and the AWS Redshift cluster does not recognize this request coming from Azure Databricks Spark cluster.

I have similar use case to connect from Azure Databricks to AWS RDS. Need to whitelist the Azure Databricks IPs in the AWS Security group connected to RDS. Databricks associate cluster with Dynamic Ip so it changes each time a cluster is restarted.
I am trying to get this solution
Create a public IP address in the Azure portal
Associate a public IP address to a virtual machine
https://learn.microsoft.com/en-us/azure/virtual-network/associate-public-ip-address-vm#azure-portal
Currently getting error that I do not have permission to update the databricks associated VNet.
This is the simplest solution I could come up with.
If this doesnt work, next option is to try Site to Site Connection to set up tunnel between Azure and AWS. This would allow all the dynamic IPs to be authorised for read and write operations on AWS.

Related

BigQuery data transfer service does not work when using VPC

I have an issue when migrating Redshift to BigQuery. So what have I done so far?
Created VPN that connects GCP VPC and AWS VPC. (VPCs IPs are not overlapped)
VPN works excellent. (I tested: created EC2 instance and pinged through GCP Compute Engine VM to AWS EC2 instance private IP ---> it works excellent)
I created Redshift instance with publicly accessible option ----> then created BigQuery data transfer service ----> It works excellent
BUT, when I create a Redshift cluster with NO publicly accessible option ----> Then create BigQuery data transfer service, it brings me an error
ERROR:
Unable to proceed: Could not connect with provided parameters: No suitable driver found for jdbc:redshift://redshift-cluster-1.cbr8ra8jmxgm.us-east-1.redshift.amazonaws.com:5439/dev
Also I tried to ping to AWS Redshift IP address from GCP Compute Engine VM. -----> It does not ping.
What can be the reason?

AWS MSK - Debezium Postgres Connector for AWS RDS - Failed to Connect

I'm currently facing the following issue when using AWS MSK Connector (Debezium Postgres Connector)
[Worker-0509fac07b9701a23] [2022-01-19 04:55:28,759] ERROR Failed testing connection for jdbc:postgresql://debezium-cdc.fac07b9701a2.ap-south-1.rds.amazonaws.com:5432/ecommerce with user 'debezium' (io.debezium.connector.postgresql.PostgresConnector:133)
I've test AWS MSK Connector using Kafka Clients on EC2, I'm able to produce & consume messages. I've also setup AWS MSK S3 Sink Connector, that is working as well.
I've double checked the security groups config for AWS RDS, I'm able to connect to it from EC2.
I'm not sure whats causing this issue.
Here's the Connector Configuration
connector.class=io.debezium.connector.postgresql.PostgresConnector
tasks.max=1
database.hostname=debezium-cdc.fac07b9701a2.ap-south-1.rds.amazonaws.com
database.port=5432
database.dbname=ecommerce
database.user=debezium
database.password=password
database.history.kafka.bootstrap.servers=b-2.awskafkatutorialclust.awskaf.c4.kafka.ap-south-1.amazonaws.com:9094,b1.awskafkatutorialclust.awskaf.c4.kafka.ap-south-1.amazonaws.com:9094,b-3.awskafkatutorialclust.awskaf.c4.kafka.ap-south-1.amazonaws.com:9094
database.server.id=1
database.server.name=debezium-cdc
database.whitelist=ecommerce
database.history.kafka.topic=dbhistory.ecommerce
include.schema.changes=true
key.converter=org.apache.kafka.connect.json.JsonConverter
value.converter=org.apache.kafka.connect.json.JsonConverter
You need to set up AWS RDS Database Publicly accessible: No.
Because your AWS MSK is in a private network (VPC) and it can not connect to public Databases (Read more: https://docs.aws.amazon.com/vpc/latest/userguide/how-it-works.html).
Please try to change your RDS Database Postgres Publicly accessible: No.
And create MSK connect again.
(make sure that your AWS RDS Database is the same VPC, Security Group as your AWS MSK.)
Anyway, If you want to connect with your private AWS RDS Database, you need to do about bastion host (Read more: https://aws.amazon.com/premiumsupport/knowledge-center/rds-connect-ec2-bastion-host/).

can we connect elastic search cluster running on EC2 as a datasource for aws appSync

I have an ec2 cluster in which my ES is running. I want to use AWS app sync, can I connect that as a data source to it. If so how?
Or is it tightly coupled to use with Amazon OpenSearch Service?

what are the steps to be performed for Replication of AWS RDS Postgresql into On-Premise Postgresql using AWS DMS?

I have a requirement of replicating data from AWS RDS Postgres(12) Database to On-Premise Postgres(12) Database for disaster recovery purpose. I have found stuff about replication from On-premise to AWS RDS. But How can we implement it for AWS RDS to On-premise?
Any help will be much appreciated.
Hello I think you have two options here:
Use AWS Database Migration Service, setup source endpoint = pgsql on RDS, and target endpoint = pgsql on-premises, and setup DMS task for full load and CDC, detail you can refer to: https://docs.aws.amazon.com/dms/latest/userguide/CHAP_Source.PostgreSQL.html, https://docs.aws.amazon.com/dms/latest/userguide/CHAP_Target.PostgreSQL.html
Setup replication from RDS pgsql to on-premises pgsql using pgsql native logical replication, there is a very good AWS blog talking exactly this: https://aws.amazon.com/blogs/database/using-logical-replication-to-replicate-managed-amazon-rds-for-postgresql-and-amazon-aurora-to-self-managed-postgresql/

Do Databricks workers and Elasticsearch nodes need to be in the same VPC in AWS?

I would like to write a dataframe into Elasticsearch from within Databricks.
My Elasticsearch cluster is hosted on AWS and Databricks is spinning up EC2 instances with a certain role. That role has the permission to interact with my Elasticsearch cluster but for some reason, I seem not to be able to even PING the Elasticsearch cluster.
Do I need to find a way to squeeze both my Databricks workers and my Elasticsearch cluster into the same VPC? Sounds like a CloudFormation nightmare.
If you've got ES running in another VPC then you'll need either private link or peering to ensure the workers can access it. For isolation and to avoid issues with IP limits for your workers, it would be better to keep ES and DB in different VPCs.