Connect to "on-premise" postgresql database with AWS glue - amazon-web-services

I have a PostgreSQL database which is in effect "on premise" but I have credentials and a JDBC connection string. I want to read the table on AWS glue and use it in a job as a source, and write to S3.
But it is asking for VPC? I don't understand. I can hard code the connection in the Job? This seems like such a basic task for an ETL environment. What am I missing?

Glue can connect to any database using JDBC. This is a good toolbox to fast track pyspark coding.
Basically you need to understand where you are physically located in AWS environment. And identify or create a VPC. From there, establish your ACL and Security Group.
Good luck!

Related

Custom Connector for AWS Glue Studio for MS SQL Server

I'm evaluating how to set up our data lake in AWS. Currently got connectivity with Glue Connections in the "old" Glue console, but since we have no glue-legacy I would think adopting the newest tools and practices now would save us converting later.
I am however at a loss of understanding how one creates a simple JDBC connection for an on prem (not AWS RDS) SQL Server. This is as simple as a jdbc-connectionstring in "old Glue", but in Glue Studio you have to either buy a container-based third party connector for $250/mo or roll your own in the form of a jar-file, which isn't exactly a simple step-by-step process.
I can't seem to find any information other that the official aws docs. Is this too new and unfinished a
service? Are AWS pushing people to use DMS to and AWS RDS to be able to do this? (unlikely)
Should I just ignore Glue Studio until it has matured? What am I missing?
PS: I'm sorry for the non-spesific and non-technical art of the question.

Using AWS Glue with Mysql in my EC2 Instance

I am trying to connect my ec2 installed/mysql with Glue, the purpose is to xtract some information and moved to redshift, but i am receiving the following error:
Check that your connection definition references your JDBC database with correct URL syntax, username, and password. Communications link failure The last packet sent successfully to the server was 0 milliseconds ago. The driver has not received any packets from the server.
This is the format that i am using jdbc:mysql://host:3306/database
I am using the same vpc, same SG, same subnet for the instance.
i know the user/password are correct because i am connected to the database with sql developer.
What i need to check? Is it possible to use AWS Glue with mysql in my instance?
Thanks in advance.
In the JDBC connection url that you have mentioned, use the private ip of the ec2 instance(where mysql is installed) as host.
jdbc:mysql://ec2_instance_private_ip:3306/database
Yes, It is possible to use AWS Glue with your MySQL running in your EC2 Instance but Before, you should first use DMS to migrate your databases.
More over, your target database (Redshift) has a different schema than the source database (MySQL), that's what we call heterogeneous database migrations (the schema structure, data types, and database code of source databases are quite differents), you need AWS SCT.
Check this out :
As you can see, I'm not sure you can migrate straight from MySQL in an EC2 instance to Amazon Redshift.
Hope this helps

querying Amazon Aurora with Java

I apologize if this is too broad, but I am very new to AWS and I have a specific task I want to do but I can't seem to find any resources to explain how to do it.
I have a Java application that at a high level manages data, and I want that application to be able to store and retrieve information from Amazon Aurora. The simplest task I want to achieve is to be able to run the query "SELECT * FROM Table1" (where Table1 is some example table name in Aurora) from Java. I feel like I'm missing something fundamental about how AWS works, because I've thus far been drowning in a sea of links to AWS SDKs, none of which seem to be relevant to this task.
If anyone could provide some concrete information toward how I could achieve this task, what I'm missing about AWS, etc, I would really appreciate it. Thank you for your time.
You don't use the AWS SDK to query an RDS database. The API/SDK is for managing the servers themselves, not for accessing the RDBMS software running on the servers. You would connect to AWS Aurora via Java just like you would connect to any other MySQL database (or PostgreSQL if you are using that version of Aurora), via the JDBC driver. There's nothing AWS specific about that, other than making sure your code is running from a location that has access to the RDS instance.

Simplest way to get data from AWS mysql RDS to AWS Elasticsearch?

I have data in an AWS RDS, and I would like to pipe it over to an AWS ES instance, preferably updating once an hour, or similar.
On my local machine, with a local mysql database and Elasticsearch database, it was easy to set this up using Logstash.
Is there a "native" AWS way to do the same thing? Or do I need to set up an EC2 server and install Logstash on it myself?
You can achieve the same thing with your local Logstash, simply point your jdbc input to your RDS database and the elasticsearch output to your AWS ES instance. If you need to run this regularly, then yes, you'd need to setup a small instance to run Logstash on it.
A more "native" AWS solution to achieve the same thing would include the use of Amazon Kinesis and AWS Lambda.
Here's a good article explaining how to connect it all together, namely:
how to stream RDS data into a Kinesis Stream
configuring a Lambda function to handle the stream
push the data to your AWS ES instance
Take a look at Amazon DMS. Its usually used for DB migrations, however, it also supports continuous data replication. This might simplify the process and be cost-effective.
You can use AWS Database Migration Service to perform continuous data replication. Continuous data replication has a multitude of use cases including Disaster Recovery instance synchronization, geographic database distribution and Dev/Test environment synchronization. You can use DMS for both homogeneous and heterogeneous data replications for all supported database engines. The source or destination databases can be located in your own premises outside of AWS, running on an Amazon EC2 instance, or it can be an Amazon RDS database. You can replicate data from a single database to one or more target databases or data from multiple source databases can be consolidated and replicated to one or more target databases.
https://aws.amazon.com/dms/

Configuring external data source for Elastic MapReduce

We want to use Amazon Elastic MapReduce on top of our current DB (we are using Cassandra on EC2). Looking at the Amazon EMR FAQ, it should be possible:
Amazon EMR FAQ: Q: Can I load my data from the internet or somewhere other than Amazon S3?
However, when creating a new job flow, we can only configure a S3 bucket as input data origin.
Any ideas/samples on how to do this?
Thanks!
P.S.: I've seen this question How to use external data with Elastic MapReduce but the answers do not really explain how to do it/configure it, simply that it is possible.
How are you processing the data? EMR is just managed hadoop. You still need to write a process of some sort.
If you are writing a Hadoop Mapreduce job, then you are writing java and you can use Cassandra apis to access it.
If you are wanting to use something like hive, you will need to write a Hive storage handler to use data backed by Cassandra.
Try using scp to copy files to your EMR instance:
my-desktop-box$ scp mylocaldatafile my-emr-node:/path/to/local/file
(or use ftp, or wget, or curl, or anything else you want)
then log into your EMR instance with ssh and load it into hadoop:
my-desktop-box$ ssh my-emr-node
my-emr-node$ hadoop fs -put /path/to/local/file /path/in/hdfs/file