Is it possible to query things in an RDS database using Athena? Or do I somehow have to get my data out of RDS and copy it into an s3 bucket so that Athena can query it from there? If that is the case how can I know the tables that are in my RDS? Is there a way to explore all the schemas of a database with Glue?
A feature was created exactly for this reason last year, Federated Queries.
By using this you can query across a large number of data sources other than just across S3.
If you're using either MySQL or Postgres in RDS then you can make use of the JDBC connector, with additional instructions here.
Related
I'm wondering whether it is possible to easily sync an Amazon RDS PostgreSQL database to Amazon S3 in near real time so that data can be used with Amazon Athena, just as read replicas do.
We have several RDS database and we would like to consolidate all the data in a single repository such as S3.
Thanks.
There is no capability to "export RDS to S3 in real time".
However, Amazon Athena can query Amazon RDS databases, so you could have some of your data in Amazon S3 and some in Amazon RDS.
See: Query any data source with Amazon Athena’s new federated query | AWS Big Data Blog
What you are describing sounds like a data warehouse, where information is extracted from many information sources and is stored in one place for easy querying -- often in 'wide' tables to make querying simpler. However, this is very difficult to do "in real time". It is typically updated nightly, or perhaps hourly.
You might want to consider using AWS Database Migration Service to continuously sync data between RDS and S3: https://aws.amazon.com/premiumsupport/knowledge-center/s3-bucket-dms-target/
saying this, it is only sensible when you don't have a read-only replica of the data and the queries might affect source RDS performance.
I have stored changelogs(data with information about data) from non-relational schemaless data tables to S3. now I want some structured relational database to query on all the data. So I need to create a database from S3. Now I am confused about what should I do, whether using another S3 or using some traditional database!!!
You can create glue catalog over the data and query it using serverless Athena.
This way you are not bound to use any rdbms and can query your data at any required time keeping the files in s3.
This will also be cost effective.
Or you can anytime spin up a RDS in AWS if requires. So keeping files in s3 is good option.
Can we execute sql query inside DMS task so that it just fetches the required data and not the whole db.
If its not possible then which aws service is used to fetch query based data from on-prem data source to aws S3.
You can use filters and/or exclude fields: https://docs.aws.amazon.com/dms/latest/userguide/CHAP_Tasks.CustomizingTasks.TableMapping.html
Contact me if you have problems.
For alternate solution to DMS, you can use AWS Glue with data retrieved using PYSPARK dataframe from on prem DB to either s3 and AWS RDS. This works very well. The only down side is the cost.
This solution supports both table and SQL as input for data extraction
I'm attempting to use AWS Glue to ETL a MySQL database in RDS to S3 so that I can work with the data in services like SageMaker or Athena. At this time, I don't care about transformations, this is a prototype and I simply want to dump the DB to S3 to start testing the various tool chains.
I've set up a Glue database and tested the connection to RDS successfully
I am using the AWS provide Glue IAM service role
My S3 bucket has the correct prefix of aws-glue-*
I created a crawler using the Glue database, AWSGlue service role, and S3 bucket above with the options:
Schema updates in the data store: Update the table definition in the data catalog
Object deletion in the data store: Delete tables and partitions from the data catalog.
When I run the crawler, it completes in ~60 seconds but it does not create any tables in the database.
I've tried adding the Admin policy to the glue service role to eliminate IAM access issues and the result is the same.
Also, CloudWatch logs are empty. Log groups are created for the test connection and the crawler but neither contains any entries.
I'm not sure how to further troubleshoot this, info on AWS Glue seems pretty sparse.
Figured it out. I had a syntax error in my "include path" for the crawler. Make sure the connection is the data source (RDS in this case) and the include path lists the data target you want e.g. mydatabase/% (I forgot the /%).
You can substitute the percent (%) character for a schema or table. For databases that support schemas, type MyDatabase/MySchema/% to match all tables in MySchema with MyDatabase. Oracle and MySQL don't support schema in the path, instead type MyDatabase/%. For information about which JDBC data stores support schema, see Cataloging Tables with a Crawler.
Ryan Fisher is correct in the sense that it's an error. I wouldn't categorize it as a syntax error. When I ran into this it was because the 'Include path' didn't include the default schema that sql server lovingly provides to you.
I had this: database_name/table_name
When it needed to be: database_name/dbo/table_name
I'm getting an error when running an Athena query against a Glue table created from an RDS database:
HIVE_UNKNOWN_ERROR: Unable to create input format
The tables are created using a crawler. The tables show up correctly in the Glue interface:
However, they do not show up in the Athena interface under the database. It says: "The selected database has no tables"
I do not see this behaviour when using a database created using an S3 file. Maybe this is related to the error. Does anybody have an idea?
I had the same problem. This is the answer that I have got from AWS Support:
I understand that you set up a Glue crawler to crawl our RDS postresql database but the tables are not visible in Athena.
Athena service is designed to query tables that point to S3 as data-source. It cannot read data from non-S3 resources as of today.
So, unfortunately not possible at the moment.