I am trying to execute query from Athena to show data from Neptune database. I am using Athena Neptune connector to connect to Neptune DB and show the data in Athena query editor.
However I get error when I run the following query
SELECT * FROM "datasource/datacatalog".<dbname>.<tablename> limit 10;
Error Message:
Encountered an exception[java.lang.NullPointerException] from your LambdaFunction[neptune_connector lambda function name] executed in context[S3SpillLocation{bucket='S3 bucketname', key='<file prefix name/2e6fedb0-9366-4d83-8a69-20472d7ff850/', directory=true}]
I have done so far.
Created a new data catalog and database and table (manually, componenttype: vertex).
Connected datacatalog with neptune connector lambda
Full access to Athena,S3,Neptune and Glue has been to the role under which Neptune connector lambda runs.
An existing S3 bucket name has been provided in spillbucket variable for connector lambda.
References:
Amazon Athena Neptune connector - https://github.com/awslabs/aws-athena-query-federation/tree/master/athena-neptune
Connecting to the data source - https://docs.aws.amazon.com/athena/latest/ug/connect-to-a-data-source-lambda.html#connect-to-a-data-source-lambda-connecting
However I am able to see table info when I run the following query but the actual select query does not work.
describe `datasource/datacatlogname`.<dbname>.<tablename>;
Cloudwatch logs:
java.lang.NullPointerException: java.lang.NullPointerException
java.lang.NullPointerException at
com.amazonaws.athena.connectors.neptune.propertygraph.PropertyGraphHandler.executeQuery(PropertyGraphHandler.java:112)
at
com.amazonaws.athena.connectors.neptune.NeptuneRecordHandler.readWithConstraint(NeptuneRecordHandler.java:113)
at
com.amazonaws.athena.connector.lambda.handlers.RecordHandler.doReadRecords(RecordHandler.java:19 at
com.amazonaws.athena.connector.lambda.handlers.RecordHandler.doHandleRequest(RecordHandler.java:158)
at
com.amazonaws.athena.connector.lambda.handlers.CompositeHandler.handleRequest(CompositeHandler.java:138)
at
com.amazonaws.athena.connector.lambda.handlers.CompositeHandler.handleRequest(CompositeHandler.java:103)
Check if the lambda configuration is setup properly to access the Neptune database. Make sure the lambda security group attached is properly setup to access the database
The connector is available in Serverless Application Reppository, try deploying from there
I was able to find out the issue. When I was creating the table manually in console I was inserting separatorChar and componenttype under Serde parameters.
https://github.com/awslabs/aws-athena-query-federation/blob/master/athena-neptune/docs/aws-glue-sample-scripts/manual/sample-cli-script.sh
It should be under table properties.
Related
I have a set of AWS Athena queries that I am executing manually on my S3 folder. The next step would be to get the query result and push it to a Aurora MYSQL Database. The Athena table I have created by using AWS Glue.
Please find below my queries.
Is there any AWS Service that can help in querying using the Athena Query and insert resultant rows into RDS
if yes for query 1 then, Can that service be scheduled.
Presently I have created a lambda for this operation
Can we execute sql query inside DMS task so that it just fetches the required data and not the whole db.
If its not possible then which aws service is used to fetch query based data from on-prem data source to aws S3.
You can use filters and/or exclude fields: https://docs.aws.amazon.com/dms/latest/userguide/CHAP_Tasks.CustomizingTasks.TableMapping.html
Contact me if you have problems.
For alternate solution to DMS, you can use AWS Glue with data retrieved using PYSPARK dataframe from on prem DB to either s3 and AWS RDS. This works very well. The only down side is the cost.
This solution supports both table and SQL as input for data extraction
I'm attempting to use AWS Glue to ETL a MySQL database in RDS to S3 so that I can work with the data in services like SageMaker or Athena. At this time, I don't care about transformations, this is a prototype and I simply want to dump the DB to S3 to start testing the various tool chains.
I've set up a Glue database and tested the connection to RDS successfully
I am using the AWS provide Glue IAM service role
My S3 bucket has the correct prefix of aws-glue-*
I created a crawler using the Glue database, AWSGlue service role, and S3 bucket above with the options:
Schema updates in the data store: Update the table definition in the data catalog
Object deletion in the data store: Delete tables and partitions from the data catalog.
When I run the crawler, it completes in ~60 seconds but it does not create any tables in the database.
I've tried adding the Admin policy to the glue service role to eliminate IAM access issues and the result is the same.
Also, CloudWatch logs are empty. Log groups are created for the test connection and the crawler but neither contains any entries.
I'm not sure how to further troubleshoot this, info on AWS Glue seems pretty sparse.
Figured it out. I had a syntax error in my "include path" for the crawler. Make sure the connection is the data source (RDS in this case) and the include path lists the data target you want e.g. mydatabase/% (I forgot the /%).
You can substitute the percent (%) character for a schema or table. For databases that support schemas, type MyDatabase/MySchema/% to match all tables in MySchema with MyDatabase. Oracle and MySQL don't support schema in the path, instead type MyDatabase/%. For information about which JDBC data stores support schema, see Cataloging Tables with a Crawler.
Ryan Fisher is correct in the sense that it's an error. I wouldn't categorize it as a syntax error. When I ran into this it was because the 'Include path' didn't include the default schema that sql server lovingly provides to you.
I had this: database_name/table_name
When it needed to be: database_name/dbo/table_name
apologies if this has been answered elsewhere (but I don't think it has). I'm trying to use AWSGlue as an external metastore for Hive via an EMR cluster.
I have some data stored as text files on S3, and via the AWSGlue web console I created a table definition over those text files.
I also started up an EMR cluster following directions here: https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hive-metastore-glue.html
When I ssh into my EMR cluster and try to access Hive, I was expecting to find that the table I created in AWSGlue would exist when I ran a "show tables" command, but instead I get the following error message when starting the interactive Hive shell:
Exception in thread "main" java.lang.RuntimeException: java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:Unable to verify existence of default database: com.amazona ws.services.glue.model.AccessDeniedException: Please migrate your Catalog to enable access to this database (Service: AWSGlue; Statu s Code: 400; Error Code: AccessDeniedException; Request ID: e6b2a87b-fe5a-11e8-8ed4-5d1e42734679))
It seems like there's some permission error involved here. I'm using EMR_EC2_DefaultRole for my EC2 Instance Profile, so I didn't think this would happen.
Am I missing something obvious?
Thanks for any help you can provide!
Kindly attach AWS GLUE and S3 full access to you current IAM Role. that should do it.
In order to solve this issue, you will have to migrate your existing Athena catalog to Glue Data Catalog as explained here
To confirm your Athena catalog has been migrated, execute the following commands using the AWS cli:
aws glue get-catalog-import-status --catalog-id <aws-account-id> --region <region>
I have faced exactly same issue recently and able to get over it by upgrading Athena to Glue Catalog.
I am also not using Athena or Redshift Spectrum to query the table but on Glue and Athena consoles there was a message saying:
To use the AWS Glue Data Catalog as a common metadata repository for Amazon Athena, Amazon Redshift Spectrum, and Amazon EMR, you need to upgrade your Athena Data Catalog to the AWS Glue Data Catalog. Without the upgrade, tables and partitions created by AWS Glue cannot be queried with Amazon Athena or Redshift Spectrum.
Soon after upgrading I was able to query the table through hive and spark shells from EMR cluster.
This might be because your default and other databases are already created via Athena before and should be upgraded to Glue to be used from EMR as default is the default database that your hive will use after going to hive in EMR. You have to use DBName.tableName or explicitly switch to the DB that consists of your table to run queries.
I'm getting an error when running an Athena query against a Glue table created from an RDS database:
HIVE_UNKNOWN_ERROR: Unable to create input format
The tables are created using a crawler. The tables show up correctly in the Glue interface:
However, they do not show up in the Athena interface under the database. It says: "The selected database has no tables"
I do not see this behaviour when using a database created using an S3 file. Maybe this is related to the error. Does anybody have an idea?
I had the same problem. This is the answer that I have got from AWS Support:
I understand that you set up a Glue crawler to crawl our RDS postresql database but the tables are not visible in Athena.
Athena service is designed to query tables that point to S3 as data-source. It cannot read data from non-S3 resources as of today.
So, unfortunately not possible at the moment.