AWS Athena JDBC Driver ResultSet.getCharacterStream method not implemented - amazon-web-services

I have an Athena/PrestoDB query that returns up to 300 million device ids. This screen shot shows the query when executed in the AWS UI. The results were displayed in under 1 minute and I downloaded the full results (319MB) in a few minutes from the link provided in the UI.
When I execute the same query over the JDBC connection I'm receiving a method not implemented error. It appears the AthenaJDBC41-1.0.0.jar from the AWS docs has not implemented the getCharacterStream yet.
ActiveRecord::StatementInvalid: ActiveRecord::JDBCError: com.amazonaws.athena.jdbc.NotImplementedException: Method ResultSet.getCharacterStream is not yet implemented: SELECT distinct(device_id) FROM presales.sightings_v3 WHERE DATE(date) BETWEEN DATE('2016-03-01') AND DATE('2016-03-02') AND ( contains(audiences, 1133) OR contains(audiences, 1149) OR contains(audiences, 1184) );
I'm using driver AthenaJDBC41-1.0.0.jar from the AWS docs and my example connection can be seen here.
My guess is that the method ResultSet.getCharacterStream is only used with large results since my other queries work fine.
Ideally I would like this response to contain the query_id or S3 Path vs streaming the big data results. I'm curious how the Athena UI generates a link to the results on S3?

You can get the query id from the ResultSet
((AthenaStatementClient)((AthenaResultSet)rs).getClient()).getQueryExecutionId()
With that you can build the S3 Path with
<s3_staging_dir>/<query_id>.csv

Related

AWS Glue Crawler and JDCBConnection : "Expected string length >= 1, but found 0 for params.Targets.JdbcTargets[0].customJdbcDriverClassName"

I am trying to setup an AWS Glue Crawler using a JDBC connection in order to populate my AWS Glue Data Catalog databases.
I already have a Connection which passes the test but when I submit my crawler creation, I have this error : "Expected string length >= 1, but found 0 for params.Targets.JdbcTargets[0].customJdbcDriverClassName" as you can see in the first screenshot.
The only clue I have for now is that there is no Class Name attached to my connection. However I cannot edit it while editing the connection
Does it ring a bell to someone?
Thanks a lot
I've also had this issue, and even tried using aws-cli to create/update my connection to try to manually input the required parameter.
Turns out this is an AWS UI issue caused by a recent update. According to this post you can create it using the Legacy console for now (on the sidbar, there is a Legacy section where you can find the Legacy pages). I just tried it on my end and it worked =)

Why do I get ThrottlingException - Rate Exceeded status:400 when making AWS Athena API call from API server?

We have an S3 data lake in AWS (with Lake Formation, Glue etc.) The end goal is to query the S3 data sources using SQL in Athena.
When making the query in the AWS Athena console - everything works
fine, results are provided,see screenshot: https://share.getcloudapp.com/NQuNBr5g
When making the query through the official API
application domain (Symfony5 RESTful api that uses aws-sdk-php vendor), the
query doesn't even get to Athena, error returned is 400: https://share.getcloudapp.com/xQuqQLrq
in CloudTrail events, I can only see
errorcode= ThrottlingException and errormessage='Rate exceeded', there's no query execution id.
Weird thing I don't get is, when making the same call in my localhost setup of the API app, the call is again successful: https://share.getcloudapp.com/jkuv8ZGy
The call made is StartQueryExecution on Athena API, error as shown on the API app's side:
Error executing \"GetQueryExecution\" on \"https://athena.us-west-2.amazonaws.com\"; AWS HTTP error: Client error: `POST https://athena.us-west-2.amazonaws.com` resulted in a `400 Bad Request` response:\n{\"__type\":\"ThrottlingException\",\"message\":\"Rate exceeded\"}\n ThrottlingException (client): Rate exceeded - {\"__type\":\"ThrottlingException\",\"message\":\"Rate exceeded\"}", "class": "Aws\\Athena\\Exception\\AthenaException"
The API app server and the datalake etc. are on the same VPC, and I created a VPC endpoint from the server's VPC to athena us-west-2 endpoint, but it didn't help. I don't think it's Athena Quota issues, since on localhost the query works just fine. Any insight would be very helpful, thank you!
The solution was a combination of actions. Athena just doesn't work like that. So it's not okay to expect data from an Athena query over an S3 datalake as if querying a relational database. What helped get results consistently and not have this error was:
update the PHP SDK AthenaClient constructor, and also pass config for retries.
... other AthenaClient constructor params...
'retries' => [
'mode' => 'standard',
'max_attempts' => 3
],
Athena and other elastic services (e.g. dynamodb) work asynchronously. You issue the query, but the result will not be delivered synchronously. As example - I saw in my early tests always receiving the initial "throttlingException" but in Athena Query console, the result of that exact same query came slightly later, but successfully. It looks like the PHP SDK for aws is done with this in mind so doing retries and exponential backoff is also what AWS recommends: https://docs.aws.amazon.com/general/latest/gr/api-retries.html
Partition your data, and in a relevant way, in order to scan as less data as possible. Which helps with more consistent and faster results. - https://docs.aws.amazon.com/athena/latest/ug/partitions.html // either on the glue table directly, or via Glue ETL job where partitioning keys are specified. If your query on athena is looking for something where country={country}, a good partitioning scheme is per country.
avoid 'select *' - always name exactly the columns needed + add limit + queries over Athena should be relatively simple select queries, if you need joins or other more complex query types, Redshift is better suited for that.

Daily AWS Lambda not creating Athena partition, however commands runs successfully

I have an Athena database set up pointing at an S3 bucket containing ALB logs, and it all works correctly. I partition the table by a column called datetime and the idea is that it has the format YYYY/MM/DD.
I can manually create partitions through the Athena console, using the following command:
ALTER TABLE alb_logs ADD IF NOT EXISTS PARTITION (datetime='2019-08-01') LOCATION 's3://mybucket/AWSLogs/myaccountid/elasticloadbalancing/eu-west-1/2019/08/01/'
I have created a lambda to run daily to create a new partition, however this doesn't seem to work. I use the boto3 python client and execute the following:
result = athena.start_query_execution(
QueryString = "ALTER TABLE alb_logs ADD IF NOT EXISTS PARTITION (datetime='2019-08-01') LOCATION 's3://mybucket/AWSLogs/myaccountid/elasticloadbalancing/eu-west-1/2019/08/01/'",
QueryExecutionContext = {
'Database': 'web'
},
ResultConfiguration = {
"OutputLocation" : "s3://aws-athena-query-results-093305704519-eu-west-1/Unsaved/"
}
)
This appears to run successfully without any errors and the query execution even returns a QueryExecutionId as it should. However if I run SHOW PARTITIONS web.alb_logs; via the Athena console it hasn't created the partition.
I have a feeling it could be down to permissions, however I have given the lambda execution role full permissions to all resources on S3 and full permissions to all resources on Athena and it still doesn't seem to work.
Since Athena query execution is asynchronous your Lambda function never sees the result of the query execution, it just gets the result of starting the query.
I would be very surprised if this wasn't a permissions issue, but because of the above the error will not appear in the Lambda logs. What you can do is to log the query execution ID and look it up with the GetQueryExecution API call to see that the query succeeded.
Even better would be to rewrite your code to use the Glue APIs directly to add the partitions. Adding a partition is a quick and synchronous operation in Glue, which means you can make the API call and get a status in the same Lambda execution. Have a look at the APIs for working with partitions: https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-catalog-partitions.html

AWS EMR - Hive creating new table in S3 results in AmazonS3Exception: Bad Request

I have a Hive script I'm running in EMR that is creating a partitioned Parquet table in S3 from a ~40GB gzipped CSV file also stored in S3.
The script runs fine for about 4 hours but reaches a point (pretty sure when it is just about done creating the Parquet table) where it errors out. The logs show that the error is:
HiveException: Hive Runtime Error while processing row
caused by:
AmazonS3Exception: Bad Request
There really isn't any more useful information in the logs that I can see. It is reading the CSV file fine from S3 and it creates a couple metadata files in S3 fine as well, so I've confirmed the instance has read/write permissions to the Bucket.
I really can't think of anything else that's going on and I wish there was more info in the logs about what "Bad Request" to S3 that Hive is making. Anyone have any ideas?
BadRequest is a fairly meaningless response from AWS which it sends if there is any reason why it doesn't like the caller. Nobody really knows what's happening.
The troubleshooting docs for the ASF S3A connector list some causes, but they aren't complete, and based on guesswork from what made the message go away.
If you have the request ID which failed, you can submit a support request for amazon to see what they saw on their side.
If it makes you feel any better, I'm seeing it when I try to list exactly one directory in an object store, and I'm co-author of the s3a connector. Like I said "guesswork". Once you find out, add a comment here or, if it's not in the troubleshooting doc, submit a patch to hadoop on the topic.

Athena Write Performance to AWS S3

I'm executing a query in AWS Athena and writing the results to s3. It seems like it's taking a long time (way too long in fact) for the file to be available when I execute the query from a lambda script.
I'm scanning 70MB of data, and the file returned is 12MB. I execute this from a lambda script like so:
athena_client = boto3.client('athena')
athena_client.start_query_execution(
QueryString=query_string,
ResultConfiguration={
'OutputLocation': 'location_on_s3',
'EncryptionConfiguration': 'SSE_S3',
}
)
If I run the query directly in Athena it takes 2.97 seconds to run. However it looks like the file is available after 2 minutes if I run this query from the lambda script.
Does anyone know the write performance of AWS Athena to AWS S3? I would like to know if this is normal. The docs don't state how quickly the write occurs.
Every query in Athena writes to S3.
If you check the History tab on the Athena page in the console you'll see a history of all queries you've run (not just through the console, but generally). Each of those has a link to a download path.
If you click the Settings button a dialog will open asking you to specify an output location. Check that location and you'll find all your query results there.
Why is this taking so much longer from your Lambda script? I'm guessing, but the only possible suggestion I have is that you're querying across regions - if your data is in your region and your result location is in another location you might experience slowness due to transfer cost. Even so, 12MB should be fast.