Why do I get ThrottlingException - Rate Exceeded status:400 when making AWS Athena API call from API server? - amazon-web-services

We have an S3 data lake in AWS (with Lake Formation, Glue etc.) The end goal is to query the S3 data sources using SQL in Athena.
When making the query in the AWS Athena console - everything works
fine, results are provided,see screenshot: https://share.getcloudapp.com/NQuNBr5g
When making the query through the official API
application domain (Symfony5 RESTful api that uses aws-sdk-php vendor), the
query doesn't even get to Athena, error returned is 400: https://share.getcloudapp.com/xQuqQLrq
in CloudTrail events, I can only see
errorcode= ThrottlingException and errormessage='Rate exceeded', there's no query execution id.
Weird thing I don't get is, when making the same call in my localhost setup of the API app, the call is again successful: https://share.getcloudapp.com/jkuv8ZGy
The call made is StartQueryExecution on Athena API, error as shown on the API app's side:
Error executing \"GetQueryExecution\" on \"https://athena.us-west-2.amazonaws.com\"; AWS HTTP error: Client error: `POST https://athena.us-west-2.amazonaws.com` resulted in a `400 Bad Request` response:\n{\"__type\":\"ThrottlingException\",\"message\":\"Rate exceeded\"}\n ThrottlingException (client): Rate exceeded - {\"__type\":\"ThrottlingException\",\"message\":\"Rate exceeded\"}", "class": "Aws\\Athena\\Exception\\AthenaException"
The API app server and the datalake etc. are on the same VPC, and I created a VPC endpoint from the server's VPC to athena us-west-2 endpoint, but it didn't help. I don't think it's Athena Quota issues, since on localhost the query works just fine. Any insight would be very helpful, thank you!

The solution was a combination of actions. Athena just doesn't work like that. So it's not okay to expect data from an Athena query over an S3 datalake as if querying a relational database. What helped get results consistently and not have this error was:
update the PHP SDK AthenaClient constructor, and also pass config for retries.
... other AthenaClient constructor params...
'retries' => [
'mode' => 'standard',
'max_attempts' => 3
],
Athena and other elastic services (e.g. dynamodb) work asynchronously. You issue the query, but the result will not be delivered synchronously. As example - I saw in my early tests always receiving the initial "throttlingException" but in Athena Query console, the result of that exact same query came slightly later, but successfully. It looks like the PHP SDK for aws is done with this in mind so doing retries and exponential backoff is also what AWS recommends: https://docs.aws.amazon.com/general/latest/gr/api-retries.html
Partition your data, and in a relevant way, in order to scan as less data as possible. Which helps with more consistent and faster results. - https://docs.aws.amazon.com/athena/latest/ug/partitions.html // either on the glue table directly, or via Glue ETL job where partitioning keys are specified. If your query on athena is looking for something where country={country}, a good partitioning scheme is per country.
avoid 'select *' - always name exactly the columns needed + add limit + queries over Athena should be relatively simple select queries, if you need joins or other more complex query types, Redshift is better suited for that.

Related

Need recommendation to create an API by aggregating data from multiple source APIs

Before I start doing this I wanted to get advice from the community on the best and most efficient manner to go about doing it.
Here is what I want to do:
Ingest data from multiple API's which returns JSON
Store it in either S3 or DynamoDB
Modify the data to use my JSON structure
Pipe out the aggregate data as an API
The data will be updated twice a day, so I would pull in the data from the source APIs and put it through my pipeline twice a day.
So basically I want to create an API by aggregating data from multiple source APIs.
I've started playing with Lambda and created the following function using Python.
#https://stackoverflow.com/a/41765656
import requests
import json
def lambda_handler(event, context):
#https://www.nylas.com/blog/use-python-requests-module-rest-apis/ USEFUL!!!
#https://stackoverflow.com/a/65896274
response = requests.get("https://remoteok.com/api")
#print(response.json())
return {
'statusCode': 200,
'body': response.json()
}
#https://stackoverflow.com/questions/63733410/using-lambda-to-add-json-to-dynamodb DYNAMODB
This works and returns a JSON response.
Here are my questions:
Should I store the data on S3 or DynamoDB?
Which AWS service should I use to aggregate the data into my JSON structure?
Which service should I use to publish the aggregate data as an API, API Gateway?
However, before I go further I would like to know what is the best way to go about doing this.
If you have experience with this I would love to hear from you.
The answer will vary depending on the quantity of data you're planning to mine. Lambdas are designed for short-duration, high-frequency workloads and thus might not be suitable.
I would recommend looking into AWS Glue, as this seems like a fairly typical ETL (Extract Transform Load) problem. You can set up glue jobs to run on a schedule, and as for data aggregation, that's the T in ETL.
It's simple to output the glue dataframe (result of a transformation) as s3 files, which can then be queried directly by Amazon Athena (as if they were db content).
As for exposing that data via an API, the serverless framework or SST are great tools for taking the sting out of spinning up a serverless API and associated resources.

Daily AWS Lambda not creating Athena partition, however commands runs successfully

I have an Athena database set up pointing at an S3 bucket containing ALB logs, and it all works correctly. I partition the table by a column called datetime and the idea is that it has the format YYYY/MM/DD.
I can manually create partitions through the Athena console, using the following command:
ALTER TABLE alb_logs ADD IF NOT EXISTS PARTITION (datetime='2019-08-01') LOCATION 's3://mybucket/AWSLogs/myaccountid/elasticloadbalancing/eu-west-1/2019/08/01/'
I have created a lambda to run daily to create a new partition, however this doesn't seem to work. I use the boto3 python client and execute the following:
result = athena.start_query_execution(
QueryString = "ALTER TABLE alb_logs ADD IF NOT EXISTS PARTITION (datetime='2019-08-01') LOCATION 's3://mybucket/AWSLogs/myaccountid/elasticloadbalancing/eu-west-1/2019/08/01/'",
QueryExecutionContext = {
'Database': 'web'
},
ResultConfiguration = {
"OutputLocation" : "s3://aws-athena-query-results-093305704519-eu-west-1/Unsaved/"
}
)
This appears to run successfully without any errors and the query execution even returns a QueryExecutionId as it should. However if I run SHOW PARTITIONS web.alb_logs; via the Athena console it hasn't created the partition.
I have a feeling it could be down to permissions, however I have given the lambda execution role full permissions to all resources on S3 and full permissions to all resources on Athena and it still doesn't seem to work.
Since Athena query execution is asynchronous your Lambda function never sees the result of the query execution, it just gets the result of starting the query.
I would be very surprised if this wasn't a permissions issue, but because of the above the error will not appear in the Lambda logs. What you can do is to log the query execution ID and look it up with the GetQueryExecution API call to see that the query succeeded.
Even better would be to rewrite your code to use the Glue APIs directly to add the partitions. Adding a partition is a quick and synchronous operation in Glue, which means you can make the API call and get a status in the same Lambda execution. Have a look at the APIs for working with partitions: https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-catalog-partitions.html

ELK stack (Elasticsearch, Logstash, Kibana) - is logstash a necessary component?

We're currently processing daily mobile app log data with AWS lambda and posting it into redshift. The lambda structures the data but it is essentially raw. The next step is to do some actual processing of the log data into sessions etc, for reporting purposes. The final step is to have something do feature engineering, and then use the data for model training.
The steps are
Structure the raw data for storage
Sessionize the data for reporting
Feature engineering for modeling
For step 2, I am looking at using Quicksight and/or Kibana to create reporting dashboard. But the typical stack as I understand it is to do the log processing with logstash, then have it go to elasticsreach and finally to Kibana/Quicksight. Since we're already handling the initial log processing through lambda, is it possible to skip this step and pass it directly into elasticsearch? If so where does this happen - in the lambda function or from redshift after it has been stored in a table? Or can elasticsearch just read it from the same s3 where I'm posting the data for ingestion into a redshift table?
Elasticsearch uses JSON to perform all operations. For example, to add a document to an index, you use a PUT operation (copied from docs):
PUT twitter/_doc/1
{
"user" : "kimchy",
"post_date" : "2009-11-15T14:12:12",
"message" : "trying out Elasticsearch"
}
Logstash exists to collect log messages, transform them into JSON, and make these PUT requests. However, anything that produces correctly-formatted JSON and can perform an HTTP PUT will work. If you already invoke Lambdas to transform your S3 content, then you should be able to adapt them to write JSON to Elasticsearch. I'd use separate Lambdas for Redshift and Elasticsearch, simply to improve manageability.
Performance tip: you're probably processing lots of records at a time, in which case the bulk API will be more efficient than individual PUTs. However, there is a limit on the size of a request, so you'll need to batch your input.
Also: you don't say whether you're using an AWS Elasticsearch cluster or self-managed. If the former you'll also have to deal with authenticated requests, or use an IP-based access policy on the cluster. You don't say what language your Lambdas are written in, but if it's Python you can use the aws-requests-auth library to make authenticated requests.

Athena Write Performance to AWS S3

I'm executing a query in AWS Athena and writing the results to s3. It seems like it's taking a long time (way too long in fact) for the file to be available when I execute the query from a lambda script.
I'm scanning 70MB of data, and the file returned is 12MB. I execute this from a lambda script like so:
athena_client = boto3.client('athena')
athena_client.start_query_execution(
QueryString=query_string,
ResultConfiguration={
'OutputLocation': 'location_on_s3',
'EncryptionConfiguration': 'SSE_S3',
}
)
If I run the query directly in Athena it takes 2.97 seconds to run. However it looks like the file is available after 2 minutes if I run this query from the lambda script.
Does anyone know the write performance of AWS Athena to AWS S3? I would like to know if this is normal. The docs don't state how quickly the write occurs.
Every query in Athena writes to S3.
If you check the History tab on the Athena page in the console you'll see a history of all queries you've run (not just through the console, but generally). Each of those has a link to a download path.
If you click the Settings button a dialog will open asking you to specify an output location. Check that location and you'll find all your query results there.
Why is this taking so much longer from your Lambda script? I'm guessing, but the only possible suggestion I have is that you're querying across regions - if your data is in your region and your result location is in another location you might experience slowness due to transfer cost. Even so, 12MB should be fast.

AWS Athena JDBC Driver ResultSet.getCharacterStream method not implemented

I have an Athena/PrestoDB query that returns up to 300 million device ids. This screen shot shows the query when executed in the AWS UI. The results were displayed in under 1 minute and I downloaded the full results (319MB) in a few minutes from the link provided in the UI.
When I execute the same query over the JDBC connection I'm receiving a method not implemented error. It appears the AthenaJDBC41-1.0.0.jar from the AWS docs has not implemented the getCharacterStream yet.
ActiveRecord::StatementInvalid: ActiveRecord::JDBCError: com.amazonaws.athena.jdbc.NotImplementedException: Method ResultSet.getCharacterStream is not yet implemented: SELECT distinct(device_id) FROM presales.sightings_v3 WHERE DATE(date) BETWEEN DATE('2016-03-01') AND DATE('2016-03-02') AND ( contains(audiences, 1133) OR contains(audiences, 1149) OR contains(audiences, 1184) );
I'm using driver AthenaJDBC41-1.0.0.jar from the AWS docs and my example connection can be seen here.
My guess is that the method ResultSet.getCharacterStream is only used with large results since my other queries work fine.
Ideally I would like this response to contain the query_id or S3 Path vs streaming the big data results. I'm curious how the Athena UI generates a link to the results on S3?
You can get the query id from the ResultSet
((AthenaStatementClient)((AthenaResultSet)rs).getClient()).getQueryExecutionId()
With that you can build the S3 Path with
<s3_staging_dir>/<query_id>.csv