I am uploading a csv that has 15,000 items to dynamodb via the 'import from s3' button on the dynamodb console. However, when it is finished and I do a 'get live item count', it says 900 when it should be 15000. Anyone know why is it not adding everything to my new table?
Probably you had a format error. Look in CloudWatch Logs for details.
https://aws.amazon.com/blogs/database/amazon-dynamodb-can-now-import-amazon-s3-data-into-a-new-table/ says:
"During the import, DynamoDB might encounter errors while parsing your data. For each error, DynamoDB creates a log entry in Amazon CloudWatch Logs and keeps a count of the total number of errors encountered."
Related
I am currently facing issue in the project where S3 buckets contain avg 50 tables and after running glue job I see this following error. I think its not the issue of memory/ or worker nodes.
{
"Error":"States.TaskFailed",
"Cause":"{\"AllocatedCapacity\":5,\"Arguments\":{\"--quotes_col_list\":\"Null\",\"--processed_prefix\":\"processed/cat2/uber/\",\"--replicated_prefix\":\"replicated/cat2/uber/\",\"--table_folder\":\"SALES_ORDER_DOCUMENT_TYPE/\",\"--devops_prefix\":\"uber_processing/glue_configuration/rename_glue_file/replicated/uber/\",\"--tablename\":\"sales_order_document_type\",\"--companies\":\"uber\",\"--metadata_path\":\"cat2/cat2_metadata.csv\",\"--reject_prefix\":\"reject/cat2/uber/\"},\"Attempt\":0,\"CompletedOn\":1641759367801,\"ErrorMessage\":\"TooManyRequestsException: An error occurred (TooManyRequestsException) when calling the StartQueryExecution operation: You have exceeded the limit for the number of queries you can run concurrently. Please reduce the number of concurrent queries submitted by this account. Contact customer support to request a concurrent query limit increase.\",\"ExecutionTime\":51,\"GlueVersion\":\"2.0\",\"Id\":\"jr_b8haonpeno503no0n3020
\",\"JobName\":\"uber_job\",\"JobRunState\":\"FAILED\",\"LastModifiedOn\":1641759367801,\"LogGroupName\":\"/aws-glue/jobs\",\"MaxCapacity\":5.0,\"NumberOfWorkers\":5,\"PredecessorRuns\":[],\"StartedOn\":1641759312689,\"Timeout\":2880,\"WorkerType\":\"G.1X\"}"
}
When I checked the query funtion it doesn't show me any query running in glue job.
response = athena_client.start_query_execution(
QueryString='msck repair table '+args['audit_table'],
ResultConfiguration={
'OutputLocation': args['athena_resultpath'] }
)
Can someone help me in QueryString='msck repair table '+args['audit_table'] what is the argument?
You mentioned the word "concurrency" bit didn't mentioned exactly what the error message is:
"ErrorMessage":"TooManyRequestsException: An error occurred (TooManyRequestsException) when calling the StartQueryExecution operation: You have exceeded the limit for the number of queries you can run concurrently
Athena has some built in soft limits, it also mentions in their docs:
A DML or DDL query quota includes both running and queued queries. For example, if you are using the default DML quota and your total of running and queued queries exceeds 25, query 26 will result in a TooManyRequestsException error.
You are simply going over the limits so your query fails, specifically the "DML query quota" i'm assuming, these soft limits are somewhat flexible and can be increased by submitting a reqest via the service quotas console
I am using GCP and want to create an alert after not seeing a certain pattern in the output logs of a process.
As an example, my CLI process will output "YYYY-MM-DD HH:MM:SS Successfully checked X" every second.
I want to know when this fails (indicated by no log output). I am collecting logs using the normal GCP log collector.
Can this be done?
I am creating the alerts via the UI at:
https://console.cloud.google.com/monitoring/alerting/policies/create
You can create an alert based on log metric. For that, create a log based metric in Cloud Logging with the log filter that you want.
Then create an alert, aggregate per minute the metrics and set an alert when the value is below 60.
You won't have an alert for each missing message but based on a minute, you will have an alert when the expected value isn't reached.
I have a Hive script I'm running in EMR that is creating a partitioned Parquet table in S3 from a ~40GB gzipped CSV file also stored in S3.
The script runs fine for about 4 hours but reaches a point (pretty sure when it is just about done creating the Parquet table) where it errors out. The logs show that the error is:
HiveException: Hive Runtime Error while processing row
caused by:
AmazonS3Exception: Bad Request
There really isn't any more useful information in the logs that I can see. It is reading the CSV file fine from S3 and it creates a couple metadata files in S3 fine as well, so I've confirmed the instance has read/write permissions to the Bucket.
I really can't think of anything else that's going on and I wish there was more info in the logs about what "Bad Request" to S3 that Hive is making. Anyone have any ideas?
BadRequest is a fairly meaningless response from AWS which it sends if there is any reason why it doesn't like the caller. Nobody really knows what's happening.
The troubleshooting docs for the ASF S3A connector list some causes, but they aren't complete, and based on guesswork from what made the message go away.
If you have the request ID which failed, you can submit a support request for amazon to see what they saw on their side.
If it makes you feel any better, I'm seeing it when I try to list exactly one directory in an object store, and I'm co-author of the s3a connector. Like I said "guesswork". Once you find out, add a comment here or, if it's not in the troubleshooting doc, submit a patch to hadoop on the topic.
I'm executing a query in AWS Athena and writing the results to s3. It seems like it's taking a long time (way too long in fact) for the file to be available when I execute the query from a lambda script.
I'm scanning 70MB of data, and the file returned is 12MB. I execute this from a lambda script like so:
athena_client = boto3.client('athena')
athena_client.start_query_execution(
QueryString=query_string,
ResultConfiguration={
'OutputLocation': 'location_on_s3',
'EncryptionConfiguration': 'SSE_S3',
}
)
If I run the query directly in Athena it takes 2.97 seconds to run. However it looks like the file is available after 2 minutes if I run this query from the lambda script.
Does anyone know the write performance of AWS Athena to AWS S3? I would like to know if this is normal. The docs don't state how quickly the write occurs.
Every query in Athena writes to S3.
If you check the History tab on the Athena page in the console you'll see a history of all queries you've run (not just through the console, but generally). Each of those has a link to a download path.
If you click the Settings button a dialog will open asking you to specify an output location. Check that location and you'll find all your query results there.
Why is this taking so much longer from your Lambda script? I'm guessing, but the only possible suggestion I have is that you're querying across regions - if your data is in your region and your result location is in another location you might experience slowness due to transfer cost. Even so, 12MB should be fast.
I'v started to play with DynamoDb and I'v created "dynamo-test" table with hash PK on userid and couple more columns (age, name). Read and write capacity is set to 5. I use Lambda and API Gateway with Node.js. Then I manually performed several API calls through API gateway using similar payload:
{
"userId" : "222",
"name" : "Test",
"age" : 34
}
I'v tried to insert the same item couple times (which didn't produce error but silently succeeded.) Also, I used DynamoDb console and browsed for inserted items several times (currently there are 2 only). I haven't tracked how many times exactly I did those actions, but that was done completely manually. And then after an hour, I'v noticed 2 alarms in CloudWatch:
INSUFFICIENT_DATA
dynamo-test-ReadCapacityUnitsLimit-BasicAlarm
ConsumedReadCapacityUnits >= 240 for 12 minutes
No notifications
And the similar alarm with "...WriteCapacityLimit...". Write capacity become OK after 2 mins, but then went back again after 10 mins. Anyway, I'm still reading and learning how to plan and monitor these capacities, but this hello world example scared me a bit if I'v exceeded my table's capacity :) Please, point me to the right direction if I'm missing some fundamental part!
It's just an "INSUFFICIENT_DATA" message. It means that your table hasn't had any reads or writes in a while, so there is insufficient data available for the CloudWatch metric. This happens with the CloudWatch alarms for any DynamoDB table that isn't used very often. Nothing to worry about.
EDIT: You can now change a setting in CloudWatch alarms to ignore missing data, which will leave the alarm at its previous state instead of changing it to the "INSUFFICIENT_DATA" state.