All my AWS datapipelines have stopped working with Validation error

All my AWS datapipelines have stopped working with Validation error - amazon-web-services

I use AWS data pipelines to automatically back up dynamodb tables to S3 on a weekly basis.
All of my data-pipelines, have stopped working since two weeks ago.
After some investigation, I see that EMR fails with "validation error" and "Terminated with errors No active keys found for user account". As a results all the jobs timeout.
Any ideas what this means?
I ruled out changes to the list of instant types that are allowed to be used with EMR.
Also I tried to read the EMR logs but it looks like it doesn't event get to the point to create logs (or I am looking for them in the wrong place).

AWS account which used to launch EMR has keys ( access key and sec key ) Could you check if those keys are deleted ? You need to login to AWS console and check keys exists for your account.
if not re create keys and use in your code that launches EMR.

Basically #Sandesh Deshmane answered my question correctly.
For future reference and clarity I explain the situation here too:
What happened was that originally I used the root account and console to create the pipelines. Later I decided to follow the best practices and removed my root account keys.
A few days later (my pipelines are scheduled to run weekly) when they all failed I did not make the connection and thought of other problems.
I think one good way to avoid this (if you want to use console) will be to login to console with an IAM account and create the pipelines.
Or you can use command line tools to create them with and IAM credentials.
The real solution now (I think it was not available when the console was first introduced), is to assign the correct IAM role in the first page when you are creating your pipeline in the console. In the "security/access" section change it from default to custom and select the correct roles there.

Related

AWS: Preemptively configure LambdaEdge log groups using Terraform

If given the correct permissions, Lambda functions will automatically create CloudWatch log groups to hold log output from the Lambda. Same for LambdaEdge functions with the addition that the log groups are created in each region in which the LambdaEdge function has run and the name of the log group includes the name of the region. The problem is that the retention time is set to forever and there is no way to change that unless you wait for the log group to be created and then change the retention config after the fact.
To address this, I would like to create those log groups preemptively in Terraform. The problem is that the region would need to be set in the provider meta argument or passed in the providers argument to a module. I had originally thought that I could get the set of all AWS regions using the aws_regions data source and then dynamically create a provider for each region. However, there is currently no way to dynamically generate providers (see https://github.com/hashicorp/terraform/issues/24476).
Has anyone solved this or a similar problem in some other way? Yes, I could create a script using the AWS CLI to do this, but I'd really like to keep everything in Terraform. Using Terragrunt is also an option, but I wanted to see if there were any solutions using pure Terraform before I go that route.

AWS IAM consistency issue

I'm using Hashicorp vault for creating users with the AWS Secrets-Engine.
I have an issue using the AWS credentials I get, probably because it takes time for all the AWS servers to be updated with the newly created user, as it stated here
I'm using Hashicorp Vault for creating AWS users in runtime, and use the credentials I get immediately. In practice, there could be a delay of up to a few seconds until I can actually use them. Besides performing some retry mechanism, I wonder if there is a real solution to this issue or at least a more elegant solution

As AWS IAM promises eventual consistency, we cannot do anything better than delay and hope for the best. The bad part is that we don't know for how long should we sleep till the new keys reach all endpoints.
This is the problem with the behavior of IAM, not really a Vault issue. There's kinda workaround like this:
make a new temporary user, generate keys for it, hand the keys over to the Vault requester
2.use a non-temporary user, make a new key pair for it, etc
Didn't test it, but as an idea to try I guess it's ok.

HashiCorp released a change in how they handle the dynamically created IAM users and the Vault provider now accounts for this delay. https://github.com/terraform-providers/terraform-provider-vault/blob/master/CHANGELOG.md#260-november-08-2019 Since this update I rarely run into issues but they occur once in a while.

AWS SageMaker GroundTruth permissions issue (can't read manifest)

I'm trying to run a simple GroundTruth labeling job with a public workforce. I upload my images to S3, start creating the labeling job, generate the manifest using their tool automatically, and explicitly specify a role that most certainly has permissions on both S3 bucket (input and output) as well as full access to SageMaker. Then I create the job (standard rest of stuff -- I just wanted to be clear that I'm doing all of that).
At first, everything looks fine. All green lights, it says it's in progress, and the images are properly showing up in the bottom where the dataset is. However, after a few minutes, the status changes to Failure and I get this: ClientError: Access Denied. Cannot access manifest file: arn:aws:sagemaker:us-east-1:<account number>:labeling-job/<job name> using roleArn: null in the reason for failure.
I also get the error underneath (where there used to be images but now there are none):
The specified key <job name>/manifests/output/output.manifest isn't present in the S3 bucket <output bucket>.
I'm very confused for a couple of reasons. First of all, this is a super simple job. I'm just trying to do the most basic bounding box example I can think of. So this should be a very well-tested path. Second, I'm explicitly specifying a role arn, so I have no idea why it's saying it's null in the error message. Is this an Amazon glitch or could I be doing something wrong?

The role must include SageMakerFullAccess and access to the S3 bucket, so it looks like you've got that covered :)
Please check that:
the user creating the labeling job has Cognito permissions: https://docs.aws.amazon.com/sagemaker/latest/dg/sms-getting-started-step1.html
the manifest exists and is at the right S3 location.
the bucket is in the same region as SageMaker.
the bucket doesn't have any bucket policy restricting access.
If that still doesn't fix it, I'd recommend opening a support ticket with the labeling job id, etc.
Julien (AWS)

There's a bug whereby sometimes the console will say something like 401 ValidationException: The specified key s3prefix/smgt-out/yourjobname/manifests/output/output.manifest isn't present in the S3 bucket yourbucket. Request ID: a08f656a-ee9a-4c9b-b412-eb609d8ce194 but that's not the actual problem. For some reason the console is displaying the wrong error message. If you use the API (or AWS CLI) to DescribeLabelingJob like
aws sagemaker describe-labeling-job --labeling-job-name yourjobname
you will see the actual problem. In my case, one of the S3 files that define the UI instructions was missing.

I had the same issue when I tried to write to a different bucket to the one that was used successfully before.
Apparently the IAM role ARN can be assigned permissions for a particular bucket only.

I would suggest to refer to CloudWatch logs and look for a CloudWatch>>CloudWatch Logs >> Log groups >> /aws/sagemaker/LabelingJobs group. I had all points ticked from another post, but my pre-processing Lambda function had wrong id for my region and the error was obvious in the logs.

AWS ElasticSearch with Lambda and S3 doesn't add documents to index

I have a bit of a mysterious issue: I have a lambda function which transports data from S3 bucket to AWS ES cluster.
My lambda function runs correctly and reports the following:
All 6 log records added to ES
However added documents do not appear in AWS ElasticSearch index
/_cat/indices?v
health status index uuid pri rep docs.count docs.deleted store.size pri.store.size
yellow open logs 3N2O9CqhSwCP6sj1QK5EQw 5 1 0 0 1.2kb 1.2kb
I'm using this lambda function https://github.com/aws-samples/amazon-elasticsearch-lambda-samples/blob/master/src/s3_lambda_es.js
Lambda function's role has full permissions to ES cluster and S3 bucket. It can access S3 bucket because I can print out contents to Lambda's console log
Any ideas for further debugging are much appreciated!
Cheers

There can be many reasons for this. since you are asking about ideas for debugging, here are couple of them:
Add the console.log in postDocumentToES method of the lambda that shows where exactly does it connect
Try to extract the code from lambda and run it locally just to make sure it succeeds to send to elastic search (so that the code is correct at least)
Make sure that there are no "special restrictions" on index (like ttl for a couple of minutes or something), or, maybe something that doesn't allow inserting into the index.
How many ES servers do you have? Maybe there is a cluster of them and the replication is not configured correctly, so when you check the state of the index in one ES it doesn't actually have the documents but the other ES server could have these docs.

Is it possible to have an AWS RDS alert directly

I'm new to AWS, started working since last couple of months.
A requirement from the client is,
Get the daily count of `users` table from `AWS RDS` in an alert at 7 am pacific.
I can write a python script to do this and can run it from the aws instace by setting up a CRONJob or A lambda and cloud-watch schedule.
But I've heard from the client that there is something in the AWS (OR AWS RDS) which allows
To run an SQL (or a sequel ;)) Query
And send that query result in an email alert
He added that one of our colleague had done it for some other purposes (And sad part is colleague has left our org now :( ).
So I'm curious what he might have done directly from AWS or from RDS to send an alert notification.
Please suggest if anyone could have any idea on it.

Writing the query in a lambda function and using either SES or SNS to send the notifications is how I would do it - and either of those options would be doing it 'in AWS'.
Depending on the flavor of RDS you are using (SQL Server Aurora, Postgres etc), there may be a vendor specific way as well, but personally I'd still choose the lambda / cloud watch event method.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

All my AWS datapipelines have stopped working with Validation error - amazon-web-services

AWS account which used to launch EMR has keys ( access key and sec key ) Could you check if those keys are deleted ? You need to login to AWS console and check keys exists for your account. if not re create keys and use in your code that launches EMR.

Related

AWS: Preemptively configure LambdaEdge log groups using Terraform

AWS IAM consistency issue

AWS SageMaker GroundTruth permissions issue (can't read manifest)

AWS ElasticSearch with Lambda and S3 doesn't add documents to index

Is it possible to have an AWS RDS alert directly

Categories

Resources