I am trying to make use of RedShift's UNLOAD to copy some data into S3. My query looks like this:
unload ('select ________________')
to 's3://my-bucket-name'
iam_role 'arn #';
If I run the select statement on it's own, I get 3 records returned, if I run the query as shown above, it is able to complete but I never see anything show up in my S3 bucket. Is there somewhere where I can find more detailed logs?
Also, the IAM role I'm giving it, is an Admin role that has full access to everything, was planning on getting it to work for a proof of concept and then make/use a role that is better suited. Also, can someone explain the IAM role? Why does it not use the IAM role for me, the user running the query?
Related
Background:
I am trying to generate patch compliance data report in quicksight. In order to do it I am using terraform I have added all inventory data in S3 bucket.
I have created Athena automation document which creates database/tables in Athena using S3 bucket data. Now I want to add some terraform code which execute automation document daily on scheduled time.
For more information about this task: https://reinvent2019.awsmanagement.tools/mgt410/en/cont.html
Problem:
I can create maintenance window to define crone job for automation task but I do not have target to add.
My Athena Automation script is only creating/updating database in the Athena.There is no role of target here.
Can someone guid me on this issue?
Thank you in advance
You can create a CloudWatch Event that triggers on schedule and calls Lambda function, that in turn invokes you Athena logic. Here is the good example: https://thedataguy.in/automate-aws-athena-create-partition-on-daily-basis/
Note on QuickSight - if you are using Spice, instead of direct query - you need to manage Spice rebuild too. Which might be tricky... The default setting only allow for once-a-day rebuild on schedule.
I'm trying to run a simple GroundTruth labeling job with a public workforce. I upload my images to S3, start creating the labeling job, generate the manifest using their tool automatically, and explicitly specify a role that most certainly has permissions on both S3 bucket (input and output) as well as full access to SageMaker. Then I create the job (standard rest of stuff -- I just wanted to be clear that I'm doing all of that).
At first, everything looks fine. All green lights, it says it's in progress, and the images are properly showing up in the bottom where the dataset is. However, after a few minutes, the status changes to Failure and I get this: ClientError: Access Denied. Cannot access manifest file: arn:aws:sagemaker:us-east-1:<account number>:labeling-job/<job name> using roleArn: null in the reason for failure.
I also get the error underneath (where there used to be images but now there are none):
The specified key <job name>/manifests/output/output.manifest isn't present in the S3 bucket <output bucket>.
I'm very confused for a couple of reasons. First of all, this is a super simple job. I'm just trying to do the most basic bounding box example I can think of. So this should be a very well-tested path. Second, I'm explicitly specifying a role arn, so I have no idea why it's saying it's null in the error message. Is this an Amazon glitch or could I be doing something wrong?
The role must include SageMakerFullAccess and access to the S3 bucket, so it looks like you've got that covered :)
Please check that:
the user creating the labeling job has Cognito permissions: https://docs.aws.amazon.com/sagemaker/latest/dg/sms-getting-started-step1.html
the manifest exists and is at the right S3 location.
the bucket is in the same region as SageMaker.
the bucket doesn't have any bucket policy restricting access.
If that still doesn't fix it, I'd recommend opening a support ticket with the labeling job id, etc.
Julien (AWS)
There's a bug whereby sometimes the console will say something like 401 ValidationException: The specified key s3prefix/smgt-out/yourjobname/manifests/output/output.manifest isn't present in the S3 bucket yourbucket. Request ID: a08f656a-ee9a-4c9b-b412-eb609d8ce194 but that's not the actual problem. For some reason the console is displaying the wrong error message. If you use the API (or AWS CLI) to DescribeLabelingJob like
aws sagemaker describe-labeling-job --labeling-job-name yourjobname
you will see the actual problem. In my case, one of the S3 files that define the UI instructions was missing.
I had the same issue when I tried to write to a different bucket to the one that was used successfully before.
Apparently the IAM role ARN can be assigned permissions for a particular bucket only.
I would suggest to refer to CloudWatch logs and look for a CloudWatch>>CloudWatch Logs >> Log groups >> /aws/sagemaker/LabelingJobs group. I had all points ticked from another post, but my pre-processing Lambda function had wrong id for my region and the error was obvious in the logs.
I'm attempting to use AWS Glue to ETL a MySQL database in RDS to S3 so that I can work with the data in services like SageMaker or Athena. At this time, I don't care about transformations, this is a prototype and I simply want to dump the DB to S3 to start testing the various tool chains.
I've set up a Glue database and tested the connection to RDS successfully
I am using the AWS provide Glue IAM service role
My S3 bucket has the correct prefix of aws-glue-*
I created a crawler using the Glue database, AWSGlue service role, and S3 bucket above with the options:
Schema updates in the data store: Update the table definition in the data catalog
Object deletion in the data store: Delete tables and partitions from the data catalog.
When I run the crawler, it completes in ~60 seconds but it does not create any tables in the database.
I've tried adding the Admin policy to the glue service role to eliminate IAM access issues and the result is the same.
Also, CloudWatch logs are empty. Log groups are created for the test connection and the crawler but neither contains any entries.
I'm not sure how to further troubleshoot this, info on AWS Glue seems pretty sparse.
Figured it out. I had a syntax error in my "include path" for the crawler. Make sure the connection is the data source (RDS in this case) and the include path lists the data target you want e.g. mydatabase/% (I forgot the /%).
You can substitute the percent (%) character for a schema or table. For databases that support schemas, type MyDatabase/MySchema/% to match all tables in MySchema with MyDatabase. Oracle and MySQL don't support schema in the path, instead type MyDatabase/%. For information about which JDBC data stores support schema, see Cataloging Tables with a Crawler.
Ryan Fisher is correct in the sense that it's an error. I wouldn't categorize it as a syntax error. When I ran into this it was because the 'Include path' didn't include the default schema that sql server lovingly provides to you.
I had this: database_name/table_name
When it needed to be: database_name/dbo/table_name
I am creating a transient EMR Spark cluster programmatically, reading a vanilla S3 object, converting it to a Dataframe and writing a parquet file.
Running on a local cluster (with S3 credentials provided) everything works.
Spinning up a transient cluster and submitting the job fails on the write to S3 with the error:
AmazonS3Exception: The AWS Access Key Id you provided does not exist in our records.
But my job is able to read the vanilla object from S3, and it logs to S3 correctly. Additionally I see that EMR_EC2_DefaultRole is set as EC2 instance profile, and that EMR_EC2_DefaultRole has the proper S3 permissions, and that my bucket has a policy set for EMR_EC2_DefaultRole.
I get the the 'filesystem' that I am trying to write the parquet file to is special, but I cannot figure out what needs to be set for this to work.
Arrrrgggghh! Basically as soon as I had posted my question, the light bulb went off.
In my Spark job I had
val cred: AWSCredentials = new DefaultAWSCredentialsProviderChain().getCredentials
session.sparkContext.hadoopConfiguration.set("fs.s3n.awsAccessKeyId", cred.getAWSAccessKeyId)
session.sparkContext.hadoopConfiguration.set("fs.s3n.awsSecretAccessKey", cred.getAWSSecretKey)
which were necessary when running locally in a test cluster, but clobbered the good values when running on EMR. I changed the block to
overrideCredentials.foreach(cred=>{
session.sparkContext.hadoopConfiguration.set("fs.s3n.awsAccessKeyId", cred.getAWSAccessKeyId)
session.sparkContext.hadoopConfiguration.set("fs.s3n.awsSecretAccessKey", cred.getAWSSecretKey)
})
and pushed the credential retrieval into my test harness (which is, of course, where it should have been all the time.)
If you are running in EC2 on the AWS code (not EMR), use the S3A connector, as it will use the EC2 IAM credential provider as the last one of the credential providers it uses by default.
The IAM credentials are short lived and include a session key: if you are copying them then you'll need to refresh at least every hour and set all three items: access key, session key and secret.
Like I said: s3a handles this, with the IAM credential provider triggering a new GET of the instance-info HTTP server whenever the previous key expires.
I use AWS data pipelines to automatically back up dynamodb tables to S3 on a weekly basis.
All of my data-pipelines, have stopped working since two weeks ago.
After some investigation, I see that EMR fails with "validation error" and "Terminated with errors No active keys found for user account". As a results all the jobs timeout.
Any ideas what this means?
I ruled out changes to the list of instant types that are allowed to be used with EMR.
Also I tried to read the EMR logs but it looks like it doesn't event get to the point to create logs (or I am looking for them in the wrong place).
AWS account which used to launch EMR has keys ( access key and sec key ) Could you check if those keys are deleted ? You need to login to AWS console and check keys exists for your account.
if not re create keys and use in your code that launches EMR.
Basically #Sandesh Deshmane answered my question correctly.
For future reference and clarity I explain the situation here too:
What happened was that originally I used the root account and console to create the pipelines. Later I decided to follow the best practices and removed my root account keys.
A few days later (my pipelines are scheduled to run weekly) when they all failed I did not make the connection and thought of other problems.
I think one good way to avoid this (if you want to use console) will be to login to console with an IAM account and create the pipelines.
Or you can use command line tools to create them with and IAM credentials.
The real solution now (I think it was not available when the console was first introduced), is to assign the correct IAM role in the first page when you are creating your pipeline in the console. In the "security/access" section change it from default to custom and select the correct roles there.