I am trying to create and run a AWS glue crawler through the boto3 library. The crawler is going against JSON files in an s3 folder. The crawler completes successfully, when i check the logs there are no errors but it doesn't create any table in my glue database
It's not a permission issue as I am able to create the same crawler through a CFT and when I run that it creates the table as expected. Im using the same role as my CFT in my code I'm running with boto3 to create it.
Have tried using boto3 create_crawler() and run_crawler(). Tried using boto3 update_crawler() on the crawler created from the CFT and updating the s3 target path.
response = glue.create_crawler(
Name='my-crawler',
Role='my-role-arn',
DatabaseName='glue_database',
Description='Crawler for generating table from s3 target',
Targets={
'S3Targets': [
{
'Path': s3_target
}
]
},
SchemaChangePolicy={
'UpdateBehavior': 'UPDATE_IN_DATABASE',
'DeleteBehavior': 'LOG'
},
TablePrefix=''
)
Are you sure you have passed correct region in glue client (glue object creation).
Once I copied code and forgot to change region and spent hours figuring out why it is not creating table when there is no error. Eventually I ended in figuring out that table is created in another region as I forgot to change region while I copied code to new region.
Related
My Quicksight currently takes everything in the S3 bucket
(S3 sample) https://i.stack.imgur.com/cO8kL.png
But S3 keep changing folder base on the date so 01/,02/,03/ and so on Is there a way to only take the latest data not all of it?
[this my current manifest:]
{
"fileLocations": [
{
"URIPrefixes": [
"https://sample-S3bucket.amazonaws.com/"
]}
],
"globalUploadSettings": {
"format": "JSON"
}
}
There might be a simple solution that I might not know about.
You could set Quicksight to read from another bucket and set a lambda that is triggered when a a new file is uploaded into your existing bucket. This lambda would:
Remove any files from the bucket which QuickSight is reading from
Copy over the latest file into the bucket
Creates a QuickSight SPICE ingestion via api.
I am following AWS documentation on how to transfer DDB table from one account to another. There are two steps:
Export DDB table into Amazon S3
Use a Glue job to read the files from the Amazon S3 bucket and write them to the target DynamoDB table
I was able to do the first step. Unfortunately the instructions don't say how to do the second step. I have worked with Glue a couple of times, but the console UI is very user un-friendly and I have no idea how to achieve it.
Can somebody please explain how to import the data from S3 into the DDB?
You could use Glue studio to generate a script.
Log into AWS
Go to Glue
Go to Glue studio
Set up the source , basically point it to S3
and then use something like below this is for a dynamo db with pk and sk as a composite primary key
This is just the mapping to a Dataframe and writing it to DynamoDB
ApplyMapping_node2 = ApplyMapping.apply(
frame=S3bucket_node1,
mappings=[
("Item.pk.S", "string", "Item.pk.S", "string"),
("Item.sk.S", "string", "Item.sk.S", "string")
],
transformation_ctx="ApplyMapping_node2"
)
S3bucket_node3 = glueContext.write_dynamic_frame.from_options(
frame=ApplyMapping_node2,
connection_type="dynamodb",
connection_options={"dynamodb.output.tableName": "my-target-table"}
}
Using terraform, I have created an s3 bucket in aws "sample-s3" using a module.
After some time I decided to change the module used for creating the s3 bucket. But the existing s3 bucket should not be deleted and re-created. is that possible? Could someone help me out?
You can delete that s3 bucket from terraform state with terraform state rm command and then import with new structure after you will change your module.
I need to create a crawler on AWS Glue to catalogue some tables that I usually query on CLI by using
something like this:
$ aws s3 ls s3://bucket/path/ --request-payer requester
but when creating a crawler I can't figure out where I need to configure the requester pays option, so I'm getting this error log:
ERROR : User does not have access to target
Any thoughts?
I'm using the AWS console for that.
I am new to AWS and as well as Terraform. I have to create a Glue job using Terraform(HCL), When I see the Terraform docs it has this script which launches Glue job using resource aws_glue_job, but there is no way to specify this is my data source and this is where it needs to go(target) after data is transformed. In my scenario, I have a Glue table which is created by using Amazon S3 JSON files which should be Data source, and the target also has to be the S3 bucket but now the data files would transform from JSON to Parquet. I am not finding a way to specify this source and target while creating the Glue job using Terraform. Help is much appreciated. Thanks in advance.
UPDATE: I generated the script by creating the Job through console and used it while creating the resources through Terraform, script in itself helped to create source and target now.
You have to create s3 bucket where you will keep the script (python,pyspark) etc along with your transformation logic and also another bucket in s3 where you will be keeping you output and can give the script location in the script path while creating the glue job.
Below is the Terrafrom code to create glue job
resource "aws_glue_job" "your job name"
{
name = "your job name"
role_arn = "${aws_iam_role.yourole.arn}"
max_retries = 0
timeout = 60
number_of_workers = 5
worker_type = "Standard"
execution_property {
max_concurrent_runs = 10
}
command {
script_location = "s3://${var.scriptbucketname}/script/scriptname.py"
python_version = "3"
}