I am following AWS documentation on how to transfer DDB table from one account to another. There are two steps:
Export DDB table into Amazon S3
Use a Glue job to read the files from the Amazon S3 bucket and write them to the target DynamoDB table
I was able to do the first step. Unfortunately the instructions don't say how to do the second step. I have worked with Glue a couple of times, but the console UI is very user un-friendly and I have no idea how to achieve it.
Can somebody please explain how to import the data from S3 into the DDB?
You could use Glue studio to generate a script.
Log into AWS
Go to Glue
Go to Glue studio
Set up the source , basically point it to S3
and then use something like below this is for a dynamo db with pk and sk as a composite primary key
This is just the mapping to a Dataframe and writing it to DynamoDB
ApplyMapping_node2 = ApplyMapping.apply(
frame=S3bucket_node1,
mappings=[
("Item.pk.S", "string", "Item.pk.S", "string"),
("Item.sk.S", "string", "Item.sk.S", "string")
],
transformation_ctx="ApplyMapping_node2"
)
S3bucket_node3 = glueContext.write_dynamic_frame.from_options(
frame=ApplyMapping_node2,
connection_type="dynamodb",
connection_options={"dynamodb.output.tableName": "my-target-table"}
}
Related
Problem Statement: CSV data currently stored in S3 (extracted from Postgresql RDS), I need to query this S3 data using Athena. To achieve this, I created AWS Glue DB and running a crawler on this S3 bucket but the data in Athena query is broken (starts from columns with large text content). I tried changing the data type of Glue Table schema from string to varchar(1000) and recrawl, but still it breaks.
Data stored in S3 bucket :
Data coming out of Athena query on same bucket (using SELECT *) [note the missing row]
Also tested loading the S3 data using jupyter notebook in AWS Glue Studio with this code snippet and the output data looks correct here :
dynamicFrame = glueContext.create_dynamic_frame.from_options(
connection_type="s3",
connection_options={"paths": ["s3://..."]},
format="csv",
format_options={
"withHeader": True,
# "optimizePerformance": True,
},
)
Any help on this would be greatly appreciated!
My Quicksight currently takes everything in the S3 bucket
(S3 sample) https://i.stack.imgur.com/cO8kL.png
But S3 keep changing folder base on the date so 01/,02/,03/ and so on Is there a way to only take the latest data not all of it?
[this my current manifest:]
{
"fileLocations": [
{
"URIPrefixes": [
"https://sample-S3bucket.amazonaws.com/"
]}
],
"globalUploadSettings": {
"format": "JSON"
}
}
There might be a simple solution that I might not know about.
You could set Quicksight to read from another bucket and set a lambda that is triggered when a a new file is uploaded into your existing bucket. This lambda would:
Remove any files from the bucket which QuickSight is reading from
Copy over the latest file into the bucket
Creates a QuickSight SPICE ingestion via api.
I am new to AWS and as well as Terraform. I have to create a Glue job using Terraform(HCL), When I see the Terraform docs it has this script which launches Glue job using resource aws_glue_job, but there is no way to specify this is my data source and this is where it needs to go(target) after data is transformed. In my scenario, I have a Glue table which is created by using Amazon S3 JSON files which should be Data source, and the target also has to be the S3 bucket but now the data files would transform from JSON to Parquet. I am not finding a way to specify this source and target while creating the Glue job using Terraform. Help is much appreciated. Thanks in advance.
UPDATE: I generated the script by creating the Job through console and used it while creating the resources through Terraform, script in itself helped to create source and target now.
You have to create s3 bucket where you will keep the script (python,pyspark) etc along with your transformation logic and also another bucket in s3 where you will be keeping you output and can give the script location in the script path while creating the glue job.
Below is the Terrafrom code to create glue job
resource "aws_glue_job" "your job name"
{
name = "your job name"
role_arn = "${aws_iam_role.yourole.arn}"
max_retries = 0
timeout = 60
number_of_workers = 5
worker_type = "Standard"
execution_property {
max_concurrent_runs = 10
}
command {
script_location = "s3://${var.scriptbucketname}/script/scriptname.py"
python_version = "3"
}
I am trying to set up a lambda to run an AWS Athena query daily and output the result to an s3 bucket stored in a different AWS Account. The account I am writing the Lambda in has s3 write permissions in the other account, I just can't figure out how to input the specific bucket I'm looking to write to, and I haven't been able to find any documentation on this use case.
The following is how I'm running my athena query from the lambda:
client = boto3.client('athena')
client.start_query_execution(
QueryString = [QUERY],
QueryExecutionContext={
'Database': [DATABASE]
},
ResultConfiguration={
'OutputLocation': [OUTPUT_LOCATION]
}
)
My query works fine when storing the result in my own AWS account. I can't just write "s3://[BUCKETNAME]" where bucket name is the name of the bucket in the other account.
I'm guessing there is something very simple I'm missing--if anyone could tell me how to format "OUTPUT_LOCATION" where ACCOUNT_ID is the id of the other account and BUCKET_NAME is the name of the bucket, that would be very helpful!
Bucket names must be unique within a partition. A partition is a grouping of Regions. AWS currently has three partitions: aws (Standard Regions), aws-cn (China Regions), and aws-us-gov (AWS GovCloud [US] Regions).
Refer this doc on the naming standards
When defining the name of an S3 bucket you do not have to specify an account ID.
For Example, an ARN for s3 bucket would be without the account ID
arn:aws:s3:::<BUCKET_NAME>
whereas an ARN for a resource or Role would have its Account ID
arn:aws:iam::<MyAccountA>:role/<MyRoleA>
client = boto3.client('athena')
client.start_query_execution(
QueryString = [QUERY],
QueryExecutionContext={
'Database': [DATABASE]
},
ResultConfiguration={
'OutputLocation': 's3://<BUCKET_NAME>`' # Even when the bucket is another AWS account
}
)
If the query fails, start looking at permissions. Your code is right.
I am trying to create and run a AWS glue crawler through the boto3 library. The crawler is going against JSON files in an s3 folder. The crawler completes successfully, when i check the logs there are no errors but it doesn't create any table in my glue database
It's not a permission issue as I am able to create the same crawler through a CFT and when I run that it creates the table as expected. Im using the same role as my CFT in my code I'm running with boto3 to create it.
Have tried using boto3 create_crawler() and run_crawler(). Tried using boto3 update_crawler() on the crawler created from the CFT and updating the s3 target path.
response = glue.create_crawler(
Name='my-crawler',
Role='my-role-arn',
DatabaseName='glue_database',
Description='Crawler for generating table from s3 target',
Targets={
'S3Targets': [
{
'Path': s3_target
}
]
},
SchemaChangePolicy={
'UpdateBehavior': 'UPDATE_IN_DATABASE',
'DeleteBehavior': 'LOG'
},
TablePrefix=''
)
Are you sure you have passed correct region in glue client (glue object creation).
Once I copied code and forgot to change region and spent hours figuring out why it is not creating table when there is no error. Eventually I ended in figuring out that table is created in another region as I forgot to change region while I copied code to new region.