how to setup multiple automated workflows on aws glue - amazon-web-services

We're trying to use AWS Glue for ETL operations in our nodejs project. The workflow will be like below
user uploads csv file
data transformation from XYZ format to ABC format(mapping and changing field names)
download transformed csv file to local system
Note that, this flow should happen programmatically(creating crawlers, job triggers should be done programmatically not using the console). I don't know why documentation and other articles always show how to create crawlers, create jobs from glue console?
I believe that we have to create lambda functions and triggers. but not quite sure how to achieve this end to end flow. can anyone please help me. Thanks

Related

Is there a way to deal with changes in log schema?

I am in a situation where I need to Extract the log JSON data, which might have changes in its data structure, to AWS S3 in real time manner.
I am thinking of using AWS S3 + AWS Glue Streaming ETL. The thing is the structure or schema of the log JSON data might change(these changes are unpredictable), so my solution needs to be aware of such changes and should still stream the log data smoothly without causing errors... But as far as I know, all the AWS Glue tutorials are showing the demo as if there is no changes in the structure of the incoming data.
Can you recommend or tell me the solution within AWS that's suitable for my case?
Thanks.

How can I rewind an AWS glue job bookmark using the AWS CLI or boto3

What I am looking for is way to programmatically or from the command line do the same thing as the Rewind job bookmark button in the AWS Glue Console does.
There's no way to do it programmatically right now (it might be in the future)
A solution right now can be:
If you are extracting AWS S3 files into your job and transforming them:
You can create a function that picks the files you want to reprocess so it moves them to the same folder.
If you are using --from-date and --to-date you can set them to the value you want.

Index large DynamoDB table into Cloudsearch via AWS CLI

I'm currently facing an issue on CloudSearch when trying to index a large DynamoDB table via AWS Console:
Retrieving a subset of the table...
The request took too long to complete. Please try again or use the command line tools.
After looking throught the documentations[1, 2, 3], there are examples of uploading several file formats thought the CLI but no mention of uploading data from a DynamoDB table using the CLI.
How this could be done without having to download the entire database in a file and uploading it?
I've got in contact with AWS Support and there's not way of indexing DynamoDB data using the CLI.
The only available way is downloading the data and uploading though the CLI using a classical file format/structure(.csv, .json, .xml and .txt).
A shame, unfortunately.

Can we see or edit 'job bookmark' info in AWS glue or where it get stored?

I created a glue job and enabled the job bookmark, want to see the metadata stored by glue to keep track of processed files.
Unfortunately AWS Glue job bookmarking details are not available for customers. As far as I know it is based on the job name, source file name and transformationContext strings passed to DynamicFrame's methods like getSinkWithFormat(), applyMapping(), getCatalogSource().
Besides that Job.init() and Job.commit() must be called at the beginning and end of a script respectively to make bookmarking working.
You can now use aws glue get-job-bookmark --job-name <job_name> to get the content of the bookmark. See https://docs.aws.amazon.com/cli/latest/reference/glue/get-job-bookmark.html for detail.

Import XML to Dynamodb

I have a set of very large XML files and I would like to import them to dynamodb after doing some data massaging.
Is this possible through AWS Data Pipeline or some other tool? Currently this is done manually through a program that runs the ETL process.
I am not sure how much would the DataPipeline would help you get the custom processing of XML would help.
I would like to recommend few approaches [definitely non exhaustive options] - either way, it would be beneficial if you keep those XML files in S3.
Try Elastic Map Reduce Route [ Bonus Points for SPOT instances ]
Try using Amazon Lambda to process and push it to dynamodb
Try ElasticBeanstalk Batch Process
Currently through Datapipeline it is not possible to directly import the XML into DynamoDB.
But if you preprocess the XML files and convert XML data to the format described in DynamoDBExportDataFormat http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-object-dynamodbexportdataformat.html, then you should be able to use the templates provided in the DataPipline console to accomplish the task http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/DynamoDBPipeline.Templates.html.