I'm a beginner in AWS, so please bare with me, if certain things are a bit off :)
I have a task, where I need to load in a fixed width text file, that contains both a header record and a footer record. And of cause a lot of data in between. The data needs some simple changes, before written into the destination file, which also should be a fixed width file.
Would like to utilize AWS Glue for this, but am a little in doubt how to attack this. I guess since the data has header and footer, spark would be my best option to both read and write the file?
The Glue job should be triggered by the input file being uploaded into a S3 bucket.
What would be the flow here?
Uploading file to S3
S3 notification event triggering what? Lambda?
Lambda starting up Glue job with spark script:
a) Load txt file data into table
b) reading and transforming data
c) Writing txt file in S3
Do I need a crawler in between somewhere?
Thanks in advance.
Related
I'm working with a pipeline that pushes JSON entries in batches to my Gcloud Storage bucket. I want to get this data into Kafka.
The way I'm going about it now is using a lambda function that gets triggered every minute to find the files that have changed, open streams from them, read line by line and batch every so often those lines as messages into a kafka producer.
This process is pretty terrible, but it works.... eventually.
I was hoping there'd be a way to do this w/ Kafka Connect or Flink, but there really isn't much development around sensing incremental file additions to a bucket.
Do the JSON entries end up in different files in your bucket? Flink has support for streaming in new files from a source.
I have a task to analyze weather forecast data in Quicksight. The forecast data is held in NetCDF binary files in a public S3 bucket. The question is: how do you expose the contents of these binary files to Quicksight or even Athena?
There are python libraries that will decode the data from the binary files, such as Iris. They are used like this:
import iris
filename = iris.sample_data_path('forecast_20200304.nc')
cubes = iris.load(filename)
print(cubes)
So what would be the AWS workflow and services necessary to create a data ingestion pipeline that would:
Respond to an SQS message that a new binary file is available
Access the new binary file and decode it to access the forecast data
Add the decoded data to the set of already decoded data from previous SQS notifications
Make all the decoded data available in Athena / Quicksight
Tricky one, this...
What I would do is probably something like this:
Write a Lambda function in Python that is triggered when new files appear in the S3 bucket – either by S3 notifications (if you control the bucket), by SNS, SQS, or by schedule in EventBridge. The function uses the code snipplet included in your question to transform each new file and upload the transformed data to another S3 bucket.
I don't know the size of these files and how often they are published, so whether to convert to CSV, JSON, or Parquet is something you have to decide – if the data is small CSV will probably be easiest and will be good enough.
With the converted data in a new S3 bucket all you need to do is create an Athena table for the data set and start using QuickSight.
If you end up with a lot of small files you might want to implement a second step where you once per day combine the converted files into bigger files, and possibly Parquet, but don't do anything like that unless you have to.
An alternative way would be to use Athena Federated Query: by implementing Lambda function(s) that respond to specific calls from Athena you can make Athena read any data source that you want. It's currently in preview, and as far as I know all the example code is written in Java – but theoretically it would be possible to write the Lambda functions in Python.
I'm not sure whether it would be less work than implementing an ETL workflow like the one you suggest, but yours is one of the use cases for which Athena Federated Query was designed for and it might be worth looking into. If NetCDF files are common and a data source for such files would be useful for other people I'm sure the Athena team would love to talk to you and help you out.
I am new to AWS. I am writing **AWS Glue job** for some transformation and I could do it. But now after the transformation I used **'from_options' in DynamicFrameWriter Class** to transfer the data frame as csv file. But the file copied to S3 without any extension. Also is there any way to rename the file copied, using DynamicFrameWriter or any other. Please help....
Step1: Triggered an AWS glue job for trnsforming files in S3 to RDS instance..
Step2: On successful job completion transfer the contents of file to another S3 using from_options' in DynamicFrameWriter class. But the file dosen't have any extension.
you have to set the format of the file you are writing.
eg: format=csv
This should set the csv file extension.. You however cannot choose the name of the file that you want to write it as. The only option you have is to have some sort of s3 operation where you change the key name of the file.
Considering the fact that there is no data pipeline available in Singapore region, are there any alternatives available to efficiently push csv data to dynamodb?
If it was me, I would setup an s3 event notification on a bucket that fires a lambda function each time a CSV file was dropped into it.
The Notification would let Lambda know that a new file was available and a lambda function would be responsible for loading the data into dynamodb.
This would work better (because of the limits of lambda) if the CSV files were not huge, so they could be processed in a reasonable amount of time, and the bonus is the only worked that would need to be done once it was working would be to simply drop the new files into the right bucket - no server required.
Here is a github repository that has a CSV->Dynamodb loader written in java - it might help get you started.
I am currently using the range header for GET request on Amazon S3 but I can't find an equivalent for PUT requests.
Do I have to upload the entire file again or can I specify where in the file I want to update? Thanks
Need to upload it again. S3 does not have a concept of either append and/or editing afile
However, if its a long file, you can do something called "Multipart Upload", and send several pieces of file, and merge it back at AWS:
http://docs.amazonwebservices.com/AmazonS3/latest/dev/uploadobjusingmpu.html