No extension while using from_options' in DynamicFrameWriter in AWS Glue spark context - amazon-web-services

I am new to AWS. I am writing **AWS Glue job** for some transformation and I could do it. But now after the transformation I used **'from_options' in DynamicFrameWriter Class** to transfer the data frame as csv file. But the file copied to S3 without any extension. Also is there any way to rename the file copied, using DynamicFrameWriter or any other. Please help....
Step1: Triggered an AWS glue job for trnsforming files in S3 to RDS instance..
Step2: On successful job completion transfer the contents of file to another S3 using from_options' in DynamicFrameWriter class. But the file dosen't have any extension.

you have to set the format of the file you are writing.
eg: format=csv
This should set the csv file extension.. You however cannot choose the name of the file that you want to write it as. The only option you have is to have some sort of s3 operation where you change the key name of the file.

Related

Continuously write to S3 file

I want to store user action logs continuously to s3 file for that session.
Requirements:
for a session single file
continuous write operations to s3
should be able to download that file at the end of the session.
Dont want to create new file for single session, want to update same file. Please suggest only AWS solutions.
Do i need to create stream and use it with s3 or using mediator storage system and push once in while.
Objects in Amazon S3 are immutable -- they cannot be modified after they are created.
From your description, a good solution would be to use Amazon Kinesis Data Firehose. Your app can stream data to the Firehose and it will combine data together based on size or time. A long session might therefore produce multiple output files, so you would need a separate process that combines those files together into a single file.

AWS Lambda avoid recursive trigger

I'm downloading data from an API and writing it to a csv file that I store in an S3 bucket. I'm then copying my file from this input bucket into an output bucket with a Lambda function. From the output bucket I'm ingesting it into a MySQL RDS instance with another Lambda function.
The copy-to-another-bucket and upload-to-RDS lambda functions both get triggered when I create a new object in a bucket. Since I'm appending to my csv file, the upload-to-RDS function gets triggered way more than it should and I end up with ~30 rows in my database instead of 6.
I thought by copying the files between S3 buckets I could avoid this, but it doesn't help. Is there any way to only upload the csv file to the database once it has been written and not while it's being updated? Can I delay the trigger maybe?
The only other solution I can think of is to skip the copy-to-another-bucket function altogether and to schedule the upload-to-RDS function.
You need to realize that S3 doesn't support updating an existing file. If you are appending a row to an existing CSV file in S3, then that operation requires uploading the entire contents of the CSV file to S3 again, which S3 sees as a new object.
If you need to store a temporary version of the CSV file in S3 while you are updating it, then you should store it in a separate path, like s3://your_bucket/tmp and then when you have completed your updates, move it to the final path like s3://your_bucket/complete and only configure the Lambda trigger on the /complete path.

Amazon S3 notification for file change

Initially a csv file is uploaded to S3 bucket and we often append that file by scripting when new row is added to that csv file. what we want is we want the script to run only when the csv file is modified, is there any watchers which can notify the script to run when the csv file is changed?
There is S3 event notification for that, you would be interested in the s3:ObjectCreated event
https://docs.aws.amazon.com/AmazonS3/latest/dev/NotificationHowTo.html
You should also take a look at the s3 documentation and note the difference between S3 and a File system. An "update" or "append" operation on s3 is actually replacing the whole object, just for your information

Retaining source file name while importing data from s3 to Redshift

I have large numbers of files within s3 bucket and usually import it to Redshift. Since number of files is large I need a column in Redshift table which should contain source file name from s3 location.
Is there any means to carried out problem ?
Agree with Ketan that this is currently not possible in Redshift. If this is what you would want to achieve, it is possible through either
Reading the S3 files programmatically and write a new S3 files with file name as the column and load the new file
Alternatively, use Hive. Create external table on S3 file bucket location and use INPUT__FILE__NAME to get the file names, create a new table and then write back to S3. You can also do some pre-processing in Hive.
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+VirtualColumns
Hope this helps.
That isn't possible. During a Copy operation, Redshift only loads file contents into a table; it doesn't provide access to S3 file names.
To achieve what you want, you need to preprocess the data to add additional information inside the files.

Move RedShift file to S3 as CSV

I'm trying to move a file from RedShift to S3. Is there an option to move this file as a .csv?
Currently I am writing a shell script to get the Redshift data, save it as a .csv, and then upload to S3. I'm assuming since this is all on AWS services, they would have an argument or something that let's me do this.
Use the UNLOAD command. It will create at least one file per slice, you will have to merge the files by yourself.
unload ('__SQL__')
to 's3://__BUCKET__/__PATH__'
credentials 'aws_access_key_id=__S3_KEY__;aws_secret_access_key=__S3_SECRET__'
delimiter as ','
addquotes
escape