Best way to write db results as csv file to aws s3 bucket - amazon-web-services

Need suggestions for best way to write database results as csv file to aws s3 bucket.
Note: the csv data may grow form kb to gb in size.

The best way would be:
Write your data to a CSV file on your local computer (or wherever your app is running)
Upload the file to an Amazon S3 bucket using the AWS SDK for Java
Please note that it is not possible to append data to an Amazon s3 object. So, you should either upload a new file each time or, if you want all data in one file, you will need to re-upload the complete file each time.
If you want to send the data as a stream, you can use putObject():
public PutObjectResult putObject(String bucketName,
String key,
InputStream input,
ObjectMetadata metadata)
throws SdkClientException,
AmazonServiceException

Related

AWS Lambda avoid recursive trigger

I'm downloading data from an API and writing it to a csv file that I store in an S3 bucket. I'm then copying my file from this input bucket into an output bucket with a Lambda function. From the output bucket I'm ingesting it into a MySQL RDS instance with another Lambda function.
The copy-to-another-bucket and upload-to-RDS lambda functions both get triggered when I create a new object in a bucket. Since I'm appending to my csv file, the upload-to-RDS function gets triggered way more than it should and I end up with ~30 rows in my database instead of 6.
I thought by copying the files between S3 buckets I could avoid this, but it doesn't help. Is there any way to only upload the csv file to the database once it has been written and not while it's being updated? Can I delay the trigger maybe?
The only other solution I can think of is to skip the copy-to-another-bucket function altogether and to schedule the upload-to-RDS function.
You need to realize that S3 doesn't support updating an existing file. If you are appending a row to an existing CSV file in S3, then that operation requires uploading the entire contents of the CSV file to S3 again, which S3 sees as a new object.
If you need to store a temporary version of the CSV file in S3 while you are updating it, then you should store it in a separate path, like s3://your_bucket/tmp and then when you have completed your updates, move it to the final path like s3://your_bucket/complete and only configure the Lambda trigger on the /complete path.

Amazon S3 notification for file change

Initially a csv file is uploaded to S3 bucket and we often append that file by scripting when new row is added to that csv file. what we want is we want the script to run only when the csv file is modified, is there any watchers which can notify the script to run when the csv file is changed?
There is S3 event notification for that, you would be interested in the s3:ObjectCreated event
https://docs.aws.amazon.com/AmazonS3/latest/dev/NotificationHowTo.html
You should also take a look at the s3 documentation and note the difference between S3 and a File system. An "update" or "append" operation on s3 is actually replacing the whole object, just for your information

No extension while using from_options' in DynamicFrameWriter in AWS Glue spark context

I am new to AWS. I am writing **AWS Glue job** for some transformation and I could do it. But now after the transformation I used **'from_options' in DynamicFrameWriter Class** to transfer the data frame as csv file. But the file copied to S3 without any extension. Also is there any way to rename the file copied, using DynamicFrameWriter or any other. Please help....
Step1: Triggered an AWS glue job for trnsforming files in S3 to RDS instance..
Step2: On successful job completion transfer the contents of file to another S3 using from_options' in DynamicFrameWriter class. But the file dosen't have any extension.
you have to set the format of the file you are writing.
eg: format=csv
This should set the csv file extension.. You however cannot choose the name of the file that you want to write it as. The only option you have is to have some sort of s3 operation where you change the key name of the file.

Change CSV file In S3 With AWS Lambda

Is there a way to have the dynamodb rows for each user, backed up in s3 with a csv file.
Then using streams, when a row is mutated, change that row in s3 in the csv file.
The csv readers that are currently out there are geared towards parsing the csv for use within the lambda.
Whereas I would like to find a specific row, given by the stream and then replace it with another row without having to load the whole file into memory as it may be quite big. The reason I would like a backup on s3, is because in the future I will need to do batch processing on it and reading 300k files from dynamo within a short period of time, is not preferable.
Read the data from S3, parse as csv using your favorite library and update, then write back to S3:
import io
import boto3
s3 = boto3.resource('s3')
bucket = s3.Bucket('mybucket')
with io.BytesIO() as data:
bucket.download_fileobj('my_key', data)
# parse csv data and update as necessary
# then write back to s3
bucket.upload_fileobj(data, 'my_key')
Note that S3 does not support object append or update if that was what you were hoping for- see here. You can only read and overwrite. You might take this into account when designing your system.

AWS S3: .csv file is downloaded as .csv

I have 2 AWC accounts, each of them has one S3 bucket. I uploaded two same-size .CSV files to each of the S3 bucket.
When I try to Download or Download As, this file is downloaded as .CSV file in first account. BUT(!!) When I try to download this file from second account - it is downloading it as .TXT.
How can this happen? Both files are created in the same way: through Redshift UNLOAD query, that perform copying of selected data from Redshift to S3.
UPDATE:
Can it be because in this account for this document , **Server side encryption is equal to AWS-KMS?
I noticed that file, that converted from .csv to .txt has "Server side encryption: AWS-KMS", while .csv file that is downloaded as .csv - has "Server side encryption: NONE"
UPDATE: tried in different browsers - same result
Check the headers for each object in the AWS S3 console and compare the Content-Type values. Content-Type provides a hint to web browsers on what data the object contains.
If Content-Type does not exist or does not contain text/csv, add or modify the header in the S3 console or via your favorite S3 application such as CloudBerry.
John is right about the Content-Type not being text/csv. Sometimes, S3 will get it right and sometimes it won't. If you can't manually correct this yourself, you can run a Lambda function to do this for you everytime you upload a new object. You can use a Python 2.7 template Lambda function to download the object from the bucket, employ mimetypes library to guess_type for your S3 object, and then re-upload the file in the same bucket. You will need to trigger this function with S3 object upload and give it the necessary permissions (S3:GetObject).
P.S. This will work for files with any extension. If you know you are only going to upload .csv files, you can ignore the mimetypes and directly re-upload the object with
bucket.upload_fileobj(filename, key, ExtraArgs={'ContentType': 'text/csv'})
If the mimetypes cannot guess the typethen you might need to add the types, look at an example here https://www.programcreek.com/python/example/5209/mimetypes.add_type
Good Luck!
Here is scala solution (to specify content type):
val settingsLine: String = "csvdata1,csvdata2,csvdata3"
val settingsStream: InputStream = new ByteArrayInputStream(settingsLine.getBytes())
val metadata: ObjectMetadata = new ObjectMetadata()
metadata.setContentType("text/csv")
s3Client.putObject(bucketName, prefix, settingsStream, metadata)