Change CSV file In S3 With AWS Lambda - amazon-web-services

Is there a way to have the dynamodb rows for each user, backed up in s3 with a csv file.
Then using streams, when a row is mutated, change that row in s3 in the csv file.
The csv readers that are currently out there are geared towards parsing the csv for use within the lambda.
Whereas I would like to find a specific row, given by the stream and then replace it with another row without having to load the whole file into memory as it may be quite big. The reason I would like a backup on s3, is because in the future I will need to do batch processing on it and reading 300k files from dynamo within a short period of time, is not preferable.

Read the data from S3, parse as csv using your favorite library and update, then write back to S3:
import io
import boto3
s3 = boto3.resource('s3')
bucket = s3.Bucket('mybucket')
with io.BytesIO() as data:
bucket.download_fileobj('my_key', data)
# parse csv data and update as necessary
# then write back to s3
bucket.upload_fileobj(data, 'my_key')
Note that S3 does not support object append or update if that was what you were hoping for- see here. You can only read and overwrite. You might take this into account when designing your system.

Related

Analyze binary NetCDF files with AWS Quicksight / Athena

I have a task to analyze weather forecast data in Quicksight. The forecast data is held in NetCDF binary files in a public S3 bucket. The question is: how do you expose the contents of these binary files to Quicksight or even Athena?
There are python libraries that will decode the data from the binary files, such as Iris. They are used like this:
import iris
filename = iris.sample_data_path('forecast_20200304.nc')
cubes = iris.load(filename)
print(cubes)
So what would be the AWS workflow and services necessary to create a data ingestion pipeline that would:
Respond to an SQS message that a new binary file is available
Access the new binary file and decode it to access the forecast data
Add the decoded data to the set of already decoded data from previous SQS notifications
Make all the decoded data available in Athena / Quicksight
Tricky one, this...
What I would do is probably something like this:
Write a Lambda function in Python that is triggered when new files appear in the S3 bucket – either by S3 notifications (if you control the bucket), by SNS, SQS, or by schedule in EventBridge. The function uses the code snipplet included in your question to transform each new file and upload the transformed data to another S3 bucket.
I don't know the size of these files and how often they are published, so whether to convert to CSV, JSON, or Parquet is something you have to decide – if the data is small CSV will probably be easiest and will be good enough.
With the converted data in a new S3 bucket all you need to do is create an Athena table for the data set and start using QuickSight.
If you end up with a lot of small files you might want to implement a second step where you once per day combine the converted files into bigger files, and possibly Parquet, but don't do anything like that unless you have to.
An alternative way would be to use Athena Federated Query: by implementing Lambda function(s) that respond to specific calls from Athena you can make Athena read any data source that you want. It's currently in preview, and as far as I know all the example code is written in Java – but theoretically it would be possible to write the Lambda functions in Python.
I'm not sure whether it would be less work than implementing an ETL workflow like the one you suggest, but yours is one of the use cases for which Athena Federated Query was designed for and it might be worth looking into. If NetCDF files are common and a data source for such files would be useful for other people I'm sure the Athena team would love to talk to you and help you out.

Are parquet files splittable when stored in AWS S3?

I know that parquet files are splittable if they are stored in block storage. E.g stored on HDFS
Are they also splittable when stored in object storage such as AWS s3?
This confuses me because, object storage is supposed to be atomic. You either access the entire file or none of the file. You can't even change meta data on an S3 file without rewriting the entire file. On the other hand, AWS reccomends using splittable file formats in S3 to improve the performance of Athena and other frameworks in the hadoop ecosystem.
Yes, Parquet files are splittable.
S3 supports positioned reads (range requests), which can be used to read only selected portions of the input file (object).
I'm not 100% sure what you mean here, but generally (I think), you have parquet partition on partition keys and save columns into blocks of rows. When I have used in it AWS S3 it has saved like:
|-Folder
|--Partition Keys
|---Columns
|----Rows_1-100.snappy.parquet
|----Rows_101-200.snappy.parquet
This handles the splitting efficiencies you mention.

Best way to write db results as csv file to aws s3 bucket

Need suggestions for best way to write database results as csv file to aws s3 bucket.
Note: the csv data may grow form kb to gb in size.
The best way would be:
Write your data to a CSV file on your local computer (or wherever your app is running)
Upload the file to an Amazon S3 bucket using the AWS SDK for Java
Please note that it is not possible to append data to an Amazon s3 object. So, you should either upload a new file each time or, if you want all data in one file, you will need to re-upload the complete file each time.
If you want to send the data as a stream, you can use putObject():
public PutObjectResult putObject(String bucketName,
String key,
InputStream input,
ObjectMetadata metadata)
throws SdkClientException,
AmazonServiceException

How do I import JSON data from S3 using AWS Glue?

I have a whole bunch of data in AWS S3 stored in JSON format. It looks like this:
s3://my-bucket/store-1/20190101/sales.json
s3://my-bucket/store-1/20190102/sales.json
s3://my-bucket/store-1/20190103/sales.json
s3://my-bucket/store-1/20190104/sales.json
...
s3://my-bucket/store-2/20190101/sales.json
s3://my-bucket/store-2/20190102/sales.json
s3://my-bucket/store-2/20190103/sales.json
s3://my-bucket/store-2/20190104/sales.json
...
It's all the same schema. I want to get all that JSON data into a single database table. I can't find a good tutorial that explains how to set this up.
Ideally, I would also be able to perform small "normalization" transformations on some columns, too.
I assume Glue is the right choice, but I am open to other options!
If you need to process data using Glue and there is no need to have a table registered in Glue Catalog then there is no need to run Glue Crawler. You can setup a job and use getSourceWithFormat() with recurse option set to true and paths pointing to the root folder (in your case it's ["s3://my-bucket/"] or ["s3://my-bucket/store-1", "s3://my-bucket/store-2", ...]). In the job you can also apply any required transformations and then write the result into another S3 bucket, relational DB or a Glue Catalog.
Yes, Glue is a great tool for this!
Use a crawler to create a table in the glue data catalog (remember to set Create a single schema for each S3 path under Grouping behavior for S3 data when creating the crawler)
Read more about it here
Then you can use relationalize to flatten our your json structure, read more about that here
Json and AWS Glue may not be the best match. Since AWS Glue is based on hadoop, it inherits hadoop's "one-row-per-newline" restriction, so even if your data is in json, it has to be formatted with one json object per line [1]. Since you'll be pre-processing your data anyway to get it into this line-separated format, it may be easier to use csv instead of json.
Edit 2022-11-29: There does appear to be some tooling now for jsonl, which is the actual format that AWS expects, making this less of an automatic win for csv. I would say if your data is already in json format, it's probably smarter to convert it to jsonl than to convert to csv.

download, process, upload large number of s3 files with spark

I have a large amount of files (~500k hdf5) inside a s3 bucket which I need to process and reupload to another s3 bucket.
I am pretty new to such tasks, so I am not quite sure if my approach is correct here. I do the following:
I use boto to get the list of keys inside the bucket and parallelize it with spark:
s3keys = bucket.list()
data = sc.parallelize(s3keys)
data = data.map(lambda x: download_process_upload(x))
result = data.collect()
where download_process_upload is a function which downloads the file specified by the key, does some processing on it and re-uploads it to another bucket (returning 1 if everything was successful, and 0 if there was an error)
So in the end I could do
success_rate = sum(result) / float(len(s3keys))
I have read that spark map statements should be stateless, while my custom map function definitely is not stateless. It downloads the file to disk and then loads it into memory etc.
So is this the proper way to do such a task?
I've successfully used your methodology to download and process data from S3. I have not tried to upload the data from within a map statement. But, I see no reason why you wouldn't be able to read the file from s3, process it, and then upload it to a new location.
Also, you can save a few keystrokes and take the explicit lambda out of the map statement like this data = data.map(download_process_upload)