Analyze binary NetCDF files with AWS Quicksight / Athena - amazon-athena

I have a task to analyze weather forecast data in Quicksight. The forecast data is held in NetCDF binary files in a public S3 bucket. The question is: how do you expose the contents of these binary files to Quicksight or even Athena?
There are python libraries that will decode the data from the binary files, such as Iris. They are used like this:
import iris
filename = iris.sample_data_path('forecast_20200304.nc')
cubes = iris.load(filename)
print(cubes)
So what would be the AWS workflow and services necessary to create a data ingestion pipeline that would:
Respond to an SQS message that a new binary file is available
Access the new binary file and decode it to access the forecast data
Add the decoded data to the set of already decoded data from previous SQS notifications
Make all the decoded data available in Athena / Quicksight
Tricky one, this...

What I would do is probably something like this:
Write a Lambda function in Python that is triggered when new files appear in the S3 bucket – either by S3 notifications (if you control the bucket), by SNS, SQS, or by schedule in EventBridge. The function uses the code snipplet included in your question to transform each new file and upload the transformed data to another S3 bucket.
I don't know the size of these files and how often they are published, so whether to convert to CSV, JSON, or Parquet is something you have to decide – if the data is small CSV will probably be easiest and will be good enough.
With the converted data in a new S3 bucket all you need to do is create an Athena table for the data set and start using QuickSight.
If you end up with a lot of small files you might want to implement a second step where you once per day combine the converted files into bigger files, and possibly Parquet, but don't do anything like that unless you have to.
An alternative way would be to use Athena Federated Query: by implementing Lambda function(s) that respond to specific calls from Athena you can make Athena read any data source that you want. It's currently in preview, and as far as I know all the example code is written in Java – but theoretically it would be possible to write the Lambda functions in Python.
I'm not sure whether it would be less work than implementing an ETL workflow like the one you suggest, but yours is one of the use cases for which Athena Federated Query was designed for and it might be worth looking into. If NetCDF files are common and a data source for such files would be useful for other people I'm sure the Athena team would love to talk to you and help you out.

Related

Update data in csv table which is stored in AWS S3 bucket

I need a solution for entering new data in csv that is stored in S3 bucket in AWS.
At this point we are downloading the file, editing and then uploading it again in s3 and we would like to automatize this process.
We need to add one row in a three column.
Thank you in advance!
I think you will be able to do that using Lambda Functions. You will need to programmatically make the modifications you need over the CSV but there are multiple programming languages that allow you to do that. One quick example is using python and the csv library
Then you can invoke that lambda or add more logic to the operations you want to do using an AWS API Gateway.
You can access the CSV file (object) inside the S3 Bucket from the lambda code using the AWS SDK and append the new rows with data you pass as parameters to the function
There is no way to directly modify the csv stored in S3 (if that is what you're asking). The process will always entail some version of download, modify, upload. There are many examples of how you can do this, for example here

Export DynamoDB table to S3 automatically

The scenario is the following: I have a lambda function that does an http request to get the data of today and the last 365 days and stores them in DynamoDB. The function is triggered every day at 8am, so the most recent data is always saved in the DynamoDB table.
Now my goal is to export the DynamoDB table to a S3 file automatically on an everyday basis as well, so I'm able to use services like QuickSight, Athena, Forecast on the data.
If possible and easily implementable, I'd like to only have one S3 file that gets added with the most recent data of the day, because an extra file everyday seems kinda pricey. If that's not possible, an extra file everyday would also be fine.
What's the best way to go about doing so without using CLI (because I'm not allowed to install programs to my laptop) and without using Lambda (because I wouldn't know how to write a function for that without any tutorials)?
Take a look at DataPipeline. This is a use case and most of the configuration is simple.
It will also not require any knowledge of Lambda and can be automated.
More info: https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/DynamoDBPipeline.html
DynamoDB recently released a new, native feature to export your table's data to an S3 bucket. It supports exporting into DynamoDB JSON and Amazon Ion - see the documentation on how to use it at:
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/DataExport.html
This will enable you to run whatever analytics tools you'd like (Athena, etc.) on the data exported in S3.

How do I import JSON data from S3 using AWS Glue?

I have a whole bunch of data in AWS S3 stored in JSON format. It looks like this:
s3://my-bucket/store-1/20190101/sales.json
s3://my-bucket/store-1/20190102/sales.json
s3://my-bucket/store-1/20190103/sales.json
s3://my-bucket/store-1/20190104/sales.json
...
s3://my-bucket/store-2/20190101/sales.json
s3://my-bucket/store-2/20190102/sales.json
s3://my-bucket/store-2/20190103/sales.json
s3://my-bucket/store-2/20190104/sales.json
...
It's all the same schema. I want to get all that JSON data into a single database table. I can't find a good tutorial that explains how to set this up.
Ideally, I would also be able to perform small "normalization" transformations on some columns, too.
I assume Glue is the right choice, but I am open to other options!
If you need to process data using Glue and there is no need to have a table registered in Glue Catalog then there is no need to run Glue Crawler. You can setup a job and use getSourceWithFormat() with recurse option set to true and paths pointing to the root folder (in your case it's ["s3://my-bucket/"] or ["s3://my-bucket/store-1", "s3://my-bucket/store-2", ...]). In the job you can also apply any required transformations and then write the result into another S3 bucket, relational DB or a Glue Catalog.
Yes, Glue is a great tool for this!
Use a crawler to create a table in the glue data catalog (remember to set Create a single schema for each S3 path under Grouping behavior for S3 data when creating the crawler)
Read more about it here
Then you can use relationalize to flatten our your json structure, read more about that here
Json and AWS Glue may not be the best match. Since AWS Glue is based on hadoop, it inherits hadoop's "one-row-per-newline" restriction, so even if your data is in json, it has to be formatted with one json object per line [1]. Since you'll be pre-processing your data anyway to get it into this line-separated format, it may be easier to use csv instead of json.
Edit 2022-11-29: There does appear to be some tooling now for jsonl, which is the actual format that AWS expects, making this less of an automatic win for csv. I would say if your data is already in json format, it's probably smarter to convert it to jsonl than to convert to csv.

AWS S3 storage and schema

I have an IOT sensor which sends the following message to IoT MQTT Core topic:
{"ID1":10001,"ID2":1001,"ID3":101,"ValueMax":123}
I have added ACT/RULE which stores the incoming message in an S3 Bucket with the timestamp as a key(each message is stored as a seperate file/row in the bucket).
I have only worked with SQL databases before, so having them stored like this is new to me.
1) Is this the proper way to work with S3 storage?
2) How can I visualize the values in a schema instead of separate files?
3) I am trying to create ML Datasource from the S3 Bucket, but get the error below when Amazon ML tries to create schema:
"Amazon ML can't retrieve the schema. If you've just created this
datasource, wait a moment and try again."
Appreciate all advice there is!
1) Is this the proper way to work with S3 storage?
With only one sensor, using the [timestamp](https://docs.aws.amazon.com/iot/latest/developerguide/iot-sql-functions.html#iot-function-timestamp function in your IoT rule would be a way to name unique objects in S3, but there are issues that might come up.
With more than one sensor, you might have multiple messages arrive at the same timestamp and this would not generate unique object names in S3.
Timestamps from nearly the same time are going to have similar prefixes and designing your S3 keys this way may not give you the best performance at higher message rates.
Since you're using MQTT, you could use the traceId function instead of the timestamp to avoid these two issues if they come up.
2) How can I visualize the values in a schema instead of separate files?
3) I am trying to create ML Datasource from the S3 Bucket, but get the error below when Amazon ML tries to create schema:
For the third question, I think you could be running into a data format problem in ML because your S3 objects contain the JSON data from your messages and not a CSV.
For the second question, I think you're trying to combine message data from successive messages into a CSV, or at least output the message data as a single line of a CSV file. I don't think this is possible with just the Iot SQL language since it's intended to produce JSON.
One alternative is to configure your IoT SQL rule with a Lambda action and use a lambda function to make your JSON to CSV conversion and then write the CSV to your S3 bucket. If you go this direction, you may have to enrich your IoT message data with the timestamp (or traceId) as you call the lambda.
A rule like select timestamp() as timestamp, traceid() as traceid, concat(ID1, ID2, ID3, ValueMax) as values, * as message would produce a JSON like
{"timestamp":1538606018066,"traceid":"abab6381-c369-4a08-931d-c08267d12947","values":[10001,1001,101,123],"message":{"ID1":10001,"ID2":1001,"ID3":101,"ValueMax":123}}
That would be straightforward to use as the source for a CSV row with the data from its values property.

How can we efficiently push data from csv file to dynamodb without using aws pipeline?

Considering the fact that there is no data pipeline available in Singapore region, are there any alternatives available to efficiently push csv data to dynamodb?
If it was me, I would setup an s3 event notification on a bucket that fires a lambda function each time a CSV file was dropped into it.
The Notification would let Lambda know that a new file was available and a lambda function would be responsible for loading the data into dynamodb.
This would work better (because of the limits of lambda) if the CSV files were not huge, so they could be processed in a reasonable amount of time, and the bonus is the only worked that would need to be done once it was working would be to simply drop the new files into the right bucket - no server required.
Here is a github repository that has a CSV->Dynamodb loader written in java - it might help get you started.