How to suppress column headers in AWS Athena query result? - amazon-web-services

I'm running a SELECT Athena query on an S3 bucket manifest. I then want to use the results of that query, in .csv format, in an S3 Batch operation.
My query runs fine and I am able to access the .csv output via S3 Batch, but since the first row is actually column headers, S3 Batch to throws an unrecoverable error because it thinks that the manifest is now referring to multiple buckets.
How can I easily strip the column headers out of my results? I would prefer to just do it in SQL. The file size makes using standard unix tools prohibitive. I could use AWS Glue, but this seems like overkill for just suppressing headers in a SQL query.

Here's a hacky way to get around it
SELECT bucket as "my-bucket-name", key as "fakekey"
from your_athena_table
This will make your header look like the rest of the file which will not break the S3 Batch copy job. You will have just one failed record of fakekey

Related

Create Athena table using s3 source data

Below is given the s3 path where I have stored the files obtained at the end of a process. The below-provided path is dynamic, that is, the value of the following fields will vary - partner_name, customer_name, product_name.
s3://bucket/{val1}/data/{val2}/output/intermediate_results
I am trying to create Athena tables for each output file present under output/ as well as under intermediate_results/ directories, for each val1-val2.
Each file is a CSV.
But I am not much familiar with AWS Athena so I'm unable to figure out the way to implement this. I would really appreciate any kind of help. Thanks!
Use CREATE TABLE - Amazon Athena. You will need to specify the LOCATION of the data in Amazon S3 by providing a path.
Amazon Athena will automatically use all files in that path, including subdirectories. This means that a table created with a Location of output/ will include all subdirectories, including intermediate_results. Therefore, your data storage format is not compatible with your desired use for Amazon Athena. You would need to put the data into separate paths for each table.

AWS Glue - table version increases on data load even with no schema changes

I have a lambda job which infrequently dumps a parquet file into an S3 bucket/Glue table using AWS Wrangler.
This Glue table appears to be increasing the table version number every time there is new data, even though the schema is unchanged.
I do not think the problem is with the lambda job/wrangler, since it deposits the parquet files as expected. I have also tested that code separately and it works as expected.
Something is going on with the Glue data catalogue table that makes it increase versions despite no changes to the schema.
I have checked for differences in the underlying parquet files to see if there are some schema, data type etc changes between updates, and there are none.
I have checked for differences between the Glue table versions via the console and AWS CLI (aws glue get-table-versions) and found no differences there either (only the UpdateTime and VersionId changes).
I have tried to recreate my setup with the same code and do not find this issue. I have tried to delete and recreate the Glue table in the same place, but the issue reoccurs.
Question: What could be causing my Glue table version numbers to increase when there are no schema changes?
Note:
The code in question looks like this. It's part of a bigger function (this is really just generating logs of what the main lambda function is doing). It works fine on its own and doesn't use variables etc from the rest of the code. I don't see how this could be the issue but including it here anyway.
#other functions do some things when triggered by a new file in another s3 bucket
#this function is just logging which files were processed. It's the Glue table from these log files which is having issues with the version number increasing every time a new log file is added.
import aws-wrangler as wr
def log(resource, filename):
log_df = build_log(resource, filename) # for building the log df, just columns of date, time, file used etc
wr.s3.to_parquet(
df=log_df,
path=log_path(), #s3 bucket where parquet logs are being put
dataset=True,
catalog_versioning=False,
database="MYDB",
partition_cols=['date'],
table='log',
mode='append'
)
This is, I think due to partitioning. You are partitioning based on date, so I guess for every day of time unit a new partition will be added. The new partitions are the reason why the table version is being incremented.

Analyze binary NetCDF files with AWS Quicksight / Athena

I have a task to analyze weather forecast data in Quicksight. The forecast data is held in NetCDF binary files in a public S3 bucket. The question is: how do you expose the contents of these binary files to Quicksight or even Athena?
There are python libraries that will decode the data from the binary files, such as Iris. They are used like this:
import iris
filename = iris.sample_data_path('forecast_20200304.nc')
cubes = iris.load(filename)
print(cubes)
So what would be the AWS workflow and services necessary to create a data ingestion pipeline that would:
Respond to an SQS message that a new binary file is available
Access the new binary file and decode it to access the forecast data
Add the decoded data to the set of already decoded data from previous SQS notifications
Make all the decoded data available in Athena / Quicksight
Tricky one, this...
What I would do is probably something like this:
Write a Lambda function in Python that is triggered when new files appear in the S3 bucket – either by S3 notifications (if you control the bucket), by SNS, SQS, or by schedule in EventBridge. The function uses the code snipplet included in your question to transform each new file and upload the transformed data to another S3 bucket.
I don't know the size of these files and how often they are published, so whether to convert to CSV, JSON, or Parquet is something you have to decide – if the data is small CSV will probably be easiest and will be good enough.
With the converted data in a new S3 bucket all you need to do is create an Athena table for the data set and start using QuickSight.
If you end up with a lot of small files you might want to implement a second step where you once per day combine the converted files into bigger files, and possibly Parquet, but don't do anything like that unless you have to.
An alternative way would be to use Athena Federated Query: by implementing Lambda function(s) that respond to specific calls from Athena you can make Athena read any data source that you want. It's currently in preview, and as far as I know all the example code is written in Java – but theoretically it would be possible to write the Lambda functions in Python.
I'm not sure whether it would be less work than implementing an ETL workflow like the one you suggest, but yours is one of the use cases for which Athena Federated Query was designed for and it might be worth looking into. If NetCDF files are common and a data source for such files would be useful for other people I'm sure the Athena team would love to talk to you and help you out.

How do I import JSON data from S3 using AWS Glue?

I have a whole bunch of data in AWS S3 stored in JSON format. It looks like this:
s3://my-bucket/store-1/20190101/sales.json
s3://my-bucket/store-1/20190102/sales.json
s3://my-bucket/store-1/20190103/sales.json
s3://my-bucket/store-1/20190104/sales.json
...
s3://my-bucket/store-2/20190101/sales.json
s3://my-bucket/store-2/20190102/sales.json
s3://my-bucket/store-2/20190103/sales.json
s3://my-bucket/store-2/20190104/sales.json
...
It's all the same schema. I want to get all that JSON data into a single database table. I can't find a good tutorial that explains how to set this up.
Ideally, I would also be able to perform small "normalization" transformations on some columns, too.
I assume Glue is the right choice, but I am open to other options!
If you need to process data using Glue and there is no need to have a table registered in Glue Catalog then there is no need to run Glue Crawler. You can setup a job and use getSourceWithFormat() with recurse option set to true and paths pointing to the root folder (in your case it's ["s3://my-bucket/"] or ["s3://my-bucket/store-1", "s3://my-bucket/store-2", ...]). In the job you can also apply any required transformations and then write the result into another S3 bucket, relational DB or a Glue Catalog.
Yes, Glue is a great tool for this!
Use a crawler to create a table in the glue data catalog (remember to set Create a single schema for each S3 path under Grouping behavior for S3 data when creating the crawler)
Read more about it here
Then you can use relationalize to flatten our your json structure, read more about that here
Json and AWS Glue may not be the best match. Since AWS Glue is based on hadoop, it inherits hadoop's "one-row-per-newline" restriction, so even if your data is in json, it has to be formatted with one json object per line [1]. Since you'll be pre-processing your data anyway to get it into this line-separated format, it may be easier to use csv instead of json.
Edit 2022-11-29: There does appear to be some tooling now for jsonl, which is the actual format that AWS expects, making this less of an automatic win for csv. I would say if your data is already in json format, it's probably smarter to convert it to jsonl than to convert to csv.

Athena Write Performance to AWS S3

I'm executing a query in AWS Athena and writing the results to s3. It seems like it's taking a long time (way too long in fact) for the file to be available when I execute the query from a lambda script.
I'm scanning 70MB of data, and the file returned is 12MB. I execute this from a lambda script like so:
athena_client = boto3.client('athena')
athena_client.start_query_execution(
QueryString=query_string,
ResultConfiguration={
'OutputLocation': 'location_on_s3',
'EncryptionConfiguration': 'SSE_S3',
}
)
If I run the query directly in Athena it takes 2.97 seconds to run. However it looks like the file is available after 2 minutes if I run this query from the lambda script.
Does anyone know the write performance of AWS Athena to AWS S3? I would like to know if this is normal. The docs don't state how quickly the write occurs.
Every query in Athena writes to S3.
If you check the History tab on the Athena page in the console you'll see a history of all queries you've run (not just through the console, but generally). Each of those has a link to a download path.
If you click the Settings button a dialog will open asking you to specify an output location. Check that location and you'll find all your query results there.
Why is this taking so much longer from your Lambda script? I'm guessing, but the only possible suggestion I have is that you're querying across regions - if your data is in your region and your result location is in another location you might experience slowness due to transfer cost. Even so, 12MB should be fast.