Generating monthly data from daily CSV files using Apache Spark and AWS - amazon-web-services

I have CSV files with identical columns and a million matching IDs for every day of 2018. Each has 5 columns, excluding the ID.
I want to concatenate the files by month so that each monthly file has the 5 columns * the number of days so January would have 155 named Day1-Col1, Day1-Col2...Day 31-Col5 for example.
Is this something I can do with Apache Spark?
My choice of Spark is because I want to place the data into an AWS Athena dataset and it seems that AWS Glue can do this with Spark SQL queries.
I imagine we'd convert the CSVs to parquet files first and then produce a monthly dataset with this to later be visualised with AWS Quicksight.

Spark separates out the I/O from the processing a bit. So, I'd do the same here in trying to solve this.
First, I'd load your csv files using AWS Glue Catalog OR Spark's native wholeTextFiles method.
From there, you can use either AWS Glue's DynamicFrame methods, Spark SQL's DataFrame methods or you can use Spark's RDD functions for data processing. In this case, the bulk of your processing looks to be grouping your data by month based on day of year. Using RDD you can use the groupBy method with a custom function that returns the month index based on day of year. Similarly, Spark SQL's Dataframe has a groupBy method as well. Another alternative here would be to iterate through months in a loop and filter the records based on day of year to the month. In some ways the for loop is cleaner and in others it is dirtier. Finally, a 3rd way to do this would be to add a month field to each record in a map. This would allow you to partition your data by month and you'll probably want year as well.
Finally, to write each month back out depends on how you solved the grouping of data. You can use the AWS Glue Catalog to write the files out if you looped or added a month field for partitioning. If you did a groupBy then you'll need to count the rows, repartition to the number of rows and then use Spark to write the files.

Related

how to update the Athena tables and show previous values as well has the current value for a column in aws quicksight?

I am able to create tables in Athena using the glue crawler (files coming from s3 as csv) and every time that csv file's values get updated, the crawler runs again and the table's values are shown as updated.
I display this result in aws quicksight, that this x item has x price. But now I want to show the previous prices of the item and the current price in quicksight. for example something like :
October $15
November $20
December $30
how can I achieve this without creating new tables every time the CSV changes, because otherwise there would be so many tables created at one point?

How does Amazon Athena manage rename of columns?

everyone!
I'm working on a solution that intends to use Amazon Athena to run SQL queries from Parquet files on S3.
Those filed will be generated from a PostgreSQL database (RDS). I'll run a query and export data to S3 using Python's Pyarrow.
My question is: since Athena is schema-on-read, add or delete of columns on database will not be a problem...but what will happen when I get a column renamed on database?
Day 1: COLUMNS['col_a', 'col_b', 'col_c']
Day 2: COLUMNS['col_a', 'col_beta', 'col_c']
On Athena,
SELECT col_beta FROM table;
will return only data from Day 2, right?
Is there a way that Athena knows about these schema evolution or I would have to run a script to iterate through all my files on S3, rename columns and update table schema on Athena from 'col_a' to 'col_beta'?
Would AWS Glue Data Catalog help in any way to solve this?
I'll love to discuss more about this!
I recommend reading more about handling schema updates with Athena here. Generally Athena supports multiple ways of reading Parquet files (as well as other columnar data formats such as ORC). By default, using Parquet, columns will be read by name, but you can change that to reading by index as well. Each way has its own advantages / disadvantages dealing with schema changes. Based on your example, you might want to consider reading by index if you are sure new columns are only appended to the end.
A Glue crawler can help you to keep your schema updated (and versioned), but it doesn't necessarily help you to resolve schema changes (logically). And it comes at an additional cost, of course.
Another approach could be to use a schema that is a superset of all schemas over time (using columns by name) and define a view on top of it to resolve changes "manually".
You can set a granularity based on 'On Demand' or 'Time Based' for the AWS Glue crawler, so every time your data on the S3 updates a new schema will be generated (you can edit the schema on the data types for the attributes). This way your columns will stay updated and you can query on the new field.
Since AWS Athena reads data in CSV and TSV in the "order of the columns" in the schema and returns them in the same order. It does not use column names for mapping data to a column, which is why you can rename columns in CSV or TSV without breaking Athena queries.

How Amazon Athena selecting new files/records from S3

I'm adding files on Amazon S3 from time to time, and I'm using Amazon Athena to perform a query on these data and save it in another S3 bucket as CSV format (aggregated data), I'm trying to find way for Athena to select only new data (which not queried before by Athena), in order to optimize the cost and avoid data duplication.
I have tried to update the records after been selected by Athena, but update query not supported in Athena.
Is any idea to solve this ?
Athena does not keep track of files on S3, it only figures out what files to read when you run a query.
When planning a query Athena will look at the table metadata for the table location, list that location, and finally read all files that it finds during query execution. If the table is partitioned it will list the locations of all partitions that matches the query.
The only way to control which files Athena will read during query execution is to partition a table and ensure that queries match the partitions you want it to read.
One common way of reading only new data is to put data into prefixes on S3 that include the date, and create tables partitioned by date. At query time you can then filter on the last week, month, or other time period to limit the amount of data read.
You can find more information about partitioning in the Athena documentation.

BigQuery Table multiple parition a day

I am trying to load some Avro format data to BigQuery through the api and I need some partitioning. According to the documentation here
https://cloud.google.com/bigquery/docs/reference/rest/v2/tables#TimePartitioning
It will create only one partition a day with the ingestion partition that use the _PARTITIONTIME column. Is it possible to create multiple partition a day by using timestamp field?
Another option I can think about was the ranged partition documented here
https://cloud.google.com/bigquery/docs/reference/rest/v2/JobConfiguration#RangePartitioning
however, it was marked as experimental. Not sure it is good for production use?

AWS Athena Query Partitioning

I am trying to use AWS Athena to provide analytics for an existing platform. Currently the flow looks like this:
Data is pumped into a Kinesis Firehose as JSON events.
The Firehose converts the data to parquet using a table in AWS Glue and writes to S3 either every 15 mins or when the stream reaches 128 MB (max supported values).
When the data is written to S3 it is partitioned with a path /year=!{timestamp:yyyy}/month=!{timestamp:MM}/day=!{timestamp:dd}/...
An AWS Glue crawler update a table with the latest partition data every 24 hours and makes it available for queries.
The basic flow works. However, there are a couple of problems with this...
The first (and most important) is that this data is part of a multi-tenancy application. There is a property inside each event called account_id. Every query that will ever be issued will be issued by a specific account and I don't want to be scanning all account data for every query. I need to find a scalable way query only the relevant data. I did look into trying to us Kinesis to extract the account_id and use it as a partition. However, this currently isn't supported and with > 10,000 accounts the AWS 20k partition limit quickly becomes a problem.
The second problem is file size! AWS recommend that files not be < 128 MB as this has a detrimental effect on query times as the execution engine might be spending additional time with the overhead of opening Amazon S3 files. Given the nature of the Firehose I can only ever reach a maximum size of 128 MB per file.
With that many accounts you probably don't want to use account_id as partition key for many reasons. I think you're fine limits-wise, the partition limit per table is 1M, but that doesn't mean it's a good idea.
You can decrease the amount of data scanned significantly by partitioning on parts of the account ID, though. If your account IDs are uniformly distributed (like AWS account IDs) you can partition on a prefix. If your account IDs are numeric partitioning on the first digit would decrease the amount of data each query would scan by 90%, and with two digits 99% – while still keeping the number of partitions at very reasonable levels.
Unfortunately I don't know either how to do that with Glue. I've found Glue very unhelpful in general when it comes to doing ETL. Even simple things are hard in my experience. I've had much more success using Athena's CTAS feature combined with some simple S3 operation for adding the data produced by a CTAS operation as a partition in an existing table.
If you figure out a way to extract the account ID you can also experiment with separate tables per account, you can have 100K tables in a database. It wouldn't be very different from partitions in a table, but could be faster depending on how Athena determines which partitions to query.
Don't worry too much about the 128 MB file size rule of thumb. It's absolutely true that having lots of small files is worse than having few large files – but it's also true that scanning through a lot of data to filter out just a tiny portion is very bad for performance, and cost. Athena can deliver results in a second even for queries over hundreds of files that are just a few KB in size. I would worry about making sure Athena was reading the right data first, and about ideal file sizes later.
If you tell me more about the amount of data per account and expected life time of accounts I can give more detailed suggestions on what to aim for.
Update: Given that Firehose doesn't let you change the directory structure of the input data, and that Glue is generally pretty bad, and the additional context you provided in a comment, I would do something like this:
Create an Athena table with columns for all properties in the data, and date as partition key. This is your input table, only ETL queries will be run against this table. Don't worry that the input data has separate directories for year, month, and date, you only need one partition key. It just complicates things to have these as separate partition keys, and having one means that it can be of type DATE, instead of three separate STRING columns that you have to assemble into a date every time you want to do a date calculation.
Create another Athena table with the same columns, but partitioned by account_id_prefix and either date or month. This will be the table you run queries against. account_id_prefix will be one or two characters from your account ID – you'll have to test what works best. You'll also have to decide whether to partition on date or a longer time span. Dates will make ETL easier and cheaper, but longer time spans will produce fewer and larger files, which can make queries more efficient (but possibly more expensive).
Create a Step Functions state machine that does the following (in Lambda functions):
Add new partitions to the input table. If you schedule your state machine to run once per day it can just add the partition that correspond to the current date. Use the Glue CreatePartition API call to create the partition (unfortunately this needs a lot of information to work, you can run a GetTable call to get it, though. Use for example ["2019-04-29"] as Values and "s3://some-bucket/firehose/year=2019/month=04/day=29" as StorageDescriptor.Location. This is the equivalent of running ALTER TABLE some_table ADD PARTITION (date = '2019-04-29) LOCATION 's3://some-bucket/firehose/year=2019/month=04/day=29' – but doing it through Glue is faster than running queries in Athena and more suitable for Lambda.
Start a CTAS query over the input table with a filter on the current date, partitioned by the first character(s) or the account ID and the current date. Use a location for the CTAS output that is below your query table's location. Generate a random name for the table created by the CTAS operation, this table will be dropped in a later step. Use Parquet as the format.
Look at the Poll for Job Status example state machine for inspiration on how to wait for the CTAS operation to complete.
When the CTAS operation has completed list the partitions created in the temporary table created with Glue GetPartitions and create the same partitions in the query table with BatchCreatePartitions.
Finally delete all files that belong to the partitions of the query table you deleted and drop the temporary table created by the CTAS operation.
If you decide on a partitioning on something longer than date you can still use the process above, but you also need to delete partitions in the query table and the corresponding data on S3, because each update will replace existing data (e.g. with partitioning by month, which I would recommend you try, every day you would create new files for the whole month, which means that the old files need to be removed). If you want to update your query table multiple times per day it would be the same.
This looks like a lot, and looks like what Glue Crawlers and Glue ETL does – but in my experience they don't make it this easy.
In your case the data is partitioned using Hive style partitioning, which Glue Crawlers understand, but in many cases you don't get Hive style partitions but just Y/M/D (and I didn't actually know that Firehose could deliver data this way, I thought it only did Y/M/D). A Glue Crawler will also do a lot of extra work every time it runs because it can't know where data has been added, but you know that the only partition that has been added since yesterday is the one for yesterday, so crawling is reduced to a one-step-deal.
Glue ETL is also makes things very hard, and it's an expensive service compared to Lambda and Step Functions. All you want to do is to convert your raw data form JSON to Parquet and re-partition it. As far as I know it's not possible to do that with less code than an Athena CTAS query. Even if you could make the conversion operation with Glue ETL in less code, you'd still have to write a lot of code to replace partitions in your destination table – because that's something that Glue ETL and Spark simply doesn't support.
Athena CTAS wasn't really made to do ETL, and I think the method I've outlined above is much more complex than it should be, but I'm confident that it's less complex than trying to do the same thing (i.e. continuously update and potentially replace partitions in a table based on the data in another table without rebuilding the whole table every time).
What you get with this ETL process is that your ingestion doesn't have to worry about partitioning more than by time, but you still get tables that are optimised for querying.