ETL Job in AWS Glue for Historical Data - amazon-web-services

I have a client who sends us data each month. However, each month's file contains all of their historical data as well as the new data from the current month. For example, in November 2020, the client sent us data from September 2018 - November 2020 inclusive. The file they will send in December 2020 will include data from September 2018 - December 2020 inclusive.
I want to perform an ETL job in Glue for only the new data (i.e. the data from the current month) each month. For example, I would like to extract just the December 2020 data from a CSV file that contains data from September 2018-December 2020. I have tried the following approaches to no avail:
Making use of job bookmarking: When enabled, the job will "ignore" data files that it has previously processed but still performs the ETL task on the entire new data file each month.
Making use of the SplitRows Class: While this can accomplish what I am looking for, it would require that I create a separate job for each month and year since the comparison-dict would have to be tailored to the month of interest. If possible, I would like to avoid having to do this.
While I am looking for solutions using AWS Glue, if none exist, I am open to making use of other AWS services (e.g. AWS DataPipeline) if necessary.
Thanks in advance for any assistance you can provide!

Related

AWS S3 partitioning strategy capable of supporting large data volumes for Analytic as well as ETL consumers

We currently ingest large volumes of data through kinesis from a 3rd party that are aggregated through firehose and written to s3 following a path structure based on the write timestamp. (/year/month/day/hour/...). This allows for the downstream daily ETL batch operations to process data for the last day by pruning down to the last 24 hours.
This is useful as our 3rd party partner has occasional latency in their publishing where events show up hours or even a few days late.
/2022/10/03/.. contains all data that ARRIVED on Oct 03 even if it includes transactions for Oct 02 or Oct 01. (easy to see events arrived on this date)
We have a new proposal to interrogate the objects and write them with the path based on the transaction timestamp within the object. This would mean that late-arriving events would be written with a path for a period of time that has already been processed.
/2022/10/03/.. contains data that arrived on Oct 3 through Oct 5+ as long as the transaction date in the record is Oct 3.
Though this approach works better for teams analyzing data for a set period of time, it limits the ETL consumer's ability to identify the late-arriving events and distinguish them from previously processed records. We see 100-500 million events a day so this can be costly to process.
I'm looking for ideas on how to identify the late-arriving transactions if we move to this model.
I'm looking to use external tables from a Snowflake database to consume this.

AWS Timestream: Unable to ingest records into AWS Timestream

As you all know, AWS Timestream was made generally available in the last week.
Since then, I have been trying to experiment with it and understanding how it models and stores the data.
I am facing an issue in ingesting records into Timestream.
I have certain records dated 23rd April 2020. On trying to insert these records into a Timestream table, I get the RecordRejected error.
According to this link, a record is rejected if it has the same dimension, timestamp or if the timestamp is beyond the retention period of the memory store of the table.
I have set the retention period of the memory store of my table to 12 months. According to the documentation: any records having a timestamp beyond 12 months would be rejected.
However, the above mentioned record gets rejected despite having a timestamp within 12 months from now.
On investigating further, I have noticed that, records with today's date (5th Oct 2020) get ingested successfully, however, records with a date 30 days before do not get ingested, i.e. 5th September 2020. To ensure this, I have also tried inserting a record with the date 6th Sept and a few more days between today's date and 5th Sept. All these are getting inserted successfully.
Could somebody explain why I am not able to insert records having a timestamp within the retention period of the memory store? It only allows me to insert records that are at the most 30 days old.
I would also like to know if there is a way we could insert historical data directly into the magnetic store. The memory store retention period may not be sufficient for my use case and I may need to insert data that is 2 years old or more. I understand this is not a classic use case of timestream, but I am still curious to know.
I am stuck on this issue and would really appreciate some help.
Thank you in advance.
I had a very similar issue, and for me it turned out that I had to set the Memory Store Retention Period to 8766 hours - which is slightly MORE than one year. I've no clue why that is, and why it works, but it worked for me importing older data.
PS: I'm pretty sure it's a bug in timestream
PPS: I've found the value by using the "default" set in the aws console. No other value worked for me.
Timestream loads data into the memory store only if the timestamp is within the timespan of its retention period. So if the retention period is 1 day, the timestamp can't be more than 1 day ago.
AWS TimeStream: Records that are older than one day are rejected

Error in Google Play transfer frequency - Google BigQuery

I want to set weekly Google Play transfer, but it can not be saved.
At first, I set daily a play-transfer job. It worked. I tried to change transfer frequency to weekly - every Monday 7:30 - got an error:
"This transfer config could not be saved. Please try again.
Invalid schedule [every mon 7:30]. Schedule has to be consistent with CustomScheduleGranularity [daily: true ].
I think this document shows it can change transfer frequency:
https://cloud.google.com/bigquery-transfer/docs/play-transfer
Can Google Play transfer be set to weekly?
By default transfer is created as daily. From the same docs:
Daily, at the time the transfer is first created (default)
Try to create brand new weekly transfer. If it works, I would think it is a web UI bug. Here are two other options to change your existing transfer:
BigQuery command-line tool: bq update --transfer_config
Very limited number of options are available, and schedule is not available for update.
BigQuery Data Transfer API: transferConfigs.patch Most transfer options are updatable. Easy way to try it is with API Explorer. Details on transferconfig object. schedule field need to be defined:
Data transfer schedule. If the data source does not support a custom
schedule, this should be empty. If it is empty, the default value for
the data source will be used. The specified times are in UTC. Examples
of valid format: 1st,3rd monday of month 15:30, every wed,fri of
jan,jun 13:15, and first sunday of quarter 00:00. See more explanation
about the format here:
https://cloud.google.com/appengine/docs/flexible/python/scheduling-jobs-with-cron-yaml#the_schedule_format
NOTE: the granularity should be at least 8 hours, or less frequent.

Generating monthly data from daily CSV files using Apache Spark and AWS

I have CSV files with identical columns and a million matching IDs for every day of 2018. Each has 5 columns, excluding the ID.
I want to concatenate the files by month so that each monthly file has the 5 columns * the number of days so January would have 155 named Day1-Col1, Day1-Col2...Day 31-Col5 for example.
Is this something I can do with Apache Spark?
My choice of Spark is because I want to place the data into an AWS Athena dataset and it seems that AWS Glue can do this with Spark SQL queries.
I imagine we'd convert the CSVs to parquet files first and then produce a monthly dataset with this to later be visualised with AWS Quicksight.
Spark separates out the I/O from the processing a bit. So, I'd do the same here in trying to solve this.
First, I'd load your csv files using AWS Glue Catalog OR Spark's native wholeTextFiles method.
From there, you can use either AWS Glue's DynamicFrame methods, Spark SQL's DataFrame methods or you can use Spark's RDD functions for data processing. In this case, the bulk of your processing looks to be grouping your data by month based on day of year. Using RDD you can use the groupBy method with a custom function that returns the month index based on day of year. Similarly, Spark SQL's Dataframe has a groupBy method as well. Another alternative here would be to iterate through months in a loop and filter the records based on day of year to the month. In some ways the for loop is cleaner and in others it is dirtier. Finally, a 3rd way to do this would be to add a month field to each record in a map. This would allow you to partition your data by month and you'll probably want year as well.
Finally, to write each month back out depends on how you solved the grouping of data. You can use the AWS Glue Catalog to write the files out if you looped or added a month field for partitioning. If you did a groupBy then you'll need to count the rows, repartition to the number of rows and then use Spark to write the files.

Timezone related issues in BigQuery (for partitioning and query)

We have a campaign management system. We create and run campaigns on various channels. When user clicks/accesses any of the Adv (as part of campaign), system generates a log. Our system is hosted in GCP. Using ‘Exports’ feature logs are exported to BigQuery
In BigQuery the Log Table is partitioned using ‘timestamp’ field (time when log is generated). We understand that BigQuery stores dates in UTC timezone and so partitions are also based on UTC time
Using this Log Table, We need to generate Reports per day. Reports can be like number of impressions per each day per campaign. And we need to show these reports as per ETC time.
Because the BigQuery table is partitioned by UTC timezone, query for ETC day would potentially need to scan multiple partitions. Had any one addressed this issue or have suggestions to optimise the storage and query so that its takes complete advantage of BigQuery partition feature
We are planning to use GCP Data studio for Reports.
BigQuery should be smart enough to filter for the correct timezones when dealing with partitions.
For example:
SELECT MIN(datehour) time_start, MAX(datehour) time_end, ANY_VALUE(title) title
FROM `fh-bigquery.wikipedia_v3.pageviews_2018` a
WHERE DATE(datehour) = '2018-01-03'
5.0s elapsed, 4.56 GB processed
For this query we processed the 4.56GB in the 2018-01-03 partition. What if we want to adjust for a day in the US? Let's add this in the WHERE clause:
WHERE DATE(datehour, "America/Los_Angeles") = '2018-01-03'
4.4s elapsed, 9.04 GB processed
Now this query is automatically scanning 2 partitions, as it needs to go across days. For me this is good enough, as BigQuery is able to automatically figure this out.
But what if you wanted to permanently optimize for one timezone? You could create a generated, shifted DATE column - and use that one to PARTITION for.