Creating a BigQuery Transfer Service with more complex reges - regex

I have a bucket that stores files based on a transaction time into a filename structure like
gs://my-bucket/YYYY/MM/DD/[autogeneratedID].parquet
Lets assume this structure dates back to 2015/01/01
Some of the files might arrive late, so in theory a new file could be written to the 2020/07/27 structure tomorrow.
I now want to create a BigQuery table that inserts all files with transaction date 2019-07-01 and newer.
My current strategy is to slice the past into small enough chunks to just run batch loads, e.g. by month. Then I want to create a transfer service that listens for all new files coming in.
I cannot just point it to gs://my-bucket/* as this would try to load the date prior to 2019-07-01.
So basically I was thinking about encoding the "future looking file name structures" into a suitable regex, but it seems like the wildcard names https://cloud.google.com/storage/docs/gsutil/addlhelp/WildcardNames only allow for very limited syntax which is not as flexible as awk regex for instance.
I know there are streaming inserts into BQ but still hoping to avoid that extra complexity and just make a smart configuration of the transfer config following the batch load.

You can use scheduled queries with external table. When you query your external table, you can use the pseudo column _FILE_NAME in the where condition of your request and filter on this.

Related

Sorting blobs in google cloud storage with respect to the last modified date field using python api

I have a scenario where i want to list blobs and then sort it using the last modified time.
I am trying to do it in python api.
I want to execute this script n number of times, and in each execution i want to list 10 files and perform some operation (e.g copy). I want to save the date of the last file in a config file and want to list the files in another iteration after the last saved date.
Need some suggestion as google api doesn't let us sort the files after listing.
blobs = storage_client.list_blobs(bucket_name,prefix=prefix,max_results=10)
Several solutions I can think of.
Get pubsub notification every time a file created.Read 10 messages each time or save the topic data to bigquery.
After using a file move it to another folder with a metadata file, or update the processed files metadata.
Use storage to trigger a function and save the event data to database.
If you control the files names and path save them in a easy to query path by using the prefix parameter.
I think the database solution in the must flexible one which give you the best control over the data and the ability to create a dashboard for your data.
Knowing more about your flow will help in order to give you a more fine grained solution.

AWS Athena - Query over large external table generated from Glue crawler?

I have a large set of history log files on aws s3 that sum billions of lines,
I used a glue crawler with a grok deserializer to generate an external table on Athena, but querying it has proven to be unfeasible.
My queries have timed out and I am trying to find another way of handling this data.
From what I understand, through Athena, external tables are not actual database tables, but rather, representations of the data in the files, and queries are run over the files themselves, not the database tables.
How can I turn this large dataset into a query friendly structure?
Edit 1: For clarification, I am not interested in reshaping the hereon log files, those are taken care of. Rather, I want a way to work with the current file base I have on s3. I need to query these old logs and at its current state it's impossible.
I am looking for a way to either convert these files into an optimal format or to take advantage of the current external table to make my queries.
Right now, by default of the crawler, the external tables are only partitined by day and instance, my grok pattern explodes the formatted logs into a couple more columns that I would love to repartition on, if possible, which I believe would make my queries easier to run.
Your where condition should be on partitions (at-least one condition). By sending support ticket, you may increase athena timeout. Alternatively, you may use Redshift Spectrum
But you may seriously thing to optimize query. Athena query timeout is 30min. It means your query ran for 30mins before timed out.
By default athena times out after 30 minutes. This timeout period can be increased but raising a support ticket with AWS team. However, you should first optimize your data and query as 30 minutes is good time for executing most of the queries.
Here are few tips to optimize the data that will give major boost to athena performance:
Use columnar formats like orc/parquet with compression to store your data.
Partition your data. In your case you can partition your logs based on year -> month -> day.
Create larger and lesser number of files per partition instead of small and more number of files.
The following AWS article gives detailed information for performance tuning in amazon athena
Top 10 performance tuning tips for amazon-athena

AWS Athena Query Partitioning

I am trying to use AWS Athena to provide analytics for an existing platform. Currently the flow looks like this:
Data is pumped into a Kinesis Firehose as JSON events.
The Firehose converts the data to parquet using a table in AWS Glue and writes to S3 either every 15 mins or when the stream reaches 128 MB (max supported values).
When the data is written to S3 it is partitioned with a path /year=!{timestamp:yyyy}/month=!{timestamp:MM}/day=!{timestamp:dd}/...
An AWS Glue crawler update a table with the latest partition data every 24 hours and makes it available for queries.
The basic flow works. However, there are a couple of problems with this...
The first (and most important) is that this data is part of a multi-tenancy application. There is a property inside each event called account_id. Every query that will ever be issued will be issued by a specific account and I don't want to be scanning all account data for every query. I need to find a scalable way query only the relevant data. I did look into trying to us Kinesis to extract the account_id and use it as a partition. However, this currently isn't supported and with > 10,000 accounts the AWS 20k partition limit quickly becomes a problem.
The second problem is file size! AWS recommend that files not be < 128 MB as this has a detrimental effect on query times as the execution engine might be spending additional time with the overhead of opening Amazon S3 files. Given the nature of the Firehose I can only ever reach a maximum size of 128 MB per file.
With that many accounts you probably don't want to use account_id as partition key for many reasons. I think you're fine limits-wise, the partition limit per table is 1M, but that doesn't mean it's a good idea.
You can decrease the amount of data scanned significantly by partitioning on parts of the account ID, though. If your account IDs are uniformly distributed (like AWS account IDs) you can partition on a prefix. If your account IDs are numeric partitioning on the first digit would decrease the amount of data each query would scan by 90%, and with two digits 99% – while still keeping the number of partitions at very reasonable levels.
Unfortunately I don't know either how to do that with Glue. I've found Glue very unhelpful in general when it comes to doing ETL. Even simple things are hard in my experience. I've had much more success using Athena's CTAS feature combined with some simple S3 operation for adding the data produced by a CTAS operation as a partition in an existing table.
If you figure out a way to extract the account ID you can also experiment with separate tables per account, you can have 100K tables in a database. It wouldn't be very different from partitions in a table, but could be faster depending on how Athena determines which partitions to query.
Don't worry too much about the 128 MB file size rule of thumb. It's absolutely true that having lots of small files is worse than having few large files – but it's also true that scanning through a lot of data to filter out just a tiny portion is very bad for performance, and cost. Athena can deliver results in a second even for queries over hundreds of files that are just a few KB in size. I would worry about making sure Athena was reading the right data first, and about ideal file sizes later.
If you tell me more about the amount of data per account and expected life time of accounts I can give more detailed suggestions on what to aim for.
Update: Given that Firehose doesn't let you change the directory structure of the input data, and that Glue is generally pretty bad, and the additional context you provided in a comment, I would do something like this:
Create an Athena table with columns for all properties in the data, and date as partition key. This is your input table, only ETL queries will be run against this table. Don't worry that the input data has separate directories for year, month, and date, you only need one partition key. It just complicates things to have these as separate partition keys, and having one means that it can be of type DATE, instead of three separate STRING columns that you have to assemble into a date every time you want to do a date calculation.
Create another Athena table with the same columns, but partitioned by account_id_prefix and either date or month. This will be the table you run queries against. account_id_prefix will be one or two characters from your account ID – you'll have to test what works best. You'll also have to decide whether to partition on date or a longer time span. Dates will make ETL easier and cheaper, but longer time spans will produce fewer and larger files, which can make queries more efficient (but possibly more expensive).
Create a Step Functions state machine that does the following (in Lambda functions):
Add new partitions to the input table. If you schedule your state machine to run once per day it can just add the partition that correspond to the current date. Use the Glue CreatePartition API call to create the partition (unfortunately this needs a lot of information to work, you can run a GetTable call to get it, though. Use for example ["2019-04-29"] as Values and "s3://some-bucket/firehose/year=2019/month=04/day=29" as StorageDescriptor.Location. This is the equivalent of running ALTER TABLE some_table ADD PARTITION (date = '2019-04-29) LOCATION 's3://some-bucket/firehose/year=2019/month=04/day=29' – but doing it through Glue is faster than running queries in Athena and more suitable for Lambda.
Start a CTAS query over the input table with a filter on the current date, partitioned by the first character(s) or the account ID and the current date. Use a location for the CTAS output that is below your query table's location. Generate a random name for the table created by the CTAS operation, this table will be dropped in a later step. Use Parquet as the format.
Look at the Poll for Job Status example state machine for inspiration on how to wait for the CTAS operation to complete.
When the CTAS operation has completed list the partitions created in the temporary table created with Glue GetPartitions and create the same partitions in the query table with BatchCreatePartitions.
Finally delete all files that belong to the partitions of the query table you deleted and drop the temporary table created by the CTAS operation.
If you decide on a partitioning on something longer than date you can still use the process above, but you also need to delete partitions in the query table and the corresponding data on S3, because each update will replace existing data (e.g. with partitioning by month, which I would recommend you try, every day you would create new files for the whole month, which means that the old files need to be removed). If you want to update your query table multiple times per day it would be the same.
This looks like a lot, and looks like what Glue Crawlers and Glue ETL does – but in my experience they don't make it this easy.
In your case the data is partitioned using Hive style partitioning, which Glue Crawlers understand, but in many cases you don't get Hive style partitions but just Y/M/D (and I didn't actually know that Firehose could deliver data this way, I thought it only did Y/M/D). A Glue Crawler will also do a lot of extra work every time it runs because it can't know where data has been added, but you know that the only partition that has been added since yesterday is the one for yesterday, so crawling is reduced to a one-step-deal.
Glue ETL is also makes things very hard, and it's an expensive service compared to Lambda and Step Functions. All you want to do is to convert your raw data form JSON to Parquet and re-partition it. As far as I know it's not possible to do that with less code than an Athena CTAS query. Even if you could make the conversion operation with Glue ETL in less code, you'd still have to write a lot of code to replace partitions in your destination table – because that's something that Glue ETL and Spark simply doesn't support.
Athena CTAS wasn't really made to do ETL, and I think the method I've outlined above is much more complex than it should be, but I'm confident that it's less complex than trying to do the same thing (i.e. continuously update and potentially replace partitions in a table based on the data in another table without rebuilding the whole table every time).
What you get with this ETL process is that your ingestion doesn't have to worry about partitioning more than by time, but you still get tables that are optimised for querying.

Dynamodb Update for multiple list items with multiple key values

We are updating data in an Excel sheet for a particular event id, We need to retrieve the primary key item from the dynamodb table for the particular event id and need to update values in the excel.
Doing this manually for few articles is ok. But if we need to update 10000 of event id values, how can we automate this process through python or any other method? Please Assist on this
If you're asking about how to automate this in Excel, then one option is to use the Office Interop APIs for Excel from your favorite .NET language (C# is really easy to use for this sort of task). Dynamo has client SDKs for .NET, again making it relatively easy to query your source table.
For the .Net SDK for Dynamo, start here: https://docs.aws.amazon.com/sdk-for-net/v3/developer-guide/dynamodb-intro.html
For Office automation, you have two options:
You can either write a .Net application that would interface with Excel and process the file, reading from Dynamo
You can try using the automation features from Excel via scripting (but I am not sure how well that would work with the external dependency on the AWS SDK)
For the latter you might start here: https://learn.microsoft.com/en-us/dotnet/csharp/programming-guide/interop/how-to-access-office-onterop-objects
There are lots of examples for automating Excel using C#. If you find that you're stuck on something in particular, feel free to ask here on SO but the more focused the question the quicker and better answers you'll get.
As far as the approach for your particular task, I would:
make a console application that opens the Excel document (workbook) you want to edit
enumerate the sheets and pick the one you need to update (presumably the first one?!)
then, for each of the rows in the sheet, read the eventid from the corresponding cell
make the DynamoDB query and get the data you need for that event
update the cells for that row
repeat this for all rows until you're done
As a potential optimization, if there aren't that many records in Dynamo (10,000 is a pretty low number), I would look into scanning the Dynamo table into memory first and then doing the lookups in the memory. This has the added benefit that it will be significantly cheaper. Scanning all 10K items and storing in memory will usually be on the order of 15-20 times cheaper than making individual Get requests for each item.
followed below steps to complete the dynamodb update
1.We have read and converted source csv data into dictionary
with open('test.csv', 'r') as f: reader = csv.reader(f) your_list
= list(reader) list_1=[] dict1={} for i in range(1, len(your_list)):
dict1[your_list[0][0]]=your_list[i][0]
dict1[your_list[0][1]]=your_list[i][1]
dict1[your_list[0][2]]=your_list[i][2]
dict1[your_list[0][3]]=your_list[i][3] list_1.append(dict1)
dict1={}
I have not copied the complete script here , just pasted one small batch script
2.Using dynamodb scan operation compared the eventid in source and destination
We have faced data retrivel issue here , at a time we can get 1 MB of data in dynamodb
3.We have verified each batch records with dynamodb table and completed the update process

content replacement in S3 files when unique id matched in both the sides by using big data solutions

I am trying to explore on a use case like "we have huge data (50B records) in files and each file has around 50M records and each record has a unique identifier. And it is possible that a record that present in file 10 can also present in file 100 but the latest state of that record is present in file 100. Files sits in AWS S3.
Now lets say around 1B records out of 50B records needs reprocessing and once reprocessing completed, we need to identify all the files which ever has these 1B records and replace the content of those files for these 1B unique ids.
Challenges: right now, we dont have a mapping that tells which file contains what all unique ids. And the whole file replacement needs to complete in one day, which means we needs parallel execution.
We have already initiated a task for maintaining the mapping for file to unique ids, and we need to load this data while processing 1B records and look up in this data set and identify all the distinct file dates for which content replacement is required.
The mapping will be huge, because it has to hold 50B records and may increase as well as it is a growing system.
Any thoughts around this?
You will likely need to write a custom script that will ETL all your files.
Tools such as Amazon EMR (Hadoop) and Amazon Athena (Presto) would be excellent for processing the data in the files. However, your requirement to identify the latest version of data based upon filename is not compatible with the way these tools would normally process data. (They look inside the files, not at the filenames.)
If the records merely had an additional timestamp field, then it would be rather simple for either EMR or Presto to read the files and output a new set of files with only one record for each unique ID (with the latest date).
Rather than creating a system to lookup unique IDs in files, you should have your system output a timestamp. This way, the data is not tied to a specific file and can easily be loaded and transformed based upon the contents of the file.
I would suggest:
Process each existing file (yes, I know you have a lot!) and add a column that represents the filename
Once you have a new set of input files with the filename column (that acts to identify the latest record), use Amazon Athena to read all records and output one row per unique ID (with the latest date). This would be a normal SELECT... GROUP BY statement with a little playing around to get only the latest record.
Athena would output new files to Amazon S3, which will contain the data with unique records. These would then be the source records for any future processing you perform.