Best strategy for joining two large datasets - mapreduce

I'm currently trying to find the best way of processing two very large datasets.
I have two BigQuery Tables :
One table containing streamed events (Billion rows)
One table containing a tags and the associated event properties (100 000 rows)
I want to tag each event with the appropriate tags based on the event properties (an event can have multiple tags). However a SQL cross-join seems to be too slow for the dataset size.
What is the best way to proceed using a pipeline of mapreduces and avoiding
very costly shuffle phase since each event has to be compared to each tag.
Also I'm planning to use Google Cloud Dataflow, is this tool adapted for this task?

Google Cloud Dataflow is a good fit for this.
Assuming the tags data is small enough to fit in memory you can avoid a shuffle by passing it as a SideInput.
Your pipeline would look like the following
Use two BigQueryIO transforms to read from each table.
Create a DoFn to tag each event with its tags.
The input PCollection to your DoFn should be the events. Pass the table of tags as a side input.
Use a BigQueryIO transform to write the result back to BigQuery (assuming you want to use BigQuery for the output)
If your tags data is too large to fit in memory you will most likely have to use a Join.

Related

Athena query timeout for bucket containing too many log entries

I am running a simple Athena query as in
SELECT * FROM "logs"
WHERE parse_datetime(requestdatetime,'dd/MMM/yyyy:HH:mm:ss Z')
BETWEEN parse_datetime('2021-12-01:00:00:00','yyyy-MM-dd:HH:mm:ss')
AND
parse_datetime('2021-12-21:19:00:00','yyyy-MM-dd:HH:mm:ss');
However this times out due to the default DML 30 min timeout.
The entries of the path I am querying are a few millions.
Is there a way to address this in Athena or is there a better suited alternative for this purpose?
This is normally solved with partitioning. For data that's organized by date, partition projection is the way to go (versus an explicit partition list that's updated manually or via Glue crawler).
That, of course, assumes that your data is organized by the partition (eg, s3://mybucket/2021/12/21/xxx.csv). If not, then I recommend changing your ingest process as a first step.
You my want to change your ingest process anyway: Athena isn't very good at dealing with a large number of small files. While the tuning guide doesn't give an optimal filesize, I recommend at least a few tens of megabytes. If you're getting a steady stream of small files, use a scheduled Lambda to combine them into a single file. If you're using Firehose to aggregate files, increase the buffer sizes / time limits.
And while you're doing that, consider moving to a columnar format such as Parquet if you're not already using it.

Use AWS Athena With Dynamic Fields / Schemaless

We want to use AWS Athena for analytics and segmentation, our problem is that our data is schemaless, rows are different with some similar columns.
Is it possible to create table without defining all the columns?
When we query we know the type (string/int) of each column so if there is a way to define on the query it will be great.
We can structure the data in anyway needed to support schemaless and in any format: CSV / JSON.
Is Athena an option for schemaless uses?
There are many ways to use Athena in schemaless uses and you need to give specific examples of scenarios that you want to support more efficiently as in Athena you pay based on the data that you scan and optimizing your data to minimize the data scan is critical to make it a useful tool in scale.
The simplest way to get you started as you are learning the tool, and the types of queries that you can run on your data, is to define a table with a single column ("line"), and then do the parsing of the data that you want using string functions, or JSON functions if the lines are in JSON format.
You will get good time performance if you have multiple files, but it will be expensive as you need to scan all your data for every query. I suggest that you start with these queries as a good way to define your requirements. As you see the growth of usage, start optimizing the use cases by using the CTAS (Create Table As Select) commands that will generate parquet versions of the original raw data to support the more popular (and expensive) use cases.
You are welcome to read my blog post that is describing the strategy and tactics of a cloud environment using Athena and the other AWS tools around it.

How best cache bigquery table for fast lookup of individual row?

I have a raw data table in bigquery that has hundreds of millions of rows. I run a scheduled query every 24 hours to produce some aggregations that results a table in the ballmark of 33 million rows (6gb) but may be expected to grow slowly to approximately double its current size.
I need a way to get 1 row at a time quick access lookup by id to that aggregate table in a separate event driven pipeline. i.e. A process is notified that person A just took an action, what do we know about this person's history from the aggregation table?
Clearly bigquery is the right tool to produce the aggregate table, but not the right tool for the quick lookups. So I need to offset it to a secondary datastore like firestore. But what is the best process to do so?
I can envision a couple strategies:
1) Schedule a dump of agg table to GCS. Kick off a dataflow job to stream contents of gcs dump to pubsub. Create a serverless function to listen to pubsub topic and insert rows into firestore.
2) A long running script on compute engine which just streams the table directly from BQ and runs inserts. (Seems slower than strategy 1)
3) Schedule a dump of agg table to GCS. Format it in such a way that can be directly imported to firestore via gcloud beta firestore import gs://[BUCKET_NAME]/[EXPORT_PREFIX]/
4) Maybe some kind of dataflow job that performs lookups directly against the bigquery table? Not played with this approach before. No idea how costly / performant.
5) some other option I've not considered?
The ideal solution would allow me quick access in milliseconds to an agg row which would allow me to append data to the real time event.
Is there a clear best winner here in the strategy I should persue?
Remember that you could also CLUSTER your table by id - making your lookup queries way faster and less data consuming. They will still take more than a second to run though.
https://medium.com/google-cloud/bigquery-optimized-cluster-your-tables-65e2f684594b
You could also set up exports from BigQuery to CloudSQL, for subsecond results:
https://medium.com/#gabidavila/how-to-serve-bigquery-results-from-mysql-with-cloud-sql-b7ddacc99299
And remember, now BigQuery can read straight out of CloudSQL if you'd like it to be your source of truth for "hot-data":
https://medium.com/google-cloud/loading-mysql-backup-files-into-bigquery-straight-from-cloud-sql-d40a98281229

Dynamodb Update for multiple list items with multiple key values

We are updating data in an Excel sheet for a particular event id, We need to retrieve the primary key item from the dynamodb table for the particular event id and need to update values in the excel.
Doing this manually for few articles is ok. But if we need to update 10000 of event id values, how can we automate this process through python or any other method? Please Assist on this
If you're asking about how to automate this in Excel, then one option is to use the Office Interop APIs for Excel from your favorite .NET language (C# is really easy to use for this sort of task). Dynamo has client SDKs for .NET, again making it relatively easy to query your source table.
For the .Net SDK for Dynamo, start here: https://docs.aws.amazon.com/sdk-for-net/v3/developer-guide/dynamodb-intro.html
For Office automation, you have two options:
You can either write a .Net application that would interface with Excel and process the file, reading from Dynamo
You can try using the automation features from Excel via scripting (but I am not sure how well that would work with the external dependency on the AWS SDK)
For the latter you might start here: https://learn.microsoft.com/en-us/dotnet/csharp/programming-guide/interop/how-to-access-office-onterop-objects
There are lots of examples for automating Excel using C#. If you find that you're stuck on something in particular, feel free to ask here on SO but the more focused the question the quicker and better answers you'll get.
As far as the approach for your particular task, I would:
make a console application that opens the Excel document (workbook) you want to edit
enumerate the sheets and pick the one you need to update (presumably the first one?!)
then, for each of the rows in the sheet, read the eventid from the corresponding cell
make the DynamoDB query and get the data you need for that event
update the cells for that row
repeat this for all rows until you're done
As a potential optimization, if there aren't that many records in Dynamo (10,000 is a pretty low number), I would look into scanning the Dynamo table into memory first and then doing the lookups in the memory. This has the added benefit that it will be significantly cheaper. Scanning all 10K items and storing in memory will usually be on the order of 15-20 times cheaper than making individual Get requests for each item.
followed below steps to complete the dynamodb update
1.We have read and converted source csv data into dictionary
with open('test.csv', 'r') as f: reader = csv.reader(f) your_list
= list(reader) list_1=[] dict1={} for i in range(1, len(your_list)):
dict1[your_list[0][0]]=your_list[i][0]
dict1[your_list[0][1]]=your_list[i][1]
dict1[your_list[0][2]]=your_list[i][2]
dict1[your_list[0][3]]=your_list[i][3] list_1.append(dict1)
dict1={}
I have not copied the complete script here , just pasted one small batch script
2.Using dynamodb scan operation compared the eventid in source and destination
We have faced data retrivel issue here , at a time we can get 1 MB of data in dynamodb
3.We have verified each batch records with dynamodb table and completed the update process

Amazon Redshift schema design

We are looking at Amazon Redshift to implement our Data Warehouse and I would like some suggestions on how to properly design Schemas in Redshift, please.
I am completely new to Redshift. In the past when I worked with "traditional" data warehouses, I was used to creating schemas such as "Source", "Stage", "Final", etc. to group all the database objects according to what stage the data was in.
By default, a database in Redshift has a single schema, which is named PUBLIC. So, my question to those who have worked with Redshift, does the approach that I have outlined above apply here? If not, I would love some suggestions.
Thanks.
With my experience in working with Redshift, I can assert the following points with confidence:
Multiple schema: You should create multiple schema and create tables accordingly. When you'll scale, it'll be easier for you to pin-point where exactly the table is supposed to be. Let us say, you have 3 schema, named production, aggregates and rough. Now, you know that the table production will contain the tables that are not supposed to be changed (mostly OLTP data) - such as user, order, transactions tables. Table aggregates will have aggregated data built over raw tables - such as number of orders placed per user per day per category. Finally, rough will contain any table that doesn't hold a business logic but is required for some temporary work - let us say to check the genre of movies for a list of 1 lakh users, which is shared with you in an excel file. Simply create a table in rough schema, perform your operations and drop the table. Now you very clearly know where you'll find the tables based on whether they are raw, aggregated or simply temporary tables.
Public schema: Forget it exists. Any table that is not preceded with a schema name, gets created there. A lot of clutter - no point in storing any important data there.
Cross schema joins: There's no stopping here. You may join as many tables from as many schema as required. In fact, it is desirable you create dimension tables and join on a PK later, rather than to keep all the information in a single table.
Spend some quality time in designing the schema and underlying table structure. When you expand, it'll be easier for you to classify things better in terms of access control. Do let me know if I've missed some obvious points.
You can have multiple databases in a Redshift cluster but I would stick with one. You are correct that schemas (essentially namespaces) are a good way to divide things up. You can query across schemas but not databases.
I would avoid using the public schema as managing certain permissions there can be difficult (easier to deny someone access to public than prevent them from being able to create a table for example).
For best results if you have the time, learn about the permissions system up front. You want to create groups that have access to schemas or tables and add/remove users from groups to control what they can do. Once you have that going it becomes pretty easy to manage.
In addition to the other responses, here are some suggestions for improving schema performance.
First: Automatic compression encodings using COPY command
Improve the performance of Amazon Redshift using the COPY command. It will get data into Redshift database. The COPY command is clever enough. It automatically chooses the most appropriate encoding settings for the data it uploads. You don’t have to think about it. However, it does so only for the first data upload into an empty table.
So, make sure to use a significant data set while uploading data for the first time, which Redshift can assess to set the column encodings in the best way. Uploading a few lines of test data will confuse Redshift to know how best to optimize the compression to handle the real workload.
Second: Use Best Distribution Style and Key
Distribution-style decides how data is distributed across the nodes. Applying a distribution style at table level tells Redshift how you want to distribute the table and the key. So, how you specify distribution style is important for good query performance with Redshift. The style you choose may affect requirements for data storage and cluster. It also affects the time taken by the COPY command to execute.
I recommend setting the distribution style to all tables with a smaller dimension. For large dimension, distribute both the dimension and associated fact on their join column. To optimize the second large dimension, take the storage-hit and distribute ALL. You can even design the dimension columns into the fact.
Third: Use the Best Sort Key
A Redshift database maintains data in a table with an arrangement of a sort-key-column if specified. Since it’s sorted in each partition; each cluster node upholds its partition in predefined order. (While designing your Redshift schema, also consider the impact on your budget. Redshift is priced by amount of stored data and by the number of nodes.)
Sort key optimizes Amazon Redshift performance significantly. You can do it in many ways. First, use data filtering. If where-clause filters on a sort-key-column, it skips the entire data blocks. It’s because Redshift saves data in blocks. Each block header records the minimum and maximum sort key value. Filter outside of that range, the entire block may get skipped.
Alternatively, when joining two tables, sorted on their joint keys, the data is read in matching order. Also, you can merge-join without separate sort-steps. Joining large dimension to a large fact table will be easy with this method because neither will fit into a hash table.