Joining Solution using Co-Group by SideInput Apache Beam

Joining Solution using Co-Group by SideInput Apache Beam - google-cloud-platform

I have 2 Tables to Join, its a Left Join. Below is the two Condition, how my pipeline is working.
The job is running in batch mode and its all User data and we want to process in Google Dataflow.
Day 1:
Table A: 5000000 Records. (Size 3TB)
Table B: 200 Records. (Size 1GB)
Both Tables Joined through SideInput where TableB Data was Taken as SideInput and it was working fine.
Day 2:
Table A: 5000010 Records. (Size 3.001TB)
Table B: 20000 Records. (Size 100GB)
On second day my pipeline is slowing down because SideInput uses cache and my cache size got exhausted, because of size of TableB got Increased.
So I tried Using Co-Group by, but Day 1 data processing was pretty slow with a Log: Having 10000 plus values on Single Key.
So is there any better performant way to perform the Joining when Hotkey get introduced.

It is true that the performance can drop precipitously once table B no longer fits into cache, and there aren't many good solutions. The slowdown in using CoGroupByKey is not solely due to having many values on a single key, but also the fact that you're now shuffling (aka grouping) Table A at all (which was avoided when using a side input).
Depending on the distribution of your keys, one possible mitigation could be to process your hot keys into a path that does the side-input joining as before, and your long-tail keys into a GoGBK. This could be done by producing a truncated TableB' as a side input, and your ParDo would attempt to look up the key emitting to one PCollection if it was found in TableB' and another if it was not [1]. One would then pass this second PCollection to a CoGroupByKey with all of TableB, and flatten the results.
[1] https://beam.apache.org/documentation/programming-guide/#additional-outputs

Related

Athena ignore LIMIT in some queries

I have a table with a lot of partitions (something that we're working on reducing)
When I query :
SELECT * FROM mytable LIMIT 10
I get :
"HIVE_EXCEEDED_PARTITION_LIMIT: Query over table 'mytable' can potentially read more than 1000000 partitions"
Why isn't the "LIMIT 10" part of the query sufficient for Athena to return a result without reading more that 1 or 3 partitions ?
ANSWER :
During the query planing phase, Athena attempts to list all partitions potentially needed to answer the query.
Since Athena doesn't know which partitions actually contain data (not empty partitions) it will add all partitions to the list.

Athena plans a query and then executes it. During planning it lists the partitions and all the files in those partitions. However, it does not know anything about the files, how many records they contain, etc.
When you say LIMIT 10 you're telling Athena you want at most 10 records in the result, and since you don't have any grouping or ordering you want 10 arbitrary records.
However, during the planning phase Athena can't know which partitions have files in them, and how many of those files it will need to read to find 10 records. Without listing the partition locations it can't know they're not all empty, and without reading the files it can't know they're not all empty too.
Therefore Athena first has to get the list of partitions, then list each partition's location on S3, even if you say you only want 10 arbitrary records.
In this case there are so many partitions that Athena short-circuits and says that you probably didn't mean to run this kind of query. If the table had fewer partitions Athena would execute the query and each worker would read as little as possible to return 10 records and then stop – but each worker would produce 10 records, because the worker can't assume that other workers would return any records. Finally the coordinator will pick the 10 records out of all the results form all workers to return as the final result.

Limit works on the display operation only, if I am not wrong. So query will still read everything but only display 10 records.
Try to limit data using where condition, that should solve the issue

I think Athena's workers try to read max number of the partitions (relative to the partition size of the table) to get that random chunk of data and stop when query is fulfilled (which in your case, is the specification of the limit).
In your case, it's not even starting to execute the above process because of too many partitions involved. Therefore, if Athena is not planning your random data selection query, you have to explicitly plan it and hand it over to the execution engine.
Something like:
select * from mytable
where (
partition_column in (
select partition_column from mytable limit cast(10 * rand() as integer)
)
)
limit 100

AWS Athena partition fetch all paths

Recently, I've experienced an issue with AWS Athena when there is quite high number of partitions.
The old version had a database and tables with only 1 partition level, say id=x. Let's take one table; for example, where we store payment parameters per id (product), and there are not plenty of IDs. Assume its around 1000-5000. Now while querying that table with passing id number on where clause like ".. where id = 10". The queries were returned pretty fast actually. Assume we update the data twice a day.
Lately, we've been thinking to add another partition level for day like, "../id=x/dt=yyyy-mm-dd/..". This means that partition number grows xID times per day if a month passes and if we have 3000 IDs, we'd approximately get 3000x30=90000 partitions a month. Thus, a rapid grow in number of partitions.
On, say 3 months old data (~270k partitions), we'd like to see a query like the following would return in at most 20 seconds or so.
select count(*) from db.table where id = x and dt = 'yyyy-mm-dd'
This takes like a minute.
The Real Case
It turns out Athena first fetches the all partitions (metadata) and s3 paths (regardless the usage of where clause) and then filter those s3 paths that you would like to see on where condition. The first part (fetching all s3 paths by partitions lasts long proportionally to the number of partitions)
The more partitions you have, the slower the query executed.
Intuitively, I expected that Athena fetches only s3 paths stated on where clause, I mean this would be the one way of magic of the partitioning. Maybe it fetches all paths
Does anybody know a work around, or do we use Athena in a wrong way ?
Should Athena be used only with small number of partitions ?
Edit
In order to clarify the statement above, I add a piece from support mail.
from Support
...
You mentioned that your new system has 360000 which is a huge number.
So when you are doing select * from <partitioned table>, Athena first download all partition metadata and searched S3 path mapped with
those partitions. This process of fetching data for each partition
lead to longer time in query execution.
...
Update
An issue opened on AWS forums. The linked issue raised on aws forums is here.
Thanks.

This is impossible to properly answer without knowing the amount of data, what file formats, and how many files we're talking about.
TL; DR I suspect you have partitions with thousands of files and that the bottleneck is listing and reading them all.
For any data set that grows over time you should have a temporal partitioning, on date or even time, depending on query patterns. If you should have partitioning on other properties depends on a lot of factors and in the end it often turns out that not partitioning is better. Not always, but often.
Using reasonably sized (~100 MB) Parquet can in many cases be more effective than partitioning. The reason is that partitioning increases the number of prefixes that have to be listed on S3, and the number of files that have to be read. A single 100 MB Parquet file can be more efficient than ten 10 MB files in many cases.
When Athena executes a query it will first load partitions from Glue. Glue supports limited filtering on partitions, and will help a bit in pruning the list of partitions – so to the best of my knowledge it's not true that Athena reads all partition metadata.
When it has the partitions it will issue LIST operations to the partition locations to gather the files that are involved in the query – in other words, Athena won't list every partition location, just the ones in partitions selected for the query. This may still be a large number, and these list operations are definitely a bottleneck. It becomes especially bad if there is more than 1000 files in a partition because that's the page size of S3's list operations, and multiple requests will have to be made sequentially.
With all files listed Athena will generate a list of splits, which may or may not equal the list of files – some file formats are splittable, and if files are big enough they are split and processed in parallel.
Only after all of that work is done the actual query processing starts. Depending on the total number of splits and the amount of available capacity in the Athena cluster your query will be allocated resources and start executing.
If your data was in Parquet format, and there was one or a few files per partition, the count query in your question should run in a second or less. Parquet has enough metadata in the files that a count query doesn't have to read the data, just the file footer. It's hard to get any query to run in less than a second due to the multiple steps involved, but a query hitting a single partition should run quickly.
Since it takes two minutes I suspect you have hundreds of files per partition, if not thousands, and your bottleneck is that it takes too much time to run all the list and get operations in S3.

Why Amazon Redshift UNLOAD performance is much better for fresh data?

I wonder why unloading from a big table (>100 bln rows) when selecting by a column, which is NOT a sort key or a part of sort key, is immensely faster for newly added data. How Redshift understands that it is time to stop sequential scan in the second scenario?
Time the query spent executing. 39m 37.02s:
UNLOAD ('SELECT * FROM production.some_table WHERE daytime BETWEEN
\\'2017-01-15\\' AND \\'2017-01-16\\'') TO ...
vs.
Time the query spent executing. 23.01s :
UNLOAD ('SELECT * FROM production.some_table WHERE daytime BETWEEN
\\'2017-06-24\\' AND \\'2017-06-25\\'') TO ...
Thanks!

Amazon Redshift uses zone maps to identify the minimum and maximum value stored in each 1MB block on disk. Each block only stores data related to a single column (eg daytime).
If the SORTKEY is not set to daytime, then the data is unsorted and any particular date could appear in many different blocks. If SORTKEY is used, then a particular date will only appear in a minimum number of blocks.
Your second query possibly executes faster, even without a SORTKEY, because you are querying data that was probably added recently and is therefore all stored together in just a few blocks. The historical data might be spread in many blocks because a VACUUM probably reordered the data based upon the correct SORTKEY. In fact, if you did a VACUUM now, you might find that your second query becomes slower.

Redshift -- Query Performance Issues

SELECT
a.id,
b.url as codingurl
FROM fact_A a
INNER JOIN dim_B b
ON strpos(a.url,b.url)> 0
Records Count in Fact_A: 2 Million
Records Count in Dim_B : 1500
Time Taken to Execute : 10 Mins
No of Nodes: 2
Could someone help me with an understanding why the above query takes more time to execute?
We have declared the distribution key in Fact_A to appropriately distribute the records evenly in both the nodes and also Sort Key is created on URL in Fact_A.
Dim_B table is created with DISTRIBUTION ALL.

Redshift does not have full-text search indexes or prefix indexes, so a query like this (with strpos used in filter) will result in full table scan, executing strpos 3 billion times.
Depending on which urls are in dim_B, you might be able to optimise this by extracting prefixes into separate columns. For example, if you always compare subpaths of the form http[s]://hostname/part1/part2/part3 then you can extract "part1/part2/part3" as a separate column both in fact_A and dim_B, and make it the dist and sort keys.
You can also rely on parallelism of Redshift. If you resize your cluster from 2 nodes to 20 nodes, you should see immediate performance improvement of 8-10 times as this kind of query can be executed by each node in parallel (for the most part).

Storing Time Series in AWS DynamoDb

I would like to store 1M+ different time series in Amazon's DynamoDb database. Each time series will have about 50K data points. A data point is comprised of a timestamp and a value.
The application will add new data points to time series frequently (all the time) and will retrieve (usually the whole time series) time series from time to time, for analytics.
How should I structure the database? Should I create a separate table for each timeseries? Or should I put all data points in one table?

Assuming your data is immutable and given the size, you may want to consider Amazon Redshift; it's written for petabyte-sized reporting solutions.
In Dynamo, I can think of a few viable designs. In the first, you could use one table, with a compound hash/range key (both strings). The hash key would be the time series name, the range key would be the timestamp as an ISO8601 string (which has the pleasant property that alphabetical ordering is also chronological ordering), and there would be an extra attribute on each item; a 'value'. This gives you the abilty to select everything from a time series (Query on hashKey equality) and a subset of a time series (Query on hashKey equality and rangeKey BETWEEN clause). However, your main problem is the "hotspot" problem: internally, Dynamo will partition your data by hashKey, and will disperse your ProvisionedReadCapacity over all your partitions. So you may have 1000 KB of reads a second, but if you have 100 partitions, then you have only 10 KB a second for each partition, and reading all data from a single time series (single hashKey) will only hit one partition. So you may think your 1000 KB of reads gives you 1 MB a second, but if you have 10 MB stored it might take you much longer to read it, as your single partition will throttle you much more heavily.
On the upside, DynamoDB has an extremely high but costly upper-bound on scaling; if you wanted you could pay for 100,000 Read Capacity units, and have sub-second response times on all of that data.
Another theoretical design would be to store every time series in a separate table, but I don't think DynamoDB is meant to scale to millions of tables, so this is probably a no-go.
You could try and spread out your time series across 10 tables where "highly read" data goes in table 1, "almost never read data" in table 10, and all other data somewhere in between. This would let you "game" the provisioned throughput / partition throttling rules, but at a high degree of complexity in your design. Overall, it's probably not worth it; where do you new time series? How do you remember where they all are? How do you move a time series?
I think DynamoDB supports some internal "bursting" on these kinds of reads from my own experience, and it's possible my numbers are off, and you will get adequete performance. However my verdict is to look into Redshift.

How about dripping each time series into JSON or similar and store in S3. At most you'd need a lookup from somewhere like Dynamo.
You still may need redshift to process your inputs.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js