Suppose I have an employees table(with around a million employees) and a tasks table(with a few hundred tasks).
Now, I have a mechanism to predict how probable(percentage) an employee is to complete the task -- let's say I have four such mechanisms, and each of the mechanism outputs it's own probability.
Putting it all together, I now have n1(employees) times n2(tasks) times n3(mechanisms) results to store.
I was wondering what would be the best way to store these results.
I have a few options and thoughts:
Maintain a column(JSONField) in either of employees or tasks tables -- Concern: Have to update the whole column data if one of the values changes
Maintaining a third table predictions with foreign keys to employee and task with a column to store the predicted_probability -- Concern: Will have to store n1 * n2 * n3 records, I'm worried about scalability and performance
Thanks for any help.
PS: I'm using Django with postgres
The predictions table is the correct way to go. Depending on how you access the data, the size of the table won't matter. e.g. I would expect that reading the prediction for a single employee has a pretty constant performance. Large tables tend to be a problem only when you need to process all (or a large fraction) of the rows. If you hit a performance problem once you test this, you could e.g. partition that table by task or by task and mechanism (depending on how your queries are structured)
-Credits to #a_horse_with_no_name
Related
I want to take a general Idea of how I can optimise the query performance in redshift Database, I have Huge queries with lots of joins , I do understand using sort and Dist key it can be achieved but is there a method which we can follow in order to get some optimal results.
What to look in a table and how to approach query optimisation in redshift?
What are the necessary steps to look for or approach in order to have a certain plan for optimisation?
Any guidance will help a lot
Having improved many queries on Redshift there are a few things I can point you towards. First let me list a few tools / techniques to make sure you have these in your toolbox.
Ability to read and EXPLAIN plan and find expected costly points
Know where to find the query "actual" execution report
Know the system tables to find join, distribution, and disk io reports
So with those understood let's look at where many queries go sideways on Redshift. I will try to list these out in pareto order but any of these, or combos, can create significant issue.
#1 - Fat in the middle queries. When joining it is possible to expand the number of rows being operated upon many fold. Cross joining is a clear way this can happen but isn't how this usually happens. If the join on conditions create a many to many join pattern the number of rows can expand. When the table sizes are very large and the "multiplication" can make absurd data sizes. The explain plan can show this but not always - use of DISTINCT and GROUP BY can "hide" the true size of the dataset in play. Performing a SELECT COUNT(*) on your join tree can help show how big this is. You may also may need to look a pieces of the join tree if a later join is collapsing the rows (failure of the query optimizer?). Redshift is a columnar database and not well set up for the creation of data - this includes during the execution of query.
#2 - Distribution of large amounts of data. Redshift is a cluster and the node are connected together by ethernet cables and these connections are the slowest part of the cluster. A lot of work is done by the query optimizer to minimize the amount of data that needs to move around the network. However, it doesn't know your data as well as you do and doesn't always do this well. Look at the type of joins you are getting - is distribution needed? how much data is being distributed? Also, group by (and window functions) need to combine rows and therefore may need redistribution to complete. How big are the data sets entering your aggregation steps?
Moving a lot of data around the network will be slow. The difficulty is that it isn't always clear how to reduce this movement. Large join trees like you say you have can do "odd" things when it comes to the resulting distribution of the "joined" data. Joins are performed one at a time and the order these happen can matter. The query optimizer is making a number of decisions about the order of joins and how to organize the resulting data from each join. The choices it makes is based on what it sees in the table metadata so completeness of metadata matters. WHERE conditions can also impact the optimizer's choices. There are just way to many interactions to itemize them out here. Best advice is to look at the performance per step and see if data distribution is a factor. Then work to control how data is distributed in the query's execution. This may mean changing the join trees or even decomposing the query into several with temp table that have distribution set so that data movement is minimized.
#3 Excessive IO traffic - While not as slow as the networks, the disk IO subsystem is often a bottleneck. This shows up in a few ways. Are you reading more data from disk than is needed? (Metadata up to date?) Do you need a redundant WHERE clause to eliminate data? (Redundant WHERE clause is one that isn't needed functionally but is added so Redshift can perform the metadata comparisons that will reduce data read at scan.) Data spill is another way that disk IO can be strained (this goes back to #1). If data needs to spill to disk it can bring the disk IO performance down considerably. Use your metadata and Where clauses well.
Now these 3 areas often team up to kill your performance. Read too many rows from your tables, join all these extra rows together across the network while also making many new rows. This data doesn't fit in memory so now Redshift needs to spill to disk to complete the query. Things slow down real fast in these conditions.
Lastly these factors I've listed are cluster wide "resources" of Redshift. If one query take up a lot of one of these then there is less for other queries running at the same time. What often happens is that the query writers on a cluster follow similar patterns (good or bad) and when their pattern is costly on one axis then many of their queries are costly on the same axis. This shows up as queries that work "ok" when run in isolation but very badly when others are using the cluster. This generally means that many queries are contributing to pushing the cluster "over the edge" on some limited resource. There are system tables that you can look at to see aggregated IO or network traffic to see these effects.
Good queries are:
Don't make a lot of new "rows" during execution (not fat in the middle)
Keep large data sets "on node" and only redistribute data once the data has been pared down significantly
Don't read more data from disk than is necessary and don't spill
The problem is that doing all of these isn't always possible the trick is to not over subscribe the cluster resources you have.
I currently have a large set of json data that I'd like to import into Amazon Athena for visualization in Amazon Quicksight. In each json, there are two fields: one is a comma separated string of ids (orderlist), and the other field is an array of strings(locations). Because Quicksight doesn't support array searching, I'm currently resorting to creating a view where I generate crossjoins across the two string arrays:
select id,
try_CAST(orderid AS bigint) orderid_targeting,
location
from advertising_json
CROSS JOIN UNNEST(split(orderlist, ',')) as x(orderid)
CROSS JOIN UNNEST(locations) t (location)
With two cross joins, this can explode out the data to 20x-30x the original size.
If I were working on individual queries on Athena, I could use Presto array functions to search through the arrays. Is there a better way to make these fields accessible for filtering on Quicksight?
You have two options: keep doing what you're doing or implement an ETL workflow where you periodically materialise the view, for example using CTAS. The latter has the added benefit that you can produce Parquet files, which could help speed up your queries.
On the other hand it's not as simple as it sounds. If you're in luck you can use INSERT INTO to transform partitions from your current table into an optimised table after a point in time when they will not change – but in my experience most of the time your most recent data gets updated during some window of time, but you still want to be able to query it during that window. In that situation the ETL process becomes much more complicated since you need to remove data from the optimised table to avoid ending up with duplicate data. It's not hard, it's just a lot of code and juggling S3 and Glue Data Catalog operations so that you never have tables that have duplicate data nor too little data.
Unless you feel like your current setup with the view is too slow, don't go implementing something big and complicated. Remember that you pay for bytes scanned in Athena, not the amount of time Athena spends crunching your query. You get quite a lot of compute power running your queries and in my experience there's rarely any point in micro-optimisation of queries, the gains you make are orders of magnitude lower than minimising the amount of data you process, either through clever partitioning or moving to columnar file formats. Most of the time the gains from small optimisations are not measurable because the error bars caused by Athena's query queue and waiting for S3 operations. You may get your query to run 50ms faster, but sometimes it gets queued for 500ms, and spends another 2000ms doing list operations on S3 so how can you tell?
If you decide to go down the materialisation route, first do it once using CTAS and run your QuickSight visualisation against the results. Don't implement the whole ETL workflow before you've checked that you get something that is significantly more performant.
If all you are worried about is that it's less performant to apply filters after the unnesting of your arrays than using array functions, write the two versions of the query and benchmark them against each other. I suspect array functions are going to be slightly faster – but for the same reasons I mentioned above, the gains may drown in the error bars caused by Athena's queuing and other operations.
Make sure to benchmark at different points during the day, and be especially conscious of the fact that top-of-the-hour behaviour in Athena is extremely different from other times (run queries at 10:00 and then at 10:10 – your total execution times will be very different because everyone's cron jobs run at the top of the hour).
I was wondering what the usage of using a partitioned table in BigQuery is. It seems most of the queries seem to take about the same time to finish regardless of size (ignoring extremes, I'm generalizing), is this mainly a matter of using it to reduce costs on the bytes processed, or what is the main use case of partitioning tables in BQ?
https://cloud.google.com/bigquery/docs/creating-column-partitions
There are multiple benefits, mainly costs.
by writing a query to read only eg: 7 days of partitions instead of 7 years you have lower costs
partitions you don't touch for older than 90 days are at lower costs
you can clearly reload a day's data much more easier than having to work around
you are still recommended to use YEARly tables eg mytable_2018, but you are no longer required to have daily tables eg: mytable_20180101, this further leads to have simpler queries, also no longer a problem to read more than 1000 tables (which is a hard limit).
when you modify schema, you need to modify a few tables, you no longer need to script alters on thousands of table
this also means it's lover bytes processed and in the cloud platform can be better optimized and needs fewer resources
by reorganizing data into partitioned tables the query times will benefit in the future. As customers will move data, the cloud engineering team will optimize the service for better usage.
you see clear cost wise benefits if your existing data is at least a couple of terabytes.
I'm putting together some simple analysis to benchmark DWU impact on read and write based on a CTAS statement.
The query is aggregating 1.7b row table to a table of 993k rows. Source and destination tables are round-robin distribution (source won't be RR long-term, will move to HASH) the query is roughly as follows:
create table CTAS_My_DWU_Test
with (distribution = round_robin)
as
select TableKey1, TableKey2,
SumCcolumn=SUM(SalesAmt),
MaxQuantity=MAX(SalesQty),
MinQuantity=MIN(SalesQty)
from FactSales
group by TableKey1, TableKey2
option (label='MyDWUTest');
I am analysing the performance via the sys.dm_pdw_dms_workers DMV, getting an average bytes_per_second over each distribution for both type=DIRECT_READER and type=WRITER.
My process is to change the DWU, drop the CTAS, re-create it and analyse the data in the DMV.
I'm not seeing a consistent improvement in performance as I increase the DWU. My goal is to look for clear proof of increase compute, however sometimes a higher DWU is slower and returning less bytes_per_sec than a smaller DWU.
If I happen to run the CTAS statement twice on the same DWU, without going through the scale process, the second & subsequent executions run nearly 10x faster.
Looking for help to on the process based on one table, want to keep data movement/join out of the equation for the moment.
Good question! The architecture of Azure SQL Data Warehouse is more performant when there is less data movement. I recommend following the steps in this article to determine which step is slowing the process down: https://azure.microsoft.com/en-us/documentation/articles/sql-data-warehouse-manage-monitor/
It's possible that your query is analyzing each of the aggregations over the 1.7b rows in serial, which doesn't maximize the parallel nature of our product, but the best way to find out what is going on is to take a look at the query plan, etc. in the link above.
As for the 10x performance on a repeat run, that's coming from internal caching in our system.
Let us know what you find in the query plan, execution plan, etc.
I am looking for a strategy to batch all my queries (with IN clause) to overcome the restrictions by databases on IN clause (See here).
I usually get list of size 100000 to 305000. So, this has become very important to tackle.
I have tried two strategies so far.
Strategy 1:
Create an entity and hence a table with one column to hold such values (can we create temp tables on the fly with JPA 2.0 vendor-independent?) and use the data from the temp table as a subquery to the original query before eventually cleaning up the temp table.
Advantage: Very performant queries. Really quick, I must admit for the numbers I have mentioned, it was mostly under a minute.
Possible drawback: Use of temp table which is actually a permanent one in my case thus far.
Strategy 2:
Calculate the batch size for the given input list and for each batch execute the query and accumulate the result.
Advantage: No temp tables. Easy for any threads within the same transaction.
Disadvantage: A big disadvantage is amount of time it takes to execute all the batches. For the mentioned numbers, this is at an unacceptable level at the moment. Takes anything between 5 to 15 mins!
I would appreciate any feedback, suggestions or improvements from all you JPA gurus.
Thanks.
I only tested up to 50,000 integers but I have some pretty good performance data around splitting large lists using various methods, with CLR and a numbers table leading the pack at the higher end:
Splitting a list of integers : another roundup
Not sure if you are using integers or strings but the results should be roughly equivalent.
As an aside, I'll confess I have no idea what JPA 2.0 is, but I assume you can control the format of the lists that it sends to SQL Server.