I tried two type of joining condition in Redshift first I tried where after join on and second,I tried and after join on.I assumed that where is executed after join so that in this case it must be scaned so much rows.
explain
select
*
from
table1 t
left join table2 t2 on t.key = t2.key
where
t.snapshot_day = to_date('2021-12-18', 'YYYY-MM-DD');
XN Hash Right Join DS_DIST_INNER (cost=43055.58..114637511640937.91 rows=2906695 width=3169)
Inner Dist Key: t.key
Hash Cond: (("outer".asin)::text = ("inner".asin)::text)
-> XN Seq Scan on table2 t2 (cost=0.00..39874539.52 rows=3987453952 width=3038)
-> XN Hash (cost=35879.65..35879.65 rows=2870373 width=131)
-> XN Seq Scan on table1 t (cost=0.00..35879.65 rows=2870373 width=131)
Filter: (snapshot_day = '2021-12-18 00:00:00'::timestamp without time zone)
on the other hands, as follows,and is conditioned before join so that I assumed it is less rows to be scaned in join. but it returned too much rows and consume huge cost as follows greater than where clause
explain
select
*
from
table1 t
left join table2 t2 on t.key= t2.key
and
t.snapshot_day = to_date('2021-12-18', 'YYYY-MM-DD');
XN Hash Right Join DS_DIST_INNER (cost=40860915.20..380935317239623.75 rows=3268873216 width=3169)
Inner Dist Key: t.key
Hash Cond: (("outer".key)::text = ("inner".key)::text)
Join Filter: ("inner".snapshot_day = '2021-12-18 00:00:00'::timestamp without time zone)
-> XN Seq Scan on table2 t2 (cost=0.00..39874539.52 rows=3987453952 width=3038)
-> XN Hash (cost=32688732.16..32688732.16 rows=3268873216 width=131)
-> XN Seq Scan on table1 t (cost=0.00..32688732.16 rows=3268873216 width=131)
What is the difference between them ? where do I misunderstand in this case ?
If someone has opinion or materials please let me know
Thanks
There are several things happening here (in my original answer I missed that you were doing an outer join, so this is a complete rewrite).
WHERE happens before JOIN (ie, real-world databases don't use relational algebra)
A join merges two result-sets on common column(s). The query optimizer will attempt to reduce the size of those result-sets by applying any predicates from the WHERE clause that are independent of the join.
Conditions in the JOIN clause control how the two result-sets are merged, nothing else.
This is a semantic difference between your two queries: when you specify the predicate on t.snapshot_day in the WHERE clause, it limits the rows selected from t. When you specify it in the JOIN clause, it controls whether a row from t2 matches a row in t, not which rows are selected from t or which rows are returned from the join.
You're doing an outer join.
In an inner join, rows between the two result-sets are joined if and only if all conditions in the JOIN clause are matched, and only those rows are returned. In practice, this will limit the result set to only those rows in t that fulfill the predicate.
An outer join, however, selects all rows from the outer result-set (in this case, all rows from t), and includes values from the inner result-set iff all join conditions are true. So you'll only include data from t2 where the key matches and the predicate applies. For all other rows in t you'll get nulls.
That DS_DIST_INNER may be a problem
Not related to the join semantics, but for large queries in Redshift it's very expensive to redistribute rows. To avoid this, you should explicitly distribute all tables on the column that's used most frequently in a join (or used with the joins that involve the largest number of rows).
If you don't pick an explicit distribution key, then Redshift will allocate rows to nodes on a round-robin basis, and your query performance will suffer (because every join will require some amount of redistribution).
Related
How can I know the number of queries hitting a table in a particular time frame and what are those queries
Is it possible to get those stats for live tables hitting a redshift table?
This will give you the number of queries hitting a redshift table in a certain time frame:
SELECT
count(*)
FROM stl_wlm_query w
LEFT JOIN stl_query q
ON q.query = w.query
AND q.userid = w.userid
join pg_user u on u.usesysid = w.userid
-- Adjust your time frame accordignly
WHERE w.queue_start_time >= '2022-04-04 10:00:00.000000'
AND w.queue_start_time <= '2022-04-05 22:00:00.000000'
AND w.userid > 1
-- Set the table name here:
AND querytxt like '%my_main_table%';
If you need the actual queries text hitting the table in a certain timeframe, plus the queue and execution time and the user (remove if not needed):
SELECT
u.usename,
q.querytxt,
w.queue_start_time,
w.total_queue_time / 1000000 AS queue_seconds,
w.total_exec_time / 1000000 exec_seconds
FROM stl_wlm_query w
LEFT JOIN stl_query q
ON q.query = w.query
AND q.userid = w.userid
join pg_user u on u.usesysid = w.userid
-- Adjust your time frame accordignly
WHERE w.queue_start_time >= '2022-04-04 10:00:00.000000'
AND w.queue_start_time <= '2022-04-05 22:00:00.000000'
AND w.userid > 1
-- Set the table name here:
AND querytxt like '%my_main_table%'
ORDER BY w.queue_start_time;
If by "hitting a table you mean scan then they system table stl_scan lists all the accesses to a table and lists the query number that causes this scan. By writing a query to aggregate the information in stl_scan you can look at it by time interval and/or originating query. If this isn't what you mean you will need to clarify.
I don't understand what is meant by 'stats for live tables hitting a redshift table?'. What is meant by a table hitting a table?
I have
Flat File1 (F1) with these columns - key1, col1, col2
Flat File2 (F2) with these columns - key2, col1, col2
and one table (T1) with these columns - key3, col1, col2
Requirement is to get data from all 3 sources based on the below checks -
when key1 in Flat file (F1) matches with key2 in Flat File(F2) - return all matching rows in F1 and F2
when key1 in Flat file (F1) doesnt matches with key2 in Flat File(F2) - Only then check should be done between flat file F1 and table T1 based on condition - key1 = key3 and if match is found - then return all matching rows in T1 and F1
To acheive teh above task
I created Joiner traNSFORMATION between these 2 sources - F1 (Master) and F2 (Detail) and got the matching rows, and the join type that i selected was "Detail outer Join"
Am stuck on how to do the remaining checks?
can anyone please guide?
You can follow below steps
First join FF1 and FF2 (Outer join FF2 so all data from FF1 comes
in).
Then use a router to group data that doesnt exist in FF2. You can send matching records to target (group 1).
Non matching records can be picked when ff1.key is not null but ff2.key2 is null. Pick those records and match with Table T1 using a JNR.
You can send these matching records to target.
Whole map should look like this -
sq_FF1 (master) |Grp 1 = ff1.key and ff2.key2 both NOT NULL (Matching)-------------------------------------------------> To TGT
| JNR ( ff1.key=ff2.key2) (Detail outer join) --> ROUTER -(2 groups) |Grp 2 = ff1.key is NOT NULL and ff2.key2 IS NULL (NonMatching) --> |
sq_FF2 (Detail) | JNR key1 = key3 (inner join) ---> To TGT
sq_T1 -----------------------------------------------------------------------------------------------------------------------------------------> |
Cant we bring resultant outcome of both the sets of data to one common tranformation (like union) -> and from there we have to implement
common logic.
i.e.
return all matching rows in F1 and F2
the remaining unmatched rows of F1 should be joined with Table T1
Finally the resultant outcome of the above 2 sets should be routed to one common tranformation (like union) -> and from there we have one common logic.
I have used joiner transf. to bring matching rows in F1 and F2 ->
used filter transf. with cond. to identify all unmatched rows of F1 with cond. Key2 is null ->
used joiner transf. to link table T1 with the records that were indetified as part of filter ->
The result identified as part of step1 and step3 are routed to Union
But THere is an issue when we merge data using union transf. as we bringing data based on join type "Detail outer join" (due to which the data seem to get duplicated). How to get rid of this issue?
I have a set of Django ORM models representing a directed graph, and I'm trying to retrieve all the adjacent vertices to a given vertex ignoring edge direction:
class Vertex(models.Model):
pass
class Edge(models.Model):
orig = models.ForeignKey(Vertex, related_name='%(class)s_orig', null=True, blank=True)
dest = models.ForeignKey(Vertex, related_name='%(class)s_dest', null=True, blank=True)
# ... other data about this edge ...
The query Vertex.objects.filter(Q(edge_orig__dest=v) | Q(edge_dest__orig=v)).distinct() returns the correct result, but in my case it takes far too long to execute.
Typically for my application there will be around 50-100 vertices at any given time, and around a million edges. Even reducing it to only 20 vertices and 100000 edges, that query takes about a minute and a half to execute:
for i in range(20):
Vertex().save()
vxs = list(Vertex.objects.all())
for i in tqdm.tqdm(range(100000)):
Edge(orig = random.sample(vxs,1)[0], dest = random.sample(vxs,1)[0]).save()
v = vxs[0]
def f1():
return list( Vertex.objects.filter(
Q(edge_orig__dest=v) | Q(edge_dest__orig=v)).distinct() )
t1 = timeit.Timer(f1)
print( t1.timeit(number=1) ) # 84.21138522100227
On the other hand, if I split the query up into two pieces I can get the exact same result in only a handful of milliseconds:
def f2():
q1 = Vertex.objects.filter(edge_orig__dest=v).distinct()
q2 = Vertex.objects.filter(edge_dest__orig=v).distinct()
return list( {x for x in itertools.chain(q1, q2)} )
t2 = timeit.Timer(f2)
print( t2.timeit(number=100)/100 ) # 0.0109818680600074
This second version has issues though:
It's not atomic. The list of edges is almost guaranteed to change between the two queries in my application, meaning the results won't represent a single point in time.
I can't perform additional processing and aggregation on the results without manually looping over it. (e.g. If I wanted Vertex.objects.filter(Q(edge_orig__dest=v) | Q(edge_dest__orig=v)).distinct().aggregate(avg=Avg('some_field')))
Why does the second version run so much faster than the first one?
How can I do this, and is there a way to get the first one to run fast enough for practical use?
I'm currently testing with Python 3.5.2, PostgreSQL 9.5.6, and Django 1.11, although if this is an issue with one of those I'm not stuck with them.
Here is the SQL generated by each query, as well as Postgres's explan:
The first one:
Vertex.objects.filter(Q(edge_orig__dest=v) | Q(edge_dest__orig=v))
SELECT DISTINCT "app_vertex"."id"
FROM "app_vertex"
LEFT OUTER JOIN "app_edge" ON ("app_vertex"."id" = "app_edge"."orig_id")
LEFT OUTER JOIN "app_edge" T4 ON ("app_vertex"."id" = T4."dest_id")
WHERE ("app_edge"."dest_id" = 1061
OR T4."orig_id" = 1061)
HashAggregate (cost=8275151.47..8275151.67 rows=20 width=4)
Group Key: app_vertex.id
-> Hash Left Join (cost=3183.45..8154147.45 rows=48401608 width=4)
Hash Cond: (app_vertex.id = app_edge.orig_id)
Filter: ((app_edge.dest_id = 1061) OR (t4.orig_id = 1061))
-> Hash Right Join (cost=1.45..2917.45 rows=100000 width=8)
Hash Cond: (t4.dest_id = app_vertex.id)
-> Seq Scan on app_edge t4 (cost=0.00..1541.00 rows=100000 width=8)
-> Hash (cost=1.20..1.20 rows=20 width=4)
-> Seq Scan on app_vertex (cost=0.00..1.20 rows=20 width=4)
-> Hash (cost=1541.00..1541.00 rows=100000 width=8)
-> Seq Scan on app_edge (cost=0.00..1541.00 rows=100000 width=8)
The second ones:
Vertex.objects.filter(edge_orig__dest=v).distinct()
SELECT DISTINCT "app_vertex"."id"
FROM "app_vertex"
INNER JOIN "app_edge" ON ("app_vertex"."id" = "app_edge"."orig_id")
WHERE "app_edge"."dest_id" = 1061
HashAggregate (cost=1531.42..1531.62 rows=20 width=4)
Group Key: app_vertex.id
-> Hash Join (cost=848.11..1519.04 rows=4950 width=4)
Hash Cond: (app_edge.orig_id = app_vertex.id)
-> Bitmap Heap Scan on app_edge (cost=846.65..1449.53 rows=4950 width=4)
Recheck Cond: (dest_id = 1061)
-> Bitmap Index Scan on app_edge_dest_id_4254b90f (cost=0.00..845.42 rows=4950 width=0)
Index Cond: (dest_id = 1061)
-> Hash (cost=1.20..1.20 rows=20 width=4)
-> Seq Scan on app_vertex (cost=0.00..1.20 rows=20 width=4)
#khampson's version also takes a minute-and-a-half to run, so it's also a no-go.
Vertex.objects.raw( ... )
SELECT DISTINCT "app_vertex"."id"
FROM "app_vertex"
JOIN "app_edge" ON ("app_vertex"."id" = "app_edge"."orig_id")
JOIN "app_edge" T4 ON ("app_vertex"."id" = T4."dest_id")
WHERE ("app_edge"."dest_id" = 1061
OR T4."orig_id" = 1061);
HashAggregate (cost=8275347.47..8275347.67 rows=20 width=4)
Group Key: app_vertex.id
-> Hash Join (cost=3183.45..8154343.45 rows=48401608 width=4)
Hash Cond: (app_vertex.id = app_edge.orig_id)
Join Filter: ((app_edge.dest_id = 1061) OR (t4.orig_id = 1061))
-> Hash Join (cost=1.45..2917.45 rows=100000 width=12)
Hash Cond: (t4.dest_id = app_vertex.id)
-> Seq Scan on app_edge t4 (cost=0.00..1541.00 rows=100000 width=8)
-> Hash (cost=1.20..1.20 rows=20 width=4)
-> Seq Scan on app_vertex (cost=0.00..1.20 rows=20 width=4)
-> Hash (cost=1541.00..1541.00 rows=100000 width=8)
-> Seq Scan on app_edge (cost=0.00..1541.00 rows=100000 width=8)
The query plans for those two queries are radically different. The first (slower) one isn't hitting any indexes, and is doing two left joins, both of which result in way, way more rows being processed and returned. From what I interpret of the intention of the Django ORM syntax, it doesn't sound like you would truly want to do left joins here.
I would recommend considering dropping down into raw SQL in this case from within the Django ORM, and hybridize the two. e.g. if you take the first one, and transform it to something like this:
SELECT DISTINCT "app_vertex"."id"
FROM "app_vertex"
JOIN "app_edge" ON ("app_vertex"."id" = "app_edge"."orig_id")
JOIN "app_edge" T4 ON ("app_vertex"."id" = T4."dest_id")
WHERE ("app_edge"."dest_id" = 1061
OR T4."orig_id" = 1061);
Two questions there: How does that version perform, and does it give you the results you're looking for?
For more info on raw queries, check out this section of the Django doc.
Response to comment from OP:
The query plan for the query I suggested also shows that it's not hitting any indexes.
Do you have indexes on both tables for the columns involved? I suspect not, especially since for this specific query, we're looking for a single value, which means if there were indexes, I would be very surprised if the query planner determined a sequential scan were a better choice (OTOH, if you were looking for a wide range of rows, say, over 10% of the rows in the tables, the query planner might correctly make such a decision).
I propose another query could be:
# Get edges which contain Vertex v, "only" optimizes fields returned
edges = Edge.objects.filter(Q(orig=v) | Q(dest=v)).only('orig_id', 'dest_id')
# Get set of vertex id's to discard duplicates
vertex_ids = {*edges.values_list('orig_id', flat=True), *edges_values_list('dest_id', flat=True)}
# Get list of vertices, excluding the original vertex
vertices = Vertex.objects.filter(pk__in=vertex_ids).exclude(pk=v.pk)
This shouldn't require any joins and shouldn't suffer from the race conditions you mention.
I have an SQLite table of ~1M rows. Each row has a structure of (docId, docBLOB). Each docBlob is nearly 20Kb.
I have to perform SELECT by an externally provided list of docIDs. Each list may be nearly 100K elements long. How can I do it more efficiently?
Maybe there is a way to make SELECT * IN docBlobTable WHERE docId IN ( [MEGALIST] ) statement?
Put all the IDs into a temporary table, then use:
SELECT * FROM docBlobTable WHERE docId IN (SELECT ID FROM TempTable)
or:
SELECT docBlobTable.*
FROM docBlobTable
JOIN TempTable ON docBlobTable.docId = TempTable.ID
I'm trying to optimize a slow query that was generated by the Django ORM. It is a many-to-many query. It takes over 1 min to run.
The tables have a good amount of data, but they aren't huge (400k rows in sp_article and 300k rows in sp_article_categories)
#categories.article_set.filter(post_count__lte=50)
EXPLAIN ANALYZE SELECT *
FROM "sp_article"
INNER JOIN "sp_article_categories" ON ("sp_article"."id" = "sp_article_categories"."article_id")
WHERE ("sp_article_categories"."category_id" = 1081
AND "sp_article"."post_count" <= 50 )
Nested Loop (cost=0.00..6029.01 rows=656 width=741) (actual time=0.472..25.724 rows=1266 loops=1)
-> Index Scan using sp_article_categories_category_id on sp_article_categories (cost=0.00..848.82 rows=656 width=12) (actual time=0.015..1.305 rows=1408 loops=1)
Index Cond: (category_id = 1081)
-> Index Scan using sp_article_pkey on sp_article (cost=0.00..7.88 rows=1 width=729) (actual time=0.014..0.015 rows=1 loops=1408)
Index Cond: (sp_article.id = sp_article_categories.article_id)
Filter: (sp_article.post_count <= 50)
Total runtime: 26.536 ms
I have an index on:
sp_article_categories.article_id (type: btree)
sp_article_categories.category_id
sp_article.post_count (type: btree)
Any suggestions on how I can tune this to get the query speedy?
Thanks!
You've provided the vital information here - the explain analyse. That isn't showing a 1 second runtime though, it's showing 20 milliseconds. So - either that isn't the query being run, or the problem is elsewhere.
The only difference between explain analyse and a real application is that the results aren't actually returned. You would need a lot of data to slow things down to 1 second though.
The other suggestions are all off the mark since they're ignoring the fact that the query isn't slow. You have the relevant indexes (both sides of the join are using an index scan) and the planner is perfectly capable of filtering on the category table first (that's the whole point of having a half decent query planner).
So - you first need to figure out what exactly is slow...
Put an index on sp_article_categories.category_id
From a pure SQL perspective, your join is more efficient if your base table has fewer rows in it, and the WHERE conditions are performed on that table before it joins to another.
So see if you can get Django to select from the categories first, then filter the category_id before joining to the article table.
Pseudo-code follows:
SELECT * FROM categories c
INNER JOIN articles a
ON c.category_id = 1081
AND c.category_id = a.category_id
And put an index on category_id like Steven suggests.
You can use field names instead * too.
select [fields] from....
I assume you have run analyze on the database to get fresh statistics.
It seems that the join between sp_article.id and sp_article_categories.article_id is costly. What data type is the article id, numeric? If it isn't you should perhaps consider making it numeric - integer or bigint, whatever suites your needs. It can make a big difference in performance according to my experience. Hope it helps.
Cheers!
// John