How to combine two select statements in c++ - c++

For an assignment, I'm looking to make my code faster. I'm using the sqlite3 c++ API to perform tasks in order to eventually build an r-tree and b-tree.
I am doing the assignment's tasks correctly, but unfortunately it's extremely slow. For my question, I'll first show simple mock tables, then show a simple flow of my program.
Simplified table schema's:
areaTable (id int, closed int)
middleTable (nodeid int, areaid int)
nodeTable (id int, x float, y float)
The flow of my program is as follows:
query1
SELECT id FROM areaTable WHERE closed = 1;
Using query1 I save the resulting id's into an vector array (we'll call it query1ResultsArray).
Then using sqlite3_prepare_v2 I prepare a new select query:
query2
SELECT MIN(x), MIN(y)
FROM nodeTable
WHERE id IN
(
SELECT nodeid
FROM middleTable
WHERE areaid = ?
);
The idea of query 2 is that we find will find the minimum values of the nodes that get grouped together by middleTable and areaTable. I bind individual results from query1 into query2 using a for loop like the following:
prepare query2
begin transaction (not sure if this helps)
for (auto &id : query1ResultsArray) {
bind(id)
step(stmt)
x = column 0
y = column 1
cout << "INSERT INTO ...."
reset(stmt)
}
end transaction
finalize(stmt)
This solution appears to work. It get's the proper results I need to continue with the assignment's tasks (building insert statements), but it's very very slow. I doubt the professor expects our programs to be this slow.
This was context for my question. The question itself is essentially:
Am I able to combine my two select statements? By combining the select statements I would be able to circumvent the constant binding and resetting which I hope (with no knowledge to back it up) will speed up my program.
I've tried the following:
SELECT MIN(x), MIN(y), MAX(x), MAX(y)
FROM nodeCartesian
WHERE id IN
(
SELECT nodeid
FROM waypoint
WHERE wayid IN
(
SELECT id
FROM way
WHERE closed = 1
)
);
But this gets the minimum of all nodes since they don't get properly grouped together into their respective 'areas'.
P.S. I am dealing with a 2D r-tree, so I know what I wrote isn't correct, but I just wrote what I'm having difficulty with. Also, I tried researching how to apply inner joins to my statement, but couldn't figure out how :(, so if you think that may help my performance as well, I would love to hear it. Another thing is that query1 deals with 2+ million rows, while query2 deals with approximately 340,000 rows, and I estimated that it will take about 1 day for query2 to finish.
Thanks

I am not sure about your schema; however, I think that something like this by including a group by your area should do it
SELECT m.areaid, MIN(n.x), MIN(n.y), MAX(n.x), MAX(n.y)
FROM
nodeCartesian n
INNER JOIN waypoint wp ON n.id = wp.nodeid
INNER JOIN way w ON wp.wayid = w.id
INNER JOIN middleTable m ON n.id = m.nodeid
WHERE
w.closed = 1
GROUP BY
m.areaid
Note: calling a SELECT query multiple times in a loop is a bad idea, because each call has a great overhead which makes it really slow. Making a single query returning all the relevant rows and then looping through them in code is much faster.

Related

MariaDB: multiple table update does not update a single row multiple times? Why?

Today I was just bitten in the rear end by something I didn't expect. Here's a little script to reproduce the issue:
create temporary table aaa_state(id int, amount int);
create temporary table aaa_changes(id int, delta int);
insert into aaa_state(id, amount) values (1, 0);
insert into aaa_changes(id, delta) values (1, 5), (1, 7);
update aaa_changes c join aaa_state s on (c.id=s.id) set s.amount=s.amount+c.delta;
select * from aaa_state;
The final result in the aaa_state table is:
ID
Amount
1
5
Whereas I would expect it to be:
ID
Amount
1
12
What gives? I checked the docs but cannot find anything that would hint at this behavior. Is this a bug that I should report, or is this by design?
The behavior you are seeing is consistent with two updates happening on the aaa_state table. One update is assigning the amount to 7, and then this amount is being clobbered by the second update, which sets to 5. This could be explained by MySQL using a snapshot of the aaa_state table to fetch the amount for each step of the update. If true, the actual steps would look something like this:
1. join the two tables
2. update the amount using the "first" row from the changes table.
now the cached result for the amount is 7, but this value will not actually
be written out to the underlying table until AFTER the entire update
3. update the amount using the "second" row from the changes table.
now the cached amount is 5
5. the update is over, write 5 out for the actual amount
Your syntax is not really correct for what you want to do. You should be using something like the following:
UPDATE aaa_state as
INNER JOIN
(
SELECT id, SUM(delta) AS delta_sum
FROM aaa_changes
GROUP BY id
) ac
ON ac.id = as.id
SET
as.amount = as.amount + ac.delta_sum;
Here we are doing a proper aggregation of the delta values for each id in a separate bona-fide subquery. This means that the delta sums will be properly computed and materialized in the subquery before MySQL does the join, to update the first table.

How to select the key corresponding to highest value from histogram map?

I'm using the histogram() function https://prestodb.github.io/docs/current/functions/aggregate.html
It "Returns a map containing the count of the number of times each input value occurs."
The result may look something like this:
{ORANGES=1, APPLES=165, BANANAS=1}
Is there a function that will return APPLES given the above input?
XY Problem?
The astute reader may notice the end-result of histogram() combined with what I'm trying to do, would be equivalent to the mythical Mode Function, which exists in textbooks but not in real-world database engines.
Here's my complete query at this point. I'm looking for the most frequently occurring value of upper(cmplx) for each upper(address),zip tuple:
select * from (select upper(address) as address, zip,
(SELECT max_by(key, value)
FROM unnest(histogram(upper(cmplx))) as t(key, value)),
count(*) as N
from apartments
group by upper(address), zip) t1
where N > 3
order by N desc;
And the error...
SYNTAX_ERROR: line 2:55: Constant expression cannot contain column
references
Here's what I use to get the key that corresponds to the max value from an arbitrary map:
MAP_KEYS(mapname)[
ARRAY_POSITION(
MAP_VALUES(mapname),
ARRAY_MAX(MAP_VALUES(mapname))
)
]
substitute your histogram map for 'mapname'.
Not sure how this solution compares computationally to the other answer, but I do find it easier to read.
You can convert the map you got from histogram to an array with map_entries. Then you can UNNEST that array to a relation and you can call max_by. Please see the below example:
SELECT max_by(key, value) FROM (
SELECT map_entries(histogram(clerk)) as entries from tpch.tiny.orders
)
CROSS JOIN UNNEST (entries) t(key, value);
EDIT:
As noted by #Alex R, you can also pass histogram results dirrectly to UNNEST:
SELECT max_by(key, value) FROM (
SELECT histogram(clerk) as histogram from tpch.tiny.orders )
CROSS JOIN UNNEST (histogram) t(key, value);
In your question the query part (SELECT max_by(key, value) FROM unnest(histogram(upper(cmplx)) is a correlated subquery which is not yet supported. However the error you are seeing is misleading. IIRC Athena is using Presto 0.172, and this error reporting was fixed in 0.183 (see https://docs.starburstdata.com/latest/release/release-0.183.html - that was in July 2017, btw map_entries was also added in 0.183)

Recommendation on Query Efficiency : 2 different versions

which of these is more efficient query to run:
one where the INCLUDE / DON'T INCLUDE filter condition in WHERE clause and tested for each row
SELECT distinct fullvisitorid
FROM `google.com:analytics-bigquery.LondonCycleHelmet.ga_sessions_20130910` t, unnest(hits) as ht
WHERE (select max(if(cd.index = 1,cd.value,null))from unnest(ht.customDimensions) cd)
= 'high_worth'
one returning all rows and then outer SELECT clause doing all filtering test to INCLUDE / DON'T INCLUDE
SELECT distinct fullvisitorid
FROM
(
SELECT
fullvisitorid
, (select max(if(cd.index = 1,cd.value,null)) FROM unnest(ht.customDimensions) cd) hit_cd_1
FROM `google.com:analytics-bigquery.LondonCycleHelmet.ga_sessions_20130910` t
, unnest(hits) as ht
)
WHERE
hit_cd_1 = 'high_worth'
Both produce exactly same results!
the goal is: list of fullvisitorId, who ever sent hit Level Custom Dimension (index =1) with value = 'high_worth' users ()
Thanks for your inputs!
Cheers!
/Vibhor
I tried the two queries and compared their explanations, they are identical. I am assuming some sort of optimization magic occurs prior to the query being ran.
As of your original two queries: obviously - they are identical even though you slightly rearranged appearance. so from those two you should choose whatever easier for you to read/maintain. I would pick first query - but it is really matter of personal preferences
Meantime, try below (BigQuery Standard SQL) - it looks slightly optimized to me - but I didn't have chance to test on real data
SELECT DISTINCT fullvisitorid
FROM `google.com:analytics-bigquery.LondonCycleHelmet.ga_sessions_20130910` t,
UNNEST(hits) AS ht, UNNEST(ht.customDimensions) cd
WHERE cd.index = 1 AND cd.value = 'high_worth'
Obviously - it should produce same result as your two queries
Execution plan looks better to me and it (query) is faster is much easier to read / manage

How to avoid CTE or subquery in SQL?

Question
Say we have 1 as foo, and we want foo+1 as bar in SQL.
With CTE or subquery, like:-
select foo+1 as bar from (select 1 as foo) as abc;
We would get (in postgre which is what I am using):-
bar
-----
2
However, when I tried the following:-
select 1 as foo, foo+1 as bar;
The following error occurs:-
ERROR: column "foo" does not exist
LINE 1: select 1 as foo, foo+1 as bar;
^
Is there any way around this without the use of CTE or subquery?
Why do I ask?
I am using Django for a web service, to order and paginate objects in the database, I have to grab the count of the upvotes and downvotes and do some extra mathematical manipulation on those two values (ie. calculating the wilson score interval), where those two values are used multiple times.
All I can work with that I know of right now is the extra() function without breaking the ORM(?) [for example lazy queryset and prefetch_related() function].
Therefore I need a way to call those two values from somewhere instead of doing a SELECT multiple times when I calculate the score. (Or that's not the case in reality anyway?)
PS. Currently I am storing the vote count as database field and update them, but I already have a model of a vote, so it seems redundant and slow to update vote count and insert vote to database
No, you need the sub-query or CTE to do that. There is one alternative though: create a stored procedure.
CREATE FUNCTION wilson(upvote integer, downvote integer) RETURNS float8 AS $$
DECLARE
score float8;
BEGIN
-- Calculate the score
RETURN score;
END; $$ LANGUAGE plpgsql STRICT;
In your ORM you now call the function as part of your SELECT statement:
SELECT id, upvotes, downvotes, wilson(upvotes, downvotes) FROM mytable;
Also makes for cleaner code.

Slow Postgres JOIN Query

I'm trying to optimize a slow query that was generated by the Django ORM. It is a many-to-many query. It takes over 1 min to run.
The tables have a good amount of data, but they aren't huge (400k rows in sp_article and 300k rows in sp_article_categories)
#categories.article_set.filter(post_count__lte=50)
EXPLAIN ANALYZE SELECT *
FROM "sp_article"
INNER JOIN "sp_article_categories" ON ("sp_article"."id" = "sp_article_categories"."article_id")
WHERE ("sp_article_categories"."category_id" = 1081
AND "sp_article"."post_count" <= 50 )
Nested Loop (cost=0.00..6029.01 rows=656 width=741) (actual time=0.472..25.724 rows=1266 loops=1)
-> Index Scan using sp_article_categories_category_id on sp_article_categories (cost=0.00..848.82 rows=656 width=12) (actual time=0.015..1.305 rows=1408 loops=1)
Index Cond: (category_id = 1081)
-> Index Scan using sp_article_pkey on sp_article (cost=0.00..7.88 rows=1 width=729) (actual time=0.014..0.015 rows=1 loops=1408)
Index Cond: (sp_article.id = sp_article_categories.article_id)
Filter: (sp_article.post_count <= 50)
Total runtime: 26.536 ms
I have an index on:
sp_article_categories.article_id (type: btree)
sp_article_categories.category_id
sp_article.post_count (type: btree)
Any suggestions on how I can tune this to get the query speedy?
Thanks!
You've provided the vital information here - the explain analyse. That isn't showing a 1 second runtime though, it's showing 20 milliseconds. So - either that isn't the query being run, or the problem is elsewhere.
The only difference between explain analyse and a real application is that the results aren't actually returned. You would need a lot of data to slow things down to 1 second though.
The other suggestions are all off the mark since they're ignoring the fact that the query isn't slow. You have the relevant indexes (both sides of the join are using an index scan) and the planner is perfectly capable of filtering on the category table first (that's the whole point of having a half decent query planner).
So - you first need to figure out what exactly is slow...
Put an index on sp_article_categories.category_id
From a pure SQL perspective, your join is more efficient if your base table has fewer rows in it, and the WHERE conditions are performed on that table before it joins to another.
So see if you can get Django to select from the categories first, then filter the category_id before joining to the article table.
Pseudo-code follows:
SELECT * FROM categories c
INNER JOIN articles a
ON c.category_id = 1081
AND c.category_id = a.category_id
And put an index on category_id like Steven suggests.
You can use field names instead * too.
select [fields] from....
I assume you have run analyze on the database to get fresh statistics.
It seems that the join between sp_article.id and sp_article_categories.article_id is costly. What data type is the article id, numeric? If it isn't you should perhaps consider making it numeric - integer or bigint, whatever suites your needs. It can make a big difference in performance according to my experience. Hope it helps.
Cheers!
// John