Recommendation on Query Efficiency : 2 different versions - google-cloud-platform

which of these is more efficient query to run:
one where the INCLUDE / DON'T INCLUDE filter condition in WHERE clause and tested for each row
SELECT distinct fullvisitorid
FROM `google.com:analytics-bigquery.LondonCycleHelmet.ga_sessions_20130910` t, unnest(hits) as ht
WHERE (select max(if(cd.index = 1,cd.value,null))from unnest(ht.customDimensions) cd)
= 'high_worth'
one returning all rows and then outer SELECT clause doing all filtering test to INCLUDE / DON'T INCLUDE
SELECT distinct fullvisitorid
FROM
(
SELECT
fullvisitorid
, (select max(if(cd.index = 1,cd.value,null)) FROM unnest(ht.customDimensions) cd) hit_cd_1
FROM `google.com:analytics-bigquery.LondonCycleHelmet.ga_sessions_20130910` t
, unnest(hits) as ht
)
WHERE
hit_cd_1 = 'high_worth'
Both produce exactly same results!
the goal is: list of fullvisitorId, who ever sent hit Level Custom Dimension (index =1) with value = 'high_worth' users ()
Thanks for your inputs!
Cheers!
/Vibhor

I tried the two queries and compared their explanations, they are identical. I am assuming some sort of optimization magic occurs prior to the query being ran.

As of your original two queries: obviously - they are identical even though you slightly rearranged appearance. so from those two you should choose whatever easier for you to read/maintain. I would pick first query - but it is really matter of personal preferences
Meantime, try below (BigQuery Standard SQL) - it looks slightly optimized to me - but I didn't have chance to test on real data
SELECT DISTINCT fullvisitorid
FROM `google.com:analytics-bigquery.LondonCycleHelmet.ga_sessions_20130910` t,
UNNEST(hits) AS ht, UNNEST(ht.customDimensions) cd
WHERE cd.index = 1 AND cd.value = 'high_worth'
Obviously - it should produce same result as your two queries
Execution plan looks better to me and it (query) is faster is much easier to read / manage

Related

Special character to query from latest timestamp sharded table in BigQuery

From
https://cloud.google.com/bigquery/docs/partitioned-tables:
you can shard tables using a time-based naming approach such as [PREFIX]_YYYYMMDD
This enables me to do:
SELECT count(*) FROM `xxx.xxx.xxx_*`
and query across all the shards. Is there a special notation that queries only the latest shard? For example say I had:
xxx_20180726
xxx_20180801
could I do something along the lines of
SELECT count(*) FROM `xxx.xxx.xxx_{{ latest }}`
to query xxx_20180801?
SINGLE QUERY INSPIRED BY Mikhail Berlyant:
SELECT count(*) as c FROM `XXX.PREFIX_*` WHERE _TABLE_SUFFIX IN ( SELECT
SUBSTR(MAX(table_id), LENGTH('PREFIX_') + 2)
FROM
`XXX.__TABLES_SUMMARY__`
WHERE
table_id LIKE 'PREFIX_%')
If you do care about cost (meaning how many tables will be scaned by your query) - the only way to do so is to do in two steps like below
First query
#standardSQL
SELECT SUBSTR(MAX(table_id), LENGTH('PREFIX') + 1)
FROM `xxx.xxx.__TABLES_SUMMARY__`
WHERE table_id LIKE 'PREFIX%'
Second Query
#standardSQL
SELECT COUNT(*)
FROM `xxx.xxx.PREFIX_*`
WHERE _TABLE_SUFFIX = '<result of first query>'
so, if result of first query is 20180801 so, second query will obviously look like below
#standardSQL
SELECT COUNT(*)
FROM `xxx.xxx.PREFIX_*`
WHERE _TABLE_SUFFIX = '20180801'
If you don't care about cost but rather need just result - you can easily combine above two queries into one - but - again - remember - even though result will be out of last table - cost will be as you query all table that match xxx.xxx.PREFIX_*
Forgot to mention (even though it should be obvious): of course when you have only COUNT(1) in your SELECT - the cost will be 0(zero) for both options - but in reality - most likely you will have something more valuable than just count(1)
I know this is a kind of an old thread but I was surprised why no one offers an answer using Variables.
"Héctor Neri" already mentioned this in the comments but I thought might be better to have an actual answer with a sample code posted.
#standardSQL
DECLARE SHARD_DATE STRING;
SET SHARD_DATE=(
SELECT MAX(REPLACE(table_name,'{TABLE}_',''))
FROM `{PRJ}.{DATASET}.INFORMATION_SCHEMA.TABLES`
WHERE table_name LIKE '{TABLE}_20%'
);
SELECT * FROM `{PRJ}.{DATASET}.{TABLE}_*`
WHERE _TABLE_SUFFIX = SHARD_DATE
Make sure to replace {PRJ}, {DATASET}, and {TABLE} values with your table location.
If you run this on BigQuery Web UI, you will see this message:
WARNING: Could not compute bytes processed estimate for script.
But you can see that variable properly reduce the table scan to the latest partition and does not cause any extra cost after running the script.

How to combine two select statements in c++

For an assignment, I'm looking to make my code faster. I'm using the sqlite3 c++ API to perform tasks in order to eventually build an r-tree and b-tree.
I am doing the assignment's tasks correctly, but unfortunately it's extremely slow. For my question, I'll first show simple mock tables, then show a simple flow of my program.
Simplified table schema's:
areaTable (id int, closed int)
middleTable (nodeid int, areaid int)
nodeTable (id int, x float, y float)
The flow of my program is as follows:
query1
SELECT id FROM areaTable WHERE closed = 1;
Using query1 I save the resulting id's into an vector array (we'll call it query1ResultsArray).
Then using sqlite3_prepare_v2 I prepare a new select query:
query2
SELECT MIN(x), MIN(y)
FROM nodeTable
WHERE id IN
(
SELECT nodeid
FROM middleTable
WHERE areaid = ?
);
The idea of query 2 is that we find will find the minimum values of the nodes that get grouped together by middleTable and areaTable. I bind individual results from query1 into query2 using a for loop like the following:
prepare query2
begin transaction (not sure if this helps)
for (auto &id : query1ResultsArray) {
bind(id)
step(stmt)
x = column 0
y = column 1
cout << "INSERT INTO ...."
reset(stmt)
}
end transaction
finalize(stmt)
This solution appears to work. It get's the proper results I need to continue with the assignment's tasks (building insert statements), but it's very very slow. I doubt the professor expects our programs to be this slow.
This was context for my question. The question itself is essentially:
Am I able to combine my two select statements? By combining the select statements I would be able to circumvent the constant binding and resetting which I hope (with no knowledge to back it up) will speed up my program.
I've tried the following:
SELECT MIN(x), MIN(y), MAX(x), MAX(y)
FROM nodeCartesian
WHERE id IN
(
SELECT nodeid
FROM waypoint
WHERE wayid IN
(
SELECT id
FROM way
WHERE closed = 1
)
);
But this gets the minimum of all nodes since they don't get properly grouped together into their respective 'areas'.
P.S. I am dealing with a 2D r-tree, so I know what I wrote isn't correct, but I just wrote what I'm having difficulty with. Also, I tried researching how to apply inner joins to my statement, but couldn't figure out how :(, so if you think that may help my performance as well, I would love to hear it. Another thing is that query1 deals with 2+ million rows, while query2 deals with approximately 340,000 rows, and I estimated that it will take about 1 day for query2 to finish.
Thanks
I am not sure about your schema; however, I think that something like this by including a group by your area should do it
SELECT m.areaid, MIN(n.x), MIN(n.y), MAX(n.x), MAX(n.y)
FROM
nodeCartesian n
INNER JOIN waypoint wp ON n.id = wp.nodeid
INNER JOIN way w ON wp.wayid = w.id
INNER JOIN middleTable m ON n.id = m.nodeid
WHERE
w.closed = 1
GROUP BY
m.areaid
Note: calling a SELECT query multiple times in a loop is a bad idea, because each call has a great overhead which makes it really slow. Making a single query returning all the relevant rows and then looping through them in code is much faster.

A simple order_by() query

I have a table with columns, a, b, c, d, e.
Each column is given a random numerical value.
I can then order by that numerical value.
query_results = UserProfile.objects.all().order_by('a')
What i want is the ability to order by the sum of 2 or 3 or even 5 columns. But i cant figure out how to do it lightly.
My answer so far has been some what convoluted. I add an extra column x.
cursor.execute("UPDATE mysite_userprofile SET x = a+b;")
I then order by x. I have to update the column x each time i want different columns added together.
This is not the way to do it i know. i know order_by('a+b') doesn't work but is their another solution just as simple. I looked on django's site but couldn't find anything
The best way is to use extra
UserProfile.objects.extra(
select={'sum_of_fields': 'a + b + c + d + e'},
order_by=('sum_of_fields',)
)
I think you can order the result in the SQL query. Before you run the SQL query, when you select the columns, you can build the query string in the background.

Django: Distinct foreign keys

class Log:
project = ForeignKey(Project)
msg = CharField(...)
date = DateField(...)
I want to select the four most recent Log entries where each Log entry must have a unique project foreign key. I've tries the solutions on google search but none of them works and the django documentation isn't that very good for lookup..
I tried stuff like:
Log.objects.all().distinct('project')[:4]
Log.objects.values('project').distinct()[:4]
Log.objects.values_list('project').distinct('project')[:4]
But this either return nothing or Log entries of the same project..
Any help would be appreciated!
Queries don't work like that - either in Django's ORM or in the underlying SQL. If you want to get unique IDs, you can only query for the ID. So you'll need to do two queries to get the actual Log entries. Something like:
id_list = Log.objects.order_by('-date').values_list('project_id').distinct()[:4]
entries = Log.objects.filter(id__in=id_list)
Actually, you can get the project_ids in SQL. Assuming that you want the unique project ids for the four projects with the latest log entries, the SQL would look like this:
SELECT project_id, max(log.date) as max_date
FROM logs
GROUP BY project_id
ORDER BY max_date DESC LIMIT 4;
Now, you actually want all of the log information. In PostgreSQL 8.4 and later you can use windowing functions, but that doesn't work on other versions/databases, so I'll do it the more complex way:
SELECT logs.*
FROM logs JOIN (
SELECT project_id, max(log.date) as max_date
FROM logs
GROUP BY project_id
ORDER BY max_date DESC LIMIT 4 ) as latest
ON logs.project_id = latest.project_id
AND logs.date = latest.max_date;
Now, if you have access to windowing functions, it's a bit neater (I think anyway), and certainly faster to execute:
SELECT * FROM (
SELECT logs.field1, logs.field2, logs.field3, logs.date
rank() over ( partition by project_id
order by "date" DESC ) as dateorder
FROM logs ) as logsort
WHERE dateorder = 1
ORDER BY logs.date DESC LIMIT 1;
OK, maybe it's not easier to understand, but take my word for it, it runs worlds faster on a large database.
I'm not entirely sure how that translates to object syntax, though, or even if it does. Also, if you wanted to get other project data, you'd need to join against the projects table.
I know this is an old post, but in Django 2.0, I think you could just use:
Log.objects.values('project').distinct().order_by('project')[:4]
You need two querysets. The good thing is it still results in a single trip to the database (though there is a subquery involved).
latest_ids_per_project = Log.objects.values_list(
'project').annotate(latest=Max('date')).order_by(
'-latest').values_list('project')
log_objects = Log.objects.filter(
id__in=latest_ids_per_project[:4]).order_by('-date')
This looks a bit convoluted, but it actually results in a surprisingly compact query:
SELECT "log"."id",
"log"."project_id",
"log"."msg"
"log"."date"
FROM "log"
WHERE "log"."id" IN
(SELECT U0."id"
FROM "log" U0
GROUP BY U0."project_id"
ORDER BY MAX(U0."date") DESC
LIMIT 4)
ORDER BY "log"."date" DESC

Slow Postgres JOIN Query

I'm trying to optimize a slow query that was generated by the Django ORM. It is a many-to-many query. It takes over 1 min to run.
The tables have a good amount of data, but they aren't huge (400k rows in sp_article and 300k rows in sp_article_categories)
#categories.article_set.filter(post_count__lte=50)
EXPLAIN ANALYZE SELECT *
FROM "sp_article"
INNER JOIN "sp_article_categories" ON ("sp_article"."id" = "sp_article_categories"."article_id")
WHERE ("sp_article_categories"."category_id" = 1081
AND "sp_article"."post_count" <= 50 )
Nested Loop (cost=0.00..6029.01 rows=656 width=741) (actual time=0.472..25.724 rows=1266 loops=1)
-> Index Scan using sp_article_categories_category_id on sp_article_categories (cost=0.00..848.82 rows=656 width=12) (actual time=0.015..1.305 rows=1408 loops=1)
Index Cond: (category_id = 1081)
-> Index Scan using sp_article_pkey on sp_article (cost=0.00..7.88 rows=1 width=729) (actual time=0.014..0.015 rows=1 loops=1408)
Index Cond: (sp_article.id = sp_article_categories.article_id)
Filter: (sp_article.post_count <= 50)
Total runtime: 26.536 ms
I have an index on:
sp_article_categories.article_id (type: btree)
sp_article_categories.category_id
sp_article.post_count (type: btree)
Any suggestions on how I can tune this to get the query speedy?
Thanks!
You've provided the vital information here - the explain analyse. That isn't showing a 1 second runtime though, it's showing 20 milliseconds. So - either that isn't the query being run, or the problem is elsewhere.
The only difference between explain analyse and a real application is that the results aren't actually returned. You would need a lot of data to slow things down to 1 second though.
The other suggestions are all off the mark since they're ignoring the fact that the query isn't slow. You have the relevant indexes (both sides of the join are using an index scan) and the planner is perfectly capable of filtering on the category table first (that's the whole point of having a half decent query planner).
So - you first need to figure out what exactly is slow...
Put an index on sp_article_categories.category_id
From a pure SQL perspective, your join is more efficient if your base table has fewer rows in it, and the WHERE conditions are performed on that table before it joins to another.
So see if you can get Django to select from the categories first, then filter the category_id before joining to the article table.
Pseudo-code follows:
SELECT * FROM categories c
INNER JOIN articles a
ON c.category_id = 1081
AND c.category_id = a.category_id
And put an index on category_id like Steven suggests.
You can use field names instead * too.
select [fields] from....
I assume you have run analyze on the database to get fresh statistics.
It seems that the join between sp_article.id and sp_article_categories.article_id is costly. What data type is the article id, numeric? If it isn't you should perhaps consider making it numeric - integer or bigint, whatever suites your needs. It can make a big difference in performance according to my experience. Hope it helps.
Cheers!
// John