Select top X results per group - grouping

I have a bunch of RDF Data Cube observations that have an attached attribute, in my case the date on when that value was recorded.
The pattern is simple, for example (leaving out other dimension/measure/attributes):
<obs1> a qb:Observation ;
my:lastupdate '2017-12-31'^^xsd:date ;
qb:dataSet <dataSet1> .
<obs2> a qb:Observation ;
my:lastupdate '2016-12-31'^^xsd:date ;
qb:dataSet <dataSet1> .
<obs2_1> a qb:Observation ;
my:lastupdate '2017-12-31'^^xsd:date ;
qb:dataSet <dataSet2> .
<obs2_2> a qb:Observation ;
my:lastupdate '2015-12-31'^^xsd:date ;
qb:dataSet <dataSet2> .
So I have multiple qb:DataSet in my store. Now I would like to figure out the last X my:lastupdate values per dataset. Let's say I want the last 5 values, for each particular DataSet.
I can do that very easily for one particular dataset:
SELECT * WHERE {
?observation my:lastupdate ?datenstand ;
qb:dataSet <dataSet1>
} ORDER BY DESC(?datenstand) LIMIT 5
But I'm a bit lost if this is at all possible within a single SPARQL query, per dataset. I tried various combination with sub-selects, LIMIT & GROUP BY combinations but nothing lead to the result I am looking for.

This query pattern was discussed at length on the now defunct SemanticOverflow Q+A site as 'get the 3 largest cities for each country' and the general consensus was that queries in the form 'get the top n related items for each master item' are not manageable with a single SPARQL query in an efficient way.
The core issue is that nested queries are evaluated bottom-up and GROUP/LIMIT clauses will apply to the whole result set rather than to each group.
The only useful exception to the bottom-up rule are (not) exists filters, which have visibility on current bindings. You can take advantage of this fact to write queries like:
select ?country ?city ?population where {
?country a :Country; :city ?city.
?city :population ?population.
filter not exists { select * where {
?country :city ?_city.
?_city :population ?_population.
filter ( ?_population > ?population )
} offset 3 }
} order by ?country desc(?population)
Unfortunately this approach is not usually viable on large real-world datasets, as it involves scanning and filtering the cartesian product of each country/city group.

Related

BigQuery MERGE statement billing more bytes than editor shows

I have a very large (3.5B records) table that I want to update/insert (upsert) using the MERGE statement in BigQuery. The source table is a staging table that contains only the new data, and I need to check if the record with a corresponding ID is in the target table, updating the row if so or inserting if not.
The target table is partitioned by an integer field called IdParent, and the matching is done on IdParent and another integer field called IdChild. My merge statement/script looks like this:
declare parentList array<int64>;
set parentList = array(select distinct IdParent from dataset.Staging);
merge into dataset.Target t
using dataset.Staging s
on
-- target is partitioned by IdParent, do this for partition pruning
t.IdParent in unnest(parentList)
and t.IdParent = s.IdParent
and t.IdChild = s.IdChild
when matched and t.IdParent in unnest(parentList) then
update
set t.Column1 = s.Column1,
t.Column2 = s.Column2,
...<more columns>
when not matched and IdParent in unnest(parentList) then
insert (<all the fields>)
values (<all the fields)
;
So I:
Pull the IdParent list from the staging table to know which partitions to prune
limit the partitions of the target table in the join predicate
also limit the partitions of the target table in the match/not matched conditions
The total size of dataset.Target is ~250GB. If I put this script in my BQ editor and remove all the IdParent in unnest(parentList) then it shows ~250GB to bill in the editor (as expected since there's no partition pruning). If I add the IdParent in unnest(parentList) back in so the script is exactly like you see it above i.e. attempting to partition prune, the editor shows ~97MB to bill. However, when I look at the query results, I see that it actually billed ~180GB:
The target table is also clustered on the two fields being matched, and I'm aware that the benefits of clustering are typically not shown in the editor's estimate. However, my understanding is that that should only make the bytes billed smaller... I can't think of any reason why this would happen.
Is this a BQ bug, or am I just missing something? BigQuery doesn't even say "the script is estimated to process XX MB", it says "This will process XX MB" and then it processes way more.
That's very interesting. What you did seems totally correct.
It seems BQ query planner could interpret your SQL correctly and know the partition pruning is provided, but when it executes. it failed to do so.
try removing t.IdParent in unnest(parentList) from both when matched clauses to see if the issue still happens, that is,
declare parentList array<int64>;
set parentList = array(select distinct IdParent from dataset.Staging);
merge into dataset.Target t
using dataset.Staging s
on
-- target is partitioned by IdParent, do this for partition pruning
t.IdParent in unnest(parentList)
and t.IdParent = s.IdParent
and t.IdChild = s.IdChild
when matched then
update
set t.Column1 = s.Column1,
t.Column2 = s.Column2,
...<more columns>
when not matched then
insert (<all the fields>)
values (<all the fields)
;
It would be a good idea to submit a bug to BigQuery if it couldn't be resolved.

Workaround to use slicer values in measures that behave like column calculations in powerBI

I'm trying to use slicer values as calculated column or something that works like one
I've seen this post
https://community.powerbi.com/t5/Desktop/Slicer-Value-in-Column-Formula/m-p/214892#M95071
but not sure how to proceed with the following case
I have registers from a sort of SCD with ValidStartDate and ValidEndDate
User should be able to set 2 slicers: AnalysisStartDate and AnalysisEndDate
I should be able to count registers based on those two dates, for instance
how many registers have ValidStartDate between AnalysisStartDate and AnalysisEndDate?
how many registers have ValidEndDate between AnalysisStartDate and AnalysisEndDate ?
Anyhelp appreciated
Looks like I managed to get to what i wanted
First you need a "measure" version of the columns you want to use in calculation just using FIRSTDATE() for instance -- I think it's very important to create the measure in the same table
To capture slicer value in a measure using something like:
if it has one value, get the value, otherwise use first value (or whatever you want)
x Analisis Inicio = IF(HASONEVALUE(TD_FECHAS_INICIO[DT_ANALISIS_INICIO]);VALUES(TD_FECHAS_INICIO[DT_ANALISIS_INICIO]);FIRSTDATE(TD_FECHAS_INICIO[DT_ANALISIS_INICIO].[Date]))
Now you can start creating measures which compare both
x SW_ES_ALTA =
IF(
AND([x Inicio Measure] >= [x Analisis Inicio]
; [x Inicio Measure] <= [x Analisis Fin])
;"SI"
;"NO"
)
and even counts of this last measure
x HC_ES_ALTA = COUNTAX(FILTER(ZZ_FLAGS_INMUEBLE;[x SW_ES_ALTA]="SI");ZZ_FLAGS_INMUEBLE[ID_INMUEBLE])
Not the easiest path, and probably you can put several of these measures in a single one, but if it works, it works...

Recommendation on Query Efficiency : 2 different versions

which of these is more efficient query to run:
one where the INCLUDE / DON'T INCLUDE filter condition in WHERE clause and tested for each row
SELECT distinct fullvisitorid
FROM `google.com:analytics-bigquery.LondonCycleHelmet.ga_sessions_20130910` t, unnest(hits) as ht
WHERE (select max(if(cd.index = 1,cd.value,null))from unnest(ht.customDimensions) cd)
= 'high_worth'
one returning all rows and then outer SELECT clause doing all filtering test to INCLUDE / DON'T INCLUDE
SELECT distinct fullvisitorid
FROM
(
SELECT
fullvisitorid
, (select max(if(cd.index = 1,cd.value,null)) FROM unnest(ht.customDimensions) cd) hit_cd_1
FROM `google.com:analytics-bigquery.LondonCycleHelmet.ga_sessions_20130910` t
, unnest(hits) as ht
)
WHERE
hit_cd_1 = 'high_worth'
Both produce exactly same results!
the goal is: list of fullvisitorId, who ever sent hit Level Custom Dimension (index =1) with value = 'high_worth' users ()
Thanks for your inputs!
Cheers!
/Vibhor
I tried the two queries and compared their explanations, they are identical. I am assuming some sort of optimization magic occurs prior to the query being ran.
As of your original two queries: obviously - they are identical even though you slightly rearranged appearance. so from those two you should choose whatever easier for you to read/maintain. I would pick first query - but it is really matter of personal preferences
Meantime, try below (BigQuery Standard SQL) - it looks slightly optimized to me - but I didn't have chance to test on real data
SELECT DISTINCT fullvisitorid
FROM `google.com:analytics-bigquery.LondonCycleHelmet.ga_sessions_20130910` t,
UNNEST(hits) AS ht, UNNEST(ht.customDimensions) cd
WHERE cd.index = 1 AND cd.value = 'high_worth'
Obviously - it should produce same result as your two queries
Execution plan looks better to me and it (query) is faster is much easier to read / manage

How to combine two select statements in c++

For an assignment, I'm looking to make my code faster. I'm using the sqlite3 c++ API to perform tasks in order to eventually build an r-tree and b-tree.
I am doing the assignment's tasks correctly, but unfortunately it's extremely slow. For my question, I'll first show simple mock tables, then show a simple flow of my program.
Simplified table schema's:
areaTable (id int, closed int)
middleTable (nodeid int, areaid int)
nodeTable (id int, x float, y float)
The flow of my program is as follows:
query1
SELECT id FROM areaTable WHERE closed = 1;
Using query1 I save the resulting id's into an vector array (we'll call it query1ResultsArray).
Then using sqlite3_prepare_v2 I prepare a new select query:
query2
SELECT MIN(x), MIN(y)
FROM nodeTable
WHERE id IN
(
SELECT nodeid
FROM middleTable
WHERE areaid = ?
);
The idea of query 2 is that we find will find the minimum values of the nodes that get grouped together by middleTable and areaTable. I bind individual results from query1 into query2 using a for loop like the following:
prepare query2
begin transaction (not sure if this helps)
for (auto &id : query1ResultsArray) {
bind(id)
step(stmt)
x = column 0
y = column 1
cout << "INSERT INTO ...."
reset(stmt)
}
end transaction
finalize(stmt)
This solution appears to work. It get's the proper results I need to continue with the assignment's tasks (building insert statements), but it's very very slow. I doubt the professor expects our programs to be this slow.
This was context for my question. The question itself is essentially:
Am I able to combine my two select statements? By combining the select statements I would be able to circumvent the constant binding and resetting which I hope (with no knowledge to back it up) will speed up my program.
I've tried the following:
SELECT MIN(x), MIN(y), MAX(x), MAX(y)
FROM nodeCartesian
WHERE id IN
(
SELECT nodeid
FROM waypoint
WHERE wayid IN
(
SELECT id
FROM way
WHERE closed = 1
)
);
But this gets the minimum of all nodes since they don't get properly grouped together into their respective 'areas'.
P.S. I am dealing with a 2D r-tree, so I know what I wrote isn't correct, but I just wrote what I'm having difficulty with. Also, I tried researching how to apply inner joins to my statement, but couldn't figure out how :(, so if you think that may help my performance as well, I would love to hear it. Another thing is that query1 deals with 2+ million rows, while query2 deals with approximately 340,000 rows, and I estimated that it will take about 1 day for query2 to finish.
Thanks
I am not sure about your schema; however, I think that something like this by including a group by your area should do it
SELECT m.areaid, MIN(n.x), MIN(n.y), MAX(n.x), MAX(n.y)
FROM
nodeCartesian n
INNER JOIN waypoint wp ON n.id = wp.nodeid
INNER JOIN way w ON wp.wayid = w.id
INNER JOIN middleTable m ON n.id = m.nodeid
WHERE
w.closed = 1
GROUP BY
m.areaid
Note: calling a SELECT query multiple times in a loop is a bad idea, because each call has a great overhead which makes it really slow. Making a single query returning all the relevant rows and then looping through them in code is much faster.

how to query disk used / available on dashDB

I would like to programmatically query the disk space used and remaining space. How can I do this in dashDB?
In oracle, I could perform something like this:
column dummy noprint
column pct_used format 999.9 heading "%|Used"
column name format a16 heading "Tablespace Name"
column bytes format 9,999,999,999,999 heading "Total Bytes"
column used format 99,999,999,999 heading "Used"
column free format 999,999,999,999 heading "Free"
break on report
compute sum of bytes on report
compute sum of free on report
compute sum of used on report
set linesize 132
set termout off
select a.tablespace_name name,
b.tablespace_name dummy,
sum(b.bytes)/count( distinct a.file_id||'.'||a.block_id ) bytes,
sum(b.bytes)/count( distinct a.file_id||'.'||a.block_id ) -
sum(a.bytes)/count( distinct b.file_id ) used,
sum(a.bytes)/count( distinct b.file_id ) free,
100 * ( (sum(b.bytes)/count( distinct a.file_id||'.'||a.block_id )) -
(sum(a.bytes)/count( distinct b.file_id ) )) /
(sum(b.bytes)/count( distinct a.file_id||'.'||a.block_id )) pct_used
from sys.dba_free_space a, sys.dba_data_files b
where a.tablespace_name = b.tablespace_name
group by a.tablespace_name, b.tablespace_name;
How would I do similar with dashDB?
A simple and fast method is to look in the catalog, which is up to date eventually (there are statistic collections done internally at certain intervals, when the catalog tables are updated with latest stats):
select substr(a.tabname,1,30), (a.fpages*PAGESIZE/1024) as size_k, a.card from syscat.tables a, syscat.tablespaces b where a.TBSPACEID=b.TBSPACEID ;
A more accurate but costly method is this:
SELECT TABSCHEMA, TABNAME, SUM(DATA_OBJECT_P_SIZE) + SUM(INDEX_OBJECT_P_SIZE)+ SUM(LONG_OBJECT_P_SIZE) + SUM(LOB_OBJECT_P_SIZE)+ SUM(XML_OBJECT_P_SIZE) FROM SYSIBMADM.ADMINTABINFO where tabschema='' and tabname='' group by tabschema,tabname;
There's currently no API call for this. (Find available API calls here: https://developer.ibm.com/clouddataservices/docs/dashdb/rest-api/) At this time, the only way to tell how much space you're using or have left is via the dashDB UI. dashDB team is exploring additional possibilities, I know. I'll post here again, if I learn more