How to tell what zone map (blocks) read by amazon redshift query - amazon-web-services

Is there any way to tell if the zone map is used by a specific query.
Is there a way to list block query read
My query is taking more time than expected, I just want to make the sure the query is using zone map to filter out blocks.

The table stl_scan contains this information.
is_rrscan indicates if the scan used range restriction (zone maps).
rows_pre_user_filter is the row count before zone map restrictions
rows_pre_filter is the row count after zone map restrictions
rows is the row count after all predicates were evaluated
SELECT query, segment
, tbl, perm_table_name
, is_rrscan
, SUM( rows_pre_user_filter ) rows_on_table
, SUM( rows_pre_filter ) rows_scanned
, SUM( rows ) rows_returned
FROM stl_scan
WHERE query = 999999
GROUP BY 1,2,3,4,5
ORDER BY 1,2,3,4,5

Related

Data comparisons in Qubole

I am very new to Qubole.We recently migrated Oracle ebiz data to Saleforce.We have both Ebiz and Salesforce data in the Qubole Data Lake.There are some discrepancies between Ebiz and Salesforce.What is the technology I can use on Qubole to find these discrepancies?
This is approach I am using to compare two tables.
Aggregate all metrics in two tables group by all dimensions, then compare using FULL JOIN, it will return all joined and not joined records from both tables. In such way you can found absent data in both tables and differences in metrics.
For example like this, using Hive:
with
sf as (
select dimension1, dimension2, ... dimensionN,
sum(metric1) as metric1,
sum(metric2) as metric2,
...
sum(metricN) as metricN,
count(*) as cnt
from Salesforce_table
group by dimension1, dimension2, ... dimensionN
),
eb as (
select dimension1, dimension2, ... dimensionN,
sum(metric1) as metric1,
sum(metric2) as metric2,
...
sum(metricN) as metricN,
count(*) as cnt
from Ebiz_table
group by dimension1, dimension2, ... dimensionN
)
--compare data
select sf.*, eb.*
from sf full join eb on NVL(sf.dimension1,'')=NVL(eb.dimension1)
and sf.dimension2=eb.dimension2
...
and sf.dimension3=eb.dimension3
--Filter discrepancies only
where ( sf.metric1!=eb.metric1
or sf.metric2!=eb.metric2
...
or sf.metricN!=eb.metricN
or sf.cnt!=eb.cnt
or sf.dimension1 is null
or eb.dimension1 is null
)
Also you can easily compare in Excel instead of filtering in the WHERE.
Metrics are everything that can be aggregated. You can use some dimensions as metrics also like this count(distinct user) as user_cnt and group by date, site_name for example. Query with full join will show differences. If some dimensions used in join condition can be null, use nvl() to match such rows like in my example. Of course do not use too many dimensions in the groupby, you can skip some of them and drill down only after finding discrepancies on aggregated level.
After you got discrepancy in aggregations, you can drill down and compare rows not aggregated, filtered by some metrics.
See also: https://stackoverflow.com/a/67382947/2700344

Sqlite Query to remove duplicates from one column. Removal depends on the second column

Please have a look at the following data example:
In this table, I have multiple columns. There is no PRIMARY KEY, as per the image I attached, there are a few duplicates in STK_CODE. Depending on the (min) column, I want to remove duplicate rows.
According to the image, one stk_code has three different rows. Corresponding to these duplicate stk_codes, value in (min) column is different, I want to keep the row which has minimum value in (min) column.
I am very new at sqlite and I am dealing with (-lsqlite3) to join cpp with sqlite.
Is there any way possible?
Your table has rowid as primary key.
Use it to get the rowids that you don't want to delete:
DELETE FROM comparison
WHERE rowid NOT IN (
SELECT rowid
FROM comparison
GROUP BY STK_CODE
HAVING (COUNT(*) = 1 OR MIN(CASE WHEN min > 0 THEN min END))
)
This code uses rowid as a bare column and a documented feature of SQLite with which when you use MIN() or MAX() aggregate functions the query returns that row which contains the min or max value.
See a simplified demo.

Remove duplicates based on sort

I have a customers table with ID's and some datetime columns. But those ID's have duplicates and i just want to Analyse distinct ID values.
I tried using groupby but this makes the process very slow.
Due to data sensitivity can't share it.
Any suggestions would be helpful.
I'd suggest using ROW_NUMBER() This lets you rank the rows by chosen columns and you can then pick out the first result.
Given you've shared no data or table and column names here's an example based on the Adventureworks database. The technique will be the same, you partition by whatever makes the group of rows you want to deduplicate unique (ProductKey below) and order in a way that makes the version you want to keep first (Children, birthdate and customerkey in my example).
USE AdventureWorksDW2017;
WITH CustomersOrdered AS
(
SELECT S.ProductKey, C.CustomerKey, C.TotalChildren, C.BirthDate
, ROW_NUMBER() OVER (
PARTITION BY S.ProductKey
ORDER BY C.TotalChildren DESC, C.BirthDate DESC, C.CustomerKey ASC
) AS CustomerSequence
FROM dbo.FactInternetSales AS S
INNER JOIN dbo.DimCustomer AS C
ON S.CustomerKey = C.CustomerKey
)
SELECT ProductKey, CustomerKey
FROM CustomersOrdered
WHERE CustomerSequence = 1
ORDER BY ProductKey, CustomerKey;
you can also just sort the columns with date column an than click on id column and delete duplicates...

DAX Query to Get Distinct Items from Multiple Tables

Problem
I'm trying to generate a table of distinct email addresses from multiple source tables. However, with the UNION statement on the outer part of the statement, it isn't generating a truly distinct list.
Code
Participants = UNION(DISTINCT('Registrations'[Email Address]), DISTINCT( 'EnteredTickets'[Email]))
*Note that while I'm starting with just two source tables, I need to expand this to 3 or 4 by the end of it.
A combination of using VALUES on the table selects plus wrapping the whole statement in one more DISTINCT did the trick:
Participants = DISTINCT(UNION(VALUES('Registrations'[Email Address]), VALUES( 'EnteredTickets'[Email])))
If you want a bridge table with unique values for all different tables, use DISTINCT instead of VALUES:
Participants =
FILTER (
DISTINCT (
UNION (
TOPN ( 0, ROW ("NiceEmail", "asdf") ), -- adds zero rows table with nice new column name
DISTINCT ( 'Registrations'[Email Address] ),
DISTINCT ( 'EnteredTickets'[Email] )
)
),
[NiceEmail] <> BLANK () -- removes all blank emails
)
DISTINCT AND VALUES may lead to different results. Essentially, using VALUES, you are likely to end up with (unwanted) blank value in your list.
Check this documentation:
https://learn.microsoft.com/en-us/dax/values-function-dax#related-functions
You might also like information under this link which you can use to get a specific column name for your table of distinct values:
DAX create empty table with specific column names and no rows

SQLite: SELECT IN by a 100K element list

I have an SQLite table of ~1M rows. Each row has a structure of (docId, docBLOB). Each docBlob is nearly 20Kb.
I have to perform SELECT by an externally provided list of docIDs. Each list may be nearly 100K elements long. How can I do it more efficiently?
Maybe there is a way to make SELECT * IN docBlobTable WHERE docId IN ( [MEGALIST] ) statement?
Put all the IDs into a temporary table, then use:
SELECT * FROM docBlobTable WHERE docId IN (SELECT ID FROM TempTable)
or:
SELECT docBlobTable.*
FROM docBlobTable
JOIN TempTable ON docBlobTable.docId = TempTable.ID