BigQuery: Using LAST_VALUE() to skip values other than just NULLs

BigQuery: Using LAST_VALUE() to skip values other than just NULLs - google-cloud-platform

Understood that we can use the LAST_VALUE() function to skip NULL values in BigQuery.
Using the "Channel" here as our example, i.e.
LAST_VALUE(Channel) OVER (PARTITION BY OrderNo ORDER BY event_timestamp ASC ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
However, says if there is a value "ATM" (so not NULL) in the Channel that I also want to skip, is it possible to use the LAST_VALUE() function here?
Thanks.

As you mentioned, LAST_VALUE supports IGNORE NULLS. Therefore you can acheive your goal like this.
SELECT LAST_VALUE(IF(Channel = 'ATM', NULL, Channel) IGNORE NULLS) OVER (PARTITION BY OrderNo ORDER BY event_timestamp ASC ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
FROM `your_table`;

Related

Sqlite Query to remove duplicates from one column. Removal depends on the second column

Please have a look at the following data example:
In this table, I have multiple columns. There is no PRIMARY KEY, as per the image I attached, there are a few duplicates in STK_CODE. Depending on the (min) column, I want to remove duplicate rows.
According to the image, one stk_code has three different rows. Corresponding to these duplicate stk_codes, value in (min) column is different, I want to keep the row which has minimum value in (min) column.
I am very new at sqlite and I am dealing with (-lsqlite3) to join cpp with sqlite.
Is there any way possible?

Your table has rowid as primary key.
Use it to get the rowids that you don't want to delete:
DELETE FROM comparison
WHERE rowid NOT IN (
SELECT rowid
FROM comparison
GROUP BY STK_CODE
HAVING (COUNT(*) = 1 OR MIN(CASE WHEN min > 0 THEN min END))
)
This code uses rowid as a bare column and a documented feature of SQLite with which when you use MIN() or MAX() aggregate functions the query returns that row which contains the min or max value.
See a simplified demo.

Remove duplicates based on sort

I have a customers table with ID's and some datetime columns. But those ID's have duplicates and i just want to Analyse distinct ID values.
I tried using groupby but this makes the process very slow.
Due to data sensitivity can't share it.
Any suggestions would be helpful.

I'd suggest using ROW_NUMBER() This lets you rank the rows by chosen columns and you can then pick out the first result.
Given you've shared no data or table and column names here's an example based on the Adventureworks database. The technique will be the same, you partition by whatever makes the group of rows you want to deduplicate unique (ProductKey below) and order in a way that makes the version you want to keep first (Children, birthdate and customerkey in my example).
USE AdventureWorksDW2017;
WITH CustomersOrdered AS
(
SELECT S.ProductKey, C.CustomerKey, C.TotalChildren, C.BirthDate
, ROW_NUMBER() OVER (
PARTITION BY S.ProductKey
ORDER BY C.TotalChildren DESC, C.BirthDate DESC, C.CustomerKey ASC
) AS CustomerSequence
FROM dbo.FactInternetSales AS S
INNER JOIN dbo.DimCustomer AS C
ON S.CustomerKey = C.CustomerKey
)
SELECT ProductKey, CustomerKey
FROM CustomersOrdered
WHERE CustomerSequence = 1
ORDER BY ProductKey, CustomerKey;

you can also just sort the columns with date column an than click on id column and delete duplicates...

ProxySQL data masking for multiple columns

I want to mask sensitive information on multiple columns in a table named my_table using ProxySQL.
I've followed this tutorial to successfully mask a single column named column_name in a table using the following mysql_query_rules:
/* only show the first character in column_name */
INSERT INTO mysql_query_rules (rule_id,active,username,schemaname,match_pattern,re_modifiers,replace_pattern,apply)
VALUES (1,1,'developer','my_table','(\(?)(`?\w+`?\.)?\`?column_name\`?(\)?)([ ,\n])','caseless,global',
"\1CONCAT(LEFT(\2column_name,1),REPEAT('X',CHAR_LENGTH(column_name)-1))\3 column_name\4",1);
But when I add a second rule for masking another column called second_column_name in the table, proxysql fails to mask the second column. Here's the second rule:
/* masking the last 3 characters in second_column_name */
INSERT INTO mysql_query_rules (rule_id,active,username,schemaname,match_pattern,re_modifiers,replace_pattern,apply)
VALUES (2,1,'developer','my_table','(\(?)(`?\w+`?\.)?\`?second_column_name\`?(\)?)([ ,\n])','caseless,global',
"\1CONCAT(LEFT(\2second_column_name,CHAR_LENGTH(second_column_name)-3),REPEAT('X',3))\3 second_column_name\4",1);
Here's the query result after the 2 rules are added:
SELECT column_name FROM my_table; returns a masked column_name.
SELECT second_column_name FROM my_table; returns a masked second_column_name.
SELECT column_name, second_column_name FROM my_table; returns data with column_name masked, but second_column_name is not masked.
SELECT second_column_name, column_name FROM my_table; also returns data with column_name masked, but second_column_name is not masked.
Does this mean that 1 query can only be matched with 1 rule?
How can I mask data in multiple columns with ProxySQL?

Using flagIN, flagOUT, and apply allows me to mask data on multiple columns.
Here's the final mysql_query_rules I have:
/* only show the first character in column_name */
INSERT INTO mysql_query_rules (rule_id,active,username,schemaname,flagIN,match_pattern,re_modifiers,flagOUT,replace_pattern,apply)
VALUES (1,1,'developer','my_db',0,'(\(?)(`?\w+`?\.)?\`?column_name\`?(\)?)([ ,\n])','caseless,global',6, "\1CONCAT(LEFT(\2column_name,1),REPEAT('X',CHAR_LENGTH(column_name)-1))\3 column_name\4",0);
/* masking the last 3 characters in second_column_name */
INSERT INTO mysql_query_rules (rule_id,active,username,schemaname,flagIN,match_pattern,re_modifiers,flagOUT,replace_pattern,apply)
VALUES (2,1,'developer','my_db',6,'(\(?)(`?\w+`?\.)?\`?second_column_name\`?(\)?)([ ,\n])','caseless,global',NULL,
"\1CONCAT(LEFT(\2second_column_name,CHAR_LENGTH(second_column_name)-3),REPEAT('X',3))\3 second_column_name\4",1);
The meanings of the 3 variables are as the following:
flagIN, flagOUT, apply - these allow us to create "chains of rules"
that get applied one after the other. An input flag value is set to
0, and only rules with flagIN=0 are considered at the beginning. When
a matching rule is found for a specific query, flagOUT is evaluated
and if NOT NULL the query will be flagged with the specified flag in
flagOUT. If flagOUT differs from flagIN , the query will exit the
current chain and enters a new chain of rules having flagIN as the
new input flag. If flagOUT matches flagIN, the query will be
re-evaluate again against the first rule with said flagIN. This
happens until there are no more matching rules, or apply is set to 1
(which means this is the last rule to be applied)

How to select the key corresponding to highest value from histogram map?

I'm using the histogram() function https://prestodb.github.io/docs/current/functions/aggregate.html
It "Returns a map containing the count of the number of times each input value occurs."
The result may look something like this:
{ORANGES=1, APPLES=165, BANANAS=1}
Is there a function that will return APPLES given the above input?
XY Problem?
The astute reader may notice the end-result of histogram() combined with what I'm trying to do, would be equivalent to the mythical Mode Function, which exists in textbooks but not in real-world database engines.
Here's my complete query at this point. I'm looking for the most frequently occurring value of upper(cmplx) for each upper(address),zip tuple:
select * from (select upper(address) as address, zip,
(SELECT max_by(key, value)
FROM unnest(histogram(upper(cmplx))) as t(key, value)),
count(*) as N
from apartments
group by upper(address), zip) t1
where N > 3
order by N desc;
And the error...
SYNTAX_ERROR: line 2:55: Constant expression cannot contain column
references

Here's what I use to get the key that corresponds to the max value from an arbitrary map:
MAP_KEYS(mapname)[
ARRAY_POSITION(
MAP_VALUES(mapname),
ARRAY_MAX(MAP_VALUES(mapname))
)
]
substitute your histogram map for 'mapname'.
Not sure how this solution compares computationally to the other answer, but I do find it easier to read.

You can convert the map you got from histogram to an array with map_entries. Then you can UNNEST that array to a relation and you can call max_by. Please see the below example:
SELECT max_by(key, value) FROM (
SELECT map_entries(histogram(clerk)) as entries from tpch.tiny.orders
)
CROSS JOIN UNNEST (entries) t(key, value);
EDIT:
As noted by #Alex R, you can also pass histogram results dirrectly to UNNEST:
SELECT max_by(key, value) FROM (
SELECT histogram(clerk) as histogram from tpch.tiny.orders )
CROSS JOIN UNNEST (histogram) t(key, value);
In your question the query part (SELECT max_by(key, value) FROM unnest(histogram(upper(cmplx)) is a correlated subquery which is not yet supported. However the error you are seeing is misleading. IIRC Athena is using Presto 0.172, and this error reporting was fixed in 0.183 (see https://docs.starburstdata.com/latest/release/release-0.183.html - that was in July 2017, btw map_entries was also added in 0.183)

DAX Query to Get Distinct Items from Multiple Tables

Problem
I'm trying to generate a table of distinct email addresses from multiple source tables. However, with the UNION statement on the outer part of the statement, it isn't generating a truly distinct list.
Code
Participants = UNION(DISTINCT('Registrations'[Email Address]), DISTINCT( 'EnteredTickets'[Email]))
*Note that while I'm starting with just two source tables, I need to expand this to 3 or 4 by the end of it.

A combination of using VALUES on the table selects plus wrapping the whole statement in one more DISTINCT did the trick:
Participants = DISTINCT(UNION(VALUES('Registrations'[Email Address]), VALUES( 'EnteredTickets'[Email])))

If you want a bridge table with unique values for all different tables, use DISTINCT instead of VALUES:
Participants =
FILTER (
DISTINCT (
UNION (
TOPN ( 0, ROW ("NiceEmail", "asdf") ), -- adds zero rows table with nice new column name
DISTINCT ( 'Registrations'[Email Address] ),
DISTINCT ( 'EnteredTickets'[Email] )
)
),
[NiceEmail] <> BLANK () -- removes all blank emails
)
DISTINCT AND VALUES may lead to different results. Essentially, using VALUES, you are likely to end up with (unwanted) blank value in your list.
Check this documentation:
https://learn.microsoft.com/en-us/dax/values-function-dax#related-functions
You might also like information under this link which you can use to get a specific column name for your table of distinct values:
DAX create empty table with specific column names and no rows

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

BigQuery: Using LAST_VALUE() to skip values other than just NULLs - google-cloud-platform

As you mentioned, LAST_VALUE supports IGNORE NULLS. Therefore you can acheive your goal like this. SELECT LAST_VALUE(IF(Channel = 'ATM', NULL, Channel) IGNORE NULLS) OVER (PARTITION BY OrderNo ORDER BY event_timestamp ASC ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) FROM `your_table`;

Related

Sqlite Query to remove duplicates from one column. Removal depends on the second column

Remove duplicates based on sort

ProxySQL data masking for multiple columns

How to select the key corresponding to highest value from histogram map?

DAX Query to Get Distinct Items from Multiple Tables

Categories

Resources