find all two word phrases that appear in more than one row in a dataset - data-mining

We would like to run a query that returns two word phrases that appear in more than one row. So for e.g. take the string "Data Ninja". Since it appears in more than one row in our dataset, the query should return that. The query should find all such phrases from all the rows in our dataset, by querying for two adjacent word combination (forming a phrase) in the rows that are in the dataset. These two adjacent word combinations should come from the dataset we loaded into BigQuery
How can we write this query in Google BigQuery?
The dataset is simply a long list of English sentences.

Good news: BigQuery now supports SPLIT(). Check https://stackoverflow.com/a/24172995/132438.
This is a hack, but a hack I happen to like :).
In its current form, it only works for sentences with more than 2 words, and it only extracts the 6 first pairs. You can extend and test from here.
Try it on your data, and please report back.
SELECT pairs, COUNT(*) c FROM
(
SELECT REGEXP_REPLACE(title, '([^\\s]+ ){0}([^\\s]* [^\\s]+).*', '\\2') pairs, title
FROM [bigquery-samples:reddit.full]
),
(
SELECT REGEXP_REPLACE(title, '([^\\s]+ ){1}([^\\s]* [^\\s]+).*', '\\2') pairs, title
FROM [bigquery-samples:reddit.full]
),
(
SELECT REGEXP_REPLACE(title, '([^\\s]+ ){2}([^\\s]* [^\\s]+).*', '\\2') pairs, title
FROM [bigquery-samples:reddit.full]
),
(
SELECT REGEXP_REPLACE(title, '([^\\s]+ ){3}([^\\s]* [^\\s]+).*', '\\2') pairs, title
FROM [bigquery-samples:reddit.full]
),
(
SELECT REGEXP_REPLACE(title, '([^\\s]+ ){4}([^\\s]* [^\\s]+).*', '\\2') pairs, title
FROM [bigquery-samples:reddit.full]
),
(
SELECT REGEXP_REPLACE(title, '([^\\s]+ ){5}([^\\s]* [^\\s]+).*', '\\2') pairs, title
FROM [bigquery-samples:reddit.full]
)
WHERE pairs != title
GROUP EACH BY pairs
HAVING c > 1
LIMIT 1000
Results might contain NSFW words. The sample dataset comes from an online community that has not been "cleaned up". Abstain from running query if you are sensitive to some words.

A very useful hack which inspired me to solve my problem, thanks.
My data is a combination of passengers and their age where age is a string of numbers:
adults ages
------ -------------
4 "53,67,65,68"
4 "44,45,69,65"
3 "20,21,20"
3 "30,32,62"
I wanted to add a column on each row containing the difference in age between the highest and lowest value
adults ages agediff
------ ------------- -------
4 "53,67,65,68" 15
4 "44,45,69,65" 25
3 "20,21,20" 1
3 "30,32,62" 32
This was done by the following, heavily inspired by the hack:
SELECT adults, ages, SUBTRACT(INTEGER(maxage),INTEGER(minage)) agediff FROM
(SELECT adults, ages, max(age) maxage, min(age) minage FROM
(SELECT adults, ages, age FROM
(SELECT adults, ages, REGEXP_EXTRACT(ages, r'([\d\d\,]{2})') age FROM [PaxAgeCombinations] WHERE (adults="3")),
(SELECT adults, ages, REGEXP_EXTRACT(ages, r'\d\d\,([\d\d\,]{2})') age FROM [PaxAgeCombinations] WHERE (adults="3")),
(SELECT adults, ages, REGEXP_EXTRACT(ages, r'\d\d\,\d\d\,([\d\d\,]{2})') age FROM [PaxAgeCombinations] WHERE (adults="3"))
),
(SELECT adults, ages, age FROM
(SELECT adults, ages, REGEXP_EXTRACT(ages, r'([\d\d\,]{2})') age FROM [PaxAgeCombinations] WHERE (adults="4")),
(SELECT adults, ages, REGEXP_EXTRACT(ages, r'\d\d\,([\d\d\,]{2})') age FROM [PaxAgeCombinations] WHERE (adults="4")),
(SELECT adults, ages, REGEXP_EXTRACT(ages, r'\d\d\,\d\d\,([\d\d\,]{2})') age FROM [PaxAgeCombinations] WHERE (adults="4")),
(SELECT adults, ages, REGEXP_EXTRACT(ages, r'\d\d\,\d\d\,\d\d\,([\d\d\,]{2})') age FROM [PaxAgeCombinations] WHERE (adults="4"))
)
)

Related

Is there any way to get customer duplicate data from customers table then update all others except 1 row from duplicate data?

I am trying to get duplicate data of my customers table then after finding update isactive column of all duplicates found to 0 except 1 row of the duplicate data.
here is my script using oracle 19c:
merge into customers c
using (
WITH cte AS (
SELECT DISTINCT ROWID, fn_createfullname(firstname, middlename, lastname) as fullName, mobile, branchid, isactive,
ROW_NUMBER() OVER (PARTITION BY fn_createfullname(firstname, middlename, lastname), mobile, branchid ORDER BY ROWID) AS rn
FROM customers
)
select * from cte
WHERE rn > 1
) tbl
on (tbl.mobile = c.mobile and fn_createfullname(c.firstname, c.middlename, c.lastname) = tbl.fullname)
when matched then update
SET c.isactive = 0
WHERE rn > 1;
i am expecting to get all duplicate data then update single row from duplicate data.
plz any help.
after running my query is displaying this error:
Error report - ORA-30926: unable to get a stable set of rows in the
source tables

why does not my query work in oracle apex?

This is my query but when I run this in oracle apex it gives me the following error:
delete from
(select ename,e1.store_id,e1.sal as highest_sal
from
employees e1 inner join
(select store_id,max(sal) as sal
from employees
group by store_id
) e2
on e1.store_id=e2.store_id
and e1.sal=e2.sal
order by store_id) s
where rowid not in
(select min(rowid) from s
group by highest_sal);
The output is:
ORA-00942: table or view does not exist
ORA-06512: at "SYS.WWV_DBMS_SQL_APEX_210200", line 673
ORA-06512: at "SYS.DBMS_SYS_SQL", line 1658
ORA-06512: at "SYS.WWV_DBMS_SQL_APEX_210200", line 659
ORA-06512: at "APEX_210200.WWV_FLOW_DYNAMIC_EXEC", line 1829
4. (select store_id,max(sal) as sal
5. from employees
6. group by store_id
7. ) e2
8. on e1.store_id=e2.store_id
When I run the code in parentheses, which has the alias s alone, it runs without any problems, but when it is placed in this code, it gives an error
updated: My goal is to first group the data according to store_id and get the maximum sal in each, and join it to the main table itself where sal and store_id are the same, and display its name, which The resulting table is called s. Then I want to remove the duplicate rows from the table (which have the same sal) and to do this we group according to highest_sal and select the least rowid between them, and remove those rowId that are not in the subquery. As a result, non-duplicates are obtained. (This is a trick to remove duplicate lines.)
You appear to want to delete all rows with the highest sal for each store_id grouping except for the row in each group with the lowest ROWID.
You can do that with analytic functions. Either:
DELETE FROM employees
WHERE ROWID IN (
SELECT ROWID
FROM (
SELECT RANK() OVER (PARTITION BY store_id ORDER BY sal DESC) AS rnk,
ROW_NUMBER() OVER (PARTITION BY store_id ORDER BY sal DESC, ROWID ASC)
AS rn
FROM employees
)
WHERE rnk = 1
AND rn > 1
);
or:
DELETE FROM employees
WHERE ROWID IN (
SELECT ROWID
FROM (
SELECT sal,
MAX(sal) OVER (PARTITION BY store_id) AS max_sal,
MIN(ROWID) KEEP (DENSE_RANK LAST ORDER BY sal)
OVER (PARTITION BY store_id) AS min_rid_for_max_sal
FROM employees
)
WHERE sal = max_sal
AND ROWID != min_rid_for_max_sal
);
Or, from Oracle 12, with row limiting clauses in a correlated sub-query:
DELETE FROM employees e
WHERE ROWID IN (
SELECT ROWID
FROM (
SELECT sal
FROM employees x
WHERE e.store_id = x.store_id
ORDER BY sal DESC
FETCH FIRST ROW WITH TIES
)
ORDER BY ROWID
OFFSET 1 ROW FETCH NEXT 100 PERCENT ROWS ONLY
);
Which, for the sample data:
CREATE TABLE employees (ename, store_id, sal) AS
SELECT 'A', 1, 1 FROM DUAL UNION ALL
SELECT 'B', 1, 2 FROM DUAL UNION ALL
SELECT 'C', 1, 3 FROM DUAL UNION ALL
SELECT 'D', 2, 1 FROM DUAL UNION ALL
SELECT 'E', 2, 2 FROM DUAL UNION ALL
SELECT 'F', 2, 2 FROM DUAL;
All delete the f row.
db<>fiddle here

Pivot with dynamic DATE columns

I have a query that I created from a table.
example:
select
pkey,
trunc (createdformat) business_date,
regexp_substr (statistics, 'business_ \ w *') business_statistics
from business_data
where statistics like '% business_%'
group by regexp_substr(statistics, 'business_\w*'), trunc(createdformat)
This works great thanks to your help.
Now I want to show that in a crosstab / pivot.
That means in the first column are the "business_statistics", the column headings are the "dynamic days from business_date".
I've tried the following, but it doesn't quite work yet
SELECT *
FROM (
select
pkey,
trunc(createdformat) business_date,
regexp_substr(statistics, 'business_\w*') business_statistics
from business_data
where statistics like '%business_%'
)
PIVOT(
count(pkey)
FOR business_date
IN ('17.06.2020','18.06.2020')
)
ORDER BY business_statistics
If I specify the date, like here 17.06.2020 and 18.06.2020 it works. 3 columns (Business_Statistic, 17.06.2020, 18.06.2020). But from column 2 it should be dynamic. That means he should show me the days (date) that are also included in the query / table. So that is the result of X columns (Business_Statistics, Date1, Date2, Date3, Date4, ....). Dynamic based on the table data.
For example, this does not work:
...
IN (SELECT DISTINCT trunc(createdformat) FROM BUSINESS_DATA WHERE statistics like '%business_%' order by trunc(createdformat))
...
The pivot clause doesn't work with dynamic values.
But there are some workarounds discuss here: How to Convert Rows to Columns and Back Again with SQL (Aka PIVOT and UNPIVOT)
You may find one workaround that suits your requirements.
Unfortunately, I am not very familiar with PL / SQL. But could I still process the start date and the end date of the user for the query?
For example, the user enters the APEX environment as StartDate: June 17, 2020 and as EndDate: June 20, 2020.
Then the daily difference is calculated in the PL / SQL query, then a variable is filled with the value of the entered period using Loop.
Example: (Just an idea, I'm not that fit in PL / SQL yet)
DECLARE
startdate := :P9999_StartDate 'Example 17.06.2020
enddate := P9999_EndDate 'Example 20.06.2020
BEGIN
LOOP 'From the startdate to the enddate day
businessdate := businessdate .... 'Example: 17.06.2020,18.06.2020,19.06.2020, ...
END LOOP
SELECT *
FROM (
select
pkey,
trunc(createdformat) business_date,
regexp_substr(statistics, 'business_\w*') business_statistics
from business_data
where statistics like '%business_%'
)
PIVOT(
count(pkey)
FOR business_date
IN (businessdate)
)
ORDER BY business_statistics
END;
That would be my idea, but I fail to implement it. Is that possible? I hope you understand what I mean

SQL to PROC SQL- partition By alternative (min case)

I am new to SAS but know sql so trying to use SQL code to write proc sql code and realized that PARTITION by is not available in SAS.
Table
Customer_id Item_type Order Size Date ….
1. A401 Fruit Small 3/14/2016 ….
2. A401 Fruit Big 5/22/2016 ….
3. A401 Vegetable Small 7/12/2016 ….
4. B509 Vegetable Small 3/25/2015 ….
5. B509 Vegetable Big 3/15/2014 ….
6. B509 Vegetable Small 3/1/2014 ….
Explanation
Customer_id Item_Type Count Reason
1.A401 Fruit 2 X-WRONG-because date corresponding big item is later than others in group
2.B509 Vegetable 2 RIGHT-Note that count is 2 only because one of the dates is earlier than the Big corresponding item(3/1/2014 is earlier than 3/15/2014)
SQL Output
Customer_id Item_Type Count
1.B509 Vegetable 2
select t.customer_id, t.item_type, count(*)
from (select t.*,
min(case when OrderSize = 'Big' then date end) over (partition by customer_id, item_type) as min_big
from t
) t
where date > min_big
group by t.customer_id, t.item_type;
In SQL dialects (MS Access, MySQL, SQLite, SAS' proc sql) that do not support window functions, most PARTITION BY calls can be replaced with correlated aggregate subqueries which is supported by all major SQL dialects. Consider the following adjustment:
select main.customer_id, main.item_type, count(*) as count
from
(select t.customer_id, t.item_type, t.date,
(select min(case when OrderSize = 'Big' then date end)
from t sub
where sub.customer_id = t.customer_id
and sub.item_type = t.item_type) as min_big
from t
) main
where main.date > main.min_big
group by main.customer_id, main.item_type;

How to order a query by values in different columns

I have a recordset named rsProductClass that is returned from a table in the database. It is a very simple SELECT * FROM Table WHERE ProductID = {ID Value Here} and the table is like this:
ProductID | UPPERTIER | LOWERTIER | NATIER | OTHERTIER
1 20 60 10 10
2 10 90 NULL NULL
3 NULL 40 NULL 5
The table may or may not have a value for each of the various tiers.
What I want to do is show to the user which column has the highest value and what the name of that column is. So for example, if you were looking at ProductID 2, then the page should display "This is likely to be a LOWERTIER product"
I need to sort the rsProductClass query in such a way that it returns me a list of columns in that query ordered by the value in each column. I want to treat the NULL values as zeros.
I tried to mess about with doing valuelist() and some ArrayToList() type functions but it crashes on the NULL values. Say I add columns to an array, and then use ArraySort() to get them in some kind of order, I'll get an error saying something like "Position 1 is not numeric" because it has a NULL value.
Is this something that can be done by ColdFusion? I suppose its some sort of pivoting of the recordset which is beyond my ability.
Something like this would work:
<cfquery name="tiers" datasource="...">
SELECT ProductID, UPPERTIER VALUE, 'UPPERTIER' TIER
WHERE UPPERTIER IS NOT NULL
UNION
SELECT ProductID, LOWERTIER VALUE, 'LOWERTIER' TIER
WHERE LOWERTIER IS NOT NULL
UNION
SELECT ProductID, OTHERTIER VALUE, 'OTHERTIER' TIER
WHERE OTHERTIER IS NOT NULL
UNION
SELECT ProductID, NATIER VALUE, 'NATIER' TIER
WHERE NATIER IS NOT NULL
ORDER BY ProductID, VALUE
</cfquery>
<cfset productGroup = StructNew()>
<cfoutput query="tiers" group="ProductID">
<cfset productGroup[ProductID].TIER = TIER>
<cfset productGroup[ProductID].VALUE = VALUE>
</cfoutput>
<cfdump var="#productGroup#">
Starting with ColdFusion 10 you can use <cfloop query="..." group="...">, before that <cfoutput> must be used.
If you're willing to unpivot your query, you might do something like the following. I used COALESCE() instead of ISNULL() (either one works in this situation, but COALESCE() is the ANSI standard). The column tier_rank will give the rank of the given tier -- that is, the tier with the highest value will have a rank of 1. If there are two tiers that both have the highest value, then both will have a value in tier_rank of 1 (this is why you would use RANK() instead of ROW_NUMBER() -- you could also use DENSE_RANK() if it better fits your requirements):
SELECT p1.product_id, p1.tier_name, p1.tier_value
, RANK() OVER ( PARTITION BY p1.product_id ORDER BY p1.tier_value DESC ) tier_rank
FROM (
SELECT product_id, 'UPPERTIER' AS tier_name
, COALESCE(uppertier, 0) AS tier_value
FROM products
UNION ALL
SELECT product_id, 'LOWERTIER' AS tier_name
, COALESCE(lowertier, 0) AS tier_value
FROM products
UNION ALL
SELECT product_id, 'NATIER' AS tier_name
, COALESCE(natier, 0) AS tier_value
FROM products
UNION ALL
SELECT product_id, 'OTHERTIER' AS tier_name
, COALESCE(othertier, 0) AS tier_value
FROM products
) p1
Please see SQL Fiddle demo here.
It might be possible to re-pivot the above unpivoted query, but I must admit my attempts at doing so failed.
I had to do something similar to this recently and looked into UNPIVOT in SQL Server. Going with the suggestion to Unpivot your query like David said, you could do something like this. This doesn't add RANK column, but it does order the values.
SELECT ProductID, Tier, TierValue
FROM
(SELECT ProductID, ISNULL(UpperTier,0) UpperTier, ISNULL(LowerTier,0) LowerTier, ISNULL(NaTier,0) NaTier, ISNULL(OtherTier,0) OtherTier
FROM products) p
UNPIVOT
(TierValue FOR Tier IN
(UpperTier, LowerTier, NaTier, OtherTier)
)AS unpvt
ORDER BY ProductID, TierValue Desc
SQL FIDDLE