Redshift generate rows as many as value in another column - amazon-web-services

df
customer_code contract_code product num_products
C0134 AB01245 toy_1 4
B8328 EF28421 doll_4 2
I would like to transform this table based on the integer value in column num_products and generate a unique id for each row:
Expected_df
unique_id customer_code contract_code product num_products
A1 C0134 AB01245 toy_1 1
A2 C0134 AB01245 toy_1 1
A3 C0134 AB01245 toy_1 1
A4 C0134 AB01245 toy_1 1
A5 B8328 EF28421 doll_4 1
A6 B8328 EF28421 doll_4 1
unique_id can be any random characters as long as I can use a count(distinct) on it later on.
I read that generate_series(1,10000) i is available in later versions of Postgres but not in Redshift

You need to use a recursive CTE to generate the series of number. Then join this with you data to produce the extra rows. I used row_number() to get the unique_id in the example below.
This should meet you needs or at least give you a start:
create table df (customer_code varchar(16),
contract_code varchar(16),
product varchar(16),
num_products int);
insert into df values
('C0134', 'AB01245', 'toy_1', 4),
('B8328', 'EF28421', 'doll_4', 2);
with recursive nums (n) as
( select 1 as n
union all
select n+1 as n
from nums
where n < (select max(num_products) from df) )
select row_number() over() as unique_id, customer_code, contract_code, product, num_products
from df d
left join nums n
on d.num_products >= n.n;
SQLfiddle at http://sqlfiddle.com/#!15/d829b/12

Related

Struct Array in Bigquery with nested columns

I have two tables called source and base. The source table has a bunch of ids and all combination of weekly dates. The base table as ids, their tagged devices and the device start and end dates.
Example source table :
id
com_date
acc_1
11/25/2022
acc_1
11/18/2022
acc_1
11/11/2022
acc_2
11/25/2022
acc_3
11/25/2022
acc_3
11/25/2022
Example of base table :
id
device_id
start_date
end_date
acc_1
d1
11/24/2022
12/31/2999
acc_1
d2
11/19/2022
12/31/2999
acc_1
d3
11/12/2022
11/28/2022
acc_2
d4
11/20/2022
11/26/2022
acc_3
d5
11/17/2022
11/24/2022
acc_3
d6
11/10/2022
12/31/2999
I would like my final table to look something like this with nested columns -
Column count should be the count of distinct devices applicable for that com_date
and each com_date should lie between start_date and end_date
You might consider below query.
(I've tested it after changing last com_date in source_table to 11/18/2022.)
SELECT s.id, s.com_date AS dates,
COUNT(DISTINCT device_id) count,
ARRAY_AGG(STRUCT(b.device_id, b.start_date AS to_date, b.end_date AS from_date)) d
FROM source_table s JOIN base_table b
ON s.id = b.id
AND PARSE_DATE('%m/%d/%Y', com_date) BETWEEN PARSE_DATE('%m/%d/%Y', start_date) AND PARSE_DATE('%m/%d/%Y', end_date)
GROUP BY 1, 2;

Big query analytical function not giving expected results

I am trying to write a sql in bigquery and I have a requirement to filter records based on a group by column and another column in the table
what I mean is I want to check if the group by column(column name:mnt) has more than one row then I have to check if col2 (col name: zel) value, then I have to apply a filter saying col2 ='X' and only pass that record else pass i.e dont filter the records if the col1 has only distinct one value per group
So I have written a sql to do this I have used row_number as well as rank , dense rank function but I noticed the value of rank and dense rank and row number functions return same value for a group
Please see the below code
#standardsql
with t1 as (SELECT mnt,
case when rank() over (partition by ltrim(rtrim(mnt)) order by
ltrim(rtrim(mnt)) asc) >1 then 'Y' else 'N' end
as flag,
rank() over (partition by mnt order by mnt) as rn,
dense_rank() over (partition by mnt order by mnt) as drn, FROM
projectname.datasetname.tablename1),
t2 as ( SELECT
mnt,
rel,
lif,
lts,
lokez FROM projectname.datasetname.tablename2
WHERE lts <> "" AND _PARTITIONTIME = TIMESTAMP(CURRENT_DATE()) ) ,
t3 as (SELECT
lif,
lifn,
lts,
par FROM `projectname.datasetname.tablename3`)
,t4 as (SELECT rcv FROM `projectname.datasetname.tablename4` WHERE mes
= 'PRO')
select * from (
SELECT t1.mnt as mnt,
t1.flag,
t1.rn,
t1.drn
t2.rel as zel,
t2.lokez as ZLOEKZ,
t4.rcv as Zrcv
FROM t1 left join t2 on replace(t1.mnt, '00000000', '') =
REPLACE(t2.mnt, '00000000', '') AND t1.lif = t2.lif and t2.lts <> ""
and
case when t1.flag = 'Y' and t2.rel ='X' then 1
when (t1.flag ='N' and t2.rel=t2.rel) or (t1.flag ='N' and t2.rel
is null) then 1
when t1.flag = 'Y' and t2.rel <>'X' then 2
else 3
end = 1
left join t3 ON t1.lif = t3.lif AND t2.lts = t3.lts AND
t3.par = 'BA' left join t4 on t4.rcv = t3.lifn and t2.lokez is null )
where ZLOEKZ is null order by mnt
As you can see I am using a case statement and even it seems to be not working fine. I am pasting the case condition below again
case when t1.flag = 'Y' and t2.rel ='X' then 1
when (t1.flag ='N' and t2.rel=t2.rel) or (t1.flag ='N' and
t2.rel
is null) then 1
when t1.flag = 'Y' and t2.rel <>'X' then 2
else 3
end = 1
But the expected record count did not match so I added the above sql lines to see if my analytical functions were giving me result I wanted
rank() over (partition by mnt order by mnt) as rn,
dense_rank() over (partition by mnt order by mnt) as drn
strangely for same mnt number the rank , dense rank and row_number function are assigning the same value what am i doing wrong here.
mnt flag rn drn rel lokez rcv
100 N 1 1 X abc 123
100 N 1 1 null xyz 123
100 N 1 1 null def 234
This is my output
I mean as per my code for same mnt number I am seeing flag set to N instead of Y and for the rank and dense rank are giving me same number for all 3 mnt it is generating 1 instead of 123 (note for rank function I understand) but dense rank should not do that
I tried to convey the issue as efficiently as I could please let me know if there is any clarifications I can provide.
any help appreciated
thanks
SELECT * EXCEPT(ct) FROM (
SELECT *, COUNT() OVER(PARTITION BY mnt) AS ct
) WHERE ct=1 or zel='X'
This is the code snippet for the problem you mentioned. Use this in your code according to the logic.

pyspark mathematical computation in a dataframe

I have extracted a Dataframe from a larger Dataframe, and now I need to do simple computation like addition and division in dataframe.
sample dataframe is like.
item counts
z 23156
x 15462
What I need to do is to divide x by sum of x and z
for example
value= x/x+z
You must compute the sum of x and first then divide x by sum(x) + sum(y)
for example:
Table 1(original table):
x z
1 2
3 4
Table 2 (Aggregated table):
table2 = sqlCtx.sql("select sum(x) + sum(z) as sum_xz")
table2.registerTempTable("table2")
sum_xz
10
Then join both table and divide
table3 = sqlCtx.sql("select a.x / bs.um_xz from table1 a join table2 b")
For your reference.

Select nth to nth row while table still have values unselected with python and pyodbc

I have a table with 10,000 rows and I want to select the first 1000 rows and then select again and this time, the next set of rows, which is 1001-2001.
I am using the BETWEEN clause in order to select the range of values. I can also increment the values. Here is my code:
count = cursor.execute("select count(*) from casa4").fetchone()[0]
ctr = 1
ctr1 = 1000
str1 = ''
while ctr1 <= count:
sql = "SELECT AccountNo FROM ( \
SELECT AccountNo, ROW_NUMBER() OVER (ORDER BY Accountno) rownum \
FROM casa4 ) seq \
WHERE seq.rownum BETWEEN " + str(ctr) + " AND " + str(ctr1) + ""
ctr = ctr1 + 1
ctr1 = ctr1 + 1000
cursor.execute(sql)
sleep(2) #interval in printing of the rows.
for row in cursor:
str1 = str1 + '|'.join(map(str,row)) + '\n'
print "Records:" + str1 #var in storing the fetched rows from database.
print sql #prints the sql statement(str) and I can see that the var, ctr and ctr1 have incremented correctly. The way I want it.
What I want to achieve is using a messaging queue, RabbitMQ, I will send this rows to another database and I want to speed up the process. Selecting all and sending it to the queue returns an error.
The output of the code is that it returns 1-1000 rows correctly on the 1st but, on the 2nd loop, instead of 1001-2001 rows, it returns 1-2001 rows, 1-3001 and so on.. It always starts on 1.
I was able to recreate your issue with both pyodbc and pypyodbc. I also tried using
WITH seq (AccountNo, rownum) AS
(
SELECT AccountNo, ROW_NUMBER() OVER (ORDER BY Accountno) rownum
FROM casa4
)
SELECT AccountNo FROM seq
WHERE rownum BETWEEN 11 AND 20
When I run that in SSMS I just get rows 11 through 20, but when I run it from Python I get all the rows (starting from 1).
The following code does work using pyodbc. It uses a temporary table named #numbered, and might be helpful in your situation since your process looks like it would do all of its work using the same database connection:
import pyodbc
cnxn = pyodbc.connect("DSN=myDb_SQLEXPRESS")
crsr = cnxn.cursor()
sql = """\
CREATE TABLE #numbered (rownum INT PRIMARY KEY, AccountNo VARCHAR(10))
"""
crsr.execute(sql)
cnxn.commit()
sql = """\
INSERT INTO #numbered (rownum, AccountNo)
SELECT
ROW_NUMBER() OVER (ORDER BY Accountno) AS rownum,
AccountNo
FROM casa4
"""
crsr.execute(sql)
cnxn.commit()
sql = "SELECT AccountNo FROM #numbered WHERE rownum BETWEEN ? AND ? ORDER BY rownum"
batchsize = 1000
ctr = 1
while True:
crsr.execute(sql, [ctr, ctr + batchsize - 1])
rows = crsr.fetchall()
if len(rows) == 0:
break
print("-----")
for row in rows:
print(row)
ctr += batchsize
cnxn.close()

Common table expression from bottom-top approach

I have an Agent table and a hierarchy table.
CREATE TABLE [dbo].[Agent](
[AgentID] [int] IDENTITY(1,1) NOT NULL,
[FirstName] [varchar](50) NULL,
[LastName] [varchar](50) NULL,
CONSTRAINT [PK_Agent] PRIMARY KEY CLUSTERED
(
[AgentID] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]
GO
CREATE TABLE [dbo].[Hierarchy](
[HierarchyID] [int] IDENTITY(1,1) NOT NULL,
[AgentID] [int] NULL,
[NextAgentID] [int] NULL,
CONSTRAINT [PK_Hierarchy] PRIMARY KEY CLUSTERED
(
[HierarchyID] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]
GO
--Insert to Agent
INSERT INTO [Agent]([FirstName],[LastName])VALUES('C1','C1');
INSERT INTO [Agent]([FirstName],[LastName])VALUES('C2','C2');
INSERT INTO [Agent]([FirstName],[LastName])VALUES('C3','C3');
INSERT INTO [Agent]([FirstName],[LastName])VALUES('C4','C4');
SELECT * FROM Agent;
AgentID FirstName LastName
1 C1 C1
2 C2 C2
3 C3 C3
4 C4 C4
--Insert to Hierarchy
INSERT INTO [Hierarchy] ([AgentID],[NextAgentID]) VALUES (1,NULL);
INSERT INTO [Hierarchy] ([AgentID],[NextAgentID]) VALUES (2,1);
INSERT INTO [Hierarchy] ([AgentID],[NextAgentID]) VALUES (3,2);
INSERT INTO [Hierarchy] ([AgentID],[NextAgentID]) VALUES (2,4);
INSERT INTO [Hierarchy] ([AgentID],[NextAgentID]) VALUES (4,NULL);
SELECT * FROM Hierarchy;
HierarchyID AgentID NextAgentID
1 1 NULL
2 2 1
3 3 2
4 2 4
5 4 NULL
I used a common table expression to determine the bottom to top levels
WITH AgentHierarchy(AgentID, NextAgentID, HierarchyLevel)
AS
(
SELECT
H1.AgentID,
H1.NextAgentID,
1 HierarchyLevel
FROM Hierarchy H1
WHERE NOT EXISTS (SELECT 1 FROM Hierarchy H2 WHERE H2.NextAgentID = H1.AgentID)
UNION ALL
SELECT
H.AgentID,
H.NextAgentID,
(AgentHierarchy.HierarchyLevel + 1) HierarchyLevel
FROM Hierarchy H
INNER JOIN AgentHierarchy ON AgentHierarchy.NextAgentID = H.AgentID
)
SELECT DISTINCT
AgentID,
NextAgentID,
HierarchyLevel
FROM AgentHierarchy
ORDER BY AgentID, NextAgentID, HierarchyLevel;
Result is:
AgentID NextAgentID HierarchyLevel
1 NULL 3
2 1 2
3 2 1
4 NULL 1
2 4 1
My requirement is to show this in the below way:
AgentID NextAgentID HierarchyLevel
1 NULL 1
2 1 1
3 2 1
3 1 2
4 NULL 1
2 4 1
3 4 2
In short, recursively all the hierarchy with levels should be pulled with bottom-to-top approach. Please help me...
I found the answer:
WITH AgentHierarchy(AgentID, NextAgentID, HierarchyLevel)
AS
(
SELECT
H1.AgentID,
H1.NextAgentID,
1 HierarchyLevel
FROM Hierarchy H1
--WHERE NOT EXISTS (SELECT 1 FROM Hierarchy H2 WHERE H2.NextAgentID = H1.AgentID)
UNION ALL
SELECT
AgentHierarchy.AgentID,
H.NextAgentID,
(AgentHierarchy.HierarchyLevel + 1) HierarchyLevel
FROM Hierarchy H
INNER JOIN AgentHierarchy ON AgentHierarchy.NextAgentID = H.AgentID
)
SELECT
AgentHierarchy.AgentID,
NextAgentID,
HierarchyLevel
FROM AgentHierarchy
WHERE NOT (NextAgentID IS NULL AND HierarchyLevel > 1);
I did the following changes:
Removed the Anchor query WHERE Clause.
Added the CTE's AgentID in the second select after UNION.
Added WHERE Clause in the CTE to remove junk records for the
bottom-most level with NULL NextAgentID.
Let me know if anyone has questions.