Struct Array in Bigquery with nested columns - google-cloud-platform

I have two tables called source and base. The source table has a bunch of ids and all combination of weekly dates. The base table as ids, their tagged devices and the device start and end dates.
Example source table :
id
com_date
acc_1
11/25/2022
acc_1
11/18/2022
acc_1
11/11/2022
acc_2
11/25/2022
acc_3
11/25/2022
acc_3
11/25/2022
Example of base table :
id
device_id
start_date
end_date
acc_1
d1
11/24/2022
12/31/2999
acc_1
d2
11/19/2022
12/31/2999
acc_1
d3
11/12/2022
11/28/2022
acc_2
d4
11/20/2022
11/26/2022
acc_3
d5
11/17/2022
11/24/2022
acc_3
d6
11/10/2022
12/31/2999
I would like my final table to look something like this with nested columns -
Column count should be the count of distinct devices applicable for that com_date
and each com_date should lie between start_date and end_date

You might consider below query.
(I've tested it after changing last com_date in source_table to 11/18/2022.)
SELECT s.id, s.com_date AS dates,
COUNT(DISTINCT device_id) count,
ARRAY_AGG(STRUCT(b.device_id, b.start_date AS to_date, b.end_date AS from_date)) d
FROM source_table s JOIN base_table b
ON s.id = b.id
AND PARSE_DATE('%m/%d/%Y', com_date) BETWEEN PARSE_DATE('%m/%d/%Y', start_date) AND PARSE_DATE('%m/%d/%Y', end_date)
GROUP BY 1, 2;

Related

Redshift generate rows as many as value in another column

df
customer_code contract_code product num_products
C0134 AB01245 toy_1 4
B8328 EF28421 doll_4 2
I would like to transform this table based on the integer value in column num_products and generate a unique id for each row:
Expected_df
unique_id customer_code contract_code product num_products
A1 C0134 AB01245 toy_1 1
A2 C0134 AB01245 toy_1 1
A3 C0134 AB01245 toy_1 1
A4 C0134 AB01245 toy_1 1
A5 B8328 EF28421 doll_4 1
A6 B8328 EF28421 doll_4 1
unique_id can be any random characters as long as I can use a count(distinct) on it later on.
I read that generate_series(1,10000) i is available in later versions of Postgres but not in Redshift
You need to use a recursive CTE to generate the series of number. Then join this with you data to produce the extra rows. I used row_number() to get the unique_id in the example below.
This should meet you needs or at least give you a start:
create table df (customer_code varchar(16),
contract_code varchar(16),
product varchar(16),
num_products int);
insert into df values
('C0134', 'AB01245', 'toy_1', 4),
('B8328', 'EF28421', 'doll_4', 2);
with recursive nums (n) as
( select 1 as n
union all
select n+1 as n
from nums
where n < (select max(num_products) from df) )
select row_number() over() as unique_id, customer_code, contract_code, product, num_products
from df d
left join nums n
on d.num_products >= n.n;
SQLfiddle at http://sqlfiddle.com/#!15/d829b/12

Update rows with multiple values from other table - BigQuery

I have two tables in Bigquery Table A and Table B.
Table A has two columns - name(String) & value(Float). The Name columns is can contain null values.
Table B has 3 columns - start_value(Float), end_value(FLoat) and name(String). These 3 columns won't be empty at any cost.
My aim is to update Table A for the rows having name as null. The logic is basically identify the value for which name is null and then find the corresponding row in Table B where
a.value >= b.start_value and a.value <= b.end_value
In this way, I have to update all the rows in Table A in a single query. How can I achieve this?
Note: No two rows in Table A will be same.
UPDATE `project.dataset.tableA` a
SET a.name = b.name
FROM `project.dataset.tableB` b
WHERE a.name IS NULL
AND value BETWEEN start_value AND end_value
Here you have a code that works perfectly on my end:
UPDATE `project.dataset.tableA` a
SET a.name = (
SELECT b.name
FROM `project.dataset.tableB` b
WHERE value BETWEEN start_value AND end_value)
WHERE a.name IS NULL

using subqueries in jpa criteria api

I'm studying JPA criteria api and my database contains Employee table.
I am trying to find all the employees who are paid second highest salary. I was able to write JPQL successfully as follows.
SELECT e FROM Employee e WHERE e.salary = (SELECT MAX(emp.salary) FROM Employee emp WHERE emp.salary < (SELECT MAX(employee.salary) FROM Employee employee) )
but now I am trying to convert it to criteria api and have tried following.
CriteriaQuery<Employee> c = cb.createQuery(Employee.class);
Root<Employee> e1 = c.from(Employee.class);
c.select(e1);
Subquery<Number> sq = c.subquery(Number.class);
Root<Employee> e2 = sq.from(Employee.class);
sq.select(cb.max(e2.<Number> get("salary")));
Subquery<Number> sq1 = sq.subquery(Number.class);
Root<Employee> e3 = sq1.from(Employee.class);
sq1.select(cb.max(e3.<Number> get("salary")));
c.where(cb.lessThan(e2.<Number>get("salary"), e3.<Number>get("salary")));// error here
c.where(cb.equal(e1.get("salary"), sq));
I get the error that parameters are not compatible with lessThan method. I do not understand how can I get this query worked out. Is my approach right?
EDIT :- Updating the question after Mikko's answer.
The jpql provided above provides following results, which are the employees with second highest salary.
Harish Taware salary 4000000.0
Nilesh Deshmukh salary 4000000.0
Deodatta Chousalkar salary 4000000.0
Deodatta Chousalkar salary 4000000.0
but the updated criteria query as below,
CriteriaQuery<Employee> c = cb.createQuery(Employee.class);
Root<Employee> e1 = c.from(Employee.class);
c.select(e1);
Subquery<Long> sq = c.subquery(Long.class);
Root<Employee> e2 = sq.from(Employee.class);
sq.select(cb.max(e2.<Long> get("salary")));
Subquery<Long> sq1 = sq.subquery(Long.class);
Root<Employee> e3 = sq1.from(Employee.class);
sq1.select(cb.max(e3.<Long> get("salary")));
c.where(cb.lessThan(e2.<Long> get("salary"), e3.<Long> get("salary")));
c.where(cb.equal(e1.get("salary"), sq));
employees = em.createQuery(c).getResultList();
for (Employee employee : employees) {
System.out.println(employee.getName() + "salary"
+ employee.getSalary());
}
This provides the employee with highest salary. The result is as below.
Pranil Gildasalary5555555.0
Please tell me where I am being wrong. An explanation is deeply appreciated.
After some more trial and error, I could write the query to select employees with second maximum salary. I would like to suggest that you should write a JPQL query first and write the criteria api accordingly. This is what I analyzed from JPQL.
SELECT e FROM Employee e
WHERE e.salary = (SELECT MAX(emp.salary) FROM Employee emp
WHERE emp.salary < (SELECT MAX(employee.salary) FROM Employee employee) )
Now we can see that
There are 2 subqueries, i.e. subquery of main query contains another subquery
The identification variables e, emp and employee correspond to the main query, subquery of main query and subquery of subquery.
Now while comparing the result of subqueries i.e. maximum salary compared with the employee salary of outer query, the identification variable from outer query is used. for e.g. WHERE emp.salary = (SELECT MAX(emp.salary) FROM Employee emp)
Now let us convert this query in criteria api.
First write CriteriaQuery that corresponds to outermost query i.e. SELECT e FROM Employee e WHERE e.salary =
CriteriaQuery<Employee> c1 = cb.createQuery(Employee.class);
Root<Employee> e3 = c1.from(Employee.class);
c1.select(e3);
Let us leave the WHERE e.salary = for now and go for the subquery
Now this should have a subquery that selects the maximum salary of employees i.e. SELECT MAX(emp.salary) FROM Employee emp
WHERE emp.salary < again let us leave the WHERE emp.salary < for now.
Subquery<Long> sq1 = c1.subquery(Long.class);
Root<Employee> e4 = sq1.from(Employee.class);
sq1.select(cb.max(e4.<Long> get("salary")));
repeating this for subquery of above subquery,
Subquery<Long> sq2 = sq1.subquery(Long.class);
Root<Employee> e5 = sq2.from(Employee.class);
sq2.select(cb.max(e5.<Long> get("salary")));
Now we have written subqueries but WHERE conditions need to be applied yet. So now the where condition in criteria api corresponding to WHERE emp.salary < (SELECT MAX(employee.salary) FROM Employee employee) will be as below.
sq1.where(cb.lessThan(e4.<Long> get("salary"), sq2));
Similarly, WHERE condition corresponding to WHERE e.salary = (SELECT MAX(emp.salary) FROM Employee emp will be as below.
c1.where(cb.equal(e3.<Long> get("salary"), sq1));
So the complete query which gives the employees with second highest salary can be written in criteria api as below.
CriteriaQuery<Employee> c1 = cb.createQuery(Employee.class);
Root<Employee> e3 = c1.from(Employee.class);
c1.select(e3);
Subquery<Long> sq1 = c1.subquery(Long.class);
Root<Employee> e4 = sq1.from(Employee.class);
sq1.select(cb.max(e4.<Long> get("salary")));
Subquery<Long> sq2 = sq1.subquery(Long.class);
Root<Employee> e5 = sq2.from(Employee.class);
sq2.select(cb.max(e5.<Long> get("salary")));
sq1.where(cb.lessThan(e4.<Long> get("salary"), sq2));
c1.where(cb.equal(e3.<Long> get("salary"), sq1));
employees = em.createQuery(c1).getResultList();
for (Employee employee : employees) {
System.out.println(employee.getName() + " " + employee.getSalary());
}
As documented, it cannot work because Number is not Comparable:
<Y extends java.lang.Comparable<? super Y>> Predicate lessThan(Expression<? extends Y> x,
Expression<? extends Y> y)
For expression with Number there is method Criteriabuilder.lt that takes such arguments:
c.where(cb.lt(e2.<Number>get("salary"), e3.<Number>get("salary")));
Other option is to change type argument from Number to something more specific. If salary is Long, following should work:
Subquery<Long> sq = c.subquery(Long.class);
Root<Employee> e2 = sq.from(Employee.class);
sq.select(cb.max(e2.<Long> get("salary")));
Subquery<Long> sq1 = sq.subquery(Long.class);
Root<Employee> e3 = sq1.from(Employee.class);
sq1.select(cb.max(e3.<Long> get("salary")));
c.where(cb.lessThan(e2.<Long>get("salary"), e3.<Long>get("salary")));
c.where(cb.equal(e1.get("salary"), sq));

Raw query must include the primary key

I got a raw SQL statement in my views.py
Message.objects.raw('''
SELECT s1.ID, s1.CHARACTER_ID, MAX(s1.MESSAGE) MESSAGE, MAX(s1.c) occurrences
FROM
(SELECT ID, CHARACTER_ID, MESSAGE, COUNT(*) c
FROM tbl_message WHERE ts > DATE_SUB(NOW(), INTERVAL %s DAY) GROUP BY CHARACTER_ID,MESSAGE) s1
LEFT JOIN
(SELECT ID, CHARACTER_ID, MESSAGE, COUNT(*) c
FROM tbl_message WHERE ts > DATE_SUB(NOW(), INTERVAL %s DAY) GROUP BY CHARACTER_ID,MESSAGE) s2
ON s1.CHARACTER_ID=s2.CHARACTER_ID
AND s1.c < s2.c
WHERE s2.c IS NULL
GROUP BY CHARACTER_ID
ORDER BY occurrences DESC''', [days, days])
The result of this SQL statement (tested on database directly) is:
ID | CHARACTER_ID | MESSAGE | OCCURENCES
----+--------------+---------+--------------
148 | 10 | test | 133
But all I got is a InvalidQuery Exception with the information Raw query must include the primary key
Then I double checked the docs and read:
There is only one field that you can’t leave out - the primary key
field....An InvalidQuery exception will be raised if you forget to include the primary key.
As you can see I got the requested primary key added in my statement. What's wrong?
class Message(models.Model):
character = models.ForeignKey('Character')
message = models.TextField()
location = models.ForeignKey('Location')
ts = models.DateTimeField()
class Meta:
pass
def __unicode__(self):
return u'%s: %s...' % (self.character, self.message[0:20])
Include 1 as id to your query
Message.objects.raw('''
SELECT 1 as id , s1.ID, s1.CHARACTER_ID, MAX(s1.MESSAGE) MESSAGE, MAX(s1.c) occurrences
FROM
(SELECT ID, CHARACTER_ID, MESSAGE, COUNT(*) c
FROM tbl_message WHERE ts > DATE_SUB(NOW(), INTERVAL %s DAY) GROUP BY CHARACTER_ID,MESSAGE) s1
LEFT JOIN
(SELECT ID, CHARACTER_ID, MESSAGE, COUNT(*) c
FROM tbl_message WHERE ts > DATE_SUB(NOW(), INTERVAL %s DAY) GROUP BY CHARACTER_ID,MESSAGE) s2
ON s1.CHARACTER_ID=s2.CHARACTER_ID
AND s1.c < s2.c
WHERE s2.c IS NULL
GROUP BY CHARACTER_ID
ORDER BY occurrences DESC''', [days, days])
I reproduced the same problem using Python 2.7.5, Django 1.5.1 and Mysql 5.5.
I've saved the result of the raw call to the results variable, so I can check what columns it contains:
>>> results.columns
['ID', 'CHARACTER_ID', 'MESSAGE', 'occurrences']
ID is in uppercase, so in your query I changed s1.ID to s1.id and it works:
>>> results = Message.objects.raw('''
... SELECT s1.id, s1.CHARACTER_ID, MAX(s1.MESSAGE) MESSAGE, MAX(s1.c) occurrences
... FROM
... (SELECT ID, CHARACTER_ID, MESSAGE, COUNT(*) c
... FROM tbl_message WHERE ts > DATE_SUB(NOW(), INTERVAL %s DAY) GROUP BY CHARACTER_ID,MESSAGE) s1
... LEFT JOIN
... (SELECT ID, CHARACTER_ID, MESSAGE, COUNT(*) c
... FROM tbl_message WHERE ts > DATE_SUB(NOW(), INTERVAL %s DAY) GROUP BY CHARACTER_ID,MESSAGE) s2
... ON s1.CHARACTER_ID=s2.CHARACTER_ID
... AND s1.c < s2.c
... WHERE s2.c IS NULL
... GROUP BY CHARACTER_ID
... ORDER BY occurrences DESC''', [days, days])
>>> results.columns
['id', 'CHARACTER_ID', 'MESSAGE', 'occurrences']
>>> results[0]
<Message_Deferred_character_id_location_id_message_ts: Character object: hello...>
Make Sure the primary key is part of the select statement.
Example:
This will not work:
`Model.objects.raw("Select Min(id), rider_id from Table_Name group by rider_id")`
But this will work:
`Model.objects.raw("Select id, Min(id), rider_id from Table_Name group by rider_id")`
For those also stuck with this problem, perhaps like me, wondering why Django needs a pk, when you don’t have a pk for the query (eg you want multiple rows) – Django just needs an id field returned, the pk does not need to be part of a where clause. ie:
select * from table where foo = 'bar';
or
select id, description from table where foo = 'bar';
Both of these work, if there is a field id in the table. But this throws the error described by Thomas Schwärzl, because no id field is returned:
select description from table where foo = 'bar';

Postgres Recursive query + group by + join in Django

My Requirement is to write a sql query to get the sub-region wise (fault)events count that occurred for the managedobjects. My database is postgres 8.4. Let me explain using the table structure.
My tables in django:
Managedobject:
class Managedobject(models.Model):
name = models.CharField(max_length=200, unique=True)
iscontainer = models.BooleanField(default=False,)
parentkey = models.ForeignKey('self', null=True)
Event Table:
class Event(models.Model):
Name = models.CharField(verbose_name=_('Name'))
foid = models.ForeignKey(Managedobject)
Managedobject Records:
NOC
Chennai
MO_1
MO_2
MO_3
Mumbai
MO_4
MO_5
MO_6
Delhi
Bangalore
IP
Calcutta
Cochin
Events Records:
event1 MO_1
event2 MO_2
event3 MO_3
event4 MO_5
event5 MO_6
Now I need to get the events count for all the sub-regions. For example,
for NOC region:
Chennai - 3
Mumbai - 2
Delhi - 0
Bangalore - 0
So far I am able to get the result in two different queries.
Get the subregions.
select id from managedobject where iscontainer = True and parentkey = 3489
For each of the region (using for loop), get the count as follows:
SELECT count(*)
from event ev
WHERE ev.foid
IN (
WITH RECURSIVE q AS (
SELECT h
FROM managedobject h
WHERE parentkey = 3489
UNION ALL
SELECT hi
FROM q
JOIN managedobject hi
ON hi.parentkey = (q.h).id
)
SELECT (q.h).id FROM q
)
Please help to combine the queries to make it a single query and for getting the top 5 regions. Since the query is difficult in django, I am going for a raw sql query.
I got the query:
WITH RECURSIVE q AS (
SELECT h,
1 AS level,
id AS ckey,
displayname as dname
FROM managedobject h
WHERE parentkey = 3489
and logicalnode=True
UNION ALL
SELECT hi,
q.level + 1 AS level,
ckey,
dname
FROM q
JOIN managedobject hi ON hi.parentkey = (q.h).id
)
SELECT count(ckey) as ccount,
ckey,
dname
FROM q
JOIN event as ev on ev.foid_id = (q.h).id
GROUP BY ckey, dname
ORDER BY ccount DESC
LIMIT 5