Why bigquery can't handle a query processing 4TB data? - google-cloud-platform

I'm trying to run this query
SELECT
id AS id,
ARRAY_AGG(DISTINCT users_ids) AS users_ids,
MAX(date) AS date
FROM
users,
UNNEST(users_ids) AS users_ids
WHERE
users_ids != " 1111"
AND users_ids != " 2222"
GROUP BY
id;
Where users table is splitted table with id column and user_ids (comma separated) column and date column
on a +4TB and it give me resources
Resources exceeded during query execution: Your project or organization exceeded the maximum disk and memory limit available for shuffle operations.
.. any idea why?
id userids date
1 2,3,4 1-10-20
2 4,5,6 1-10-20
1 7,8,4 2-10-20
so the final result I'm trying to reach
id userids date
1 2,3,4,7,8 2-10-20
2 4,5,6 1-10-20
Execution details:

It's constantly repartitioning - I would guess that you're trying to cramp too much stuff into the aggregation part. Just remove the aggregation part - I don't even think you have to cross join here.
Use a subquery instead of this cross join + aggregation combo.
Edit: just realized that you want to aggregate the arrays but with distinct values
WITH t AS (
SELECT
id AS id,
ARRAY_CONCAT_AGG(ARRAY(SELECT DISTINCT uids FROM UNNEST(user_ids) as uids WHERE
uids != " 1111" AND uids != " 2222")) AS users_ids,
MAX(date) OVER (partition by id) AS date
FROM
users
GROUP BY id
)
SELECT
id,
ARRAY(SELECT DISTINCT * FROM UNNEST(user_ids)) as user_ids
,date
FROM t
Just the draft I assume id is unique but it should be something along those lines? Grouping by arrays is not possible ...
array_concat_agg() has no distinct so it comes in a second step.

Related

How to collapse sheet by pivoting rows into csv data

I have a set of data where an account id can have multiple rows of country. I'm looking for an array function that will give me a unique list of accounts with the countries in the second column as csv values e.g. country1,country1,country3.
If I unique the accounts, this query will do it per row but I'm really looking for an array so I don't have to maintain it as the number of rows grows.
=TEXTJOIN(",",1,UNIQUE(QUERY(A:B,"select B where A = '"&D2&"'",0)))
I have a sample sheet here.
try:
=INDEX(REGEXREPLACE(TRIM(SPLIT(FLATTEN(QUERY(QUERY(
IF(A2:A="",,{A2:A&"×", B2:B&","}),
"select max(Col2)
where not Col2 matches '^×|^$'
group by Col2
pivot Col1"),,9^9)), "×")), ",$", ))

i am looking to get the date diff from two or more row in a way that first rows serviceto date - second rows service start date so that i can get diff

my data looks like this
userid
completedat
serviceperiodfrom
serviceperiodto
00002cd9-94eb-4c06-a2c4-75253fd541b9
2020-11-25T14:20:04.293Z
2020-11-25T14:20:04.200Z
2021-02-25T14:20:04.200Z
00002cd9-94eb-4c06-a2c4-75253fd541b9
2021-03-21T10:27:34.842Z
2021-03-21T10:27:34.800Z
2022-03-21T10:27:34.800Z
00002cd9-94eb-4c06-a2c4-75253fd541b9
2020-07-24T11:22:12.410Z
2020-07-24T11:22:12.300Z
2020-10-24T11:22:12.300Z
I need the date diff from serviceperiodto date of first row - serviceperiodfrom date of secondrow and it goes for as many iteration as it has these details for each userid
please help me i tried joining the tables using subqueries tried to create a pivot table but none of them seem working for me please help
You can use lag/lead to access previous/next item:
WITH dataset
AS (SELECT *
FROM
(
VALUES
(1, from_iso8601_timestamp('2020-11-25T14:20:04.200Z'), from_iso8601_timestamp('2021-02-25T14:20:04.200Z')),
(1, from_iso8601_timestamp('2021-03-21T10:27:34.800Z'), from_iso8601_timestamp('2022-03-21T10:27:34.800Z')),
(1, from_iso8601_timestamp('2020-07-24T11:22:12.300Z'), from_iso8601_timestamp('2020-10-24T11:22:12.300Z'))
) AS t (userid, serviceperiodfrom, serviceperiodto)
)
SELECT date_diff(
'hour',
serviceperiodto,
lead(serviceperiodfrom, 1) OVER (PARTITION BY userid ORDER BY serviceperiodfrom))
FROM dataset
Output:
_col0
770
572
 

mimic window function in sas to group rows and calculate the max continuous membership

I have a table like this showing the id of a member, the beginning and end date of their membership, the days of their membership and the gap between their last membership and current membership period.
for example, member 11101 started their membership on 3/1/95 and paused their membership on 11/01/97. they renewed their membership on 6/1/97 without a gap. the total days of this span is 822+153=975 days
he terminated his membership on 11/1/97 and restarted it on 11/10/04. the gap between these two membership are 11/10/04-11/01/97=2565 days
Im trying to find out the longest spans of a certain member's continuous membership, which is 2160 in this case. I think a window function lag/lead is necessary in SQL. however window function is not supported in sql. how can group these periods based on gap days and calculate the max spans?
thank you for the help!
enter image description here
This question was originally tagged SQL/MySQL/Oracle. This answers the original version of the question.
You can summarize the data for each members for a user. The idea is to sum up the non-zero values of gap (which is rather convenient to have). This defines a group that can be used for aggregation. And, in this case, you can just sum the gap:
select id, sum(span), min(begin_date), max(end_date)
from (select t.*,
sum(gap) over (partition by id order by begin_date) as grp
from t
) t
group by id, grp;
For only the longest per id, you can use window functions again:
select *
from (select id, sum(span), min(begin_date), max(end_date),
row_number() over (partition by id order by sum(span) desc) as seqnum
from (select t.*,
sum(gap) over (partition by id order by begin_date) as grp
from t
) t
group by id, grp
) ig
where seqnum = 1;

SQLite how to limit the number of records

I want to limit the number of records in my SQLite table to for example 100 records, and then when I INSERT the 101th record, the first record (the oldest) be removed from the table. In other word, I want to prevent the table from growing more than 100 records and always have the last 100 records. Is there any setting or query with SQLite or should I handle it manually?
thanks in advance
You can do it with a trigger.
Say that your table is this:
CREATE TABLE tablename (
id INTEGER PRIMARY KEY,
name TEXT,
inserted_at TEXT DEFAULT (strftime('%Y-%m-%d %H:%M:%f', 'now'))
);
In the column inserted_at you will have the timestamp of the insertion of each row.
This is not necessary if you declared the column id as:
id INTEGER PRIMARY KEY AUTOINCREMENT
because in this case you could identify the 1st inserted row by the minimum value of the id.
Now create this trigger:
CREATE TRIGGER keep_100_rows AFTER INSERT ON tablename
WHEN (SELECT COUNT(*) FROM tablename) > 100
BEGIN
DELETE FROM tablename
WHERE id = (SELECT id FROM tablename ORDER BY inserted_at, id LIMIT 1);
-- or if you define id as AUTOINCREMENT
-- WHERE id = (SELECT MIN(id) FROM tablename);
END;
END;
Every time that you insert a new row, the trigger will check if the table has more than 100 rows and if it does it will delete the 1st inserted row.
See the demo (for max 3 rows).

Amazon Athena LEFT OUTER JOIN query not working as expected

I am trying to do a left ourter join in Athena and my query looks like the following:
SELECT customer.name, orders.price
FROM customer LEFT OUTER JOIN order
ON customer.id = orders.customer_id
WHERE price IS NULL;
Where each customer could only have one order in the orders table at most and there are customers with no order in the orders table at all. So I am expecting to get some number of records where there is a customer in the customer table with no records in orders table which means when I do LEFT OUTER JOIN the price will be NULL. But this query returns 0 every time I run it. I have queries both tables separately and pretty sure there is data in both but not sure why this is returning zero where it works if I remove the price IS NULL. I have also tried price = '' and price IN ('') and none of them works. Has anyone here had a similar experience before? Or is there something wrong with my query that I can not see or identify?
It seems that your query is correct. To validate, I created two CTEs that should match up with your customer and orders table and ran your query against them. When running the query below, it returns a record for customer 3 Ted Johnson who did not have an order.
WITH customer AS (
SELECT 1 AS id, 'John Doe' AS name
UNION
SELECT 2 AS id, 'Jane Smith' AS name
UNION
SELECT 3 AS id, 'Ted Johnson' AS name
),
orders AS (
SELECT 1 AS customer_id, 20 AS price
UNION
SELECT 2 AS customer_id, 15 AS price
)
SELECT customer.name, orders.price
FROM customer LEFT OUTER JOIN orders
ON customer.id = orders.customer_id
WHERE price IS NULL;
I'd suggest running the following queries:
COUNT(DISTINCT id) FROM customers;
COUNT(DISTINCT customer_id) FROM orders;
Based on the results you are seeing, I would expect those counts to match. Perhaps your system is creating a record in the orders table whenever a customer is created with a price of 0.
Probably you can't use where for order table.
SELECT customer.name, order.price
FROM customer LEFT OUTER JOIN order
ON customer.id = orders.customer_id AND order.price IS NULL;