Grouping multiple text values from Query in Google Sheets - regex

So I have a pivot table created from a Query function in Google sheets that I wish to group by its rows based on a decision rule.
the pivot table basically looks something like this (a table of classes and grades, and a header with student names):
| John Dough | John Though | John Doe |... | John A Hill
History | 79 | | |... | |
Chem 101 | | | 87 |... | |
Phys 101 | | | |... | 77 |
Phys 202 | | | |... | |
Geo 101 | | 75 | |... | |
... | | | ... |... | |
Sport AT | | | 85 |... | |
now, Let say the needed score in the final exam is 75, what I'd like to do is get this table:
| Failed Passed
History | John Dough | John A Hill , John Deere
Chem 101 | John E , John Tra | John Son , John Snow
Phys 101 | John B Good , John Na | #N/A
Phys 202 | John Bon Jovi | John Diy , John L , John R
Geo 101 | #N/A | John Lennon
... | ... | ...
Sport AT | John Bone | John the revelator
the catch is that I'd like to wrap the existing pivot table with a formula, so it looks something like:
=MagicFormula[Query("Data !A1:X99","select yada yada, sum(yada), Pivot(whatever)]
And my question is, can it be done by wrapping?

=ARRAYFORMULA({"", "PASSED", "FAILED"; A2:A, REGEXREPLACE(TRIM({
TRANSPOSE(QUERY(TRANSPOSE(IF((B2:E>=79)*(B2:E<>""), B1:E1&",", )),,999^99)),
TRANSPOSE(QUERY(TRANSPOSE(IF((B2:E< 79)*(B2:E<>""), B1:E1&",", )),,999^99))}),
",$", )})
UPDATE:
=ARRAYFORMULA({{QUERY(QUERY({List!B5:D},
"select Col1,sum(Col3) where Col1 is not null group by Col1 pivot Col2", 0),
"select Col1", 0)}, {"PASSED", "FAILED";
REGEXREPLACE(TRIM({TRANSPOSE(QUERY(TRANSPOSE(IF((QUERY(QUERY({List!B5:D},
"select sum(Col3) where Col1 is not null group by Col1 pivot Col2", 0),
"offset 1", 0)>=79)*(QUERY(QUERY({List!B5:D},
"select sum(Col3) where Col1 is not null group by Col1 pivot Col2", 0),
"offset 1", 0)<>""), QUERY(QUERY({List!B5:D},
"select sum(Col3) where Col1 is not null group by Col1 pivot Col2", 0),
"limit 0", 1)&",", )),,999^99)), TRANSPOSE(QUERY(TRANSPOSE(IF((QUERY(QUERY({List!B5:D},
"select sum(Col3) where Col1 is not null group by Col1 pivot Col2", 0),
"offset 1", 0)< 79)*(QUERY(QUERY({List!B5:D},
"select sum(Col3) where Col1 is not null group by Col1 pivot Col2", 0),
"offset 1", 0)<>""), QUERY(QUERY({List!B5:D},
"select sum(Col3) where Col1 is not null group by Col1 pivot Col2", 0),
"limit 0", 1)&",", )),,999^99))}), ",$", )}})

Related

Concatenating two columns from different tables if one of the columns is not empty

I have two tables which are connected by an ID column (not shown in the picture). Here is how the data looks:
| column1 | column2 |
| -------- | -------------- |
| Mike | 345 |
| Steve | 987 |
| Andy | 0 |
| Lucas | 0 |
--
| column3 | column4 |
| -------- | -------------- |
| Mike | 543 |
| Lucas | 0 |
| Andy | 678 |
| Steve | 0 |
I wish to create a calculated column which concatenates the results from the second table in the picture (column3, column4) only if the result in column2 is zero. If the result of column2 is not zero then it should have precedence in concatenation.
Also if both column2 and column4 are zero then there should be no concatenation.
I'm expecting something like this:
| Column3 | Column4 | Concat column|
|---- |------| -----|
| Mike | 543 | Mike 345 |
| Lucas | 0 | |
| Andy | 678 | Andy 678 |
| Steve | 0 | Steve 987 |
Try this.
ConcatColumn = IF(Table1[Column2]<>0,Table1[Column1]&Table1[Column1],RELATED(Table2[Column3])&RELATED(Table2[Column4]))
Before using the above calculated column, First you have to establish relationship Table1 & Table2 by Column1 & Column3.
Also it is assumed that Column2 & Column4 have datatype as WholeNumber

Deleting rows based on multiple columns conditions

Given the following table have, I would like to delete the records that satisfy the conditions based on the to_delete table.
data have;
infile datalines delimiter="|";
input id :8. item :$8. datetime : datetime18.;
format datetime datetime18.;
datalines;
111|Basket|30SEP20:00:00:00
111|Basket|30SEP21:00:00:00
111|Basket|31DEC20:00:00:00
111|Backpack|31MAY22:00:00:00
222|Basket|31DEC20:00:00:00
222|Basket|30JUN20:00:00:00
;
+-----+----------+------------------+
| id | item | datetime |
+-----+----------+------------------+
| 111 | Basket | 30SEP20:00:00:00 |
| 111 | Basket | 30SEP21:00:00:00 |
| 111 | Basket | 31DEC20:00:00:00 |
| 111 | Backpack | 31MAY22:00:00:00 |
| 222 | Basket | 31DEC20:00:00:00 |
| 222 | Basket | 30JUN20:00:00:00 |
+-----+----------+------------------+
data to_delete;
infile datalines delimiter="|";
input id :8. item :$8. datetime : datetime18.;
format datetime datetime18.;
datalines;
111|Basket|30SEP20:00:00:00
111|Backpack|31MAY22:00:00:00
222|Basket|30JUN20:00:00:00
;
+-----+----------+------------------+
| id | item | datetime |
+-----+----------+------------------+
| 111 | Basket | 30SEP20:00:00:00 |
| 111 | Backpack | 31MAY22:00:00:00 |
| 222 | Basket | 30JUN20:00:00:00 |
+-----+----------+------------------+
In the past, I used to operate with the catx() function to concatenate the conditions in a where statement, but I wonder if there is a better way of doing this
proc sql;
delete from have
where catx('|',id,item,datetime) in
(select catx('|',id,item,datetime) from to_delete);
run;
+-----+--------+------------------+
| id | item | datetime |
+-----+--------+------------------+
| 111 | Basket | 30SEP21:00:00:00 |
| 111 | Basket | 31DEC20:00:00:00 |
| 222 | Basket | 31DEC20:00:00:00 |
+-----+--------+------------------+
Please note that it should allow the have table to have more columns than the table to_delete.
You can use except from to compute difference set of two sets:
proc sql;
create table want as
select * from have except select * from to_delete
;
quit;

Big query PIVOT operator, how do I make it work when column data type is STRING and I cannot apply aggregate functions

I was trying to understand and work on big query new pivot operator.
From this another post How to Pivot table in BigQuery, I understand how it works.
The example in google documentation that is mentioned about data having product, sales and quarter and how we get Pivoted data using query below.
SELECT * FROM
(SELECT * FROM Produce)
PIVOT(SUM(sales) FOR quarter IN ('Q1', 'Q2', 'Q3', 'Q4'))
+---------+----+----+----+----+
| product | Q1 | Q2 | Q3 | Q4 |
+---------+----+----+----+----+
| Apple | 77 | 0 | 25 | 2 |
| Kale | 51 | 23 | 45 | 3 |
+---------+----+----+----+----+
I am wondering what if I have to PIVOT a data for data example below using PIVOT operator where the sales data is STRING column and had below data. This is just for an example. I cannot provide real time data here as its sensitive data.
+---------+-------+---------+
| product | sales | quarter |
+---------+-------+---------+
| Kale | good | Q1 |
| Kale | bad | Q2 |
| Kale | good | Q3 |
| Kale | bad | Q4 |
| Apple | bad | Q1 |
| Apple | good | Q2 |
| Apple | bad | Q3 |
| Apple | good | Q4 |
+---------+-------+---------+
And the output should be as below
+---------+------+-----+-----+----+
| product | Q1 | Q2 | Q3 | Q4 |
+---------+------+-----+-----+----+
| Apple | bad | good| bad | good|
| Kale | good | bad | good| bad |
+---------+------+-----+-----+-----+
In this case sum will not work, neither casting will work. How should we use PIVOT operator in such cases?
Consider below
SELECT * FROM
(SELECT * FROM Produce)
PIVOT(STRING_AGG(sales) FOR quarter IN ('Q1', 'Q2', 'Q3', 'Q4'))
Try aggregate functions that work with strings. For example MAX:
with Produce AS (
SELECT 'Kale' as product, 'good' as sales, 'Q1' as quarter UNION ALL
SELECT 'Kale', 'bad', 'Q2' UNION ALL
SELECT 'Kale', 'good', 'Q3' UNION ALL
SELECT 'Kale', 'bad', 'Q4' UNION ALL
SELECT 'Apple', 'bad', 'Q1' UNION ALL
SELECT 'Apple', 'good', 'Q2' UNION ALL
SELECT 'Apple', 'bad', 'Q3' UNION ALL
SELECT 'Apple', 'good', 'Q4')
SELECT * FROM
(SELECT product, sales, quarter FROM Produce)
PIVOT(MAX(sales) FOR quarter IN ('Q1', 'Q2', 'Q3', 'Q4'))

SAS:how to count average interval time

how to count average interval time
+-------------+----------+----------+--------+------------------+
| customer_id | date | time | answer | missed_call_type |
+-------------+----------+----------+--------+------------------+
| 101 | 2018/8/3 | 12:13:00 | no | employee |
| 102 | 2018/8/3 | 12:15:00 | no | customer |
| 103 | 2018/8/3 | 12:20:00 | no | employee |
| 102 | 2018/8/3 | 15:15:00 | no | customer |
| 101 | 2018/8/3 | 18:15:00 | no | employee |
| 105 | 2018/8/3 | 18:18:00 | no | customer |
| 102 | 2018/8/3 | 19:18:00 | no | employee |
+-------------+----------+----------+--------+------------------+
I got a table which looks like this and wanted to calculate average interval time for those who did not answer the phone. For this example,the average interval time is:
[(18:15:00-12:13:00)+(19:18:00-15:15:00)+(15:15:00-12:15:00)]/3
it could work in mssql, and could create a colum interval_time for each customer then sum up. How to achive it in sas?data step or proc sql
CREATE TABLE customer_data (
customer_id BIGINT,
date DATE,
time time,
answer VARCHAR(100),
missed_call_type VARCHAR(100)
);
INSERT INTO customer_data
VALUES
(101, '2018/8/3', '12:13:00', 'no', 'employee'),
(102, '2018/8/3', '12:15:00', 'no', 'customer'),
(103, '2018/8/3', '12:20:00', 'no', 'employee'),
(102, '2018/8/3', '15:15:00', 'no', 'customer'),
(101, '2018/8/3', '18:15:00', 'no', 'employee'),
(105, '2018/8/3', '18:18:00', 'no', 'customer'),
(102, '2018/8/3', '19:18:00', 'no', 'employee')
select cd.customer_id, answer, missed_call_type,
CAST(CAST(cd.date as VARCHAR(10))+' ' +CAST(cd.time as VARCHAR(10)) as datetime) as date,
ROW_NUMBER() OVER(PARTITION BY cd.customer_id ORDER BY date desc, time desc) as ranks
INTO #temP
from customer_data cd
order by cd.customer_Id, ranks;
select AVG(DATEDIFF(MINUTE, x1.date, x2.date)) as avg_mins
from #temP x1
INNER JOIN #temP x2 ON x1.customer_id = x2.customer_id
WHERE x2.ranks = (x1.ranks-1)
A nested query can be used to prepare the data for the selection and computation you want. An important feature is to recognize that the datetime range (max-min) of a customer_id group is the same as adding up the sequential intervals of all the nos.
data have;
input customer_id date & yymmdd8. time & time8. answer $ missed_call_type $;
format date yymmdd10. time time8.;
datetime = dhms(date,hour(time), minute(time), second(time));
format datetime datetime20.;
datalines;
101 2018/8/3 12:13:00 no employee
102 2018/8/3 12:15:00 no customer
103 2018/8/3 12:20:00 no employee
102 2018/8/3 15:15:00 no customer
101 2018/8/3 18:15:00 no employee
105 2018/8/3 18:18:00 no customer
102 2018/8/3 19:18:00 no employee
run;
proc sql;
create table want as
select
sum(range) / sum (interval_count) as mean_interval_time format=time8.
, sum(range) as sum_range format=time8.
, sum(interval_count) as sum_interval_count
, count(range) as group_count
from
( select
max(datetime) - min(datetime) as range
, count(*) - 1 as interval_count
from have
group by customer_id
having count(*) > 1
);
You do not explain what should happen if the answer=yes, so the actual query may be more complicated than shown here.

Set value of different columns in Pandas DataFrame with indexes

I have extracted the indexes of certain rows that I'd like to assign to rows of another DataFrame with the following command:
indexes = df1[df1.iloc[:, 0].isin(df2.iloc[:, 0].values)].index.values
What I'd like to do is assign certain values of columns of df2 to the rows (of which I have the indexes) to certain columns of df1.
For example:
df1:
index | col1 | col2 | col3
0 | ABC | DEF | GHI
1 | JKL | MNO | PQR
2 | STU | VWX | YZ
df2:
index | colA | colB | colC
0 | WHAT | EVER | 123
2 | 111 | 222 | 333
What I'd like to do now for example is to assign the value of colB (df2) to col3 (df1) according to the indexes. So the result should be:
df1:
index | col1 | col2 | col3
0 | ABC | DEF | EVER <- value of colB (df2)
1 | JKL | MNO | PQR
2 | STU | VWX | 222 <- value of colB (df2)
I'm aware that I can set values with .iloc (integer location) function. But I can't figure out how to do this with the corresponding indexes.
Also I'd appreciate a good Pandas guide (as you can see I'm new with Pandas)
Greetings,
Frame