Transposing / pivoting rows to columns in Cascalog? - clojure

Let's say I have a set of tuples to be processed by Cascalog, formatted like [Date, Name, Value], e.g.
2014-01-01 Pizza 3
2014-01-01 Hamburger 4
2014-01-01 Cheeseburger 2
2014-01-02 Pizza 1
2014-01-02 Hamburger 2
Given that I have a list of columns like [Pizza, Hamburger, Cheeseburger], I want to transpose / pivot the data so it looks like this:
Date Pizza Hamburger Cheeseburger
2014-01-01 3 4 2
2014-01-02 1 2 0
What's the best way to do this in Cascalog?

Here's one way to do it:
(:use cascalog.api)
(def input
[["2014-01-01" "Pizza" 3]
["2014-01-01" "Hamburger" 4]
["2014-01-01" "Cheeseburger" 2]
["2014-01-02" "Pizza" 1]
["2014-01-02" "Hamburger" 2]])
(defn init-aggregate [k v]
{k v})
(def combine-aggregate
(partial merge-with +))
(defparallelagg aggregate
:init-var #'init-aggregate
:combine-var #'combine-aggregate)
(defn select-values [hashmap keyseq]
(map #(get hashmap %) keyseq))
(def columns
["Pizza" "Hamburger" "Cheeseburger"])
(defn transpose [data]
(<- [?date !pizza !hamburger !cheeseburger]
((<- [?date ?sum]
(data ?date ?name ?value)
(aggregate ?name ?value :> ?sum))
?date ?sum)
(select-values ?sum columns :> !pizza !hamburger !cheeseburger)))
(?- (stdout) (transpose input))
Let's have a quick run through the code:
Most of the action happens in the transpose function, which contains two queries:
The inner query aggregates all ?name ?value pairs for a given date into a ?sum map.
The outer query uses select-values to fetch the values for our columns out of the ?sum map, and into the final result rows.
Since we know the columns are Pizza, Hamburger, Cheeseburger we can simply hardcode them into the query. If you want to know how to make the columns dynamic, read Nathan Marz's blog post on creating a news feed in Cascalog.
Note that we have to represent the columns as nullable variables (using !) since not every column will have a value for any given row. If we wanted to avoid null results, we could change select-values to use 0 as the default value.
(One caveat is that this won't produce any headers in the final output, so this has to be done as a post-processing step.)

Related

BigQuery Challenge: How can I correctly assign the value by next available value and keep it until the next update?

Come up with a quite challenging BigQuery question here.
So basically, I have to assign the next available value to session1's code (in this case session 1 should be the next available value -> 123.
However, we want to keep the code value at 234 in session4 until it gets another update.
Here's what I have:
timestamp
session
user_id
code
ts1
1
User A
NULL
ts2
2
User A
NULL
ts3
2
User A
123
ts4
3
User A
NULL
ts5
3
User A
234
ts6
4
User A
NULL
And the desired output table:
timestamp
session
user_id
code
ts1
1
User A
123
ts2
2
User A
123
ts3
2
User A
123
ts4
3
User A
234
ts5
3
User A
234
ts6
4
User A
234
Thanks everyone for the help!
You might consider below approach.
SELECT *,
COALESCE(
FIRST_VALUE(code IGNORE NULLS) OVER w0,
LAST_VALUE(code IGNORE NULLS) OVER w1
) AS new_code
FROM sample_table
WINDOW w AS (PARTITION BY user_id ORDER BY timestamp),
w0 AS (w RANGE BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING),
w1 AS (w RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW);
Query results
JayTiger's answer is a much cleaner, but here's what I came up with that can be used as an alternative:
SELECT *EXCEPT (Code),
IFNULL(
(FIRST_VALUE(LatestCodeBySession IGNORE NULLS)
OVER (PARTITION BY user_id ORDER BY event_timestamp ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING)),
(LAST_VALUE(LatestCodeBySession IGNORE NULLS)
OVER (PARTITION BY user_id ORDER BY event_timestamp ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)))
AS Code
LatestCodeBySession: `LAST_VALUE(Code IGNORE NULLS) OVER (PARTITION BY session ORDER BY ts ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)`

Create an indicator variable for a table relationship after an outer join

I am working with data from a data cube, meaning I cannot easily control the structures of my underlying data. My underlying data looks like the below. As a Cube datasource, each one of these columns also is its own table with define join relationships between one another. For example, to select the Group1 column on its own would look like 'Group1'[Group1].
Group 1
Group2
Group3
Desired Column
1
a
x
1
2
a
y
1
3
a
y
1
4
b
y
1
5
b
y
1
6
c
z
0
I am trying to create a variable that reflects what is shown in "Desired Column" so that I can include results for any value of Group2 that matches to the "y" value of Group3.
I am not very fluent with DAX, but my preliminary thoughts on an approach are to run a FILTER on Group3 for "y", select values of Group2, and then somehow use those Group2 values in a CONTAINS statement in a further FILTER. Here is my current, non-functional attempt:
Is_Y = CONTAINS(NATURALLEFTOUTERJOIN('Group2','Group3'), 'Group3'[Group3], "Y")

Sort redis hash maps based on string field

I'm trying to implement query functionality for my AWS redis cluster. I have stored all my data as hash maps and also created SortedSet for each of the indexed fields.
Whenever a query is received we query SortedSet to find ids. The querying may involve multiple indexes as well which gets merged based on AND/OR conditions. Once we have the final set of ids we need to sort the data based on some fields. So basically im fetching list of hashmaps which matches the ids. The hashmap looks like below
HSET employees::1 name Arivu salary 100000 age 30
HSET employees::2 name Uma salary 300000 age 31
HSET employees::3 name Jane salary 100000 age 25
HSET employees::4 name Zakir salary 150000 age 28
Now I'm adding all the hashes to a set so that I can use a sort function
SADD collection employees::1 employees::2 employees::3 employees::4
Now when i try to sort based on string field the sort doesn't seems to work
127.0.0.1:6379> SORT collection by name
1) "employees::2"
2) "employees::4"
3) "employees::3"
4) "employees::1"
127.0.0.1:6379> SORT collection by name desc
1) "employees::2"
2) "employees::4"
3) "employees::3"
4) "employees::1"
I assume this is because the hasmaps are stored as byte data but is there anyway i can sort these alphabetically?
I have also tried using alpha param which sort function provides but it doesnt seems to work
SORT collection by name desc ALPHA
Your usage seems to be incorrect.
Set your hash like this (like you are doing)
HSET employees::1 name Arivu salary 100000 age 30
HSET employees::2 name Uma salary 300000 age 31
HSET employees::3 name Jane salary 100000 age 25
HSET employees::4 name Zakir salary 150000 age 28
Store your ids in the set like this:
SADD collection 1 2 3 4
Please note in the set I just store the ids of the employee (1,2,3,4).
Now time to sort
SORT collection by employees::*->name ALPHA
It will sort like this as you expected
1) "1"
2) "3"
3) "2"
4) "4"
In case you need fields do like this:
SORT collection by employees::*->name ALPHA GET employees::*->name
1) "Arivu"
2) "Jane"
3) "Uma"
4) "Zakir"
In case you need age also along with name:
SORT collection by employees::*->name ALPHA GET employees::*->name GET employees::*->age
1) "Arivu"
2) "30"
3) "Jane"
4) "25"
5) "Uma"
6) "31"
7) "Zakir"
8) "28"

How to incorporate thresholds for limit check with a provided static value to compare against?

here is the instance:
Column 1 Column 2 Column 3
2.99 4 Price OK
1.99 4 Price below limit
12.99
5.99 6 Price OK
1.99 6 Price below limit
8.99 6 Price OK
So for Power BI context Column 2 is a custom column from power query, the goal is to set a threshold value for column 2 pack size, in this instance pack size of 4 needs to check for minimum price of $2.99 (higher is ok), below the price should be below limit, in instance of column 2 blanks (result should also be blank). In the instance of size 6 the minimum price to check for is 5.99.
Is there a decent way to go about this?
Let's do this in two steps. First, create a column MinPrice that defines your minimum prices.
if [Column 2] = 4 then 2.99
else if [Column 2] = 6 then 5.99
else null
Then create a column that compares the actual and the minimal
if [Column 1] = null or [Column 2] = null then null
else if [Column 1] < [MinPrice] then "Price below limit"
else "Price OK"
If you have a bunch of unique values in Column 2 that you need to create rules for, then create a reference table that you can merge onto your original table and expand the MinPrice column instead of the first step stated above.
Column2 MinPrice
-----------------
4 2.99
6 5.99
8 7.99
...

select last row by group with "collapse (last) ..." syntax

I want to select the last row in each subset of the data determined by one or more categorical variables.
Background. For each ticket in my data set, I have a ticketid and multiple transactions (sale, refund, sale, refund, sale...). I am only interested in keeping series that end in "sale".
My first step was to drop ticketids with evenly matched sales and refunds:
duplicates tag ticketid, gen(mult)
by ticketid: egen count_sale = total(transtatus == "Sale")
by ticketid: egen count_ref = total(transtatus == "Refund")
drop if mult & count_sale == count_ref
Now, I want to keep just the final sale when count_sale = count_ref + 1
sort ticketid time
preserve
** some collapse command
save "temp_terminal_sales.dta"
restore
append using "temp_terminal_sales.dta"
I can't figure out how (if at all) to use collapse here. I think I may just have to keep if mult, tag the last row with by ticketid: gen last = _n == _N and keep if last...? It seems like collapse should work. Here is the (wrong) syntax that seemed intuitive to me:
collapse (last), by(ticketid)
collapse (last) *, by(ticketid)
These don't work because (i) a varlist is required, and (ii) the by variables cannot be in the varlist.
Example data:
ticketid time myvar transtatus
1 1 2 "Sale"
1 2 2 "Refund"
2 1 2 "Sale"
3 1 2 "Sale"
3 2 2 "Refund"
3 3 2 "Sale"
3 4 2 "Refund"
4 1 2 "Sale"
4 2 2 "Refund"
4 3 2 "Sale"
Desired result:
ticketid time myvar transtatus
2 1 2 "Sale"
4 3 2 "Sale"
The easiest generic way to keep the last of a group is as follows. For a concrete example I assume panel data with identifier id and time variable time:
bysort id (time): keep if _n == _N
The generalisation is
bysort <variables defining groups> (<variable defining order first ... last>): keep if _n == _N
Many Stata commands support the in qualifier, but here we need if and the syntax hinges crucially on the fact that under the aegis of by: observation number _n and number of observations _N are determined within the groups defined by by:. Thus _n == 1 identifies the first and _n == _N identifies the last observation in each group.
drop if _n < _N is a dual command here.
You touched on this approach in your question, but the intermediate step of creating an indicator variable is unnecessary.
For collapse a work-around is presumably just to use some other variable, or even to create one for the purpose as in gen anything = 1. But I would always use by: for your purpose.
There is a discursive tutorial on by: at http://www.stata-journal.com/article.html?article=pr0004 Searching the Stata Journal archives using by as a keyword will reveal more applications.
#NickCox has already provided the general answer. Now that you have given example data, I post a reproducible example with several syntaxes:
clear all
set more off
input ///
ticketid time myvar str10 transtatus
1 1 2 "Sale"
1 2 2 "Refund"
2 1 2 "Sale"
3 1 2 "Sale"
3 2 2 "Refund"
3 3 2 "Sale"
3 4 2 "Refund"
4 1 2 "Sale"
4 2 2 "Refund"
4 3 2 "Sale"
end
list, sepby(ticketid)
*-----
* Method 1
bysort ticketid (time): keep if transtatus[_N] == "Sale" // keep subsets
by ticketid: keep if _n == _N // keep last observation of subsets
list
*-----
* Method 2
// list of all variables except ticketid
unab allvars: _all
local exclvar ticketid
local mycvars: list allvars - exclvar
bysort ticketid (time): keep if transtatus[_N] == "Sale" // keep subsets
collapse (last) `mycvars', by(ticketid) // keep last observation of subsets
list
*-----
*Method 3
bysort ticketid (time): keep if transtatus[_N] == "Sale" & _n == _N
list
(Remember to reload the data for each method.)
Consider also tagging and then running the following estimation commands with if. For example, regress ... if ...