sas hash table with multiple key value pairs - sas

I have a lookup table which looks like this (Name: LOOKUP_TABLE):
Obs Member_id plan_id Plan_desc group_id Group_name
1 164-234 XYZ HMO_Salaried G123 Umbrellas, Inc.
2 297-123 ABC PPO_Hourly G123 Umbrellas, Inc.
3 344-123 JKL HMO_Executive G456 Toy Company
4 395-123 XYZ HMO_Salaried G123 Umbrellas, Inc.
5 495-987 ABC PPO_Hourly G456 Toy Company
6 562-987 ABC PPO_Hourly G123 Umbrellas, Inc.
7 697-123 XYZ HMO_Salaried G456 Toy Company
I have another table with the following data (Name: MAIN_TABLE):
Obs Member_id zip income svc_dt dx plan_id group_id Obs old_id new_id
1 164-234 04021 $45,000 2005/01/01 250 XYZ G123 1 164-234 N164-234
2 297-123 22003-1234 $56,999 2005/02/03 4952 ABC G123 2 297-123 N297-123
3 344-123 45459-0306 $72,999 2005/03/15 78910 JKL G456 3 344-123 C344-123
4 395-123 03755 $75,000 2005/04/14 250 XYZ G123 4 N164-234 M164-234
5 495-987 94305 $96,000 2005/08/19 12345 ABC G456 5 N297-123 B297-123
6 562-987 78277-8310 $32,999 2005/09/13 250 ABC G123 6 M164-234 P164-234
7 697-123 88044-3760 $47,999 2005/11/01 4952 XYZ G456 7 P164-234 A164-234
My SAS data step is as follows:
data MAIN_TABLE_1.
set MAIN_TABLE;
declare hash pd_lookup(dataset:"&LOOKUP_TABLE.");
rc_pd_definekey = pd_lookup.definekey
(
'plan_id',
'group_id'
);
rc_pd_definedata = pd_lookup.definedata
(
'Plan_desc',
'Group_name'
);
rc_pd_definedone = pd_lookup.definedone();
call missing (
Plan_desc,
Group_name
);
put "rc_pd_definekey is " rc_pd_definekey;
put "rc_pd_definedata is " rc_pd_definedata;
put "rc_pd_definedone is " rc_pd_definedone;
drop rc_pd_definekey rc_pd_definedata rc_pd_definedone;
rc_pd_lookup = pd_lookup.find();
run
My question is to understand whats happening behind the scenes in this lookup, mainly with regards to the key value pairs being generated.
i.e., are there individual key value pairs being generated.
As in , the example of key value pairs will be
: "plan_id" -> "Plan_desc"
: "plan_id" -> "Group_name"
: "group_id" -> "Plan_desc"
: "group_id" -> "Group_name"
Or is it that the keys are concatenated together and so are the values, and then we make pairs.
As in, something like this
:"plan_id"+"group_id" -> "Plan_desc" + "Group_name"
I ask this question as I have to convert the same code logic into R, and if I misunderstand, then the whole R code will be wrong

Each combination of plan_id and group_id is used to retrieve a unique entry from the hash table containing values of both plan_desc and group_name.
However, currently there are duplicate rows with the same combination of these ids in the lookup table, which may cause errors or unexpected behaviour - e.g. obs 1 and 4. You should create a deduplicated copy of the lookup table and use that to declare the hash object.

Related

Power BI Measure date

I have a table (SalesTable) with list of customers and dates of orders I received from them. I also created a table with Calendarauto function called 'Calendar'.
What I would like to do is to add a measure so to add value 1 to all those orders that were placed before a specific date and and
at the same customers who did not place ANY other order after that date
Measure =
IF(
SELECTEDVALUE('SalesTable'[SalesDate])<MIN(Calendar[Date])||
SELECTEDVALUE('SalesTabl'[SalesDate])>MAX(Calendar[Date]),
1,0
)
but this shows me in fact only orders that were placed before MIN(Calendar[Date] but does not excludes those customers who did not place any other order after that MIN(Calendar[Date]
This MIN(Calendar[Date] is controlled by slicer
Anyone could help me to modify this?
and here my sample data:
Customers Order no. Dates of Orders Expected Results
Customer A 1 01.01.2023 1
Customer A 2 02.01.2023 1
Customer E 3 03.01.2023 1
Customer E 4 04.01.2023 1
Customer E 5 05.01.2023 1
Customer C 6 06.01.2023 0
Customer C 7 07.01.2023 0
Customer C 8 08.01.2023 0
Customer B 9 09.01.2023 0
Customer B 10 10.01.2023 0
Customer B 11 11.01.2023 0
Customer D 12 12.01.2023 0
Customer C 13 13.01.2023 0
Customer D 14 14.01.2023 0
Customer C 15 15.01.2023 0
and here is bascially how my power BI page looks like as an example, the aboe slicer should control what is being shown in matrix below it
So let's take 09.01.2023 as a reference. I would like to add value = 1 to csutomers A and E because they did buy sth before 09.01.2023 but did not buy anything after 09.01.2023 and would like to add value = 0 to the rest customers since they did buy sth after 09.01.2023

Power BI Measure to countrows of related values several related tables deep

I need to create a measure for a card that will count the total number of Question Groups that exist for each person using the tables below.
I've tried the following but it's returning the result 10, instead of the expected result which should be 6. (George = 2, Susan = 1, tom = 1, bill=1, sally =1, mark =0, jason=0)
Measure = COUNTROWS(NATURALLEFTOUTERJOIN(NATURALLEFTOUTERJOIN(People,Questions),'Question Groups'))
What am I doing wrong?
Table: People
PeopleID
Name
1
George
2
Susan
3
Tom
4
Bill
5
Sally
6
Mark
7
Jason
Table: relPeopleQuestions
PeopleID
QuestionID
1
1
1
2
1
3
2
4
2
5
3
6
4
7
5
8
Table: Questions
Question ID
Question name
Questiong Group ID
1
How are you?
1
2
Favorite Color?
2
3
Favorite Movie?
2
4
Sister's Name
3
5
Brother's Name
3
6
What is your birthdate?
1
7
What City do you live in?
1
8
Favorite game?
2
Table: Question Groups
Question Group ID
Question Group Name
1
Assorted
2
Favorites
3
Relatives
A working example file can be obtained here.
A distinct count on the Question Group ID from the Questions table would seem to be sufficient, e.g.
MyMeasure =
VAR MyTable =
SUMMARIZE (
People,
People[Name],
"Count", DISTINCTCOUNT ( Questions[Question Group ID] )
)
RETURN
SUMX ( MyTable, [Count] )

DAX Grouping and Ranking in Calculated Columns

My raw data stops at sales - looking for some DAX help adding the last two as calculated columns.
customer_id order_id order_date sales total_sales_by_customer total_sales_customer_rank
------------- ---------- ------------ ------- ------------------------- ---------------------------
BM 1 9/2/2014 476 550 1
BM 2 10/27/2016 25 550 1
BM 3 9/30/2014 49 550 1
RA 4 12/18/2017 47 525 3
RA 5 9/7/2017 478 525 3
RS 6 7/5/2015 5 5 other
JH 7 5/12/2017 6 6 other
AG 8 9/7/2015 7 7 other
SP 9 5/19/2017 26 546 2
SP 10 8/16/2015 520 546 2
Lets start with total sales by customer:
total_sales_by_customer =
var custID = orders[customer_id]
return CALCULATE(SUM(orders[sales], FILTER(orders, custID = orders[customer_id]))
first we get the custID, filter the orders table on this ID and sum it together per customer.
Next the ranking:
total_sales_customer_rank =
var rankMe = RANKX(orders, orders[total_sales_by_customer],,,Dense)
return if (rankMe > 3, "other", CONVERT(rankMe, STRING))
We get the rank per cust sales (gotten from first column), if it is bigger than 3, replace by "other"
On your first question: DAX is not like a programming language. Each row is assessed individual. Lets go with your first row: your custID will be "BM".
Next we calculate the sum of all the sales. We filter the whole table on the custID and sum this together. So in the filter we have actualty only 3 rows!
This is repeated for each row, seems slow but I only told this so you can understand the result you are getting back. In reality there is clever logic to return data fast.
What you want to do "Orders[Customer ID]=Orders[Customer ID]" is not possible because your Orders[Customer ID] is within the filter and will run with the rows..
var custid = VALUES(Orders[Customer ID]) Values is returning a single column table, you can not use this in a filter because you are then comparing a cell value with a table.

Python merge based on column position

I have 2 dataframes like so,
ID employee group
1 Bob Accounting
2 Jake Engineering
3 Lisa Engineering
4 Sue HR
ID employee hire_date
1 Lisa 2004
2 Bob 2008
3 Jake 2012
4 Sue 2014
Now I'd like to merge these two dataframes on the employee column. Only the thing is, rather than mentioning the column name employee, I need to mention only the position of the employee column which I will know.
Simply put, I would like to merge the 2 dataframes on employee column without mentioning the column name, rather by mentioning column position only.
Now I tried something like this,
import pandas as pd
df1 = pd.DataFrame({'ID':[1,2,3,4], 'employee': ['Bob', 'Jake', 'Lisa', 'Sue'],
'group': ['Accounting', 'Engineering', 'Engineering', 'HR']})
df2 = pd.DataFrame({'ID':[1,2,3,4],'employee': ['Lisa', 'Bob', 'Jake', 'Sue'],
'hire_date': [2004, 2008, 2012, 2014]})
merged = pd.merge(df1, df2, left_on=df1.ix[:,[1]], right_on=df2.ix[:,[1]])
But it is throwing ValueError. So could somebody help me with this?
Try this:
df1.merge(df2, right_on=df2.columns[1], left_on=df1.columns[1])
Output:
ID_x employee group ID_y hire_date
0 1 Bob Accounting 2 2008
1 2 Jake Engineering 3 2012
2 3 Lisa Engineering 1 2004
3 4 Sue HR 4 2014
You can use list(df) to access a list of column names which you can reference by position:
merged = pd.merge(df1, df2, left_on = list(df1)[1], right_on = list(df2)[1])
Output:
ID_x employee group ID_y hire_date
0 1 Bob Accounting 2 2008
1 2 Jake Engineering 3 2012
2 3 Lisa Engineering 1 2004
3 4 Sue HR 4 2014

Amazon Redshift - Joining table and finding out unmatched rows

I have two tables whose pseudo structure would be something as follows:
User_master
user pfid
------------
reno 2
andrew 3
reno 4
rosh 5
rosh 8
john 7
HR_master
user pfid
-------------
andrew 3
reno 4
rosh 9
john 12
Roaster_master
user pfid
--------------
andrew 3
reno 4
rosh 10
john 12
I need to join all 3 tables on column user and find the rows in HR_master where pfid doesn't match with any equivalent entry in User_master. If you note one of the entry for "reno" matches, while none of the entry for "rosh" matches.
It would have been an easy tasks if there were only one entry in User_master,the complication arise because of multiple rows.
The expected output is
USM.user USM.pfid HRM.pfid RM.pfid
-----------------------------------------
rosh 5|8 9 10
john 7 12 12
As asked, here is the query that I have compiled:
select
UM.email,UM.pfid as UMpfid,
HRM.pfid, RM.pfid
from user_master UM
left join HR_master HRM on (HRM.email=UM.email)
left join Roaster_master RM on (RM.email=UM.email)
where UM.pfid != HRM.pfid
The above query returns "reno" as well, whereas it should not come as one of the row in User_master has pfid matching.