Amazon Redshift - Joining table and finding out unmatched rows - amazon-web-services

I have two tables whose pseudo structure would be something as follows:
User_master
user pfid
------------
reno 2
andrew 3
reno 4
rosh 5
rosh 8
john 7
HR_master
user pfid
-------------
andrew 3
reno 4
rosh 9
john 12
Roaster_master
user pfid
--------------
andrew 3
reno 4
rosh 10
john 12
I need to join all 3 tables on column user and find the rows in HR_master where pfid doesn't match with any equivalent entry in User_master. If you note one of the entry for "reno" matches, while none of the entry for "rosh" matches.
It would have been an easy tasks if there were only one entry in User_master,the complication arise because of multiple rows.
The expected output is
USM.user USM.pfid HRM.pfid RM.pfid
-----------------------------------------
rosh 5|8 9 10
john 7 12 12
As asked, here is the query that I have compiled:
select
UM.email,UM.pfid as UMpfid,
HRM.pfid, RM.pfid
from user_master UM
left join HR_master HRM on (HRM.email=UM.email)
left join Roaster_master RM on (RM.email=UM.email)
where UM.pfid != HRM.pfid
The above query returns "reno" as well, whereas it should not come as one of the row in User_master has pfid matching.

Related

PowerBI DAX - Sum table by criteria and date

relatively new to PowerBI/PowerQuery/DAX and have become stuck at the following problem. I am unsure what road to go down to get the best outcome and would appreciate any help.
My data table is connected to a time tracking application. A User will enter a time entry everytime they complete a task. The task can be either a Project task or an Admin task. When selecting either of these, there will be multiple sub-categories beneath each, each with its own ID. This translates to my table as the following :
User ProjectID AdminID Hours Date
John 1 2 01/01/22
John 11 1 01/01/22
John 4 1 01/01/22
John 12 3 01/01/22
John 13 1 01/01/22
Pete 7 1 01/01/22
Pete 2 4 01/01/22
Pete 3 2 01/01/22
Mike 1 6 01/01/22
Mike 9 1 01/01/22
Mike 10 1 01/01/22
My objective is, for each Date in the table, to calculate the total hours spent either doing Project tasks or Admin tasks. I am not concerned about the specific breakdown (ie the sum of the unique IDs), rather the overall total. The above example covers just one day, in reality my data covers multiple years. My expected output will look like this :
User TotalProject TotalAdmin Date
John 3 5 01/01/22
John 3 4 01/02/22
John 5 2 01/03/22
Pete 5 1 01/01/22
Pete 1 8 01/02/22
Pete 6 2 01/03/22
Mike 6 2 01/01/22
Mike 6 1 01/02/22
Mike 7 2 01/03/22
I am unsure the best method to achieve this - either by creating some kind of column in the table through PowerQuery? Or a calculated column using DAX? And if so, what the SUM syntax would look like?
Very willing to learn, to any tips would be greatly appreciated!
For your sample input, just create 2 measures.
Total Admin = CALCULATE( SUM('Table'[Hours]), NOT(ISBLANK('Table'[AdminID])))
Total Project = CALCULATE( SUM('Table'[Hours]), NOT(ISBLANK('Table'[ProjectID])))

Power BI : line grouping

I begin to use Power BI, and I don't know how to group lines.
I have this kind of data :
api user 01/07/21 02/07/21 03/07/21 ...
a 25 null 3 4
b 25 1 null 2
c 25 1 4 5
a 30 4 3 5
b 30 3 2 2
c 30 1 1 3
And I would like to have the sum of the values per user, not by api and user
user 01/07/21 02/07/21 03/07/21 ...
25 2 7 11
30 8 6 10
Do you know how to do it please ?
I created a table with your sample data (make sure your values are treated as numbers):
Then create a Matrix visual, with "user" in Rows and your desired columns in the Values section:

PowerBI: How can I have two different side by side tables scroll at the same time in PBI?

I have two tables:
Table A
id
name
month_1
month_2
month_3
month_4
month_5
month_6
1
John
3
0
1
0
null
null
2
Mary
6
1
2
1
1
2
3
Angelo
1
5
null
null
null
null
4
Diane
3
2
0
1
null
null
Table B
id
name
LastYearTotal
CurrentYearTotal
1
John
2
4
2
Mary
6
13
3
Angelo
9
6
4
Diane
9
6
And then tables A and B will be side by side but not in the same table. Like there will be a separator between A and B. But when I use a filter, both tables will reflect the filter. In addition, there will only be one scroll for both tables so they move at the same time.
Thanks.

How to reshape data multiple ways in Stata?

I am working with a data set covering multiple countries, variables, and years. It is currently organized wide like so (actually ~30 years and 5 different variables for each country):
country measure yr1995 yr1996 yr1997
USA A 5 4 1
USA B 1 2 1
USA C 0 4 2
UK A 2 4 9
UK B 2 8 4
UK C 2 4 1
What I would like is for the data to be rearranged long like so:
country year A B C
USA 1995 5 1 0
USA 1996 4 2 4
USA 1997 1 1 2
UK 1995 2 2 2
UK 1996 4 8 4
UK 1997 9 4 1
I tried using reshape long yr, i(country) j(year) but get the following error message:
variable id does not uniquely identify the observations
Your data are currently wide. You are performing a reshape long. You specified i(country) and j(year). In
the current wide form, variable country should uniquely identify the observations.
I think this is because country is not the only long variable? (measure also is?)
Besides fixing that issue and arranging the years long instead of wide, I don't think this command will accomplish the other task of moving the different variables (A, B, C) into the wide format as column headers.
Will I need to use a separate reshape wide command for that? Or is there some way to expand the command to do both at once?
It's a double reshape. At least it can be done that way; and, further, that seems essential because years need to be long, not wide, and the measure(s) need to be wide, not long, so there are flavours of both problems.
Economic development data often arrive like this. Indeed the problem has given rise to at least one dedicated short paper
in the Stata Journal, but visible to all.
Your data example is helpful, and almost immediately useful, but please read the Stata tag and help dataex (if necessary, install dataex first using ssc install dataex).
See also this FAQ, which includes some hints beyond the Stata help and manual entry.
A search reshape in Stata would have pointed to these resources.
clear
input str3 country str1 measure yr1995 yr1996 yr1997
USA A 5 4 1
USA B 1 2 1
USA C 0 4 2
UK A 2 4 9
UK B 2 8 4
UK C 2 4 1
end
reshape long yr, i(country measure) j(year)
reshape wide yr, i(country year) j(measure) string
rename (yr*) *
list, sepby(country)
+----------------------------+
| country year A B C |
|----------------------------|
1. | UK 1995 2 2 2 |
2. | UK 1996 4 8 4 |
3. | UK 1997 9 4 1 |
|----------------------------|
4. | USA 1995 5 1 0 |
5. | USA 1996 4 2 4 |
6. | USA 1997 1 1 2 |
+----------------------------+

Python merge based on column position

I have 2 dataframes like so,
ID employee group
1 Bob Accounting
2 Jake Engineering
3 Lisa Engineering
4 Sue HR
ID employee hire_date
1 Lisa 2004
2 Bob 2008
3 Jake 2012
4 Sue 2014
Now I'd like to merge these two dataframes on the employee column. Only the thing is, rather than mentioning the column name employee, I need to mention only the position of the employee column which I will know.
Simply put, I would like to merge the 2 dataframes on employee column without mentioning the column name, rather by mentioning column position only.
Now I tried something like this,
import pandas as pd
df1 = pd.DataFrame({'ID':[1,2,3,4], 'employee': ['Bob', 'Jake', 'Lisa', 'Sue'],
'group': ['Accounting', 'Engineering', 'Engineering', 'HR']})
df2 = pd.DataFrame({'ID':[1,2,3,4],'employee': ['Lisa', 'Bob', 'Jake', 'Sue'],
'hire_date': [2004, 2008, 2012, 2014]})
merged = pd.merge(df1, df2, left_on=df1.ix[:,[1]], right_on=df2.ix[:,[1]])
But it is throwing ValueError. So could somebody help me with this?
Try this:
df1.merge(df2, right_on=df2.columns[1], left_on=df1.columns[1])
Output:
ID_x employee group ID_y hire_date
0 1 Bob Accounting 2 2008
1 2 Jake Engineering 3 2012
2 3 Lisa Engineering 1 2004
3 4 Sue HR 4 2014
You can use list(df) to access a list of column names which you can reference by position:
merged = pd.merge(df1, df2, left_on = list(df1)[1], right_on = list(df2)[1])
Output:
ID_x employee group ID_y hire_date
0 1 Bob Accounting 2 2008
1 2 Jake Engineering 3 2012
2 3 Lisa Engineering 1 2004
3 4 Sue HR 4 2014