Python merge based on column position

Python merge based on column position - python-2.7

I have 2 dataframes like so,
ID employee group
1 Bob Accounting
2 Jake Engineering
3 Lisa Engineering
4 Sue HR
ID employee hire_date
1 Lisa 2004
2 Bob 2008
3 Jake 2012
4 Sue 2014
Now I'd like to merge these two dataframes on the employee column. Only the thing is, rather than mentioning the column name employee, I need to mention only the position of the employee column which I will know.
Simply put, I would like to merge the 2 dataframes on employee column without mentioning the column name, rather by mentioning column position only.
Now I tried something like this,
import pandas as pd
df1 = pd.DataFrame({'ID':[1,2,3,4], 'employee': ['Bob', 'Jake', 'Lisa', 'Sue'],
'group': ['Accounting', 'Engineering', 'Engineering', 'HR']})
df2 = pd.DataFrame({'ID':[1,2,3,4],'employee': ['Lisa', 'Bob', 'Jake', 'Sue'],
'hire_date': [2004, 2008, 2012, 2014]})
merged = pd.merge(df1, df2, left_on=df1.ix[:,[1]], right_on=df2.ix[:,[1]])
But it is throwing ValueError. So could somebody help me with this?

Try this:
df1.merge(df2, right_on=df2.columns[1], left_on=df1.columns[1])
Output:
ID_x employee group ID_y hire_date
0 1 Bob Accounting 2 2008
1 2 Jake Engineering 3 2012
2 3 Lisa Engineering 1 2004
3 4 Sue HR 4 2014

You can use list(df) to access a list of column names which you can reference by position:
merged = pd.merge(df1, df2, left_on = list(df1)[1], right_on = list(df2)[1])
Output:
ID_x employee group ID_y hire_date
0 1 Bob Accounting 2 2008
1 2 Jake Engineering 3 2012
2 3 Lisa Engineering 1 2004
3 4 Sue HR 4 2014

Related

DAX equation to average data with different timespans

I have data for different companies. The data stops at day 10 for one of the companies (Company 1), day 6 for the others. If Company 1 is selected with other companies, I want to show the average so that the data runs until day 10, but using day 7, 8, 9, 10 values for Company 1 and day 6 values for others.
I'd want to just fill down days 8-10 for other companies with the day 6 value, but that would look misleading on the graph. So I need a DAX equation with some magic in it.
As an example, I have companies:
Company 1
Company 2
Company 3
etc. as a filter
And a table like:
Company
Date
Day of Month
Count
Company 1
1.11.2022
1
10
Company 1
2.11.2022
2
20
Company 1
3.11.2022
3
21
Company 1
4.11.2022
4
30
Company 1
5.11.2022
5
40
Company 1
6.11.2022
6
50
Company 1
7.11.2022
7
55
Company 1
8.11.2022
8
60
Company 1
9.11.2022
9
62
Company 1
10.11.2022
10
70
Company 1
11.11.2022
11
NULL
Company 2
1.11.2022
1
15
Company 2
2.11.2022
2
25
Company 2
3.11.2022
3
30
Company 2
4.11.2022
4
34
Company 2
5.11.2022
5
45
Company 2
6.11.2022
6
100
Company 2
7.11.2022
7
NULL
Every date has a row, but for days over 6/10 the count is NULL. If Company 1 or Company 2 is chosen separately, I'd like to show the count as is. If they are chosen together, I'd like the average of the two so that:
Day 5: AVG(40,45)
Day 6: AVG(50,100)
Day 7: AVG(55,100)
Day 8: AVG(60,100)
Day 9: AVG(62,100)
Day 10: AVG(70,100)
Any ideas?

You want something like this?
Create a Matriz using your:
company_table_dim (M)
calendar_Days_Table(N)
So you will have a new table of MXN Rows
Go to PowerQuery Order DATA and FillDown your QTY column
(= Table.FillDown(#"Se expandió Fact_Table",{"QTY"}))
So your last known QTY will de filled til the end of Time_Table for any company filters
Cons: Consider your new Matriz MXN it could be millions of rows to calculate
Greetings
enter image description here

PowerBI DAX - Sum table by criteria and date

relatively new to PowerBI/PowerQuery/DAX and have become stuck at the following problem. I am unsure what road to go down to get the best outcome and would appreciate any help.
My data table is connected to a time tracking application. A User will enter a time entry everytime they complete a task. The task can be either a Project task or an Admin task. When selecting either of these, there will be multiple sub-categories beneath each, each with its own ID. This translates to my table as the following :
User ProjectID AdminID Hours Date
John 1 2 01/01/22
John 11 1 01/01/22
John 4 1 01/01/22
John 12 3 01/01/22
John 13 1 01/01/22
Pete 7 1 01/01/22
Pete 2 4 01/01/22
Pete 3 2 01/01/22
Mike 1 6 01/01/22
Mike 9 1 01/01/22
Mike 10 1 01/01/22
My objective is, for each Date in the table, to calculate the total hours spent either doing Project tasks or Admin tasks. I am not concerned about the specific breakdown (ie the sum of the unique IDs), rather the overall total. The above example covers just one day, in reality my data covers multiple years. My expected output will look like this :
User TotalProject TotalAdmin Date
John 3 5 01/01/22
John 3 4 01/02/22
John 5 2 01/03/22
Pete 5 1 01/01/22
Pete 1 8 01/02/22
Pete 6 2 01/03/22
Mike 6 2 01/01/22
Mike 6 1 01/02/22
Mike 7 2 01/03/22
I am unsure the best method to achieve this - either by creating some kind of column in the table through PowerQuery? Or a calculated column using DAX? And if so, what the SUM syntax would look like?
Very willing to learn, to any tips would be greatly appreciated!

For your sample input, just create 2 measures.
Total Admin = CALCULATE( SUM('Table'[Hours]), NOT(ISBLANK('Table'[AdminID])))
Total Project = CALCULATE( SUM('Table'[Hours]), NOT(ISBLANK('Table'[ProjectID])))

Power BI Measure to countrows of related values several related tables deep

I need to create a measure for a card that will count the total number of Question Groups that exist for each person using the tables below.
I've tried the following but it's returning the result 10, instead of the expected result which should be 6. (George = 2, Susan = 1, tom = 1, bill=1, sally =1, mark =0, jason=0)
Measure = COUNTROWS(NATURALLEFTOUTERJOIN(NATURALLEFTOUTERJOIN(People,Questions),'Question Groups'))
What am I doing wrong?
Table: People
PeopleID
Name
1
George
2
Susan
3
Tom
4
Bill
5
Sally
6
Mark
7
Jason
Table: relPeopleQuestions
PeopleID
QuestionID
1
1
1
2
1
3
2
4
2
5
3
6
4
7
5
8
Table: Questions
Question ID
Question name
Questiong Group ID
1
How are you?
1
2
Favorite Color?
2
3
Favorite Movie?
2
4
Sister's Name
3
5
Brother's Name
3
6
What is your birthdate?
1
7
What City do you live in?
1
8
Favorite game?
2
Table: Question Groups
Question Group ID
Question Group Name
1
Assorted
2
Favorites
3
Relatives
A working example file can be obtained here.

A distinct count on the Question Group ID from the Questions table would seem to be sufficient, e.g.
MyMeasure =
VAR MyTable =
SUMMARIZE (
People,
People[Name],
"Count", DISTINCTCOUNT ( Questions[Question Group ID] )
)
RETURN
SUMX ( MyTable, [Count] )

Amazon Redshift - Joining table and finding out unmatched rows

I have two tables whose pseudo structure would be something as follows:
User_master
user pfid
------------
reno 2
andrew 3
reno 4
rosh 5
rosh 8
john 7
HR_master
user pfid
-------------
andrew 3
reno 4
rosh 9
john 12
Roaster_master
user pfid
--------------
andrew 3
reno 4
rosh 10
john 12
I need to join all 3 tables on column user and find the rows in HR_master where pfid doesn't match with any equivalent entry in User_master. If you note one of the entry for "reno" matches, while none of the entry for "rosh" matches.
It would have been an easy tasks if there were only one entry in User_master,the complication arise because of multiple rows.
The expected output is
USM.user USM.pfid HRM.pfid RM.pfid
-----------------------------------------
rosh 5|8 9 10
john 7 12 12
As asked, here is the query that I have compiled:
select
UM.email,UM.pfid as UMpfid,
HRM.pfid, RM.pfid
from user_master UM
left join HR_master HRM on (HRM.email=UM.email)
left join Roaster_master RM on (RM.email=UM.email)
where UM.pfid != HRM.pfid
The above query returns "reno" as well, whereas it should not come as one of the row in User_master has pfid matching.

How to check if there is any missing and replace

I have a dataset covering a number of companies for which there is a variable for the firms employees. Some years the number of employees has not been reported, hence a some years appear blank while the year before and after contains a value.
The data is similar to:
COMPANY YEAR NO. EMPLOYEES
Company 1 2007 4
Company 1 2008 5
Company 1 2009 5
Company 1 2010 5
Company 2 2007 11
Company 2 2008 10
Company 2 2009
Company 2 2010 10
Company 3 2007 3
Company 3 2008 4
Company 3 2009
Company 3 2010 3
I would like to be able to search the dataset for any such occurrences, making an indicator of these years, and afterwards replace any blank spots with the year before. If there is no previous year to use as a replacement or the previous year is blank, the year after the blank spot. I am hoping for the dataset to like:
COMPANY YEAR NO. EMPLOYEES
Company 1 2007 4
Company 1 2008 5
Company 1 2009 5
Company 1 2010 5
Company 2 2007 11
Company 2 2008 10
Company 2 2009 10
Company 2 2010 10
Company 3 2007 3
Company 3 2008 4
Company 3 2009 4
Company 3 2010 3
To sum up, at first i need to check whether or not i do have a problem with missing values in-between two years (important that the codes do not replace missing values before or after the last year with a non-missing value, since som firms exit the sample). Next, if any blank years in between any two years that are non-blank, I would like to replace these blank spots as mentioned above.

The method I would use:
1. Sort the dataset company/year.
2. Replace missing values using LAG function if the missing value is not the first observation of the company group.
3. Reverse the sort order
4. Repeat step 2 on the dataset with reversed order
5. Return the dataset to the original order
Please note, I have changed your original data for Company 3 in order to have a case for your second scenario (missing value, no previous record).
DATA HAVE;
input COMPANY $ 0-10 YEAR 13-17 N_EMPLOYEES 24-27;
datalines;
Company 1 2007 4
Company 1 2008 5
Company 1 2009 5
Company 1 2010 5
Company 2 2007 11
Company 2 2008 10
Company 2 2009
Company 2 2010 10
Company 3 2007
Company 3 2008 3
Company 3 2009 4
Company 3 2010 3
;
run;
PROC SORT DATA=HAVE
OUT=DOSOMEWORKHERE;
BY COMPANY YEAR;
RUN;
DATA DOSOMEWORKHERE (drop=PREV_N_EMPLOYEES);
set DOSOMEWORKHERE;
by COMPANY;
PREV_N_EMPLOYEES = LAG(N_EMPLOYEES);
if first.COMPANY then
do;
PREV_N_EMPLOYEES = .;
end;
if N_EMPLOYEES = . then N_EMPLOYEES = PREV_N_EMPLOYEES;
run;
PROC SORT DATA=DOSOMEWORKHERE
OUT=DOSOMEWORKHERE;
BY DESCENDING COMPANY DESCENDING YEAR ;
RUN;
DATA DOSOMEWORKHERE (drop=PREV_N_EMPLOYEES);
set DOSOMEWORKHERE;
by DESCENDING COMPANY;
PREV_N_EMPLOYEES = LAG(N_EMPLOYEES);
if first.COMPANY then
do;
PREV_N_EMPLOYEES = .;
end;
if N_EMPLOYEES = . then N_EMPLOYEES = PREV_N_EMPLOYEES;
run;
PROC SORT DATA=DOSOMEWORKHERE
OUT=WANT;
BY COMPANY YEAR;
RUN;
Result:

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js