Pandas - calculating interaction between buyers and sellers - python-2.7

So I have a programming problem, which I am finding a little challenging. I have daily data for several years (located in Pandas DataFrame) on buyers and sellers ala. this:
Seller Buyer Amount
2012/11/13 Bank1 Bank2 15
2012/11/13 Bank1 Bank2 17
2012/11/13 Bank5 Bank3 5
2012/11/14 Bank4 Bank2 10
2012/11/14 Bank1 Bank3 22
Index is Pandas.DatetimeIndex.
I would like to calculate, on a month-level basis for each buying bank, what share of their total monthly volume comes from each seller, they have interacted with. So in the above case the output (preferably a DataFrame as well) would be:
Month Seller Buyer Share
2012/11 Bank1 Bank2 32/42
2012/11 Bank4 Bank2 10/42
2012/11 Bank5 Bank3 5/27
2012/11 Bank1 Bank3 22/27
Any input is greatly appreciated!

I have uploaded an IPython notebook to Github that should begin to answer your question. You can find it here.
The basic approach here is to do DataFrame.groupby() operations. Using the dummy data you provided, I first did data.groupby('Buyer').sum(), reset the index to the buyer, add back the sum as a "total" column, and finally divide.
I am not so sure on how to do the timeseries stuff (i.e. grouping by month when you only have full date info not split into year month date). If others have suggestions, please leave a comment so I can update the notebook and my answer!

Related

Calculating dynamic cost from several tables

I'm new to Power BI and struggling a bit with concepts. My simplified problem is this:
I have a table with time-track data (possible multiple entries per day for the same user):
datetime
timeTracked (hours)
user
2022-12-01 14:15
1.5
john
2022-12-01 16:30
0.5
john
2022-12-02 12:00
2.5
tom
I also have second table with yearly wages for each user (irregular dates, wage is applicable from that day forward):
date
user
wage
2021-01-01
john
20000
2021-06-15
tom
25000
2022-02-01
john
22000
I want to somehow calculate the cost for a dynamic period or total time, but using other filters (like client).
For example: total cost for current month - the sum of hours for each user multiplied by hourly cost (let's say yearly wage / 1500 for simplicity). The complication is that there might be multiple applicable wages in the selected period.
I'm not sure if I should create a new table, new column(s) to existing table(s), measure(s) or anything else? I don't know how to deal with this problem in Power BI, any hints, links or examples are very appreciated!

Power BI DAX - sum all of one column but keyed off different table

This FEELS like something that can be done but I am at a loss for how to do it.
I have a table that has applicants for jobs...
name, requisition id, division, date applied, date hired
Each row is an applicant. Obviously not all applicants are hired. So in every row all fields are filled out with the exception of date hired for applicants that have not been hired.
I have slicers for month/quarter/year and division.
The date slicers all key off a field in every table called data_as_of which is the last day of the month with a one-to-many relationship with a date dimension table.
Here is a sample table...
[1]
[1]: https://i.stack.imgur.com/XQO9d.png
So here is what I'd like to do.
I'd like to slice by year and show a visual of all people hired in that year. Same with Quarter and Month (ie count all people in that quarter or month as appropriate). So far so good. That's easy.
Now on the same report page I'd like to show a visual (assume bar charts) that shows me a count of all the people that applied to the same requisition id prior to the date hired of whomever was hired in that requisition id.
Using the example above...
All of these examples assume 2021.
So if I used the month slicer in December I'd get 2 hirees in HR, Diane and Mel. In the second visual I'd get 7 Applicants.
If I used the month slicer to show November I'd get two hirees - Rhys and Jody. The applicant visual would show me 8 applicants. All 6 from requisition id 4 and 2 from requisition id 2 because one applied after Rhys was hired.
Consequently if I sliced for April of 2021 I'd get 1 hiree - Remi. In the applicant visual I'd get 4 applicants who all applied prior to Remi's hire date (including Morgan who applied in March but wasn't hired until May).
Does that all make sense?
I very much appreciate your help.
Best regards,
~Don

Seperating sickness between months in PowerBI

I have set up a table showing number of sick days based on absence start and finish date in Power BI. The date tables have been set up.
I am having an issue with sick days that continue in the following month, for e.g .
Absence Start Date
5 May 21
Absence End Date
5 June 21
My table sums all the absence days in May.
How do I allocate the 25 days in May and then the remaining go to June even though it is one occurrence?
How Data is being summarised
Unfortunately you don't provide enough information about your model/ data. Consider this example. Here is my dummy table:
Table 3 =
var __baseTable =
DATATABLE("StartD",DATETIME, "EndD", DATETIME,
{
{"2020-01-28","2020-02-24"},
{"2019-12-23","2020-01-31"},
{"2020-01-20","2020-02-17"}
}
)
return
ADDCOLUMNS(GENERATE(__baseTable, CALENDAR([StartD],[EndD])),"Month", FORMAT([Date],"yyyy-mm"), "WorkDay", IF(NOt(WEEKDAY([Date]) in {1,7}) ,1,0))
And here are measures (calculating days without a weekend, because I don't have Calendar with businessHoliday):
SumOfSickDays = CALCULATE(SUM('Table 3'[WorkDay]))

sas hierarchical raw file - no record type identifiers - multiple observations per records

I have a problem importing a hierarchical text file into SAS.
I've been searching for the past week and had no luck.
The problem is that this file does not contain anything that indicates that the detail records are linked with the header.
I have tried the various methods explained with the. Input #1 Test # with the then do.
Extract of file (every new record starts with Hong Kong but each record has a variable number of lines):
HONG KONG
STEEL GROUP
Invoice Date
09.12.2015
Number
90035565
Delivery note no.
80006292
SAP Order number
18915
Customer number
105226
Order number
RCHEB 5114 1-1 24-11
Shipped from Saldanha bay, South Africa, per vessel
LAN MAY
Bill of lading date
14.11.2015
Port of discharge
ANY CHINESE PORT
Reference no.
Agreement/Contract/Order
OMl/24/ll
Port Wet Metric Tons Dry Metric Tons
ANY CHINESE PORT 202,079.000 199,957.171
Product % USD Value
Steel Ore 50%;29% 3,500.00
HONG KONG
TRADING CORP
Invoice Date
21.12.2015
Number
90035792
Provisional Invoice No
90033952
SAP Order number
50005313
Customer number
102872
Order number
KITST 5007 1-1 21-11
Shipped from Saldanha bay, South Africa, per vessel
HEBEI SUCCESS
Bill of lading date
15.06.2015
Port of discharge
BEILUN
Reference no.
WUGANG
Agreement/Contract/Order
OM6/21/ABG
Port Wet Metric Tons Dry Metric Tons
BEILUN 124,772.000 122,214.174
Product % USD Value Sishen 63.5%, 8 mm Fine Ore
Steel Ore 50%,10% 2,500.00
Iron Ore 20%,80% 1,500.00
Unfortunately, there is not an easy way to do this in SAS (that I know of). I think you are the right track reading the file in, record by record, and writing logic in a data step to parse it.
I would do it like this:
data blah;
infile "stuff.txt";
format inStr $2000.;
input inStr $;
if strip(inStr) = "HONG KONG" do;
...
end;
else if ... then do;
...
end;
...
run;

How to parse a long dataframe of text using regular expressions into a dataframe [R]

I have a giant data frame which I would like to turn into a more usable format. It is based on a copy-pasted text file of a schedule, where the entries have a consistent format.So, I know that each day will have the format:
###
Title - Date
First event
Time: 11:00 AM
Location: Address
Address line 2
Second event description
Time: 12:00 AM
Location: Address
Address
###
What I am having trouble with is figuring out how to parse this. Basically, I want to store everything between the "###"'s as a single day, and then add events based on how many times the above format repeats, and make a string or datetime entry based on if letters are following a "Time:" or a "Location:".
However, I am really having trouble with this. I have tried putting it all into a giant dataframe where each line is a row, and then adding dummies for location rows, time rows, etc as seperate columns, but am not sure how to translate that into discrete days or events. Any help would be appreciated.
Data is public, so a sample is below -- it is a giant dataframe with one row for each row of text:
*Text*
###
The Public Schedule for Mayor Rahm Emanuel – January 5, 2014
There are no public events scheduled at this time.
###
The Public Schedule for Mayor Rahm Emanuel – January 6, 2014
Mayor Rahm Emanuel will greet and thank snow clearing teams from the Department of Streets and Sanitation.
Time: 11:30 AM
Location: Department of Streets and Sanitation
Salt Station
West Grand Avenue and North Rockwell Street
Chicago, IL*
*There will be no media availability.
Mayor Rahm Emanuel and City Officials will provide an update on the City’s efforts to provide services during the extreme weather.
Time: 2:00 PM
Location: Office of Emergency Management and Communications
Headquarters
1411 West Madison Street
Chicago, IL*
*There will be a media availability following.
Mayor Rahm Emanuel will greet and visit with residents taking advantage of a City warming center.
Time: 3:00 PM
Location: Department of Family and Support Services
10 South Kedzie Avenue
Chicago, IL**
*There will be no media availability.
**Members of the media must respect the privacy of residents at the facility, and can only film City of Chicago employees.
###
Edit:
An example output I would like is something like (sorry the code here is broken, not sure why!):
Date Time Description Location
December 4th 9:00 AM A housewarming party 1211 Main St.
December 5th 11:00 AM Another big event 1234 Main St.
If at all possible.
EDIT 2:
Basically -- I know how to pull all this stuff out of the columns. I think my issue may really be reshaping the data intelligently. I have split it into this giant dataframe with one column where each row is a string which corresponds to a row of text int he original schedule. I then have a bunch of columns like "is_time", "is_new_entry", "is_location" which are 1 if the row is a time, new entry beginning, or location. I just don't know how to reshape this into the output above.