Add a flag on the values of rows - sas

I have a table (Student_classification) with two columns, Student Number and Subject (example):
Student Number Subject
122 Biology_Physics
122 Math
122 Music
125 music
125 geography
298 Math
298 Economics
My task is to get a new table where:
if the student Number has Biology_Physics and (either Math or Music or geography or economics) as Science
if the student number has (geography or music) and do not have any other as Humnity/arts
if the student has (Math or Economics) and do not have any other as EconomicsEngineering
My final result should be:
Student Number Type
122 Science
125 Humanity/arts
298 EconomicsEngineering
However, I get following table which is incorrect:
Student_Number Type
122 Other
122 EconomicEngineering
122 Humanity/arts
125 Humanity/arts
298 EconomicEngineering
I have written the following code in SAS, but the logics seems incorrect:
Proc Sql;
create table student_classification as
(
select distinct cust_num,
case
when Subject ='Biology_Physics' and Subject in ('Math' 'Music' 'geography' 'economics') then 'Science'
When Subject in ('geography' 'music') and Subject not in ('Biology_Physics' 'Math' 'economics') then 'Humanity/arts'
When Subject in ('math' 'economics) and subject not in ('Biology_Physics' 'Geography' 'Music') then 'EconomicEngineering'
else 'Other'
end as Type
from Student_classification
Group by student_number, Type
);
quit;
My use case is different, but simulating the similar idea here.

You try to compare values from multiple rows, thus you need conditional aggregation.
select cust_num,
case
-- has Biology_Physics and (either Math or Music or geography or economics) as Science
when max(case when Subject ='Biology_Physics' then 1 end) = 1
and max(case when Subject in ('Math', 'Music', 'geography', 'economics') then 1 end) = 1
then 'Science'
-- has (geography or music) and do not have any other as Humnity/arts
When max(case when Subject in ('geography', 'music') then 0 else 1 end) = 0
then 'Humanity/arts'
-- has (Math or Economics) and do not have any other as EconomicsEngineering
When max(case when Subject in ('math', 'economics) then 0 else 1 end) = 0
then 'EconomicEngineering'
else 'Other'
end as Type
from Student_classification
Group by cust_num

Related

Calculate the amount of the cost of tickets finalized per material divided by the total amount of the tickets finalized

I have the following need :
Calculate the ratio between the sum of the amounts of tickets with status finalized for each material and the sum of the total amounts of the tickets finalized.
My fact table is like below :
TicketID StatusID MaterialID CategoryID Amount FKDATE
123 3 45 9 150 12/03/2021
124 5 50 4 569 11/03/2021
125 3 78 78 556 14/03/2021
126 -1 -1 -1 -1 12/03/2021
My dimension Status is like below :
StatusID Status
1 Open
2 In Process
3 Finalized
My dimension Material is like below :
MaterialID MaterielLabel
1 Bikes
.. ..
I want to exclude the TicketID with MaterialID = -1.
Try the following :
AmountFinalizedByMaterial:=
VAR AmountFinalizedByMaterialGroup =
CALCULATE (
SUM(yourFactTable[Amount]),
Status[Status] = "Finalized" ,
yourFactTable[MaterialID] <> -1)
VAR TotalAmountFinalized =
CALCULATE (
SUM(yourFactTable[Amount]),
Status[Status] = "Finalized" ,
ALL(Material)
)
RETURN
DIVIDE (
AmountFinalizedByMaterialGroup,
TotalAmountFinalized
)

How can I write a query to carry a remaining balance of hours forward for load leveling a schedule?

I have a query result with a total amount of hours scheduled per week in chronological order without gaps and have a set amount of hours that can be processed each week. Any hours not processed should be carried over to one or more following weeks. The following information is available.
Week | Hours | Capacity
1 2000 160
2 100 160
3 0 140
4 150 160
5 500 160
6 1500 160
Each week it should reduce the new hours plus carried over hours by the Capacity but never go below zero. A positive value should carry into the following week(s).
Week | Hours | Capacity | LeftOver = (Hours + LAG(LeftOver) - Capacity)
1 400 160 240 (400 + 0 - 160)
2 100 160 180 (100 + 240 - 160)
3 0 140 40 ( 0 + 180 - 140)
4 20 160 0 ( 20 + 40 - 160) (no negative, change to zero)
5 500 160 340 (500 + 0 - 160)
6 0 160 180 ( 0 + 340 - 160)
I'm assuming this can be done with cte recursion and a running value that doesn't go below zero but I can't find any specific examples of how this would be written.
Well, you are not wrong, a recursive common table expression is indeed an option to construct a solution.
Construction of recursive queries can generally be done in steps. Run your query after every step and validate the result.
Define the "anchor" of your recursion: where does the recursion start?Here the start is defined by Week = 1.
Define a recursion iteration: what is the relation between iterations?Here that would be the incrementing week numbers d.Week = r.Week + 1.
Avoiding negative numbers can be resolved with a case expression.
Sample data
create table data
(
Week int,
Hours int,
Capacity int
);
insert into data (Week, Hours, Capacity) values
(1, 400, 160),
(2, 100, 160),
(3, 0, 140),
(4, 20, 160),
(5, 500, 160),
(6, 0, 160);
Solution
with rcte as
(
select d.Week,
d.Hours,
d.Capacity,
case
when d.Hours - d.Capacity > 0
then d.Hours - d.Capacity
else 0
end as LeftOver
from data d
where d.Week = 1
union all
select d.Week,
d.Hours,
d.Capacity,
case
when d.Hours + r.LeftOver - d.Capacity > 0
then d.Hours + r.LeftOver - d.Capacity
else 0
end
from rcte r
join data d
on d.Week = r.Week + 1
)
select r.Week,
r.Hours,
r.Capacity,
r.LeftOver
from rcte r
order by r.Week;
Result
Week Hours Capacity LeftOver
---- ----- -------- --------
1 400 160 240
2 100 160 180
3 0 140 40
4 20 160 0
5 500 160 340
6 0 160 180
Fiddle to see things in action.
I ended up writing a few CTEs then a recursive CTE and got what I needed. The capacity is a static number here but will be replaced later with one that takes holidays and vacations into account. Will also need to consider the initial 'LeftOver' value for the first week but could use this query with an earlier date period to find the most recent date with a zero LeftOver value then use that as a new start date, then filter out those earlier weeks in the final query.
DECLARE #StartDate date = (SELECT MAX(FirstDayOfWorkWeek) FROM dbo._Calendar WHERE Date <= GETDATE());
DECLARE #EndDate date = DATEADD(week, 12, #StartDate);
DECLARE #EmployeeQty int = (SELECT ISNULL(COUNT(*), 0) FROM Employee WHERE DefaultDepartment IN (4) AND Hidden = 0 AND DateTerminated IS NULL);
WITH hours AS (
/* GRAB ALL NEW HOURS SCHEDULED FOR EACH WEEK IN THE SELECTED PERIOD */
SELECT c.FirstDayOfWorkWeek as [Date]
, SUM(budget.Hours) as hours
FROM dbo.Project_Phase phase
JOIN dbo.Project_Budget_Labor budget on phase.ID = budget.Phase
JOIN dbo._Calendar c on CONVERT(date, phase.Date1) = c.[Date]
WHERE phase.CompletedOn IS NULL AND phase.Project <> 4266
AND phase.Date1 BETWEEN #StartDate AND #EndDate
AND budget.Department IN (4)
GROUP BY c.FirstDayOfWorkWeek
)
, weeks AS (
/* CREATE BLANK ROWS FOR EACH WEEK AND JOIN TO ACTUAL HOURS TO ELIMINATE GAPS */
/* ADD A ROW NUMBER FOR RECURSION IN NEXT CTE */
SELECT cal.[Date]
, ROW_NUMBER() OVER(ORDER BY cal.[Date]) as [rownum]
, ISNULL(SUM(hours.Hours), 0) as Hours
FROM (SELECT FirstDayOfWorkWeek as [Date] FROM dbo._Calendar WHERE [Date] BETWEEN #StartDate AND #EndDate GROUP BY FirstDayOfWorkWeek) as cal
LEFT JOIN hours on cal.[Date] = hours.[Date]
GROUP BY cal.[Date]
)
, spread AS (
/* GRAB FIRST WEEK AND USE RECURSION TO CREATE RUNNING TOTAL THAT DOES NOT DROP BELOW ZERO*/
SELECT TOP 1 [Date]
, rownum
, Hours
, #EmployeeQty * 40 as Capacity
, CONVERT(numeric(9,2), 0.00) as LeftOver
, Hours as running
FROM weeks
ORDER BY rownum
UNION ALL
SELECT curr.[Date]
, curr.rownum
, curr.Hours
, #EmployeeQty * 40 as Capacity
, CONVERT(numeric(9,2), CASE WHEN curr.Hours + prev.LeftOver - (#EmployeeQty * 40) < 0 THEN 0 ELSE curr.Hours + prev.LeftOver - (#EmployeeQty * 40) END) as LeftOver
, curr.Hours + prev.LeftOver as running
FROM weeks curr
JOIN spread prev on curr.rownum = (prev.rownum + 1)
)
SELECT spread.Hours as NewHours
, spread.LeftOver as PrevHours
, spread.Capacity
, spread.running as RunningTotal
, CASE WHEN running < Capacity THEN running ELSE Capacity END as HoursThisWeek
FROM spread

For a pandas dataframe column, TypeError: float() argument must be a string or a number

here is the code where 'LoanAmount', 'ApplicantIncome', 'CoapplicantIncome' are type objects:
document=pandas.read_csv("C:/Users/User/Documents/train_u6lujuX_CVtuZ9i.csv")
document.isnull().any()
document = document.fillna(lambda x: x.median())
for col in ['LoanAmount', 'ApplicantIncome', 'CoapplicantIncome']:
document[col]=document[col].astype(float)
document['LoanAmount_log'] = np.log(document['LoanAmount'])
document['TotalIncome'] = document['ApplicantIncome'] + document['CoapplicantIncome']
document['TotalIncome_log'] = np.log(document['TotalIncome'])
i get the following error in converting the object type to float:
TypeError: float() argument must be a string or a number
please help as i need to train my classification model using these features. here's a snippet of the csv file -
Loan_ID Gender Married Dependents Education Self_Employed ApplicantIncome CoapplicantIncome LoanAmount Loan_Amount_Term Credit_History Property_Area Loan_Status
LP001002 Male No 0 Graduate No 5849 0 360 1 Urban Y
LP001003 Male Yes 1 Graduate No 4583 1508 128 360 1 Rural N
LP001005 Male Yes 0 Graduate Yes 3000 0 66 360 1 Urban Y
LP001006 Male Yes 0 Not Graduate No 2583 2358 120 360 1 Urban Y
In your code document = document.fillna(lambda x: x.median()) will return a function not a value so a function cannot be converted to a float it should be either a string of numbers or an integer.
Hope the following code helps
median = document['LoanAmount'].median()
document['LoanAmount'] = document['LoanAmount'].fillna(median) # Or document = document.fillna(method='ffill')
for col in ['LoanAmount', 'ApplicantIncome', 'CoapplicantIncome']:
document[col]=document[col].astype(float)
document['LoanAmount_log'] = np.log(document['LoanAmount'])
document['TotalIncome'] = document['ApplicantIncome'] + document['CoapplicantIncome']
document['TotalIncome_log'] = np.log(document['TotalIncome'])

Adding a new column based on values

I have the following sample data:
data weight_club;
input IdNumber 1-4 Name $ 6-24 Team $ StartWeight EndWeight;
Loss = StartWeight - EndWeight;
datalines;
1023 David Shaw red 189 165
1049 Amelia Serrano yellow 145 124
1219 Alan Nance purple 210 192
1246 Ravi Sinha yellow 194 177
1078 Ashley McKnight green 127 118
;
What I would like to do now is the following:
Create two lists with colours (fe, list1 = "red" and "yellow" and list2 = "purple" and "green")
Classify the records according to whether or not they are in list1 and list2 and add a new column.
So the pseudo code is like this:
'Set new category called class
If item is in list1 then class = 1
Else if item is in list2 then class = 2
Else class = 3
Any thoughts on how I can do this most effciently?
Your pseudocode is almost exactly it.
If item is in ('red' 'yellow') then class = 1;
Else if item is in ('purple' 'green') then class = 2;
Else class = 3;
This is really a lookup, so their are many other methods. One I usually recommend as well is Proc format, though in a simplistic case like this I'm not sure of any gains.
Proc format;
Value $ colour_cat
'red', 'yellow' = 1
'purple', 'green' = 2
Other = 3;
Run;
And then in a data/SQL either of the following can be used.
*actual conversion;
Category = put(colour, $colour_cat.);
* change display only;
Format colour $colour_cat.;

Pandas groupby mean absolute deviation

I have a pandas dataframe like this:
Product Group Product ID Units Sold Revenue Rev/Unit
A 451 8 $16 $2
A 987 15 $40 $2.67
A 311 2 $5 $2.50
B 642 6 $18 $3.00
B 251 4 $28 $7.00
I want to transform it to look like this:
Product Group Units Sold Revenue Rev/Unit Mean Abs Deviation
A 25 $61 $2.44 $0.24
B 10 $46 $4.60 $2.00
The Mean Abs Deviation column is to be performed on the Rev/Unit column in the first table. The tricky thing is taking into account the respective weights behind the Rev/Unit calculation.
For example taking a straight MAD of Product Group A's Rev/Unit would yield $0.26. However after taking weight into consideration, the MAD would be $0.24.
I know to use groupby to get the simple summation for units sold and revenue, but I'm a bit lost on how to do the more complicated calculations of the next 2 columns.
Also while we're giving advice/help---is there any easier way to create/paste tables into SO posts??
UPDATE:
Would a solution like this work? I know it will for the summation fields, but not sure how to implement for the latter 2 fields.
grouped_df=df.groupby("Product Group")
grouped_df.agg({
'Units Sold':'sum',
'Revenue':'sum',
'Rev/Unit':'Revenue'/'Units Sold',
'MAD':some_function})
you need to clarify what the "weights" are, I assumed the weights are the number of units sold, but that gives a different results from yours:
pv = df.pivot_table( rows='Product Group',
values=[ 'Units Sold', 'Revenue' ],
aggfunc=sum )
pv[ 'Rev/Unit' ] = pv.Revenue / pv[ 'Units Sold' ]
this gives:
Revenue Units Sold Rev/Unit
Product Group
A 61 25 2.44
B 46 10 4.60
As for WMAD:
def wmad( prod ):
idx = df[ 'Product Group' ] == prod
w = df[ 'Units Sold' ][ idx ]
abs_dev = np.abs ( df[ 'Rev/Unit' ][ idx ] - pv[ 'Rev/Unit' ][ prod ] )
return sum( abs_dev * w ) / sum( w )
pv[ 'Mean Abs Deviation' ] = [ wmad( idx ) for idx in pv.index ]
which as I mentioned gives different result
Revenue Units Sold Rev/Unit Mean Abs Deviation
Product Group
A 61 25 2.44 0.2836
B 46 10 4.60 1.9200
From your suggested solution, you can use a lambda function to operate on each row e.g:
'Rev/Unit': lambda x: calculate_revenue_per_unit(x)
Bear in mind that x is a tuple for each row, so you'll need to unpack that within your calculate_revenue_per_unit function.