I have the following sample data:
data weight_club;
input IdNumber 1-4 Name $ 6-24 Team $ StartWeight EndWeight;
Loss = StartWeight - EndWeight;
datalines;
1023 David Shaw red 189 165
1049 Amelia Serrano yellow 145 124
1219 Alan Nance purple 210 192
1246 Ravi Sinha yellow 194 177
1078 Ashley McKnight green 127 118
;
What I would like to do now is the following:
Create two lists with colours (fe, list1 = "red" and "yellow" and list2 = "purple" and "green")
Classify the records according to whether or not they are in list1 and list2 and add a new column.
So the pseudo code is like this:
'Set new category called class
If item is in list1 then class = 1
Else if item is in list2 then class = 2
Else class = 3
Any thoughts on how I can do this most effciently?
Your pseudocode is almost exactly it.
If item is in ('red' 'yellow') then class = 1;
Else if item is in ('purple' 'green') then class = 2;
Else class = 3;
This is really a lookup, so their are many other methods. One I usually recommend as well is Proc format, though in a simplistic case like this I'm not sure of any gains.
Proc format;
Value $ colour_cat
'red', 'yellow' = 1
'purple', 'green' = 2
Other = 3;
Run;
And then in a data/SQL either of the following can be used.
*actual conversion;
Category = put(colour, $colour_cat.);
* change display only;
Format colour $colour_cat.;
Related
I am trying to understand the results I got for a fake dataset. I have two independent variables, hours, type and response pain.
First question: How was 82.46721 calculated as the lsmeans for the first type?
Second question: Why is the standard error exactly the same (8.24003) for both types?
Third question: Why is the degrees of freedom 3 for both types?
data = data.frame(
type = c("A", "A", "A", "B", "B", "B"),
hours = c(60,72,61, 54,68,66),
# pain = c(85,95,69, 73, 29, 30)
pain = c(85,95,69, 85,95,69)
)
model = lm(pain ~ hours + type, data = data)
lsmeans(model, c("type", "hours"))
> data
type hours pain
1 A 60 85
2 A 72 95
3 A 61 69
4 B 54 85
5 B 68 95
6 B 66 69
> lsmeans(model, c("type", "hours"))
type hours lsmean SE df lower.CL upper.CL
A 63.5 82.46721 8.24003 3 56.24376 108.6907
B 63.5 83.53279 8.24003 3 57.30933 109.7562
Try this:
newdat <- data.frame(type = c("A", "B"), hours = c(63.5, 63.5))
predict(model, newdata = newdat)
An important thing to note here is that your model has hours as a continuous predictor, not a factor.
I have a table (Student_classification) with two columns, Student Number and Subject (example):
Student Number Subject
122 Biology_Physics
122 Math
122 Music
125 music
125 geography
298 Math
298 Economics
My task is to get a new table where:
if the student Number has Biology_Physics and (either Math or Music or geography or economics) as Science
if the student number has (geography or music) and do not have any other as Humnity/arts
if the student has (Math or Economics) and do not have any other as EconomicsEngineering
My final result should be:
Student Number Type
122 Science
125 Humanity/arts
298 EconomicsEngineering
However, I get following table which is incorrect:
Student_Number Type
122 Other
122 EconomicEngineering
122 Humanity/arts
125 Humanity/arts
298 EconomicEngineering
I have written the following code in SAS, but the logics seems incorrect:
Proc Sql;
create table student_classification as
(
select distinct cust_num,
case
when Subject ='Biology_Physics' and Subject in ('Math' 'Music' 'geography' 'economics') then 'Science'
When Subject in ('geography' 'music') and Subject not in ('Biology_Physics' 'Math' 'economics') then 'Humanity/arts'
When Subject in ('math' 'economics) and subject not in ('Biology_Physics' 'Geography' 'Music') then 'EconomicEngineering'
else 'Other'
end as Type
from Student_classification
Group by student_number, Type
);
quit;
My use case is different, but simulating the similar idea here.
You try to compare values from multiple rows, thus you need conditional aggregation.
select cust_num,
case
-- has Biology_Physics and (either Math or Music or geography or economics) as Science
when max(case when Subject ='Biology_Physics' then 1 end) = 1
and max(case when Subject in ('Math', 'Music', 'geography', 'economics') then 1 end) = 1
then 'Science'
-- has (geography or music) and do not have any other as Humnity/arts
When max(case when Subject in ('geography', 'music') then 0 else 1 end) = 0
then 'Humanity/arts'
-- has (Math or Economics) and do not have any other as EconomicsEngineering
When max(case when Subject in ('math', 'economics) then 0 else 1 end) = 0
then 'EconomicEngineering'
else 'Other'
end as Type
from Student_classification
Group by cust_num
How can I filter a dataframe to rows with values that are contained within a list? Specifically, the values in the dataframe will only be partial matches with the list and never exact match.
I've tried using pandas.DataFrame.isin but this only works if the values in the dataframe are the same as in the list.
list = ["123 MAIN STREET", "456 BLUE ROAD", "789 SKY DRIVE"]
df =
address
0 123 MAIN
1 456 BLUE
2 987 PANDA
target_df = df[df["address"].isin(list)
Ideally the result should be
target_df =
address
0 123 MAIN
1 456 BLUE
Use str.contains and a simple regex using | to connect the terms.
f = '|'.join
mask = f(map(f, map(str.split, list)))
df[df.address.str.contains(mask)]
address
0 123 MAIN
1 456 BLUE
Ending up using for loop
df[[any(x in y for y in l) for x in df.address]]
Out[257]:
address
0 123 MAIN
1 456 BLUE
here is the code where 'LoanAmount', 'ApplicantIncome', 'CoapplicantIncome' are type objects:
document=pandas.read_csv("C:/Users/User/Documents/train_u6lujuX_CVtuZ9i.csv")
document.isnull().any()
document = document.fillna(lambda x: x.median())
for col in ['LoanAmount', 'ApplicantIncome', 'CoapplicantIncome']:
document[col]=document[col].astype(float)
document['LoanAmount_log'] = np.log(document['LoanAmount'])
document['TotalIncome'] = document['ApplicantIncome'] + document['CoapplicantIncome']
document['TotalIncome_log'] = np.log(document['TotalIncome'])
i get the following error in converting the object type to float:
TypeError: float() argument must be a string or a number
please help as i need to train my classification model using these features. here's a snippet of the csv file -
Loan_ID Gender Married Dependents Education Self_Employed ApplicantIncome CoapplicantIncome LoanAmount Loan_Amount_Term Credit_History Property_Area Loan_Status
LP001002 Male No 0 Graduate No 5849 0 360 1 Urban Y
LP001003 Male Yes 1 Graduate No 4583 1508 128 360 1 Rural N
LP001005 Male Yes 0 Graduate Yes 3000 0 66 360 1 Urban Y
LP001006 Male Yes 0 Not Graduate No 2583 2358 120 360 1 Urban Y
In your code document = document.fillna(lambda x: x.median()) will return a function not a value so a function cannot be converted to a float it should be either a string of numbers or an integer.
Hope the following code helps
median = document['LoanAmount'].median()
document['LoanAmount'] = document['LoanAmount'].fillna(median) # Or document = document.fillna(method='ffill')
for col in ['LoanAmount', 'ApplicantIncome', 'CoapplicantIncome']:
document[col]=document[col].astype(float)
document['LoanAmount_log'] = np.log(document['LoanAmount'])
document['TotalIncome'] = document['ApplicantIncome'] + document['CoapplicantIncome']
document['TotalIncome_log'] = np.log(document['TotalIncome'])
How would you create a column(s) in the below pandas DataFrame where the new columns are the expanding mean/median of 'val' for each 'Mod_ID_x'. Imagine this as if were time series data and 'ID' 1-2 was on Day 1 and 'ID' 3-4 was on Day 2.
I have tried every way I could think of but just can't seem to get it right.
left4 = pd.DataFrame({'ID': [1,2,3,4],'val': [10000, 25000, 20000, 40000],
'Mod_ID': [15, 35, 15, 42],'car': ['ford','honda', 'ford', 'lexus']})
right4 = pd.DataFrame({'ID': [3,1,2,4],'color': ['red', 'green', 'blue', 'grey'], 'wheel': ['4wheel','4wheel', '2wheel', '2wheel'],
'Mod_ID': [15, 15, 35, 42]})
df1 = pd.merge(left4, right4, on='ID').drop('Mod_ID_y', axis=1)
Hard to test properly on your DataFrame, but you can use something like this:
>>> df1["exp_mean"] = df1[["Mod_ID_x","val"]].groupby("Mod_ID_x").transform(pd.expanding_mean)
>>> df1
ID Mod_ID_x car val color wheel exp_mean
0 1 15 ford 10000 green 4wheel 10000
1 2 35 honda 25000 blue 2wheel 25000
2 3 15 ford 20000 red 4wheel 15000
3 4 42 lexus 40000 grey 2wheel 40000