Please help to understand why object_cols which is list of categorical variables
used as label_X_train[object_cols] why not as label_X_train.object_cols as object_cols is already a list. Same question for X_train[object_cols] in the below codes.
Get list of categorical variables
`> s = (X_train.dtypes == 'object')
> object_cols = list(s[s].index)
>
> print("Categorical variables:")
> print(object_cols)
> Categorical variables:
> ['Type', 'Method', 'Regionname']`
Make copy to avoid changing original data
` label_X_train = X_train.copy()
label_X_valid = X_valid.copy()``
Apply ordinal encoder to each column with categorical data
ordinal_encoder = OrdinalEncoder() label_X_train[object_cols] = ordinal_encoder.fit_transform(X_train[object_cols])
label_X_valid[object_cols] = ordinal_encoder.transform(X_valid[object_cols])
What happens when we put a list inside a data frame column.
Related
I would like to do a nested if statement within powerbi, I need to have multiple if statements in one column with some returning - value depending on the if statement.
I have tried to below however theyre all coming back as false.
Calculated value = if('TableName'[ColumnName1] = "exp1" && 'TableName'[columnName2] = "exp2", 'TableName'[value]|| if('TableName'[ColumnName1] = "exp1" && 'TableName'[ColumnName2] = "exp3", - 'TableName'[value],""))
In the first if you don't have any output to the false result. I Add an extra empty value in your calculated value as below only to exemplify what you need to add:
Calculated value = if('TableName'[ColumnName1] = "exp1" && 'TableName'[columnName2] = "exp2", 'TableName'[value]|| if('TableName'[ColumnName1] = "exp1" && 'TableName'[ColumnName2] = "exp3", - 'TableName'[value],""),"")
Check the example and tell something about this, please
A user can supply answers for up to 5 question, and each question has 3 columns in the database:
st_m1_id
st_m1_fam
st_m1_cip
st_m2_id
st_m2_fam
st_m2_cip
etc
When the form is submitted I get the IDs into an array and then loop over that and then query a different table to look up fam and cip.
aidms = listToArray(form['idms[]']);
for( i=1;i<=arrayLen(aidms);i++ ) {
qryData = invoke(dataCFC,'queryData', { smid = aidms[i]});
temp = invoke(userCFC,'updateUser',{
userid = session.userid,
st_m#i#_id = aidms[i],
st_m#i#_fam = qryData.fam,
etc......
});
};
How do I notate the dynamic query column names in the function so that when they are passed to updateUser the correct columns are referenced?
I am attempting to compare a column in 2 different data frames and getting an error. My goal is to determine if the playerID in df1 matches the playerID in df2. Also not sure if it makes a difference but the data in each data frame is of different lengths.
Here is my code with examples of the data frames:
cleaned_hof_df = hof_df[(hof_df.inducted == 'Y') & (hof_df.category == 'Player')]
cleaned_hof_df.reset_index(drop = True, inplace = True)
cleaned_hof_df.head(3)
cleaned_wins_losses_df = pitching_df[(pitching_df.W > 0) & (pitching_df.L > 0)]
cleaned_wins_losses_df.reset_index(drop = True, inplace = True)
cleaned_wins_losses_df.head(3)
cleaned_hof_df.playerID == cleaned_wins_losses_df.playerID
Your data frames
cleaned_hot_df
and
cleaned_win_losses_df
have different number of rows, so the corresponding series
cleaned_hot_df.playerID
and
cleaned_win_losses_df.playerID
have different lengths.
So your two series are not identically-labeled (which is the error you got).
Extracting data from xml file usesing python. Is it possible to fatch each data in single loop and store in respective array?
import xml.etree.cElementTree as etree
xmlDoc = open('C:/Users/Talha/Documents/abc.xml', 'r')
xmlDocData = xmlDoc.read()
xmlDocTree = etree.XML(xmlDocData)
srnoList, stateList, statecdList, districtList, issuedOnList, dayList, normalRainfallList, normalTempmaxList, normalTempminList = ([] for i in range(9))
for srno in xmlDocTree.iter('Srno'):
srnoList.append(srno.text)
for state in xmlDocTree.iter('State'):
stateList.append(state.text)
for statecd in xmlDocTree.iter('Statecd'):
statecdList.append(statecd.text)
for district in xmlDocTree.iter('District'):
districtList.append(district.text)
for issuedOn in xmlDocTree.iter('IssuedOn'):
issuedOnList.append(issuedOn.text)
for day in xmlDocTree.iter('Day'):
dayList.append(day.text)
for normalRainfall in xmlDocTree.iter('normal_rainfall'):
normalRainfallList.append(normalRainfall.text)
for normalTempmax in xmlDocTree.iter('normal_temp_max'):
normalTempmaxList.append(normalTempmax.text)
for normalTempmin in xmlDocTree.iter('normal_temp_min'):
normalTempminList.append(normalTempmin.text)
Down to 2 loops, but a lot cleaner:
sections = ['Srno','State','Statecd','District','IssuedOn','Day','normal_rainfall','normal_temp_max','normal_temp_min']
fetched = dict()
for sec in sections:
fetched[sec] = []
for item in xmlDocTree.iter( sec ):
fetched[sec].append( item )
Instead of distinct variables for each type of data, each is an item in a dictionary. This could be tweaked to use different keys for the dictionary.
I have time series data in two separate DataFrame columns which refer to the same parameter but are of differing lengths.
On dates where data only exist in one column, I'd like this value to be placed in my new column. On dates where there are entries for both columns, I'd like to have the mean value. (I'd like to join using the index, which is a datetime value)
Could somebody suggest a way that I could combine my two columns? Thanks.
Edit2: I written some code which should merge the data from both of my column, but I get a KeyError when I try to set the new values using my index generated from rows where my first df has values but my second df doesn't. Here's the code:
def merge_func(df):
null_index = df[(df['DOC_mg/L'].isnull() == False) & (df['TOC_mg/L'].isnull() == True)].index
df['TOC_mg/L'][null_index] = df[null_index]['DOC_mg/L']
notnull_index = df[(df['DOC_mg/L'].isnull() == True) & (df['TOC_mg/L'].isnull() == False)].index
df['DOC_mg/L'][notnull_index] = df[notnull_index]['TOC_mg/L']
df.insert(len(df.columns), 'Mean_mg/L', 0.0)
df['Mean_mg/L'] = (df['DOC_mg/L'] + df['TOC_mg/L']) / 2
return df
merge_func(sve)
And here's the error:
KeyError: "['2004-01-14T01:00:00.000000000+0100' '2004-03-04T01:00:00.000000000+0100'\n '2004-03-30T02:00:00.000000000+0200' '2004-04-12T02:00:00.000000000+0200'\n '2004-04-15T02:00:00.000000000+0200' '2004-04-17T02:00:00.000000000+0200'\n '2004-04-19T02:00:00.000000000+0200' '2004-04-20T02:00:00.000000000+0200'\n '2004-04-22T02:00:00.000000000+0200' '2004-04-26T02:00:00.000000000+0200'\n '2004-04-28T02:00:00.000000000+0200' '2004-04-30T02:00:00.000000000+0200'\n '2004-05-05T02:00:00.000000000+0200' '2004-05-07T02:00:00.000000000+0200'\n '2004-05-10T02:00:00.000000000+0200' '2004-05-13T02:00:00.000000000+0200'\n '2004-05-17T02:00:00.000000000+0200' '2004-05-20T02:00:00.000000000+0200'\n '2004-05-24T02:00:00.000000000+0200' '2004-05-28T02:00:00.000000000+0200'\n '2004-06-04T02:00:00.000000000+0200' '2004-06-10T02:00:00.000000000+0200'\n '2004-08-27T02:00:00.000000000+0200' '2004-10-06T02:00:00.000000000+0200'\n '2004-11-02T01:00:00.000000000+0100' '2004-12-08T01:00:00.000000000+0100'\n '2011-02-21T01:00:00.000000000+0100' '2011-03-21T01:00:00.000000000+0100'\n '2011-04-04T02:00:00.000000000+0200' '2011-04-11T02:00:00.000000000+0200'\n '2011-04-14T02:00:00.000000000+0200' '2011-04-18T02:00:00.000000000+0200'\n '2011-04-21T02:00:00.000000000+0200' '2011-04-25T02:00:00.000000000+0200'\n '2011-05-02T02:00:00.000000000+0200' '2011-05-09T02:00:00.000000000+0200'\n '2011-05-23T02:00:00.000000000+0200' '2011-06-07T02:00:00.000000000+0200'\n '2011-06-21T02:00:00.000000000+0200' '2011-07-04T02:00:00.000000000+0200'\n '2011-07-18T02:00:00.000000000+0200' '2011-08-31T02:00:00.000000000+0200'\n '2011-09-13T02:00:00.000000000+0200' '2011-09-28T02:00:00.000000000+0200'\n '2011-10-10T02:00:00.000000000+0200' '2011-10-25T02:00:00.000000000+0200'\n '2011-11-08T01:00:00.000000000+0100' '2011-11-28T01:00:00.000000000+0100'\n '2011-12-20T01:00:00.000000000+0100' '2012-01-19T01:00:00.000000000+0100'\n '2012-02-14T01:00:00.000000000+0100' '2012-03-13T01:00:00.000000000+0100'\n '2012-03-27T02:00:00.000000000+0200' '2012-04-02T02:00:00.000000000+0200'\n '2012-04-10T02:00:00.000000000+0200' '2012-04-17T02:00:00.000000000+0200'\n '2012-04-26T02:00:00.000000000+0200' '2012-04-30T02:00:00.000000000+0200'\n '2012-05-03T02:00:00.000000000+0200' '2012-05-07T02:00:00.000000000+0200'\n '2012-05-10T02:00:00.000000000+0200' '2012-05-14T02:00:00.000000000+0200'\n '2012-05-22T02:00:00.000000000+0200' '2012-06-05T02:00:00.000000000+0200'\n '2012-06-19T02:00:00.000000000+0200' '2012-07-03T02:00:00.000000000+0200'\n '2012-07-17T02:00:00.000000000+0200' '2012-07-31T02:00:00.000000000+0200'\n '2012-08-14T02:00:00.000000000+0200' '2012-08-28T02:00:00.000000000+0200'\n '2012-09-11T02:00:00.000000000+0200' '2012-09-25T02:00:00.000000000+0200'\n '2012-10-10T02:00:00.000000000+0200' '2012-10-24T02:00:00.000000000+0200'\n '2012-11-21T01:00:00.000000000+0100' '2012-12-18T01:00:00.000000000+0100'] not in index"
You are close, but you actually don't need to iterate over the rows when using the isnull() functions. by default
df[(df['DOC_mg/L'].isnull() == False) & (df['TOC_mg/L'].isnull() == True)].index
Will return just the index of the rows where DOC_mg/L is not null and TOC_mg/L is null.
Now you can do something like this to set the values for TOC_mg/L:
null_index = df[(df['DOC_mg/L'].isnull() == False) & \
(df['TOC_mg/L'].isnull() == True)].index
df['TOC_mg/L'][null_index] = df['DOC_mg/L'][null_index] # EDIT To switch the index position.
This will use the index of the rows where TOC_mg/L is null and DOC_mg/L is not null, and set the values for TOC_mg/L to the those found in DOC_mg/L in the same rows.
Note: This is not the accepted way for setting values using an index, but it is how I've been doing it for some time. Just make sure that when setting values, the left side of the equation is df['col_name'][index]. If col_name and index are switched you will set the values to a copy which is never set back to the original.
Now to set the mean, you can create a new column, we'll call this Mean_mg/L and set the value = 0.0. Then set this new column to the mean of both columns:
# Insert a new col at the end of the dataframe columns name 'Mean_mg/L'
# with default value 0.0
df.insert(len(df.columns), 'Mean_mg/L', 0.0)
# Set this columns value to the average of DOC_mg/L and TOC_mg/L
df['Mean_mg/L'] = (df['DOC_mg/L'] + df['TOC_mg/L']) / 2
In the columns where we filled null values with the corresponding column value, the average will be the same as the values.