How to add a column in dataframe - casting

I casted a dataframe using reshape package that is 100 obs by 1000 variables with some NA's. How would I add a column that includes mean, median, min, max, total etc... to the data frame?
I keep getting "length of 'dimnames' [2] not equal to array extent" error.. when trying apply function and simple rowMeans functions..
Thanks!

Can you try using reshape2::dcast instead of reshape::cast to cast your dataframe and then run the following:
df1$mean<-apply(df1,1,function(x) mean(x, na.rm=TRUE))

Related

Convert excel formula to DAX / M

I am trying to work out if this is possible or not in DAX or M.
Basically I want to replicate this:
=IF(T9>0,T9-1,$Q$6)
Which is the formula in T10. So it is counting down by one if the value above it is not 0, otherwise put in a value and start counting down again.
Here is some data and expected outcome:
When the stock on hand drops below 5000 it triggers the lead time count down to start. When that hits 0, it adds stock to the SOH balance, 4000 in this case. Since the stock is below its reorder point, it puts starts the countdown again.
We would need the data in order to answer your question properly, if you can't share the dataset, add the index column using
Transform data > Add Column > Index Column

Error while converting continuous data to categorical data in Logistic Regression

I am using Logistic regression over my dataset which has its target variable in 0s and 1s. I used .replace() function and replaced them accordingly.
> data['target']=data['target'].replace({0:"No",1:"yes"})
The code ran fine. But when I am modelling the data,
model_log=sm.Logit(data['target'],data.iloc[:,2:]).fit()
it is showing the below error:
ValueError: Pandas data cast to numpy dtype of object. Check input
data with np.asarray(data).
when you select X data using iloc,it is return a pandas dataframe.According to statsmodel documentation,logit expect to X and y to be array_like. You need to cast the dataframe to required data type.You can use to_numpy method to convert dataframe to numpy array.
model_log=sm.Logit(data['target'].astype(float),data.iloc[:,2:].to_numpy()).fit()

In DataPrep, sum a set of many columns or values in an object

I have a DataPrep dataset which contains a series of ~10 columns, each of which indicates whether or not a particular brochure was selected:
BRO_AF BRO_SAF BRO_SE ...
1 1
1 1
1
I'd like to sum/count these values into a BrochuresSelected column.
I was hoping to use ADD with a column range (ie BRO_AF~BRO_ITA), but ADD only takes two numbers.
I can't use COUNT, as it counts rows not columns.
I can use NEST to create a column storing a map or array of brochures, but there doesn't seem to be a function for adding these. I can't use ARRAYLEN on this column, as even empty columns are represented in the column (eg ["1","","","","",""] would have an array length of six, not one).
Has anyone solved a similar issue?
If you know the column names, you can use the + operator in a derive transform. For example:

Calculating median on a excel file

I want to calculate the medians for a series of numbers from an excel file.
My excel spreadsheet looks like this:
CELLNOUN 9.32
CELLNOUN 10.62
CELLNOUN 8.42
CELLNOUN 10.64
CELLNOUN 11.51
CELLNOUN 12.01
CELLNOUN 8.83
CELLSNOUN/CELLNOUN 9.53
CELLSNOUN/CELLNOUN 9.21
CELLNOUN/CELLSNOUN 10.76
CELLNOUN/CELLSNOUN 7.01
CELLSNOUN/CELLNOUN 10.21
PLANTNOUN/PLANTSNOUN 3.62
PLANTNOUN/PLANTSNOUN 3.38
PLANTSNOUN/PLANTNOUN 3.92
PLANTSNOUN/PLANTNOUN 3.24
PLANTNOUN/PLANTSNOUN 3.83
PLANTNOUN/PLANTSNOUN 3.24
PLANTSNOUN/PLANTNOUN 3.00
PLANTSNOUN/PLANTNOUN 1.80
...
In the spreadsheet, each set of words has been separated by a blank row, but the numbers of the entries for each set varies, like CELLNOUN/CELLSNOUN has 12 entries but PLANTNOUN/ has 8 entries. The numbers coming after the words are, in fact, the occurrences of these words. I want to find out the median of the occurrences for CELLNOUN/CELLSNOUN, PLANTNOUN/PLANTSNOUN etc, by using Regex instead of using the MEDIAN function in Excel to do it, because I have thousands of sets like this and I can't do it one by one on Excel. But if you know a quicker way to do it on Excel, please advice.
Thank you very much.
First of all, remove the blank rows from your data set and then create an Excel Table with Insert > Table or Ctrl-T. With an Excel table object, all functions and commands that refer to the table will catch when more data is added to the table.
Now you can create a pivot table from your source data with Insert > PivotTable. If you drag the first column field into the rows area, you will have a list of unique values in that source data column. You can drag the values column into the Values area of the Pivot Panel, if you want to. This should now look similar to this screenshot:
I'm not sure if you are aware of the different spellings of your categories, i.e. with or without an "S". The pivot table uncovers them all.
Out of the box, Excel PivotTables do not offer the Median as an option to aggregate, but you can use a method outlined here
http://www.myonlinetraininghub.com/calculating-median-in-pivottables
to calculate a median.
The exact approach varies depending on whether or not you use Pivot tables or Power Pivot, so check out the article.
Use an array formula as shown below and press ctrl+shift+enter to make it an array formula:
=MEDIAN((IF($A$1:$A$20=A1,$B$1:$B$20)))
Refer to the formula bar in the image below to apply to all cells by applying the same formula to all cells

create SAS dataset with variable number of attributes

I need to create a dataset in SAS, with a variable no of attribute names.
Im not so proficient in SAS, so writing the logic in normal lang
for(i=1 to 10)
{
for (j=1 to n)
{
Combinations(j,i);
}
//perform some calculations on the temporary average table and delete it
}
The problem is in the combinations function. Here
combinations(i,j)
{
//find all possible combinations
//find average of all combinations
}
I now need to store all the averages in a temporary table/data set
For ex., for i=2,j=5.. ill have ten combinations for each value of j.
so, the column count will be 10 and the row count wil be 2.
This table should be a dynamic dataset I guess.
Im not really sure what to do.. just struck.
Any help will be much appreciated.
Thanks
Likely the best solution is to initially create the i,j dataset as vertical - with each eventual-variable as a row - and then use PROC TRANSPOSE to transpose it to horizontal. You can use the ID statement to name the variable.