Error while converting continuous data to categorical data in Logistic Regression - casting

I am using Logistic regression over my dataset which has its target variable in 0s and 1s. I used .replace() function and replaced them accordingly.
> data['target']=data['target'].replace({0:"No",1:"yes"})
The code ran fine. But when I am modelling the data,
model_log=sm.Logit(data['target'],data.iloc[:,2:]).fit()
it is showing the below error:
ValueError: Pandas data cast to numpy dtype of object. Check input
data with np.asarray(data).

when you select X data using iloc,it is return a pandas dataframe.According to statsmodel documentation,logit expect to X and y to be array_like. You need to cast the dataframe to required data type.You can use to_numpy method to convert dataframe to numpy array.
model_log=sm.Logit(data['target'].astype(float),data.iloc[:,2:].to_numpy()).fit()

Related

Error during Dataset to Pandas conversion

I am getting this error while converting dataset to pandas using ds.to_pandas. Is there any way to overcome it?
The dataset has more than the given limit of 100000 records. Use d.limit(N).topandas()
Thanks

Read pandas dataframe row by row, call API and then store each result in a separate pandas dataframe

Currently I am stuck in a problem. I hope I will get some solution here. I have a python dataframe with 100000 rows and 10 columns. Now I want to iterate over pandas dataframe row by row and pass it through API. after calling API, i want to store the incoming result as a separate pandas dataframe and write to a table. Please let me know if it is possible, If yes, then any sample piece of code will be appreciated.
If you're calling row by API call, and want to add that row in another data frame, here's code for that.
import pandas as pd
df = pd.DataFrame()
df.append(result)
Note that the result here is the output of your API call.
This code will append every row you'll get after API call.
Then after, you can convert the dataframe to your choice

Fractional logit models in SAS

Institutionally constrained to using SAS (yes, I know). I have a basic specification I run in Stata/R no problem: fractional logit model (Papke Wooldridge 1996). It's a GLM with a binomial distribution assumption and a logit link function. Data context is stationary time series in the unit interval—percentage data.
In Stata this is easily run as
glm Y X, family(binomial) link(logit)
in R it is
aModel <- glm(Y ~ X, family=binomial(link=logit), data = aDataFrame)
Attempting to do this in SAS using proc GLIMMIX:
proc glimmix data =aDataTable method = rspl;
class someClassifier anotherClassifier;
model Y = X / dist = binomial link = logit SOLUTION;
random _residual_;
run;
I'm dealing with a panel dataset, which doesn't matter in R or Stata syntax but appears to be needed information for proc glimmix, hence my inclusion of a 'class' line. I am able to fit models that are fairly close to the original from Stata/R but differ in non trivial ways when we look at individual parameters or predicted values (correlation between different predicted values is about .97). Can anyone advise on the proper way to do a fractional logit in SAS? I think the inclusion of a "random" line as I have above is one source of trouble, as this seems to add random effects to the model via an extra matrix * vector operation.
Solution is simple it turns out. Need to use:
method = QUAD
which will use quasi maximum likelihood estimation, the same as used in Stata and R.

Datatype mismatch converting SAS numeric to Teradata BIGINT

I have a SAS dataset with a numeric variable ACCT_ID (among other fields). Its attributes in a PROC CONTENTS are:
# Variable Type Len Format Informat Label
1 ACCT_ID Num 8 19. 19. ACCT_ID
I know that this field doesn't have any non-integer values in it, so I want to store it as a BIGINT in Teradata, and I've specified this with the dbtype data set option like this:
data td.output(dbtype=(ACCT_ID="BIGINT", <etc etc>));
However, this gives the following error:
ERROR: Datatype mismatch for column: ACCT_ID.
There are no missing or non-integer values in that field, and the error persists even if I round ACCT_ID using round(acct_id, 1) to explicitly remove any floating point values that could exist.
Strangely enough, no error is given if I assign this to be a DECIMAL(18,0) in Teradata rather than a BIGINT. I guess that could be one workaround, but I'd like to understand how I can create integer fields in Teradata from SAS numeric variables like this given that SAS doesn't distinguish types between integer and floating point.
SAS does not support the BIGINT datatype. See http://support.sas.com/kb/34/729.html.
Teradata's BIGINT data type is not supported in SAS/ACCESS Interface
to Teradata. You cannot read or update a table containing a column
with the BIGINT data type in SAS/ACCESS Interface to Teradata.
Attempting to do so generates the following error message:
ERROR: At least one of the columns in this DBMS table has a datatype that is
not supported by this engine.

How to add a column in dataframe

I casted a dataframe using reshape package that is 100 obs by 1000 variables with some NA's. How would I add a column that includes mean, median, min, max, total etc... to the data frame?
I keep getting "length of 'dimnames' [2] not equal to array extent" error.. when trying apply function and simple rowMeans functions..
Thanks!
Can you try using reshape2::dcast instead of reshape::cast to cast your dataframe and then run the following:
df1$mean<-apply(df1,1,function(x) mean(x, na.rm=TRUE))