It's difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying this question so that it can be reopened, visit the help center.
Closed 10 years ago.
I need create some new variables day1 day2 day3 etc. If readmit=1then do day[i] each day[i]=gap For example, the first two readmit should get day[1]=21 day[2]=9. then the next readmit=1, For the third readmit, the fourth readmit and the fifth readmit=1 should get the result day[1]=29 day[2]=12 day[3]=23 and so on. Hopefully, I expressed well enough. Thanks in advance.
STUDYID index readmit gap
10001 1 0
10001 1 0 79
10001 1 0 48
10001 1 0 39
10001 1 0 74
10001 1 0 41
10001 0 1 21
10001 0 1 9
10001 0 0 130
10001 0 0 52
10001 0 0 110
10001 1 0 80
10001 0 1 29
10001 0 1 12
10001 0 1 23
10001 1 0 57
10001 0 1 28
10001 0 1 14
10001 1 0 118
10001 0 1 5
10001 0 1 22
10001 1 0 40
10001 0 1 23
10001 0 1 24
10001 0 1 19
I think the code below answers your question. This requires 2 passes of the data, the first to calculate the maximum number of consecutive rows where READMIT=1, which is stored in a macro variable used to determine the array size in the second pass.
The key to solving this question is the order of the data and the use of the NOTSORTED option in the BY statement. This enables every change in the READMIT value to be treated as a new section.
Hope this helps, although it would be good if someone could find a method that just uses a single pass of the data.
data have;
input STUDYID index readmit gap;
cards;
10001 1 0 .
10001 1 0 79
10001 1 0 48
10001 1 0 39
10001 1 0 74
10001 1 0 41
10001 0 1 21
10001 0 1 9
10001 0 0 130
10001 0 0 52
10001 0 0 110
10001 1 0 80
10001 0 1 29
10001 0 1 12
10001 0 1 23
10001 1 0 57
10001 0 1 28
10001 0 1 14
10001 1 0 118
10001 0 1 5
10001 0 1 22
10001 1 0 40
10001 0 1 23
10001 0 1 24
10001 0 1 19
;
run;
data _null_;
set have (keep=readmit) end=last;
by readmit notsorted;
if first.readmit then days=0;
retain max_days;
if readmit=1 then days+1;
max_days=max(max_days,days);
if last then call symput('max_days',strip(max_days));
run;
%put maximum consecutive days = &max_days.;
data want;
set have;
by readmit notsorted;
array dayvar{*} day1-day&max_days.;
if first.readmit then do;
num_day=0;
call missing(of day:);
end;
retain day1-day&max_days.;
if readmit=1 then do;
num_day+1;
dayvar{num_day}=gap;
if last.readmit then output;
end;
keep studyid index day: ;
run;
Related
I am using estpost and esttab to export tabulation results in Stata.
sysuse auto, clear
estpost tabulate turn foreign
esttab ., cells("b(fmt(0))") unstack
---------------------------------------------------
(1)
Domestic Foreign Total
b b b
---------------------------------------------------
31 1 0 1
32 0 1 1
33 1 1 2
34 2 4 6
35 2 4 6
36 1 8 9
37 2 2 4
38 1 2 3
39 1 0 1
40 6 0 6
41 4 0 4
42 7 0 7
43 12 0 12
44 3 0 3
45 3 0 3
46 3 0 3
48 2 0 2
51 1 0 1
Total 52 22 74
---------------------------------------------------
N 74
---------------------------------------------------
Although I can change the format of the cells, I couldn't find a way to change the format of the observation number(N) and the total number of observations in each column. I tried adding obs(fmt(%10.2fc)) as an estab option but it didn't work.
I have a table with Scores and default indicator values.
I sorted the table on the basis of descending scores and then applied proc rank to populate the group column.
Below is a sample of the dataset after the proc rank step.
Obs Scores Def group
1 100 0 9
2 100 1 9
3 99 0 9
4 97 0 9
5 97 0 9
6 95 0 9
7 94 0 9
8 92 0 9
9 92 0 9
10 91 0 9
11 91 0 9
12 89 1 8
13 88 0 8
14 87 0 8
15 87 0 8
16 86 0 8
17 85 0 8
18 84 0 8
19 84 0 8
20 83 0 8
21 83 0 8
22 83 0 8
23 82 0 8
24 81 0 7
25 80 0 7
26 80 1 7
I want to count the population(i.e. number of scores that lie within each group).
Also count the number of defaults in each group.
I tried the below code:
proc rank data = sortedScore groups = 10 out = Score_sorted_10;
var Scores ;
ranks Scores_group;
run;
data NumCount;
set Score_sorted_10;
Retain Popnum 0;
Retain Badnum 0;
do i=0 to 9;
if Scores_group=i
then Popnum=sum(Popnum,1);
if Scores_group=i and Def=1
then Badnum=sum(Def,1);
end;
But this code is getting into infinite loop.
Please help.
I think it is easier to do it using proc sql.
The following query will do the trick:
proc sql;
create table want as
select distinct
Group,
count(scores) as Nbr_Scores,
sum(def) as Nbr_Def
from have
group by group;
quit;
I have a data set where the %age of bads are quite low.Can any one suggest a way to balance such a data set using SAS so that the logistic regression run gives a better result? Below is a sample. Thanks in advance!!
ID X1 X2 X3 X4 X5 Target
1 87 400 2 0 0 0
2 70 620 1 0 0 0
3 66 410 3 0 0 0
4 85 300 1 0 0 0
5 100 200 4 0 0 0
6 201 110 1 0 0 0
7 132 513 3 0 0 0
8 98 417 4 0 0 0
9 397 620 1 0 0 1
10 98 700 5 0 0 1
You can oversample the percentage of bads and then use the priorevent option within the score statement of proc logistic to correct the oversampling. There are plenty of examples online that will help you further with this.
I am trying to replace values in a dataframe by 0. the first column I need to replace the 1st 3 values, the next column the 1st 6 values so on so forth increasing by 3 every time
a=np.array([133,124,156,189,132,176,189,192,100,120,130,140,150,50,70,133,124,156,189,132])
b = pd.DataFrame(a.reshape(10,2), columns= ['s','t'])
for columns in b:
yy = 3
for i in xrange(yy):
b[columns][i] = 0
yy += 3
print b
the outcome is the following
s t
0 0 0
1 0 0
2 0 0
3 189 189
4 132 132
5 176 176
6 189 189
7 192 192
8 100 100
9 120 120
I am clearly missing something really simple, to make the loop replace 6 values instead of only 3 in column t, any ideas?
i would do it this way:
i = 1
for c in b.columns:
b.ix[0 : 3*i-1, c] = 0
i += 1
Demo:
In [86]: b = pd.DataFrame(np.random.randint(0, 100, size=(20, 4)), columns=list('abcd'))
In [87]: %paste
i = 1
for c in b.columns:
b.ix[0 : 3*i-1, c] = 0
i += 1
## -- End pasted text --
In [88]: b
Out[88]:
a b c d
0 0 0 0 0
1 0 0 0 0
2 0 0 0 0
3 10 0 0 0
4 8 0 0 0
5 49 0 0 0
6 55 48 0 0
7 99 43 0 0
8 63 29 0 0
9 61 65 74 0
10 15 29 41 0
11 79 88 3 0
12 91 74 11 4
13 56 71 6 79
14 15 65 46 81
15 81 42 60 24
16 71 57 95 18
17 53 4 80 15
18 42 55 84 11
19 26 80 67 59
You need inicialize yy=3 before loop:
yy = 3
for columns in b:
for i in xrange(yy):
b[columns][i] = 0
yy += 3
print b
Python 3 solution:
yy = 3
for columns in b:
for i in range(yy):
b[columns][i] = 0
yy += 3
print (b)
s t
0 0 0
1 0 0
2 0 0
3 189 0
4 100 0
5 130 0
6 150 50
7 70 133
8 124 156
9 189 132
Another solution:
yy= 3
for i, col in enumerate(b.columns):
b.ix[:i*yy+yy-1, col] = 0
print (b)
s t
0 0 0
1 0 0
2 0 0
3 189 0
4 100 0
5 130 0
6 150 50
7 70 133
8 124 156
9 189 132
In R, I'm trying to train a neural network on multiple files. I have preformed the multinom function on a single dataset but I cannot find how to train my model with another dataset.
So I want to apply a model from a previous call to new data without re-estimating the model.
So first you build a model as in Sam Thomas's answer is explained.
#load libraries
library(nnet)
library(MASS)
#Define data
example(birthwt)
# Define training and test data
set.seed(321)
index <- sample(seq_len(nrow(bwt)), 130)
bwt_train <- bwt[index, ]
bwt_test <- bwt[-index, ]
# Build model
bwt.mu <- multinom(low ~ ., data=bwt_train)
Then I have another similar dataset I want to train/update the earlier created model with. So I want to update the model with new data to improve my model.
# New data set (for example resampled bwt)
bwt2=sapply(bwt, sample)
head(bwt2,3)
low age lwt race smoke ptd ht ui ftv
[1,] 1 31 115 3 1 1 0 0 2
[2,] 1 20 95 1 0 1 0 0 3
[3,] 2 25 95 2 0 1 0 1 1
# Define training and test data with new dataset
set.seed(321)
index <- sample(seq_len(nrow(bwt2)), 130)
bwt2_train <- bwt2[index, ]
bwt2_test <- bwt2[-index, ]
Now with this new dataset I want to optimze the model. I cannot merge the two datasets because the model should update over time when new data is available. This also because it is not preferable to recalculate everytime we have new data availble.
Thanks in advance,
Adam
Borrowed from an example in ?nnet::multinom
library(nnet)
library(MASS)
example(birthwt)
head(bwt, 2)
low age lwt race smoke ptd ht ui ftv
1 0 19 182 black FALSE FALSE FALSE TRUE 0
2 0 33 155 other FALSE FALSE FALSE FALSE 2+
set.seed(321)
index <- sample(seq_len(nrow(bwt)), 130)
bwt_train <- bwt[index, ]
bwt_test <- bwt[-index, ]
bwt.mu <- multinom(low ~ ., bwt_train)
(pred <- predict(bwt.mu, newdata=bwt_test))
[1] 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0
[39] 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0
Levels: 0 1
Or if you want the probabilities
(pred <- predict(bwt.mu, newdata=bwt_test, type="probs"))
1 5 6 16 19 23 24
0.43672841 0.65881933 0.21958026 0.39061949 0.51970665 0.01627479 0.17210620
26 27 28 29 30 37 40
0.06133368 0.31568117 0.05665126 0.26507476 0.37419673 0.18475433 0.14946268
44 46 47 51 56 58 60
0.09670367 0.72178459 0.06541529 0.37448908 0.31883809 0.09532218 0.27515734
61 64 67 69 72 74 76
0.27515734 0.09456443 0.16829037 0.62285841 0.12026718 0.47417711 0.09603950
78 87 94 99 100 106 114
0.34588019 0.30327432 0.87688323 0.21177276 0.06576210 0.19741587 0.22418653
115 117 118 120 125 126 130
0.14592195 0.19340994 0.14874536 0.30176632 0.09513698 0.08334515 0.03886775
133 134 139 140 145 147 148
0.41216817 0.85046516 0.46344537 0.34219775 0.33673304 0.26894886 0.43778705
152 163 164 165 168 174 180
0.19044485 0.27800125 0.17865143 0.86783149 0.25969355 0.60623964 0.34931986
182 183 185
0.22944657 0.08066599 0.22863967