I have a dataset like so
data test;
do i = 1 to 100;
x1 = ceil(ranuni(0) * 100);
x2 = floor(ranuni(0) * 1600);
x3 = ceil(ranuni(0) * 1500);
x4 = ceil(ranuni(0) * 1100);
x5 = floor(ranuni(0) * 10);
output;
end;
run;
data test_2;
set test;
if mod(x1,3) = 0 then x1 = .;
if mod(x2,13) = 0 then x2 = .;
if mod(x3,7) = 0 then x3 = .;
if mod(x4,6) = 0 then x4 = .;
if mod(x5,2) = 0 then x5 = .;
drop i;
run;
I plan to calculate a number of percentiles including two non-standard percentiles (2.5th and 97.5th). I do this using proc stdize as below
PROC STDIZE
DATA=test_2
OUT=_NULL_
NOMISS
PCTLMTD=ORD_STAT
pctldef=3
OUTSTAT=STDLONGPCTLS
pctlpts=(2.5 5 25 50 75 95 97.5);
VAR _NUMERIC_;
RUN;
Comparing to proc means
DATA TEST_MEANS;
SET TEST_2;
IF NOT MISSING(X1);
IF NOT MISSING(X2);
IF NOT MISSING(X3);
IF NOT MISSING(X4);
IF NOT MISSING(X5);
RUN;
PROC MEANS
DATA=TEST_MEANS NOPRINT;
VAR _NUMERIC_;
OUTPUT OUT=MEANSWIDEPCTLS P5= P25= P50= P75= P95= / AUTONAME;
RUN;
However, something to do with how SAS labels missing values as -inf, when I compare the results above, to the results produced in excel and proc means, they aren't aligned, can someone confirm which would be correct?
You are using pctldef=3 in PROC STDIZE but the default definition for PROC MEANS, and that is 5. I tested your code with PCTLDEF=3 using PROC MEANS and get matching results.
A user on sasprofessionals.net had a problem of not being able to group his dataset by several variables where variable values are interchangeable within the observation because they carried the same meaning.
In the example dataset, observation 2,3 and 7 are the same because each of them have A14, A14 and A10 as values for Stat1 to Stat3, and it is only the order is different. These should be grouped by Count. Observation 5 and 6 form another group that should be summed up by count.
Example dataset:
Obs Stat1 Stat2 Stat3 Count
1 A14 A14 A14 53090
2 A14 A14 A10 6744
3 A14 A10 A14 5916
4 A01 A01 A01 4222
5 A10 A10 A10 3085
6 A10 A10 A10 2731
7 A10 A14 A14 2399
Desirable output:
Obs Stat1 Stat2 Stat3 Count
1 A14 A14 A14 53090
4 A01 A01 A01 4222
6 A10 A10 A10 5816
7 A10 A14 A14 15059
The actual dataset is larger and more complex. I am not aware if the user tried any methods to solve the problem.
This question was originally posted on sasprofessionals.net and it was copied to StackOverflow for the benefit of the community. It was changed to meet the StackOverflow Q&A standards.
This was my answer to solve the user's problem. In general, I loaded Stat1-Stat3 into an array, sorted the array with sortc call function and then summed it up by a temporary ID which was constructed out of sorted Stat1-Stat3 array.
/* Loading the data into SAS dataset */
/* Loading Stat1-Stat3 into an array */
/* Sorting stat1-stat3 creating a new ID */
data have;
input obs stat1 $ stat2 $ stat3 $ count;
array stat{3} stat1-stat3;
call sortc(of stat1-stat3);
ID = CATX("/",stat1,stat2,stat3);
datalines;
1 A14 A14 A14 53090
2 A14 A14 A10 6744
3 A14 A10 A14 5916
4 A01 A01 A01 4222
5 A10 A10 A10 3085
6 A10 A10 A10 2731
7 A10 A14 A14 2399
;
/* sorting the data set in preparation for data step with by statement*/
PROC SORT data=have;
BY ID OBS;
RUN;
/* Summarising the dataset and outputing into final dataset*/
DATA summed (drop=ID count);
set sorted_arrays;
by ID;
retain sum 0;
if first.ID then sum = 0;
sum + count;
if last.ID then output;
RUN;
/* Sorting it back into original order */
PROC SORT data=summed out=want;
BY OBS;
RUN;
Since I've been giving myself hash exercises, I decided to try this via hashing. Paul Dorfman has a couple papers that discuss using hash tables to sort an array, e.g. Black Belt Hashigana.
Below, I use one hash table to sort horizontally, then another hash table to sum the counts by ID. The data only have to be read once, but given the size of the data I'm certainly not claiming an efficiency benefit in this case. I didn't return the data back to original sort order.
Edits/questions/suggestions welcome, as this is part of my hash learning curve. : )
data have;
input stat1 $ stat2 $ stat3 $ count;
datalines;
A14 A14 A14 53090
A14 A14 A10 6744
A14 A10 A14 5916
A01 A01 A01 4222
A10 A10 A10 3085
A10 A10 A10 2731
A10 A14 A14 2399
;
data want;
length _stat $3;
if _n_=1 then do;
declare hash hstat(multidata:"y", ordered:"y");
declare hiter hstatiter ("hstat" ) ;
hstat.definekey('_stat');
hstat.definedata('_stat');
hstat.definedone();
call missing(_stat);
declare hash hsum(suminc: "count", ordered: "y");
declare hiter hsumiter ("hsum" ) ;
hsum.definekey("stat1","stat2","stat3");
hsum.definedone();
end;
set have end=last;
array stat{3};
*load the array values into htable hstat to sort them;
*then iterate over the hash, returning the values to array in sorted order;
do _i=1 to dim(stat);
hstat.add(key:stat{_i},data:stat{_i});
end;
do _i=1 to dim(stat);
hstatiter.next();
stat{_i}=_stat;
end;
_rc=hstatiter.next(); *hack- there is no next, this releases hiter lock so can clear hstat;
hstat.clear();
*now that the stat keys have been sorted, can use them as key in hash table hsum;
*as data are loaded into/checked against the hash table, counts are summed;
*Then if last, iterate over hsum writing it to output dataset;
hsum.ref(); *This sums count as records are loaded/checked;
if last then do;
_rc = hsumiter.next();
do while(_rc = 0);
_rc = hsum.sum(sum: count);
output ;
_rc = hsumiter.next();
end;
end;
drop _: ;
run;
I'm having trouble with a match and paste problem. I have a data frame like
df
# X1 X2 X3 X4 X5 X6
#t1 <NA> <NA> AU 78 <NA> <NA>
#t2 dA AK <NA> <NA> 5 <NA>
#t3 ip <NA> <NA> <NA> <NA> <NA>
#t4 <NA> <NA> <NA> <NA> <NA> BA
I want it to look like this after operations,
newdf
# X1 X2 X3 X4 X5 X6
#v1 <NA> <NA> <NA> <NA> <NA> <NA>
#v2 AU78 <NA> <NA> <NA> <NA> <NA>
#v3 AK5 <NA> <NA> <NA> <NA> <NA>
#v4 <NA> <NA> <NA> <NA> <NA> BA
The process should first search for values that start with 'A'. df[1,3], df[2,2] in this case. Then paste that value to any other numbers further to the right of it (there will always be one number to the right of it). Also, to help, there will never be stray characters in between a target element (like 'AK') and the number to the right of it; only NAs will seperate them.
Those combined new values need to be brought to the first column, and one row down from where it was. It does not matter if values existing in the first row are overwritten.
My pattern locator is,
pat.locate <- lapply(df, function(x) grep('^A', x))
un.pat <- unlist(pat.locate)
#X2 X3
# 2 1
This looked like a good start. From there,
df[un.pat, names(un.pat)]
# X2 X3
#t2 AK <NA>
#t1 <NA> AU
So the target values are found with their column and row indexes. But I need the values to the right of those indexes. To subset the entire rows,
full.row <- df[un.pat, ]
# X1 X2 X3 X4 X5 X6
#t2 dA AK <NA> <NA> 5 <NA>
#t1 <NA> <NA> AU 78 <NA> <NA>
I paste the non-NA values, but you can tell what's going to happen,
paste(full.row[!is.na(full.row)], collapse='')
#[1] "dAAKAU785"
To divide it up, an apply over the rows was used:
pasty <- function(x) paste(x[!is.na(x)], collapse='')
pasted.rows <- apply(full.row, 1, pasty)
# t2 t1
#"dAAK5" "AU78"
That still leaves the stray string at the beginning. If I found a good regex to tell it to cast that off I'd have,
good.regex
# t2 t1
# "AK5" "AU78"
I could then subset the whole data frame based on those indices with,
df[names(good.regex), 1] <- good.regex
df
# X1 X2 X3 X4 X5 X6
#t1 AU78 <NA> AU 78 <NA> <NA>
#t2 AK5 AK <NA> <NA> 5 <NA>
#t3 ip <NA> <NA> <NA> <NA> <NA>
#t4 <NA> <NA> <NA> <NA> <NA> BA
But I'm still left with having to move the pasted values down by one.
df[names(good.regex)+1, 1] <- good.regex
#Error in names(good.regex) + 1 : non-numeric argument to binary operator
We obviously can't add a numeric to a named-style subset. I feel like I'm missing some element early on that's leading me down a difficult path to a solution. A regex would have to be a sub out that uses the pattern match and a look-behind that I can't crack. I think I'm working myself into a corner that is unnecessary. Any help is appreciated.
Data
df <- structure(list(X1 = c(NA, "dA", "ip", NA), X2 = c(NA, "AK", NA,
NA), X3 = c("AU", NA, NA, NA), X4 = c("78", NA, NA, NA), X5 = c(NA,
"5", NA, NA), X6 = c(NA, NA, NA, "BA")), .Names = c("X1", "X2",
"X3", "X4", "X5", "X6"), row.names = c("t1", "t2", "t3", "t4"
), class = "data.frame")
newdf <- structure(list(X1 = structure(c(NA, 2L, 1L, NA), .Names = c("v1",
"v2", "v3", "v4"), .Label = c("AK5", "AU78"), class = "factor"),
X2 = structure(c(NA_integer_, NA_integer_, NA_integer_, NA_integer_
), .Names = c("v1", "v2", "v3", "v4"), .Label = character(0), class = "factor"),
X3 = structure(c(NA_integer_, NA_integer_, NA_integer_, NA_integer_
), .Names = c("v1", "v2", "v3", "v4"), .Label = character(0), class = "factor"),
X4 = structure(c(NA_integer_, NA_integer_, NA_integer_, NA_integer_
), .Names = c("v1", "v2", "v3", "v4"), .Label = character(0), class = "factor"),
X5 = structure(c(NA_integer_, NA_integer_, NA_integer_, NA_integer_
), .Names = c("v1", "v2", "v3", "v4"), .Label = character(0), class = "factor"),
X6 = structure(c(NA, NA, NA, 1L), .Names = c("v1", "v2",
"v3", "v4"), .Label = "BA", class = "factor")), .Names = c("X1",
"X2", "X3", "X4", "X5", "X6"), row.names = c("v1", "v2", "v3",
"v4"), class = "data.frame")
For what I understand according to your output example, the point is to collapse a A* character and its following number in the same row, then move this new entity down to the first column one row below. While "erasing" the original line (row 1 of newdf filled with NA) but keeping lines with no-match intact if they're not affected by the previous movement (row 4).
Your main problem was to collapse on the full row, instead of collapsing only its end.
## original data
df <- structure(list(X1 = c(NA, "dA", "ip", NA),
X2 = c(NA, "AK", NA, NA),
X3 = c("AU", NA, NA, NA),
X4 = c("78", NA, NA, NA),
X5 = c(NA, "5", NA, NA),
X6 = c(NA, NA, NA, "BA")),
.Names = c("X1", "X2", "X3", "X4", "X5", "X6"),
row.names = c("t1", "t2", "t3", "t4"), class = "data.frame")
df
X1 X2 X3 X4 X5 X6
t1 <NA> <NA> AU 78 <NA> <NA>
t2 dA AK <NA> <NA> 5 <NA>
t3 ip <NA> <NA> <NA> <NA> <NA>
t4 <NA> <NA> <NA> <NA> <NA> BA
This following function grab the rows with the matching pattern but collapse only from this pattern to the end of the row, while forgetting its beginning. Thus avoiding the problem encountered with non matching stray character (the dA of your example) :
locateAndPaste <- function(x){
if(TRUE %in% grepl('^A', df[x,])){
endRow <- df[x, grep('^A', df[x,]):length(df)]
pasted.rows <- paste(endRow[!is.na(endRow)], collapse='')
}
else{NA}
}
The else element prevents throwing out errors if no match is found.
newEntity <- sapply(1:nrow(df), locateAndPaste)
# [1] "AU78" "AK5" NA NA
Two matching pattern have been found in row 1 and 2 and none in row 3 and 4.
As you can see the collapsing part worked perfectly.
Your second problem was to move one row down, and the impossibility of adding number to a character string. As I'm not subsetting on the names but on the indexes, the problem is easily avoided:
(in order to be complete, I've added a line at the end of this post regarding the conversion to numeric of those names)
## the newEntity element is already ordered according to the original row numbers
originalRowNumbers <- grep("^A", newEntity)
# [1] 1 2
From then, it's pretty straight forward :
newdf <- df ## all operations can be done on the original df,
## this copy is made only for the sake of the example.
## as per your example, "erase" the original lines where a matching pattern was found
## that will also prevent orphan lines if a no match have been found in the above line
newdf[originalRowNumbers, ] <- rep(NA, length(df))
## place the new entity in the first column one row below
newdf[originalRowNumbers+1, 1] <- newEntity[originalRowNumbers]
## fill the rest of this row with NA as per your example
newdf[originalRowNumbers+1, 2:length(df)] <- NA
newdf
X1 X2 X3 X4 X5 X6
t1 <NA> <NA> <NA> <NA> <NA> <NA>
t2 AU78 <NA> <NA> <NA> <NA> <NA>
t3 AK5 <NA> <NA> <NA> <NA> <NA>
t4 <NA> <NA> <NA> <NA> <NA> BA
However, if a matching pattern were to be found in the last row, an extra row will be added to newdf. In order to avoid that, it's possible to shorten the initial selection :
newEntity <- sapply(1:(nrow(df)-1), locateAndPaste)
To be complete : in your example it's possible to grab only the number in the names of good.regex and then feed them to your subset :
idx.goood.regex <- as.numeric(gsub("t","", names(good.regex)))
# [1] 2 1
df[idx.good.regex+1, 1] <- good.regex
Note that only works because good.regex is of class character. An error would occur if good.regex was a data.frame.
I am kind of tired of working with lists..and my limited R capabilities ... I could not solve this from long time...
My list with multiple dataframe looks like the following:
set.seed(456)
sn1 = paste( "X", c(1:4), sep= "")
onelist <- list (df1 <- data.frame(sn = sn1, var1 = runif(4)),
df2 <- data.frame(sn = sn1, var1 = runif(4)),
df3 <- data.frame(sn = sn1,var1 = runif(4)))
[[1]]
sn var1
1 X1 0.3852362
2 X2 0.3729459
3 X3 0.2179086
4 X4 0.7551050
[[2]]
sn var1
1 X1 0.8216811
2 X2 0.5989182
3 X3 0.6510336
4 X4 0.8431172
[[3]]
sn var1
1 X1 0.4532381
2 X2 0.7167571
3 X3 0.2912222
4 X4 0.1798831
I want make a subset list in which the row 2 and 3 are only present.
srow <- c(2:3) # just I have many rows in real data
newlist <- lapply(onelist, function(y) subset(y, row(y) == srow))
The newlist is empty....
> newlist
[[1]]
[1] sn var1
<0 rows> (or 0-length row.names)
[[2]]
[1] sn var1
<0 rows> (or 0-length row.names)
[[3]]
[1] sn var1
<0 rows> (or 0-length row.names)
Help please ....
Does this do it?
Note the comma after the rows which implicitly is interpreted as NULL and results in the extraction all of the columns:
> lapply(onelist, "[", c(2,3),)
[[1]]
sn var1
2 X2 0.2105123
3 X3 0.7329553
[[2]]
sn var1
2 X2 0.33195997
3 X3 0.08243274
[[3]]
sn var1
2 X2 0.3852362
3 X3 0.3729459
You could have gotten your subset strategy to work with:
lapply(onelist, function(y) subset(y, rownames(y) %in% srow ))
Note that many time people use "==" when they really should be using %in%
?match
I don't think the row function does what you think it does:
Returns a matrix of integers indicating their row number in a matrix-like object, or a factor indicating the row labels.
Looking at what it returns on the list you have
> row(onelist[[1]])
[,1] [,2]
[1,] 1 1
[2,] 2 2
[3,] 3 3
[4,] 4 4
> row(onelist[[1]])==srow
[,1] [,2]
[1,] FALSE FALSE
[2,] FALSE FALSE
[3,] FALSE FALSE
[4,] FALSE FALSE
You are doing a simple subset of the data.frames, so you can just use
newlist <- lapply(onelist, function(y) y[srow,])
which gives
> newlist
[[1]]
sn var1
2 X2 0.2105123
3 X3 0.7329553
[[2]]
sn var1
2 X2 0.33195997
3 X3 0.08243274
[[3]]
sn var1
2 X2 0.3852362
3 X3 0.3729459