Recoding with and/or condition across multiple columns - if-statement

I want to create a new variable (new_var) and condition it off of multiple columns: class == "yes" & score1:score5 >80. I have code that works below but is there a cleaner way to do this? I do it by embedding multiple ifelse columns, but is there a way I can somehow where I can just do score1:score5 > 80?
age <- sample(18:30, 50, replace=TRUE)
score1 <- sample(1:100, 50, replace=TRUE)
score2 <-sample(1:100, 50, replace=TRUE)
score3 <-sample(1:100, 50, replace=TRUE)
score4 <-sample(1:100, 50, replace=TRUE)
score5 <-sample(1:100, 50, replace=TRUE)
year <- sample(c("first", "second", "third", "fourth"), 50, replace=TRUE)
major <- sample(c("ps", "cs", "ir", "stats"), 50, replace =TRUE)
attend <- sample(c("yes", "no"), 50, replace=TRUE)
class <- data.frame(age, score1, score2, score3, score4, score5, year, major, attend)
class <- class %>% mutate(new_var = ifelse(
attend == "yes" & score1 > 80, "good",
ifelse(attend == "yes" & score2 > 80, "good",
ifelse(attend == "yes" & score3 > 80, "good",
ifelse(attend == "yes" & score4 > 80, "good",
ifelse(attend == "yes" & score5 > 80, "good",
attend))))))

Well you could or together the conditions:
class <- class %>% mutate(new_var = ifelse(
attend == "yes" & (score1 > 80 | score2 > 80 | score3 > 80 | score4 > 80 |
score5 > 80), "good",
attend)
)

Related

What is wrong with this Switch statement? Im getting error cannot convert value text to type True/False

Season =
SWITCH(TRUE(),
'sales'[Month] = "December" || "January" , "Winter",
'sales'[Month] = "February" || "March" , "Spring",
'sales'[Month] = "April" || "May" || "June" , "Summer",
'sales'[Month] = "July" || "August" || "September" , "Monsoon",
'sales'[Month] = "October" || "November", "Autumn",
"NA")
Cannot convert value 'January' of type Text to type True/False.
You need to repeat the column reference in each statement. e.g.
sales'[Month] = "December" || sales'[Month] = "January" , "Winter",

If/Else Statement One Line

I'd like to incorporate if they currently have mmsa or jmmsa. MMMSA is balance over 2500 and JMMSA is balance over 100,000.
combined_2 = (
combined
.withColumn('mmsa_eligible',
F.when(
(F.col('min_bal_0_90_days') >= 2500) & (F.col('current_bal') >= 2500), 1
).otherwise(0)
)
.withColumn('jmmsa_eligible' ,
F.when(
(F.col('min_bal_0_90_days') >= 100000) & (F.col('current_bal') >= 100000), 1
).otherwise(0)
)
if jmmsa_eligible == 1 and jmmsa_current_flag == 0:
print ('Y')
else:
print ('N')

Error : length of 'dimnames' [2] not equal to array extent - but number of columns IS equal to number of columnnames

maybe you can help me with a small problem. I picked some data out of a bigger data file and combined those data in the end to a table. In the last line I want to put columnnames to that table. Before I put those columnnames to that table the length is = 115. But when I put the columnnames to that table the length is suddenly = 112 and the above mentioned error occurs. But I also counted the number of columnnames and there are 115. Maybe do you have clue what to do?
Thank you in advance and kind regards,
Julian
Code:
setwd("/home/julian/Schreibtisch/Test")
# Alle einzulesenden Daten
dirnames <- dir()
for(i in 1:length(dirnames)){
# Basis-Angaben
INSIST_basics_1 <- read.csv(dirnames[i])[ ,c('mouse.x', 'TN.Nr.')]
INSIST_basics_1 <- subset(INSIST_basics_1, mouse.x > 0)
INSIST_basics_1 <- INSIST_basics_1[,c('TN.Nr.')]
INSIST_basics_1 <- c(INSIST_basics_1)
INSIST_basics_1 <- rbind(INSIST_basics_1)
INSIST_basics_2 <- read.csv(dirnames[i])[ ,c('mouse.x', 'Termin')]
INSIST_basics_2 <- subset(INSIST_basics_2, mouse.x > 0)
INSIST_basics_2 <- INSIST_basics_2[,c('Termin')]
INSIST_basics_2 <- c(INSIST_basics_2)
INSIST_basics_2 <- rbind(INSIST_basics_2)
INSIST_basics_3 <- read.csv(dirnames[i])[ ,c('mouse.x', 'Zuweisung')]
INSIST_basics_3 <- subset(INSIST_basics_3, mouse.x > 0)
INSIST_basics_3 <- INSIST_basics_3[,c('Zuweisung')]
INSIST_basics_3 <- c(INSIST_basics_3)
INSIST_basics_3 <- rbind(INSIST_basics_3)
INSIST_basics <- c(INSIST_basics_1,INSIST_basics_2,INSIST_basics_3)
INSIST_basics <- rbind(INSIST_basics)
# VAS
INSIST_VAS <- read.csv(dirnames[i])[ ,c('Word', 'trials_4.thisIndex', 'trials_6.thisIndex', 'trials_10.thisIndex', 'trials_7.thisIndex', 'trials_8.thisIndex', 'slider.response')]
INSIST_VAS <- INSIST_VAS[-which(INSIST_VAS$Word == ""), ]
INSIST_VAS_1 <- subset(INSIST_VAS, trials_4.thisIndex >= 0)
INSIST_VAS_1 <- INSIST_VAS_1[order(INSIST_VAS_1$trials_4.thisIndex),]
INSIST_VAS_1 <- INSIST_VAS_1[,c('slider.response')]
INSIST_VAS_1 <- (INSIST_VAS_1 - 1) *100
INSIST_VAS_2 <- subset(INSIST_VAS, trials_6.thisIndex >= 0)
INSIST_VAS_2 <- INSIST_VAS_2[order(INSIST_VAS_2$trials_6.thisIndex),]
INSIST_VAS_2 <- INSIST_VAS_2[,c('slider.response')]
INSIST_VAS_2 <- (INSIST_VAS_2 - 1) *100
INSIST_VAS_3 <- subset(INSIST_VAS, trials_10.thisIndex >= 0)
INSIST_VAS_3 <- INSIST_VAS_3[order(INSIST_VAS_3$trials_10.thisIndex),]
INSIST_VAS_3 <- INSIST_VAS_3[,c('slider.response')]
INSIST_VAS_3 <- (INSIST_VAS_3 - 1) *100
INSIST_VAS_4 <- subset(INSIST_VAS, trials_7.thisIndex >= 0)
INSIST_VAS_4 <- INSIST_VAS_4[order(INSIST_VAS_4$trials_7.thisIndex),]
INSIST_VAS_4 <- INSIST_VAS_4[,c('slider.response')]
INSIST_VAS_4 <- (INSIST_VAS_4 - 1) *100
INSIST_VAS_5 <- subset(INSIST_VAS, trials_8.thisIndex >= 0)
INSIST_VAS_5 <- INSIST_VAS_5[order(INSIST_VAS_5$trials_8.thisIndex),]
INSIST_VAS_5 <- INSIST_VAS_5[,c('slider.response')]
INSIST_VAS_5 <- (INSIST_VAS_5 - 1) *100
INSIST_VAS_all <- c(INSIST_VAS_1,INSIST_VAS_2,INSIST_VAS_3,INSIST_VAS_4,INSIST_VAS_5)
INSIST_VAS_all <- rbind(INSIST_VAS_all)
# FB tDCS
INSIST_tDCS <- read.csv(dirnames[i])[ ,c('itemIndex','questions', 'ratings')]
INSIST_tDCS <- INSIST_tDCS[-which(INSIST_tDCS$questions == ""), ]
INSIST_tDCS <- INSIST_tDCS[,c('ratings')]
INSIST_tDCS <- c(INSIST_tDCS)
INSIST_tDCS <- rbind(INSIST_tDCS)
# FB Zustand
INSIST_Zustand <- read.csv(dirnames[i])[ ,c('questionText', 'trials_9.thisIndex', 'trials_13.thisIndex', 'trials_15.thisIndex', 'slider_13.response')]
INSIST_Zustand <- INSIST_Zustand[-which(INSIST_Zustand$questionText == ""), ]
INSIST_Zustand_1 <- subset(INSIST_Zustand, trials_9.thisIndex >= 0)
INSIST_Zustand_1 <- INSIST_Zustand_1[order(INSIST_Zustand_1$trials_9.thisIndex),]
INSIST_Zustand_1 <- INSIST_Zustand_1[,c('slider_13.response')]
INSIST_Zustand_1 <- (INSIST_Zustand_1 - 1) *100
INSIST_Zustand_2 <- subset(INSIST_Zustand, trials_13.thisIndex >= 0)
INSIST_Zustand_2 <- INSIST_Zustand_2[order(INSIST_Zustand_2$trials_13.thisIndex),]
INSIST_Zustand_2 <- INSIST_Zustand_2[,c('slider_13.response')]
INSIST_Zustand_2 <- (INSIST_Zustand_2 - 1) *100
INSIST_Zustand_3 <- subset(INSIST_Zustand, trials_15.thisIndex >= 0)
INSIST_Zustand_3 <- INSIST_Zustand_3[order(INSIST_Zustand_3$trials_15.thisIndex),]
INSIST_Zustand_3 <- INSIST_Zustand_3[,c('slider_13.response')]
INSIST_Zustand_3 <- (INSIST_Zustand_3 - 1) *100
INSIST_Zustand_all <- c(INSIST_Zustand_1,INSIST_Zustand_2,INSIST_Zustand_3)
INSIST_Zustand_all <- rbind(INSIST_Zustand_all)
# Szenarien
INSIST_Szenarien <- read.csv(dirnames[i])[ ,c('scenario', 'trials_11.thisIndex', 'rating_2.response')]
INSIST_Szenarien <- INSIST_Szenarien[-which(INSIST_Szenarien$scenario == ""), ]
INSIST_Szenarien <- INSIST_Szenarien[order(INSIST_Szenarien$trials_11.thisIndex),]
INSIST_Szenarien <- INSIST_Szenarien[,c('rating_2.response')]
INSIST_Szenarien <- c(INSIST_Szenarien)
INSIST_Szenarien <- rbind(INSIST_Szenarien)
# Alle Abschnitte zusammenfuehren
INSIST_tab <- c(INSIST_basics, INSIST_VAS_all, INSIST_tDCS, INSIST_Zustand_all, INSIST_Szenarien)
INSIST_tab <- rbind(INSIST_tab)
colnames(INSIST_tab) <- c( "TN_Nr", "Termin", "Zuweisung", "Hungrig_1", "Satt_1", "Durstig_1", "Aengstlich_1", "Froehlich_1", "Gestresst_1", "Schlaefrig_1", "Konzentriert_1", "Traurig_1", "Essen_generell_1", "Essen_suess_1", "Essen_herzhaft_1", "Hungrig_2", "Satt_2", "Durstig_2", "Aengstlich_2", "Froehlich_2", "Gestresst_2", "Schlaefrig_2", "Konzentriert_2", "Traurig_2", "Essen_generell_2", "Essen_suess_2", "Essen_herzhaft_2", "Hungrig_3", "Satt_3", "Durstig_3", "Aengstlich_3", "Froehlich_3", "Gestresst_3", "Schlaefrig_3", "Konzentriert_3", "Traurig_3", "Essen_generell_3", "Essen_suess_3", "Essen_herzhaft_3", "Hungrig_4", "Satt_4", "Durstig_4", "Aengstlich_4", "Froehlich_4", "Gestresst_4", "Schlaefrig_4", "Konzentriert_4", "Traurig_4", "Essen_generell_4", "Essen_suess_4", "Essen_herzhaft_4", "Hungrig_5", "Satt_5", "Durstig_5", "Aengstlich_5", "Froehlich_5", "Gestresst_5", "Schlaefrig_5", "Konzentriert_5", "Traurig_5", "Essen_generell_5", "Essen_suess_5", "Essen_herzhaft_5", "tDCS_Jucken_1", "tDCS_Schmerzen_1", "tDCS_Brennen_1", "tDCS_Waerme_1", "tDCS_Metall_1", "tDCS_Ermuedung_1", "tDCS_Kopf_1", "tDCS_Jucken_2", "tDCS_Schmerzen_2", "tDCS_Brennen_2", "tDCS_Waerme_2", "tDCS_Metall_2", "tDCS_Ermuedung_2", "tDCS_Kopf_2","tDCS_Jucken_3", "tDCS_Schmerzen_3", "tDCS_Brennen_3", "tDCS_Waerme_3", "tDCS_Metall_3", "tDCS_Ermuedung_3", "tDCS_Kopf_3", 'Zustand_Herr_1', 'Zustand_kontrollieren_1', 'Zustand_Versuchungen_1', 'Zustand_Stimmung_1', 'Zustand_Druck_1', 'Zustand_Kontrolle_1', 'Zustand_willensstark_1', 'Zustand_erschrocken_1', 'Zustand_ueberrascht_1', 'Zustand_Herr_2', 'Zustand_kontrollieren_2', 'Zustand_Versuchungen_2', 'Zustand_Stimmung_2', 'Zustand_Druck_2', 'Zustand_Kontrolle_2', 'Zustand_willensstark_2', 'Zustand_erschrocken_2', 'Zustand_ueberrascht_2','Zustand_Herr_3', 'Zustand_kontrollieren_3', 'Zustand_Versuchungen_3', 'Zustand_Stimmung_3', 'Zustand_Druck_3', 'Zustand_Kontrolle_3', 'Zustand_willensstark_3', 'Zustand_erschrocken_3', 'Zustand_ueberrascht_3','Kino', 'Pralinen', 'Zuhause', 'Weihnachtsessen' ) # Spaltennamen
}```

R function for pattern matching

I am doing a text mining project that will analyze some speeches from the three remaining presidential candidates. I have completed POS tagging with OpenNLP and created a two column data frame with the results. I have added a variable, called pair. Here is a sample from the Clinton data frame:
V1 V2 pair
1 c( NN FALSE
2 "thank VBP FALSE
3 you PRP FALSE
4 so RB FALSE
5 much RB FALSE
6 . . FALSE
7 it PRP FALSE
8 is VBZ FALSE
9 wonderful JJ FALSE
10 to TO FALSE
11 be VB FALSE
12 here RB FALSE
13 and CC FALSE
14 see VB FALSE
15 so RB FALSE
16 many JJ FALSE
17 friends NNS FALSE
18 . . FALSE
19 ive JJ FALSE
20 spoken VBN FALSE
What I'm now trying to do is write a function that will iterate through the V2 POS column and evaluate it for specific pattern pairs. (These come from Turney's PMI article.) I'm not yet very knowledgeable when it comes to writing functions, so I'm certain I've done it wrong, but here is what I've got so far.
pairs <- function(x){
JJ <- "JJ" #adjectives
N <- "N[A-Z]" #any noun form
R <- "R[A-Z]" #any adverb form
V <- "V[A-Z]" #any verb form
for(i in 1:(length)(x) {
if(x == J && x+1 == N) { #i.e., if the first word = J and the next = N
pair[i] <- "JJ|NN" #insert this into the 'pair' variable
} else if (x == R && x+1 == J && x+2 != N) {
pair[i] <- "RB|JJ"
} else if (x == J && x+1 == J && x+2 != N) {
pair[i] <- "JJ|JJ"
} else if (x == N && x+1 == J && x+2 != N) {
pair[i] <- "NN|JJ"
} else if (x == R && x+1 == V) {
pair[i] <- "RB|VB"
} else {
pair[i] <- "FALSE"
}
}
}
# Run the function
cl.df.pairs <- pairs(cl.df$V2)
There are a number of (truly embarrassing) issues. First, when I try to run the function code, I get two Error: unexpected '}' in " }" errors at the end. I can't figure out why, because they match opening "{". I'm assuming it's because R is expecting something else to be there.
Also, and more importantly, this function won't exactly get me what I want, which is to extract the word pairs that match a pattern and then the pattern that they match. I honestly have no idea how to do that.
Then I need to figure out how to evaluate the semantic orientation of each word combo by comparing the phrases to the pos/neg lexical data sets that I have, but that's a whole other issue. I have the formula from the article, which I'm hoping will point me in the right direction.
I have looked all over and can't find a comparable function in any of the NLP packages, such as OpenNLP, RTextTools, etc. I HAVE looked at other SO questions/answers, like this one and this one, but they haven't worked for me when I've tried to adapt them. I'm fairly certain I'm missing something obvious here, so would appreciate any advice.
EDIT:
Here is the first 20 lines of the Sanders data frame.
head(sa.POS.df, 20)
V1 V2
1 the DT
2 american JJ
3 people NNS
4 are VBP
5 catching VBG
6 on RB
7 . .
8 they PRP
9 understand VBP
10 that IN
11 something NN
12 is VBZ
13 profoundly RB
14 wrong JJ
15 when WRB
16 , ,
17 in IN
18 our PRP$
19 country NN
20 today NN
And I've written the following function:
pairs <- function(x, y) {
require(gsubfn)
J <- "JJ" #adjectives
N <- "N[A-Z]" #any noun form
R <- "R[A-Z]" #any adverb form
V <- "V[A-Z]" #any verb form
for(i in 1:(length(x))) {
ngram <- c(x[[i]], x[[i+1]])
# the ngram consists of the word on line `i` and the word below line `i`
}
strapply(y[i], "(J)\n(N)", FUN = paste(ngram, sep = " "), simplify = TRUE)
ngrams.df = data.frame(ngrams=ngram)
return(ngrams.df)
}
So, what is SUPPOSED to happen is that when strapply matches the pattern (in this case, an adjective followed by a noun, it should paste the ngram. And all of the resulting ngrams should populate the ngrams.df.
So I've entered the following function call and get an error:
> sa.JN <- pairs(x=sa.POS.df$V1, y=sa.POS.df$V2)
Error in x[[i + 1]] : subscript out of bounds
I'm only just learning the intricacies of regular expressions, so I'm not quite sure how to get my function to pull the actual adjective and noun. Based on the data shown here, it should pull "american" and "people" and paste them into the data frame.
Okay, here we go. Using this data (shared nicely with dput()):
df = structure(list(V1 = structure(c(15L, 3L, 11L, 4L, 5L, 9L, 2L,
16L, 18L, 14L, 13L, 8L, 12L, 20L, 19L, 1L, 7L, 10L, 6L, 17L), .Label = c(",",
".", "american", "are", "catching", "country", "in", "is", "on",
"our", "people", "profoundly", "something", "that", "the", "they",
"today", "understand", "when", "wrong"), class = "factor"), V2 = structure(c(3L,
5L, 7L, 12L, 11L, 10L, 2L, 8L, 12L, 4L, 6L, 13L, 10L, 5L, 14L,
1L, 4L, 9L, 6L, 6L), .Label = c(",", ".", "DT", "IN", "JJ", "NN",
"NNS", "PRP", "PRP$", "RB", "VBG", "VBP", "VBZ", "WRB"), class = "factor")), .Names = c("V1",
"V2"), class = "data.frame", row.names = c("1", "2", "3", "4",
"5", "6", "7", "8", "9", "10", "11", "12", "13", "14", "15",
"16", "17", "18", "19", "20"))
I'll use the stringr package because of its consistent syntax so I don't have to look up the argument order for grep. We'll first detect the adjectives, then the nouns, and figure out where the line up (offsetting by 1). Then paste the words together that correspond to the matches.
library(stringr)
adj = str_detect(df$V2, "JJ")
noun = str_detect(df$V2, "NN")
pairs = which(c(FALSE, adj) & c(noun, FALSE))
ngram = paste(df$V1[pairs - 1], df$V1[pairs])
# [1] "american people"
Now we can put it in a function. I left the patterns as arguments (with adjective, noun as the defaults) for flexibility.
bigram = function(word, type, patt1 = "JJ", patt2 = "N[A-Z]") {
pairs = which(c(FALSE, str_detect(type, pattern = patt1)) &
c(str_detect(type, patt2), FALSE))
return(paste(word[pairs - 1], word[pairs]))
}
Demonstrating use on the original data
with(df, bigram(word = V1, type = V2))
# [1] "american people"
Let's cook up some data with more than one match to make sure it works:
df2 = data.frame(w = c("american", "people", "hate", "a", "big", "bad", "bank"),
t = c("JJ", "NNS", "VBP", "DT", "JJ", "JJ", "NN"))
df2
# w t
# 1 american JJ
# 2 people NNS
# 3 hate VBP
# 4 a DT
# 5 big JJ
# 6 bad JJ
# 7 bank NN
with(df2, bigram(word = w, type = t))
# [1] "american people" "bad bank"
And back to the original to test out a different pattern:
with(df, bigram(word = V1, type = V2, patt1 = "N[A-Z]", patt2 = "V[A-Z]"))
# [1] "people are" "something is"
I think the following is the code you wrote, but without throwing errors:
pairs <- function(x) {
J <- "JJ" #adjectives
N <- "N[A-Z]" #any noun form
R <- "R[A-Z]" #any adverb form
V <- "V[A-Z]" #any verb form
pair = rep("FALSE", length(x))
for(i in 1:(nrow(x)-2)) {
this.pos = x[i,2]
next.pos = x[i+1,2]
next.next.pos = x[i+2,2]
if(this.pos == J && next.pos == N) { #i.e., if the first word = J and the next = N
pair[i] <- "JJ|NN" #insert this into the 'pair' variable
} else if (this.pos == R && next.pos == J && next.next.pos != N) {
pair[i] <- "RB|JJ"
} else if (this.pos == J && next.pos == J && next.next.pos != N) {
pair[i] <- "JJ|JJ"
} else if (this.pos == N && next.pos == J && next.next.pos != N) {
pair[i] <- "NN|JJ"
} else if (this.pos == R && next.pos == V) {
pair[i] <- "RB|VB"
} else {
pair[i] <- "FALSE"
}
}
## then deal with the last two elements, for which you can't check what's up next
return(pair)
}
not sure what you mean by this, though:
Also, and more importantly, this function won't exactly get me what I
want, which is to extract the word pairs that match a pattern and then
the pattern that they match. I honestly have no idea how to do that.

Boxplot by groups, plus a user-defined scatter plot (markers for a subset of values)

Working with lab data, I want to overlay a subset of data points on a boxplot grouped by treatment and sequenced by timepoint. Bringing all elements together is not straightforward in SAS, and requires a clever approach that I can't devise or find myself :)
The beauty of the desired plot is that it displays 2 distinct types of outliers:
The boxplots include statistical outliers - square markers (1.5 IQR)
Then overlay markers for "normal range" outliers - a clinical definition, specific to each lab test.
This is difficult when grouping data (e.g., by treatment) and then blocking or categorizing by another variable (e.g., a timepoint). SAS internally determines the spacing of the boxplots, so this spacing is difficult to mimic for the overlayed normal-range data markers. A generic solution in this direction would be an unreliable kludge.
I've demoed this approach, below, of manually mimicking the group separation for the overlay markers -- just to give an idea of intent. As expected, normal range outliers do not line up with the boxplot groups. Plus, data points that meet both outlier criteria (statistical and clinical) appear as separate points, rather than single points with overlayed markers. My annotations in green:
SGPLOT-overlay-fail
Is there an easy, robust way to instruct SAS to overlay grouped data points on a boxplot, keeping everything aligned as intended?
Here's the code to reproduce that miss:
proc sql;
create table labstruct
( mygroup char(3) label='Treatment Group'
, myvisitnum num label='Visit number'
, myvisitname char(8) label='Visit name'
, labtestname char(8) label='Name of lab test'
, labseed num label='Lab measurement seed'
, lablow num label='Low end of normal range'
, labhigh num label='High end of normal range'
)
;
insert into labstruct
values('A', 1, 'Day 1', 'Test XYZ', 48, 40, 60)
values('A', 5, 'Week 1', 'Test XYZ', 50, 40, 60)
values('A', 10, 'Week 2', 'Test XYZ', 52, 40, 60)
values('B', 1, 'Day 1', 'Test XYZ', 52, 40, 60)
values('B', 5, 'Week 1', 'Test XYZ', 50, 40, 60)
values('B', 10, 'Week 2', 'Test XYZ', 48, 40, 60)
;
quit;
data labdata;
set labstruct;
* Put normal range outliers on 2nd axis, manually separate groups on 2nd axis *;
select (mygroup);
when ('A') scatternum = myvisitnum - 1;
when ('B') scatternum = myvisitnum + 1;
otherwise;
end;
* Make more obs from the seeds above *;
label labvalue = 'Lab measurement';
do repeat = 1 to 20;
labvalue = labseed + 6*rannor(3297);
* Scatter plot ONLY normal range outliers *;
if labvalue < lablow or labvalue > labhigh
then scattervalue = labvalue;
else scattervalue = .;
output;
end;
drop repeat labseed;
run;
proc sgplot data=labdata;
block x=myvisitnum block=myvisitname /
nofill
lineattrs=(color=lightgray);
vbox labvalue /
category=myvisitnum
group=mygroup
outlierattrs=(symbol=square);
scatter x=scatternum y=scattervalue /
group=mygroup
x2axis
jitter;
x2axis display=none;
keylegend / position=bottom type=marker;
run;
So - I think there is a solution here, but I'm not sure how general it is. Certainly, it only works for a two element boxplot.
The issue you have right now is that the axis type by default for a scatterplot is linear, not discrete, while a boxplot is by default discrete. This is always going to be messy if you have it set up that way, though you could in theory work out the exact difference and plot it. You could also use the annotate facility, though it will have the same problem.
However, if you set the scatterplot to use a discrete axis, you can use the discreteoffset option to make things line up properly - more or less. Unfortunately, there's no way to use the group on scatterplot to tell SAS to place the appropriate marker on the appropriate boxplot, so by default everything ends up in the center of the discrete axis; so you will need to use two separate plots here, one for a and one for b, one with negative offset and one with positive.
The advantage of discreteoffset is it should be a constant value for any two-group boxplot, unless you make some alteration to the box widths; no matter how big the actual plot is, the discreteoffset amount should be the same (as it's a percentage of the total width of the block assigned for that value).
Some things to consider here include having six elements in your boxplot instead of three (so get rid of group and just have six different visnum values, a_1 b_1 etc.); that would guarantee that each boxplot centered right on the center of the discrete axis (then your scatterplot would have a 0 discrete offset). You also could consider rolling your own boxplot; calculate your own IQR, for example, and then use high-low plots to draw the boxes and draw the whiskers via annotation, then scatterplot all of the different outliers (not just your 'normal' ones).
Here's the code that seems to work for your specific example, and hopefully would work for most cases similar (with two bars). For 3 bars it's probably easy as well (1 bar has a 0 offset, the other two are probably around +/- 0.25). Beyond that you start having to do more calculations to figure out where the boxes will be, but overall SAS will be pretty good at spacing them out equally so it'll usually be fairly straightforward.
proc sql;
create table labstruct
( mygroup char(3) label='Treatment Group'
, myvisitnum num label='Visit number'
, myvisitname char(8) label='Visit name'
, labtestname char(8) label='Name of lab test'
, labseed num label='Lab measurement seed'
, lablow num label='Low end of normal range'
, labhigh num label='High end of normal range'
)
;
insert into labstruct
values('A', 1, 'Day 1', 'Test XYZ', 48, 40, 60)
values('A', 5, 'Week 1', 'Test XYZ', 50, 40, 60)
values('A', 10, 'Week 2', 'Test XYZ', 52, 40, 60)
values('B', 1, 'Day 1', 'Test XYZ', 52, 40, 60)
values('B', 5, 'Week 1', 'Test XYZ', 50, 40, 60)
values('B', 10, 'Week 2', 'Test XYZ', 48, 40, 60)
;
quit;
data labdata;
set labstruct;
* Put normal range outliers on 2nd axis, manually separate groups on 2nd axis *;
select (mygroup);
when ('A') a_scatternum = myvisitnum; /* Note the separate names now, but no added +/- 1 */
when ('B') b_scatternum = myvisitnum;
otherwise;
end;
* Make more obs from the seeds above *;
label labvalue = 'Lab measurement';
do repeat = 1 to 20;
labvalue = labseed + 6*rannor(3297);
* Scatter plot ONLY normal range outliers *;
if labvalue < lablow or labvalue > labhigh
then scattervalue = labvalue;
else scattervalue = .;
output;
end;
drop repeat labseed;
run;
proc sgplot data=labdata noautolegend; /* suppress auto-legend */
block x=myvisitnum block=myvisitname /
nofill
lineattrs=(color=lightgray);
vbox labvalue /
category=myvisitnum
group=mygroup
outlierattrs=(symbol=square) name="boxplot"; /* Name for keylegend */
scatter x=a_scatternum y=scattervalue / /* Now you have two of these - and no need for an x2axis */
group=mygroup discreteoffset=-0.175
jitter
;
scatter x=b_scatternum y=scattervalue /
group=mygroup discreteoffset=0.175
jitter
;
keylegend "boxplot" / position=bottom type=marker; /* Needed to make a custom keylegend or else you have a mess with three plots in it */
run;
Thanks for the insights! I was stuck on the the same disconnect between boxplot discrete axis and scatter plot real axis. It turns out that with SAS 9.4, scatter plots can handle "categories" like the vbox, but SAS refers to this as the x-axis rather than a category. This SAS 9.4 example also helped crack it for me (as soon as I'd given up, naturally :).
This is pretty close, and leaves most processing to SAS (always my preference for a robust solution):
The updated code: The "category" from the VBOX is the "x" for the SCATTER. Note that the default cluster-width for VBOX and SCATTER are different, 0.7 and 0.85, respectively, so I have to explicitly set them to the same value:
proc sql;
create table labstruct
( mygroup char(3) label='Treatment Group'
, myvisitnum num label='Visit number'
, myvisitname char(8) label='Visit name'
, labtestname char(8) label='Name of lab test'
, labseed num label='Lab measurement seed'
, lablow num label='Low end of normal range'
, labhigh num label='High end of normal range'
)
;
insert into labstruct
values('A', 1, 'Day 1', 'Test XYZ', 48, 40, 60)
values('A', 5, 'Week 1', 'Test XYZ', 50, 40, 60)
values('A', 10, 'Week 2', 'Test XYZ', 52, 40, 60)
values('B', 1, 'Day 1', 'Test XYZ', 52, 40, 60)
values('B', 5, 'Week 1', 'Test XYZ', 50, 40, 60)
values('B', 10, 'Week 2', 'Test XYZ', 48, 40, 60)
;
quit;
data labdata;
set labstruct;
* Make more obs from the seeds above *;
label labvalue = 'Lab measurement';
do repeat = 1 to 20;
labvalue = labseed + 6*rannor(3297);
* Scatter plot ONLY normal range outliers *;
if labvalue < lablow or labvalue > labhigh
then scattervalue = labvalue;
else scattervalue = .;
output;
end;
drop repeat labseed;
run;
proc sgplot data=labdata;
block x=myvisitnum block=myvisitname /
nofill
lineattrs=(color=lightgray);
vbox labvalue /
category=myvisitnum
group=mygroup
groupdisplay=cluster
clusterwidth=0.7
outlierattrs=(symbol=square);
scatter x=myvisitnum y=scattervalue /
group=mygroup
groupdisplay=cluster
clusterwidth=0.7
jitter;
keylegend /
position=bottom type=marker;
run;
Thanks, again, for getting me back on track so quickly!