SNPRelate: how to give specific color to a population in PCA plot - pca

I am using SNPRelate for PCA analysis. Its using default color for different populations but I want to color them according to me. Plotting commands are like this:
plot(tab$EV2, tab$EV1, col=as.integer(tab$pop),cex=1.2,pch=20,
+ xlab="eigenvector 2", ylab="eigenvector 1")
legend("topleft", legend=levels(tab$pop), cex=1,pch=20, col=1:nlevels(tab$pop))
Head of the input file is like this:
sample.id pop EV1 EV2
1 A1 POP_I -0.10172849 0.03619405
2 A2 POP_I -0.15951814 0.08234857
3 A3 POP_I -0.15632495 0.08180843
4 A4 POP_I -0.09679447 0.07981108
5 A5 POP_I 0.11362360 -0.03186038
6 A6 POP_I 0.05594095 -0.05498351

Define a list of colors:
col.list <- c("gray", "blue", "green", "red", "blue", "yellow", ...)
plot(tab$EV2, tab$EV1, col=col.list[as.integer(tab$pop)], cex=1.2, pch=20, xlab="eigenvector 2", ylab="eigenvector 1")
legend("topleft", legend=levels(tab$pop), cex=1,pch=20, col=1:nlevels(tab$pop))

Related

How To Interpret Least Square Means and Standard Error

I am trying to understand the results I got for a fake dataset. I have two independent variables, hours, type and response pain.
First question: How was 82.46721 calculated as the lsmeans for the first type?
Second question: Why is the standard error exactly the same (8.24003) for both types?
Third question: Why is the degrees of freedom 3 for both types?
data = data.frame(
type = c("A", "A", "A", "B", "B", "B"),
hours = c(60,72,61, 54,68,66),
# pain = c(85,95,69, 73, 29, 30)
pain = c(85,95,69, 85,95,69)
)
model = lm(pain ~ hours + type, data = data)
lsmeans(model, c("type", "hours"))
> data
type hours pain
1 A 60 85
2 A 72 95
3 A 61 69
4 B 54 85
5 B 68 95
6 B 66 69
> lsmeans(model, c("type", "hours"))
type hours lsmean SE df lower.CL upper.CL
A 63.5 82.46721 8.24003 3 56.24376 108.6907
B 63.5 83.53279 8.24003 3 57.30933 109.7562
Try this:
newdat <- data.frame(type = c("A", "B"), hours = c(63.5, 63.5))
predict(model, newdata = newdat)
An important thing to note here is that your model has hours as a continuous predictor, not a factor.

Combining IF formulas

I have two IF formulas that I would like to combine - please see attached excel doc.
If C2 = "Blue" =IF(E2="","",IF(((((((B2*(C2-2))*1.02)/(E2-1))/1.02)+(-B2))+(B2))/(B2)<0.65,"NO BET",((C2-1)/(E2-F2)B2)))
If C2 = "Green" =IF(E3="","",IF(((((((B3(C3-2))*1.02)/(E3-1))/1.02)+(-B3))+(B3))/(B3)<0.65,"NO BET",(C3/(E3-F3)*B3)))
The formulas are the same up until after "NO BET". I would like this to be one formula only so that I can change value C2 and it calculates correctly.
Many thanks
I don't have the rep required to comment for clarifications so I've got a couple of possible options depending on what you need. I've also imposed a bit of spacing in my answer so its a bit more readable.
If C2 = "Blue"
=IF(E2="","",IF(((((((B2*(C2-2))*1.02)/(E2-1))/1.02)+(-B2))+(B2))/(B2)<0.65,"NO
BET",((C2-1)/(E2-F2)B2)))
If C2 = "Green"
=IF(E3="","",IF(((((((B3(C3-2))*1.02)/(E3-1))/1.02)+(-B3))+(B3))/(B3)<0.65,"NO
BET",(C3/(E3-F3)*B3)))
In the question you say that the formulas are the same up to NO BET so, assuming that:
you want both forumlas to work on both rows 2 and 3 and
the C column is filled in Green/Blue for all the different rows
this is how it can work for row 2:
=IF(E2="",
"",
IF(
((((((B2*(C2-2))*1.02)/(E2-1))/1.02)+(-B2))+(B2))/(B2)<0.65,
"NO BET",
IF(C2 = "Blue",
((C2-1)/(E2-F2)B2),
IF(C2 = "Green",
(C2/(E2-F2)*B2)
)
)
)
)
If Blue/Green is only ever in C2 and the rest of the C column is irrelevent
=IF(E2="",
"",
IF(
((((((B2*(C2-2))*1.02)/(E2-1))/1.02)+(-B2))+(B2))/(B2)<0.65,
"NO BET",
IF($C$2 = "Blue",
((C2-1)/(E2-F2)B2),
IF($C$2 = "Green",
(C2/(E2-F2)*B2)
)
)
)
)

Adding a new column based on values

I have the following sample data:
data weight_club;
input IdNumber 1-4 Name $ 6-24 Team $ StartWeight EndWeight;
Loss = StartWeight - EndWeight;
datalines;
1023 David Shaw red 189 165
1049 Amelia Serrano yellow 145 124
1219 Alan Nance purple 210 192
1246 Ravi Sinha yellow 194 177
1078 Ashley McKnight green 127 118
;
What I would like to do now is the following:
Create two lists with colours (fe, list1 = "red" and "yellow" and list2 = "purple" and "green")
Classify the records according to whether or not they are in list1 and list2 and add a new column.
So the pseudo code is like this:
'Set new category called class
If item is in list1 then class = 1
Else if item is in list2 then class = 2
Else class = 3
Any thoughts on how I can do this most effciently?
Your pseudocode is almost exactly it.
If item is in ('red' 'yellow') then class = 1;
Else if item is in ('purple' 'green') then class = 2;
Else class = 3;
This is really a lookup, so their are many other methods. One I usually recommend as well is Proc format, though in a simplistic case like this I'm not sure of any gains.
Proc format;
Value $ colour_cat
'red', 'yellow' = 1
'purple', 'green' = 2
Other = 3;
Run;
And then in a data/SQL either of the following can be used.
*actual conversion;
Category = put(colour, $colour_cat.);
* change display only;
Format colour $colour_cat.;

Identifying rows in data.frames based on complex rules

In two previous questions I have asked how to identify and extract substrings based on complex rules:
Identifying substrings based on complex rules
Extracting capturing groups from a regex
The current question concerns how you would achieve the same end in a data.frame structure. Let's say you have a data.frame as follows:
data.frame(time = seq(1:10),
event = c("FA", "EX", "I1", "FA", "FA", "I3", "EX", "EX", "EX", "I3"),
actor = c("John", "Alex", "John", "Alex", "Tim", "Sandra", "Sara", "John", "Eliza", "Alex"))
time event actor
1 FA John
2 EX Alex
3 I1 John
4 FA Alex
5 FA Tim
6 I3 Sandra
7 EX Sara
8 EX John
9 EX Eliza
10 I3 Alex
Now I want to move from time 1 to 10 and group all rows that precedes an I3. This means that I want to return a list of two data.frames (rows 1-6 and rows 7-10 should each form a separate data.frame to be placed in a common list). How can I accomplish this?
You can use split:
split(dat, c(0, cumsum(dat$event=="I3"))[-(nrow(dat)+1)])
$`0`
time event actor
1 1 FA John
2 2 EX Alex
3 3 I1 John
4 4 FA Alex
5 5 FA Tim
6 6 I3 Sandra
$`1`
time event actor
7 7 EX Sara
8 8 EX John
9 9 EX Eliza
10 10 I3 Alex
That works too:
i3.index = which(data$event == "I3")
i3.start = c(1, i3.index[-length(i3.index)]+1)
indexMatrix = cbind(from = i3.start, end = i3.index)
apply(indexMatrix, 1, function(x){data[x[1]:x[2],]})
# [[1]]
# time event actor
# 1 1 FA John
# 2 2 EX Alex
# 3 3 I1 John
# 4 4 FA Alex
# 5 5 FA Tim
# 6 6 I3 Sandra
#
# [[2]]
# time event actor
# 7 7 EX Sara
# 8 8 EX John
# 9 9 EX Eliza
# 10 10 I3 Alex
This will also work:
library(dplyr)
data %>%
arrange(time %>% desc) %>%
mutate(group = cumsum(event == "I3")) %>%
arrange(time) %>%
group_by(group)

Replace entire strings based on partial match

New to R. Looking to replace the entire string if there is a partial match.
d = c("SDS0G2 Blue", "Blue SSC2CWA3", "Blue SA2M1GC", "SA5 Blue CSQ5")
gsub("Blue", "Red", d, ignore.case = FALSE, fixed = FALSE)
Output: "SDS0G2 Red" "Red SSC2CWA3" "Red SA2M1GC" "SA5 Red CSQ5"
Desired Output: “Red” “Red” “Red” “Red”
Any help in solving this is truly appreciated.
I'd suggest using grepl to find the indices and replace those indices with "Red":
d = c("SDS0G2 Blue", "Blue SSC2CWA3", "Blue SA2M1GC", "SA5 Blue CSQ5", "ABCDE")
d[grepl("Blue", d, ignore.case=FALSE)] <- "Red"
d
# [1] "Red" "Red" "Red" "Red" "ABCDE"
If you did want to keep the variable as a factor and replace multiple partial matches at once, the following function will work (example from another question).
clrs <- c("blue", "light blue", "red", "rose", "ruby", "yellow", "green", "black", "brown", "royal blue")
dfx <- data.frame(colors1=clrs, colors2 = clrs, Amount=sample(100,10))
# Function to replace levels with regex matching
make_levels <- function(.f, patterns, replacement = NULL, ignore.case = FALSE) {
lvls <- levels(.f)
# Replacements can be listed in the replacement argument, taken as names in patterns, or the patterns themselves.
if(is.null(replacement)) {
if(is.null(names(patterns)))
replacement <- patterns
else
replacement <- names(patterns)
}
# Find matching levels
lvl_match <- setNames(vector("list", length = length(patterns)), replacement)
for(i in seq_along(patterns))
lvl_match[[replacement[i]]] <- grep(patterns[i], lvls, ignore.case = ignore.case, value = TRUE)
# Append other non-matching levels
lvl_other <- setdiff(lvls, unlist(lvl_match))
lvl_all <- append(
lvl_match,
setNames(as.list(lvl_other), lvl_other)
)
return(lvl_all)
}
# Replace levels
levels(dfx$colors2) <- make_levels(.f = dfx$colors2, patterns = c(Blue = "blue", Red = "red|rose|ruby"))
dfx
#> colors1 colors2 Amount
#> 1 blue Blue 75
#> 2 light blue Blue 55
#> 3 red Red 47
#> 4 rose Red 83
#> 5 ruby Red 56
#> 6 yellow yellow 10
#> 7 green green 25
#> 8 black black 29
#> 9 brown brown 23
#> 10 royal blue Blue 24
Created on 2020-04-18 by the reprex package (v0.3.0)