Replace ":" with " x " in `emmeans::joint_tests` output - r-markdown

I bring this question over from tex exchange because it didn't get much attention there The answer I got doesn't apply to tables longer than 3 rows. Please see how I could change my code. Thanks for your attention. https://tex.stackexchange.com/questions/594324/how-to-replace-all-with-or-x-in-an-anova-table
library(emmeans)
library(kableExtra)
neuralgia.glm <- glm(Pain ~ Treatment * Sex * Age, family=binomial, data = neuralgia)
model term df1 df2 F.ratio p.value
Treatment 2 Inf 0 1.0000
Sex 1 Inf 0 0.9964
Age 1 Inf 0 0.9941
Treatment:Sex 2 Inf 0 1.0000
Treatment:Age 2 Inf 0 1.0000
Sex:Age 1 Inf 0 0.9942
Treatment:Sex:Age 2 Inf 0 1.0000
I want to replace all the colons with x and add space for ease of view. That's what I tried with gsub(":", " x "). The print(neuralgia.glm, export = T) command keeps the p-value format as <0.0001 instead of 0 when knitted.
This code gave me just an x. Using sub or gsub do the same.
joint_tests(neuralgia.glm) %>%
print(, export = T) %>%
gsub(":", " x ") %>%
kable()
This code replaced the colons with " x " but it broke the table.
gsub( "\\:", " x ", print(neuralgia.glm, export = T)) %>%
kable()

Note the order of the arguments to gsub() has x = as the third argument, not the first. So your piping wouldn't work.
Solution:
library(emmeans)
neuralgia.glm <- glm(Pain ~ Treatment * Sex * Age, family=binomial, data = neuralgia)
#> Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
jt <- joint_tests(neuralgia.glm)
jt$`model term` <- gsub(":", " x ", jt$`model term`)
jt
#> model term df1 df2 F.ratio p.value
#> Treatment 2 Inf 0 1.0000
#> Sex 1 Inf 0 0.9970
#> Age 1 Inf 0 0.9951
#> Treatment x Sex 2 Inf 0 1.0000
#> Treatment x Age 2 Inf 0 1.0000
#> Sex x Age 1 Inf 0 0.9952
#> Treatment x Sex x Age 2 Inf 0 1.0000
Created on 2021-06-07 by the reprex package (v2.0.0)

Related

Create new columns for the duplicate records:Python

I have an input file that is being generated at runtime of this form:
Case 1:
ID,Numbers,P_ID,Cores,Count
1,1234567890,A1,200,3
2,1234567890,A2,150,3
3,0123459876,A3,1000,1
The generated file can also be of this form:
Case 2:
ID,Numbers,P_ID,Cores,Count
1,1234567890,A1,200,3
3,0123459876,A3,1000,1
Expected Output:
Case 1:
Numbers ID_1 P_ID_1 Cores_1 Count_1 ID_2 P_ID_2 Cores_2 Count_2
0 123459876 3 A3 1000 1 NaN None NaN NaN
1 1234567890 1 A1 200 3 2.0 A2 150.0 3.0
Case 2:
Numbers ID_1 P_ID_1 Cores_1 Count_1 ID_2 P_ID_2 Cores_2 Count_2
0 123459876 3 A3 1000 1 NaN None NaN NaN
1 1234567890 1 A1 200 3 Nan None Nan Nan
In the input file there may be 0 or 1 or 2 rows(but never more that 2) with the same Number(1234567890). These 2 rows, i'm trying to summarize into 1 single row(as shown in output file).
I would like to convert my input file into the above structure.How can i do this? I'm really new to pandas. Please be so kind as to help me with this. Thanks in advance.
In the Case 2:
The structure of output file must remain the same i.e., column names should be same.
I think you need:
first create new column with cumcount for counting Numbers
then reshape by set_index + unstack
MultiIndex in columns is converted to Index with list comprehension
df['g'] = df.groupby('Numbers').cumcount()
df = df.set_index(['Numbers', 'g']).unstack().sort_index(axis=1, level=1)
df.columns = ['_'.join((x[0], str(x[1] + 1))) for x in df.columns]
df = df.reset_index()
print (df)
Numbers ID_1 P_ID_1 Cores_1 Count_1 ID_2 P_ID_2 Cores_2 Count_2
0 123459876 3.0 A3 1000.0 1.0 NaN None NaN NaN
1 1234567890 1.0 A1 200.0 3.0 2.0 A2 150.0 3.0
EDIT:
For converting to int is possible use custom function, which convert only if no error - so columns with NaNs are not changed:
def f(x):
try:
return x.astype(int)
except (TypeError, ValueError):
return x
df['g'] = df.groupby('Numbers').cumcount()
df1 = df.set_index(['Numbers', 'g']).unstack().sort_index(axis=1, level=1)
df1.columns = ['_'.join((x[0], str(x[1] + 1))) for x in df1.columns]
df1 = df1.apply(f).reset_index()
print (df1)
Numbers ID_1 P_ID_1 Cores_1 Count_1 ID_2 P_ID_2 Cores_2 Count_2
0 123459876 3 A3 1000 1 NaN None NaN NaN
1 1234567890 1 A1 200 3 2.0 A2 150.0 3.0
EDIT1:
There has to be 1 or 2 rows per group, so use reindex_axis is possible:
def f(x):
try:
return x.astype(int)
except (TypeError, ValueError):
return x
df['g'] = df.groupby('Numbers').cumcount()
df1 = df.set_index(['Numbers', 'g']).unstack().sort_index(axis=1, level=1)
df1.columns = ['_'.join((x[0], str(x[1] + 1))) for x in df1.columns]
cols = ['ID_1','P_ID_1','Cores_1','Count_1','ID_2','P_ID_2','Cores_2','Count_2']
df1 = df1.apply(f).reindex_axis(cols, axis=1).reset_index()
print (df1)
Numbers ID_1 P_ID_1 Cores_1 Count_1 ID_2 P_ID_2 Cores_2 Count_2
0 123459876 3 A3 1000 1 NaN NaN NaN NaN
1 1234567890 1 A1 200 3 NaN NaN NaN NaN

pandas dataframe update or set column[y] = x where column[z] = 'abc'

I'm new to python pandas and haven't found an answer to this in the documentation. I have an existing dataframe and I've added a new column Y. I want to set the value of column Y to 'abc' in all rows in which column Z = 'xyz'. In sql this would be a simple
update table set colY = 'abc' where colZ = 'xyz'
Is there a similar way to do this update in pandas?
Thanks!
You can use loc or numpy.where if you need set other value too:
df.loc[df.Z == 'xyz', 'Y'] = 'abc'
Sample:
import pandas as pd
import numpy as np
df = pd.DataFrame({'X':[1,2,3],
'Z':['xyz',5,6],
'C':[7,8,9]})
print (df)
C X Z
0 7 1 xyz
1 8 2 5
2 9 3 6
df.loc[df.Z == 'xyz', 'Y'] = 'abc'
print (df)
C X Z Y
0 7 1 xyz abc
1 8 2 5 NaN
2 9 3 6 NaN
df['Y1'] = np.where(df.Z == 'xyz', 'abc', 'klm')
print (df)
C X Z Y Y1
0 7 1 xyz abc abc
1 8 2 5 NaN klm
2 9 3 6 NaN klm
You can use set column values too:
df['Y2'] = np.where(df.Z == 'xyz', 'abc', df.C)
print (df)
C X Z Y Y2
0 7 1 xyz abc abc
1 8 2 5 NaN 8
2 9 3 6 NaN 9

Parsing irregular character strings for numbers and put into structured format using regular expressions in R

I have a vector of irregularly-structured character data, that I want to find an extract particular numbers from. For example, take this piece of a much larger dataset:
x <- c("2001 Tax # $25.19/Widget, 2002 Est Tax # $10.68/Widget; 2000 Est Int # $55.67/Widget",
"1999 Tax # $81.16/Widget",
"1998 Tax # $52.72/Widget; 2001 Est Int # $62.49/Widget",
"1994 Combined Tax/Int # $68.33/widget; 1993 Est Int # $159.67/Widget",
"1993 Combined Tax/Int # $38.33/widget; 1992 Est Int # $159.67/Widget",
"2006 Tax # $129.21/Widget, 1991 Est Tax # $58.19/Widget; 1991 Est Int # $30.95/Widget")
and so on. Reading the table for a larger vector shows that most of the entries are separated by semi-colons or commas, and that there are only a limited number of terms used -- the year, Tax, Int, Combined, Est -- with occasional variations in entries (like ";" versus ",", or "Widget" versus "widget").
I'd like to extract each of the numbers related to the terms above into a more structured data table, such as:
[id] [year] [number] [cat] [est]
row1 2001 25.19 Tax
row1 2002 10.68 Tax Est
row1 2000 55.67 Int Est
row2 1999 81.16 Tax
row3 1998 52.72 Tax
row3 2001 62.49 Int Est
....
or else maybe a more compact / sparse representation like:
[id] [1999tax] [2001tax] [2002esttax] [2000estint]
row1 0 25.19 10.68 55.67
row2 81.16 0 0 0
If that makes sense -- I ultimately need to put this into a regression model.
My first approach has been to write the following pseudocode:
split strings into list using strsplit() on ";" or ","
extract all years
operate on list elements using function that extracts numbers between "$" and "/"
return structured table columns
So far, I've only gotten this far:
pieces.of.x <- strsplit(x1, "[;,]"); head(pieces.of.x)
which gives:
[[1]]
[1] "2001 Tax # $25.19/Widget" " 2002 Est Tax # $10.68/Widget" " 2000 Est Int # $55.67/Widget"
[[2]]
[1] "1999 Tax # $81.16/Widget"
[[3]]
[1] "1998 Tax # $52.72/Widget" " 2001 Est Int # $62.49/Widget"
[[4]]
[1] "1994 Combined Tax/Int # $68.33/widget" " 1993 Est Int # $159.67/Widget"
[[5]]
[1] "1993 Combined Tax/Int # $38.33/widget" " 1992 Est Int # $159.67/Widget"
[[6]]
[1] "2006 Tax # $129.21/Widget" " 1991 Est Tax # $58.19/Widget" " 1991 Est Int # $30.95/Widget"
Unfortunately, I don't have the knowledge of both lapply() and regular expressions ("regex") in R, to make a procedure that is robust enough to extract the years, operate on each sub-vector of elements, and then return them.
Thanks in advance for reading.
The stringr package is pretty useful when dealing with strings, and I bet that someone could even make a single matcher to extract named capture group to get a similar solution...
[edit: missed the combined entries]
library(stringr)
library(data.table)
# Split the row entries
x <- strsplit(x, "[,;]")
# Generate the entry identifiers.
i <- 0
id <- unlist( sapply( x, function(r) rep(i<<-i+1, length(r) ) ) )
# Extract the desired values
x <- unlist( x, recursive = FALSE )
year.re <- "(^\\s?([[:digit:]]{4})\\s)"
value.re <- "[$]([[:digit:]]+[.][[:digit:]]{2})[/]"
object.re <- "[/]([[:alnum:]]+)$"
Cats<- c("Tax","Int","Combination")
x <- lapply( x, function(str) {
c( Year=str_extract( str, year.re),
Category=Cats[ grepl( "Tax", str)*1 + grepl( "Int", str)*2 ],
Estimate=grepl( "Est", str),
Value=str_match( str, value.re)[2],
Object=str_match( str, object.re)[2] )
})
# Create a data object.
data.table( ID=id, do.call(rbind,x), key=c("Year") )
## ID Year Category Estimate Value Object
## 1: 6 1991 Tax TRUE 58.19 Widget
## 2: 6 1991 Int TRUE 30.95 Widget
## 3: 5 1992 Int TRUE 159.67 Widget
## 4: 4 1993 Int TRUE 159.67 Widget
## 5: 5 1993 Combination FALSE 38.33 widget
## 6: 4 1994 Combination FALSE 68.33 widget
## 7: 3 1998 Tax FALSE 52.72 Widget
## 8: 2 1999 Tax FALSE 81.16 Widget
## 9: 1 2000 Int TRUE 55.67 Widget
## 10: 3 2001 Int TRUE 62.49 Widget
## 11: 1 2001 Tax FALSE 25.19 Widget
## 12: 1 2002 Tax TRUE 10.68 Widget
## 13: 6 2006 Tax FALSE 129.21 Widget
This is similar to one of he other answers and distinguishes between line numbers (your [id] column).
matches <- regmatches(x,gregexpr("[0-9]{4} [^#]+# \\$[0-9.]+",x))
lengths <- sapply(matches,length)
z <- unlist(matches)
z <- regmatches(z,regexec("([0-9]{4}) ([^#]+) # \\$([0-9.]+)",z))
df <- t(sapply(z,function(x)c(year=x[2], number=x[4], cat=x[3])))
df <- data.frame(id=rep(1:length(x),times=lengths),df, stringsAsFactors=F)
df$est <- ifelse(grepl("Est",df$cat),"Est","")
df$cat <- regmatches(df$cat,regexpr("[^ /]+$",df$cat))
df
# id year number cat est
# 1 1 2001 25.19 Tax
# 2 1 2002 10.68 Tax Est
# 3 1 2000 55.67 Int Est
# 4 2 1999 81.16 Tax
# 5 3 1998 52.72 Tax
# 6 3 2001 62.49 Int Est
# 7 4 1994 68.33 Int
# 8 4 1993 159.67 Int Est
# 9 5 1993 38.33 Int
# 10 5 1992 159.67 Int Est
# 11 6 2006 129.21 Tax
# 12 6 1991 58.19 Tax Est
# 13 6 1991 30.95 Int Est
To create exactly the dataframe you are asking for, you can use a few tricks like strsplit, regular expressions, and rbind.
x <- unlist(strsplit(x, ',|;'))
bits <- regmatches(x,gregexpr('(\\d|\\.)+|(Tax|Int|Est)', x))
df <- do.call(rbind, lapply(bits, function(info) {
data.frame(year = info[[1]], number = tail(info, 1)[[1]],
cat = if ('Tax' %in% info) 'Tax' else 'Int',
est = if ('Est' %in% info) 'Est' else '')
}))
df$cat <- factor(df$cat); df$est <- factor(df$est);
which gives us
year number cat est
1 2001 25.19 Tax
2 2002 10.68 Tax Est
3 2000 55.67 Int Est
4 1999 81.16 Tax
5 1998 52.72 Tax
You can extract the numbers out using:
regmatches(x,gregexpr('(\\d)+', x))
which yields
[[1]]
[1] "2001" "25.19" "2002" "10.68" "2000" "55.67"
[[2]]
[1] "1999" "81.16"
[[3]]
[1] "1998" "52.72" "2001" "62.49"
[[4]]
[1] "1994" "68.33" "1993" "159.67"
[[5]]
[1] "1993" "38.33" "1992" "159.67"
[[6]]
[1] "2006" "129.21" "1991" "58.19" "1991" "30.95"
However, if you can assume every year's info is separated by a , or ;, try this:
x <- unlist(strsplit(x, ',|;'))
nums <- regmatches(x,gregexpr('(\\d|\\.)+', x))
df <- data.frame(matrix(as.numeric(unlist(nums)), ncol = 2, byrow = TRUE))
colnames(df) <- c('Year', 'Number')
which looks like
Year Number
1 2001 25.19
2 2002 10.68
3 2000 55.67
4 1999 81.16
5 1998 52.72

Splitting single observation into multiple observations

I have the following SAS data set:
data mydata;
LENGTH id 4.0 num 4.0 A $ 4. B $ 8. C $ 20.;
input id num A $ B $ C $;
datalines;
1 1 x yy zzzzz
2 1 x yy zzzzz
3 2 xq yyqq zzzzzqqqqq
4 1 x yy zzzzz
5 3 xqw yyqqww zzzzzqqqqqwwwww
6 1 x yy zzzzz
7 4 xqwe yyqqwwee zzzzzqqqqqwwwwweeeee
;
which looks like
mydata
-------------------
id num A B C
1 1 x yy zzzzz
2 1 x yy zzzzz
3 2 xq yyqq zzzzzqqqqq
4 1 x yy zzzzz
5 3 xqw yyqqww zzzzzqqqqqwwwww
6 1 x yy zzzzz
7 4 xqwe yyqqwwee zzzzzqqqqqwwwwweeeee
The problem is that each of the observations where num > 1 actually contains data for multiple "observations" and I would like to split it up using some logic in SAS. Here's an example for what I want to get:
mydatawanted
-------------------
id num A B C
1 1 x yy zzzzz
2 1 x yy zzzzz
3 1 x yy zzzzz
3 1 q qq qqqqq
4 1 x yy zzzzz
5 1 x yy zzzzz
5 1 q qq qqqqq
5 1 w ww wwwww
6 1 x yy zzzzz
7 1 x yy zzzzz
7 1 q qq qqqqq
7 1 w ww wwwww
7 1 e ee eeeee
Basically, if num > 1 I want to take the substring of each variable depending on its length, for each item, and then output those as new observations with num = 1. Here is what I have tried to code so far:
data mydata2(drop=i _:);
set mydata; /*use the data from the original data set */
_temp_id = id; /*create temp variables from the currently read observation */
_temp_num = num;
_temp_A = A;
_temp_B = B;
_temp_C = C;
if (_temp_num > 1) THEN /* if num in current record > 1 then split them up */
do i = 1 to _temp_num;
id = _temp_id; /* keep id the same */
num = 1; /* set num to 1 for each new observation */
A = substr(_temp_A,i,i); /*split the string by 1s */
B = substr(_temp_B,1 + 2 * (i - 1),i * 2); /*split the string by 2s */
C = substr(_temp_C,1 + 5 * (i - 1),i * 5); /*split the string by 5s */
OUTPUT; /* output this new observation with the changes */
end;
else OUTPUT; /* if num == 1 then output without any changes */
run;
However it doesn't work as I wanted it to (I put in some comments to show what I thought was happening at each step). It actually produces the following result:
mydata2
-------------------
id num A B C
1 1 x yy zzzzz
2 1 x yy zzzzz
3 1 x yy zzzzz
3 1 q qq qqqqq
4 1 x yy zzzzz
5 1 x yy zzzzz
5 1 qw qqww qqqqqwwwww
5 1 w ww wwwww
6 1 x yy zzzzz
7 1 x yy zzzzz
7 1 qw qqww qqqqqwwwww
7 1 we wwee wwwwweeeee
7 1 e ee eeeee
This mydata2 result isn't the same as mydatawanted. The lines where num = 1 are fine but when num > 1 the output records are much different from what I want. The total number of records are correct though. I'm not really sure what is happening, since this is the first time I tried any complicated SAS logic like this, but I would appreciate any help in either fixing my code or accomplishing what I want to do using any alternate methods. Thank you!
edit: I fixed a problem with my original input mydata data statement and updated the question.
Your substrings are incorrect. Substr takes the arguments (original string, start, length), not (original string, start, ending position). So length should be 1,2,5 not i,i*2,i*5.
A = substr(_temp_A,i,1); /*split the string by 1s */
B = substr(_temp_B,1 + 2 * (i - 1),2); /*split the string by 2s */
C = substr(_temp_C,1 + 5 * (i - 1),5); /*split the string by 5s */

R: fragment a list

I am kind of tired of working with lists..and my limited R capabilities ... I could not solve this from long time...
My list with multiple dataframe looks like the following:
set.seed(456)
sn1 = paste( "X", c(1:4), sep= "")
onelist <- list (df1 <- data.frame(sn = sn1, var1 = runif(4)),
df2 <- data.frame(sn = sn1, var1 = runif(4)),
df3 <- data.frame(sn = sn1,var1 = runif(4)))
[[1]]
sn var1
1 X1 0.3852362
2 X2 0.3729459
3 X3 0.2179086
4 X4 0.7551050
[[2]]
sn var1
1 X1 0.8216811
2 X2 0.5989182
3 X3 0.6510336
4 X4 0.8431172
[[3]]
sn var1
1 X1 0.4532381
2 X2 0.7167571
3 X3 0.2912222
4 X4 0.1798831
I want make a subset list in which the row 2 and 3 are only present.
srow <- c(2:3) # just I have many rows in real data
newlist <- lapply(onelist, function(y) subset(y, row(y) == srow))
The newlist is empty....
> newlist
[[1]]
[1] sn var1
<0 rows> (or 0-length row.names)
[[2]]
[1] sn var1
<0 rows> (or 0-length row.names)
[[3]]
[1] sn var1
<0 rows> (or 0-length row.names)
Help please ....
Does this do it?
Note the comma after the rows which implicitly is interpreted as NULL and results in the extraction all of the columns:
> lapply(onelist, "[", c(2,3),)
[[1]]
sn var1
2 X2 0.2105123
3 X3 0.7329553
[[2]]
sn var1
2 X2 0.33195997
3 X3 0.08243274
[[3]]
sn var1
2 X2 0.3852362
3 X3 0.3729459
You could have gotten your subset strategy to work with:
lapply(onelist, function(y) subset(y, rownames(y) %in% srow ))
Note that many time people use "==" when they really should be using %in%
?match
I don't think the row function does what you think it does:
Returns a matrix of integers indicating their row number in a matrix-like object, or a factor indicating the row labels.
Looking at what it returns on the list you have
> row(onelist[[1]])
[,1] [,2]
[1,] 1 1
[2,] 2 2
[3,] 3 3
[4,] 4 4
> row(onelist[[1]])==srow
[,1] [,2]
[1,] FALSE FALSE
[2,] FALSE FALSE
[3,] FALSE FALSE
[4,] FALSE FALSE
You are doing a simple subset of the data.frames, so you can just use
newlist <- lapply(onelist, function(y) y[srow,])
which gives
> newlist
[[1]]
sn var1
2 X2 0.2105123
3 X3 0.7329553
[[2]]
sn var1
2 X2 0.33195997
3 X3 0.08243274
[[3]]
sn var1
2 X2 0.3852362
3 X3 0.3729459