Convert list to bullet-points - r-markdown

If I have a list of characters: list("bannana", "apple", "orange"), I would like to get a R chunk with the output:
- bannana
- apple
- orange
Moreover, with a list with structure list(fruits=list("bannana", "apple", "orange"), meat=list("beef", "chicken")), I would like to get:
- fruits
- bannana
- apple
- orange
- meat:
- beef
- chicken
Is this possible? I could consider also a table as output, but I think that would be more complicated.

For a simple list:
l <- list("bannana", "apple", "orange")
cat(paste0("- ", l, collapse = "\n"))
banana
apple
orange
For a 2-level list, maybe something like this:
library(tidyverse)
list2markdown <- function(l) {
enframe(l) %>%
group_by(name) %>%
mutate(items = paste0(
"- ",
name,
paste0(
"\n\t- ",
unlist(value),
collapse = ""))) %>%
pull(items) %>%
paste0(collapse = "\n") %>%
cat()
}
l <- list(fruits = list("bannana", "apple", "orange"),
meat = list("beef", "chicken"))
list2markdown(l)
fruits
bannana
apple
orange
meat
beef
chicken

Related

How can I split a table so that it appears side by side in R markdown?

I'm writing a document with R markdown and I'd like to put a table. The problem is that this table only has two columns and takes a full page, which is not very beautiful. So my question is : is there a way to split this table in two and to place the two "sub-tables" side by side with only one caption ?
I use the kable command and I tried this solution (How to split kable over multiple columns?) but I could not do the cbind() command.
Here's my code to create the table :
---
title:
author:
date: "`r format(Sys.time(), '%d %B, %Y')`"
output: pdf_document
indent: true
header-includes:
- \usepackage{indentfirst}
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
```{r, echo = FALSE}
kable(aerop2, format = "markdown")
```
where aerop2 is my data frame with a list of country names in column 1 and the number of airports in each of these countries in column 2.
I have a long two-column table which is a waste of space. I would like to split this table in two sub-tables and put these sub-tables side by side with a caption that includes both of them.
This doesn't give a lot of flexibility in spacing, but here's one way to do it. I'm using the mtcars dataset as an example because I don't have aerop2.
---
output: pdf_document
indent: true
header-includes:
- \usepackage{indentfirst}
- \usepackage{booktabs}
---
```{r setup, include=FALSE}
library(knitr)
opts_chunk$set(echo = TRUE)
```
The data are in Table \ref{tab:tables}, which will float to the top of the page.
```{r echo = FALSE}
rows <- seq_len(nrow(mtcars) %/% 2)
kable(list(mtcars[rows,1:2],
matrix(numeric(), nrow=0, ncol=1),
mtcars[-rows, 1:2]),
caption = "This is the caption.",
label = "tables", format = "latex", booktabs = TRUE)
```
This gives:
Note that without that zero-row matrix, the two parts are closer together. To increase the spacing more, put extra copies of the zero-row matrix into
the list.
The solution offered by 'user2554330' was very useful.
As I needed to split in more columns and eventually more sections, I further developed the idea.
I also needed to have the tables after the text, not floating to the top. I found a way using kableExtra::kable_styling(latex_options = "hold_position").
I am writing here to share the development and to ask minor questions.
1 - Why did you add the line - \usepackage{indentfirst}?
2 - What is the effect of label = "tables" as kable() input?
(The questions are related to Latex. I probably know to little to understand the explanation in kable() documentation: "label - The table reference label"!)
---
title: "Test-split.print"
header-includes:
- \usepackage{booktabs}
output:
pdf_document: default
html_document:
df_print: paged
---
```{r setup, include=FALSE}
suppressPackageStartupMessages(library(tidyverse))
library(knitr)
library(kableExtra)
split.print <- function(x, cols = 2, sects = 1, spaces = 1, caption = "", label = ""){
if (cols < 1) stop("cols must be GT 1!")
if (sects < 1) stop("sects must be GT 1!")
rims <- nrow(x) %% sects
nris <- (rep(nrow(x) %/% sects, sects) + c(rep(1, rims), rep(0, sects-rims))) %>%
cumsum() %>%
c(0, .)
for(s in 1:sects){
xs <- x[(nris[s]+1):nris[s+1], ]
rimc <- nrow(xs) %% cols
nric <- (rep(nrow(xs) %/% cols, cols) + c(rep(1, rimc), rep(0, cols-rimc))) %>%
cumsum() %>%
c(0, .)
lst <- NULL
spc <- NULL
for(sp in 1:spaces) spc <- c(spc, list(matrix(numeric(), nrow=0, ncol=1)))
for(c in 1:cols){
lst <- c(lst, list(xs[(nric[c]+1):nric[c+1], ]))
if (cols > 1 & c < cols) lst <- c(lst, spc)
}
kable(lst,
caption = ifelse(sects == 1, caption, paste0(caption, " (", s, "/", sects, ")")),
label = "tables", format = "latex", booktabs = TRUE) %>%
kable_styling(latex_options = "hold_position") %>%
print()
}
}
```
```{r, results='asis'}
airquality %>%
select(1:3) %>%
split.print(cols = 3, sects = 2, caption = "multi page table")
```

R - extract all strings matching pattern and create relational table

I am looking for a shorter and more pretty solution (possibly in tidyverse) to the following problem. I have a data.frame "data":
id string
1 A 1.001 xxx 123.123
2 B 23,45 lorem ipsum
3 C donald trump
4 D ssss 134, 1,45
What I wanted to do is to extract all numbers (no matter if the delimiter is "." or "," -> in this case I assume that string "134, 1,45" can be extracted into two numbers: 134 and 1.45) and create a data.frame "output" looking similar to this:
id string
1 A 1.001
2 A 123.123
3 B 23.45
4 C <NA>
5 D 134
6 D 1.45
I managed to do this (code below) but the solution is pretty ugly for me also not so efficient (two for-loops). Could someone suggest a better way to do do this (preferably using dplyr)
# data
data <- data.frame(id = c("A", "B", "C", "D"),
string = c("1.001 xxx 123.123",
"23,45 lorem ipsum",
"donald trump",
"ssss 134, 1,45"),
stringsAsFactors = FALSE)
# creating empty data.frame
len <- length(unlist(sapply(data$string, function(x) gregexpr("[0-9]+[,|.]?[0-9]*", x))))
output <- data.frame(id = rep(NA, len), string = rep(NA, len))
# main solution
start = 0
for(i in 1:dim(data)[1]){
tmp_len <- length(unlist(gregexpr("[0-9]+[,|.]?[0-9]*", data$string[i])))
for(j in (start+1):(start+tmp_len)){
output[j,1] <- data$id[i]
output[j,2] <- regmatches(data$string[i], gregexpr("[0-9]+[,|.]?[0-9]*", data$string[i]))[[1]][j-start]
}
start = start + tmp_len
}
# further modifications
output$string <- gsub(",", ".", output$string)
output$string <- as.numeric(ifelse(substring(output$string, nchar(output$string), nchar(output$string)) == ".",
substring(output$string, 1, nchar(output$string) - 1),
output$string))
output
1) Base R This uses relatively simple regular expressions and no packages.
In the first 2 lines of code replace any comma followed by a space with a
space and then replace all remaining commas with a dot. After these two lines s will be: c("1.001 xxx 123.123", "23.45 lorem ipsum", "donald trump", "ssss 134 1.45")
In the next 4 lines of code trim whitespace from beginning and end of each string field and split the string field on whitespace producing a
list. grep out those elements consisting only of digits and dots. (The regular expression ^[0-9.]*$ matches the start of a word followed by zero or more digits or dots followed by the end of the word so only words containing only those characters are matched.) Replace any zero length components with NA. Finally add data$id as the names. After these 4 lines are run the list L will be list(A = c("1.001", "123.123"), B = "23.45", C = NA, D = c("134", "1.45")) .
In the last line of code convert the list L to a data frame with the appropriate names.
s <- gsub(", ", " ", data$string)
s <- gsub(",", ".", s)
L <- strsplit(trimws(s), "\\s+")
L <- lapply(L, grep, pattern = "^[0-9.]*$", value = TRUE)
L <- ifelse(lengths(L), L, NA)
names(L) <- data$id
with(stack(L), data.frame(id = ind, string = values))
giving:
id string
1 A 1.001
2 A 123.123
3 B 23.45
4 C <NA>
5 D 134
6 D 1.45
2) magrittr This variation of (1) writes it as a magrittr pipeline.
library(magrittr)
data %>%
transform(string = gsub(", ", " ", string)) %>%
transform(string = gsub(",", ".", string)) %>%
transform(string = trimws(string)) %>%
with(setNames(strsplit(string, "\\s+"), id)) %>%
lapply(grep, pattern = "^[0-9.]*$", value = TRUE) %>%
replace(lengths(.) == 0, NA) %>%
stack() %>%
with(data.frame(id = ind, string = values))
3) dplyr/tidyr This is an alternate pipeline solution using dplyr and tidyr. unnest converts to long form, id is made factor so that we can later use complete to recover id's that are removed by subsequent filtering, the filter removes junk rows and complete inserts NA rows for each id that would otherwise not appear.
library(dplyr)
library(tidyr)
data %>%
mutate(string = gsub(", ", " ", string)) %>%
mutate(string = gsub(",", ".", string)) %>%
mutate(string = trimws(string)) %>%
mutate(string = strsplit(string, "\\s+")) %>%
unnest() %>%
mutate(id = factor(id))
filter(grepl("^[0-9.]*$", string)) %>%
complete(id)
4) data.table
library(data.table)
DT <- as.data.table(data)
DT[, string := gsub(", ", " ", string)][,
string := gsub(",", ".", string)][,
string := trimws(string)][,
string := setNames(strsplit(string, "\\s+"), id)][,
list(string = list(grep("^[0-9.]*$", unlist(string), value = TRUE))), by = id][,
list(string = if (length(unlist(string))) unlist(string) else NA_character_), by = id]
DT
Update Removed assumption that junk words do not have digit or dot. Also added (2), (3) and (4) and some improvements.
We can replace the , in between the numbers with . (using gsub), extract the numbers with str_extract_all (from stringr into a list), replace the list elements that have length equal to 0 with NA, set the names of the list with 'id' column, stack to convert the list to data.frame and rename the columns.
library(stringr)
setNames(stack(setNames(lapply(str_extract_all(gsub("(?<=[0-9]),(?=[0-9])", ".",
data$string, perl = TRUE), "[0-9.]+"), function(x)
if(length(x)==0) NA else as.numeric(x)), data$id))[2:1], c("id", "string"))
# id string
#1 A 1.001
#2 A 123.123
#3 B 23.45
#4 C NA
#5 D 134
#6 D 1.45
Same idea as Gabor's. I had hoped to use R's built-in parsing of strings (type.convert, used in read.table) rather than writing custom regex substitutions:
sp = setNames(strsplit(data$string, " "), data$id)
spc = lapply(sp, function(x) {
x = x[grep("[^0-9.,]$", x, invert=TRUE)]
if (!length(x))
NA_real_
else
mapply(type.convert, x, dec=gsub("[^.,]", "", x), USE.NAMES=FALSE)
})
setNames(rev(stack(spc)), names(data))
id string
1 A 1.001
2 A 123.123
3 B 23.45
4 C <NA>
5 D 134
6 D 1.45
Unfortunately, type.convert is not robust enough to consider both decimal delimiters at once, so we need this mapply malarkey instead of type.convert(x, dec = "[.,]").

merging two data sets on the basis of two columns

There are n files. In each file there are multiple columns and i have to select only first two. I have to merge these n files on the basis of those two columns with an additional column. The value will be like a string. The length of the string depends on number of files. For instance, say there are 4 files,
File1:
cat dog
lion ele
mice hello
new lion
ele that
File2:
cat lion
mice hello
cub pet
old lion
File3:
new lion
cub pet
cat dog
hello cat
File4:
ele that
hello cat
new old
I want to generate a new file,
cat dog PAPA
lion ele PAAA
mice hello PPAA
new lion PAPA
ele that PAAP
cat lion APAA
cub pet APPA
old lion APAA
new lion AAPA
hello cat AAPP
new old AAAP
The value should be at position 'i' is 'A' if they are not present in ith file, else it will 'P'. This is how the strings have been formed.
If you have a small dataset, you can do this with reshaping
library(dplyr)
library(tidyr)
list_of_file_names = c(...)
data_frame(file = list_of_file_names) %>%
group_by(file) %>%
do(read.csv(.$file) ) %>%
distinct %>%
mutate(present = "P") %>%
spread(file, present, fill = "A") %>%
gather(file, present_absent, first_file_name:last_file_name) %>%
group_by(column1, column2) %>%
summarize(present_absent_string =
present_absent %>%
paste(collapse = "") )
I am having troubles in installing tidyr package. Is there any other
way?
Here's one without an additional library.
#!/usr/bin/Rscript --vanilla
# data input - filenames are to be provided as command line arguments:
t = lapply(commandArgs(T), read.table, col.names=1:2, flush=T) # only 2 columns
t = mapply('[<-', t, 3, value="P", SIMPLIFY=F) # mark the values as "present"
t = Reduce(function(x, y) merge(x, y, 1:2, all=T, suffixes=ncol(x)), t) # merge
t[is.na(t)] = "A" # mark the not present values as "absent"
t[3] = Reduce(function(...) paste(..., sep=''), t[-(1:2)]) # concatenate P&A
# data output - write the desired output format
write.table(format(t[1:3], justify="l"), quote=F, row.names=F, col.names=F)

Replace entire strings based on partial match

New to R. Looking to replace the entire string if there is a partial match.
d = c("SDS0G2 Blue", "Blue SSC2CWA3", "Blue SA2M1GC", "SA5 Blue CSQ5")
gsub("Blue", "Red", d, ignore.case = FALSE, fixed = FALSE)
Output: "SDS0G2 Red" "Red SSC2CWA3" "Red SA2M1GC" "SA5 Red CSQ5"
Desired Output: “Red” “Red” “Red” “Red”
Any help in solving this is truly appreciated.
I'd suggest using grepl to find the indices and replace those indices with "Red":
d = c("SDS0G2 Blue", "Blue SSC2CWA3", "Blue SA2M1GC", "SA5 Blue CSQ5", "ABCDE")
d[grepl("Blue", d, ignore.case=FALSE)] <- "Red"
d
# [1] "Red" "Red" "Red" "Red" "ABCDE"
If you did want to keep the variable as a factor and replace multiple partial matches at once, the following function will work (example from another question).
clrs <- c("blue", "light blue", "red", "rose", "ruby", "yellow", "green", "black", "brown", "royal blue")
dfx <- data.frame(colors1=clrs, colors2 = clrs, Amount=sample(100,10))
# Function to replace levels with regex matching
make_levels <- function(.f, patterns, replacement = NULL, ignore.case = FALSE) {
lvls <- levels(.f)
# Replacements can be listed in the replacement argument, taken as names in patterns, or the patterns themselves.
if(is.null(replacement)) {
if(is.null(names(patterns)))
replacement <- patterns
else
replacement <- names(patterns)
}
# Find matching levels
lvl_match <- setNames(vector("list", length = length(patterns)), replacement)
for(i in seq_along(patterns))
lvl_match[[replacement[i]]] <- grep(patterns[i], lvls, ignore.case = ignore.case, value = TRUE)
# Append other non-matching levels
lvl_other <- setdiff(lvls, unlist(lvl_match))
lvl_all <- append(
lvl_match,
setNames(as.list(lvl_other), lvl_other)
)
return(lvl_all)
}
# Replace levels
levels(dfx$colors2) <- make_levels(.f = dfx$colors2, patterns = c(Blue = "blue", Red = "red|rose|ruby"))
dfx
#> colors1 colors2 Amount
#> 1 blue Blue 75
#> 2 light blue Blue 55
#> 3 red Red 47
#> 4 rose Red 83
#> 5 ruby Red 56
#> 6 yellow yellow 10
#> 7 green green 25
#> 8 black black 29
#> 9 brown brown 23
#> 10 royal blue Blue 24
Created on 2020-04-18 by the reprex package (v0.3.0)

How to separate the variables of a particular column in a CSV file and write to a CSV file in R?

I have a CSV file like
Market,CampaignName,Identity
Wells Fargo,Gary IN MetroChicago IL Metro,56
EMC,Los Angeles CA MetroBoston MA Metro,78
Apple,Cupertino CA Metro,68
Desired Output to a CSV file with the first row as the headers
Market,City,State,Identity
Wells Fargo,Gary,IN,56
Wells Fargo,Chicago,IL,56
EMC,Los Angeles,CA,78
EMC,Boston,MA,78
Apple,Cupertino,CA,68
res <-
gsub('(.*) ([A-Z]{2})*Metro (.*) ([A-Z]{2}) .*','\\1,\\2:\\3,\\4',
xx$Market)
How to modify the above regular expressions to get the result in R?
New to R, any help is appreciated.
library(stringr)
xx.to.split <- with(xx, setNames(gsub("Metro", "", as.character(CampaignName)), Market))
do.call(rbind, str_match_all(xx.to.split, "(.+?) ([A-Z]{2}) ?"))[, -1]
Produces:
[,1] [,2]
Wells Fargo "Gary" "IN"
Wells Fargo "Chicago" "IL"
EMC "Los Angeles" "CA"
EMC "Boston" "MA"
Apple "Cupertino" "CA"
This should work even if you have different number of Compaign Names in each market. Unfortunately I think base options are annoying to implement because frustratingly there isn't a gregexec, although I'd be curious if someone comes up with something comparably compact in base.
Here is a solution using base R. Split the CampaignName column on the string Metro adding sequential numbers as names. stack turns it into a data frame with columns ind and values which we massage into DF1. Merge that with xx by the sequence numbers of DF1 and the row numbers of xx. Move Market to the front of DF2 and remove ind and CampaignName. Finally write it out.
xx <- read.csv("Campaign.csv", as.is = TRUE)
s <- strsplit(xx$CampaignName, " Metro")
names(s) <- seq_along(s)
ss <- stack(s)
DF1 <- with(ss, data.frame(ind,
City = sub(" ..$", "", values),
State = sub(".* ", "", values)))
DF2 <- merge(DF1, xx, by.x = "ind", by.y = 0)
DF <- DF2[ c("Market", setdiff(names(DF2), c("ind", "Market", "CampaignName"))) ]
write.csv(DF, file = "myfile.csv", row.names = FALSE, quote = FALSE)
REVISED to handle extra columns after poster modified the question to include such. Minor improvements.