Related
Suppose I have a data.frame like this:
Debts <- data.frame(name= c("Julia Fischer", "Arold Hass", "Michael Pfeifer", "Harry Frank"),
value= c(145, 136, 0, 100))
I want to generate PDFs in a loop, instead of printing like this:
for(i in 1:length(Debts$name)) {
L <- Debts[i,]
if(L[2] > 0){
print(str_c("Hi ", L[1], " you owe me ", L[2], " dollars."))
} else {
print(str_c("Hi ", L[1], " we are even."))
}
}
Is it possible to do it using R-markdown? How can I do that ? I guess that if it is possible, I can generate the pdfs in a nice template too. If it's not possible with R-markdown, is there any other option ?
I was able to do it using Parameterized R Markdown reports. It has some limitations, but it can work. Basically, to the YAML header of the .Rmd you add your params using:
params:
The params you want to use are indented as follows. You can also insert default values to the param (using '3' just to show how):
---
title: "Our accounting"
output: pdf_document
params:
n: 3
---
In our case, I'm using just one parameter "n", Because I'm going to use it as index to my loop.
So, in the console, you can create your loop. Inside it, you can use the function rmarkdown::render() to call the .Rmd script and run the PDFs. You will also need to set the name of the pdfs to be created with "output_file=". Something like this:
for(i in 1:length(Debts$name)) {
name <- Debts$name[i]
rmarkdown::render("debts.Rmd",
params = list(N = i),
output_file = paste0(name, "-debts.pdf")
)
}
In your .Rmd file you'll have something like:
Hi `r Debts$name[params$N]`, you owe me `r Debts$value[params$N]` dollars.
I have a character vector (content) of about 50,000 lines in R. However, some of the lines when read in from a text file are on separate lines and should not be. Specifically, the lines look something like this:
[1] hello,
[2] world
[3] ""
[4] how
[5] are
[6] you
[7] ""
I would like to combine the lines so that I have something that looks like this:
[1] hello, world
[2] how are you
I have tried to write a for loop:
for(i in 1:length(content)){
if(content[i+1] != ""){
content[i+1] <- c(content[i], content[i+1])
}
}
But when I run the loop, I get an error: missing value where TRUE/FALSE needed.
Can anyone suggest a better way to do this, maybe not even using a loop?
Thanks!
EDIT:
I am actually trying to apply this to a Corpus of documents that are all many thousands lines each. Any ideas on how to translate these solutions into a function that can be applied to the content of each of the documents?
you don't need a loop to do that
x <- c("hello,", "world", "", "how", "\nare", "you", "")
dummy <- paste(
c("\n", sample(letters, 20, replace = TRUE), "\n"),
collapse = ""
) # complex random string as a split marker
x[x == ""] <- dummy #replace empty string by split marker
y <- paste(x, collapse = " ") #make one long string
z <- unlist(strsplit(y, dummy)) #cut the string at the split marker
gsub(" $", "", gsub("^ ", "", z)) # remove space at start and end
I think there are more elegant solutions, but this might be usable for you:
chars <- c("hello,","world","","how","are","you","")
###identify groups that belong together (id increases each time a "" is found)
ids <- cumsum(chars=="")
#split vector (an filter out "" by using the select vector)
select <- chars!=""
splitted <- split(chars[select], ids[select])
#paste the groups together
res <- sapply(splitted,paste, collapse=" ")
#remove names(if necessary, probably not)
res <- unname(res) #thanks #Roland
> res
[1] "hello, world" "how are you"
Here's a different approach using data.table which is likely to be faster than for or *apply loops:
library(data.table)
dt <- data.table(x)
dt[, .(paste(x, collapse = " ")), rleid(x == "")][V1 != ""]$V1
#[1] "hello, world" "how are you"
Sample data:
x <- c("hello,", "world", "", "how", "are", "you", "")
Replace the "" with something you can later split on, and then collapse the characters together, and then use strsplit(). Here I have used the newline character since if you were to just paste it you could get the different lines on the output, e.g. cat(txt3) will output each phrase on a separate line.
txt <- c("hello", "world", "", "how", "are", "you", "", "more", "text", "")
txt2 <- gsub("^$", "\n", txt)
txt3 <- paste(txt2, collapse = " ")
unlist(strsplit(txt3, "\\s\n\\s*"))
## [1] "hello world" "how are you" "more text"
Another way to add to the mix:
tapply(x[x != ''], cumsum(x == '')[x != '']+1, paste, collapse=' ')
# 1 2 3
#"hello, world" "how are you" "more text"
Group by non-empty strings. And paste the elements together by group.
I searched the stack overflow a little and all I found was, that regex in R are a bit tricky and not convenient compared to Perl or Python.
My problem is the following. I have long file names with informations inside. The look like the following:
20150416_QEP1_EXT_GR_1234_hs_IP_NON_060.raw
20150416_QEP1_EXT_GR_1234-1235_hs_IP_NON_060.raw
20150416_QEP1_EXT_GR_1236_hs_IP_NON_060_some_other_info.raw
20150416_QEP1_EXT_GR_1237_hs_IP_NON_060
I want to extract the parts from the filename and convert them conveniently into values, for example the first part is a date, the second is machine abbreviation, the next an institute abbreviation, group abbreviation, sample number(s) etc...
What I do at the moment is constructing a regex, to make (almost) sure, I grep the correct part of the string:
regex <- '([:digit:]{8})_([:alnum:]{1,4})_([:upper:]+)_ etc'
Then I use sub to save each snipped into a variable:
date <- sub(regex, '\\1', filename)
machine <- sub(regex, '\\2', filename)
etc
This works, if the filename has the correct convention. It is overall very hard to read and I am search for a more convenient way of doing the work. I thought about splitting the filename by _ and accessing the string by index might be a good solution. But sometimes, since the filenames often get created by hand, there are terms missing or additional information in the names and I am looking for a better solution to this.
Can anyone suggest a better way of doing so?
EDIT
What I want to create is an object, which has all the information of the filenames extracted and accessible... such as my_object$machine or so....
The help page for ?regex actually gives an example that is exactly equivalent to Python's re.match(r"(?P<first_name>\w+) (?P<last_name>\w+)", "Malcolm Reynolds") (as per your comment):
## named capture
notables <- c(" Ben Franklin and Jefferson Davis",
"\tMillard Fillmore")
#name groups 'first' and 'last'
name.rex <- "(?<first>[[:upper:]][[:lower:]]+) (?<last>[[:upper:]][[:lower:]]+)"
(parsed <- regexpr(name.rex, notables, perl = TRUE))
gregexpr(name.rex, notables, perl = TRUE)[[2]]
parse.one <- function(res, result) {
m <- do.call(rbind, lapply(seq_along(res), function(i) {
if(result[i] == -1) return("")
st <- attr(result, "capture.start")[i, ]
substring(res[i], st, st + attr(result, "capture.length")[i, ] - 1)
}))
colnames(m) <- attr(result, "capture.names")
m
}
parse.one(notables, parsed)
The normal way (i.e. the R way) to extract from a string is the following:
text <- "Malcolm Reynolds"
x <- gregexpr("\\w+", text) #Don't forget to escape the backslash
regmatches(text, x)
[[1]]
[1] "Malcolm" "Reynolds"
You can use however Perl-style group naming by using argument perl=TRUE:
regexpr("(?P<first_name>\\w+) (?P<last_name>\\w+)", text, perl=TRUE)
However regmatches does not support it, hence the need to create your own function to handle that, which is given in the help page:
parse.one <- function(res, result) {
m <- do.call(rbind, lapply(seq_along(res), function(i) {
if(result[i] == -1) return("")
st <- attr(result, "capture.start")[i, ]
substring(res[i], st, st + attr(result, "capture.length")[i, ] - 1)
}))
colnames(m) <- attr(result, "capture.names")
m
}
Applied to your example:
text <- "Malcolm Reynolds"
x <- regexpr("(?P<first_name>\\w+) (?P<last_name>\\w+)", text, perl=TRUE)
parse.one(text, x)
first_name last_name
[1,] "Malcolm" "Reynolds"
To go back to your initial problem:
filenames <- c("20150416_QEP1_EXT_GR_1234_hs_IP_NON_060.raw", "20150416_QEP1_EXT_GR_1234-1235_hs_IP_NON_060.raw", "20150416_QEP1_EXT_GR_1236_hs_IP_NON_060_some_other_info.raw", "20150416_QEP1_EXT_GR_1237_hs_IP_NON_060")
regex <- '(?P<date>[[:digit:]]{8})_(?P<machine>[[:alnum:]]{1,4})_(?P<whatev>[[:upper:]]+)'
x <- regexpr(regex,filenames,perl=TRUE)
parse.one(filenames,x)
date machine whatev
[1,] "20150416" "QEP1" "EXT"
[2,] "20150416" "QEP1" "EXT"
[3,] "20150416" "QEP1" "EXT"
[4,] "20150416" "QEP1" "EXT"
I have thousand of files from a certain directory:
filenames <- list.files("D:/MessData_Source", pattern="*.DAT", full.names=TRUE)
.....
.....
[9998] "D:/MessData_Source/908-A0F7__01310012567794F.DAT"
[9999] "D:/MessData_Source/908-A0F7__01310015662858F.DAT"
[10000] "D:/MessData_Source/908-A0F7__01310015662859F.DAT"
....
....
Out of those more than 1000 files, I need to extract out ONLY those files with filenames which contain certain strings.
e.g.
filename_extracted <- list()
for (i in 1:length(filenames))
{
# search for those filenames that contain the strings with PartNo and MoNo and store in results
filename_extracted[[i]] <- substr(filenames[i],31,43)
}
Above I am extracting the filenames string from number 31 to 43 and store it in filename_extracted which is something like this:
[[9993]]
[1] "1856955908850"
[[9994]]
[1] "1856955933372"
[[9995]]
[1] "1856955933372"
[[9996]]
[1] "1856955954613"
[[9997]]
[1] "1856955954613"
[[9998]]
[1] "1310012567794"
[[9999]]
[1] "1310015662858"
[[10000]]
[1] "1310015662859"
Next, I need to compare the filename_extracted to my required list, and copy those matched files to another directory.
required_list <- list()
df <-read.csv("PartNo_MoNo.csv") # full set
for (i in 1:length(df))
{
required_list[[i]] <- paste(df[i,1],df[i,2], sep="")
}
> required_list
[[1]]
[1] "1235235987252"
[[2]]
[1] "1897865985468"
If there are matches between required_list and filename_extracted, I want to copy the matched files to another directory, how do I do it?
thanks.
Here is the updated code, fully vectorized:
filename_extracted = substr(filenames, start=31, stop=43)
prefix = substr(filesnames, start=20, stop=30)
required_list = paste0(df[,1], df[,2])
common_suffix = intersect(filename_extracted, required_list)
common_prefix = prefix[filename_extracted %in% common]
storeDir = "D:/MessData_Source"
otherDir = "D:/OrderedData_Source"
if(length(common!=0))
{
commonFile = paste0(common_prefix, common_suffix, ".DAT")
sapply(commonFile, function(u){
file.copy(file.path(storeDir,u), file.path(otherDir, u))
})
}
Before executing this, make sure otherDir is created.
# Create data
library(stringr)
lapply(1:10, function(x){
write.csv(head(iris),file=paste0("908-A0F7__",x,".csv"))
write.csv(head(iris),file=paste0("notused__",x,".csv"))
})
# Only get files with correct pattern
pattern = "908-A0F7__(\\d+).csv"
files = data.frame(name=dir(pattern=pattern,full.names=TRUE))
files$num = as.integer(str_match(filenames$name,pattern)[,2])
required = c(1,3,5) # You can also read this in from your csv
myFiles = files[files$num %in% required,]
dir.create("copied")
file.copy(as.character(myFiles$name),file.path("copied",str_sub(myFiles$name,3)))
R 2.13.1 on Mac OS X. I'm trying to import a data file that has a point for thousand separator and comma as the decimal point, as well as trailing minus for negative values.
Basically, I'm trying to convert from:
"A|324,80|1.324,80|35,80-"
to
V1 V2 V3 V4
1 A 324.80 1324.8 -35.80
Now, interactively both the following works:
gsub("\\.","","1.324,80")
[1] "1324,80"
gsub("(.+)-$","-\\1", "35,80-")
[1] "-35,80"
and also combining them:
gsub("\\.", "", gsub("(.+)-$","-\\1","1.324,80-"))
[1] "-1324,80"
However, I'm not able to remove the thousand separator from read.data:
setClass("num.with.commas")
setAs("character", "num.with.commas", function(from) as.numeric(gsub("\\.", "", sub("(.+)-$","-\\1",from))) )
mydata <- "A|324,80|1.324,80|35,80-"
mytable <- read.table(textConnection(mydata), header=FALSE, quote="", comment.char="", sep="|", dec=",", skip=0, fill=FALSE,strip.white=TRUE, colClasses=c("character","num.with.commas", "num.with.commas", "num.with.commas"))
Warning messages:
1: In asMethod(object) : NAs introduced by coercion
2: In asMethod(object) : NAs introduced by coercion
3: In asMethod(object) : NAs introduced by coercion
mytable
V1 V2 V3 V4
1 A NA NA NA
Note that if I change from "\\." to "," in the function, things look a bit different:
setAs("character", "num.with.commas", function(from) as.numeric(gsub(",", "", sub("(.+)-$","-\\1",from))) )
mytable <- read.table(textConnection(mydata), header=FALSE, quote="", comment.char="", sep="|", dec=",", skip=0, fill=FALSE,strip.white=TRUE, colClasses=c("character","num.with.commas", "num.with.commas", "num.with.commas"))
mytable
V1 V2 V3 V4
1 A 32480 1.3248 -3580
I think the problem is that read.data with dec="," converts the incoming "," to "." BEFORE calling as(from, "num.with.commas"), so that the input string can be e.g. "1.324.80".
I want as("1.123,80-","num.with.commas") to return -1123.80 and as("1.100.123,80", "num.with.commas") to return 1100123.80.
How can I make my num.with.commas replace all except the last decimal point in the input string?
Update: First, I added negative lookahead and got as() working in the console:
setAs("character", "num.with.commas", function(from) as.numeric(gsub("(?!\\.\\d\\d$)\\.", "", gsub("(.+)-$","-\\1",from), perl=TRUE)) )
as("1.210.123.80-","num.with.commas")
[1] -1210124
as("10.123.80-","num.with.commas")
[1] -10123.8
as("10.123.80","num.with.commas")
[1] 10123.8
However, read.table still had the same problem. Adding some print()s to my function showed that num.with.commas in fact got the comma and not the point.
So my current solution is to then replace from "," to "." in num.with.commas.
setAs("character", "num.with.commas", function(from) as.numeric(gsub(",","\\.",gsub("(?!\\.\\d\\d$)\\.", "", gsub("(.+)-$","-\\1",from), perl=TRUE))) )
mytable <- read.table(textConnection(mydata), header=FALSE, quote="", comment.char="", sep="|", dec=",", skip=0, fill=FALSE,strip.white=TRUE, colClasses=c("character","num.with.commas", "num.with.commas", "num.with.commas"))
mytable
V1 V2 V3 V4
1 A 324.8 1101325 -35.8
You should be removing all the periods first and then changing the commas to decimal points before coercing with as.numeric(). You can later control how decimal points are printed with options(OutDec=",") . I do not think R uses commas as decimal separators internally even in locales where they are conventional.
> tst <- c("A","324,80","1.324,80","35,80-")
>
> as.numeric( sub("\\,", ".", sub("(.+)-$","-\\1", gsub("\\.", "", tst)) ) )
[1] NA 324.8 1324.8 -35.8
Warning message:
NAs introduced by coercion
Here's a solution with regular expressions and substitutions
mydata <- "A|324,80|1.324,80|35,80-"
# Split data
mydata2 <- strsplit(mydata,"|",fixed=TRUE)[[1]]
# Remove commas
mydata3 <- gsub(",","",mydata2,fixed=TRUE)
# Move negatives to front of string
mydata4 <- gsub("^(.+)-$","-\\1",mydata3)
# Convert to numeric
mydata.cleaned <- c(mydata4[1],as.numeric(mydata4[2:4]))