R: Create a file list without a specific string - regex

I'm trying to create a list of files from a directory containing files with the following patterns:
Name_Surname_12345_noe_xy.xls
Name_Surname_12345_xy.xls
xy can be one or two characters.
Now I want a list of all files wich do not contain "noe" in the filename.
I can read in only "noe" - files using
fl = list.files(pattern = "noe.+xls$", recursive=T, full.names=T)
but found no way to exclude them. Any suggestions?
Many thanks
Markus

Get all the files and then use grep to find the noe ones and subset them out:
> all
[1] "Name_Surname_123425_xy.xls" "Name_Surname_1234445_xy.xls"
[3] "Name_Surname_12345_noe_xy.xls" "Name_Surname_12345_xy.xls"
[5] "Name_Surname_13245_noe_xy.xls"
> all[grep("noe_xy.xls",all,invert=TRUE)]
[1] "Name_Surname_123425_xy.xls" "Name_Surname_1234445_xy.xls"
[3] "Name_Surname_12345_xy.xls"
always make sure you check the edge cases where all or none of the files match:
> all[grep("xls",all,invert=TRUE)]
character(0)
> all[grep("fnord",all,invert=TRUE)]
[1] "Name_Surname_123425_xy.xls" "Name_Surname_1234445_xy.xls"
[3] "Name_Surname_12345_noe_xy.xls" "Name_Surname_12345_xy.xls"
[5] "Name_Surname_13245_noe_xy.xls"
Using grep with a negative index works except in these edge cases:
> all
[1] "Name_Surname_123425_xy.xls" "Name_Surname_1234445_xy.xls"
[3] "Name_Surname_12345_noe_xy.xls" "Name_Surname_12345_xy.xls"
[5] "Name_Surname_13245_noe_xy.xls"
> all[-grep("noe_xy.xls",all)] # strip out the noe_xy.xls files
[1] "Name_Surname_123425_xy.xls" "Name_Surname_1234445_xy.xls"
[3] "Name_Surname_12345_xy.xls"
# works. Now strip out any xls files (should leave nothing)
> all[-grep("xls",all)]
character(0)
# yup, that works too. Now strip out 'fnord' files, shouldn't remove anything:
> all[-grep("fnord",all)]
character(0)
Epic fail! Reason is left as an exercise to the reader.

Related

Find all words that have "<-" at the end of the word OR in front of a dot

How do I pull out all words that have the symbol "<-" either at the end of the word or somewhere in between but in the latter case only if the "<-" symbol is followed by a dot.
To put it into context. Exercise 6.5.3 a. of Hadley Wickhams - Advanced R asks the reader to list all replacement functions in the base package.
Replacement function that only have one method are indicated by the symbol <-
right at the end of the function name. Generic functions, however, have their
method name attached to the name of the replacement form (with a dot), such that the <- is no longer at the end of the function name. Example split<-.data.frame
EDIT:
obj <- mget(ls("package:base"), inherits = TRUE)
funs <- Filter(is.function, objs)
This is how you pull out all functions in the base package. Now I want to find only the replacement functions.
If you want all base package replacement functions and their respective S3 methods, you can try
ls(envir = as.environment("package:base"), pattern = "<-")
With no packages loaded, this gives the following result:
[1] "<<-" "<-" "[<-"
[4] "[[<-" "#<-" "$<-"
[7] "attr<-" "attributes<-" "body<-"
[10] "class<-" "colnames<-" "comment<-"
[13] "[<-.data.frame" "[[<-.data.frame" "$<-.data.frame"
[16] "[<-.Date" "diag<-" "dim<-"
[19] "dimnames<-" "dimnames<-.data.frame" "Encoding<-"
[22] "environment<-" "[<-.factor" "[[<-.factor"
[25] "formals<-" "is.na<-" "is.na<-.default"
[28] "is.na<-.factor" "is.na<-.numeric_version" "length<-"
[31] "length<-.factor" "levels<-" "levels<-.factor"
[34] "mode<-" "mostattributes<-" "names<-"
[37] "names<-.POSIXlt" "[<-.numeric_version" "[[<-.numeric_version"
[40] "oldClass<-" "parent.env<-" "[<-.POSIXct"
[43] "[<-.POSIXlt" "regmatches<-" "row.names<-"
[46] "rownames<-" "row.names<-.data.frame" "row.names<-.default"
[49] "split<-" "split<-.data.frame" "split<-.default"
[52] "storage.mode<-" "substr<-" "substring<-"
[55] "units<-" "units<-.difftime"
Thanks to #42 for helping me improve this answer.
We can try
library(stringr)
str_extract(v1, "\\w+<-$|\\w*<-\\.\\S+")
#[1] "split<-.data.frame" NA "splitdata<-"
data
v1 <- c("split<-.data.frame", "split<-data", "splitdata<-")

Extracting pattern from the nested list in R using regex

I have following sorted list (lst) of time periods and I want to split the periods into specific dates and then extract maximum time period without altering order of the list.
$`1`
[1] "01.12.2015 - 21.12.2015"
$`2`
[1] "22.12.2015 - 05.01.2016"
$`3`
[1] "14.09.2015 - 12.10.2015" "29.09.2015 - 26.10.2015"
Therefore, after adjustment list should look like this:
$`1`
[1] "01.12.2015" "21.12.2015"
$`2`
[1] "22.12.2015" "05.01.2016"
$`3`
[1] "14.09.2015" "12.10.2015" "29.09.2015" "26.10.2015"
In order to do so, I began with splitting the list:
lst_split <- str_split(lst, pattern = " - ")
which leads to the following:
[[1]]
[1] "01.12.2015" "21.12.2015"
[[2]]
[1] "22.12.2015" "05.01.2016"
[[3]]
[1] "c(\"14.09.2015" "12.10.2015\", \"29.09.2015" "26.10.2015\")"
Then, I tried to extract the pattern:
lapply(lst_split, function(x) str_extract(pattern = c("\\d+\\.\\d+\\.\\d+"),x))
but my output is missing one date (29.09.2015)
[[1]]
[1] "01.12.2015" "21.12.2015"
[[2]]
[1] "22.12.2015" "05.01.2016"
[[3]]
[1] "14.09.2015" "12.10.2015" "26.10.2015"
Does anyone have an idea how I could make it work and maybe propose more efficient solution? Thank you in advance.
Thanks to comments of #WiktorStribiżew and #akrun it is enough to use str_extract_all.
In this example:
> str_extract_all(lst,"\\d+\\.\\d+\\.\\d+")
[[1]]
[1] "01.12.2015" "21.12.2015"
[[2]]
[1] "22.12.2015" "05.01.2016"
[[3]]
[1] "14.09.2015" "12.10.2015" "29.09.2015" "26.10.2015"
1) Use strsplit, flatten each component using unlist, convert the dates to "Date" class and then use range to get the maximum time span. No packages are used.
> lapply(lst, function(x) range(as.Date(unlist(strsplit(x, " - ")), "%d.%m.%Y")))
$`1`
[1] "2015-12-01" "2015-12-21"
$`2`
[1] "2015-12-22" "2016-01-05"
$`3`
[1] "2015-09-14" "2015-10-26"
2) This variation using a magrittr pipeline also works:
library(magrittr)
lapply(lst, function(x)
x %>%
strsplit(" - ") %>%
unlist %>%
as.Date("%d.%m.%Y") %>%
range
)
Note: The input lst in reproducible form is:
lst <- structure(list(`1` = "01.12.2015 - 21.12.2015", `2` = "22.12.2015 - 05.01.2016",
`3` = c("14.09.2015 - 12.10.2015", "29.09.2015 - 26.10.2015"
)), .Names = c("1", "2", "3"))

R match expression multiple times in the same line

I am working with a set of Tweets (very original, I know) in R and would like to extract the text after each # sign and after each # and put them into separate variables. For example:
This is a test tweet using #twitter. #johnsmith #joesmith.
Ideally I would like it to create new variables in the dataframe that has twitter johnsmith joesmith, etc.
Currently I am using
data$at <- str_match(data$tweet_text,"\s#\w+")
data$hash <- str_match(data$tweet_text,"\s#\w+")
Which obviously gives me the first occurrence of each into a new variable. Any suggestions?
strsplit and grep will work:
x <-strsplit("This is a test tweet using #twitter. #johnsmith #joesmith."," ")
grep("#|#",unlist(x), value=TRUE)
#[1] "#twitter." "#johnsmith" "#joesmith."
If you only want to keep the words, no #,# or .:
out <-grep("#|#",unlist(x), value=TRUE)
gsub("#|#|\\.","",out)
[1] "twitter" "johnsmith" "joesmith"
UPDATE Putting the results in a list:
my_list <-NULL
x <-strsplit("This is a test tweet using #twitter. #johnsmith #joesmith."," ")
my_list$hash <-c(my_list$hash,gsub("#|#|\\.","",grep("#",unlist(x), value=TRUE)))
my_list$at <-c(my_list$at,gsub("#|#|\\.","",grep("#",unlist(x), value=TRUE)))
x <-strsplit("2nd tweet using #second. #jillsmith #joansmith."," ")
my_list$hash <-c(my_list$hash,gsub("#|#|\\.","",grep("#",unlist(x), value=TRUE)))
my_list$at <-c(my_list$at,gsub("#|#|\\.","",grep("#",unlist(x), value=TRUE)))
my_list
$hash
[1] "twitter" "second"
$at
[1] "johnsmith" "joesmith" "jillsmith" "joansmith"

How to use separate() properly?

I have some difficulties to extract an ID in the form:
27da12ce-85fe-3f28-92f9-e5235a5cf6ac
from a data frame:
a<-c("NAME_27da12ce-85fe-3f28-92f9-e5235a5cf6ac_THOMAS_MYR",
"NAME_94773a8c-b71d-3be6-b57e-db9d8740bb98_THIMO",
"NAME_1ed571b4-1aef-3fe2-8f85-b757da2436ee_ALEX",
"NAME_9fbeda37-0e4f-37aa-86ef-11f907812397_JOHN_TYA",
"NAME_83ef784f-3128-35a1-8ff9-daab1c5f944b_BISHOP",
"NAME_39de28ca-5eca-3e6c-b5ea-5b82784cc6f4_DUE_TO",
"NAME_0a52a024-9305-3bf1-a0a6-84b009cc5af4_WIS_MICHAL",
"NAME_2520ebbb-7900-32c9-9f2d-178cf04f7efc_Sarah_Lu_Van_Gar/Thomas")
Basically its the thing between the first and the second underscore.
Usually I approach that by:
library(tidyr)
df$a<-as.character(df$a)
df<-df[grep("_", df$a), ]
df<- separate(df, a, c("ID","Name") , sep = "_")
df$a<-as.numeric(df$ID)
However this time there a to many underscores...and my approach fails. Is there a way to extract that ID?
I think you should use extract instead of separate. You need to specify the patterns which you want to capture. I'm assuming here that ID is always starts with a number so I'm capturing everything after the first number until the next _ and then everything after it
df <- data.frame(a)
df <- df[grep("_", df$a),, drop = FALSE]
extract(df, a, c("ID", "NAME"), "[A-Za-z].*?(\\d.*?)_(.*)")
# ID NAME
# 1 27da12ce-85fe-3f28-92f9-e5235a5cf6ac THOMAS_MYR
# 2 94773a8c-b71d-3be6-b57e-db9d8740bb98 THIMO
# 3 1ed571b4-1aef-3fe2-8f85-b757da2436ee ALEX
# 4 9fbeda37-0e4f-37aa-86ef-11f907812397 JOHN_TYA
# 5 83ef784f-3128-35a1-8ff9-daab1c5f944b BISHOP
# 6 39de28ca-5eca-3e6c-b5ea-5b82784cc6f4 DUE_TO
# 7 0a52a024-9305-3bf1-a0a6-84b009cc5af4 WIS_MICHAL
# 8 2520ebbb-7900-32c9-9f2d-178cf04f7efc Sarah_Lu_Van_Gar/Thomas
try this (which assumes that the ID is always the part after the first unerscore):
sapply(strsplit(a, "_"), function(x) x[[2]])
which gives you "the middle part" which is your ID:
[1] "27da12ce-85fe-3f28-92f9-e5235a5cf6ac" "94773a8c-b71d-3be6-b57e-db9d8740bb98"
[3] "1ed571b4-1aef-3fe2-8f85-b757da2436ee" "9fbeda37-0e4f-37aa-86ef-11f907812397"
[5] "83ef784f-3128-35a1-8ff9-daab1c5f944b" "39de28ca-5eca-3e6c-b5ea-5b82784cc6f4"
[7] "0a52a024-9305-3bf1-a0a6-84b009cc5af4" "2520ebbb-7900-32c9-9f2d-178cf04f7efc"
if you want to get the Name as well a simple solution would be (which assumes that the Name is always after the second underscore):
Names <- sapply(strsplit(a, "_"), function(x) Reduce(paste, x[-c(1,2)]))
which gives you this:
[1] "THOMAS MYR" "THIMO" "ALEX" "JOHN TYA"
[5] "BISHOP" "DUE TO" "WIS MICHAL" "Sarah Lu Van Gar/Thomas"

Parsing out a line in R to pick different objects

I have this line:
system<-c("System configuration: type=Shared mode=Uncapped smt=4 lcpu=96 mem=393216MB psize=64 ent=16.00")
I need to parse out this and pick smt, lcpu, mem, mpsize and ent into different objects.
For example, I doing this to pick the smt, but it picks the whole line, any ideas what I am doing wrong here?
smt<-sub('^.* smt=([[:digit:]])', '\\1', system)
smt needs to have a number 4 in this case.
I would use strsplit a couple times, and type.convert:
parse.config <- function(x) {
clean <- sub("System configuration: ", "", x)
pairs <- strsplit(clean, " ")[[1]]
items <- strsplit(pairs, "=")
keys <- sapply(items, `[`, 1)
values <- sapply(items, `[`, 2)
values <- lapply(values, type.convert, as.is = TRUE)
setNames(values, keys)
}
config <- parse.config(system)
# $type
# [1] "Shared"
#
# $mode
# [1] "Uncapped"
#
# $smt
# [1] 4
#
# $lcpu
# [1] 96
#
# $mem
# [1] "393216MB"
#
# $psize
# [1] 64
#
# $ent
# [1] 16
The output is a list so you can access any of the parsed items, for example:
config$smt
# [1] 4
Using strapplyc in the gusbfn package the following creates a list L whose names are the left hand sides such as smt and whose values are the right hand sides.
library(gsubfn)
LHS <- strapplyc( system, "(\\w+)=" )[[1]]
RHS <- strapplyc( system, "=(\\w+)" )[[1]]
L <- setNames( as.list(RHS), LHS )
For example we can now get smt like this (and similarly for the other left hand sides):
> L$smt
[1] "4"
UPDATE: Simplified.
add .* to the end of your matching expression and you'll get "4".
sub('^.* smt=([[:digit:]]+).*', '\\1', system)
You may want to add the + I included in the instance where it is more than a single digit.
You could also approach this by splitting on spaces and the finding the matches:
splits <- unlist(strsplit(system, ' '))
sub('smt=', '', grep('smt=', splits, value=TRUE))
# [1] "4"
or wrapping it in a function:
matchfun <- function(string, to_match, splitter=' ') {
splits <- unlist(strsplit(string, splitter))
sub(to_match, '', grep(to_match, splits, value=TRUE))
}
matchfun(system, 'smt=')
# [1] "4"
Well, I'm voting for #GaborGrothendieck's, but am offering this as a more pedestrian alternative:
inp <- c("System configuration: type=Shared mode=Uncapped smt=4 lcpu=96 mem=393216MB psize=64 ent=16.00")
inparsed <- read.table(text=inp, stringsAsFactors=FALSE)
vals <- unlist(inparsed)[grep("\\=", unlist(inparsed))]
vals
# V3 V4 V5 V6 V7 V8 V9
# type=Shared mode=Uncapped smt=4 lcpu=96 mem=393216MB psize=64 ent=16.00
vals[grep("smt|lcpu|mem|mpsize|ent", vals)]
V5 V6 V7 V9
"smt=4" "lcpu=96" "mem=393216MB" "ent=16.00"
I would note that choosing the name 'system' for a variable seems most unwise in light of the system function's existence.