Using R, Retrieve files which filenames that contains certain strings - regex

I have thousand of files from a certain directory:
filenames <- list.files("D:/MessData_Source", pattern="*.DAT", full.names=TRUE)
.....
.....
[9998] "D:/MessData_Source/908-A0F7__01310012567794F.DAT"
[9999] "D:/MessData_Source/908-A0F7__01310015662858F.DAT"
[10000] "D:/MessData_Source/908-A0F7__01310015662859F.DAT"
....
....
Out of those more than 1000 files, I need to extract out ONLY those files with filenames which contain certain strings.
e.g.
filename_extracted <- list()
for (i in 1:length(filenames))
{
# search for those filenames that contain the strings with PartNo and MoNo and store in results
filename_extracted[[i]] <- substr(filenames[i],31,43)
}
Above I am extracting the filenames string from number 31 to 43 and store it in filename_extracted which is something like this:
[[9993]]
[1] "1856955908850"
[[9994]]
[1] "1856955933372"
[[9995]]
[1] "1856955933372"
[[9996]]
[1] "1856955954613"
[[9997]]
[1] "1856955954613"
[[9998]]
[1] "1310012567794"
[[9999]]
[1] "1310015662858"
[[10000]]
[1] "1310015662859"
Next, I need to compare the filename_extracted to my required list, and copy those matched files to another directory.
required_list <- list()
df <-read.csv("PartNo_MoNo.csv") # full set
for (i in 1:length(df))
{
required_list[[i]] <- paste(df[i,1],df[i,2], sep="")
}
> required_list
[[1]]
[1] "1235235987252"
[[2]]
[1] "1897865985468"
If there are matches between required_list and filename_extracted, I want to copy the matched files to another directory, how do I do it?
thanks.

Here is the updated code, fully vectorized:
filename_extracted = substr(filenames, start=31, stop=43)
prefix = substr(filesnames, start=20, stop=30)
required_list = paste0(df[,1], df[,2])
common_suffix = intersect(filename_extracted, required_list)
common_prefix = prefix[filename_extracted %in% common]
storeDir = "D:/MessData_Source"
otherDir = "D:/OrderedData_Source"
if(length(common!=0))
{
commonFile = paste0(common_prefix, common_suffix, ".DAT")
sapply(commonFile, function(u){
file.copy(file.path(storeDir,u), file.path(otherDir, u))
})
}
Before executing this, make sure otherDir is created.

# Create data
library(stringr)
lapply(1:10, function(x){
write.csv(head(iris),file=paste0("908-A0F7__",x,".csv"))
write.csv(head(iris),file=paste0("notused__",x,".csv"))
})
# Only get files with correct pattern
pattern = "908-A0F7__(\\d+).csv"
files = data.frame(name=dir(pattern=pattern,full.names=TRUE))
files$num = as.integer(str_match(filenames$name,pattern)[,2])
required = c(1,3,5) # You can also read this in from your csv
myFiles = files[files$num %in% required,]
dir.create("copied")
file.copy(as.character(myFiles$name),file.path("copied",str_sub(myFiles$name,3)))

Related

Removing rows containing special characters

I am working on filtering out a massive dataset that reads in as a list. I need to filter out special markings and am getting stuck on some of them. Here is what I currently have:
library(R.utils)
library(stringr)
gunzip("movies.list.gz") #open file
movies <- readLines("movies.list") #read lines in
movies <- gsub("[\t]", '', movies) #remove tabs (\t)
#movies <- gsub(, '', movies)
a <- movies[!grepl("\\{", movies)] # removed any line that contained special character {
b <- a[!grepl("\\(V)", a)] #remove porn?
c <- b[!grepl("\\(TV)", b)] #remove tv
d <- c[!grepl("\\(VG)", c)] #remove video games
e <- d[!grepl("\\(\\?\\?\\?\\?\\)", d)] #remove anyhting with unknown date ex (????)
f <- e[!grepl("\\#)", e)]
g <- e[!grepl("\\!)", f)]
i <- data.frame(g)
i <- i[-c(1:15),]
i <- data.frame(i)
i$Date <- lapply(strsplit(as.character(i$i), "\\(....\\)"), "[", 2)
i$Title <- lapply(strsplit(as.character(i$i), "\\(....\\)"), "[", 1)
I still need to clean it up a bit, and remove the original column (i) but from the output you can see that it is not removing the special characters ! or #
> head(i)
i Date Title
1 "!Next?" (1994)1994-1995 1994-1995 "!Next?"
2 "#1 Single" (2006)2006-???? 2006-???? "#1 Single"
3 "#1MinuteNightmare" (2014)2014-???? 2014-???? "#1MinuteNightmare"
4 "#30Nods" (2014)2014-2015 2014-2015 "#30Nods"
5 "#7DaysLater" (2013)2013-???? 2013-???? "#7DaysLater"
6 "#ATown" (2014)2014-???? 2014-???? "#ATown"
What I actually want to do is remove the entire rows containing those special characters. Everything I have tried has thrown errors. Any suggestions?
You could sub anything that is not alphanumeric or a "-" or "()" like this:
gsub("[^A-Za-z()-]", "", row)
In order to remove the rows you can try something like the one below:
data[!grepl(pattern = "[#!]", x = data)]
In case you want to remove all the rows with special characters you can use the code suggested by #luke1018 using grepl:
data[!grepl(pattern = "[^A-Za-z0-9-()]", x = data)]

Using regular expressions in R to extract information from string

I searched the stack overflow a little and all I found was, that regex in R are a bit tricky and not convenient compared to Perl or Python.
My problem is the following. I have long file names with informations inside. The look like the following:
20150416_QEP1_EXT_GR_1234_hs_IP_NON_060.raw
20150416_QEP1_EXT_GR_1234-1235_hs_IP_NON_060.raw
20150416_QEP1_EXT_GR_1236_hs_IP_NON_060_some_other_info.raw
20150416_QEP1_EXT_GR_1237_hs_IP_NON_060
I want to extract the parts from the filename and convert them conveniently into values, for example the first part is a date, the second is machine abbreviation, the next an institute abbreviation, group abbreviation, sample number(s) etc...
What I do at the moment is constructing a regex, to make (almost) sure, I grep the correct part of the string:
regex <- '([:digit:]{8})_([:alnum:]{1,4})_([:upper:]+)_ etc'
Then I use sub to save each snipped into a variable:
date <- sub(regex, '\\1', filename)
machine <- sub(regex, '\\2', filename)
etc
This works, if the filename has the correct convention. It is overall very hard to read and I am search for a more convenient way of doing the work. I thought about splitting the filename by _ and accessing the string by index might be a good solution. But sometimes, since the filenames often get created by hand, there are terms missing or additional information in the names and I am looking for a better solution to this.
Can anyone suggest a better way of doing so?
EDIT
What I want to create is an object, which has all the information of the filenames extracted and accessible... such as my_object$machine or so....
The help page for ?regex actually gives an example that is exactly equivalent to Python's re.match(r"(?P<first_name>\w+) (?P<last_name>\w+)", "Malcolm Reynolds") (as per your comment):
## named capture
notables <- c(" Ben Franklin and Jefferson Davis",
"\tMillard Fillmore")
#name groups 'first' and 'last'
name.rex <- "(?<first>[[:upper:]][[:lower:]]+) (?<last>[[:upper:]][[:lower:]]+)"
(parsed <- regexpr(name.rex, notables, perl = TRUE))
gregexpr(name.rex, notables, perl = TRUE)[[2]]
parse.one <- function(res, result) {
m <- do.call(rbind, lapply(seq_along(res), function(i) {
if(result[i] == -1) return("")
st <- attr(result, "capture.start")[i, ]
substring(res[i], st, st + attr(result, "capture.length")[i, ] - 1)
}))
colnames(m) <- attr(result, "capture.names")
m
}
parse.one(notables, parsed)
The normal way (i.e. the R way) to extract from a string is the following:
text <- "Malcolm Reynolds"
x <- gregexpr("\\w+", text) #Don't forget to escape the backslash
regmatches(text, x)
[[1]]
[1] "Malcolm" "Reynolds"
You can use however Perl-style group naming by using argument perl=TRUE:
regexpr("(?P<first_name>\\w+) (?P<last_name>\\w+)", text, perl=TRUE)
However regmatches does not support it, hence the need to create your own function to handle that, which is given in the help page:
parse.one <- function(res, result) {
m <- do.call(rbind, lapply(seq_along(res), function(i) {
if(result[i] == -1) return("")
st <- attr(result, "capture.start")[i, ]
substring(res[i], st, st + attr(result, "capture.length")[i, ] - 1)
}))
colnames(m) <- attr(result, "capture.names")
m
}
Applied to your example:
text <- "Malcolm Reynolds"
x <- regexpr("(?P<first_name>\\w+) (?P<last_name>\\w+)", text, perl=TRUE)
parse.one(text, x)
first_name last_name
[1,] "Malcolm" "Reynolds"
To go back to your initial problem:
filenames <- c("20150416_QEP1_EXT_GR_1234_hs_IP_NON_060.raw", "20150416_QEP1_EXT_GR_1234-1235_hs_IP_NON_060.raw", "20150416_QEP1_EXT_GR_1236_hs_IP_NON_060_some_other_info.raw", "20150416_QEP1_EXT_GR_1237_hs_IP_NON_060")
regex <- '(?P<date>[[:digit:]]{8})_(?P<machine>[[:alnum:]]{1,4})_(?P<whatev>[[:upper:]]+)'
x <- regexpr(regex,filenames,perl=TRUE)
parse.one(filenames,x)
date machine whatev
[1,] "20150416" "QEP1" "EXT"
[2,] "20150416" "QEP1" "EXT"
[3,] "20150416" "QEP1" "EXT"
[4,] "20150416" "QEP1" "EXT"

Selecting the word immediately after a keyword

I'm trying to extract the word immediately a keyword using R. I don't have a lot of experience with regular expressions so everything I've found so far doesn't help me much. If I could get the function to return multiple instances that would be ideal.
For example if my keyword was the and my string was:
The yellow log is in the stream
It would return yellow and stream.
I found this solution for c# and it seems exactly like what I want but I'm having trouble implementing it in R.
You can try
library(stringr)
str_extract_all(str1, perl('(?<=\\b(?i)The )\\w+'))[[1]]
#[1] "yellow" "stream"
Or using stringi
library(stringi)
stri_extract_all_regex(str1, '(?<=\\b(?i)The )\\w+')[[1]]
#[1] "yellow" "stream"
EDIT: Changed based on #Roland's suggestion in the comments.
data
str1 <- 'The yellow log is in the stream'
assign key to whatever string you want and use
key <- 'the'
p <- "The yellow log is in the stream"
regmatches(p, gregexpr(sprintf('(?i)(?<=%s\\s)\\w+', key), p, perl = TRUE))[[1]]
# [1] "yellow" "stream"
or as #Roland points out, it would be safer to use a word boundary around your keyword to avoid this:
key <- 'the'
p <- "The yellow log is in the stream drinking absinthe and beer"
regmatches(p, gregexpr(sprintf('(?i)(?<=%s\\s)\\w+', key), p, perl = TRUE))[[1]]
# [1] "yellow" "stream" "and"
regmatches(p, gregexpr(sprintf('(?i)(?<=\\b%s )\\w+', key), p, perl = TRUE))[[1]]
# [1] "yellow" "stream"
Here is non regex solution:
mytext <- "The yellow log is in the stream"
mykey <- "the"
x <- unlist(strsplit(mytext," "))
x[which(tolower(x)==mykey)+1]
Try this: this returns 'yellow' and 'stream'
x <- "The yellow log is in the stream"
regmatches(x, gregexpr("(?:(?:T|t)he)\\s(\\w+)", x, perl = TRUE))[[1]]
## [1] "The yellow" "the stream"
The qdapRegex package I maintain has a regular expression after_ in the regex_supplement dictionary that is perfect for this. You can use rm_ to make your own after_the function:
library(qdapRegex)
x<- "The yellow log is in the stream"
after_the <- rm_(pattern = S("#after_", "[Tt]he"), extract = TRUE)
after_the(x)
## [[1]]
## [1] "yellow" "stream"
The S function is a wrapper for sprintf that allows you to easily pass elements (like the work "the" in this case) to the base regex producing:
S("#after_", "the", "The")
## [1] "(?<=\\b(the|The)\\s)(\\w+)"
EDIT
library(qdapRegex)
x<- c("The yellow log is in the stream", "I like the one box for a pack")
after_ <- rm_(extract = TRUE)
after_the(x)
after_ <- rm_(extract = TRUE)
words <- c("the", "a", "one")
setNames(lapply(words, function(y){
after_(x, pattern = S("#after_", y, TC(y)))
}), words)
## $the
## $the[[1]]
## [1] "yellow" "stream"
##
## $the[[2]]
## [1] "one"
##
##
## $a
## $a[[1]]
## [1] NA
##
## $a[[2]]
## [1] "pack"
##
##
## $one
## $one[[1]]
## [1] NA
##
## $one[[2]]
## [1] "box"

Remove all characters before a period in a string

This keeps everything before a period:
gsub("\\..*","", data$column )
how to keep everything after the period?
To remove all the characters before a period in a string(including period).
gsub("^.*\\.","", data$column )
Example:
> data <- 'foobar.barfoo'
> gsub("^.*\\.","", data)
[1] "barfoo"
To remove all the characters before the first period(including period).
> data <- 'foo.bar.barfoo'
> gsub("^.*?\\.","", data)
[1] "bar.barfoo"
You could use stringi with lookbehind regex
library(stringi)
stri_extract_first_regex(data1, "(?<=\\.).*")
#[1] "bar.barfoo"
stri_extract_first_regex(data, "(?<=\\.).*")
#[1] "barfoo"
If the string doesn't have ., this retuns NA (it is not clear about how to deal with this in the question)
stri_extract_first_regex(data2, "(?<=\\.).*")
#[1] NA
###data
data <- 'foobar.barfoo'
data1 <- 'foo.bar.barfoo'
data2 <- "foobar"
If you don't want to think about the regex for this the qdap package has the char2end function that grabs from a particular character until the end of the string.
data <- c("foo.bar", "foo.bar.barfoo")
library(qdap)
char2end(data, ".")
## [1] "bar" "bar.barfoo"
use this :
gsub(".*\\.","", data$column )
this will keep everything after period
require(stringr)
I run a course on Data Analysis and the students came up with this solution :
get_after_period <- function(my_vector) {
# Return a string vector without the characters
# before a period (excluding the period)
# my_vector, a string vector
str_sub(my_vector, str_locate(my_vector, "\\.")[,1]+1)
}
Now, just call the function :
my_vector <- c('foobar.barfoo', 'amazing.point')
get_after_period(my_vector)
[1] "barfoo" "point"

Extract string between parenthesis in R

I have to extract values between a very peculiar feature in R. For eg.
a <- "{1:0987617820}{2:q312132498s7yd09f8sydf987s6df8797yds9f87098}
{3:{112:123123214321}}{4:20:asdasd3214213}"
This is my example string and I wish to extract text between {[0-9]: and } such that my output for the above string looks like
## output should be
"0987617820" "q312132498s7yd09f8sydf987s6df8797yds9f87098", "{112:123123214321}" "20:asdasd3214213"
This is a horrible hack and probably breaks on your real data. Ideally you could just use a parser but if you're stuck with regex... well... it's not pretty
a <- "{1:0987617820}{2:q312132498s7yd09f8sydf987s6df8797yds9f87098}
{3:{112:123123214321}}{4:20:asdasd3214213}"
# split based on }{ allowing for newlines and spaces
out <- strsplit(a, "\\}[[:space:]]*\\{")
# Make a single vector
out <- unlist(out)
# Have an excess open bracket in first
out[1] <- substring(out[1], 2)
# Have an excess closing bracket in last
n <- length(out)
out[length(out)] <- substring(out[n], 1, nchar(out[n])-1)
# Remove the number colon at the beginning of the string
answer <- gsub("^[0-9]*\\:", "", out)
which gives
> answer
[1] "0987617820"
[2] "q312132498s7yd09f8sydf987s6df8797yds9f87098"
[3] "{112:123123214321}"
[4] "20:asdasd3214213"
You could wrap something like that in a function if you need to do this for multiple strings.
Using PERL. This way is a bit more robust.
a = "{1:0987617820}{2:q312132498s7yd09f8sydf987s6df8797yds9f87098}{3:{112:123123214321}}{4:20:asdasd3214213}"
foohacky = function(str){
#remove opening bracket
pt1 = gsub('\\{+[0-9]:', '##',str)
#remove a closing bracket that is preceded by any alphanumeric character
pt2 = gsub('([0-9a-zA-Z])(\\})', '\\1',pt1, perl=TRUE)
#split up and hack together the result
pt3 = strsplit(pt2, "##")[[1]][-1]
pt3
}
For example
> foohacky(a)
[1] "0987617820"
[2] "q312132498s7yd09f8sydf987s6df8797yds9f87098"
[3] "{112:123123214321}"
[4] "20:asdasd3214213"
It also works with nesting
> a = "{1:0987617820}{{3:{112:123123214321}}{4:{20:asdasd3214213}}"
> foohacky(a)
[1] "0987617820" "{112:123123214321}" "{20:asdasd3214213}"
Here's a more general way, which returns any pattern between {[0-9]: and } allowing for a single nest of {} inside the match.
regPattern <- gregexpr("(?<=\\{[0-9]\\:)(\\{.*\\}|.*?)(?=\\})", a, perl=TRUE)
a_parse <- regmatches(a, regPattern)
a <- unlist(a_parse)