Split or substitute strings with wildcards in R [duplicate] - regex

This question already has answers here:
Split data frame string column into multiple columns
(16 answers)
Closed 6 years ago.
I have the following vector:
a <- c("abc_lvl1", "def_lvl2")
I basically want to split into two vectors:
("abc", "def") and ("lvl1", "lvl2). I know how to substitute with sub:
sub(".*_", "", a)
[1] "lvl1" "lvl2"
I think this translates into "Search for any number of any characters before "_" and replace with nothing." Accordingly - i thought - this should give me the other desired vector:
sub("_*.", "", a), but it removes just the leading character:
[1] "bc_lvl1" "ef_lvl2"
Where do i mess up?
This is essentially the equivalent for the "text-to-columns"-function in excel.

There are several ways to do this. Here are a few, some using packages, and others with base R.
Given:
a <- c("abc_lvl1", "def_lvl2")
Here are some options:
do.call(rbind, strsplit(a, "_", TRUE))
matrix(scan(what = "", text = a, sep = "_"), ncol = 2, byrow = TRUE)
scan(text = a, sep = "_", what = list("", "")) ## a list
library(splitstackshape)
cSplit(data.table(a), "a", "_")
library(data.table)
setDT(tstrsplit(a, "_"))[]
library(dplyr)
library(tidyr)
data_frame(a) %>%
separate(a, into = c("this", "that"))
library(reshape2)
colsplit(a, "_", c("this", "that"))
library(stringi)
t(stri_split_fixed(a, "_", simplify = TRUE))
library(iotools)
mstrsplit(a, "_") # Matrix
dstrsplit(a, col_types = c("character", "character"), "_") # data.frame
library(gsubfn)
read.pattern(text = a, pattern = "(.*)_(.*)")

We can use read.csv/read.table and specify the sep="_". It will split the strings into two columns.
read.csv(text=a, sep="_", header=FALSE)

Just to build on the initial comments
a <- c("abc_lvl1", "def_lvl2")
a1 <- do.call(c, lapply(a, function(x){strsplit(x, "_")[[1]][1]}))
a2 <- do.call(c, lapply(a, function(x){strsplit(x, "_")[[1]][2]}))
a1
[1] "abc" "def"
a2
[1] "lvl1" "lvl2"

Related

Split words in R Dataframe column

I have a data frame with words in a column separated by single space. I want to split it into three types as below. Data frame looks as below.
Text
one of the
i want to
I want to split it into as below.
Text split1 split2 split3
one of the one one of of the
I am able to achieve 1st. Not able to figure out the other two.
my code to get split1:
new_data$split1<-sub(" .*","",new_data$Text)
Figured out the split2:
df$split2 <- gsub(" [^ ]*$", "", df$Text)
We can try with gsub. Capture one or more non-white space (\\S+) as a group (in this case there are 3 words), then in the replacement, we rearrange the backreference and insert a delimiter (,) which we use for converting to different columns with read.table.
df1[paste0("split", 1:3)] <- read.table(text=gsub("(\\S+)\\s+(\\S+)\\s+(\\S+)",
"\\1,\\1 \\2,\\2 \\3", df1$Text), sep=",")
df1
# Text split1 split2 split3
#1 one of the one one of of the
#2 i want to i i want want to
data
df1 <- structure(list(Text = c("one of the", "i want to")),
.Names = "Text", class = "data.frame", row.names = c(NA, -2L))
There might be more elegant solutions. Here are two options:
Using ngrams:
library(dplyr); library(tm)
df %>% mutate(splits = strsplit(Text, "\\s+")) %>%
mutate(split1 = lapply(splits, `[`, 1)) %>%
mutate(split2 = lapply(splits, function(words) ngrams(words, 2)[[1]]),
split3 = lapply(splits, function(words) ngrams(words, 2)[[2]])) %>%
select(-splits)
Text split1 split2 split3
1 one of the one one, of of, the
2 i want to i i, want want, to
Extract the two grams manually:
df %>% mutate(splits = strsplit(Text, "\\s+")) %>%
mutate(split1 = lapply(splits, `[`, 1)) %>%
mutate(split2 = lapply(splits, `[`, 1:2),
split3 = lapply(splits, `[`, 2:3)) %>%
select(-splits)
Text split1 split2 split3
1 one of the one one, of of, the
2 i want to i i, want want, to
Update:
With regular expression, we can use back reference of gsub.
Split2:
gsub("((.*)\\s+(.*))\\s+(.*)", "\\1", df$Text)
[1] "one of" "i want"
Split3:
gsub("(.*)\\s+((.*)\\s+(.*))", "\\2", df$Text)
[1] "of the" "want to"
This is a bit of hackish solution.
Assumption :- you are not concerned about number of spaces between two words.
> library(stringr)
> x<-c('one of the','i want to')
> strsplit(gsub('(\\S+)\\s+(\\S+)\\s+(.*)', '\\1 \\1 \\2 \\2 \\3', x), '\\s\\s+')
#[[1]]
#[1] "one" "one of" "of the"
#[[2]]
#[1] "i" "i want" "want to"

Replace element in string after first occurrence

I wish to replace all 2's in a string after the first occurrence of a 2, ideally using regex in base R. This seems like it must be a duplicate, but I cannot locate the answer.
Here is an example:
my.data <- read.table(text='
my.string
.1.222.2.2
..1..1..2.
1.1.2.2...
.222.232..
..1..1....
', header=TRUE, stringsAsFactors = FALSE)
my.data
desired.result <- read.table(text='
my.string
.1.2......
..1..1..2.
1.1.2.....
.2....3...
..1..1....
', header=TRUE, stringsAsFactors = FALSE)
desired.result
my.last.2 <- c(4, 9, 5, 2, NA)
my.last.2
Thank you for any assistance.
This appears to match your desired output:
> gsub(pattern = "(?<=2)(.*?)2",
replacement = "\\1\\.",
x = my.data$my.string,
perl = TRUE)
[1] ".1.2......" "..1..1..2." "1.1.2....." ".2....3..." "..1..1...."
This is literally a directly modification from this answer to a very similar question to make it R specific. I'll be honest, I don't quite understand this regex, so use (and up-vote) with caution.
This works, but is probably inefficient:
with(my.data, gsub("#", "2", gsub("2", ".", sub("2", "#", my.string))))
# [1] ".1.2......" "..1..1..2." "1.1.2....." ".2....3..." "..1..1...."
Approach: Use sub to only match the first occurrence and change it to # (or some other placeholder character which doesn't show up elsewhere in my.string, then use gsub to replace all remaining 2s, then gsub # back into 2.

Combining lines in character vector in R

I have a character vector (content) of about 50,000 lines in R. However, some of the lines when read in from a text file are on separate lines and should not be. Specifically, the lines look something like this:
[1] hello,
[2] world
[3] ""
[4] how
[5] are
[6] you
[7] ""
I would like to combine the lines so that I have something that looks like this:
[1] hello, world
[2] how are you
I have tried to write a for loop:
for(i in 1:length(content)){
if(content[i+1] != ""){
content[i+1] <- c(content[i], content[i+1])
}
}
But when I run the loop, I get an error: missing value where TRUE/FALSE needed.
Can anyone suggest a better way to do this, maybe not even using a loop?
Thanks!
EDIT:
I am actually trying to apply this to a Corpus of documents that are all many thousands lines each. Any ideas on how to translate these solutions into a function that can be applied to the content of each of the documents?
you don't need a loop to do that
x <- c("hello,", "world", "", "how", "\nare", "you", "")
dummy <- paste(
c("\n", sample(letters, 20, replace = TRUE), "\n"),
collapse = ""
) # complex random string as a split marker
x[x == ""] <- dummy #replace empty string by split marker
y <- paste(x, collapse = " ") #make one long string
z <- unlist(strsplit(y, dummy)) #cut the string at the split marker
gsub(" $", "", gsub("^ ", "", z)) # remove space at start and end
I think there are more elegant solutions, but this might be usable for you:
chars <- c("hello,","world","","how","are","you","")
###identify groups that belong together (id increases each time a "" is found)
ids <- cumsum(chars=="")
#split vector (an filter out "" by using the select vector)
select <- chars!=""
splitted <- split(chars[select], ids[select])
#paste the groups together
res <- sapply(splitted,paste, collapse=" ")
#remove names(if necessary, probably not)
res <- unname(res) #thanks #Roland
> res
[1] "hello, world" "how are you"
Here's a different approach using data.table which is likely to be faster than for or *apply loops:
library(data.table)
dt <- data.table(x)
dt[, .(paste(x, collapse = " ")), rleid(x == "")][V1 != ""]$V1
#[1] "hello, world" "how are you"
Sample data:
x <- c("hello,", "world", "", "how", "are", "you", "")
Replace the "" with something you can later split on, and then collapse the characters together, and then use strsplit(). Here I have used the newline character since if you were to just paste it you could get the different lines on the output, e.g. cat(txt3) will output each phrase on a separate line.
txt <- c("hello", "world", "", "how", "are", "you", "", "more", "text", "")
txt2 <- gsub("^$", "\n", txt)
txt3 <- paste(txt2, collapse = " ")
unlist(strsplit(txt3, "\\s\n\\s*"))
## [1] "hello world" "how are you" "more text"
Another way to add to the mix:
tapply(x[x != ''], cumsum(x == '')[x != '']+1, paste, collapse=' ')
# 1 2 3
#"hello, world" "how are you" "more text"
Group by non-empty strings. And paste the elements together by group.

R regex matching for tweet pattern

I am trying to use the regex feature in R to parse some tweet text into its key words. I have the following code.
sentence = gsub("[[:punct:]]", "", sentence)
sentence = gsub("[[:cntrl:]]", "", sentence)
sentence = gsub("\\d+", "", sentence)
sentence = tolower(sentence)
However, one of my sentences has the sequence "\ud83d\udc4b". THe parsing fails for this sequence (the error is "invalid input in utf8towcs"). I would like to replace such sequences with "". I tried substituting the regex "\u+", but that did not match. What is the regex I should use to match this sequence? Thanks.
I think you want something like this,
> s <- "\ud83d\udc4b Delta"
> Encoding(s)
[1] "UTF-8"
> iconv(s, "ASCII", sub="")
[1] " Delta"
> f <- iconv(s, "ASCII", sub="")
> sentence = tolower(f)
> sentence
[1] " delta"
> sentence = RemoveNotASCII(sentence)
A function to remove not ASCII characters below.
RemoveNotASCII <- function#Remove all non ASCII characters
### remove column by columns non ASCII characters from a dataframe
(
x ##<< dataframe
){
n <- ncol(x)
z <- list()
for (j in 1:n) {
y = as.character(x[,j])
if (class(y)=="character") {
Encoding(y) <- "latin1"
y <- iconv(y, "latin1", "ASCII", sub="")
}
z[[j]] <- y
}
z = do.call("cbind.data.frame", z)
names(z) <- names(x)
return(z)
### Dataframe with non ASCII characters removed
}
The qdapRegex package has the rm_non_ascii function to handle this:
library(qdapRegex)
tolower(rm_non_ascii(s))
## [1] "delta"

Eliminating the characters that are not a date in R

I have some data frame, df with a column with dates that are in the following format:
pv$day
01/01/13 00:00:00
03/01/13 00:02:03
04/03/13 00:10:15
....
I would like to eliminate the timestamp, just leaving the date (e.g. 01/01/13 for the first row). I have tried both using sapply() to apply the strsplit() function, and tried to filter the content using a regex, but don't seem to have quite gotten it right in either case. This:
sapply(pv$day, function(x) strsplit(toString(x), ' '))
gives me the column with the correct split, but indexing with either [1] or [[1]] does not return the first element of the split.
What is the best way to go about this?
You can use sub:
vec <- c("01/01/13 00:00:00", "03/01/13 00:02:03", "04/03/13 00:10:15")
sub(" .+", "", vec)
# [1] "01/01/13" "03/01/13" "04/03/13"
A simple, flexible solution is to use strptime and strftime. Here is an example that uses your dates from the example above:
# Your dates
t <- c("01/01/13 00:00:00","03/01/13 00:02:03", "04/03/13 00:10:15")
# Convert character strings to dates
z <- strptime(t, "%d/%m/%y %H:%M:%OS")
# Convert dates to string, omitting the time
z.date <- strftime(z,"%d/%m/%y")
# Print the first date
z.date[1]
Here's a nice way to use sapply, it uses strsplit to split at the space
> d <- c("01/01/13 00:00:00", "03/01/13 00:02:03", "04/03/13 00:10:15")
> sapply(strsplit(d, " "), `[`, 1)
# [1] "01/01/13" "03/01/13" "04/03/13"
And also, you could use stringr::word if you just want a character vector.
> library(stringr)
> word(d)
# [1] "01/01/13" "03/01/13" "04/03/13"
Here is an approach using a look around assertion:
vec <- c("01/01/13 00:00:00", "03/01/13 00:02:03", "04/03/13 00:10:15")
gsub(pattern = "(?=00).*$", replacement = "", vec, perl = TRUE)
[1] "01/01/13 " "03/01/13 " "04/03/13 "
The pattern looks for anything at the end of a string that begins with double 00, and removes it.