Data frame column vector manipulation

Data frame column vector manipulation - regex

I have a dataframe mydf:
Content term
1 Search Term: abc| NA
2 Search Term-xyz NA
3 Search Term-pqr| NA
Made a regex:
\Search Term[:]?.?([a-zA-Z]+)\
to get terms like abc xyz and pqr.
How do I extract these terms in the term column. I tried str_match and gsub, but not getting the correct results.

We can try with sub
sub(".*(\\s+|-)", "", df1$Content)
#[1] "abc" "xyz" "pqr"
Or
library(stringr)
str_extract(df1$Content, "\\w+$")
#[1] "abc" "xyz" "pqr"
Update
If the | is also found in the string at the end
gsub(".*(\\s+|-)|[^a-z]+$", "", df1$Content)
#[1] "abc" "xyz" "pqr"
Or
str_extract(df1$Content, "\\w+(?=(|[|])$)")
#[1] "abc" "xyz" "pqr"

Just to demonstrate the word function of stringr,
library(stringr)
df$term <- gsub('.*-', '', word(df$Content, -1))
gsub('[[:punct:]]', '', df$term)
#[1] "abc" "xyz" "pqr"

'gsub' will help you
content <- c("Search Term: abc|", "Search Term-xyz", "Search Term-pqr|")
term <- c(NA, NA, NA)
test123 <- as.data.frame(cbind(content, term))
test123$term <- as.character(gsub(".*(\\s+|-)|[^a-z]+$", "", test123$content))
test123
content term
1 Search Term: abc| abc
2 Search Term-xyz xyz
3 Search Term-pqr| pqr

Related

Split words in R Dataframe column

I have a data frame with words in a column separated by single space. I want to split it into three types as below. Data frame looks as below.
Text
one of the
i want to
I want to split it into as below.
Text split1 split2 split3
one of the one one of of the
I am able to achieve 1st. Not able to figure out the other two.
my code to get split1:
new_data$split1<-sub(" .*","",new_data$Text)
Figured out the split2:
df$split2 <- gsub(" [^ ]*$", "", df$Text)

We can try with gsub. Capture one or more non-white space (\\S+) as a group (in this case there are 3 words), then in the replacement, we rearrange the backreference and insert a delimiter (,) which we use for converting to different columns with read.table.
df1[paste0("split", 1:3)] <- read.table(text=gsub("(\\S+)\\s+(\\S+)\\s+(\\S+)",
"\\1,\\1 \\2,\\2 \\3", df1$Text), sep=",")
df1
# Text split1 split2 split3
#1 one of the one one of of the
#2 i want to i i want want to
data
df1 <- structure(list(Text = c("one of the", "i want to")),
.Names = "Text", class = "data.frame", row.names = c(NA, -2L))

There might be more elegant solutions. Here are two options:
Using ngrams:
library(dplyr); library(tm)
df %>% mutate(splits = strsplit(Text, "\\s+")) %>%
mutate(split1 = lapply(splits, `[`, 1)) %>%
mutate(split2 = lapply(splits, function(words) ngrams(words, 2)[[1]]),
split3 = lapply(splits, function(words) ngrams(words, 2)[[2]])) %>%
select(-splits)
Text split1 split2 split3
1 one of the one one, of of, the
2 i want to i i, want want, to
Extract the two grams manually:
df %>% mutate(splits = strsplit(Text, "\\s+")) %>%
mutate(split1 = lapply(splits, `[`, 1)) %>%
mutate(split2 = lapply(splits, `[`, 1:2),
split3 = lapply(splits, `[`, 2:3)) %>%
select(-splits)
Text split1 split2 split3
1 one of the one one, of of, the
2 i want to i i, want want, to
Update:
With regular expression, we can use back reference of gsub.
Split2:
gsub("((.*)\\s+(.*))\\s+(.*)", "\\1", df$Text)
[1] "one of" "i want"
Split3:
gsub("(.*)\\s+((.*)\\s+(.*))", "\\2", df$Text)
[1] "of the" "want to"

This is a bit of hackish solution.
Assumption :- you are not concerned about number of spaces between two words.
> library(stringr)
> x<-c('one of the','i want to')
> strsplit(gsub('(\\S+)\\s+(\\S+)\\s+(.*)', '\\1 \\1 \\2 \\2 \\3', x), '\\s\\s+')
#[[1]]
#[1] "one" "one of" "of the"
#[[2]]
#[1] "i" "i want" "want to"

Replace element in string after first occurrence

I wish to replace all 2's in a string after the first occurrence of a 2, ideally using regex in base R. This seems like it must be a duplicate, but I cannot locate the answer.
Here is an example:
my.data <- read.table(text='
my.string
.1.222.2.2
..1..1..2.
1.1.2.2...
.222.232..
..1..1....
', header=TRUE, stringsAsFactors = FALSE)
my.data
desired.result <- read.table(text='
my.string
.1.2......
..1..1..2.
1.1.2.....
.2....3...
..1..1....
', header=TRUE, stringsAsFactors = FALSE)
desired.result
my.last.2 <- c(4, 9, 5, 2, NA)
my.last.2
Thank you for any assistance.

This appears to match your desired output:
> gsub(pattern = "(?<=2)(.*?)2",
replacement = "\\1\\.",
x = my.data$my.string,
perl = TRUE)
[1] ".1.2......" "..1..1..2." "1.1.2....." ".2....3..." "..1..1...."
This is literally a directly modification from this answer to a very similar question to make it R specific. I'll be honest, I don't quite understand this regex, so use (and up-vote) with caution.

This works, but is probably inefficient:
with(my.data, gsub("#", "2", gsub("2", ".", sub("2", "#", my.string))))
# [1] ".1.2......" "..1..1..2." "1.1.2....." ".2....3..." "..1..1...."
Approach: Use sub to only match the first occurrence and change it to # (or some other placeholder character which doesn't show up elsewhere in my.string, then use gsub to replace all remaining 2s, then gsub # back into 2.

regular expression string between two [ ] in R

I am stuck on regular expressions yet again but this time in R.
The problem I am facing is that I a vector I would like to extract a string between two [] for each row in the vector. However, sometimes I have cases where there is more than one series of [ ] in the whole statement and so I am recovering all strings in each row that is in the [ ]. In all cases I just need to recover the first instance of the string in the [ ] not the second or more instances. The example dataframe I have is:
comp541_c0_seq1 gi|356502740|ref|XP_003520174.1| PREDICTED: uncharacterized protein LOC100809655 [Glycine max]
comp5041_c0_seq1 gi|460370622|ref|XP_004231150.1| [Solanum lycopersicum] PREDICTED: uncharacterized protein LOC101250457 [Solanum lycopersicum]
The code i have been using that recovers the string and the index and makes a vector in the new dataframe are:
pattern <- "\\[\\w*\\s\\w*]"
match<- gregexpr(pattern, data$Description)
data$Species <- regmatches(data$Description, match)
the structure of the dataframe that I am using is:
data.frame': 67911 obs. of 6 variables:
$ Column1 : Factor w/ 67911 levels "comp100012_c0_seq1 ",..: 3344 8565 17875 18974 19059 19220 21429 29791 40214 48529 ...
$ Description : Factor w/ 26038 levels "0.0","1.13142e-173",..: NA NA NA NA NA NA NA NA 7970 NA ...
So the problem with my pattern match is that it return a vector (Species) where some of the rows have:
[Glycine max] # this is good
c("[Solanum lycopersicum]", "[Solanum lycopersicum]") # I only need one set returned
What I would like is:
[Glycine max]
[Solanum lycopersicum]
I have been trying every way I can with the regular expression. Would anyone know how to improve what I have to just extract the first instance of the string within [ ]?
Thanks in advance.

I think this example should be illuminating to your problems:
txt <- c("[Bracket text]","[Bracket text1] and [Bracket text2]","No brackets in here")
pattern <- "\\[\\w*\\s\\w*]"
mat <- regexpr(pattern,txt)
#[1] 1 1 -1
#attr(,"match.length")
#[1] 14 15 -1
txt[mat != -1] <- regmatches(txt, mat)
txt
#[1] "[Bracket text]" "[Bracket text1]" "No brackets in here"
Or if you want to do it all in one go and return NA values for non-matches, try:
ifelse(mat != -1, regmatches(txt,mat), NA)
#[1] "[Bracket text]" "[Bracket text1]" NA

Using the base-R facilities for string manipulation is just making life hard for yourself. Use rebus to create the regular expression, and stringi (or stringr) to get the matches.
library(rebus)
library(stringi)
txt <- c("[Bracket text]","[Bracket text1] and [Bracket text2]","No brackets in here") # thanks, thelatemail
pattern <- OPEN_BRACKET %R%
alnum(1, Inf) %R%
space(1, Inf) %R%
alnum(1, Inf) %R%
"]"
stri_extract_first_regex(txt, pattern)
## [1] "[Bracket text]" "[Bracket text1]" NA
I suspect that you probably don't want to keep those square brackets. Try this variant:
pattern <- OPEN_BRACKET %R%
capture(
alnum(1, Inf) %R%
space(1, Inf) %R%
alnum(1, Inf)
) %R%
"]"
stri_match_first_regex(txt, pattern)[, 2]
## [1] "Bracket text" "Bracket text1" NA

How to prevent regmatches drop non matches?

I would like to capture the first match, and return NA if there is no match.
regexpr("a+", c("abc", "def", "cba a", "aa"), perl=TRUE)
# [1] 1 -1 3 1
# attr(,"match.length")
# [1] 1 -1 1 2
x <- c("abc", "def", "cba a", "aa")
m <- regexpr("a+", x, perl=TRUE)
regmatches(x, m)
# [1] "a" "a" "aa"
So I expected "a", NA, "a", "aa"

Staying with regexpr:
r <- regexpr("a+", x)
out <- rep(NA,length(x))
out[r!=-1] <- regmatches(x, r)
out
#[1] "a" NA "a" "aa"

use regexec instead, since it returns a list which will allow you to catch the character(0)'s before unlisting
R <- regmatches(x, regexec("a+", x))
unlist({R[sapply(R, length)==0] <- NA; R})
# [1] "a" NA "a" "aa"

In R 3.3.0, it is possible to pull out both the matches and the non-matched results using the invert=NA argument. From the help file, it says
if invert is NA, regmatches extracts both non-matched and matched substrings, always starting and ending with a non-match (empty if the match occurred at the beginning or the end, respectively).
The output is a list, typically, in most cases of interest, (matching a single pattern), regmatches with this argument will return a list with elements of either length 3 or 1. 1 is the case of where no matches are found and 3 is the case with a match.
myMatch <- regmatches(x, m, invert=NA)
myMatch
[[1]]
[1] "" "a" "bc"
[[2]]
[1] "def"
[[3]]
[1] "cb" "a" " a"
[[4]]
[1] "" "aa" ""
So to extract what you want (with "" in place of NA), you can use sapply as follows:
myVec <- sapply(myMatch, function(x) {if(length(x) == 1) "" else x[2]})
myVec
[1] "a" "" "a" "aa"
At this point, if you really want NA instead of "", you can use
is.na(myVec) <- nchar(myVec) == 0L
myVec
[1] "a" NA "a" "aa"
Some revisions:
Note that you can collapse the last two lines into a single line:
myVec <- sapply(myMatch, function(x) {if(length(x) == 1) NA_character_ else x[2]})
The default data type of NA is logical, so using it will result in additional data conversions. Using the character version NA_character_, avoids this.
An even slicker extraction method for the final line is to use [:
sapply(myMatch, `[`, 2)
[1] "a" NA "a" "aa"
So you can do the whole thing in a fairly readable single line:
sapply(regmatches(x, m, invert=NA), `[`, 2)

Using more or less the same construction as yours -
chars <- c("abc", "def", "cba a", "aa")
chars[
regexpr("a+", chars, perl=TRUE) > 0
][1] #abc
chars[
regexpr("q", chars, perl=TRUE) > 0
][1] #NA
#vector[
# find all indices where regexpr returned positive value i.e., match was found
#][return the first element of the above subset]
Edit - Seems like I misunderstood the question. But since two people have found this useful I shall let it stay.

You can use stringr::str_extract(string, pattern). It will return NA if there is no matches. It has simpler function interface than regmatches() as well.

R: removing the '$' symbols

I have downloaded some data from a web server, including prices that are formatted for humans, including $ and thousand separators.
> head(m)
[1] $129,900 $139,900 $254,000 $260,000 $290,000 $295,000
I was able to get rid of the commas, using
m <- sub(',','',m)
but
m <- sub('$','',m)
does not remove the dollar sign. If I try mn <- as.numeric(m) or as.integer I get an error message:
Warning message: NAs introduced by coercion
and the result is:
> head(m)
[1] NA NA NA NA NA NA
How can I remove the $ sign? Thanks

dat <- gsub('[$]','',dat)
dat <- as.numeric(gsub(',','',dat))
> dat
[1] 129900 139900 254000 260000 290000 295000
In one step
gsub('[$]([0-9]+)[,]([0-9]+)','\\1\\2',dat)
[1] "129900" "139900" "254000" "260000" "290000" "295000"

Try this. It means replace anything that is not a digit with the empty string:
as.numeric(gsub("\\D", "", dat))
or to remove anything that is neither a digit nor a decimal:
as.numeric(gsub("[^0-9.]", "", dat))
UPDATE: Added a second similar approach in case the data in the question is not representative.

you could also use:
x <- c("$129,900", "$139,900", "$254,000", "$260,000", "$290,000", "$295,000")
library(qdap)
as.numeric(mgsub(c("$", ","), "", x))
yielding:
> as.numeric(mgsub(c("$", ","), "", x))
[1] 129900 139900 254000 260000 290000 295000
If you wanted to stay in base use the fixed = TRUE argument to gsub:
x <- c("$129,900", "$139,900", "$254,000", "$260,000", "$290,000", "$295,000")
as.numeric(gsub("$", "", gsub(",", "", x), fixed = TRUE))

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Data frame column vector manipulation - regex

I have a dataframe mydf: Content term 1 Search Term: abc| NA 2 Search Term-xyz NA 3 Search Term-pqr| NA Made a regex: \Search Term[:]?.?([a-zA-Z]+)\ to get terms like abc xyz and pqr. How do I extract these terms in the term column. I tried str_match and gsub, but not getting the correct results.

Just to demonstrate the word function of stringr, library(stringr) df$term <- gsub('.*-', '', word(df$Content, -1)) gsub('[[:punct:]]', '', df$term) #[1] "abc" "xyz" "pqr"

Related

Split words in R Dataframe column

Replace element in string after first occurrence

regular expression string between two [ ] in R

How to prevent regmatches drop non matches?

R: removing the '$' symbols

Categories

Resources