Regex works, but not on strings in my vector - regex

So I am attempting to use grep to find pattern and replace values within my single column data frame. I basically want grep that says "delete everything after the comma until the end of the string".
I wrote the expression, and it works on my dummy vector:
> library(stringr)
> pretendvector <- c("Hi","Hi,there","Hi there, how are you")
>str_replace(pretendvector, regex(',.*$'),'')
[1] "Hi" "Hi" "Hi there"
However, when apply the same expression to my vector (since its for stringr I vectorized the column of the dataframe), it returns every value in the column, and does not apply the expression. Does anyone have any idea why this might be?

I guess the OP didn't assign the output from str_replace to a new object or update the original vector. In that case,
newvector <- str_replace(pretendvector, regex(',.*$'),'')
We can also do this using sub from base R
newvector <- sub(",.*", "", pretendvector)

Related

Incrementing a number in a string using sub

There's a string with a (single) number somewhere in it. I want to increment the number by one. Simple, right? I wrote the following without giving it a second thought:
sub("([[:digit:]]+)", as.character(as.numeric("\\1")+1), string)
... and got an NA.
> sub("([[:digit:]]+)", as.character(as.numeric("\\1")+1), "x is 5")
[1] NA
Warning message:
In sub("([[:digit:]]+)", as.character(as.numeric("\\1") + 1), "x is 5") :
NAs introduced by coercion
Why doesn't it work? I know other ways of doing this, so I don't need a "solution". I want to understand why this method fails.
The point is that the backreference is only evaluated during a match operation, and you cannot pass it to any function before that.
When you write as.numeric("\\1") the as.numeric function accepts a \1 string (a backslash and a 1 char). Thus, the result is expected, NA.
This happens because there is no built-in backreference interpolation in R.
You may use a gsubfn package:
> library(gsubfn)
> s <- "x is 5"
> gsubfn("\\d+", function(x) as.numeric(x) + 1, s)
[1] "x is 6"
It does not work because the arguments of sub are evaluated before they are passed to the regex engine (which gets called by .Internal).
In particular, as.numeric("\\1") evaluates to NA ... after that you're doomed.
It might be easier to think of it differently. You are getting the same error that you would get if you used:
print(as.numeric("\\1")+1)
Remember, the strings are passed to the function, where they are interpreted by the regex engine. The string \\1 is never transformed to be 5, since this calculation is done within the function.
Note that \\1 is not something that works as a number. NA seems to be similar to null in other languages:
NA... is a product of operation when you try to access something that is not there
From mpiktas' answer here.

Extract info inside parenthesis in R

I have some rows, some have parenthesis and some don't. Like ABC(DEF) and ABC. I want to extract info from parenthesis:
ABC(DEF) -> DEF
ABC -> NA
I wrote
gsub(".*\\((.*)\\).*", "\\1",X).
It works good for ABC(DEF), but output "ABC" when there is not parenthesis.
If you do not want to get ABC when using sub with your regex, you need to add an alternative that would match all the non-empty string and remove it.
X <- c("ABC(DEF)", "ABC")
sub(".*(?:\\((.*)\\)).*|.*", "\\1",X)
^^^
See the IDEONE demo.
Note you do not have to use gsub, you only need one replacement to be performed, so a sub will do.
Also, a stringr str_match would also be handy for this task:
str_match(X, "\\((.*)\\)")
or
str_match(X, "\\(([^()]*)\\)")
Using string_extract() will work.
library(stringr)
df$`new column` <- str_extract(df$`existing column`, "(?<=\\().+?(?=\\))")
This creates a new column of any text inside parentheses of an existing column. If there is no parentheses in the column, it will fill in NA.
The inspiration for my answer comes from this answer on the original question about this topic

filtering columns by regex in dataframe

I have a large dataframe (3000+ columns) and I am trying to get a list of all column names that follow this pattern:
"stat.mineBlock.minecraft.123456stone"
"stat.mineBlock.minecraft.DFHFFBSBstone2"
"stat.mineBlock.minecraft.AAAstoneAAAA"
My code:
stoneCombined<-grep("^[stat.mineBlock.minecraft.][a-zA-Z0-9]*?[stone][a-zA-Z0-9]*?", colnames(ingame), ignore.case =T)
where ingame is the dataframe I am searching. My code returns a list of numbers however instead of the dataframe columns (like those above) that I was expecting. Con someone tell me why?
After adding value=TRUE (Thanks to user227710):
I now get column names, but I get every column in my dataset not those that contain : stat.mineBlock.minecraft. and stone like I was trying to get.
To return the column names you need to set value=TRUE as an additional argument of grep. The default option in grep is to set value=FALSE and so it will give you indices of the matched colnames. .
help("grep")
value
if FALSE, a vector containing the (integer) indices of the matches determined by grep is returned, and if TRUE, a vector containing the matching elements themselves is returned.
grep("your regex pattern", colnames(ingame),value=TRUE, ignore.case =T)
Here is a solution in dplyr:
library(dplyr)
your_df %>%
select(starts_with("stat.mineBlock.minecraft"))
The more general way to match a column name to a regex is with matches() inside select(). See ?select for more information.
My answer is based on this SO post. As per the regex, you were very close.
Just [] create a character class matching a single character from the defined set, and it is the main reason it was not working. Also, perl=T is always safer to use with regex in R.
So, here is my sample code:
df <- data.frame(
"stat.mineBlock.minecraft.123456stone" = 1,
"stat.mineBlock.minecraft.DFHFFBSBwater2" = 2,
"stat.mineBlock.minecraft.DFHFFBSBwater3" = 3,
"stat.mineBlock.minecraft.DFHFFBSBstone4" = 4
)
grep("^stat\\.mineBlock\\.minecraft\\.[a-zA-Z0-9]*?stone[a-zA-Z0-9]*?", colnames(df), value=TRUE, ignore.case=T, perl=T)
See IDEONE demo

Remove square brackets from a string vector

I have a character vector in which each element is enclosed in brackets. I want
to remove the brackets and just have the string.
So I tried:
n = c("[Dave]", "[Tony]", "[Sara]")
paste("", n, "", sep="")
Unfortunately, this doesn't work for some reason.
I've performed the same task before using this same code, and am not sure why it's not working this time.
I want to go from '[Dave]' to 'Dave'.
What am I doing wrong?
You could gsub out the brackets like so:
n = c("[Dave]", "[Tony]", "[Sara]")
gsub("\\[|\\]", "", n)
[1] "Dave" "Tony" "Sara"
A regular expression substitution will do it. Look at the gsub() function.
This gives you what you want (it removes any instance of '[' or ']'):
gsub("\\[|\\]", "", n)
The other answers should be enough to get your desired output. I just wanted to provide a brief explanation of why what you tried didn't work.
paste concatenates character strings. If you paste an empty character string, "", to something with a separator that is also an empty character string, you really haven't altered anything. So paste can't make a character string shorter; the result will either be the same (as in your example) or longer.
If working within tidyverse:
library(tidyverse); library(stringr)
n = c("[Dave]", "[Tony]", "[Sara]")
n %>% str_replace_all("\\[|\\]", "")
[1] "Dave" "Tony" "Sara"

Regular expressions in R to erase all characters after the first space?

I have data in R that can look like this:
USDZAR Curncy
R157 Govt
SPX Index
In other words, one word, in this case a Bloomberg security identifier, followed by another word, which is the security class, separated by a space. I want to strip out the class and the space to get to:
USDZAR
R157
SPX
What's the most efficient way of doing this in R? Is it regular expressions or must I do something as I would in MS Excel using the mid and find commands? eg in Excel I would say:
=MID(#REF, 1, FIND(" ", #REF, 1)-1)
which means return a substring starting at character 1, and ending at the character number of the first space (less 1 to erase the actual space).
Do I need to do something similar in R (in which case, what is the equivalent), or can regular expressions help here? Thanks.
1) Try this where the regular expression matches a space followed by any sequence of characters and sub replaces that with a string having zero characters:
x <- c("USDZAR Curncy", "R157 Govt", "SPX Index")
sub(" .*", "", x)
## [1] "USDZAR" "R157" "SPX"
2) An alternative if you wanted the two words in separate columns in a data frame is as follows. Here as.is = TRUE makes the columns be character rather than factor.
read.table(text = x, as.is = TRUE)
## V1 V2
## 1 USDZAR Curncy
## 2 R157 Govt
## 3 SPX Index
It's pretty easy with stringr:
x <- c("USDZAR Curncy", "R157 Govt", "SPX Index")
library(stringr)
str_split_fixed(x, " ", n = 2)[, 1]
If you're like me, in that regexp's will always remain an inscrutable, frustrating mystery, this clunkier solution also exists:
x <- c("USDZAR Curncy", "R157 Govt", "SPX Index")
unlist(lapply(strsplit(x," ",fixed=TRUE),"[",1))
The fixed=TRUE isn't strictly necessary, just pointing out that you can do this (simple case) w/out really knowing the first thing about regexp's.
Edited to reflect #Wojciech's comment.
The regex would be to search for:
\x20.*
and replace with an empty string.
If you want to know whether it's faster, just time it.