Using gsub in R to remove values in Zip Code field - regex

I have a data frame that contains columns of values, one of which is United States Postal Zip codes.
Row_num Restaurant Address City State Zip
26698 m 1460 Memorial Drive Chicopee MA 01020-3964
For this entry, I want to only have the 5 digit zip code 01020 and remove the "-3964" after it and do this for every entry in my data frame. Right now the zip code column is being treated as a chr by r.
I have tried the following gsub code:
df$Zip <- gsub(df$Zip, pattern="-[0,9]{0,4}", replacement = "")
However, all that does is replace the "-" with no space. Not only is that not what I want but it is also not what I expected so any help as to how gsub behaves and how to get the desired result would be appreciated.
Thank you!
Edit: I have found out through trial and error that this block of code works as well
df$Zip <- gsub(df$Zip, pattern="-.*", replacement = "")

The character class you defined has only three elements 0, 9, and ",". Inside character class brackets you need to use dash as the range operator, so try:
df$Zip <- gsub(df$Zip, pattern="-[0-9]{0,4}", replacement = "")

Related

Splitting up data in Google Spreadsheet for adjacent columns

I would like to split up data from a spreadsheet column (Work ID#, last name, first name) into two adjacent columns.. so that I have 3 separate columns for this data.
Here is a link to my spreadsheet: https://docs.google.com/spreadsheets/d/1gsLYnrNHEMbZTML1YZG-NhtDYO_FAQaAPwGFvtqn2kg/edit?usp=sharing
I'm working on the basis that column B in your sheet (last_name) contains your merged data, (employee_ID last_name, first_name).
You can do this with three formula, using regexreplace and regexextract.
employee_ID
Put this in cell A1:
=arrayformula({"employee_ID";iferror(if($B2:$B="","",value(regexextract($B2:$B,"^.*\d"))),"")})
first_name
Put this in cell C1:
=arrayformula({"first_name";if($B2:$B="","",regexreplace($B2:$B,"^.*\,\ ",""))})
last_name
I created a new column D in front of department. This goes in cell D1:
=arrayformula({"last_name";if($B2:$B="","",regexreplace(regexreplace($B2:$B,".*\d\ ",""),"\,.*",""))})
Summary
Each one has an array using {}. The first part of the array is the column heading in "", like "employee_ID". This is so you can keep your formula in row 1 incase you want to add a filter view on the dataset. Then a ; is a return, then the rest of the cells below: iferror(if($B2:$B="","",value(regexextract($B2:$B,"^.*\d"))),"").
regexextract($B2:$B,"^.*\d") looks for anything at the start ^.*, then a number \d.
value() converts the result to a number. iferror handles where a number can't be found.
On first_name, anything at the beginning ^.* followed by a comma \, then a space \ is replaced with "" (ie. removed).
On last_name, .*\d\ replaces anything .* followed by a number \d followed by a space \ with nothing "" (ie. removed), then from that result another replace removes a comma \, followed by anything .*.
Formula in one cell
If you want to do the split in one go, in say new columns F,G and H, put this in cell F1:
=arrayformula(iferror(trim(split({"employee_ID,last_name,first_name";if($B2:$B="","",value(regexextract($B2:$B,"^.*\d"))&", "&regexreplace($B2:B,"^.*\d\ ",""))},",")),""))
However, employee_ID is then formatted as text.
I believe the easiest for you is to use the following logic and single formula
=ArrayFormula(IFERROR(SPLIT(REGEXREPLACE(A2:A,"(\d+) (.+), (.+)","$1#$2#$3"),"#")))
Please write the headings manually. Including them in the formula un-necessarily complicates things. Just make sure that everything below B2, C2 and D2 is cleared.
Please let us know about info on how the formula works.
Functions used:
ArrayFormula
IFERROR
SPLIT
REGEXREPLACE

How can I separate a string by underscore (_) in google spreadsheets using regex?

I need to create some columns from a cell that contains text separated by "_".
The input would be:
campaign1_attribute1_whatever_yes_123421
And the output has to be in different columns (one per field), with no "_" and excluding the final number, as it follows:
campaign1 attribute1 whatever yes
It must be done using a regex formula!
help!
Thanks in advance (and sorry for my english)
=REGEXEXTRACT("campaign1_attribute1_whatever_yes_123421","(("&REGEXREPLACE("campaign1_attribute1_whatever_yes_123421","((_)|(\d+$))",")$1(")&"))")
What this does is replace all the _ with parenthesis to create capture groups, while also excluding the digit string at the end, then surround the whole string with parenthesis.
We then use regex extract to actuall pull the pieces out, the groups automatically push them to their own cells/columns
To solve this you can use the SPLIT and REGEXREPLACE functions
Solution:
Text - A1 = "campaign1_attribute1_whatever_yes_123421"
Formula - A3 = =SPLIT(REGEXREPLACE(A1,"_+\d*$",""), "_", TRUE)
Explanation:
In cell A3 We use SPLIT(text, delimiter, [split_by_each]), the text in this case is formatted with regex =REGEXREPLACE(A1,"_+\d$","")* to remove 123421, witch will give you a column for each word delimited by ""
A1 = "campaign1_attribute1_whatever_yes_123421"
A2 = "=REGEXREPLACE(A1,"_+\d*$","")" //This gives you : *campaign1_attribute1_whatever_yes*
A3 = SPLIT(A2, "_", TRUE) //This gives you: campaign1 attribute1 whatever yes, each in a separate column.
I finally figured it out yesterday in stackoverflow (spanish): https://es.stackoverflow.com/questions/55362/c%C3%B3mo-separo-texto-por-guiones-bajos-de-una-celda-en...
It was simple enough after all...
The reason I asked to be only in regex and for google sheets was because I need to use it in Google data studio (same regex functions than spreadsheets)
To get each column just use this regex extract function:
1st column: REGEXP_EXTRACT(Campaña, '^(?:[^_]*_){0}([^_]*)_')
2nd column: REGEXP_EXTRACT(Campaña, '^(?:[^_]*_){1}([^_]*)_')
3rd column: REGEXP_EXTRACT(Campaña, '^(?:[^_]*_){2}([^_]*)_')
etc...
The only thing that has to be changed in the formula to switch columns is the numer inside {}, (column number - 1).
If you do not have the final number, just don't put the last "_".
Lastly, remember to do all the calculated fields again, because (for example) it gets an error with CPC, CTR and other Adwords metrics that are calculated automatically.
Hope it helps!

filtering columns by regex in dataframe

I have a large dataframe (3000+ columns) and I am trying to get a list of all column names that follow this pattern:
"stat.mineBlock.minecraft.123456stone"
"stat.mineBlock.minecraft.DFHFFBSBstone2"
"stat.mineBlock.minecraft.AAAstoneAAAA"
My code:
stoneCombined<-grep("^[stat.mineBlock.minecraft.][a-zA-Z0-9]*?[stone][a-zA-Z0-9]*?", colnames(ingame), ignore.case =T)
where ingame is the dataframe I am searching. My code returns a list of numbers however instead of the dataframe columns (like those above) that I was expecting. Con someone tell me why?
After adding value=TRUE (Thanks to user227710):
I now get column names, but I get every column in my dataset not those that contain : stat.mineBlock.minecraft. and stone like I was trying to get.
To return the column names you need to set value=TRUE as an additional argument of grep. The default option in grep is to set value=FALSE and so it will give you indices of the matched colnames. .
help("grep")
value
if FALSE, a vector containing the (integer) indices of the matches determined by grep is returned, and if TRUE, a vector containing the matching elements themselves is returned.
grep("your regex pattern", colnames(ingame),value=TRUE, ignore.case =T)
Here is a solution in dplyr:
library(dplyr)
your_df %>%
select(starts_with("stat.mineBlock.minecraft"))
The more general way to match a column name to a regex is with matches() inside select(). See ?select for more information.
My answer is based on this SO post. As per the regex, you were very close.
Just [] create a character class matching a single character from the defined set, and it is the main reason it was not working. Also, perl=T is always safer to use with regex in R.
So, here is my sample code:
df <- data.frame(
"stat.mineBlock.minecraft.123456stone" = 1,
"stat.mineBlock.minecraft.DFHFFBSBwater2" = 2,
"stat.mineBlock.minecraft.DFHFFBSBwater3" = 3,
"stat.mineBlock.minecraft.DFHFFBSBstone4" = 4
)
grep("^stat\\.mineBlock\\.minecraft\\.[a-zA-Z0-9]*?stone[a-zA-Z0-9]*?", colnames(df), value=TRUE, ignore.case=T, perl=T)
See IDEONE demo

R: how to convert part of a string to variable name and return its value in the same string?

Suppose I have a string marco <- 'polo'. Is there any way I can embed marco in the middle of another string, e.g. x <- 'John plays water marco.' and have x return 'John plays water polo.'?
EDIT
The solution David kindly offered does work for the hypothetical problem I posted above, but what I was trying to get to was this:
data <- c('kek','koki','ukak','ikka')
V <- c('a|e|i|o|u')
Rather than deleting all vowels, which the solution can manage (gsub(V,'',data)), how do I specify, say, all vowels between two k's? Obviously gsub('kVk','',data) doesn't work. Any help would be greatly appreciated.
If you want all vowels between two "k" letters removed, I propose the following:
V <- '[aeiou]'
data <- c('kek', 'koki', 'ukak', 'ikka', 'keeuiokaeioukaeiousk')
gsub(paste0('(?:\\G(?!^)|[^k]*k(?=[^k]+k))\\K', V), '', data, perl=T)
# [1] "kk" "kki" "ukk" "ikka" "kkksk"
The \G feature is an anchor that can match at one of two positions; the start of the string position or the position at the end of the last match. \K resets the starting point of the reported match and any previously consumed characters are no longer included which is similar to a lookbehind.
Regular Expression Explanation
Or, to use the example as given:
V <- 'a|e|i|o|u' ## or equivalently '[aeiou]'
dd <- c('kek','koki','ukak','ikka','kaaaak')
gsub(paste0("k(",V,")+k"),"kk",dd)
## [1] "kk" "kki" "ukk" "ikka" "kk"
I guessed that you might (?) want to delete multiple vowels between ks; I added a + to the regular expression to do this.

Regular expressions in R to erase all characters after the first space?

I have data in R that can look like this:
USDZAR Curncy
R157 Govt
SPX Index
In other words, one word, in this case a Bloomberg security identifier, followed by another word, which is the security class, separated by a space. I want to strip out the class and the space to get to:
USDZAR
R157
SPX
What's the most efficient way of doing this in R? Is it regular expressions or must I do something as I would in MS Excel using the mid and find commands? eg in Excel I would say:
=MID(#REF, 1, FIND(" ", #REF, 1)-1)
which means return a substring starting at character 1, and ending at the character number of the first space (less 1 to erase the actual space).
Do I need to do something similar in R (in which case, what is the equivalent), or can regular expressions help here? Thanks.
1) Try this where the regular expression matches a space followed by any sequence of characters and sub replaces that with a string having zero characters:
x <- c("USDZAR Curncy", "R157 Govt", "SPX Index")
sub(" .*", "", x)
## [1] "USDZAR" "R157" "SPX"
2) An alternative if you wanted the two words in separate columns in a data frame is as follows. Here as.is = TRUE makes the columns be character rather than factor.
read.table(text = x, as.is = TRUE)
## V1 V2
## 1 USDZAR Curncy
## 2 R157 Govt
## 3 SPX Index
It's pretty easy with stringr:
x <- c("USDZAR Curncy", "R157 Govt", "SPX Index")
library(stringr)
str_split_fixed(x, " ", n = 2)[, 1]
If you're like me, in that regexp's will always remain an inscrutable, frustrating mystery, this clunkier solution also exists:
x <- c("USDZAR Curncy", "R157 Govt", "SPX Index")
unlist(lapply(strsplit(x," ",fixed=TRUE),"[",1))
The fixed=TRUE isn't strictly necessary, just pointing out that you can do this (simple case) w/out really knowing the first thing about regexp's.
Edited to reflect #Wojciech's comment.
The regex would be to search for:
\x20.*
and replace with an empty string.
If you want to know whether it's faster, just time it.