R: gsub inserting whitespaces between capture groups

R: gsub inserting whitespaces between capture groups - regex

I'm desperately trying to insert whitespaces between capture groups. My naive approach was
c = c("WesternSaharaRegion", "ColumbiaState", "OneTwoThreeFourFiveSix")
gsub("(.+[a-z])([A-Z].+)","\\1 \\2", clist, perl=T)
which is only inserting a whitespaces between the last two capital-letter-words. Using
gsub("(?=([a-z][A-Z]))"," ", c, perl = T)
works not quite exactly for it's a one-character-shifted version
"Wester nSahar aRegion" "Columbi aState" "On eTw oThre eFou rFiv eSix"
How am I able to elegantly receive
"Western Sahara Region" "Columbia State" "One Two Three Four Five Six"
strsplit() unfortunately doesn't keep the capture group :/

We can either use regex lookarounds
gsub('(?<=[a-z])(?=[A-Z])', ' ', c, perl=TRUE)
#[1] "Western Sahara Region" "Columbia State" "One Two Three Four Five Six"
Or use capture groups
gsub('([a-z])([A-Z])', '\\1 \\2', c)
#[1] "Western Sahara Region" "Columbia State" "One Two Three Four Five Six"

Related

Remove multiple periods from character string

I have some similar column names for example:
Eagles.....Brown.Bears.......
Western.Bulls......Great.Lions....
I would like to extract the words. For example from the first:
'Eagles' and 'Brown.Bears'
for the second:
'Western.Bulls' and 'Great.Lions'
There are always periods between team names (>2 periods but vary in number '....') and there is always one period in place of a space within a team name.

We can use str_extract
library(stringr)
str_extract_all(str1, "\\w+(\\.\\w+)?")
#[[1]]
#[1] "Eagles" "Brown.Bears"
#[[2]]
#[1] "Western.Bulls" "Great.Lions"
Or using strsplit from base R
strsplit(str1, "\\.{2,}")
#[[1]]
#[1] "Eagles" "Brown.Bears"
#[[2]]
#[1] "Western.Bulls" "Great.Lions"
data
str1 <- c("Eagles.....Brown.Bears.......", "Western.Bulls......Great.Lions....")

Regular expression: matching multiple words

I am using regular expressions in R to extract strings from a variable. The variable contains distinct values that look like:
MEDIUM /REGULAR INSEAM
XX LARGE /SHORT INSEAM
SMALL /32" INSM
X LARGE /30" INSM
I have to capture two things: the value before the / as a whole(SMALL,XX LARGE) and the string(alphabetic or numeric) after it. I dont want the " INSM or the INSEAM part.
The regular expression for first two I am using is ([A-Z]\w+) \/([A-Z]\w+) INSEAM and for the last two I am using ([A-Z]\w+) \/([0-9][0-9])[" INSM].
The part ([A-Z]\w+) only captures one word, so it works fine for MEDIUM and SMALL, but fails for X LARGE, XX LARGE etc. Is there a way I can modify it to capture two occurances of word before the / character? Or is there a better way to do it?
Thanks in advance!

From your description, Wiktor's regex will fail on "XX LARGE/SHORT" due to the extra space. It is safer to capture everything before the forward slash as a group:
sub("^(.*/\\w+).*", "\\1", x)
#[1] "MEDIUM /REGULAR" "XX LARGE /SHORT" "SMALL /32" "X LARGE /30"

It seems you can use
(\w+(?: \w+)?) */ *(\w+)
See the regex demo
Pattern details:
(\w+(?: \w+)?) - Group 1 capturing one or more word chars followed with an optional sequence of a space + one or more word chars
*/ * - a / enclosed with 0+ spaces
(\w+) - Group 2 capturing 1 or more word chars
R code with stringr:
> library(stringr)
> v <- c("MEDIUM /REGULAR INSEAM", "XX LARGE /SHORT INSEAM", "SMALL /32\" INSM", "X LARGE /30\" INSM")
> str_match(v, "(\\w+(?: \\w+)?) */ *(\\w+)")
[,1] [,2] [,3]
[1,] "MEDIUM /REGULAR" "MEDIUM" "REGULAR"
[2,] "XX LARGE /SHORT" "XX LARGE" "SHORT"
[3,] "SMALL /32" "SMALL" "32"
[4,] "X LARGE /30" "X LARGE" "30"

R get first letters of double/tripple-barrel surnames in data.frame

I have a dataframe with 2 columns:
> df1
Surname Name
1 The Builder Bob
2 Zeta-Jones Catherine
I want to add a third column "Shortened_Surname" which contains the first letters of all the words in the surname field:
Surname Name Shortened_Surname
1 The Builder Bob TB
2 Zeta-Jones Catherine ZJ
Note the "-" in the second name. I have barreled surnames separated by spaces and hyphens.
I have tried:
Step1:
> strsplit(unlist(as.character(df1$Surname))," ")
[[1]]
[1] "The" "Builder"
[[2]]
[1] "Zeta-Jones"
My research suggests I could possibly use strtrim as a Step 2, but all I have found is a number of ways how not to do it.

You can target the space, hyphen, and beginning of the line with lookarounds. For instance, you any character (.) not preceded by the beginning of the line, a space, or a hyphen should be substituted to "":
with(df, gsub("(?<!^|[ -]).", "", Surname, perl=TRUE))
[1] "TB" "ZJ"
or
with(df, gsub("(?<=[^ -]).", "", Surname, perl=TRUE))
The second gsub substitutes a blank ("") for any character that is preceded by a character that is not a " " or "-".

You can try this, if the format of the names is as show in the input data:
library(stringr)
df$Shortened_Surname <- sapply(str_extract_all(df$Surname, '[A-Z]{1}'), function(x) paste(x, collapse = ''))
Output is as follows:
Surname Name Shortened_Surname
1 The Builder Bob TB
2 Zeta-Jones Catherine ZJ
If the format of the names is somewhat inconsistent, you will need to modify the above pattern to capture that. You can use |, & operators inside the pattern to combine multiple patterns.

R - Remove dashes from a column with phone numbers

I'd like to create a new column of phone numbers with no dashes. I have data that is a mix of just numbers and some numbers with dashes. The data looks as follows:
Phone
555-555-5555
1234567890
555-3456789
222-222-2222
51318312491

Since you are dealing with a very straightforward substitution, you can easily use gsub to find the character you want to remove and replace it with nothing.
Assuming your dataset is called "mydf" and the column of interest is "Phone", try this:
gsub("-", "", mydf$Phone)

Building on the answer of #Ananda Mahto, it seemed useful to show how to break the numbers up again and put a parenthetical around the area code.
phone <- c("1234567890", "555-3456789", "222-222-2222", "5131831249")
phone <- gsub("-", "", phone)
gsub("(^\\d{3})(\\d{3})(\\d{4}$)", "(\\1) \\2 \\3", phone)
[1] "(123) 456 7890" "(555) 345 6789" "(222) 222 2222" "(513) 183 1249"
The second regex creates three capture groups, two with three digits and the final one with four. Then R substitutes them back in with a space between each and ( ) around the first one. You could also put hyphens between capture group 2 and capture group 3. [Not sure at all why Skype appeared out of nowhere!]

remove comma from a digits portion string

How can I (fastest preferable) remove commas from a digit part of a string without affecting the rest of the commas in the string. So in the example below I want to remove the comas from the number portions but the comma after dog should remain (yes I know the comma in 1023455 is wrong but just throwing a corner case out there).
What I have:
x <- "I want to see 102,345,5 dogs, but not too soo; it's 3,242 minutes away"
Desired outcome:
[1] "I want to see 1023455 dogs, but not too soo; it's 3242 minutes away"
Stipulation: must be done in base no add on packages.
Thank you in advance.
EDIT:
Thank you Dason, Greg and Dirk. Both your responses worked very well. I was playing with something close to Dason's response but had the comma inside the parenthesis. Now looking at it that doesn't even make sense. I microbenchmarked both responses as I need speed here (text data):
Unit: microseconds
expr min lq median uq max
1 Dason_0to9 14.461 15.395 15.861 16.328 25.191
2 Dason_digit 21.926 23.791 24.258 24.725 65.777
3 Dirk 127.354 128.287 128.754 129.686 154.410
4 Greg_1 18.193 19.126 19.127 19.594 27.990
5 Greg_2 125.021 125.954 126.421 127.353 185.666
+1 to all of you.

You could replace anything with the pattern (comma followed by a number) with the number itself.
x <- "I want to see 102,345,5 dogs, but not too soo; it's 3,242 minutes away"
gsub(",([[:digit:]])", "\\1", x)
#[1] "I want to see 1023455 dogs, but not too soo; it's 3242 minutes away"
#or
gsub(",([0-9])", "\\1", x)
#[1] "I want to see 1023455 dogs, but not too soo; it's 3242 minutes away"

Using Perl regexp, and focusing on "digit comma digit" we then replace with just the digits:
R> x <- "I want to see 102,345,5 dogs, but not too soo; it's 3,242 minutes away"
R> gsub("(\\d),(\\d)", "\\1\\2", x, perl=TRUE)
[1] "I want to see 1023455 dogs, but not too soo; it's 3242 minutes away"
R>

Here are a couple of options:
> tmp <- "I want to see 102,345,5 dogs, but not too soo; it's 3,242 minutes away"
> gsub('([0-9]),([0-9])','\\1\\2', tmp )
[1] "I want to see 1023455 dogs, but not too soo; it's 3242 minutes away"
> gsub('(?<=\\d),(?=\\d)','',tmp, perl=TRUE)
[1] "I want to see 1023455 dogs, but not too soo; it's 3242 minutes away"
>
They both match a digit followed by a comma followed by a digit. The [0-9] and \d (the extra \ escapes the second one so that it makes it through to the regular epression) both match a single digit.
The first epression captures the digit before the comma and the digit after the comma and uses them in the replacement string. Basically pulling them out and putting them back (but not putting the comma back).
The second version uses zero-length matches, the (?<=\\d) says that there needs to be a single digit before the comma in order for it to match, but the digit itself is not part of the match. The (?=\\d) says that there needs to be a digit after the comma in order for it to match, but it is not included in the match. So basically it matches a comma, but only if preceded and followed by a digit. Since only the comma is matched, the replacement string is empty meaning delete the comma.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

R: gsub inserting whitespaces between capture groups - regex

We can either use regex lookarounds gsub('(?<=[a-z])(?=[A-Z])', ' ', c, perl=TRUE) #[1] "Western Sahara Region" "Columbia State" "One Two Three Four Five Six" Or use capture groups gsub('([a-z])([A-Z])', '\\1 \\2', c) #[1] "Western Sahara Region" "Columbia State" "One Two Three Four Five Six"

Related

Remove multiple periods from character string

Regular expression: matching multiple words

R get first letters of double/tripple-barrel surnames in data.frame

R - Remove dashes from a column with phone numbers

remove comma from a digits portion string

Categories

Resources