Remove multiple periods from character string - regex

I have some similar column names for example:
Eagles.....Brown.Bears.......
Western.Bulls......Great.Lions....
I would like to extract the words. For example from the first:
'Eagles' and 'Brown.Bears'
for the second:
'Western.Bulls' and 'Great.Lions'
There are always periods between team names (>2 periods but vary in number '....') and there is always one period in place of a space within a team name.

We can use str_extract
library(stringr)
str_extract_all(str1, "\\w+(\\.\\w+)?")
#[[1]]
#[1] "Eagles" "Brown.Bears"
#[[2]]
#[1] "Western.Bulls" "Great.Lions"
Or using strsplit from base R
strsplit(str1, "\\.{2,}")
#[[1]]
#[1] "Eagles" "Brown.Bears"
#[[2]]
#[1] "Western.Bulls" "Great.Lions"
data
str1 <- c("Eagles.....Brown.Bears.......", "Western.Bulls......Great.Lions....")

Related

Extract just the part of string that matches a regex pattern in R

I build a data frame scraped automatically from a webpage on which one of the variables is a date in the text form “May 12”.
Nevertheless, sometimes observations came with some characters (in some cases weird ones) attached after the date, for example: “May 20 õ", "Dez 1", "Oct 12ABCdáé".
For those cases, I want to replace the value with the correct characters, thus: “Dec 24”, “Oct 1”.
After googling for a solution several times and trying functions like: sub, gsub and grep , I could not find the way to find a correct function to work.
I see that regular expressions has a steep learning curve, but after using the tool http://regexr.com/ I could define the regular expression to match the pattern in the observations where the problems appears. ([A-Z]{1}[a-z]{2})\s\d+.*
At this moment, I have the following example:
vector = c("May 20", "Dez 1", "Oct 12ABCdáé”)
And the last solution I tried is:
dateformat = gsub(pattern = "([A-Z]{1}[a-z]{2})\\s\\d+.*", replacement = "([A-Z]{1}[a-z]{2})\\s\\d+", x = vector)
But of course this gives me a replacement with the text string "([A-Z]{1}[a-z]{2})\s\d+” on each of them.
dateformat
[1] "([A-Z]{1}[a-z]{2})sd+" "([A-Z]{1}[a-z]{2})sd+"
[3] "([A-Z]{1}[a-z]{2})sd+"
I really do not understand what I have to include in the replacement argument to remove the bad characters if they exists.
I added a capture group and a back-reference "\\1":
sub("^([A-Z]{1}[a-z]{2}\\s\\d+).*", "\\1", vector)
[1] "May 20" "Dez 1" "Oct 12"
The replacement argument accepts back-references like '\\1', but not typical regex patterns as you used. The back-reference refers back to the pattern you created and the capture group you defined. In this case our capture group was the abbreviated month and day which we outlined with parantheticals (..). Any text captured within those brackets are returned when "\\1" is placed in the replacement argument.
This quick-start guide may help
We could also try
sub("\\s*[^0-9]+$", "", vector)
#[1] "May 20" "Dez 1" "Oct 12"
In case anyone else is interested in the performance of these different approaches, here is a repeatable example comparing Pierre's approach to akrun's approach.
This shows akrun's approach is faster:
library(microbenchmark)
set.seed(1234)
# Original poster's data
# vector <- c("May 20", "Dez 1", "Oct 12ABCdáé")
# Increased the size to 200
vector <- sample(c("May 20", "Dez 1", "Oct 12ABCdáé"), 200L, replace = TRUE)
# Comparison of timings with 10000 repetitions
microbenchmark(
pierre_l = sub("^([A-Z]{1}[a-z]{2}\\s\\d+).*", "\\1", vector),
akrun = sub("\\s*[^0-9]+$", "", vector),
times = 10000L
)
#> Unit: microseconds
#> expr min lq mean median uq max neval
#> pierre_l 164.201 169.201 233.5096 173.302 220.2515 17809.1 10000
#> akrun 159.001 164.202 228.9020 168.200 212.7010 13443.5 10000
Created on 2022-03-24 by the reprex package (v2.0.1)

R: gsub inserting whitespaces between capture groups

I'm desperately trying to insert whitespaces between capture groups. My naive approach was
c = c("WesternSaharaRegion", "ColumbiaState", "OneTwoThreeFourFiveSix")
gsub("(.+[a-z])([A-Z].+)","\\1 \\2", clist, perl=T)
which is only inserting a whitespaces between the last two capital-letter-words. Using
gsub("(?=([a-z][A-Z]))"," ", c, perl = T)
works not quite exactly for it's a one-character-shifted version
"Wester nSahar aRegion" "Columbi aState" "On eTw oThre eFou rFiv eSix"
How am I able to elegantly receive
"Western Sahara Region" "Columbia State" "One Two Three Four Five Six"
strsplit() unfortunately doesn't keep the capture group :/
We can either use regex lookarounds
gsub('(?<=[a-z])(?=[A-Z])', ' ', c, perl=TRUE)
#[1] "Western Sahara Region" "Columbia State" "One Two Three Four Five Six"
Or use capture groups
gsub('([a-z])([A-Z])', '\\1 \\2', c)
#[1] "Western Sahara Region" "Columbia State" "One Two Three Four Five Six"

R get first letters of double/tripple-barrel surnames in data.frame

I have a dataframe with 2 columns:
> df1
Surname Name
1 The Builder Bob
2 Zeta-Jones Catherine
I want to add a third column "Shortened_Surname" which contains the first letters of all the words in the surname field:
Surname Name Shortened_Surname
1 The Builder Bob TB
2 Zeta-Jones Catherine ZJ
Note the "-" in the second name. I have barreled surnames separated by spaces and hyphens.
I have tried:
Step1:
> strsplit(unlist(as.character(df1$Surname))," ")
[[1]]
[1] "The" "Builder"
[[2]]
[1] "Zeta-Jones"
My research suggests I could possibly use strtrim as a Step 2, but all I have found is a number of ways how not to do it.
You can target the space, hyphen, and beginning of the line with lookarounds. For instance, you any character (.) not preceded by the beginning of the line, a space, or a hyphen should be substituted to "":
with(df, gsub("(?<!^|[ -]).", "", Surname, perl=TRUE))
[1] "TB" "ZJ"
or
with(df, gsub("(?<=[^ -]).", "", Surname, perl=TRUE))
The second gsub substitutes a blank ("") for any character that is preceded by a character that is not a " " or "-".
You can try this, if the format of the names is as show in the input data:
library(stringr)
df$Shortened_Surname <- sapply(str_extract_all(df$Surname, '[A-Z]{1}'), function(x) paste(x, collapse = ''))
Output is as follows:
Surname Name Shortened_Surname
1 The Builder Bob TB
2 Zeta-Jones Catherine ZJ
If the format of the names is somewhat inconsistent, you will need to modify the above pattern to capture that. You can use |, & operators inside the pattern to combine multiple patterns.

How to extract a part from a string in R

I have a problem when I tried to obtain a numeric part in R. The original strings, for example, is "buy 1000 shares of Google at 1100 GBP"
I need to extract the number of the shares (1000) and the price (1100) separately. Besides, I need to extract the number of the stock, which always appears after "shares of".
I know that sub and gsub can replace string, but what commands should I use to extract part of a string?
1) This extracts all numbers in order:
s <- "buy 1000 shares of Google at 1100 GBP"
library(gsubfn)
strapplyc(s, "[0-9.]+", simplify = as.numeric)
giving:
[1] 1000 1100
2) If the numbers can be in any order but if the number of shares is always followed by the word "shares" and the price is always followed by GBP then:
strapplyc(s, "(\\d+) shares", simplify = as.numeric) # 1000
strapplyc(s, "([0-9.]+) GBP", simplify = as.numeric) # 1100
The portion of the string matched by the part of the regular expression within parens is returned.
3) If the string is known to be of the form: X shares of Y at Z GBP then X, Y and Z can be extracted like this:
strapplyc(s, "(\\d+) shares of (.+) at ([0-9.]+) GBP", simplify = c)
ADDED Modified pattern to allow either digits or a dot. Also added (3) above and the following:
strapply(c(s, s), "[0-9.]+", as.numeric)
strapply(c(s, s), "[0-9.]+", as.numeric, simplify = rbind) # if ea has same no of matches
strapply(c(s, s), "(\\d+) shares", as.numeric, simplify = c)
strapply(c(s, s), "([0-9.]+) GBP", as.numeric, simplify = c)
strapplyc(c(s, s), "(\\d+) shares of (.+) at ([0-9.]+) GBP")
strapplyc(c(s, s), "(\\d+) shares of (.+) at ([0-9.]+) GBP", simplify = rbind)
You can use the sub function:
s <- "buy 1000 shares of Google at 1100 GBP"
# the number of shares
sub(".* (\\d+) shares.*", "\\1", s)
# [1] "1000"
# the stock
sub(".*shares of (\\w+) .*", "\\1", s)
# [1] "Google"
# the price
sub(".* at (\\d+) .*", "\\1", s)
# [1] "1100"
You can also use gregexpr and regmatches to extract all substrings at once:
regmatches(s, gregexpr("\\d+(?= shares)|(?<=shares of )\\w+|(?<= at )\\d+",
s, perl = TRUE))
# [[1]]
# [1] "1000" "Google" "1100"
I feel compelled to include the obligatory stringr solution as well.
library(stringr)
s <- "buy 1000 shares of Google at 1100 GBP"
str_match(s, "([0-9]+) shares")[2]
[1] "1000"
str_match(s, "([0-9]+) GBP")[2]
[1] "1100"
If you want to extract all digits from text use this function from stringi package.
"Nd" is the class of decimal digits.
stri_extract_all_charclass(c(123,43,"66ala123","kot"),"\\p{Nd}")
[[1]]
[1] "123"
[[2]]
[1] "43"
[[3]]
[1] "66" "123"
[[4]]
[1] NA
Please note that here 66 and 123 numbers are extracted separatly.

Regex matching everything that's not a 4 digit number

I match and replace 4-digit numbers preceded and followed by white space with:
str12 <- "coihr 1234 &/()= jngm 34 ljd"
sub("\\s\\d{4}\\s", "", str12)
[1] "coihr&/()= jngm 34 ljd"
but, every try to invert this and extract the number instead fails.
I want:
[1] 1234
does someone has a clue?
ps: I know how to do it with {stringr} but am wondering if it's possible with {base} only..
require(stringr)
gsub("\\s", "", str_extract(str12, "\\s\\d{4}\\s"))
[1] "1234"
regmatches(), only available since R-2.14.0, allows you to "extract or replace matched substrings from match data obtained by regexpr, gregexpr or regexec"
Here are examples of how you could use regmatches() to extract either the first whitespace-cushioned 4-digit substring in your input character string, or all such substrings.
## Example strings and pattern
x <- "coihr 1234 &/()= jngm 34 ljd" # string with 1 matching substring
xx <- "coihr 1234 &/()= jngm 3444 6789 ljd" # string with >1 matching substring
pat <- "(?<=\\s)(\\d{4})(?=\\s)"
## Use regexpr() to extract *1st* matching substring
as.numeric(regmatches(x, regexpr(pat, x, perl=TRUE)))
# [1] 1234
as.numeric(regmatches(xx, regexpr(pat, xx, perl=TRUE)))
# [1] 1234
## Use gregexpr() to extract *all* matching substrings
as.numeric(regmatches(xx, gregexpr(pat, xx, perl=TRUE))[[1]])
# [1] 1234 3444 6789
(Note that this will return numeric(0) for character strings not containing a substring matching your criteria).
It's possible to capture group in regex using (). Taking the same example
str12 <- "coihr 1234 &/()= jngm 34 ljd"
gsub(".*\\s(\\d{4})\\s.*", "\\1", str12)
[1] "1234"
I'm pretty naive about regex in general, but here's an ugly way to do it in base:
# if it's always in the same spot as in your example
unlist(strsplit(str12, split = " "))[2]
# or if it can occur in various places
str13 <- unlist(strsplit(str12, split = " "))
str13[!is.na(as.integer(str13)) & nchar(str13) == 4] # issues warning