R get first letters of double/tripple-barrel surnames in data.frame - regex

I have a dataframe with 2 columns:
> df1
Surname Name
1 The Builder Bob
2 Zeta-Jones Catherine
I want to add a third column "Shortened_Surname" which contains the first letters of all the words in the surname field:
Surname Name Shortened_Surname
1 The Builder Bob TB
2 Zeta-Jones Catherine ZJ
Note the "-" in the second name. I have barreled surnames separated by spaces and hyphens.
I have tried:
Step1:
> strsplit(unlist(as.character(df1$Surname))," ")
[[1]]
[1] "The" "Builder"
[[2]]
[1] "Zeta-Jones"
My research suggests I could possibly use strtrim as a Step 2, but all I have found is a number of ways how not to do it.

You can target the space, hyphen, and beginning of the line with lookarounds. For instance, you any character (.) not preceded by the beginning of the line, a space, or a hyphen should be substituted to "":
with(df, gsub("(?<!^|[ -]).", "", Surname, perl=TRUE))
[1] "TB" "ZJ"
or
with(df, gsub("(?<=[^ -]).", "", Surname, perl=TRUE))
The second gsub substitutes a blank ("") for any character that is preceded by a character that is not a " " or "-".

You can try this, if the format of the names is as show in the input data:
library(stringr)
df$Shortened_Surname <- sapply(str_extract_all(df$Surname, '[A-Z]{1}'), function(x) paste(x, collapse = ''))
Output is as follows:
Surname Name Shortened_Surname
1 The Builder Bob TB
2 Zeta-Jones Catherine ZJ
If the format of the names is somewhat inconsistent, you will need to modify the above pattern to capture that. You can use |, & operators inside the pattern to combine multiple patterns.

Related

Remove multiple periods from character string

I have some similar column names for example:
Eagles.....Brown.Bears.......
Western.Bulls......Great.Lions....
I would like to extract the words. For example from the first:
'Eagles' and 'Brown.Bears'
for the second:
'Western.Bulls' and 'Great.Lions'
There are always periods between team names (>2 periods but vary in number '....') and there is always one period in place of a space within a team name.
We can use str_extract
library(stringr)
str_extract_all(str1, "\\w+(\\.\\w+)?")
#[[1]]
#[1] "Eagles" "Brown.Bears"
#[[2]]
#[1] "Western.Bulls" "Great.Lions"
Or using strsplit from base R
strsplit(str1, "\\.{2,}")
#[[1]]
#[1] "Eagles" "Brown.Bears"
#[[2]]
#[1] "Western.Bulls" "Great.Lions"
data
str1 <- c("Eagles.....Brown.Bears.......", "Western.Bulls......Great.Lions....")

R regmatches() and stringr str_extract() dragging whitespaces along

Here's the thing:
test=" 2 15 3 23 12 0 0.18"
#I want to extract the 1st number separately
pattern="^ *(\\d+) +"
d=regmatches(test,gregexpr(pattern,test))
> d
[[1]]
[1] " 2 "
library(stringr)
f=str_extract(test,pattern)
> f
[1] " 2 "
They both bring whitespaces to the result despite usage of ()-brackets. Why? The brackets are for specifying which part of the matched pattern you want, am I wrong? I know I can trim them with trimws() or coerce them directly to numeric, but I wonder if I misunderstand some mechanics of patterns.
Using str_match (or str_match_all)
Since you want to extract a capture group, you can use str_match (or str_match_all). str_extract only extracts whole matches.
From R stringr help:
str_match Extract matched groups from a string.
and
str_extract to extract the complete match
R code:
library(stringr)
test=" 2 15 3 23 12 0 0.18"
pattern="^ *(\\d+) +"
f=str_match(test,pattern)
f[[2]]
## [1] "2"
The f[[2]] will output the 2nd item that is the first capture group value.
Using regmatches
As it is mentioned in the comment above, it is also possible with regmatches and regexec:
test=" 2 15 3 23 12 0 0.18"
pattern="^ *(\\d+) +"
res <- regmatches(test,regexec(pattern,test))
res[[1]][2] // The res list contains all matches and submatches
## [1] "2" // We get the item[2] from the first match to get "2"
See regexec help page that says:
regexec returns a list of the same length as text each element of which is either -1 if there is no match, or a sequence of integers with the starting positions of the match and all substrings corresponding to parenthesized subexpressions of pattern, with attribute "match.length" a vector giving the lengths of the matches (or -1 for no match).
OP task specific solution
Actually, since you only are interested in 1 integer number in the beginning of a string, you could achieve what you want with a mere gsub:
> gsub("^ *(\\d+) +.*", "\\1", test)
[1] "2"

Extract contents within brackets using R and Regex

I have a data-frame that contains user names in the format
"John Smith (Company Department)"
I want to extract the department from the username to add it to its own separate column.
I have tried the below code but it fails if the user name is something like
"John Smith (Company Department) John Doe)"
Can anyone help. Reg-ex isn't my strong suit and the below code will only work if the username is non standard like my example above with multiple brackets
strcol <- "John Smith (FPO Sales) John Doe)"
start_loc <- str_locate_all(pattern ='\\(FPO ',strcol)[[1]][2]
end_loc <- str_locate_all(pattern ='\\)',strcol)[[1]][2]
substr(strcol,start_loc +1, end_loc -1)))
Expected Output:
Sales
I have also tried the post here using non greedy, but got the following error:
Error: '[' is an unrecognized escape in character string starting ""/["
Note: the company will always be the same
You may use sub
> strcol <- "John Smith (FPO Sales) John Doe)"
> sub(".*\\(FPO[^)]*?(\\w+)\\).*", "\\1", strcol)
[1] "Sales"
.*\\(FPO would match all the characters upto the (FPO
[^)]*? this would match any char but not of ) zero or ore times.
(\\w+)\\) captures one or more word characters exists at the last within the same brackets itself.
.* would match all the remaining characters.
So by replacing all the matched chars with the chars present inside group index 1 will give you the desired output.
OR
> library(stringr)
> str_extract(strcol, perl("FPO[^)]*?\\K\\w+(?=\\))"))
[1] "Sales"
gsub('.*\\s(.*)\\).*\\)$','\\1',strcol)
[1] "Sales"

R - Remove dashes from a column with phone numbers

I'd like to create a new column of phone numbers with no dashes. I have data that is a mix of just numbers and some numbers with dashes. The data looks as follows:
Phone
555-555-5555
1234567890
555-3456789
222-222-2222
51318312491
Since you are dealing with a very straightforward substitution, you can easily use gsub to find the character you want to remove and replace it with nothing.
Assuming your dataset is called "mydf" and the column of interest is "Phone", try this:
gsub("-", "", mydf$Phone)
Building on the answer of #Ananda Mahto, it seemed useful to show how to break the numbers up again and put a parenthetical around the area code.
phone <- c("1234567890", "555-3456789", "222-222-2222", "5131831249")
phone <- gsub("-", "", phone)
gsub("(^\\d{3})(\\d{3})(\\d{4}$)", "(\\1) \\2 \\3", phone)
[1] "(123) 456 7890" "(555) 345 6789" "(222) 222 2222" "(513) 183 1249"
The second regex creates three capture groups, two with three digits and the final one with four. Then R substitutes them back in with a space between each and ( ) around the first one. You could also put hyphens between capture group 2 and capture group 3. [Not sure at all why Skype appeared out of nowhere!]

Regex matching everything that's not a 4 digit number

I match and replace 4-digit numbers preceded and followed by white space with:
str12 <- "coihr 1234 &/()= jngm 34 ljd"
sub("\\s\\d{4}\\s", "", str12)
[1] "coihr&/()= jngm 34 ljd"
but, every try to invert this and extract the number instead fails.
I want:
[1] 1234
does someone has a clue?
ps: I know how to do it with {stringr} but am wondering if it's possible with {base} only..
require(stringr)
gsub("\\s", "", str_extract(str12, "\\s\\d{4}\\s"))
[1] "1234"
regmatches(), only available since R-2.14.0, allows you to "extract or replace matched substrings from match data obtained by regexpr, gregexpr or regexec"
Here are examples of how you could use regmatches() to extract either the first whitespace-cushioned 4-digit substring in your input character string, or all such substrings.
## Example strings and pattern
x <- "coihr 1234 &/()= jngm 34 ljd" # string with 1 matching substring
xx <- "coihr 1234 &/()= jngm 3444 6789 ljd" # string with >1 matching substring
pat <- "(?<=\\s)(\\d{4})(?=\\s)"
## Use regexpr() to extract *1st* matching substring
as.numeric(regmatches(x, regexpr(pat, x, perl=TRUE)))
# [1] 1234
as.numeric(regmatches(xx, regexpr(pat, xx, perl=TRUE)))
# [1] 1234
## Use gregexpr() to extract *all* matching substrings
as.numeric(regmatches(xx, gregexpr(pat, xx, perl=TRUE))[[1]])
# [1] 1234 3444 6789
(Note that this will return numeric(0) for character strings not containing a substring matching your criteria).
It's possible to capture group in regex using (). Taking the same example
str12 <- "coihr 1234 &/()= jngm 34 ljd"
gsub(".*\\s(\\d{4})\\s.*", "\\1", str12)
[1] "1234"
I'm pretty naive about regex in general, but here's an ugly way to do it in base:
# if it's always in the same spot as in your example
unlist(strsplit(str12, split = " "))[2]
# or if it can occur in various places
str13 <- unlist(strsplit(str12, split = " "))
str13[!is.na(as.integer(str13)) & nchar(str13) == 4] # issues warning