Extract e-mail address from string using r - regex

These are 5 twitter user descriptions. The idea is to extract the e-mail from each string.
This is the code i've tried, it works but there is probably something better.
I'd rather avoid using unlist() and do it in one go using regex. I've seen other questions of the kind for python/perl/php but not for R.
I know i could use grep(..., perl = TRUE) but that should't be the only way to do it.
If it works, of course it helps.
ds <- c("#MillonMusical | #PromotorMusical | #Diseñador | Contacto : ezequielife#gmail.com | #Instagram : Ezeqielgram | 01-11-11 | #_MillonMusical #flowfestar", "LipGLosSTudio by: SAndry RUbio Maquilladora PRofesional estudiande de diseño profesional de maquillaje artistico lipglosstudio#hotmail.com/", "Medico General Barranquillero radicado con su familia en Buenos Aires para iniciar Especialidad Medico Quirurgica. email jaenpavi#hotmail.com", "msn =
rdt031169#hotmail.comskype = ronaldotorres-br", "Aguante piscis / manuarias17#gmail.com buenos aires"
)
ds <- unlist(strsplit(ds, ' '))
ds <- ds[grep("mail.", ds)]
> print(ds)
[1] "\t\tezequielife#gmail.com" "lipglosstudio#hotmail.com/"
[3] "jaenpavi#hotmail.com" "rdt031169#hotmail.comskype"
[5] "/\t\tmanuarias17#gmail.com"
It would be nice to separate this one "rdt031169#hotmail.comskype"
perhaps asking it to end in .com or .com.ar that would make sense for what i'm working on

Here's one alternative:
> regmatches(ds, regexpr("[[:alnum:]]+\\#[[:alpha:]]+\\.com", ds))
[1] "ezequielife#gmail.com" "lipglosstudio#hotmail.com" "jaenpavi#hotmail.com" "rdt031169#hotmail.com"
[5] "manuarias17#gmail.com"
Based on #Frank's comment, if you want to keep country identifier after .com as in your example .com.ar then, look at this:
> ds <- c(ds, "fulanito13#somemail.com.ar") # a new e-mail address
> regmatches(ds, regexpr("[[:alnum:]]+\\#[[:alpha:]]+\\.com(\\.[a-z]{2})?", ds))
[1] "ezequielife#gmail.com" "lipglosstudio#hotmail.com" "jaenpavi#hotmail.com" "rdt031169#hotmail.com"
[5] "manuarias17#gmail.com" "fulanito13#somemail.com.ar"

Related

R match expression multiple times in the same line

I am working with a set of Tweets (very original, I know) in R and would like to extract the text after each # sign and after each # and put them into separate variables. For example:
This is a test tweet using #twitter. #johnsmith #joesmith.
Ideally I would like it to create new variables in the dataframe that has twitter johnsmith joesmith, etc.
Currently I am using
data$at <- str_match(data$tweet_text,"\s#\w+")
data$hash <- str_match(data$tweet_text,"\s#\w+")
Which obviously gives me the first occurrence of each into a new variable. Any suggestions?
strsplit and grep will work:
x <-strsplit("This is a test tweet using #twitter. #johnsmith #joesmith."," ")
grep("#|#",unlist(x), value=TRUE)
#[1] "#twitter." "#johnsmith" "#joesmith."
If you only want to keep the words, no #,# or .:
out <-grep("#|#",unlist(x), value=TRUE)
gsub("#|#|\\.","",out)
[1] "twitter" "johnsmith" "joesmith"
UPDATE Putting the results in a list:
my_list <-NULL
x <-strsplit("This is a test tweet using #twitter. #johnsmith #joesmith."," ")
my_list$hash <-c(my_list$hash,gsub("#|#|\\.","",grep("#",unlist(x), value=TRUE)))
my_list$at <-c(my_list$at,gsub("#|#|\\.","",grep("#",unlist(x), value=TRUE)))
x <-strsplit("2nd tweet using #second. #jillsmith #joansmith."," ")
my_list$hash <-c(my_list$hash,gsub("#|#|\\.","",grep("#",unlist(x), value=TRUE)))
my_list$at <-c(my_list$at,gsub("#|#|\\.","",grep("#",unlist(x), value=TRUE)))
my_list
$hash
[1] "twitter" "second"
$at
[1] "johnsmith" "joesmith" "jillsmith" "joansmith"

paste lines of string characters in raw text

I work on raw textual data from a scanned catalog.
Here is an example:
ABADIE-LANDEL (Pierre) — 1920 — né à Paris. — 17, rue Campagne-Première
ABOU (Albert) — 1930 — né à Marseille.
— 41, rue de Seine, 6e.
ANGER (Jacques) — 1925 — né à Paris. — 33, rue Vineuse, 16e.
ANTHONE (Armand) — 1908 — né à Paris. — 4, avenue Victor-Hugo
Rue des Tournelles
ANTRAL (Jean) — 1920
This is a list of names with occasional lines including address mentions.
The data is imported into R with:
readlines ("clipboard", encoding = " latin1 ")
I am able to identify lines including artist names in capital letters with different regex
[A-ZÁÀÂÄÃÅÇÉÈÊËÍÌÎÏÑÓÒÔÖÕÚÙÛÜÝYÆO][A-ZÁÀÂÄÃÅÇÉÈÊËÍÌÎÏÑÓÒÔÖÕÚÙÛÜÝYÆO |']
or (ICU)
[\p{Uppercase Letter}][\p{Uppercase Letter}|']
I am able to identify lines including artworks
^[0-9]+[\s][^bis]`
I am able to extract artists names
".+(?=- [0-9]{4})"
or
(.+)[0-9]{4}.+ # with backreference \1
For more data, here is a sample of data from a 1930 catalog:
https://docs.google.com/document/d/1nF3CQmZbDsCGKMp_OgZymxWIfoOx5xrNdTmDXZANwuc/edit?usp=sharing
I wish I could paste the pieces of adress substrings but My final goal is to create a data.frame object structured as follows:
1st column: NAME artist and surname;
2nd column: supplements (address, nationality ...)
3rd columns: works or better ...
Column 3: 1 work
4th column 2 work, etc.
Thank you in advance for your help.
If I understand your question correctly, you want to extract names and addresses from your records some of which may span across different lines.
One solution may be to exploit the fact that the character — works as a field separator. So assuming that the structure of your records is regular you could do:
(data is a variable holding your example string)
## Replace newlines with the separator character
data <- gsub("\\n(\\s*—)?", " — ", data)
## Normalize space
data <- gsub("\\s+", " ", data)
## Now split by the separator character
tokens <- strsplit(data, "\\s—\\s")[[1]]
tokens now contains:
[1] "ABADIE-LANDEL (Pierre)" "1920" "né à Paris." "17, rue Campagne-Première" "ABOU (Albert)"
[6] "1930" "né à Marseille." "41, rue de Seine, 6e." "ANGER (Jacques)" "1925"
[11] "né à Paris." "33, rue Vineuse, 16e." "ANTHONE (Armand)" "1908" "né à Paris."
[16] "4, avenue Victor-Hugo" "Rue des Tournelles" "ANTRAL (Jean)" "1920"
Each complete record should have 4 sequential indices in this vector but since there may be incomplete records we must work a little more.
We exploit the fact that people's names are all-capitals and follow a strict pattern. We get the indices of the names in tokens and then split tokens on those indices. Each subvector produced is now a complete record:
## Get the indices of names
idx <- which(grepl("^[A-Z-]+\\s\\(", tokens))
## Use the indices to partition tokens to subvectors
records <- list()
for (i in 1:length(idx)) {
start <- idx[i]
if (i == length(idx)) {
stop <- length(tokens)
}
else {
stop <- idx[i+1] - 1
}
records[[i]] <- tokens[start:stop]
}
Here is the final list of results:
[[1]]
[1] "ABADIE-LANDEL (Pierre)" "1920" "né à Paris." "17, rue Campagne-Première"
[[2]]
[1] "ABOU (Albert)" "1930" "né à Marseille." "41, rue de Seine, 6e."
[[3]]
[1] "ANGER (Jacques)" "1925" "né à Paris." "33, rue Vineuse, 16e."
[[4]]
[1] "ANTHONE (Armand)" "1908" "né à Paris." "4, avenue Victor-Hugo" "Rue des Tournelles"
[[5]]
[1] "ANTRAL (Jean)" "1920"
Hope this helps or leads to better ideas.

How to use separate() properly?

I have some difficulties to extract an ID in the form:
27da12ce-85fe-3f28-92f9-e5235a5cf6ac
from a data frame:
a<-c("NAME_27da12ce-85fe-3f28-92f9-e5235a5cf6ac_THOMAS_MYR",
"NAME_94773a8c-b71d-3be6-b57e-db9d8740bb98_THIMO",
"NAME_1ed571b4-1aef-3fe2-8f85-b757da2436ee_ALEX",
"NAME_9fbeda37-0e4f-37aa-86ef-11f907812397_JOHN_TYA",
"NAME_83ef784f-3128-35a1-8ff9-daab1c5f944b_BISHOP",
"NAME_39de28ca-5eca-3e6c-b5ea-5b82784cc6f4_DUE_TO",
"NAME_0a52a024-9305-3bf1-a0a6-84b009cc5af4_WIS_MICHAL",
"NAME_2520ebbb-7900-32c9-9f2d-178cf04f7efc_Sarah_Lu_Van_Gar/Thomas")
Basically its the thing between the first and the second underscore.
Usually I approach that by:
library(tidyr)
df$a<-as.character(df$a)
df<-df[grep("_", df$a), ]
df<- separate(df, a, c("ID","Name") , sep = "_")
df$a<-as.numeric(df$ID)
However this time there a to many underscores...and my approach fails. Is there a way to extract that ID?
I think you should use extract instead of separate. You need to specify the patterns which you want to capture. I'm assuming here that ID is always starts with a number so I'm capturing everything after the first number until the next _ and then everything after it
df <- data.frame(a)
df <- df[grep("_", df$a),, drop = FALSE]
extract(df, a, c("ID", "NAME"), "[A-Za-z].*?(\\d.*?)_(.*)")
# ID NAME
# 1 27da12ce-85fe-3f28-92f9-e5235a5cf6ac THOMAS_MYR
# 2 94773a8c-b71d-3be6-b57e-db9d8740bb98 THIMO
# 3 1ed571b4-1aef-3fe2-8f85-b757da2436ee ALEX
# 4 9fbeda37-0e4f-37aa-86ef-11f907812397 JOHN_TYA
# 5 83ef784f-3128-35a1-8ff9-daab1c5f944b BISHOP
# 6 39de28ca-5eca-3e6c-b5ea-5b82784cc6f4 DUE_TO
# 7 0a52a024-9305-3bf1-a0a6-84b009cc5af4 WIS_MICHAL
# 8 2520ebbb-7900-32c9-9f2d-178cf04f7efc Sarah_Lu_Van_Gar/Thomas
try this (which assumes that the ID is always the part after the first unerscore):
sapply(strsplit(a, "_"), function(x) x[[2]])
which gives you "the middle part" which is your ID:
[1] "27da12ce-85fe-3f28-92f9-e5235a5cf6ac" "94773a8c-b71d-3be6-b57e-db9d8740bb98"
[3] "1ed571b4-1aef-3fe2-8f85-b757da2436ee" "9fbeda37-0e4f-37aa-86ef-11f907812397"
[5] "83ef784f-3128-35a1-8ff9-daab1c5f944b" "39de28ca-5eca-3e6c-b5ea-5b82784cc6f4"
[7] "0a52a024-9305-3bf1-a0a6-84b009cc5af4" "2520ebbb-7900-32c9-9f2d-178cf04f7efc"
if you want to get the Name as well a simple solution would be (which assumes that the Name is always after the second underscore):
Names <- sapply(strsplit(a, "_"), function(x) Reduce(paste, x[-c(1,2)]))
which gives you this:
[1] "THOMAS MYR" "THIMO" "ALEX" "JOHN TYA"
[5] "BISHOP" "DUE TO" "WIS MICHAL" "Sarah Lu Van Gar/Thomas"

Extracting text after "?"

I have a string
x <- "Name of the Student? Michael Sneider"
I want to extract "Michael Sneider" out of it.
I have used:
str_extract_all(x,"[a-z]+")
str_extract_all(data,"\\?[a-z]+")
But can't extract the name.
I think this should help
substr(x, str_locate(x, "?")+1, nchar(x))
Try this:
sub('.*\\?(.*)','\\1',x)
x <- "Name of the Student? Michael Sneider"
sub(pattern = ".+?\\?" , x , replacement = '' )
To take advantage of the loose wording of the question, we can go WAY overboard and use natural language processing to extract all names from the string:
library(openNLP)
library(NLP)
# you'll also have to install the models with the next line, if you haven't already
# install.packages('openNLPmodels.en', repos = 'http://datacube.wu.ac.at/', type = 'source')
s <- as.String(x) # convert x to NLP package's String object
# make annotators
sent_token_annotator <- Maxent_Sent_Token_Annotator()
word_token_annotator <- Maxent_Word_Token_Annotator()
entity_annotator <- Maxent_Entity_Annotator()
# call sentence and word annotators
s_annotated <- annotate(s, list(sent_token_annotator, word_token_annotator))
# call entity annotator (which defaults to "person") and subset the string
s[entity_annotator(s, s_annotated)]
## Michael Sneider
Overkill? Probably. But interesting, and not actually all that hard to implement, really.
str_match is more helpful in this situation
str_match(x, ".*\\?\\s(.*)")[, 2]
#[1] "Michael Sneider"

R - grab the exact 8 digits number in a string and transform it

I have 2 problems in extracting and transforming data using R.
Here's the dataset:
messageID | msg
1111111111 | hey id 18271801, fix it asap
2222222222 | please fix it soon id12901991 and 91222911. dissapointed
3333333333 | wow $300 expensive man, come on
4444444444 | number 2837169119 test
The problem is:
I want to grab the number with only 8 digits length. In the dataset above, message id 3333...(300 - 3 digits) and 4444...(2837169119 - 10 digits) should not included. And here's my best shot so far:
as.matrix(unlist(apply(df[2],1,function(x){regmatches(x,gregexpr('([0-9]){8}', x))})))
.
However, with this line of code, message 444... is included because is contains more than 8 digits number.
Transform the data to another form like this:
message_id | customer_ID
1111111111 | 18271801
2222222222 | 12901991
2222222222 | 91222911
I don't know how to efficiently transform the data.
The output of dput(df):
structure(list(id = c(1111111111, 2222222222, 3333333333, 4444444444
), msg = c("hey id 18271801, fix it asap", "please fix it soon id12901991 and 91222911. dissapointed",
"wow $300 expensive man, come on", "number 2837169119 test")), .Names = c("id",
"msg"), row.names = c(NA, 4L), class = "data.frame")
Thanks
Use rebus to create your regular expression, and stringr to extract the matches.
You may need to play with the exact form of the regular expression. This code works on your examples, but you'll probably need to adapt it for your dataset.
library(rebus)
library(stringr)
# Create regex
rx <- negative_lookbehind(DGT) %R%
dgt(8) %R%
negative_lookahead(DGT)
rx
## <regex> (?<!\d)[\d]{8}(?!\d)
# Extract the IDs
extracted_ids <- str_extract_all(df$msg, perl(rx))
# Stuff the IDs into a data frame.
data.frame(
messageID = rep(
df$id,
vapply(extracted_ids, length, integer(1))
),
extractedID = unlist(extracted_ids, use.names = FALSE)
)