How to separate POS from words

How to separate POS from words - regex

Need to create an text sparce matrix (DTM) for classification. To prepare the text, first I need to eliminate (separate) the POS-tags the text. My guess was to do it like below. I'm new to R and don't now how to negate a REGEX (see below NOT!).
text <- c("wenn/KOUS ausläuft/VVFIN ./$.", "Kommt/VVFIN vor/PTKVZ ;/$.", "-RRB-/TRUNC Durch/APPR und/KON", "man/PIS zügig/ADJD ./$.", "empfehlung/NN !!!/NE")
My guess how it could work:
(POSs <- regmatches(text, gregexpr('[[:punct:]]*/[[:alpha:][:punct:]]*', text)))
[[1]]
[1] "/KOUS" "/VVFIN" "./$."
[[2]]
[1] "/VVFIN" "/PTKVZ" ";/$."
[[3]]
[1] "-/TRUNC" "/APPR" "/KON"
[[4]]
[1] "/PIS" "/ADJD" "./$."
[[5]]
[1] "/NN" "!!!/NE"
But don't konw how to negate the expression like:
# VVV
(texts <- regmatches(text, NOT!(gregexpr('[[:punct:]]*/[[:alpha:][:punct:]]*', text))))
[[1]]
[1] "wenn" "ausläuft"
[[2]]
[1] "Kommt" "vor"
[[3]]
[1] "Durch" "und"
[[4]]
[1] "man" "zügig"
[[5]]
[1] "empfehlung"

One possibility is to eliminate the tags by, searching for POS-tags and replacing them with '' (i.e. empty text):
text <- c("wenn/KOUS ausläuft/VVFIN ./$.", "Kommt/VVFIN vor/PTKVZ ;/$.", "-RRB-/TRUNC Durch/APPR und/KON", "man/PIS zügig/ADJD ./$.", "empfehlung/NN !!!/NE")
(textlist <- strsplit(paste(gsub('[[:punct:]]*/[[:alpha:][:punct:]]*','', text), sep=' '), " "))
[[1]]
[1] "wenn" "ausläuft"
[[2]]
[1] "Kommt" "vor"
[[3]]
[1] "-RRB" "Durch" "und"
[[4]]
[1] "man" "zügig"
[[5]]
[1] "empfehlung"
With the friendly help of rawr

Related

R list character removal

lst1 is a list:
lst1 <- list(c("all the apples", "apples in the tables", "fashion prosthetics"), c("meteorological concepts", "effects of climate change", "environmental"))
I want to preserve the list structure and remove the last s from all the words. The desired answer is the list below:
> lst1
[[1]]
[1] "all the apple" "apple in the table" "nature"
[[2]]
[1] "meteorological concept" "effect of climate change"
[3] "environmental"
I tried
gsub("\\'s|s$|s[[:space:]]{0}","",lst1)
but it is not preserving the list structure.
How can it be done?

You can use gsub with lapply to loop over the list elements
lapply(lst1, gsub, pattern= "\\'s|s$|s\\b", replacement='')
#[[1]]
#[1] "all the apple" "apple in the table" "fashion prosthetic"
#[[2]]
#[1] "meteorological concept" "effect of climate change"
#[3] "environmental"

Same solution, different regex, using a non-capturing group to leave whitespace as is:
> lapply(lst1, gsub, pattern="s(?= |$)", replacement="", perl=TRUE)
[[1]]
[1] "all the apple" "apple in the table" "fashion prosthetic"
[[2]]
[1] "meteorological concept" "effect of climate change" "environmental"

Simpler regex:
lapply(lst1, function(x) gsub('s\\b', '', x))
Results in:
[[1]]
[1] "all the apple" "apple in the table" "fashion prosthetic"
[[2]]
[1] "meteorological concept" "effect of climate change"
[3] "environmental"

regex capture repeated phrases

I can capture repeated words pretty easily using:
"(?i)\\b(\\w+)(((\\.{3}\\s*|,\\s+)*|\\s+)\\1)+\\b" but this regex does not seem to extend to mutipe words (and why should it in its current state). How could I find repeated phrases using regex?
Here I extract repeated terms (regardless of case) but the same regex doesn't word to extract a repeated phrase:
library(qdapRegex)
rm_default("this is a big Big deal", pattern = "(?i)\\b(\\w+)(((\\.{3}\\s*|,\\s+)*|\\s+)\\1)+\\b", extract=TRUE)
rm_default("this is a big is a Big deal", pattern = "(?i)\\b(\\w+)(((\\.{3}\\s*|,\\s+)*|\\s+)\\1)+\\b", extract=TRUE)
I would hope for a regex that would return:
"is a big is a Big"
for:
x <- "this is a big is a Big deal"
To cover corner cases here's a larger desired test and output...
"this is a big is a Big deal",
"I want want to see",
"I want, want to see",
"I want...want to see see how",
"this is a big is a Big deal for those of, those of you who are.",
"I like it. It is cool",
)
[[1]]
[1] "is a big is a Big"
[[2]]
[1] "want want"
[[3]]
[1] "want, want"
[[4]]
[1] "want...want" "see see"
[[5]]
[1] "is a big is a Big" "those of, those of"
[[6]]
[1] NA
My current regex only gets me to:
rm_default(y, pattern = "(?i)\\b(\\w+)(((\\.{3}\\s*|,\\s+)*|\\s+)\\1)+\\b", extract=TRUE)
## [[1]]
## [1] NA
##
## [[2]]
## [1] "want want"
##
## [[3]]
## [1] "want, want"
##
## [[4]]
## [1] "want...want" "see see"
##
## [[5]]
## [1] NA

I think this does what you want (note we only allow a single space, ..., or , as separators, but you should be able to tweak that easily):
pattern <- "(?i)\\b(\\w.*)((?:\\s|\\.{3}|,)+\\1)+\\b"
rm_default(x, pattern = pattern, extract=TRUE)
Produces:
[[1]]
[1] "is a big is a Big"
[[2]]
[1] "want want"
[[3]]
[1] "want, want"
[[4]]
[1] "want...want" "see see"
[[5]]
[1] "is a big is a Big" "those of, those of"

Try this:
> regmatches(x, gregexpr("(?i)\\b(\\S.*\\S)[ ,.]*\\b(\\1)", x, perl = TRUE))
[[1]]
[1] "is a big is a Big"
[[2]]
[1] "want want"
[[3]]
[1] "want, want"
[[4]]
[1] "want...want" "see see"
[[5]]
[1] "is a big is a Big" "those of, those of"
Here is a visualization (except there is an error in the visualization - the \S parts should be within the group.
(?i)\b(\S.*\S)[ ,.]*\b(\1)
Debuggex Demo
You might want to replace [ ,.] with [ [:punct:]]. I did not do that since debuggex does not support POSIX character groups.

splitting a string in which upper case follows lower case in stringr

I have a string vector that looks like this and I'd like to split it up:
str <- c("Fruit LoopsJalapeno Sandwich", "Red Bagel", "Basil LeafBarbeque SauceFried Beef")
str_split(str, '[a-z][A-Z]', n = 3)
[[1]]
[1] "Fruit Loop" "alapeno Sandwich"
[[2]]
[1] "Red Bagel"
[[3]]
[1] "Basil Lea" "arbeque Sauc" "ried Beef"
But I need to keep those letters at the end and start of the words.

Here's 2 approaches in base (you can generalize to stringr if you want).
This one subs out this place with a placeholder and then splits on that.
strsplit(gsub("([a-z])([A-Z])", "\\1SPLITHERE\\2", str), "SPLITHERE")
## [[1]]
## [1] "Fruit Loops" "Jalapeno Sandwich"
##
## [[2]]
## [1] "Red Bagel"
##
## [[3]]
## [1] "Basil Leaf" "Barbeque Sauce" "Fried Beef"
This method uses lookaheads and lookbehinds:
strsplit(str, "(?<=[a-z])(?=[A-Z])", perl=TRUE)
## [[1]]
## [1] "Fruit Loops" "Jalapeno Sandwich"
##
## [[2]]
## [1] "Red Bagel"
##
## [[3]]
## [1] "Basil Leaf" "Barbeque Sauce" "Fried Beef"
EDIT Generalized to stringr so you can grab 3 pieces if you want
stringr::str_split(gsub("([a-z])([A-Z])", "\\1SPLITHERE\\2", str), "SPLITHERE", 3)

You could also match instead of splitting based off your string.
unlist(regmatches(str, gregexpr('[A-Z][a-z]+ [A-Z][a-z]+', str)))
# [1] "Fruit Loops" "Jalapeno Sandwich" "Red Bagel"
# [4] "Basil Leaf" "Barbeque Sauce" "Fried Beef"

Extract phone number regex

How can I extract phone numbers from a text file?
x <- c(" Mr. Bean bought 2 tickets 2-613-213-4567 or 5555555555 call either one",
"43 Butter Rd, Brossard QC K0A 3P0 – 613 213 4567",
"Please contact Mr. Bean (613)2134567",
"1.575.555.5555 is his #1 number",
"7164347566"
)
This is a question that's been answered for other languages (see php abd general regex) but doesn't seem to have been tackled on SO for R.
I have searched and found what appears to be possible regexes to find phone numbers (In addition to the regexes from other languages above): http://regexlib.com/Search.aspx?k=phone but have not been able to use gsub within R with these to extract all of these numbers in the example.
Ideally, we'd get something like:
[[1]]
[1] "2-613-213-4567" "5555555555"
[[2]]
[1] "613 213 4567"
[[3]]
[1] "(613)2134567"
[[4]]
[1] "1.575.555.5555"
[[5]]
[1] "7164347566"

This is the best I've been able to do- you have a pretty wide range of formats, including some with spaces, so the regex is pretty general. It just says "look for a string of at least 5 characters made up entirely of digits, periods, brackets, hyphens or spaces":
library(stringr)
str_extract_all(x, "(^| )[0-9.() -]{5,}( |$)")
Output:
[[1]]
[1] " 2-613-213-4567 " " 5555555555 "
[[2]]
[1] " 613 213 4567"
[[3]]
[1] " (613)2134567"
[[4]]
[1] "1.575.555.5555 "
[[5]]
[1] "7164347566"
The leading/trailing spaces could probably be fixed with some additional complexity, or you could just fix it in post.
Update: a bit of searching lead me to this answer, which I slightly modified to allow periods. A bit stricter in terms of requiring a valid (US?) phone number, but seems to cover all your examples:
str_extract_all(x, "\\(?\\d{3}\\)?[.-]? *\\d{3}[.-]? *[.-]?\\d{4}")
Output:
[[1]]
[1] "613-213-4567" "5555555555"
[[2]]
[1] "613 213 4567"
[[3]]
[1] "(613)2134567"
[[4]]
[1] "575.555.5555"
[[5]]
[1] "7164347566"
The monstrosity found here also works once you take out the ^ and $ at either end. Use only if you really need it:
huge_regex = "(?:(?:\\+?1\\s*(?:[.-]\\s*)?)?(?:\\(\\s*([2-9]1[02-9]|[2-9][02-8]1|[2-9][02-8][02-9])\\s*\\)|([2-9]1[02-9]|[2-9][02-8]1|[2-9][02-8][02-9]))\\s*(?:[.-]\\s*)?)?([2-9]1[02-9]|[2-9][02-9]1|[2-9][02-9]{2})\\s*(?:[.-]\\s*)?([0-9]{4})(?:\\s*(?:#|x\\.?|ext\\.?|extension)\\s*(\\d+))?"

The qdapRegex now has the rm_phone specifically designed for this task:
x <- c(" Mr. Bean bought 2 tickets 2-613-213-4567 or 5555555555 call either one",
"43 Butter Rd, Brossard QC K0A 3P0 – 613 213 4567",
"Please contact Mr. Bean (613)2134567",
"1.575.555.5555 is his #1 number",
"7164347566"
)
library(qdapRegex)
ex_phone(x)
## [[1]]
## [1] "613-213-4567" "5555555555"
##
## [[2]]
## [1] "613 213 4567"
##
## [[3]]
## [1] "(613)2134567"
##
## [[4]]
## [1] "1.575.555.5555"
##
## [[5]]
## [1] "7164347566"

You would need a complex regex to cover all rules for matching phone numbers, but to cover your examples.
> library(stringi)
> unlist(stri_extract_all_regex(x, '(\\d[.-])?\\(?\\d{3}\\)?[-. ]?\\d{3}[-. ]?\\d{4}\\b'))
# [1] "2-613-213-4567" "5555555555" "613 213 4567" "(613)2134567"
# [5] "1.575.555.5555" "7164347566"

R programming: regexpr next occurence

I have the following code and results:
> x <- c("ABCDE CDEFG FGHIJ")
> x
[1] "ABCDE CDEFG FGHIJ"
> regexpr("D", x)
[1] 4
attr(,"match.length")
[1] 1
regexpr only returns the first occurrence of "D", how can I get it to return all occurrences of "D"

You were so close -- just a couple of line down from regexpr in the help file...
gregexpr("D", x)
# [[1]]
# [1] 4 8
# attr(,"match.length")
# [1] 1 1
# attr(,"useBytes")
# [1] TRUE

You can also make use strsplit like this :
which(unlist(strsplit(x,split=""))=="D")
[1] 4 8
This way you can also have exact match for D.

I'm sure there will be a pure regex approach. However stringr::str_locate_all will suffice
library(stringr)
unique(unlist(str_locate_all(x, 'D')))
## [1] 4 8

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

How to separate POS from words - regex

Related

R list character removal

regex capture repeated phrases

splitting a string in which upper case follows lower case in stringr

Extract phone number regex

R programming: regexpr next occurence

Categories

Resources