Search a string by a mix of syntactical and regex patterns - regex

I would like to use R to search a text for patterns expressed through a mix of POS and actual strings. (I have seen this functionality in a python library here: http://www.clips.ua.ac.be/pages/pattern-search).
For instance, a search pattern could be: 'NOUNPHRASE be|is|was ADJECTIVE than NOUNPHRASE', and should return all strings containing structures like: "a cat is faster than a dog".
I know that packages like openNLP and qdap offer convenient POS-tagging. Has anyone been using the output of it for this kind of pattern maching ?

As a starter, using koRpus and TreeTagger:
library(koRpus)
library(tm)
mytxt <- c("This is my house.", "A house is better than no house.", "A cat is faster than a dog.")
pattern <- "Noun, singular or mass.*?Adjective, comparative.*?Noun, singular or mass"
tagged.results <- treetag(file = mytxt, treetagger="C:/TreeTagger/bin/tag-english.bat", lang="en", format="obj", stopwords=stopwords("en"))
tagged.results <- kRp.filter.wclass(tagged.results, "stopword")
taggedText(tagged.results)$id <- factor(head(cumsum(c(0, taggedText(tagged.results)$desc == "Sentence ending punctuation")) + 1, -1))
setNames(mytxt, grepl(pattern, aggregate(desc~id, taggedText(tagged.results), FUN = paste0)$desc))
# FALSE TRUE TRUE
# "This is my house." "A house is better than no house." "A cat is faster than a dog."

Related

Performance - How to get those words in a list of words that matches with a give sentence in R

Am trying to get only those words from a list that are present in a given sentence.The words can include bigram words as well. For example,
wordList <- c("really good","better","awesome","true","happy")
sentence <- c("This is a really good program but it can be made better by making it more efficient")
myresult should be :
"really good" "better"
I have 1000 sentences like this on which i need to compare the words. The word list is also bigger. i tried brute force method using grep command but it took a lot of time (as expected). I am looking to get the matching words in a way that the performance is better.
require(dplyr)
wordList <- c("really good","better","awesome","true","happy")
sentence <- c("This is a really good program but it can be made better by making it more efficient")
# get unigrams from the sentence
unigrams <- unlist(strsplit(sentence, " ", fixed=TRUE))
# get bigrams from the sentence
bigrams <- unlist(lapply(1:length(words)-1, function(i) {paste(words[i],words[i+1])} ))
# .. and combine into data frame
grams <- data.frame(grams=c(unigrams, bigrams), stringsAsFactors = FALSE)
# dplyr join should be pretty efficient
matches <- inner_join(data.frame(wordList, stringsAsFactors = FALSE),
grams,
by=c('wordList'='grams'))
matches
wordList
1 really good
2 better
I was able to use #epi99's answer with a slight modification.
wordList <- c("really good","better","awesome","true","happy")
sentence <- c("This is a really good program but it can be made better by making it more efficient")
# get unigrams from the sentence
unigrams <- unlist(strsplit(sentence, " ", fixed=TRUE))
# get bigrams from the sentence
bigrams <- unlist(lapply(1:length(unigrams)-1, function(i) {paste(unigrams[i],unigrams[i+1])} ))
# .. and combine into a single vector
grams=c(unigrams, bigrams)
# use match function to get the matching words
matches <- match(grams, wordList )
matches <- na.omit(matches)
matchingwords <- wordList[matches]
What about
unlist(sapply(wordList, function(x) grep(x, sentence)))

r ngram extraction with regex

Karl Broman's post: https://kbroman.wordpress.com/2015/06/22/randomized-hobbit-2/ got me playing with regex and ngrams just for fun. I attempted to use regex to extract 2-grams. I know there are parsers to do this but am interested in the regex logic (i.e., it was a self challenge that I failed to meet).
Below I give a minimal example and the desired output. The problem in my attempt is 2 fold:
The grams (words) get eaten up and aren't available for the next pass. How can I make them available for the second pass? (e.g., I want like to be available for like toast after it's already been consumed previously in I like)
I couldn't make the space between words non-captured (notice the trailing white space in my output even though I used (?:\\s*)). How can I not capture trailing spaces on the nth (in this case second) word? I know this could be done simply with: "(\\b[A-Za-z']+\\s)(\\b[A-Za-z']+)" for a 2-gram but I want to extend the solution to n-grams. PS I know about \\w but I don't consider underscores and numbers as word parts, but do consider ' as a word part.
MWE:
library(stringi)
x <- "I like toast and jam."
stringi::stri_extract_all_regex(
x,
pattern = "((\\b[A-Za-z']+\\b)(?:\\s*)){2}"
)
## [[1]]
## [1] "I like " "toast and "
Desired Output:
## [[1]]
## [1] "I like" "like toast" "toast and" "and jam"
Here's one way using base R regex. This can be easily extended to handle arbitrary n-grams. The trick is to put the capture group inside a positive look-ahead assertion, eg., (?=(my_overlapping_pattern))
x <- "I like toast and jam."
pattern <- "(?=(\\b[A-Za-z']+\\b \\b[A-Za-z']+\\b))"
matches<-gregexpr(pattern, x, perl=TRUE)
# a little post-processing needed to get the capture groups with regmatches
attr(matches[[1]], 'match.length') <- as.vector(attr(matches[[1]], 'capture.length')[,1])
regmatches(x, matches)
# [[1]]
# [1] "I like" "like toast" "toast and" "and jam"
Actually, there is an app for that: the quanteda package (for the quantitative analysis of textual data). My coauthor Paul Nulty and I are working hard to improve this, but it easily handles the use case you describe.
install.packages("quanteda")
require(quanteda)
x <- "I like toast and jam."
> ngrams(x, 2)
## [[1]]
## [1] "i_like" "like_toast" "toast_and" "and_jam"
ngrams(x, n = 2, concatenator = " ", toLower = FALSE)
## [[1]]
## [1] "I like" "like toast" "toast and" "and jam"
No painful regexes required!

Extract a string of words between two specific words in R [duplicate]

This question already has answers here:
Extracting a string between other two strings in R
(4 answers)
Closed last year.
I have the following string : "PRODUCT colgate good but not goodOKAY"
I want to extract all the words between PRODUCT and OKAY
This can be done with sub:
s <- "PRODUCT colgate good but not goodOKAY"
sub(".*PRODUCT *(.*?) *OKAY.*", "\\1", s)
giving:
[1] "colgate good but not good"
No packages are needed.
Here is a visualization of the regular expression:
.*PRODUCT *(.*?) *OKAY.*
Debuggex Demo
x = "PRODUCT colgate good but not goodOKAY"
library(stringr)
str_extract(string = x, pattern = "(?<=PRODUCT).*(?=OKAY)")
(?<=PRODUCT) -- look behind the match for PRODUCT
.* match everything except new lines.
(?=OKAY) -- look ahead to match OKAY.
I should add you don't need the stringr package for this, the base functions sub and gsub work fine. I use stringr for it's consistency of syntax: whether I'm extracting, replacing, detecting etc. the function names are predictable and understandable, and the arguments are in a consistent order. I use stringr because it saves me from needing the documentation every time.
(Note that for stringr versions less than 1.1.0, you need to specify perl-flavored regex to get lookahead and lookbehind functionality - so the pattern above would need to be wrapped in perl().)
You can use gsub:
vec <- "PRODUCT colgate good but not goodOKAY"
gsub(".*PRODUCT\\s*|OKAY.*", "", vec)
# [1] "colgate good but not good"
You could use the rm_between function from the qdapRegex package. It takes a string and a left and right boundary as follows:
x <- "PRODUCT colgate good but not goodOKAY"
library(qdapRegex)
rm_between(x, "PRODUCT", "OKAY", extract=TRUE)
## [[1]]
## [1] "colgate good but not good"
You could use the package unglue :
library(unglue)
x <- "PRODUCT colgate good but not goodOKAY"
unglue_vec(x, "PRODUCT {out}OKAY")
#> [1] "colgate good but not good"

Excel RegEx Functions in R

I regularly work with Excel Sheets where some fields (observations) contain large amounts of text content in a part structured form (at least visually)
So the content of a single Cell/Obs might be somewhat like this:
My name is John Doe
I live at my address
My Post code is ABC123
My Favorite Pet is: A dog
In Excel I've created a few functions which I can use to look for a string within the cell so lets say that the data is in "A1"
in "A2" I can use "=GETPOSTCODE(A1) where the function is:
Function GetPostCode(PostCode As Range) As String
regex.Pattern = "[A-Z]{3}\d{3,}\b\w*"
regex.IgnoreCase = True
regex.MultiLine = True
Set X = regex.Execute(PostCode.Value)
For Each x1 In X
GetPostCode = UCase(x1)
Exit For
Next
End Function
What kind of structures/functions could I use in r to accomplish this?
the Cells really contain Much more data than that, its purely for example, and I have a number of different "get" functions with different regexs.
I've had a good look at all the Grep type commands but am struggling with limited/developing R skills.
I've been working around this kind of Principle, but pretty much stalled (where textfield is the column with my text in obviously!) I can get a list of all the rows where it contains a post code but not JUST the Post Code:
df$postcode <- df[(df$textfield = grep("[A-Z]{3}\\d{3,}\\b\\w*", df$textfield), ]
Any Help appreciated!
I think you need a combination of regexpr or grepexpr (to find the matches in the string) and regmatches to extract the matching parts of the strings:
x <- "My name is John Doe
I live at my address
My Post code is ABC123
My Favorite Pet is: A dog"
> regmatches(x, regexpr("[A-Z]{3}\\d{3,}\\b\\w*", x, ignore.case = TRUE))
[1] "ABC123"
Other options probably include str_extract from stringr or stri_extract from stringi packages.

gsub every other occurrence of a condition

Sometimes I use R for parsing text from pdfs for quotes in writing an article (I use LATEX). One thing I'd like to do is change straight left and right quotes to LATEX style left and right quotes.
LATEX would change "dog" to ``dog'' (so two ` for the left and two ' for the right)
Here's an example of what I have and what I'd like to get.
#currently
x <- c('I like "proper" cooking.', 'I heard him say, "I want some too" and "nice".')
[1] "I like \"proper\" cooking." "I heard him say, \"I want some too\" and \"nice\"."
#desired outcome
[1] "I like ``proper'' cooking." "I heard him say, ``I want some too'' and ``nice''."
EDIT: Thought I'd share the actual use for context. Using ttmaccer's solution (works on a windows machine):
g <- function(){
require(qdap)
x <- readClipboard()
x <- clean(paste2(x, " "))
zz <- mgsub(c("- ", "“", "”"), c("", "``", "''"), x)
zz <- gsub("\"([^\"].*?)\"","``\\1''", zz)
writeClipboard(noquote(zz), format = 1)
}
Note: qdap can be downloaded HERE
A naive solution would be:
> gsub("\"([^\"].*?)\"","``\\1''",x)
[1] "I like ``proper'' cooking."
[2] "I heard him say, ``I want some too'' and ``nice''."
but I'm not sure how you would handle "some \"text\" with one \""
a two stage solution:
stage 1: use "((?:[^\\"]|\\.)*)" to match double quoted string
stage 2: use \\"([^\\"]*)\\" to replace \" from group 1 of stage 1