R Regular Expression : extracting speaker in a script - regex

I would like to use R to extract the speaker out of scripts formatted like in the following example:
"Scene 6: Second Lord: Nay, good my lord, put him to't; let him have his way. First Lord: If your lordship find him not a hilding, hold me no more in your respect. Second Lord: On my life, my lord, a bubble. BERTRAM: Do you think I am so far deceived in him? Second Lord: Believe it, my lord, in mine own direct knowledge, without any malice, but to speak of him as my kinsman, he's a most notable coward, an infinite and endless liar, an hourly promise-breaker, the owner of no one good quality worthy your lordship's entertainment."
In this example, I would like to extract: ("Second Lord", "First Lord", "Second Lord", "BERTRAM", "Second Lord"). The rule is obvious: it is the group of words situated between the end of a sentence and a semi-column.
How can I write this in R ?

Maybe something like this:
library(stringr)
body <- "Scene 6: Second Lord: Nay, good my lord, put him to't; let him have his way. First Lord: If your lordship find him not a hilding, hold me no more in your respect. Second Lord: On my life, my lord, a bubble. BERTRAM: Do you think I am so far deceived in him? Second Lord: Believe it, my lord, in mine own direct knowledge, without any malice, but to speak of him as my kinsman, he's a most notable coward, an infinite and endless liar, an hourly promise-breaker, the owner of no one good quality worthy your lordship's entertainment."
p <- str_extract_all(body, "[:.?] [A-z ]*:")
# and get rid of extra signs
p <- str_replace_all(p[[1]], "[?:.]", "")
# strip white spaces
p <- str_trim(p)
p
"Second Lord" "First Lord" "Second Lord" "BERTRAM" "Second Lord"
# unique players
unique(p)
[1] "Second Lord" "First Lord" "BERTRAM"
Explanations of regex: (which are not perfect)
str_extract_all(body, "[:.?] [A-z ]*:") a match is started with either : or . or ? ([:.?]) followed by a whitespace. Any character and whitespace is matched until the next :.
Get position
You can use str_locate_all with the same regex:
str_locate_all(body, "[:.?] [A-z ]*:")

gsubfn/strapplyc
Try this where x is the input string. Here strapplyc returns the portion of the match within parentheses:
> library(gsubfn)
> strapplyc(x, "[.?:] *([^:]+):", simplify = c)
[1] "Second Lord" "First Lord" "Second Lord" "BERTRAM" "Second Lord"
gregexpr
Here is a second method. It uses no external packages. Here we calculate the starting and ending positions (start.pos and end.pos) and then extract out the strings they define:
> pos <- gregexpr("[.?:] [^:]+:", x)[[1]]
> start.pos <- pos + 2
> end.pos <- start.pos + attr(pos, "match.length") - 4
> substring(x, start.pos, end.pos)
[1] "Second Lord" "First Lord" "Second Lord" "BERTRAM" "Second Lord"

At least in this case, a better solution is to search the text in a more structured form. Mining structured documents is almost always easier than unstructured ones. Since the source is Shakespeare, there are many copies floating around the internet.
script_url <- "http://www.opensourceshakespeare.org/views/plays/play_view.php?WorkID=allswell&Act=3&Scene=6&Scope=scene"
doc <- htmlParse(script_url)
character_links <- xpathApply(doc, '//li[#class="playtext"]/strong/a')
characters <- unique(sapply(character_links, xmlValue))
#[1] "Second Lord" "First Lord" "Bertram" "Parolles"
Note that the version of the text you use makes a big difference. Open Source Shakespeare is very good in that the html pages are well structured and include classes. On the other hand Bartleby pages are not. Let's run the analysis again:
script_url2 <- "http://www.bartleby.com/70/2236.html"
doc2 <- htmlParse(script_url2)
tbl <- xpathApply(doc2, '//table[#width="100%"]')[[1]]
italics <- xpathApply(tbl, '//tr/td/i')
characters2 <- unique(sapply(italics, xmlValue))
#[1] "First Lord." "Sec. Lord." "Ber." "Par." "hic jacet." "Exit."
#[7] "Ber" "Exeunt."
In this case you can't programmatically tell the difference between characters, stage directions (without compiling a list of possible stage directions and ignoring them), and emphasised speech. Choose your text source wisely!

Related

r ngram extraction with regex

Karl Broman's post: https://kbroman.wordpress.com/2015/06/22/randomized-hobbit-2/ got me playing with regex and ngrams just for fun. I attempted to use regex to extract 2-grams. I know there are parsers to do this but am interested in the regex logic (i.e., it was a self challenge that I failed to meet).
Below I give a minimal example and the desired output. The problem in my attempt is 2 fold:
The grams (words) get eaten up and aren't available for the next pass. How can I make them available for the second pass? (e.g., I want like to be available for like toast after it's already been consumed previously in I like)
I couldn't make the space between words non-captured (notice the trailing white space in my output even though I used (?:\\s*)). How can I not capture trailing spaces on the nth (in this case second) word? I know this could be done simply with: "(\\b[A-Za-z']+\\s)(\\b[A-Za-z']+)" for a 2-gram but I want to extend the solution to n-grams. PS I know about \\w but I don't consider underscores and numbers as word parts, but do consider ' as a word part.
MWE:
library(stringi)
x <- "I like toast and jam."
stringi::stri_extract_all_regex(
x,
pattern = "((\\b[A-Za-z']+\\b)(?:\\s*)){2}"
)
## [[1]]
## [1] "I like " "toast and "
Desired Output:
## [[1]]
## [1] "I like" "like toast" "toast and" "and jam"
Here's one way using base R regex. This can be easily extended to handle arbitrary n-grams. The trick is to put the capture group inside a positive look-ahead assertion, eg., (?=(my_overlapping_pattern))
x <- "I like toast and jam."
pattern <- "(?=(\\b[A-Za-z']+\\b \\b[A-Za-z']+\\b))"
matches<-gregexpr(pattern, x, perl=TRUE)
# a little post-processing needed to get the capture groups with regmatches
attr(matches[[1]], 'match.length') <- as.vector(attr(matches[[1]], 'capture.length')[,1])
regmatches(x, matches)
# [[1]]
# [1] "I like" "like toast" "toast and" "and jam"
Actually, there is an app for that: the quanteda package (for the quantitative analysis of textual data). My coauthor Paul Nulty and I are working hard to improve this, but it easily handles the use case you describe.
install.packages("quanteda")
require(quanteda)
x <- "I like toast and jam."
> ngrams(x, 2)
## [[1]]
## [1] "i_like" "like_toast" "toast_and" "and_jam"
ngrams(x, n = 2, concatenator = " ", toLower = FALSE)
## [[1]]
## [1] "I like" "like toast" "toast and" "and jam"
No painful regexes required!

regular expression in R-- new lines

I'm trying to using regular expression in R by using regexpr function. I have multiple conditions to match, therefore my regular expression is very long actually, for example "A\s+(\d+)|(\d+)\s+A". So I want to put each separate expression on different lines, like
"A\\s+(\\d+)|
(\\d+)\\s+A|"
But it's not working. The bracket tells R that I want to extract the digit number out. Can anyone give suggestions?
1) paste Try using paste:
paste("A\\s+(\\d+)",
"(\\d+)\\s+A",
sep = "|")
2) rex Another possibility is to use the rex package
library(rex)
rex(group("A", spaces, capture(digits)) %or%
group(capture(digits), spaces, "A"))
which gives:
(?:(?:A[[:space:]]+([[:digit:]]+))|(?:([[:digit:]]+)[[:space:]]+A))
3) rebus The rebus package is similar in intent:
library(rebus)
literal("A") %R% one_or_more(space()) %R% capture(one_or_more(ascii_digit())) %|%
capture(one_or_more(digit())) %R% one_or_more(space()) %R% literal("A")
which emits:
<regex> \QA\E[[:space:]]+([0-9]+)|([[:digit:]]+)[[:space:]]+\QA\E
If you want to break string literal up on to several lines in your script, one solution is to use paste0:
my_expr <- paste0('partone',
'parttwo',
'partthree')
Then you get the desired result:
> my_expr
[1] "partoneparttwopartthree"
You can't just break it up onto several lines in between quotes, b/c then the new line character is part of the expression.
If you are also trying to trouble shoot your regular expression, you'll need to post a sample of the data you are trying to work with and the desired result
Just use the x modifier with perl = TRUE in whatever function you're using. Place the x modifier ((?x)) at the beginning of the expression and white space is ignored. Additionally, comment charcters are ignored as well.
pat <- "(?x)\\\\ ## Grab a backslash followed by...
[a-zA-Z0-9]*cite[a-zA-Z0-9]* ## A word that contains ‘cite‘
(\\[([^]]+)\\]){0,2}\\** ## Look for 0-2 square brackets w/ content
\\{([a-zA-Z0-9 ,]+)\\}" ## Look for curly braces with viable bibkey
tex <- c(
"Many \\parencite*{Ted2005, Moe1999} say graphs \\textcite{Few2010}.",
"But \\authorcite{Ware2013} said perception good too.",
"Random words \\pcite[see][p. 22]{Get9999c}.",
"Still more \\citep[p. 22]{Foo1882c}?"
)
gsub(pat, "", tex, perl=TRUE)
## [1] "Many say graphs ." "But said perception good too."
## [3] "Random words ." "Still more ?"
A second approach...I maintain a package called regexr that attempts to enable maintainers of regular expressions libraries:
to write regular expressions in a way that is similar to the ways R code is written.
This may be overkill if you're aren't panning long term maintence of the expression but you could do the same thing with regexr by (no need for perl = TRUE). Note the minimal comments as the meaning is shared with sub expression names. The %:)% is a comment operator (commented code is happy code) but you need not use the leading names or comments, just construct.:
library(regexr)
pat2 <- construct(
backslash = "\\\\" %:)% "\\",
cite_command = "[a-zA-Z0-9]*cite[a-zA-Z0-9]*" %:)% "parencite",
square_brack = "(\\[([^]]+)\\]){0,2}\\**" %:)% "[e.g.][p. 12]",
bibkeys = "\\{([a-zA-Z0-9 ,]+)\\}" %:)% "{Rinker2014}"
)
gsub(pat2, "", tex)
## [1] "Many say graphs ." "But said perception good too."
## [3] "Random words ." "Still more ?"
The regexr frame work requires a bit of upfront time but the "code" is much easier to maintain, more modular, and easier to understand by others without learning a new "language". This is one approach of many and I tend to use a combination of standard regex, regexr and rebus (which works within the regexr framework). So for example we can grab any of the sub expressions from pat2 with the subs function as follows:
subs(pat2)
## $backslash
## [1] "\\\\"
##
## $cite_command
## [1] "[a-zA-Z0-9]*cite[a-zA-Z0-9]*"
##
## $square_brack
## [1] "(\\[([^]]+)\\]){0,2}\\**"
##
## $bibkeys
## [1] "\\{([a-zA-Z0-9 ,]+)\\}"
I also included simple way to test the main and sub expressions for perl validity as follows:
test(pat2)
## $regex
## [1] TRUE
##
## $subexpressions
## backslash cite_command square_brack bibkeys
## TRUE TRUE TRUE TRUE

Extract a string of words between two specific words in R [duplicate]

This question already has answers here:
Extracting a string between other two strings in R
(4 answers)
Closed last year.
I have the following string : "PRODUCT colgate good but not goodOKAY"
I want to extract all the words between PRODUCT and OKAY
This can be done with sub:
s <- "PRODUCT colgate good but not goodOKAY"
sub(".*PRODUCT *(.*?) *OKAY.*", "\\1", s)
giving:
[1] "colgate good but not good"
No packages are needed.
Here is a visualization of the regular expression:
.*PRODUCT *(.*?) *OKAY.*
Debuggex Demo
x = "PRODUCT colgate good but not goodOKAY"
library(stringr)
str_extract(string = x, pattern = "(?<=PRODUCT).*(?=OKAY)")
(?<=PRODUCT) -- look behind the match for PRODUCT
.* match everything except new lines.
(?=OKAY) -- look ahead to match OKAY.
I should add you don't need the stringr package for this, the base functions sub and gsub work fine. I use stringr for it's consistency of syntax: whether I'm extracting, replacing, detecting etc. the function names are predictable and understandable, and the arguments are in a consistent order. I use stringr because it saves me from needing the documentation every time.
(Note that for stringr versions less than 1.1.0, you need to specify perl-flavored regex to get lookahead and lookbehind functionality - so the pattern above would need to be wrapped in perl().)
You can use gsub:
vec <- "PRODUCT colgate good but not goodOKAY"
gsub(".*PRODUCT\\s*|OKAY.*", "", vec)
# [1] "colgate good but not good"
You could use the rm_between function from the qdapRegex package. It takes a string and a left and right boundary as follows:
x <- "PRODUCT colgate good but not goodOKAY"
library(qdapRegex)
rm_between(x, "PRODUCT", "OKAY", extract=TRUE)
## [[1]]
## [1] "colgate good but not good"
You could use the package unglue :
library(unglue)
x <- "PRODUCT colgate good but not goodOKAY"
unglue_vec(x, "PRODUCT {out}OKAY")
#> [1] "colgate good but not good"

How to convert a vector of strings to Title Case

I have a vector of strings in lower case. I'd like to change them to title case, meaning the first letter of every word would be capitalized. I've managed to do it with a double loop, but I'm hoping there's a more efficient and elegant way to do it, perhaps a one-liner with gsub and a regex.
Here's some sample data, along with the double loop that works, followed by other things I tried that didn't work.
strings = c("first phrase", "another phrase to convert",
"and here's another one", "last-one")
# For each string in the strings vector, find the position of each
# instance of a space followed by a letter
matches = gregexpr("\\b[a-z]+", strings)
# For each string in the strings vector, convert the first letter
# of each word to upper case
for (i in 1:length(strings)) {
# Extract the position of each regex match for the string in row i
# of the strings vector.
match.positions = matches[[i]][1:length(matches[[i]])]
# Convert the letter in each match position to upper case
for (j in 1:length(match.positions)) {
substr(strings[i], match.positions[j], match.positions[j]) =
toupper(substr(strings[i], match.positions[j], match.positions[j]))
}
}
This worked, but it seems inordinately complicated. I resorted to it only after experimenting unsuccessfully with more straightforward approaches. Here are some of the things I tried, along with the output:
# Google search suggested \\U might work, but evidently not in R
gsub("(\\b[a-z]+)", "\\U\\1" ,strings)
[1] "Ufirst Uphrase" "Uanother Uphrase Uto Uconvert"
[3] "Uand Uhere'Us Uanother Uone" "Ulast-Uone"
# I tried this on a lark, but to no avail
gsub("(\\b[a-z]+)", toupper("\\1"), strings)
[1] "first phrase" "another phrase to convert"
[3] "and here's another one" "last-one"
The regex captures the correct positions in each string as shown by a call to gregexpr, but the replacement string is clearly not working as desired.
If you can't already tell, I'm relatively new to regexes and would appreciate help on how to get the replacement to work correctly. I'd also like to learn how to structure the regex so as to avoid capturing a letter after an apostrophe, since I don't want to change the case of those letters.
The main problem is that you're missing perl=TRUE (and your regex is slightly wrong, although that may be a result of flailing around to try to fix the first problem).
Using [:lower:] instead of [a-z] is slightly safer in case your code ends up being run in some weird (sorry, Estonians) locale where z is not the last letter of the alphabet ...
re_from <- "\\b([[:lower:]])([[:lower:]]+)"
strings <- c("first phrase", "another phrase to convert",
"and here's another one", "last-one")
gsub(re_from, "\\U\\1\\L\\2" ,strings, perl=TRUE)
## [1] "First Phrase" "Another Phrase To Convert"
## [3] "And Here's Another One" "Last-One"
You may prefer to use \\E (stop capitalization) rather than \\L (start lowercase), depending on what rules you want to follow, e.g.:
string2 <- "using AIC for model selection"
gsub(re_from, "\\U\\1\\E\\2" ,string2, perl=TRUE)
## [1] "Using AIC For Model Selection"
Without using regex, the help page for tolower has two example functions that will do this.
The more robust version is
capwords <- function(s, strict = FALSE) {
cap <- function(s) paste(toupper(substring(s, 1, 1)),
{s <- substring(s, 2); if(strict) tolower(s) else s},
sep = "", collapse = " " )
sapply(strsplit(s, split = " "), cap, USE.NAMES = !is.null(names(s)))
}
capwords(c("using AIC for model selection"))
## -> [1] "Using AIC For Model Selection"
To get your regex approach (almost) working you need to set `perl = TRUE)
gsub("(\\b[a-z]{1})", "\\U\\1" ,strings, perl=TRUE)
[1] "First Phrase" "Another Phrase To Convert"
[3] "And Here'S Another One" "Last-One"
but you will need to deal with apostrophes slightly better perhaps
sapply(lapply(strsplit(strings, ' '), gsub, pattern = '^([[:alnum:]]{1})', replace = '\\U\\1', perl = TRUE), paste,collapse = ' ')
A quick search of SO found https://stackoverflow.com/a/6365349/1385941
Already excellent answers here. Here's one using a convenience function from the reports package:
strings <- c("first phrase", "another phrase to convert",
"and here's another one", "last-one")
CA(strings)
## > CA(strings)
## [1] "First Phrase" "Another Phrase To Convert"
## [3] "And Here's Another One" "Last-one"
Though it doesn't capitalize one as it didn't make sense to do so for my purposes.
Update I manage the qdapRegex package that has the TC (title case) function that does true title case:
TC(strings)
## [[1]]
## [1] "First Phrase"
##
## [[2]]
## [1] "Another Phrase to Convert"
##
## [[3]]
## [1] "And Here's Another One"
##
## [[4]]
## [1] "Last-One"
I'll throw one more into the mix for fun:
topropper(strings)
[1] "First Phrase" "Another Phrase To Convert" "And Here's Another One"
[4] "Last-one"
topropper <- function(x) {
# Makes Proper Capitalization out of a string or collection of strings.
sapply(x, function(strn)
{ s <- strsplit(strn, "\\s")[[1]]
paste0(toupper(substring(s, 1,1)),
tolower(substring(s, 2)),
collapse=" ")}, USE.NAMES=FALSE)
}
Here is another one-liner, based on stringr package:
str_to_title(strings, locale = "en")
where strings is your vector of strings.
Source
The best way for conversion of any case to any other case is the use of snakecase package in r.
Simply use the package
library(snakecase)
strings = c("first phrase", "another phrase to convert",
"and here's another one", "last-one")
to_title_case(strings)
## [1] "First Phrase" "Another Phrase to Convert"
## [3] "And Here s Another One" "Last One"
Keep Coding!

gsub every other occurrence of a condition

Sometimes I use R for parsing text from pdfs for quotes in writing an article (I use LATEX). One thing I'd like to do is change straight left and right quotes to LATEX style left and right quotes.
LATEX would change "dog" to ``dog'' (so two ` for the left and two ' for the right)
Here's an example of what I have and what I'd like to get.
#currently
x <- c('I like "proper" cooking.', 'I heard him say, "I want some too" and "nice".')
[1] "I like \"proper\" cooking." "I heard him say, \"I want some too\" and \"nice\"."
#desired outcome
[1] "I like ``proper'' cooking." "I heard him say, ``I want some too'' and ``nice''."
EDIT: Thought I'd share the actual use for context. Using ttmaccer's solution (works on a windows machine):
g <- function(){
require(qdap)
x <- readClipboard()
x <- clean(paste2(x, " "))
zz <- mgsub(c("- ", "“", "”"), c("", "``", "''"), x)
zz <- gsub("\"([^\"].*?)\"","``\\1''", zz)
writeClipboard(noquote(zz), format = 1)
}
Note: qdap can be downloaded HERE
A naive solution would be:
> gsub("\"([^\"].*?)\"","``\\1''",x)
[1] "I like ``proper'' cooking."
[2] "I heard him say, ``I want some too'' and ``nice''."
but I'm not sure how you would handle "some \"text\" with one \""
a two stage solution:
stage 1: use "((?:[^\\"]|\\.)*)" to match double quoted string
stage 2: use \\"([^\\"]*)\\" to replace \" from group 1 of stage 1