Extracting sentences using scan() in R - regex

I've been told that I shouldn't use R to scan text (but I have been doing so, anyway, pending the acquisition of other skills) and encountered a problem that has confused me sufficiently to retreat to these fora. Thanks for any help, in advance.
I'm trying to store a large amount of text (e.g., a short story) as a vector of strings, each of which is a separate sentence. I've been doing this using the scan() function, but I am encountering two basic problems: (1) scan() only seems to allow a single separating character, whereas sentences can obviously end in multiple ways. I know how to mark the end of a sentence using regex (e.g. [!?\.], but I don't know of a function in R that uses regular expressions to split text. (2) scan() seems to automatically regard a new line as a new field, whereas I want it to ignore new lines unless they coincide with the end of a sentence.
download.file("http://www.textfiles.com/stories/3lpigs.txt","threelittlepigs.txt")
threelittlepigs_s<-scan("threelittlepigs.txt",character(0),
sep=".",quote=NULL)
If I don't include the 'quote=NULL' option, scan() throws the warning that an EOF (end of field, I'm guessing) falls within a quoted string. This produces a handful of multi-line elements/fields, but pretty erratically. I can't seem to discern a pattern.
Sorry if this has been asked before. I'm sure there's an easy solution. I would prefer one that helps me make sense of why scan() isn't working the way I would expect, but if there are better tools to read text in R, please do let me know.

R has some really strong text mining capability, with many strong packages. For example, tm, rvest, stringi and others.
But here is a simple example of doing this almost completely in base R. I only use the %>% pipe from magrittr because I think this makes the code a bit more readable.
the specific answer to your question is you can use regular expressions to search for multiple punctuation marks. In the example below I use "[\\.?!] ", meaning a period, question mark or exclamation mark, followed by a space. You may have to experiment.
Try this:
library("magrittr")
url <- "http://www.textfiles.com/stories/3lpigs.txt"
corpus <- url %>%
paste(readLines(url), collapse=" ") %>%
gsub("http://www.textfiles.com/stories/3lpigs.txt", "", .)
head(corpus)
z <- corpus %>%
gsub(" +", " ", .) %>%
strsplit(split = "[\\.?!] ")
z[[1]]
The results:
z[[1]]
[1] " THE THREE LITTLE PIGS Once upon a time "
[2] ""
[3] ""
[4] "there were three little pigs, who left their mummy and daddy to see the world"
[5] "All summer long, they roamed through the woods and over the plains,playing games and having fun"
[6] "None were happier than the three little pigs, and they easily made friends with everyone"
[7] "Wherever they went, they were given a warm welcome, but as summer drew to a close, they realized that folk were drifting back to their usual jobs, and preparing for winter"
[8] "Autumn came and it began to rain"
[9] "The three little pigs started to feel they needed a real home"
[10] "Sadly they knew that the fun was over now and they must set to work like the others, or they'd be left in the cold and rain, with no roof over their heads"
...etc

Related

how to delete all English words, except special punctuation, in R

I have a data file in R,
data <- "conflict need resolved :< turned conversation exchange ideas richer environment one tricky concepts :D conflict always top business agendas :> maybe different ideas opinions different :)"
from this I want to remove all words, only the smiles will be there, and the output I am expecting,
":< :D :> :)"
Is there any library or method in R for doing this task easily?
You can use [[:alnum:]] as a regexp pattern for all numeric and alphanumeric characters of a string
s <- gsub("[[:alnum:]]*", "", "conflict need resolved :< turned conversation exchange ideas richer environment one tricky concepts :D conflict always top business agendas :> maybe different ideas opinions different :) ")
gsub(" +", " ", s)
[1] " :< : :> :) "

Time complexity of regex and Allowing jitter in pattern finding

To find patterns in string, I have the following code. In it, find.string finds substring of maximum length subject to (1) substring must be repeated consecutively at least th times and (2) substring length must be no longer than len.
reps <- function(s, n) paste(rep(s, n), collapse = "") # repeat s n times
find.string <- function(string, th = 3, len = floor(nchar(string)/th)) {
for(k in len:1) {
pat <- paste0("(.{", k, "})", reps("\\1", th-1))
r <- regexpr(pat, string, perl = TRUE)
if (attr(r, "capture.length") > 0) break
}
if (r > 0) substring(string, r, r + attr(r, "capture.length")-1) else ""
}
An example for the above mentioned code: for the string "a0cc0vaaaabaaaabaaaabaa00bvw" the pattern should come out to be "aaaab".
NOW I am trying to get patterns allowing jitter of 1 character. Example: for the string "a0cc0vaaaabaaadbaaabbaa00bvw" the pattern should come out to be "aaajb" where "j" can be anything. Can anyone suggest a modification of the above mentioned code or any new code for pattern finding, that could allow such jitters?
Also can anyone throw some light on the TIME COMPLEXITY and INTERNAL ALGORITHM used for the regexpr function ?
Thanks! :)
Not very efficient but tada:
reps <- function(s, n) paste(rep(s, n), collapse = "") # repeat s n times
find.string <- function(string, th = 3, len = floor(nchar(string)/th)) {
found <- FALSE
for(sublen in len:1) {
for(inlen in 0:sublen) {
pat <- paste0("((.{", sublen-inlen, "})(.)(.{", inlen, "}))", reps("(\\2.\\4)", th-1))
r <- regexpr(pat, string, perl = TRUE)
if (attr(r, "capture.length")[1] > 0){
found = TRUE
break;
}
}
if(found) break
}
if (r > 0) substring(string, r, r + attr(r, "capture.length")[1] - 1) else ""
}
find.string("a0cc0vaaaabaaadbaaabbaa00bvw"); # returns "aaaab"
Without any fuzzy matching tool available, I manually check each possibility. I use an inner loop to try different size prefix and suffix lengths on either size of the "jitter" character. The prefix is grouped as \2 and the suffix as \4 (the jitter is \3 but I don't use it). Then, the repeated part tries to match \2.\4 - so the prefix, any new jitter character, and the suffix.
I say not efficient because its evaluating O(len^2) different patterns, versus O(len) patterns in your code. For large len this might become a problem.
Note that I have multiple groups, and only look at the [1] position. The full r variable has more useful information, for example [1] will be the first part, [5] will be the 2nd part, [6] will be the 3rd part, etc. Also [3] will be the "jitter" character in the 1st part.
Regarding the complexity of the actual regex: it varies a lot. However, often the construction (setup) of a particular regex is vastly more intensive then the actual matching, which is why a single pattern used repeatedly can produce better results than multiple patterns. In truth, this varies a lot based on the pattern and the engine you're using - see links at the end for more info about complexity.
Regarding how regex works: just a note, this is going to be a very theoretical overview, its not meant to indicate how any particular regex engine works.
For a more practical overview, there are plenty of sites that cover just enough to know how to use a regex, but not how to build your own engine. - for example http://www.regular-expressions.info/engine.html
Regex is what's known as a state machine, specifically a (non-deterministic) finite state automaton (NFA). A very simple, real world state machine is a lightbulb: its either on, or off, and different inputs can change the state its in. A regex is much more complex, (generally) each symbol in the pattern forms a state, and different input can send it to different states. So if you have \d\d\d, 3 virtual states each can accept any digit, and any other input goes to a 4th "failure" state. The end result is the end state after all input is 'consumed'.
Perhaps you can imagine: this gets vastly more complicated, with many many states, when you use any ambiguity, such as wildcards or alternation. So our \d\d\d regex will basically be linear. But more complicated one will not be. Part of the optimization in a regex engine is converting a NFA to a DFA - a deterministic finite state automaton. Here, the ambiguity is removed, generating many more states, and this is the very computationally complex process referenced above (the construction stage).
This is really just a very theoretical overview of an ideal NFA. In practice, modern regex grammars can do a lot more than this, for example backtracing is not technically possible in a "proper" regex.
This might be a bit too high-level, but thats the basic idea. If you're curious, there are plenty of good articles about regex, different flavors, and their complexity. For example: http://swtch.com/~rsc/regexp/regexp1.html
There's basically two regex algorithm types, Perl-Style (with a lot of complex backtracking) and Thompson-NFA.
http://swtch.com/~rsc/regexp/regexp1.html
To determine which engine R uses R's svn repo is here:
*root repo:
http://svn.r-project.org/R/
http://svn.r-project.org/R/branches\R-exp-uncmin\src\regex
I poked around in there a bit and found a file called "engine.c" On first glance it doesn't look like a Thompson-NFA but I didn't take long to read it.
At any rate, the first link goes in depth into the complexity question in general and should give you a great idea as to how regex parsing works under the hood to boot.

R dividing texts in tm package - recognizing speakers

I am trying to identify the most frequently used words in the congress speeches, and have to separate them by the congressperson. I am just starting to learn about R and the tm package. I have a code that can find the most frequent words, but what kind of a code can I use to automatically identify and store the speaker of the speech?
Text looks like this:
OPENING STATEMENT OF SENATOR HERB KOHL, CHAIRMAN
The Chairman. Good afternoon to everybody, and thank you
very much for coming to this hearing this afternoon.
In today's tough economic climate, millions of seniors have
lost a big part of their retirement and investments in only a
matter of months. Unlike younger Americans, they do not have
time to wait for the markets to rebound in order to recoup a
lifetime of savings.
[....]
STATEMENT OF SENATOR MEL MARTINEZ, RANKING MEMBER
[....]
I would like to be able to get these names, or separate text by the people. Hope you can help me. Thanks a lot.
Would it be correct to say that you want to split the file so you have one text object per speaker? And then use a regular expression to grab the speaker's name for each object? Then you can write a function to collect word frequencies, etc. on each object and put them in a table where the row or column names are the speaker's names.
If so, you might say x is your text, then use strsplit(x, "STATEMENT OF") to split on the words STATEMENT OF, then grep() or str_extract() to return the 2 or 3 words after SENATOR (do they always have only two names as in your example?).
Have a look here for more on the use of these functions, and text manipulation in general in R: http://en.wikibooks.org/wiki/R_Programming/Text_Processing
UPDATE Here's a more complete answer...
#create object containing all text
x <- c("OPENING STATEMENT OF SENATOR HERB KOHL, CHAIRMAN
The Chairman. Good afternoon to everybody, and thank you
very much for coming to this hearing this afternoon.
In today's tough economic climate, millions of seniors have
lost a big part of their retirement and investments in only a
matter of months. Unlike younger Americans, they do not have
time to wait for the markets to rebound in order to recoup a
lifetime of savings.
STATEMENT OF SENATOR BIG APPLE KOHL, CHAIRMAN
I am trying to identify the most frequently used words in the
congress speeches, and have to separate them by the congressperson.
I am just starting to learn about R and the tm package. I have a code
that can find the most frequent words, but what kind of a code can I
use to automatically identify and store the speaker of the speech
STATEMENT OF SENATOR LITTLE ORANGE, CHAIRMAN
Would it be correct to say that you want
to split the file so you have one text object
per speaker? And then use a regular expression
to grab the speaker's name for each object? Then
you can write a function to collect word frequencies,
etc. on each object and put them in a table where the
row or column names are the speaker's names.")
# split object on first two words
y <- unlist(strsplit(x, "STATEMENT OF"))
#load library containing handy function
library(stringr)
# use word() to return words in positions 3 to 4 of each string, which is where the first and last names are
z <- word(y[2:4], 3, 4) # note that the first line in the character vector y has only one word and this function gives and error if there are not enough words in the line
z # have a look at the result...
[1] "HERB KOHL," "BIG APPLE" "LITTLE ORANGE,"
No doubt a regular expressions wizard could come up with something to do it quicker and neater!
Anyway, from here you can run a function to calculate word freqs on each line in the vector y (ie. each speaker's speech) and then make another object that combines the word freq results with the names for further analysis.
This is how I'd approach it using Ben's example (use qdap to parse and create a dataframe and then convert to a Corpus with 3 documents; note that qdap was designed for transcript data like this and a Corpus may not be the best data format):
library(qdap)
dat <- unlist(strsplit(x, "\\n"))
locs <- grep("STATEMENT OF ", dat)
nms <- sapply(strsplit(dat[locs], "STATEMENT OF |,"), "[", 2)
dat[locs] <- "SPLIT_HERE"
corp <- with(data.frame(person=nms, dialogue =
Trim(unlist(strsplit(paste(dat[-1], collapse=" "), "SPLIT_HERE")))),
df2tm_corpus(dialogue, person))
tm::inspect(corp)
## A corpus with 3 text documents
##
## The metadata consists of 2 tag-value pairs and a data frame
## Available tags are:
## create_date creator
## Available variables in the data frame are:
## MetaID
##
## $`SENATOR BIG APPLE KOHL`
## I am trying to identify the most frequently used words in the congress speeches, and have to separate them by the congressperson. I am just starting to learn about R and the tm package. I have a code that can find the most frequent words, but what kind of a code can I use to automatically identify and store the speaker of the speech
##
## $`SENATOR HERB KOHL`
## The Chairman. Good afternoon to everybody, and thank you very much for coming to this hearing this afternoon. In today's tough economic climate, millions of seniors have lost a big part of their retirement and investments in only a matter of months. Unlike younger Americans, they do not have time to wait for the markets to rebound in order to recoup a lifetime of savings.
##
## $`SENATOR LITTLE ORANGE`
## Would it be correct to say that you want to split the file so you have one text object per speaker? And then use a regular expression to grab the speaker's name for each object? Then you can write a function to collect word frequencies, etc. on each object and put them in a table where the row or column names are the speaker's names.

replacing specific characters in between specific elements

I'd like to use a regular expression to replace a space in a string. The space in question is the only space between two elements in the string. The string itself however contains much more elements and spaces. So far i've tried
(<-)[\s]*?(->)
But that doesnt work. It is supposed to take
<-word anotherword->
and allow me to replace the space in it.
As \s selects all spaces, and
(<-)[\s\S]*?(->)
Selects all characters inbetween the <- and ->, i tried to re-use the expression but then for the spaces only.
I'm not so good at these expressions, and i can't for the life of me find an answer anywhere.
If anyone could just point me to the answer, that would be great. Thanks.
It's difficult to be sure what you want, post some before and after examples. And, specify what language you are using.
But, it looks like (<-\S+)\s*(\S+->) should probably do it (deletes spaces).
If the <- and -> are NOT to be preserved, move them out of the parentheses, like so:
<-(\S+)\s*(\S+)->
Here's what it would look like in JavaScript:
var before = "Ten years ago a crack <-flegan esque-> unit was sent to prison by a military "
+ "court for a crime they didn't commit.\n"
+ "These men promptly escaped from a maximum security stockade to the "
+ "<-flargon 666-> underground.\n"
+ "Today, still wanted by the government, they survive as soldiers of fortune.\n"
+ "If you have a problem and no one else can help, and if you can find them, "
+ "maybe you can hire the <-flugen 9->.\n"
;
var after = before.replace (/(<-\S+)\s*(\S+->)/g, "$1$2");
alert (after);
Which yields:
Ten years ago a crack <-fleganesque-> unit was sent to prison by a military court for a crime they didn't commit.
These men promptly escaped from a maximum security stockade to the <-flargon666-> underground.
Today, still wanted by the government, they survive as soldiers of fortune.
If you have a problem and no one else can help, and if you can find them, maybe you can hire the <-flugen9->.

How to get sentence number from input?

It seems hard to detect a sentence boundary in a text. Quotation marks like .!? may be used to delimite sentences but not so accurate as there may be ambiguous words and quotations such as U.S.A or Prof. or Dr. I am studying Tperlregex library and Regular Expression Cookbook by Jan Goyvaerts but I do not know how to write the expression that detects sentence?
What may be comparatively accurate expression using Tperlregex in delphi?
Thanks
First, you probably need to arrive at your own definition of what a "sentence" is, then implement that definition. For example, how about:
He said: "It's OK!"
Is it one sentence or two? A general answer is irrelevant. Decide whether you want it to interpret it as one or two sentences, and proceed accordingly.
Second, I don't think I'd be using regular expressions for this. Instead, I would scan each character and try to detect sequences. A period by itself may not be enough to delimit a sentence, but a period followed by whitespace or carriage return (or end of string) probably does. This immediately lets you weed out U.S.A (periods not followed by whitespace).
For common abbreviations like Prof. an Dr. it may be a good idea to create a dictionary - perhaps editable by your users, since each language will have its own set of common abbreviations.
Each language will have its own set of punctuation rules too, which may affect how you interpret punctuation characters. For example, English tends to put a period inside the parentheses (like this.) while Polish does the opposite (like this). The same difference will apply to double quotes, single quotes (some languages don't use them at all, sometimes they are indistinguishable from apostrophes etc.). Your rules may well have to be language-specific, at least in part.
In the end, you may approximate the human way of delimiting sentences, but there will always be cases that can throw the analysis off. For example, assuming that you have a dictionary that recognizes "Prof." as an abbreviation, what are you going to do about
Most people called him Professor Jones, but to me he was simply The Prof.
Even if you have another sentence that follows and starts with a capital letter, that still won't help you know where the sentence ends, because it might as well be
Most people called him Professor Jones, but to me he was simply Prof. Bill.
Check my tutorial here http://code.google.com/p/graph-expression/wiki/SentenceSplitting. This concrete example can be easily rewritten to regular expressions and some imperative code.
It will be wise to use a NLP processor with a pre-trained model. EnglishSD.nbin is one such model that is available for OpenNLP and it can be used in Visual Studio with SharpNLP.
The advantage of using this method is numerous. For example consider the input
Prof. Jessica is a wonderful woman. She is a native of U.S.A. She is married to Mr. Jacob Jr.
If you are using a regex split, for example
string[] sentences = Regex.Split(text, #"(?<=['""A-Za-z0-9][\.\!\?])\s+(?=[A-Z])");
Then the above input will be split as
Prof.
Jessica is a wonderful woman.
She is a native of U.
S.
A.
She is married to Mr.
Jacob Jr.
However the desired output is
Prof. Jessica is a wonderful woman.
She is a native of U.S.A. She is married to Mr. Jacob Jr.
This kind of logical sentence split can be achieved only using trained models from OpenNLP project. The method is as simple as this.
private string mModelPath = #"C:\Users\ATS\Documents\Visual Studio 2012\Projects\Google_page_speed_json\Google_page_speed_json\bin\Release\";
private OpenNLP.Tools.SentenceDetect.MaximumEntropySentenceDetector mSentenceDetector;
private string[] SplitSentences(string paragraph)
{
if (mSentenceDetector == null)
{
mSentenceDetector = new OpenNLP.Tools.SentenceDetect.EnglishMaximumEntropySentenceDetector(mModelPath + "EnglishSD.nbin");
}
return mSentenceDetector.SentenceDetect(paragraph);
}
where mModelPath is the path of the directory containing the nbin file.
The mSentenceDetector is derived from the OpenNLP dll.
You can get the desired output by
string[] sentences = SplitSentences(text);
Kindly read through this article I have written for integrating SharpNLP with your Application in Visual Studio to make use of the NLP tools