R dividing texts in tm package - recognizing speakers

R dividing texts in tm package - recognizing speakers - regex

I am trying to identify the most frequently used words in the congress speeches, and have to separate them by the congressperson. I am just starting to learn about R and the tm package. I have a code that can find the most frequent words, but what kind of a code can I use to automatically identify and store the speaker of the speech?
Text looks like this:
OPENING STATEMENT OF SENATOR HERB KOHL, CHAIRMAN
The Chairman. Good afternoon to everybody, and thank you
very much for coming to this hearing this afternoon.
In today's tough economic climate, millions of seniors have
lost a big part of their retirement and investments in only a
matter of months. Unlike younger Americans, they do not have
time to wait for the markets to rebound in order to recoup a
lifetime of savings.
[....]
STATEMENT OF SENATOR MEL MARTINEZ, RANKING MEMBER
[....]
I would like to be able to get these names, or separate text by the people. Hope you can help me. Thanks a lot.

Would it be correct to say that you want to split the file so you have one text object per speaker? And then use a regular expression to grab the speaker's name for each object? Then you can write a function to collect word frequencies, etc. on each object and put them in a table where the row or column names are the speaker's names.
If so, you might say x is your text, then use strsplit(x, "STATEMENT OF") to split on the words STATEMENT OF, then grep() or str_extract() to return the 2 or 3 words after SENATOR (do they always have only two names as in your example?).
Have a look here for more on the use of these functions, and text manipulation in general in R: http://en.wikibooks.org/wiki/R_Programming/Text_Processing
UPDATE Here's a more complete answer...
#create object containing all text
x <- c("OPENING STATEMENT OF SENATOR HERB KOHL, CHAIRMAN
The Chairman. Good afternoon to everybody, and thank you
very much for coming to this hearing this afternoon.
In today's tough economic climate, millions of seniors have
lost a big part of their retirement and investments in only a
matter of months. Unlike younger Americans, they do not have
time to wait for the markets to rebound in order to recoup a
lifetime of savings.
STATEMENT OF SENATOR BIG APPLE KOHL, CHAIRMAN
I am trying to identify the most frequently used words in the
congress speeches, and have to separate them by the congressperson.
I am just starting to learn about R and the tm package. I have a code
that can find the most frequent words, but what kind of a code can I
use to automatically identify and store the speaker of the speech
STATEMENT OF SENATOR LITTLE ORANGE, CHAIRMAN
Would it be correct to say that you want
to split the file so you have one text object
per speaker? And then use a regular expression
to grab the speaker's name for each object? Then
you can write a function to collect word frequencies,
etc. on each object and put them in a table where the
row or column names are the speaker's names.")
# split object on first two words
y <- unlist(strsplit(x, "STATEMENT OF"))
#load library containing handy function
library(stringr)
# use word() to return words in positions 3 to 4 of each string, which is where the first and last names are
z <- word(y[2:4], 3, 4) # note that the first line in the character vector y has only one word and this function gives and error if there are not enough words in the line
z # have a look at the result...
[1] "HERB KOHL," "BIG APPLE" "LITTLE ORANGE,"
No doubt a regular expressions wizard could come up with something to do it quicker and neater!
Anyway, from here you can run a function to calculate word freqs on each line in the vector y (ie. each speaker's speech) and then make another object that combines the word freq results with the names for further analysis.

This is how I'd approach it using Ben's example (use qdap to parse and create a dataframe and then convert to a Corpus with 3 documents; note that qdap was designed for transcript data like this and a Corpus may not be the best data format):
library(qdap)
dat <- unlist(strsplit(x, "\\n"))
locs <- grep("STATEMENT OF ", dat)
nms <- sapply(strsplit(dat[locs], "STATEMENT OF |,"), "[", 2)
dat[locs] <- "SPLIT_HERE"
corp <- with(data.frame(person=nms, dialogue =
Trim(unlist(strsplit(paste(dat[-1], collapse=" "), "SPLIT_HERE")))),
df2tm_corpus(dialogue, person))
tm::inspect(corp)
## A corpus with 3 text documents
##
## The metadata consists of 2 tag-value pairs and a data frame
## Available tags are:
## create_date creator
## Available variables in the data frame are:
## MetaID
##
## $`SENATOR BIG APPLE KOHL`
## I am trying to identify the most frequently used words in the congress speeches, and have to separate them by the congressperson. I am just starting to learn about R and the tm package. I have a code that can find the most frequent words, but what kind of a code can I use to automatically identify and store the speaker of the speech
##
## $`SENATOR HERB KOHL`
## The Chairman. Good afternoon to everybody, and thank you very much for coming to this hearing this afternoon. In today's tough economic climate, millions of seniors have lost a big part of their retirement and investments in only a matter of months. Unlike younger Americans, they do not have time to wait for the markets to rebound in order to recoup a lifetime of savings.
##
## $`SENATOR LITTLE ORANGE`
## Would it be correct to say that you want to split the file so you have one text object per speaker? And then use a regular expression to grab the speaker's name for each object? Then you can write a function to collect word frequencies, etc. on each object and put them in a table where the row or column names are the speaker's names.

Related

Parsing a String in SSIS or C#

I have one string without any delimiter and I want to parse it. Is it possible in SSIS or c#.
For Example, If I have address info in a single column, but i want to split/parse it in multiple columns such as House Number, Road Number, Road name, Road type, Locality name, state code, post code, country wise etc.
12/38 Meacher Street Mount Druitt NSW 2770 Australia -- In this case House Number:- 12, road no:- 38, road name meacher, road type - road, locality :- mount druitt, state-NSW, post code:- 2770
have all these info in a single column, so how I will parse it and split inh multiple columns. I know by giving space delimiter will not work as there will be split the wrong information and there will be some road name with more than space , so in this info will be split up in wrong column.
Any suggestion would be appreciated.
Thanks.

Please remember that the country can also have spaces in it and some countries use alphanumerical post codes.
If all addresses are in Australia and in the same format of (...), state, postcode, Australia then you can split it into
StreetAddress, State, PostCode
You could also use one of online APIs to find an address and then then you get individual elements.
The best solution is to keep it together - why split it?

Matching datasets with on variables with inconsistent formats in SAS

I have two datasets, one that lives within my agency and another that comes from an external source. Theoretically, all my agency's data should be matchable as a subset of the external data, but the problem is that there's no consistency in how PHN + street addresses are being recorded externally.
Our data = 100 West 10 Street
Their data = 100W 10th St / 100 W. 10 St. / 100west 10TH Street (you get the idea)
We have a lot of data, but they have even more, and both our data change on a daily basis, so it's infeasible to change formats one by one.
So I have two questions, coming from a SAS novice who's learned through work and lots of Googling, so please bear with me.
1 - Is there a way to do a quick non-perfect/fuzzy matching of the two datasets on addresses if they're not totally consistent in format? I understand that I'd have to go through the results, but I wanted a quick way to eliminate most of the non-matches immediately with minimal clean-up beforehand.
2 - If 1 isn't possible, what is the best approach to clean up the external data and to make the addresses more consistent? Should I keep the PHN + Street together, or keep them as separate variables? I started looking into prxchange and while it's definitely useful, it's not perfect. For example:
Address = left(prxchange('s / ST | ST. / STREET /', -1, cat(' ', address, ' ')));
Works great until it hits addresses at St Marks, for example, and converts the St to STREET.
The other problem is that I have to account for all the possible variations in spelling, abbreviations, periods, etc., which I'm doing now the old-fashioned way in Excel, but this leaves room for error.
Also, if some of the addresses have been compressed, such as 10west instead of 10 west, what is the best way to add a space or separate out entirely? Everything has been read in in the text format, and again there's no consistency in the number of characters to do a simple substring.
Thanks!

Extracting sentences using scan() in R

I've been told that I shouldn't use R to scan text (but I have been doing so, anyway, pending the acquisition of other skills) and encountered a problem that has confused me sufficiently to retreat to these fora. Thanks for any help, in advance.
I'm trying to store a large amount of text (e.g., a short story) as a vector of strings, each of which is a separate sentence. I've been doing this using the scan() function, but I am encountering two basic problems: (1) scan() only seems to allow a single separating character, whereas sentences can obviously end in multiple ways. I know how to mark the end of a sentence using regex (e.g. [!?\.], but I don't know of a function in R that uses regular expressions to split text. (2) scan() seems to automatically regard a new line as a new field, whereas I want it to ignore new lines unless they coincide with the end of a sentence.
download.file("http://www.textfiles.com/stories/3lpigs.txt","threelittlepigs.txt")
threelittlepigs_s<-scan("threelittlepigs.txt",character(0),
sep=".",quote=NULL)
If I don't include the 'quote=NULL' option, scan() throws the warning that an EOF (end of field, I'm guessing) falls within a quoted string. This produces a handful of multi-line elements/fields, but pretty erratically. I can't seem to discern a pattern.
Sorry if this has been asked before. I'm sure there's an easy solution. I would prefer one that helps me make sense of why scan() isn't working the way I would expect, but if there are better tools to read text in R, please do let me know.

R has some really strong text mining capability, with many strong packages. For example, tm, rvest, stringi and others.
But here is a simple example of doing this almost completely in base R. I only use the %>% pipe from magrittr because I think this makes the code a bit more readable.
the specific answer to your question is you can use regular expressions to search for multiple punctuation marks. In the example below I use "[\\.?!] ", meaning a period, question mark or exclamation mark, followed by a space. You may have to experiment.
Try this:
library("magrittr")
url <- "http://www.textfiles.com/stories/3lpigs.txt"
corpus <- url %>%
paste(readLines(url), collapse=" ") %>%
gsub("http://www.textfiles.com/stories/3lpigs.txt", "", .)
head(corpus)
z <- corpus %>%
gsub(" +", " ", .) %>%
strsplit(split = "[\\.?!] ")
z[[1]]
The results:
z[[1]]
[1] " THE THREE LITTLE PIGS Once upon a time "
[2] ""
[3] ""
[4] "there were three little pigs, who left their mummy and daddy to see the world"
[5] "All summer long, they roamed through the woods and over the plains,playing games and having fun"
[6] "None were happier than the three little pigs, and they easily made friends with everyone"
[7] "Wherever they went, they were given a warm welcome, but as summer drew to a close, they realized that folk were drifting back to their usual jobs, and preparing for winter"
[8] "Autumn came and it began to rain"
[9] "The three little pigs started to feel they needed a real home"
[10] "Sadly they knew that the fun was over now and they must set to work like the others, or they'd be left in the cold and rain, with no roof over their heads"
...etc

Replace ONLY first occurrence of word from a paragraph using RegEx

I have two paragraphs. I want to replace ONLY first occurrence of a specific word 'acetaminophen' by '{yootooltip :: It is a widely used over-the-counter analgesic (pain reliever) and antipyretic (fever reducer). Excessive use of paracetamol can damage multiple organs, especially the liver and kidney.}acetaminophen{/yootooltip}'
The paragraph is:
Percocet is a painkiller which is partly made from oxycodone and partly made from acetaminophen. It will usually be prescribed for a patient who is suffering from acute severe pain. Because it has oxycodone in it, this substance can create an addiction and is also a dangerous prescription drug to abuse. It is illegal to either sell or use Percocet that has not been prescribed by a licensed professional.
In 2008, drugs like Percocet (which have both oxycodone and acetaminophen as their main ingredients) were the prescription drugs most sold in all of Ontario. The records also show that the rates of death by oxycodone (this includes brand name like Percocet) doubled. That is why it is imperative that the people who are addicted go to an Ontario drug rehab center. Most of the drug rehabs can take care of Percocet addiction.
I am trying to write a regular expression for this. I have tried
\bacetaminophen\b
But it is replacing both occurrences.
Any help would be appreciated.
Thanks

Use the optional $limit parameter in PHP's preg_replace function http://us2.php.net/manual/en/function.preg-replace.php
$text = preg_replace('/acetaminophen/i', 'da-da daa', $text, 1);
will replace only the first occurance.

For the tool you're using, just use this:
(.*?)\bacetaminophen\b

Weighted disjunction in Perl Regular Expressions?

I am fairly experienced with regular expressions, but I am having some difficulty with a current application involving disjunction.
My situation is this: I need to separate an address into its component parts based on a regular expression match on the "Identifier elements" of the address -- A comparable English example would be words like "state", "road", or "boulevard"--IF, for example, we wrote these out in our addresses. Imagine we have an address like the following, where (and this would never happen in English), we specified the identifier type after each name
United States COUNTRY California STATE San Francisco CITY Mission STREET 345 NUMBER
(Where the words in CAPS are what I have called "identifiers").
We want to parse it into:
United States COUNTRY
California STATE
San Francisco CITY
Mission STREET
245 NUMBER
OK, this is certainly contrived for English, but here's the catch: I am working with Chinese data, where in fact this style of identifier specification happens all the time. An example below:
云南-省 ; 丽江-市 ; 古城-区 ; 西安-街 ; 杨春-巷 ;
Yunnan-Province ; LiJiang-City ; GuCheng-District ; Xi'An-Street ; Yangchun-Alley
This is easy enough--a lazy match on a potential candidate identifier names, separated into a disjunctive list.
For China, the following are the "province-level" entities:
省 (Province) ,
自治区 (Autonomous Region) ,
市 (Municipality)
So my regex so far looks like this:
(.+?(?:(?:省)|(?:自治区)|(?:市)))
I have a series of these, in order to account for different portions of the address. The next level, corresponding to cities, for instance, is:
(.+?(?:(?:地区)|(?:自治州)|(?:市)|(?:盟)))
So to match a province entity followed by a city entity:
(.+?(?:(?:省)|(?:自治区)|(?:市)))(.+?(?:(?:地区)|(?:自治州)|(?:市)|(?:盟)))
With named capture groups:
(?<Province>.+?(?:(?:省)|(?:自治区)|(?:市)))(?<City>.+?(?:(?:地区)|(?:自治州)|(?:市)|(?:盟)))
For the above, this yields:
$+{Province} = 云南省
$+{City} = 丽江市
This is all good and well, and gets me pretty far. The problem, however, is when I try to account for identifiers that can be a substring of other identifiers. A common street-level entity, for instance, is "村委会", which means village organizing committee. In the set of addresses I wish to separate, not every address has this written out in full. In fact, I find "村委" and just plain "村" as well.
The problem? If I have a pure disjunction of these elements, we have the following:
(?<Street>.+?(?:(?:村委会)|(?:村委)|(?:村)))
What happens, though, is that if you have an entity 保定-村委会 (Baoding Village organizing committee), this lazy regex stops at 村 and calls it a day, orphaning our poor 委会 because 村 is one of the potential disjunctive elements.
Imagine an English equivalent like the following:
(?<Animal>.+?(?:(?:Cat)|(?:Elephant)|(?:CatElephant)|(?:City)))
We have two input strings:
1. "crap catelephant crap city", where we wanted "Crap catelephant" and "crap city"
2. "crap catelephant city" , where we wanted "crap cat" "elephant city"
Ah, the solution, you say, is to make the pre-identifier capture greedy. But! There are entities have the same identifier that are not at the same level.
Take 市 for example. It means simply "city". But in China, there are county-level, province-level, and municipality-level cities. If this character occurred twice in the string, especially in two adjacent entities, the greedy search would incorrectly tag the greedy match as the first entity. As in the following:
广东-省 ; 江门-市 ; 开平-市 ; 三埠-区 石海管-区
Guangdong-province ; Jiangmen-City ; Kaiping-City ; Sanbu-District ; Shihaiguan-District
(Note, as above, this has been hand-segmented. The raw data would simply have a string of concatenated characters)
The match for a greedy search would be
江门市开平市
This is wrong, as the two adjacent entities should be separated into their constituent parts. Once is at the level of provincial city, one is a county-level city.
Back to the original point, and I thank you for reading this far, is there a way to put a weighting on disjunctive entities? I would want the regex to find the highest "weighted" identifier first. 村委会 instead of simple 村 for example, "catelephant" instead of just "cat". In preliminary experiments, the regex parser apparently proceeds left to right in finding disjunctive matches. Is this a valid assumption to make? Should I put the most frequently-occurring identifiers first in the disjunctive list?
If I have lost anyone with Chinese-related details, I apologize, and can further clarify if needed. The example really doesn't have to be Chinese--I think more generally it is a question about the mechanics of the regex disjunctive match -- in what order does it preference the disjunctive entities, and how does it decide when to "call it a day" in the context of a lazy search?
In a way, is there some sort of middle ground between lazy and greedy searches? Find the smallest bit you can find before the longest / highest weighted disjunctive entity? Be lazy, but put in that little bit of extra effort if you can for the sake of thoroughness?
(Incidentally, my work philosophy in college?)

How alternations are handled depends on the particular regular expression engine. For almost all engines (including Perl's regular expression engine) the alternation matches eagerly - that is, it matches the left-most choice first and only tries another alternative if this fails. For example, if you have /(cat|catelephant)/ it will never match catelephant. The solution is to reorder the choices so that the most specific comes first.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js