Find all paragraphs of text that are related to a topic - regex

Given a set of words ["college", "sports", "coding"], and a set of paragraphs of text (i.e. facebook posts), how can I see for each word the paragraphs that are related to that topic?
So for college, how can I find all the paragraphs of text that may be about the topic college?
I'm new to natural language processing, and not very advanced at regex. Clues about how to get started, what the right terms to google, etc are appreciated.

One basic ideea would be to iterate over your posts and see if any post matches any of the topic.
Let's say we have the following posts:
Post 1:
Dadadad adada college fgdssfgoksh jkhsfdkjshdkj sports hfjkshgkjshgjhsdgjkhskjgfs.
Post 2:
Sports dadadad adada fgdssfgoksh jkhsfdkjshdkj hfjkshgkjshgjhsdgjkhskjgfs.
Post 3:
Coding adskjdsflkshdflksjlg lsdjk hsjdkh kdsafkj asfjkhsa coding fhksajhdf kjhskfhsfd ssdggsd.
and the following topics:
["college", "sports", "coding"]
The regex could be: (topicName)+
E.g.: (college)+ or (sports)+ or (coding)+
Small pseudocode:
for every topicName
for every post
var customRegex = new RegExp('(' + topicName + ')+');
if customRegex.test(post) then
//post matches topicName
else
//post doesn't match topicName
endif
endfor
endfor
Hope it could give you a starting point.

Exact string matching won't take you far, especially with small fragments of text. I suggest you to use semantic similarity for this. A simple web search will give several implementations.

Related

React - Using Regex for highlighting text inside of dangerouslySetInnerHTML. Not working reliably

The goal is to highlight text parts (strings) inside of a dangerouslySetInnerHTML. Therefore I try to match the desired text part inside of the html, and wrap it in a "span" with appropriate styling. I am using the following code that works for certain texts (html) flawlessy, but for certain texts not at all. Please find a working an a not working example below. Trying for hours to understand the difference, or why the regex does not work... but I can't figure it out. Banging my head agains the wall.
My question is: Why is the regex failing in some cases and working in others? Even though in all cases the text ("quote") is there.
Any ideas what I am missing? Thanks so much for your help!!!
Highlighting Component JSX:
import React from "react";
class HighlightQuote extends React.Component {
render = () => {
//zitat is for getting rid of any quotation marks in the beginning or end.
var zitat = this.props.quotes.map(x => x.replace(/^[“”"’()]+|[“”"’()]+$/g, ""));
if (this.props.quotes.length === 0) {
var highlightedHtml = this.props.newcontent
}
else {
var zitat = this.props.quotes.map(x => x.replace(/^[“”"’()]+|[“”"’()]+$/g, ""));
const regex = new RegExp(`(${zitat.join('|')})`, 'g');
var highlightedHtml = this.props.content.replace(
regex,
'<span class="hl">$1</span>'
);
console.log ('highlightedHtml:');
console.log (highlightedHtml);
}
return (
<div className="reader" ref="test" dangerouslySetInnerHTML={{ __html: highlightedHtml }} />
);
};
}
export default HighlightQuote;
Working example (console.log ('highlighted html')
<div class="post" id="post-17660">
<p class="postcontents">
<article> <div class="post-inside">
<p>One of the things I have disliked the most about the crypto sector is the idea that people should “hodl” or “hold on for dear life.”</p>
<p>I have written many times here at AVC that one should take profits when they are available and diversify an investment portfolio.</p>
<p><span class="hl">The idea that an investor should hold on no matter what has always seemed ridiculous to me.</span></p>
<p>Now, the crypto markets are in the eighth month of a long and painful bear market and we are starting to see some signs of capitulation, particularly in the assets that went up the most last year.</p>
<p>Whether this is the long-awaited capitulation of the HODL crowd or not, I can’t say.</p>
<p>But capitulation would be a good thing for the crypto markets, releasing assets into the market that until now have been locked up by long-term holders.</p>
<p><span class="hl">Until then it is hard to get excited about buying anything in crypto.</span></p>
</div> </article>
</p> </div>
Quotes that are highlighted as expected:
"The idea that an investor should hold on no matter what has always seemed ridiculous to me."
"Until then it is hard to get excited about buying anything in crypto."
Failing example (console.log ('highlighted html')
<div><article id="story" class="Story-story--2QyGh css-1j0ipd9"><header class="css-1qcpy3f e345g291"><p class="css-1789nl8 etcg8100"><a class="css-1g7m0tk" href="https://www.nytimes.com/column/new-sentences">New Sentences</a></p><div class="css-30n6iy e345g290"><div class="css-acwcvw"></div></div><figure class="ResponsiveMedia-media--32g1o ResponsiveMedia-sizeSmall--3092U ResponsiveMedia-layoutVertical--1pg1o ResponsiveMedia-sizeSmallNoCaption--n--T0 css-1hzd7ei"><figcaption class="css-pplcdj ResponsiveMedia-caption--1dUVu"></figcaption></figure></header><div class="css-18sbwfn StoryBodyCompanionColumn"><div class="css-1h6whtw"><p class="css-1i0edl6 e2kc3sl0"><em class="css-2fg4z9 ehxkw330">— From Keith Gessen’s second novel, “A Terrible Country” (Viking, 2018, Page 4). Gessen is also the author of “All the Sad Young Literary Men” and a founding editor of the journal n+1.</em></p><p class="css-1i0edl6 e2kc3sl0">All authors have signature sentence structures — deep expressive grooves that their minds instinctively find and follow. (That previous sentence is one of mine: a simple declaration that leaps, after the break of a long dash, into an elaborate restatement.)</p><p class="css-1i0edl6 e2kc3sl0">Here is one of Keith Gessen’s:</p><p class="css-1i0edl6 e2kc3sl0">“As for me, I wasn’t really an idiot. But neither was I not an idiot.”</p><p class="css-1i0edl6 e2kc3sl0">“I hadn’t been yelling, I didn’t think. But I hadn’t not been yelling either.”</p><p class="css-1i0edl6 e2kc3sl0">“Cute cafes were not the problem, but they were also not, as I’d once apparently thought, the opposite of the problem.”</p></div><aside class="css-14jsv4e"><span></span></aside></div><div class="css-18sbwfn StoryBodyCompanionColumn"><div class="css-1h6whtw"><p class="css-1i0edl6 e2kc3sl0">Sentence structures are not simply sentence structures, of course — they are miniature philosophies. Hemingway, with his blunt verbal bullets, is making a huge claim about the nature of the world. So is James Joyce, with his collages and frippery. So are Nikki Giovanni and Samuel Delany and Ursula K. Le Guin and John McPhee and Missy Elliott and Dr. Seuss and anyone else who converts thoughts into prose.</p><p class="css-1i0edl6 e2kc3sl0">Likewise, Keith Gessen’s signature sentence structure — “not X, but also not not X” — suggests an entire worldview. It is a universe of in-betweenness, in which the most basic facts of life, the things we absolutely expect to understand, spill and scatter like toast crumbs into the gaps between the floorboards. It is a world of embarrassingly trivial category errors. The sentences above come from Gessen’s new novel, “A Terrible Country,” the story of a 30-something American man who goes to Russia to care for his elderly grandmother. He falls into the gaps between huge concepts: youth and age, purpose and purposelessness, progress and stasis. He is not Russian but also not not Russian, not smart but also not not smart, not heroic but also not not heroic. Such is the way of the world. No matter how much we try, none of us is ever only one thing. None of us is ever pure.</p></div><aside class="css-14jsv4e"><span></span></aside></div><div class="bottom-of-article"><div class="css-k8fkhk"><p>Sam Anderson is a staff writer for the magazine.</p> <p><i>Sign up for </i><i>our newsletter</i><i> to get the best of The New York Times Magazine delivered to your inbox every week.</i></p></div><div class="css-3glrhn">A version of this article appears in print on , on Page 11 of the Sunday Magazine with the headline: From Keith Gessen’s ‘A Terrible Country’<span>. Order Reprints | Today’s Paper | Subscribe</span></div></div><span></span></article></div>
The quote that should be highlighted:
"Sentence structures are not simply sentence structures, of course — they are miniature philosophies"
The reason for the failing regex matches were html entities. Some of the parsed texts inside of the dangerouslySetInnerHTML used entity references. In the failing example above the quote includes a "—" character that in the html is decoded as — .
In order to get rid of the html entities I used the "he" library https://github.com/mathiasbynens/he a robust HTML entity encoder/decoder written in JavaScript.
var contentDecoded = he.decode(this.props.content);
var highlightedHtml = contentDecoded.replace(
regex,
'<span class="annotator-hl">$1</span>'
);

Usage of REGEXP with a token from dictionary for creating an annotation in UIMA

I need to mark an annotation with the use of regular expression and a token from a dictionary. Here is my rule
ANY{REGEXP("new"), Book.names.ct == "personal book" -> MARK (NewPersonalBook)};
that has to work with the following input:
new personal book application
open a new personal book
The programm shows no errors in the code but it doesn't mark the annotation "NewPersonalBook" for the input.
How is it possible to fix the problem?
I'm not sure if I understood your case but I tried to replicate what you're trying to do
I created a wordlist personal book, nicebook
Then I have my text example
new personal book application. open a new personal book. my new nicebook is nice.
The script
WORDLIST BooksList = 'books.txt';
DECLARE Book, NewBook;
Document{-> MARKFAST(Book, BooksList)};
W{REGEXP("new")} Book.ct == "personal book" {-> MARK(NewBook, 1, 2)}; //if you want to test a specific text
W{REGEXP("new")} Book {-> MARK(NewBook, 1, 2)}; //this will annotate NewBook for a books with the word new before it
If you dont want the "new" word with the annotation you need to remove the integer parameters (as they indicate the span that you want covered, in this case the first matched text "new" and the second which will be the book text)
Disclaimer: I'm new to UIMA RUTA, hope this helps

Webscraping (potentially) ill-formated HTML in R with xpath or regex

I'm trying to extract the abstract from this link. However, I'm unable to extract only the content of the abstract. Here's what I accomplished so far:
url <- "http://www.scielo.br/scielo.php?script=sci_abstract&pid=S1981-38212013000100001&lng=en&nrm=iso&tlng=en"
textList <- readLines(url)
text <- textList[grep("Abstract[^\\:]", textList)] # get the correct element
text1 <- gsub("\\b(.*?)\\bISSN", "" , text)
Up to this point I got almost what I want, but then I couldn't get rid of the rest of the string that isn't of interest to me.
I even tried another approach, with xpath, but unsuccessfully. I tried something like the code below, but to no effect whatsoever.
library(XML)
arg.xpath <-"//p/#xmlns"
doc <- htmlParse( url) # parseia url
linksAux <- xpathSApply(doc, arg.xpath)
How can I accomplih what I want, either with regex or xpath, or maybe both?
ps.: my general aim is webscraping of several similar pages like the one I provided. I alredy can extract the link. I only need to get the abstract now.
free(doc)
I would strongly recommend the XML approach because regular expressions with HTML can be quite a headache. I think your xpath expression was just a bit off. Try
doc <- htmlParse(url)
xpathSApply(doc, "//p[#xmlns]", xmlValue)
This returns (clipped for length)
[1] "HOLLANDA, Cristina Buarque de. Human rights ..."
[2] "This article is dedicated to recounting the main ..."
[3] "Keywords\n\t\t:\n\t\tHuman rights; transitional ..."
[4] ""
someone better could give you a better answer but this kinda works
reg=regexpr("<p xmlns=\"\">(.*?)</p>",text1)
begin=reg[[1]]+12
end=attr(reg,which = "match.length")+begin-17
substr(text1,begin,end)
Here is another approach, which is klunky as written, but offers the technique of keeping the right parts after splitting at tag tokens:
text2 <- sapply(strsplit(x = text1, ">"), "[", 3)
text2
[1] "This article is dedicated to recounting the main initiative of Nelson Mandela's government to manage the social resentment inherited from the segregationist regime. I conducted interviews with South African intellectuals committed to the theme of transitional justice and with key personalities who played a critical role in this process. The Truth and Reconciliation Commission is presented as the primary institutional mechanism envisioned for the delicate exercise of redefining social relations inherited from the apartheid regime in South Africa. Its founders declared grandiose political intentions to the detriment of localized more palpable objectives. Thus, there was a marked disparity between the ambitious mandate and the political discourse about the commission, and its actual achievements.</p"
text3 <- sapply(strsplit(text2, "<"), "[", 1)

Parsing Exchange Recipients from a string with Regex

I'm about to break this down into two operations since I can't seem to figure out the regular expression to do it in one. However, I thought I would ask the brain trust here to see if anyone can do it (which I'm sure someone can).
Essentially I have a string containing a recipients field from an email in Exchange. I want to parse it out into individual recipients. I don't need to validate emails or anything. Essentially the data is comma separated except if the comma is in between a set of quotes. That's the part that's messing me up.
Right now I'm using: (?"[^"\r\n]*")
Which gives me the quoted names, and ([a-zA-Z0-9_-.]+)#(([[0-9]{1,3}.[0-9]{1,3}.[0-9]{1,3}.)|(([a-zA-Z0-9-]+.)+))([a-zA-Z]{2,4}|[0-9]{1,3})
which gives me the email addresses
Here's what I have..
Data:
"George Washington" <gwashington#government.net>, "Abraham Lincoln" <alincoln#government.net>, "Carter, Jimmy" <jimmy.carter#presidents.com>, "Nixon, Richard M." <tricky.dick#presidents.com>
What I'd like to get back is this:
"George Washington" <gwashington#government.net>
"Abraham Lincoln" <alincoln#government.net>
"Carter, Jimmy" <jimmy.carter#presidents.com>
"Nixon, Richard M." <tricky.dick#presidents.com>
I dont know enough about the exchange to get the pattern that will match for any exchange recipients entries.
But based on information past for you as an example. I give you this:
["][^"]+["][^",]+(?=[,]?)
This match all for entries that you post.
And know a simple example in C# how to use:
var input = "\"George Washington\" <gwashington#government.net>, \"Abraham Lincoln\" <alincoln#government.net>, \"Carter, Jimmy\" <jimmy.carter#presidents.com>, \"Nixon, Richard M.\" <tricky.dick#presidents.com>";
var pattern = "[\"][^\"]+[\"][^\",]+(?=[,]?)";
var items = Regex.Matches(input, pattern)
.Cast<Match>()
.Select(s => s.Value)
.ToList();
If there is a input text that this pattern dont work please post the input here.
Regex.Match(input, #"\"[^\"]*\"\s\<[^>]*>");

How to extract values from HTML using RegEx?

Given the following HTML:
<p><span class="xn-location">OAK RIDGE, N.J.</span>, <span class="xn-chron">March 16, 2011</span> /PRNewswire/ -- Lakeland Bancorp, Inc. (Nasdaq: <a href='http://studio-5.financialcontent.com/prnews?Page=Quote&Ticker=LBAI' target='_blank' title='LBAI'> LBAI</a>), the holding company for Lakeland Bank, today announced that it redeemed <span class="xn-money">$20 million</span> of the Company's outstanding <span class="xn-money">$39 million</span> in Fixed Rate Cumulative Perpetual Preferred Stock, Series A that was issued to the U.S. Department of the Treasury under the Capital Purchase Program on <span class="xn-chron">February 6, 2009</span>, thereby reducing Treasury's investment in the Preferred Stock to <span class="xn-money">$19 million</span>. The Company paid approximately <span class="xn-money">$20.1 million</span> to the Treasury to repurchase the Preferred Stock, which included payment for accrued and unpaid dividends for the shares.  This second repayment, or redemption, of Preferred Stock will result in annualized savings of <span class="xn-money">$1.2 million</span> due to the elimination of the associated preferred dividends and related discount accretion.  A one-time, non-cash charge of <span class="xn-money">$745 thousand</span> will be incurred in the first quarter of 2011 due to the acceleration of the Preferred Stock discount accretion.  The warrant previously issued to the Treasury to purchase 997,049 shares of common stock at an exercise price of <span class="xn-money">$8.88</span>, adjusted for stock dividends and subject to further anti-dilution adjustments, will remain outstanding.</p>
I'd like to get the values inside the <span> elements. I'd also like to get the value of the class attribute on the <span> elements.
Ideally I could just run some HTML through a function and get back a dictionary of extracted entities (based on the <span> parsing defined above).
The above code is a snippet from a larger source HTML file, which fails to pare with an XML parser. So I'm looking for a possible regular expression to help extract the information of interest.
Use this tool (free):
http://www.radsoftware.com.au/regexdesigner/
Use this Regex:
"<span[^>]*>(.*?)</span>"
The values in Group 1 (for each match) will be the text that you need.
In C# it will look like:
Regex regex = new Regex("<span[^>]*>(.*?)</span>");
string toMatch = "<span class=\"ajjsjs\">Some text</span>";
if (regex.IsMatch(toMatch))
{
MatchCollection collection = regex.Matches(toMatch);
foreach (Match m in collection)
{
string val = m.Groups[1].Value;
//Do something with the value
}
}
Ammended to answer comment:
Regex regex = new Regex("<span class=\"(.*?)\">(.*?)</span>");
string toMatch = "<span class=\"ajjsjs\">Some text</span>";
if (regex.IsMatch(toMatch))
{
MatchCollection collection = regex.Matches(toMatch);
foreach (Match m in collection)
{
string class = m.Groups[1].Value;
string val = m.Groups[2].Value;
//Do something with the class and value
}
}
Assuming that you have no nested span tags, the following should work:
/<span(?:[^>]+class=\"(.*?)\"[^>]*)?>(.*?)<\/span>/
I only did some basic testing on it, but it'll match the class of the span tag (if it exists) along with the data until the tag is closed.
I strongly advise you to use a real HTML or XML parser for this instead. You cannot reliably parse HTML or XML with regular expressions--the most you can do is come close, and the closer you get, the more convoluted and time-consuming your regex will be. If you have a large HTML file to parse, it's highly likely to break any simple regex pattern.
Regex like <span[^>]*>(.*?)</span> will work on your example, but there's a LOT of XML-valid code that's difficult or even impossible to parse with regex (for example, <span>foo <span>bar</span></span> will break the above pattern). If you want something that's going to work on other HTML samples, regex isn't the way to go here.
Since your HTML code isn't XML-valid, consider the HTML Agility Pack, which I've heard is very good.