How to extract values from HTML using RegEx? - regex

Given the following HTML:
<p><span class="xn-location">OAK RIDGE, N.J.</span>, <span class="xn-chron">March 16, 2011</span> /PRNewswire/ -- Lakeland Bancorp, Inc. (Nasdaq: <a href='http://studio-5.financialcontent.com/prnews?Page=Quote&Ticker=LBAI' target='_blank' title='LBAI'> LBAI</a>), the holding company for Lakeland Bank, today announced that it redeemed <span class="xn-money">$20 million</span> of the Company's outstanding <span class="xn-money">$39 million</span> in Fixed Rate Cumulative Perpetual Preferred Stock, Series A that was issued to the U.S. Department of the Treasury under the Capital Purchase Program on <span class="xn-chron">February 6, 2009</span>, thereby reducing Treasury's investment in the Preferred Stock to <span class="xn-money">$19 million</span>. The Company paid approximately <span class="xn-money">$20.1 million</span> to the Treasury to repurchase the Preferred Stock, which included payment for accrued and unpaid dividends for the shares.  This second repayment, or redemption, of Preferred Stock will result in annualized savings of <span class="xn-money">$1.2 million</span> due to the elimination of the associated preferred dividends and related discount accretion.  A one-time, non-cash charge of <span class="xn-money">$745 thousand</span> will be incurred in the first quarter of 2011 due to the acceleration of the Preferred Stock discount accretion.  The warrant previously issued to the Treasury to purchase 997,049 shares of common stock at an exercise price of <span class="xn-money">$8.88</span>, adjusted for stock dividends and subject to further anti-dilution adjustments, will remain outstanding.</p>
I'd like to get the values inside the <span> elements. I'd also like to get the value of the class attribute on the <span> elements.
Ideally I could just run some HTML through a function and get back a dictionary of extracted entities (based on the <span> parsing defined above).
The above code is a snippet from a larger source HTML file, which fails to pare with an XML parser. So I'm looking for a possible regular expression to help extract the information of interest.

Use this tool (free):
http://www.radsoftware.com.au/regexdesigner/
Use this Regex:
"<span[^>]*>(.*?)</span>"
The values in Group 1 (for each match) will be the text that you need.
In C# it will look like:
Regex regex = new Regex("<span[^>]*>(.*?)</span>");
string toMatch = "<span class=\"ajjsjs\">Some text</span>";
if (regex.IsMatch(toMatch))
{
MatchCollection collection = regex.Matches(toMatch);
foreach (Match m in collection)
{
string val = m.Groups[1].Value;
//Do something with the value
}
}
Ammended to answer comment:
Regex regex = new Regex("<span class=\"(.*?)\">(.*?)</span>");
string toMatch = "<span class=\"ajjsjs\">Some text</span>";
if (regex.IsMatch(toMatch))
{
MatchCollection collection = regex.Matches(toMatch);
foreach (Match m in collection)
{
string class = m.Groups[1].Value;
string val = m.Groups[2].Value;
//Do something with the class and value
}
}

Assuming that you have no nested span tags, the following should work:
/<span(?:[^>]+class=\"(.*?)\"[^>]*)?>(.*?)<\/span>/
I only did some basic testing on it, but it'll match the class of the span tag (if it exists) along with the data until the tag is closed.

I strongly advise you to use a real HTML or XML parser for this instead. You cannot reliably parse HTML or XML with regular expressions--the most you can do is come close, and the closer you get, the more convoluted and time-consuming your regex will be. If you have a large HTML file to parse, it's highly likely to break any simple regex pattern.
Regex like <span[^>]*>(.*?)</span> will work on your example, but there's a LOT of XML-valid code that's difficult or even impossible to parse with regex (for example, <span>foo <span>bar</span></span> will break the above pattern). If you want something that's going to work on other HTML samples, regex isn't the way to go here.
Since your HTML code isn't XML-valid, consider the HTML Agility Pack, which I've heard is very good.

Related

Pandas: Grouping rows by list in CSV file?

In an effort to make our budgeting life a bit easier and help myself learn; I am creating a small program in python that takes data from our exported bank csv.
I will give you an example of what I want to do with this data. Say I want to group all of my fast food expenses together. There are many different names with different totals in the description column but I want to see it all tabulated as one "Fast Food " expense.
For instance the Csv is setup like this:
Date Description Debit Credit
1/20/20 POS PIN BLAH BLAH ### 1.75 NaN
I figured out how to group them with an or statement:
contains = df.loc[df['Description'].str.contains('food court|whataburger', flags = re.I, regex = True)]
I ultimately would like to have it read off of a list? I would like to group all my expenses into categories and check those category variable names so that it would only output from that list.
I tried something like:
fast_food = ['Macdonald', 'Whataburger', 'pizza hut']
That obviously didn't work.
If there is a better way of doing this I am wide open to suggestions.
Also I have looked through quite a few posts here on stack and have yet to find the answer (although I am sure I overlooked it)
Any help would be greatly appreciated. I am still learning.
Thanks
You can assign a new column using str.extract and then groupby:
df = pd.DataFrame({"description":['Macdonald something', 'Whataburger something', 'pizza hut something',
'Whataburger something','Macdonald something','Macdonald otherthing',],
"debit":[1.75,2.0,3.5,4.5,1.5,2.0]})
fast_food = ['Macdonald', 'Whataburger', 'pizza hut']
df["found"] = df["description"].str.extract(f'({"|".join(fast_food)})',flags=re.I)
print (df.groupby("found").sum())
#
debit
found
Macdonald 5.25
Whataburger 6.50
pizza hut 3.50
Use dynamic pattern building:
fast_food = ['Macdonald', 'Whataburger', 'pizza hut']
pattern = r"\b(?:{})\b".format("|".join(map(re.escape, fast_food)))
contains = df.loc[df['Description'].str.contains(pattern, flags = re.I, regex = True)]
The \b word boundaries find whole words, not partial words.
The re.escape will protect special characters and they will be parsed as literal characters.
If \b does not work for you, check other approaches at Match a whole word in a string using dynamic regex

React - Using Regex for highlighting text inside of dangerouslySetInnerHTML. Not working reliably

The goal is to highlight text parts (strings) inside of a dangerouslySetInnerHTML. Therefore I try to match the desired text part inside of the html, and wrap it in a "span" with appropriate styling. I am using the following code that works for certain texts (html) flawlessy, but for certain texts not at all. Please find a working an a not working example below. Trying for hours to understand the difference, or why the regex does not work... but I can't figure it out. Banging my head agains the wall.
My question is: Why is the regex failing in some cases and working in others? Even though in all cases the text ("quote") is there.
Any ideas what I am missing? Thanks so much for your help!!!
Highlighting Component JSX:
import React from "react";
class HighlightQuote extends React.Component {
render = () => {
//zitat is for getting rid of any quotation marks in the beginning or end.
var zitat = this.props.quotes.map(x => x.replace(/^[“”"’()]+|[“”"’()]+$/g, ""));
if (this.props.quotes.length === 0) {
var highlightedHtml = this.props.newcontent
}
else {
var zitat = this.props.quotes.map(x => x.replace(/^[“”"’()]+|[“”"’()]+$/g, ""));
const regex = new RegExp(`(${zitat.join('|')})`, 'g');
var highlightedHtml = this.props.content.replace(
regex,
'<span class="hl">$1</span>'
);
console.log ('highlightedHtml:');
console.log (highlightedHtml);
}
return (
<div className="reader" ref="test" dangerouslySetInnerHTML={{ __html: highlightedHtml }} />
);
};
}
export default HighlightQuote;
Working example (console.log ('highlighted html')
<div class="post" id="post-17660">
<p class="postcontents">
<article> <div class="post-inside">
<p>One of the things I have disliked the most about the crypto sector is the idea that people should “hodl” or “hold on for dear life.”</p>
<p>I have written many times here at AVC that one should take profits when they are available and diversify an investment portfolio.</p>
<p><span class="hl">The idea that an investor should hold on no matter what has always seemed ridiculous to me.</span></p>
<p>Now, the crypto markets are in the eighth month of a long and painful bear market and we are starting to see some signs of capitulation, particularly in the assets that went up the most last year.</p>
<p>Whether this is the long-awaited capitulation of the HODL crowd or not, I can’t say.</p>
<p>But capitulation would be a good thing for the crypto markets, releasing assets into the market that until now have been locked up by long-term holders.</p>
<p><span class="hl">Until then it is hard to get excited about buying anything in crypto.</span></p>
</div> </article>
</p> </div>
Quotes that are highlighted as expected:
"The idea that an investor should hold on no matter what has always seemed ridiculous to me."
"Until then it is hard to get excited about buying anything in crypto."
Failing example (console.log ('highlighted html')
<div><article id="story" class="Story-story--2QyGh css-1j0ipd9"><header class="css-1qcpy3f e345g291"><p class="css-1789nl8 etcg8100"><a class="css-1g7m0tk" href="https://www.nytimes.com/column/new-sentences">New Sentences</a></p><div class="css-30n6iy e345g290"><div class="css-acwcvw"></div></div><figure class="ResponsiveMedia-media--32g1o ResponsiveMedia-sizeSmall--3092U ResponsiveMedia-layoutVertical--1pg1o ResponsiveMedia-sizeSmallNoCaption--n--T0 css-1hzd7ei"><figcaption class="css-pplcdj ResponsiveMedia-caption--1dUVu"></figcaption></figure></header><div class="css-18sbwfn StoryBodyCompanionColumn"><div class="css-1h6whtw"><p class="css-1i0edl6 e2kc3sl0"><em class="css-2fg4z9 ehxkw330">— From Keith Gessen’s second novel, “A Terrible Country” (Viking, 2018, Page 4). Gessen is also the author of “All the Sad Young Literary Men” and a founding editor of the journal n+1.</em></p><p class="css-1i0edl6 e2kc3sl0">All authors have signature sentence structures — deep expressive grooves that their minds instinctively find and follow. (That previous sentence is one of mine: a simple declaration that leaps, after the break of a long dash, into an elaborate restatement.)</p><p class="css-1i0edl6 e2kc3sl0">Here is one of Keith Gessen’s:</p><p class="css-1i0edl6 e2kc3sl0">“As for me, I wasn’t really an idiot. But neither was I not an idiot.”</p><p class="css-1i0edl6 e2kc3sl0">“I hadn’t been yelling, I didn’t think. But I hadn’t not been yelling either.”</p><p class="css-1i0edl6 e2kc3sl0">“Cute cafes were not the problem, but they were also not, as I’d once apparently thought, the opposite of the problem.”</p></div><aside class="css-14jsv4e"><span></span></aside></div><div class="css-18sbwfn StoryBodyCompanionColumn"><div class="css-1h6whtw"><p class="css-1i0edl6 e2kc3sl0">Sentence structures are not simply sentence structures, of course — they are miniature philosophies. Hemingway, with his blunt verbal bullets, is making a huge claim about the nature of the world. So is James Joyce, with his collages and frippery. So are Nikki Giovanni and Samuel Delany and Ursula K. Le Guin and John McPhee and Missy Elliott and Dr. Seuss and anyone else who converts thoughts into prose.</p><p class="css-1i0edl6 e2kc3sl0">Likewise, Keith Gessen’s signature sentence structure — “not X, but also not not X” — suggests an entire worldview. It is a universe of in-betweenness, in which the most basic facts of life, the things we absolutely expect to understand, spill and scatter like toast crumbs into the gaps between the floorboards. It is a world of embarrassingly trivial category errors. The sentences above come from Gessen’s new novel, “A Terrible Country,” the story of a 30-something American man who goes to Russia to care for his elderly grandmother. He falls into the gaps between huge concepts: youth and age, purpose and purposelessness, progress and stasis. He is not Russian but also not not Russian, not smart but also not not smart, not heroic but also not not heroic. Such is the way of the world. No matter how much we try, none of us is ever only one thing. None of us is ever pure.</p></div><aside class="css-14jsv4e"><span></span></aside></div><div class="bottom-of-article"><div class="css-k8fkhk"><p>Sam Anderson is a staff writer for the magazine.</p> <p><i>Sign up for </i><i>our newsletter</i><i> to get the best of The New York Times Magazine delivered to your inbox every week.</i></p></div><div class="css-3glrhn">A version of this article appears in print on , on Page 11 of the Sunday Magazine with the headline: From Keith Gessen’s ‘A Terrible Country’<span>. Order Reprints | Today’s Paper | Subscribe</span></div></div><span></span></article></div>
The quote that should be highlighted:
"Sentence structures are not simply sentence structures, of course — they are miniature philosophies"
The reason for the failing regex matches were html entities. Some of the parsed texts inside of the dangerouslySetInnerHTML used entity references. In the failing example above the quote includes a "—" character that in the html is decoded as — .
In order to get rid of the html entities I used the "he" library https://github.com/mathiasbynens/he a robust HTML entity encoder/decoder written in JavaScript.
var contentDecoded = he.decode(this.props.content);
var highlightedHtml = contentDecoded.replace(
regex,
'<span class="annotator-hl">$1</span>'
);

web browser innertext data how received in textbox?

I have posted my HTML below. In which I want to get the name value from within my textbox area. I've tried several processes and I'm still not getting any valid solution. Please check my HTML and code snippet, and show me a possible solution.
The name prefix will always stay the same when I refresh the page. However, the last name within the "name" area will change, but will always contain the literal "mr." as the first 3 digits. regex as ([mM]r.\ ) - Four digits if you consider the literal space. Below is my table example.
<table>
<tr><td><b>Your Name is </b> mr. kamrul</td></tr>
<tr><td><b>your age </b> 12</td></tr>
<tr><td><b>Email:</b>kennethdasma30#gmail.com</td></tr>
<tr><td><b>job title</b> sales man</td></tr>
</table>
As shown below I am trying this process using listbox but I am not receiving anything.
HtmlElementCollection bColl =
webBrowser1.Document.GetElementsByTagName("table");
foreach (HtmlElement bEl in bColl)
{
if (bEl.GetAttribute("table") != null)
{
listBox1.Items.Add(bEl.GetAttribute("table"));
}
}
If anyone ca give me an idea of how I am able to receive all in the browser window as ("mr. " + text) within my list box I would appreciate it. Also, if you can explain the answer verbosely and with good comments I would appreciate it, as I'd like to understand the answer in greater detail as well.
Here is one simple way using Regex, assuming that the format of your html page doesn't change.
Regex re = new Regex(#"(?<=<tr><td><b>Your\sName\sis\s?</b>\s?)[mM]r\.\s.+?(?=</td></tr>)", RegexOptions.Singleline);
foreach (Match match in re.Matches(webBrowser1.DocumentText))
{
listBox1.Items.Add(match.Value);
}

Find all paragraphs of text that are related to a topic

Given a set of words ["college", "sports", "coding"], and a set of paragraphs of text (i.e. facebook posts), how can I see for each word the paragraphs that are related to that topic?
So for college, how can I find all the paragraphs of text that may be about the topic college?
I'm new to natural language processing, and not very advanced at regex. Clues about how to get started, what the right terms to google, etc are appreciated.
One basic ideea would be to iterate over your posts and see if any post matches any of the topic.
Let's say we have the following posts:
Post 1:
Dadadad adada college fgdssfgoksh jkhsfdkjshdkj sports hfjkshgkjshgjhsdgjkhskjgfs.
Post 2:
Sports dadadad adada fgdssfgoksh jkhsfdkjshdkj hfjkshgkjshgjhsdgjkhskjgfs.
Post 3:
Coding adskjdsflkshdflksjlg lsdjk hsjdkh kdsafkj asfjkhsa coding fhksajhdf kjhskfhsfd ssdggsd.
and the following topics:
["college", "sports", "coding"]
The regex could be: (topicName)+
E.g.: (college)+ or (sports)+ or (coding)+
Small pseudocode:
for every topicName
for every post
var customRegex = new RegExp('(' + topicName + ')+');
if customRegex.test(post) then
//post matches topicName
else
//post doesn't match topicName
endif
endfor
endfor
Hope it could give you a starting point.
Exact string matching won't take you far, especially with small fragments of text. I suggest you to use semantic similarity for this. A simple web search will give several implementations.

regex to extract data from wikimedia formatted markup documents

i'm trying to extract some data from the wikipedia/wikimedia markup structure in clojure.
{{Infobox company
...
...
|operating_income = {{Increase}} US$ 26.76&nbsp;billion (2013)<ref name=10K/>
|net_income = {{Increase}} US$ 21.86&nbsp;billion (2013)<ref name=10K/>
|assets = {{Increase}} US$ 142.43&nbsp;billion (2013)<ref name=10K/>
|equity = {{Increase}} US$ 78.94&nbsp;billion (2013)<ref name=10K/>
...
}}
i need the information within the {{infobox company .... }} area.
so i used this regex (re-seq #"\{\{(.*?)}\}" above-txt)
but that gave me some of the regexes but still not all. there is a lot of extra data on this page as well as nested {{ }}
you can see the full text at http://en.wikipedia.org/w/api.php?format=xml&action=query&titles=microsoft&prop=revisions&rvprop=content
i think the problem with my regex is that its not dealing with nested {{ .. }} tags.
If regular expressions get frustrating you could consider using Instaparse to make a small parser that could handle arbitrarily nested expressions. I's a bit heavier weight though it will work on more input types.