regex to extract data from wikimedia formatted markup documents

regex to extract data from wikimedia formatted markup documents - regex

i'm trying to extract some data from the wikipedia/wikimedia markup structure in clojure.
{{Infobox company
...
...
|operating_income = {{Increase}} US$ 26.76&nbsp;billion (2013)<ref name=10K/>
|net_income = {{Increase}} US$ 21.86&nbsp;billion (2013)<ref name=10K/>
|assets = {{Increase}} US$ 142.43&nbsp;billion (2013)<ref name=10K/>
|equity = {{Increase}} US$ 78.94&nbsp;billion (2013)<ref name=10K/>
...
}}
i need the information within the {{infobox company .... }} area.
so i used this regex (re-seq #"\{\{(.*?)}\}" above-txt)
but that gave me some of the regexes but still not all. there is a lot of extra data on this page as well as nested {{ }}
you can see the full text at http://en.wikipedia.org/w/api.php?format=xml&action=query&titles=microsoft&prop=revisions&rvprop=content
i think the problem with my regex is that its not dealing with nested {{ .. }} tags.

If regular expressions get frustrating you could consider using Instaparse to make a small parser that could handle arbitrarily nested expressions. I's a bit heavier weight though it will work on more input types.

Related

React - Using Regex for highlighting text inside of dangerouslySetInnerHTML. Not working reliably

The goal is to highlight text parts (strings) inside of a dangerouslySetInnerHTML. Therefore I try to match the desired text part inside of the html, and wrap it in a "span" with appropriate styling. I am using the following code that works for certain texts (html) flawlessy, but for certain texts not at all. Please find a working an a not working example below. Trying for hours to understand the difference, or why the regex does not work... but I can't figure it out. Banging my head agains the wall.
My question is: Why is the regex failing in some cases and working in others? Even though in all cases the text ("quote") is there.
Any ideas what I am missing? Thanks so much for your help!!!
Highlighting Component JSX:
import React from "react";
class HighlightQuote extends React.Component {
render = () => {
//zitat is for getting rid of any quotation marks in the beginning or end.
var zitat = this.props.quotes.map(x => x.replace(/^[“”"’()]+|[“”"’()]+$/g, ""));
if (this.props.quotes.length === 0) {
var highlightedHtml = this.props.newcontent
}
else {
var zitat = this.props.quotes.map(x => x.replace(/^[“”"’()]+|[“”"’()]+$/g, ""));
const regex = new RegExp(`(${zitat.join('|')})`, 'g');
var highlightedHtml = this.props.content.replace(
regex,
'<span class="hl">$1</span>'
);
console.log ('highlightedHtml:');
console.log (highlightedHtml);
}
return (
<div className="reader" ref="test" dangerouslySetInnerHTML={{ __html: highlightedHtml }} />
);
};
}
export default HighlightQuote;
Working example (console.log ('highlighted html')
<div class="post" id="post-17660">
<p class="postcontents">
<article> <div class="post-inside">
<p>One of the things I have disliked the most about the crypto sector is the idea that people should “hodl” or “hold on for dear life.”</p>
<p>I have written many times here at AVC that one should take profits when they are available and diversify an investment portfolio.</p>
<p><span class="hl">The idea that an investor should hold on no matter what has always seemed ridiculous to me.</span></p>
<p>Now, the crypto markets are in the eighth month of a long and painful bear market and we are starting to see some signs of capitulation, particularly in the assets that went up the most last year.</p>
<p>Whether this is the long-awaited capitulation of the HODL crowd or not, I can’t say.</p>
<p>But capitulation would be a good thing for the crypto markets, releasing assets into the market that until now have been locked up by long-term holders.</p>
<p><span class="hl">Until then it is hard to get excited about buying anything in crypto.</span></p>
</div> </article>
</p> </div>
Quotes that are highlighted as expected:
"The idea that an investor should hold on no matter what has always seemed ridiculous to me."
"Until then it is hard to get excited about buying anything in crypto."
Failing example (console.log ('highlighted html')
<div><article id="story" class="Story-story--2QyGh css-1j0ipd9"><header class="css-1qcpy3f e345g291"><p class="css-1789nl8 etcg8100"><a class="css-1g7m0tk" href="https://www.nytimes.com/column/new-sentences">New Sentences</a></p><div class="css-30n6iy e345g290"><div class="css-acwcvw"></div></div><figure class="ResponsiveMedia-media--32g1o ResponsiveMedia-sizeSmall--3092U ResponsiveMedia-layoutVertical--1pg1o ResponsiveMedia-sizeSmallNoCaption--n--T0 css-1hzd7ei"><figcaption class="css-pplcdj ResponsiveMedia-caption--1dUVu"></figcaption></figure></header><div class="css-18sbwfn StoryBodyCompanionColumn"><div class="css-1h6whtw"><p class="css-1i0edl6 e2kc3sl0"><em class="css-2fg4z9 ehxkw330">— From Keith Gessen’s second novel, “A Terrible Country” (Viking, 2018, Page 4). Gessen is also the author of “All the Sad Young Literary Men” and a founding editor of the journal n+1.</em></p><p class="css-1i0edl6 e2kc3sl0">All authors have signature sentence structures — deep expressive grooves that their minds instinctively find and follow. (That previous sentence is one of mine: a simple declaration that leaps, after the break of a long dash, into an elaborate restatement.)</p><p class="css-1i0edl6 e2kc3sl0">Here is one of Keith Gessen’s:</p><p class="css-1i0edl6 e2kc3sl0">“As for me, I wasn’t really an idiot. But neither was I not an idiot.”</p><p class="css-1i0edl6 e2kc3sl0">“I hadn’t been yelling, I didn’t think. But I hadn’t not been yelling either.”</p><p class="css-1i0edl6 e2kc3sl0">“Cute cafes were not the problem, but they were also not, as I’d once apparently thought, the opposite of the problem.”</p></div><aside class="css-14jsv4e"><span></span></aside></div><div class="css-18sbwfn StoryBodyCompanionColumn"><div class="css-1h6whtw"><p class="css-1i0edl6 e2kc3sl0">Sentence structures are not simply sentence structures, of course — they are miniature philosophies. Hemingway, with his blunt verbal bullets, is making a huge claim about the nature of the world. So is James Joyce, with his collages and frippery. So are Nikki Giovanni and Samuel Delany and Ursula K. Le Guin and John McPhee and Missy Elliott and Dr. Seuss and anyone else who converts thoughts into prose.</p><p class="css-1i0edl6 e2kc3sl0">Likewise, Keith Gessen’s signature sentence structure — “not X, but also not not X” — suggests an entire worldview. It is a universe of in-betweenness, in which the most basic facts of life, the things we absolutely expect to understand, spill and scatter like toast crumbs into the gaps between the floorboards. It is a world of embarrassingly trivial category errors. The sentences above come from Gessen’s new novel, “A Terrible Country,” the story of a 30-something American man who goes to Russia to care for his elderly grandmother. He falls into the gaps between huge concepts: youth and age, purpose and purposelessness, progress and stasis. He is not Russian but also not not Russian, not smart but also not not smart, not heroic but also not not heroic. Such is the way of the world. No matter how much we try, none of us is ever only one thing. None of us is ever pure.</p></div><aside class="css-14jsv4e"><span></span></aside></div><div class="bottom-of-article"><div class="css-k8fkhk"><p>Sam Anderson is a staff writer for the magazine.</p> <p><i>Sign up for </i><i>our newsletter</i><i> to get the best of The New York Times Magazine delivered to your inbox every week.</i></p></div><div class="css-3glrhn">A version of this article appears in print on , on Page 11 of the Sunday Magazine with the headline: From Keith Gessen’s ‘A Terrible Country’<span>. Order Reprints | Today’s Paper | Subscribe</span></div></div><span></span></article></div>
The quote that should be highlighted:
"Sentence structures are not simply sentence structures, of course — they are miniature philosophies"

The reason for the failing regex matches were html entities. Some of the parsed texts inside of the dangerouslySetInnerHTML used entity references. In the failing example above the quote includes a "—" character that in the html is decoded as — .
In order to get rid of the html entities I used the "he" library https://github.com/mathiasbynens/he a robust HTML entity encoder/decoder written in JavaScript.
var contentDecoded = he.decode(this.props.content);
var highlightedHtml = contentDecoded.replace(
regex,
'<span class="annotator-hl">$1</span>'
);

Find all paragraphs of text that are related to a topic

Given a set of words ["college", "sports", "coding"], and a set of paragraphs of text (i.e. facebook posts), how can I see for each word the paragraphs that are related to that topic?
So for college, how can I find all the paragraphs of text that may be about the topic college?
I'm new to natural language processing, and not very advanced at regex. Clues about how to get started, what the right terms to google, etc are appreciated.

One basic ideea would be to iterate over your posts and see if any post matches any of the topic.
Let's say we have the following posts:
Post 1:
Dadadad adada college fgdssfgoksh jkhsfdkjshdkj sports hfjkshgkjshgjhsdgjkhskjgfs.
Post 2:
Sports dadadad adada fgdssfgoksh jkhsfdkjshdkj hfjkshgkjshgjhsdgjkhskjgfs.
Post 3:
Coding adskjdsflkshdflksjlg lsdjk hsjdkh kdsafkj asfjkhsa coding fhksajhdf kjhskfhsfd ssdggsd.
and the following topics:
["college", "sports", "coding"]
The regex could be: (topicName)+
E.g.: (college)+ or (sports)+ or (coding)+
Small pseudocode:
for every topicName
for every post
var customRegex = new RegExp('(' + topicName + ')+');
if customRegex.test(post) then
//post matches topicName
else
//post doesn't match topicName
endif
endfor
endfor
Hope it could give you a starting point.

Exact string matching won't take you far, especially with small fragments of text. I suggest you to use semantic similarity for this. A simple web search will give several implementations.

Regular expression to get HTML table contents

I've stumbled on a bit of challenge here: how to get the contents of a table in HTML with the help of a regular expression. Let's say this is our table:
<table someprop=2 id="the_table" otherprop="val">
<tr>
<td>First row, first cell</td>
<td>Second cell</td>
</tr>
<tr>
<td>Second row</td>
<td>...</td>
</tr>
<tr>
<td>Another row, first cell</td>
<td>Last cell</td>
</tr>
</table>
I already found a method that works, but involves multiple regular expression to be executed in steps:
Get the right table and put it's rows in back-reference 1 (there may be more than one in the document):
<table[^>]*?id="the_table"[^>]*?>(.*?)</table>
Get the rows of the table and put the cells in back-reference 1:
<tr.*?>(.*?)</tr>
And lastly fetch the cell contents in back-reference 1:
<td.*?>(.*?)</td>
Now this is all good, but it would be infinitely more awesome to do this all using one fancy regular expression... Does someone know if this is possible?

There really isn’t a possible regex solution that works for an arbitrary number of table data and puts each cell into a separate back reference. That’s because with backreferences, you need to have a distinct open paren for each backref you want to create, and you don’t know how many cells you have.
There’s nothing wrong with using looping of one or another sort to pull out the data. For example, on the last one, in Perl it would be this, given that $tr already contains the row you need:
#td = ( $tr =~ m{<td.*?>(.*?)</td>}sg );
Now $td[0] will contain the first <td>, $td[1] will contain the second one, etc. If you wanted a two-dimensional array, you might wrap that in a loop like this to populate a new #cells variable:
our $table; # assume has full table in it
my #cells;
while(my($tr) =~ $table = m{<tr.*?>(.*?)</tr>}sg) {
push #cells, [ $tr =~ m{<td.*?>(.*?)</td>}sg ];
}
Now you can do two-dimensional addressing, allowing for $cells[0][0], etc. The outer explicit loop processes the table a row at a time, and the inner implicit loop pulls out all the cells.
That will work on the canned sample data you showed. If that’s good enough for you, then great. Use it and move on.
What Could Ever Be Wrong With That?
However, there are actually quite a few assumptions in your patterns about the contents of your data, ones I don’t know that you’re aware of. For one thing, notice how I’ve used /s so that it doesn’t get stuck on newlines.
But the main problem is that minimal matches aren’t always quite what you want here. At least, not in the general case. Sometimes they aren’t as minimal as you think, matching more than you want, and sometimes they just don’t match enough.
For example, a pattern like <i>(.*?)</i> will get more than you want if the string is:
<i>foo<i>bar</i>ness</i>
Because you will end up matching the string <i>foo<i>bar</i>.
The other common problem (and not counting the uncommon ones) is that a pattern like <tag.*?> may match too little, such as with
<img alt=">more" src="somewhere">
Now if you use a simplistic <img.*?> on that, you would only capture <img alt=">, which is of course wrong.
I think the last major remaining problem is that you have to altogether ignore certain things in parsing. The simplest demo of this embedded comments (also <script>, <style>, andCDATA`), since you could have something like
<i> some <!-- secret</i> --> stuff </i>
which will throw off something like <i>(.*?)</i>.
There are ways around all these, of course. Once you’ve done so, and it is really quite a bit of effort, you’ll find that you have built yourself a real parser, completely with a lot of auxiliary logic, not just one pattern.
Even then you are only processing well-formed input strings. Error recovery and failing softly is an entirely different art.

This answer was added before it was known the the OP needed a solution for c++...
Since using regex to parse html is technically wrong, I'll offer a better solution. You could use js to get the data and put it into a two dimensional array. I use jQuery in the example.
var data = [];
$('table tr').each(function(i, n){
var $tr = $(n);
data[i] = [];
$tr.find('td').text(function(j, text){
data[i].push(text);
});
});
jsfiddle of the example: http://jsfiddle.net/gislikonrad/twzM7/
EDIT
If you want a plain javascript way of doing this (not using jQuery), then this might be more for you:
var data = [];
var rows = document.getElementById('the_table').getElementsByTagName('tr');
for(var i = 0; i < rows.length; i++){
var d = rows[i].getElementsByTagName('td');
data[i] = [];
for(var j = 0; j < d.length; j++){
data[i].push(d[j].innerText);
}
}
Both these functions return the data the same way.

How to extract values from HTML using RegEx?

Given the following HTML:
<p><span class="xn-location">OAK RIDGE, N.J.</span>, <span class="xn-chron">March 16, 2011</span> /PRNewswire/ -- Lakeland Bancorp, Inc. (Nasdaq: <a href='http://studio-5.financialcontent.com/prnews?Page=Quote&Ticker=LBAI' target='_blank' title='LBAI'> LBAI</a>), the holding company for Lakeland Bank, today announced that it redeemed <span class="xn-money">$20 million</span> of the Company's outstanding <span class="xn-money">$39 million</span> in Fixed Rate Cumulative Perpetual Preferred Stock, Series A that was issued to the U.S. Department of the Treasury under the Capital Purchase Program on <span class="xn-chron">February 6, 2009</span>, thereby reducing Treasury's investment in the Preferred Stock to <span class="xn-money">$19 million</span>. The Company paid approximately <span class="xn-money">$20.1 million</span> to the Treasury to repurchase the Preferred Stock, which included payment for accrued and unpaid dividends for the shares.  This second repayment, or redemption, of Preferred Stock will result in annualized savings of <span class="xn-money">$1.2 million</span> due to the elimination of the associated preferred dividends and related discount accretion.  A one-time, non-cash charge of <span class="xn-money">$745 thousand</span> will be incurred in the first quarter of 2011 due to the acceleration of the Preferred Stock discount accretion.  The warrant previously issued to the Treasury to purchase 997,049 shares of common stock at an exercise price of <span class="xn-money">$8.88</span>, adjusted for stock dividends and subject to further anti-dilution adjustments, will remain outstanding.</p>
I'd like to get the values inside the <span> elements. I'd also like to get the value of the class attribute on the <span> elements.
Ideally I could just run some HTML through a function and get back a dictionary of extracted entities (based on the <span> parsing defined above).
The above code is a snippet from a larger source HTML file, which fails to pare with an XML parser. So I'm looking for a possible regular expression to help extract the information of interest.

Use this tool (free):
http://www.radsoftware.com.au/regexdesigner/
Use this Regex:
"<span[^>]*>(.*?)</span>"
The values in Group 1 (for each match) will be the text that you need.
In C# it will look like:
Regex regex = new Regex("<span[^>]*>(.*?)</span>");
string toMatch = "<span class=\"ajjsjs\">Some text</span>";
if (regex.IsMatch(toMatch))
{
MatchCollection collection = regex.Matches(toMatch);
foreach (Match m in collection)
{
string val = m.Groups[1].Value;
//Do something with the value
}
}
Ammended to answer comment:
Regex regex = new Regex("<span class=\"(.*?)\">(.*?)</span>");
string toMatch = "<span class=\"ajjsjs\">Some text</span>";
if (regex.IsMatch(toMatch))
{
MatchCollection collection = regex.Matches(toMatch);
foreach (Match m in collection)
{
string class = m.Groups[1].Value;
string val = m.Groups[2].Value;
//Do something with the class and value
}
}

Assuming that you have no nested span tags, the following should work:
/<span(?:[^>]+class=\"(.*?)\"[^>]*)?>(.*?)<\/span>/
I only did some basic testing on it, but it'll match the class of the span tag (if it exists) along with the data until the tag is closed.

I strongly advise you to use a real HTML or XML parser for this instead. You cannot reliably parse HTML or XML with regular expressions--the most you can do is come close, and the closer you get, the more convoluted and time-consuming your regex will be. If you have a large HTML file to parse, it's highly likely to break any simple regex pattern.
Regex like <span[^>]*>(.*?)</span> will work on your example, but there's a LOT of XML-valid code that's difficult or even impossible to parse with regex (for example, <span>foo <span>bar</span></span> will break the above pattern). If you want something that's going to work on other HTML samples, regex isn't the way to go here.
Since your HTML code isn't XML-valid, consider the HTML Agility Pack, which I've heard is very good.

Matching innermost braces with regex or strpos?

I have a sort of mini parsing syntax I made up to help me streamline my view code in cakephp. Basically I have created a table helper which, when given a dataset and (optionally) a set of options for how to format the data will render out a table, as opposed to me looping though the data and editing it manually.
It allows the user to be as complex or as simple as they like, it can get pretty powerful. However, In order to achieve this I had to make a simple parsing syntax. As a quick example the user would do something like so:
$this->Table->data = $userData;
$this->Table->elements['td']['data'] = array(
'{:User.username:}',
'{:User.created:}' => array('Time::nice')
);
echo $this->Table->render();
And when rendering the table would then generate:
<table>
<tbody>
<tr><td>rich97</td><td>Sun 21st 02:30pm</td></tr>
</tbody>
</table>
The problem occurs then I try to nest the braces like so:
{:User.levels.iconClasses.{:User.access:}:}
Is there anyway I can only get the inner most brackets on the first time round and loop though until there are no matches? Or even do it in one go? Or even better use strpos?
Here is my regex as it stands:
'/\{\:([^}]+)\:\}/'

Just add the opening brace to your negated character class:
'/\{:([^{}]+):\}/'

var $validate= array(
'name'=>array(
'notEmpty' =>array(
'rule'=>'notEmpty',
'message'=>'Please Enter The Name'
),
'isUnique' =>array(
'rule'=>'isUnique',
'message'=>'Name Already Exist'
)
),
'address'=>array(
'rule'=>'notEmpty',
'message'=>'Please Enter The Address')
);

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

regex to extract data from wikimedia formatted markup documents - regex

If regular expressions get frustrating you could consider using Instaparse to make a small parser that could handle arbitrarily nested expressions. I's a bit heavier weight though it will work on more input types.

Related

React - Using Regex for highlighting text inside of dangerouslySetInnerHTML. Not working reliably

Find all paragraphs of text that are related to a topic

Regular expression to get HTML table contents

How to extract values from HTML using RegEx?

Matching innermost braces with regex or strpos?

Categories

Resources