Regular expression cannot match "</p>" correctly - regex

everyone.
I'm having some difficulties to use regular expressions to grep the text from HTML, which has
</p>
I'm using unsung hero.*</p> to grep the paragraph I'm interested in, but cannot make it match until next </p>
The command I use is:
egrep "unsung hero.*</p>" test
and in test is a webpage like:
<p>There are going to be outliers among us, people with extraordinary skill at recognizing faces. Some of them may end up as security officers or gregarious socialites or politicians. The rest of us are going to keep smiling awkwardly at office parties at people we\'re supposed to know. It\'s what happens when you stumble around in the 21st century with a mind that was designed in the Stone Age.</p>\n <p>(SOUNDBITE OF MUSIC)</p>\n <p>VEDANTAM: This week\'s show was produced by Chris Benderev and edited by Jenny Schmidt. Our supervising producer is Tara Boyle. Our team includes Renee Cohen, Parth Shah, Laura Kwerel, Thomas Lu and Angus Chen.</p>\n <p>Our unsung hero this week is Alexander Diaz, who troubleshoots technical problems whenever they arise and has the most unflappable, kind disposition in the face of whatever crisis we throw his way. Producers at NPR have taken to calling him Batman because he\'s constantly, silently, secretly saving the day. Thanks, Batman.</p>\n <p>If you like today\'s episode, please take a second to share it with a friend. We\'re always looking for new people to discover our show. I\'m Shankar Vedantam, and this is NPR.</p>\n <p>(SOUNDBITE OF MUSIC)</p>\n\n <p class="disclaimer">Copyright © 2019 NPR. All rights reserved. Visit our website terms of use and permissions pages at www.npr.org for further information.</p>\n\n <p class="disclaimer">NPR transcripts are created on a rush deadline by Verb8tm, Inc., an NPR contractor, and produced using a proprietary transcription process developed with NPR. This text may not be in its final form and may be updated or revised in the future. Accuracy and availability may vary. The authoritative record of NPR’s programming is the audio record.</p>\n</div><div class="share-tools share-tools--secondary" aria-label="Share tools">\n <ul>\n
I'm expecting to match before
</p>\n <p>If you like
But it actually went way further than that.
I feel like the regular expression I used has issue, but don't know how. Any help will be appreciated.
Thanks!
20190523:
Thanks for your guys' suggestions.
I tried
egrep "unsung hero.*?</p>" test
But it didn't give me the result I want, insted it's like
Leo, I feel like this is a useful expression and I'd like to get it right. could you explain a bit?
The other test I did for
[^<]*
Actually gave the result expected

With .* the match will be greedy and match the longest substring possible. (Which is in your case until the last paragraph.)
What you actually want is a non-greedy match with .*?
Your specific command should most likely look like this:
grep -P -o "unsung hero.*?</p>" test
Another solution would be to expand your regex until the end of the string/webpage and than pick the selected substring with a group.
UPDATE
As Charles Duffy pointed out correctly, this will not work with the standard (POSIX ERE) syntax. Therefore the command above uses the -P flag to specify that it is a perl regular expression.
If your system or application does not support perl regular expression and you are ok with matching until the first < (instead of matching until the first </p>), matching every character except < is the way to go.
With this, the complete command should look like this:
grep -o "unsung hero[^<]*</p>" test
Thanks to Charles for pointing that out in the comments.

Related

Regex not returning string if theres another one

Given this regex query string:
(?:<.*>)?(?:.*)?("|quot;)(.*)(\1)(?:.*)?(?:<.*>)?(?:http(s)?:\/\/)?(?:w{3})?plainview.io\/archives\/(\w+)(?:.*)?(?:<.*>)?
I need to be able to select
"minister, a loyal party member with black rimmed glasses told us. He's the best man for the job."
which I'm able to do from the following text:
<p>This is some text before, "minister, a loyal party member with black rimmed glasses told us. He's the best man for the job." www.plainview.io/archives/SysteBvsl</a> and some text after</p>
but not from the following:
<p>This is some text before, "minister, a loyal party member with black rimmed glasses told us. He's the best man for the job." www.plainview.io/archives/SysteBvsl and some text after</p>
Instead, for the latter, I get
\\/si\\/ajax\\/l\\/render_linkshim_log\\/?u=http\\u00253A\\u00252F\\u00252Fwww.plainview.io\\u00252Farchives\\u00252FSysteBvsl&h=ATPBq9DrC_xIokWhmxk7f3nyKGofYnM9zGt3mF-7bfMNNupsX0WSR4TdE6VmX6W9gd_1Rnby1nXfIfq3MzgOS2PKryxKu9z3yci0ZvomiLHvYbVSfuwg29Y1Z_R1LEKRDXO3sAOZ2dsMgQ&enc=AZMnRgfaZaV-J1wtvqulToF-RxOlkhgY6kzmkLuXSv26a0waxI3nHsI1rXkl-ILjrXkcnwajsVFizefc27K5A_WlqpJrNQLKWSTnDSwIGHGHYvWDp1CWeBP8vbzcQZcnJHA-ka3LvpJIYIO7_YwPaEpKsT0I0nNewd0aHZYbPtHghob7_7a_fubIkIy5g3R7ExA&d&
Why is it that when I add more text (that is actually AFTER the string I need), it selects the one that comes after?
You should learn about how regexes run internally.
Your problem here is mainly the (too) complex regex combined with greediness:
(?:<.*>)?(?:.*?)?("|quot;)(.*)(\1)(?:.*)?(?:<.*>)?(?:http(s)?:\/\/)?(?:w{3})?plainview.io\/archives\/(\w+)(?:.*)?(?:<.*>)?
will solve your Problem. What I did here, is just replace (?:.*) by (?:.*?) (adding a ?).
A good resource I just found would be Why Using the Greedy .* in Regular Expressions Is Almost Never What You Actually Want
A much simpler way to get the same result is this regex:
"(.*?)"

Extracting a paragraph from articles | Regular Expression

I have scraped several articles concerning terrorist attacks. From these articles I would like to extract a specific paragraph.
This is a sample of the articles scraped:
By DAVID D. KIRKPATRICK MARCH 18, 2015
Scenes from Tunisian state television showed confusion outside an art museum and Parliament on Wednesday after gunmen attacked.
CAIRO — Gunmen in military uniforms killed 19 people on Wednesday in a
midday attack on a museum in downtown Tunis, dealing a new blow to the tourist industry
that is vital to Tunisia as it struggles to consolidate the only transition to democracy
after the Arab Spring revolts.
Tunisian officials had initially said that the attackers took 10
hostages and killed nine people, including seven foreign visitors and two Tunisians.
What I want to extract for further analysis, is the text that goes, in this example, from: "CAIRO —" to the first fullstop.
This is the regular expression that I came up with:
([A-Z]+(?:\W+\w+)?)\s*—[\s\S]+\.\s
With this regular expression I extract only the starting point of the paragraph but I don't extract the rest of it.
Use non-greedy
(([A-Z]+(?:\W+\w+)?)\s*—[\s\S]+?\.\s)
The ? after a + (or *) makes it non-greedy. Meaning it will only match as little as possible, instead of normal behaviour, where it matches as much as possible.
EDIT1:
try the regex as follows:
([A-Z]+\w+\s*—\s*.*?\.)
It is about grouping, though it matches the text that you want.
try the following regex (surround the regex with parenthisis):
(([A-Z]+(?:\W+\w+)?)\s*—[\s\S]+\.\s)
Group 1 contains the required string/text.
Image reference:

Why is this line of regex capturing white spaces?

I'm using the following line of regex which I found from this SO answer:
(?:[\w[a-z]-]+:(?:/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.-]+[.??][a-z]{2,4}/)(?:[^\s()<>]+|(([^\s()<>]+|(([^\s()<>]+)))))+(?:(([^\s()<>]+|(([^\s()<>]+))))|[^\s`!()[]{};:'".,<>?«»“”‘’])
I am testing it on the following string:
"Quattro Amici in Concert Mar. 3, 2014. Long-time collaborators Lun Jiang, violin; Roberta Zalkind, viola; Pegsoon Whang, cello; and Karlyn Bond, piano, will perform works by Franz Joseph Haydn, Wolfgang Amadeus Mozart, Ludwig van Beethoven and Gabriel Faure. To purchase tickets visit westminstercollege.edu/culturalevents or call 801-832-2457. - See more at: http://entertainment.sltrib.com/events/view/quattro_amici_in_concert#sthash.QRsLXXiA.dpuf"
I'm simply attempting to extract urls from strings and based on a bunch of SO answers, I've found that regex is the recommended tool for that job. I'm not a regex expert (or even intermediate in my understanding), so I'm baffled by the empty strings my re.findall() keeps returning. I've stepped through the regex line using regex buddy and still no luck. Any help would be hugely appreciated.
I'm not sure that a big regex like that is entirely necessary - if you're just looking to get links, you could use a much simpler regex, like this:
/(https?:\/\/[\w\d\$-_\.\+!\*'\(\),\/#]+)/ig
According to RFC 1738, urls are only allowed to use the characters specified in the class above, so it should cover any valid url, without such a gigantic mess of a regex.
You can also use a tool like regexpal.com to validate regexes, which helps find issues. That said, I pasted your regex in there and it crashed chrome, so it may not be a great help for a beast like that :)

REGEX: best practice to insert before, after or between?

i'm nervous as hell asking this question since there's a LOT of RegEx posts out there. but i'm asking for best method as well, so i'm going to risk it (fully expecting a rep hit if i botch the job...)
i've been given a list to reformat. 120 questions and answers (240 tag sets total). * glark * all i need to do is make the text between the tags a link, like so:
<li>do snails make your feet itch?</li>
has to become
<li>do snails make your feet itch?</li>`
THIS IS NOT A JAVASCRIPT/PHP RegEx question. it is JUST RegEx that i can drop into the search/replace fields of my IDE. i'll likely try and do a batch replace afterwards with PERL to insert the 'n' variable so the links point properly.
and i know you're going to ask 'if you can use PERL for that, why not the whole shebang?' and that's a valid question, but i want to be using RegEx more for the power it has for big lists like this. plus my PERL skills are sketchy at best... unless you want to tack that on as well... :D heh heh.
if this question can't be answered or is wrong for this part of the forum, please accept my apologies and point me in the right direction.
many thanks!
WR!
You can do it in two steps.
Substitute <li> with <li><a href="#n">
Substitute </li> with </a></li>
Or you can try to be clever and it it in one. Here is a substitute command in Perl syntax ($1 references what was matched in the brackets).
s,<li>(.*)</li>,<li>$1</li>,
And while you are there it's easy to replace the second part of the replacement pattern with an expression that will increment n
s,<li>(.*)</li>,q{<li>$1</li>},e
See how you can run this from the command line:
echo '<li>do snails make your feet itch?</li>' |
perl -pe 's,<li>(.*)</li>,q{<li>$1</li>},e'
<li>do snails make your feet itch?</li>
Search
<li>(.*?)</li>
Replace
<li>$1</li>

perl regex problem -- $amp in yahoo finance page

I found an old perl hack on the O'Reilly site http://oreilly.com/pub/h/1041 and decided to check it out. After a little fiddling around it started to run but the regex are out of date.
Here is the question: with this
/<a href="\/q\/op\?s=(.*?)\&m=(.*?)">/
as the first line of regex, what needs to be modified to make the regex function again? The following are snippets from
http://finance.yahoo.com/q/op?s=FISV
<a href="/q/op?s=FISV&k=55.000000">
and
<a href="/q/os?s=FISV&m=2011-04-15">
.
The original hack is dated 2004 and option symbols looked like this (FQVAH or FQVFF) back then instead of fisv110416c00060000 for a call option and fisv110416p00090000 for a put option. First thing I did to get it going was to modify all instances of $url to $curl because until the name was changed the symbol was not being passed to yahoo for lookup. The &amp is giving me the most trouble. If this is found to run without modification I would be very surprised and would very much like to know what system and perl -V is installed. SLES 10 and perl 5.8.0 is what I am currently using.
Any suggestions would be helpful. It could be a useful script to anyone who is serious about protecting themselves from a falling equity market.
Thanks,
robm
I'm not /100%/ sure what you're asking, but if I'm understanding, you want a regex that will capture "fisv110416c00060000" and tell you the first few letters, whether it's a call or a put, and the amount?
If so, you're looking for something like:
/([a-z]+)(\d+)([cp])(\d+)/
That should capture the following for the first example
$1 = "fisv"
$2 = 110416
$3 = c
$4 = 00060000
The original regex was very specific to that html string. You can include the beginning bits of it if you need to use it to check that the entire string is there as well. Of course, make your regex as tight as possible to avoid over-matches and wasted time pattern matching. I'm just not sure the exact pattern you're trying to match (ie: is it always "fisv"?).
You should either first unescape the html, this would turn the & into a &, or just change the regex, like this:
/<a href="\/q\/os\?s=(.*?)\&(?:amp;)?m=(.*?)">/
To match both types of urls:
/<a href="\/q\/o[ps]\?s=(.*?)\&(?:amp;)?[mk]=(.*?)">/