Extracting a paragraph from articles | Regular Expression - regex

I have scraped several articles concerning terrorist attacks. From these articles I would like to extract a specific paragraph.
This is a sample of the articles scraped:
By DAVID D. KIRKPATRICK MARCH 18, 2015
Scenes from Tunisian state television showed confusion outside an art museum and Parliament on Wednesday after gunmen attacked.
CAIRO — Gunmen in military uniforms killed 19 people on Wednesday in a
midday attack on a museum in downtown Tunis, dealing a new blow to the tourist industry
that is vital to Tunisia as it struggles to consolidate the only transition to democracy
after the Arab Spring revolts.
Tunisian officials had initially said that the attackers took 10
hostages and killed nine people, including seven foreign visitors and two Tunisians.
What I want to extract for further analysis, is the text that goes, in this example, from: "CAIRO —" to the first fullstop.
This is the regular expression that I came up with:
([A-Z]+(?:\W+\w+)?)\s*—[\s\S]+\.\s
With this regular expression I extract only the starting point of the paragraph but I don't extract the rest of it.

Use non-greedy
(([A-Z]+(?:\W+\w+)?)\s*—[\s\S]+?\.\s)
The ? after a + (or *) makes it non-greedy. Meaning it will only match as little as possible, instead of normal behaviour, where it matches as much as possible.

EDIT1:
try the regex as follows:
([A-Z]+\w+\s*—\s*.*?\.)
It is about grouping, though it matches the text that you want.
try the following regex (surround the regex with parenthisis):
(([A-Z]+(?:\W+\w+)?)\s*—[\s\S]+\.\s)
Group 1 contains the required string/text.
Image reference:

Related

RegEx: How to match on the second of 2 or more dates

I am trying to return a match on only the second date where there may be multiple dates present. There is not really a consistent word directly before or after to match on. Is this possible?
Also of note, the dates could be formatted as 01-01-1001, 01/01/1001, or January 01, 2001 (although the last with the month spelled is uncommon).
Below is an example of some text that I would attempt to be matching and the second date in bold is an example of what I would want it to return.
Some text fields here
And others here
Exp: 03/31/15
Page:
1
2129364
23675918 INTERNET
05/04/14
12:04 PM MAY
ULTIMATE
42159497 93736662
WEB
04-11-18
Taxed item
June, 14 2018
I tried this code and it seems to work
(?<=(.|\n)*?(\d{2}|January)(\s|\/|-)\d{2}(\s|\/|-)(\d{2}|\d{2})(\s|\/|-))(.|\n)*?
(\d{2}|January)(\s|\/|-)\d{2}(\s|\/|-)(\d{2}|\d{2})(\s|\/|-)
This code basically searches for the pattern you are looking for, PRECEEDED by the same pattern and optionally sourounded by other symbols (newline included).
One more advise: If you want to test regex patterns, you may want to check out this website.
There you can also switch between languages (I have used javascript), though the basics stay more or less the same in my opinion.

Regular expression cannot match "</p>" correctly

everyone.
I'm having some difficulties to use regular expressions to grep the text from HTML, which has
</p>
I'm using unsung hero.*</p> to grep the paragraph I'm interested in, but cannot make it match until next </p>
The command I use is:
egrep "unsung hero.*</p>" test
and in test is a webpage like:
<p>There are going to be outliers among us, people with extraordinary skill at recognizing faces. Some of them may end up as security officers or gregarious socialites or politicians. The rest of us are going to keep smiling awkwardly at office parties at people we\'re supposed to know. It\'s what happens when you stumble around in the 21st century with a mind that was designed in the Stone Age.</p>\n <p>(SOUNDBITE OF MUSIC)</p>\n <p>VEDANTAM: This week\'s show was produced by Chris Benderev and edited by Jenny Schmidt. Our supervising producer is Tara Boyle. Our team includes Renee Cohen, Parth Shah, Laura Kwerel, Thomas Lu and Angus Chen.</p>\n <p>Our unsung hero this week is Alexander Diaz, who troubleshoots technical problems whenever they arise and has the most unflappable, kind disposition in the face of whatever crisis we throw his way. Producers at NPR have taken to calling him Batman because he\'s constantly, silently, secretly saving the day. Thanks, Batman.</p>\n <p>If you like today\'s episode, please take a second to share it with a friend. We\'re always looking for new people to discover our show. I\'m Shankar Vedantam, and this is NPR.</p>\n <p>(SOUNDBITE OF MUSIC)</p>\n\n <p class="disclaimer">Copyright © 2019 NPR. All rights reserved. Visit our website terms of use and permissions pages at www.npr.org for further information.</p>\n\n <p class="disclaimer">NPR transcripts are created on a rush deadline by Verb8tm, Inc., an NPR contractor, and produced using a proprietary transcription process developed with NPR. This text may not be in its final form and may be updated or revised in the future. Accuracy and availability may vary. The authoritative record of NPR’s programming is the audio record.</p>\n</div><div class="share-tools share-tools--secondary" aria-label="Share tools">\n <ul>\n
I'm expecting to match before
</p>\n <p>If you like
But it actually went way further than that.
I feel like the regular expression I used has issue, but don't know how. Any help will be appreciated.
Thanks!
20190523:
Thanks for your guys' suggestions.
I tried
egrep "unsung hero.*?</p>" test
But it didn't give me the result I want, insted it's like
Leo, I feel like this is a useful expression and I'd like to get it right. could you explain a bit?
The other test I did for
[^<]*
Actually gave the result expected
With .* the match will be greedy and match the longest substring possible. (Which is in your case until the last paragraph.)
What you actually want is a non-greedy match with .*?
Your specific command should most likely look like this:
grep -P -o "unsung hero.*?</p>" test
Another solution would be to expand your regex until the end of the string/webpage and than pick the selected substring with a group.
UPDATE
As Charles Duffy pointed out correctly, this will not work with the standard (POSIX ERE) syntax. Therefore the command above uses the -P flag to specify that it is a perl regular expression.
If your system or application does not support perl regular expression and you are ok with matching until the first < (instead of matching until the first </p>), matching every character except < is the way to go.
With this, the complete command should look like this:
grep -o "unsung hero[^<]*</p>" test
Thanks to Charles for pointing that out in the comments.

Regular Expression to match sentences

I'm trying to make a regular expression in python that matches sentences. The one I see that mostly works is: [^\.\?\!].*?[\.\?\!] ,but with the test sentences below it has a few errors. You can see using the site https://regex101.com/. I'm looking for a regular expression that encompasses all the problems below such as ellipsis, honorifics, and the i.e. thing.
For performing tokenization in languages other than English, we can
load the respective language pickle file found in tokenizers/punkt and
then tokenize the text in another language, which is an argument of
the tokenize() function. For the tokenization of French text, we will
use the french.pickle file as follows: Mr. Smith bought cheapsite.com
for 1.5 million dollars, i.e. he paid a lot for it. Did he mind? Adam
Jones Jr. thinks he didn't. In any case, this isn't true... Well, with
a probability of .9 it isn't.
p.s. If you're wondering I got the above sentences from a natural language processing book and another stack overflow question on the same subject.
the easiest way is to split it in 3 operations.
substitute i.e., ellipsis and what ever you want with other markers without dots like ###ie### and ###ellipsis###.
match sentences.
After that rebuild i.e. and ellipsis.
Update: Some code how to do it. You have to do the substitutions for each item with dots you want to exclude from the sentence-matcher.
sentences = re.sub(r'i\.e\.', "###ie###", sentences);
matches = re.match(r'[^\.\?\!].*[\.\?\!]', sentences);
matches = re.sub(r'###ie###', "i.e.", matches);

Regex to 'clean' main words from suffixes recognized by some repetitive patterns

I have this exaple list
Veep - Season 1 BDMux.torrent
Vegas S01e01-21.torrent
Velvet S01e13.torrent
Velvet.e10.torrent
Velvet_e01.torrent
Veronica Mars s01.torrent
Vicious S01e01-06.torrent
Victor Ros S01e01-06.torrent
Video.Game.High.School.S01e01-09.XviD.torrent
Vikings - Season 1 EXT.torrent
Vikings_S04e04.avi.torrent
I want eliminate similar lines like velvet. or velvet_ and consolidate to one and finally print like this
Veep
Vegas
Velvet
Veronica Mars
Victor Ros
Video Game High School
Vikings
How regex?
To do all that in one regex, I'd say is impossible. However, this regex
^(.*?)[ ._-]*(?:s\w*\s*\d+)?(?:e\d\d(?:-\d\d)?)?[\s.]*\w*?\.torrent(?:[\s\S]*\1.*$)*$
handles what you throwed at us ;). There's one but though - it can't remove the dots in titles like Video.Game.High.School.
And - it requires the shows to be grouped, like in your example (e.g. All Velvet grouped together). This ought to be easily solved by Notepad++'s Edit>Line Operations>Sort Lines in Ascending though.
Check it out here at regex101.
What it does is to capture everything up to season and/or episode, allowing for an optional format and finally matching .torrent. It then optionally matches everything up to a possible repeat of the first captured and whatever follows up to the end of the line. The last step is repeated until no match found. The capture group now holds the name of the show, but the regex matches all lines of the show. Thus, replacing the whole match with the capture, will leave only one clean entry for each show.
This means that it won't handle when a shows name starts with the complete name of another show, e.g. American Crime and American Crime Story, since the first would match the second, and therefor keep matching 'til the end of the second. This can be fixed by including the test for season/episode in the second part of the regex, but I opted out on this to keep it simpler and faster.
So, you say in a comment "regex does not need to be perfect". Well, here's one that gets most of the job done for you - but isn't perfect.
Regards
Edit
Made some updates and simplified regex considerably. Here's the old one if you want the more specific one:
^(.*?)[ ._]?(?:-? season \d+|(?:s\d\d)?(?:e\d\d(?:-\d\d)?)?)[\s.]*(?:bdmux|xvid|ext|avi)?\.torrent(?:[\s\S]*\1.*$)*$

Why is this line of regex capturing white spaces?

I'm using the following line of regex which I found from this SO answer:
(?:[\w[a-z]-]+:(?:/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.-]+[.??][a-z]{2,4}/)(?:[^\s()<>]+|(([^\s()<>]+|(([^\s()<>]+)))))+(?:(([^\s()<>]+|(([^\s()<>]+))))|[^\s`!()[]{};:'".,<>?«»“”‘’])
I am testing it on the following string:
"Quattro Amici in Concert Mar. 3, 2014. Long-time collaborators Lun Jiang, violin; Roberta Zalkind, viola; Pegsoon Whang, cello; and Karlyn Bond, piano, will perform works by Franz Joseph Haydn, Wolfgang Amadeus Mozart, Ludwig van Beethoven and Gabriel Faure. To purchase tickets visit westminstercollege.edu/culturalevents or call 801-832-2457. - See more at: http://entertainment.sltrib.com/events/view/quattro_amici_in_concert#sthash.QRsLXXiA.dpuf"
I'm simply attempting to extract urls from strings and based on a bunch of SO answers, I've found that regex is the recommended tool for that job. I'm not a regex expert (or even intermediate in my understanding), so I'm baffled by the empty strings my re.findall() keeps returning. I've stepped through the regex line using regex buddy and still no luck. Any help would be hugely appreciated.
I'm not sure that a big regex like that is entirely necessary - if you're just looking to get links, you could use a much simpler regex, like this:
/(https?:\/\/[\w\d\$-_\.\+!\*'\(\),\/#]+)/ig
According to RFC 1738, urls are only allowed to use the characters specified in the class above, so it should cover any valid url, without such a gigantic mess of a regex.
You can also use a tool like regexpal.com to validate regexes, which helps find issues. That said, I pasted your regex in there and it crashed chrome, so it may not be a great help for a beast like that :)