Webscraping (potentially) ill-formated HTML in R with xpath or regex - regex

I'm trying to extract the abstract from this link. However, I'm unable to extract only the content of the abstract. Here's what I accomplished so far:
url <- "http://www.scielo.br/scielo.php?script=sci_abstract&pid=S1981-38212013000100001&lng=en&nrm=iso&tlng=en"
textList <- readLines(url)
text <- textList[grep("Abstract[^\\:]", textList)] # get the correct element
text1 <- gsub("\\b(.*?)\\bISSN", "" , text)
Up to this point I got almost what I want, but then I couldn't get rid of the rest of the string that isn't of interest to me.
I even tried another approach, with xpath, but unsuccessfully. I tried something like the code below, but to no effect whatsoever.
library(XML)
arg.xpath <-"//p/#xmlns"
doc <- htmlParse( url) # parseia url
linksAux <- xpathSApply(doc, arg.xpath)
How can I accomplih what I want, either with regex or xpath, or maybe both?
ps.: my general aim is webscraping of several similar pages like the one I provided. I alredy can extract the link. I only need to get the abstract now.
free(doc)

I would strongly recommend the XML approach because regular expressions with HTML can be quite a headache. I think your xpath expression was just a bit off. Try
doc <- htmlParse(url)
xpathSApply(doc, "//p[#xmlns]", xmlValue)
This returns (clipped for length)
[1] "HOLLANDA, Cristina Buarque de. Human rights ..."
[2] "This article is dedicated to recounting the main ..."
[3] "Keywords\n\t\t:\n\t\tHuman rights; transitional ..."
[4] ""

someone better could give you a better answer but this kinda works
reg=regexpr("<p xmlns=\"\">(.*?)</p>",text1)
begin=reg[[1]]+12
end=attr(reg,which = "match.length")+begin-17
substr(text1,begin,end)

Here is another approach, which is klunky as written, but offers the technique of keeping the right parts after splitting at tag tokens:
text2 <- sapply(strsplit(x = text1, ">"), "[", 3)
text2
[1] "This article is dedicated to recounting the main initiative of Nelson Mandela's government to manage the social resentment inherited from the segregationist regime. I conducted interviews with South African intellectuals committed to the theme of transitional justice and with key personalities who played a critical role in this process. The Truth and Reconciliation Commission is presented as the primary institutional mechanism envisioned for the delicate exercise of redefining social relations inherited from the apartheid regime in South Africa. Its founders declared grandiose political intentions to the detriment of localized more palpable objectives. Thus, there was a marked disparity between the ambitious mandate and the political discourse about the commission, and its actual achievements.</p"
text3 <- sapply(strsplit(text2, "<"), "[", 1)

Related

How to make full text search in PostgreSQL to search any order of words in the input search_text and also if there are words between them

I'm trying to use Django full-text search but I'm having a problem:
I've followed this documentation, and it works quite well. But my problem is that I don't want the postgress to consider the order or permutation of the words I search.
I mean I want the result of searching "good boy" and "boy good" to be the same.
and also when I search "good boy" I want to see the "good bad boy" in the results.
But none of these happen and I can't query "good bad boy" with typing "boy good" or even "good boy" (Because of the "bad" missing).
I've tried to split search_text by space and then & the search queries like this in order to remove the order of words but it didn't work.
I changed this code:
search_query = SearchQuery(
search_text
)
search_rank = SearchRank(search_vectors, search_query)
to this:
s = SearchQuery(search_text.split(' ')[0])
for x in search_text.split(' ')[1:]:
s = s & SearchQuery(x)
search_query = s
search_rank = SearchRank(search_vectors, search_query)
You can achieve this quite easily in PostgreSQL.
I suggest reading carefully the documentation on Basic Text Matching.
What you need is to use <->, (FOLLOWED BY) tsquery operator.
If you use an integer instead of - you can express the desired proximity :
<N>, where N is an integer standing for the difference between the positions of the matching lexemes.
So you should find a way to make Python to execute this PostgreSQL function:
to_tsquery('good <1> boy | boy <1> good' );.
Maybe just calling SearchQuery with the string 'good <1> boy | boy <1> good' just works, but you should refer to the tutorial documentation to find how to use <-> with the method SearchQuery.
Edited after comment.
Looking at Django source code you can see that the constructor for SearchQuery class pass a default parameter search_type='plain'.
You can pass a different parameter to implement a different kind of search.
In the Django project documentation, you can find one example for each accepted string value for the named parameter search_type: 'plain', 'phrase', 'raw', 'websearch'.
If you pass search_type=raw you can provide a formatted search query with terms and operators, but I'm not sure if Django supports the <Ineger> operator as I said before you should try it.
Have you tried a search with search_type='raw', and a value of "'good <1> boy | boy <1> good'"?
If the FOLLOWED BY operator (<->) is not supported the close you can get to what you wanted is calling SearchQuery with search_type='phrase'.
This will find both good boy and boy good but the exact behaviour depends on the definition of the stop words used by the Dictionary set for PostgreSQL.

React - Using Regex for highlighting text inside of dangerouslySetInnerHTML. Not working reliably

The goal is to highlight text parts (strings) inside of a dangerouslySetInnerHTML. Therefore I try to match the desired text part inside of the html, and wrap it in a "span" with appropriate styling. I am using the following code that works for certain texts (html) flawlessy, but for certain texts not at all. Please find a working an a not working example below. Trying for hours to understand the difference, or why the regex does not work... but I can't figure it out. Banging my head agains the wall.
My question is: Why is the regex failing in some cases and working in others? Even though in all cases the text ("quote") is there.
Any ideas what I am missing? Thanks so much for your help!!!
Highlighting Component JSX:
import React from "react";
class HighlightQuote extends React.Component {
render = () => {
//zitat is for getting rid of any quotation marks in the beginning or end.
var zitat = this.props.quotes.map(x => x.replace(/^[“”"’()]+|[“”"’()]+$/g, ""));
if (this.props.quotes.length === 0) {
var highlightedHtml = this.props.newcontent
}
else {
var zitat = this.props.quotes.map(x => x.replace(/^[“”"’()]+|[“”"’()]+$/g, ""));
const regex = new RegExp(`(${zitat.join('|')})`, 'g');
var highlightedHtml = this.props.content.replace(
regex,
'<span class="hl">$1</span>'
);
console.log ('highlightedHtml:');
console.log (highlightedHtml);
}
return (
<div className="reader" ref="test" dangerouslySetInnerHTML={{ __html: highlightedHtml }} />
);
};
}
export default HighlightQuote;
Working example (console.log ('highlighted html')
<div class="post" id="post-17660">
<p class="postcontents">
<article> <div class="post-inside">
<p>One of the things I have disliked the most about the crypto sector is the idea that people should “hodl” or “hold on for dear life.”</p>
<p>I have written many times here at AVC that one should take profits when they are available and diversify an investment portfolio.</p>
<p><span class="hl">The idea that an investor should hold on no matter what has always seemed ridiculous to me.</span></p>
<p>Now, the crypto markets are in the eighth month of a long and painful bear market and we are starting to see some signs of capitulation, particularly in the assets that went up the most last year.</p>
<p>Whether this is the long-awaited capitulation of the HODL crowd or not, I can’t say.</p>
<p>But capitulation would be a good thing for the crypto markets, releasing assets into the market that until now have been locked up by long-term holders.</p>
<p><span class="hl">Until then it is hard to get excited about buying anything in crypto.</span></p>
</div> </article>
</p> </div>
Quotes that are highlighted as expected:
"The idea that an investor should hold on no matter what has always seemed ridiculous to me."
"Until then it is hard to get excited about buying anything in crypto."
Failing example (console.log ('highlighted html')
<div><article id="story" class="Story-story--2QyGh css-1j0ipd9"><header class="css-1qcpy3f e345g291"><p class="css-1789nl8 etcg8100"><a class="css-1g7m0tk" href="https://www.nytimes.com/column/new-sentences">New Sentences</a></p><div class="css-30n6iy e345g290"><div class="css-acwcvw"></div></div><figure class="ResponsiveMedia-media--32g1o ResponsiveMedia-sizeSmall--3092U ResponsiveMedia-layoutVertical--1pg1o ResponsiveMedia-sizeSmallNoCaption--n--T0 css-1hzd7ei"><figcaption class="css-pplcdj ResponsiveMedia-caption--1dUVu"></figcaption></figure></header><div class="css-18sbwfn StoryBodyCompanionColumn"><div class="css-1h6whtw"><p class="css-1i0edl6 e2kc3sl0"><em class="css-2fg4z9 ehxkw330">— From Keith Gessen’s second novel, “A Terrible Country” (Viking, 2018, Page 4). Gessen is also the author of “All the Sad Young Literary Men” and a founding editor of the journal n+1.</em></p><p class="css-1i0edl6 e2kc3sl0">All authors have signature sentence structures — deep expressive grooves that their minds instinctively find and follow. (That previous sentence is one of mine: a simple declaration that leaps, after the break of a long dash, into an elaborate restatement.)</p><p class="css-1i0edl6 e2kc3sl0">Here is one of Keith Gessen’s:</p><p class="css-1i0edl6 e2kc3sl0">“As for me, I wasn’t really an idiot. But neither was I not an idiot.”</p><p class="css-1i0edl6 e2kc3sl0">“I hadn’t been yelling, I didn’t think. But I hadn’t not been yelling either.”</p><p class="css-1i0edl6 e2kc3sl0">“Cute cafes were not the problem, but they were also not, as I’d once apparently thought, the opposite of the problem.”</p></div><aside class="css-14jsv4e"><span></span></aside></div><div class="css-18sbwfn StoryBodyCompanionColumn"><div class="css-1h6whtw"><p class="css-1i0edl6 e2kc3sl0">Sentence structures are not simply sentence structures, of course — they are miniature philosophies. Hemingway, with his blunt verbal bullets, is making a huge claim about the nature of the world. So is James Joyce, with his collages and frippery. So are Nikki Giovanni and Samuel Delany and Ursula K. Le Guin and John McPhee and Missy Elliott and Dr. Seuss and anyone else who converts thoughts into prose.</p><p class="css-1i0edl6 e2kc3sl0">Likewise, Keith Gessen’s signature sentence structure — “not X, but also not not X” — suggests an entire worldview. It is a universe of in-betweenness, in which the most basic facts of life, the things we absolutely expect to understand, spill and scatter like toast crumbs into the gaps between the floorboards. It is a world of embarrassingly trivial category errors. The sentences above come from Gessen’s new novel, “A Terrible Country,” the story of a 30-something American man who goes to Russia to care for his elderly grandmother. He falls into the gaps between huge concepts: youth and age, purpose and purposelessness, progress and stasis. He is not Russian but also not not Russian, not smart but also not not smart, not heroic but also not not heroic. Such is the way of the world. No matter how much we try, none of us is ever only one thing. None of us is ever pure.</p></div><aside class="css-14jsv4e"><span></span></aside></div><div class="bottom-of-article"><div class="css-k8fkhk"><p>Sam Anderson is a staff writer for the magazine.</p> <p><i>Sign up for </i><i>our newsletter</i><i> to get the best of The New York Times Magazine delivered to your inbox every week.</i></p></div><div class="css-3glrhn">A version of this article appears in print on , on Page 11 of the Sunday Magazine with the headline: From Keith Gessen’s ‘A Terrible Country’<span>. Order Reprints | Today’s Paper | Subscribe</span></div></div><span></span></article></div>
The quote that should be highlighted:
"Sentence structures are not simply sentence structures, of course — they are miniature philosophies"
The reason for the failing regex matches were html entities. Some of the parsed texts inside of the dangerouslySetInnerHTML used entity references. In the failing example above the quote includes a "—" character that in the html is decoded as — .
In order to get rid of the html entities I used the "he" library https://github.com/mathiasbynens/he a robust HTML entity encoder/decoder written in JavaScript.
var contentDecoded = he.decode(this.props.content);
var highlightedHtml = contentDecoded.replace(
regex,
'<span class="annotator-hl">$1</span>'
);

Using Regex in Pig in hadoop

I have a CSV file containing user (tweetid, tweets, userid).
396124436476092416,"Think about the life you livin but don't think so hard it hurts Life is truly a gift, but at the same it is a curse",Obey_Jony09
396124436740317184,"“#BleacherReport: Halloween has given us this amazing Derrick Rose photo (via #amandakaschube, #ScottStrazzante) http://t.co/tM0wEugZR1” yes",Colten_stamkos
396124436845178880,"When's 12.4k gonna roll around",Matty_T_03
Now I need to write a Pig Query that returns all the tweets that include the word 'favorite', ordered by tweet id.
For this I have the following code:
A = load '/user/pig/tweets' as (line);
B = FOREACH A GENERATE FLATTEN(REGEX_EXTRACT_ALL(line,'(.*)[,”:-](.*)[“,:-](.*)')) AS (tweetid:long,msg:chararray,userid:chararray);
C = filter B by msg matches '.*favorite.*';
D = order C by tweetid;
How does the regular expression work here in splitting the output in desired way?
I tried using REGEX_EXTRACT instead of REGEX_EXTRACT_ALL as I find that much more simpler, but couldn't get the code working except for extracting just the tweets:
B = FOREACH A GENERATE FLATTEN(REGEX_EXTRACT(line,'[,”:-](.*)[“,:-]',1)) AS (msg:chararray);
the above alias gets me the tweets, but if I use REGEX_EXTRACT to get the tweet_id, I do not get the desired o/p: B = FOREACH A GENERATE FLATTEN(REGEX_EXTRACT(line,'(.*)[,”:-]',1)) AS (tweetid:long);
(396124554353197056,"Just saw #samantha0wen and #DakotaFears at the drake concert #waddup")
(396124554172432384,"#Yutika_Diwadkar I'm just so bright 😁")
(396124554609033216,"#TB23GMODE i don't know, i'm just saying, why you in GA though? that's where you from?")
(396124554805776385,"#MichaelThe_Lion me too 😒")
(396124552540852226,"Happy Halloween from us 2 #maddow & #Rev_AlSharpton :) http://t.co/uC35lDFQYn")
grunt>
Please help.
Can't comment, but from looking at this and testing it out, it looks like your quotes in the regex are different from those in the csv.
" in the csv
” in the regex code.
To get the tweetid try this:
B = FOREACH A GENERATE FLATTEN(REGEX_EXTRACT(line,'.*(,")',1)) AS (tweetid:long);

Parameterize pattern match as function argument in R

I've a directory with csv files, about 12k in number, with the naming format being
YYYY-MM-DD<TICK>.csv
. The <TICK> refers to ticker of a stock, e.g. MSFT, GS, QQQ etc. There are total 500 tickers, of various length.
My aim is to merge all the csv for a particular tick and save as a zoo object in individual RData file in a separate directory.
To automate this I've managed to do the csv manipulation, setup as a function which gets a ticker as input, does all the data modification. But I'm stuck in making the file listing stage, passing the pattern to match the ticker being processed. I'm unable to make the pattern to be matched dependent on the ticker.
Below is the function i've tried to make work, doesn't work:
csvlist2zoo <- function(symbol){
csvlist=list.files(path = "D:/dataset/",pattern=paste("'.*?",symbol,".csv'",sep=""),full.names=T)
}
This works, but can't make it work in function
csvlist2zoo <- function(symbol){
csvlist=list.files(path = "D:/dataset/",pattern='.*?"ibm.csv',sep=""),full.names=T)
}
Searched in SO, there are similar questions, not exactly meeting my requirement. But if I missed something please point out in the right direction. Still fighting with regex.
OS: Win8 64bit, R version-3.1.0 (if needed)
Try:
csvlist2zoo <- function(symbol){
list.files(pattern=paste0('\\d{4}-\\d{2}-\\d{2}',symbol, ".csv"))
}
csvlist2zoo("QQQ")
#[1] "2002-12-19QQQ.csv" "2008-01-25QQQ.csv"
csvlist2zoo("GS")
#[1] "2005-05-18GS.csv"
I created some files in the working directory (linux)
v1 <- c("2001-05-17MSFT.csv", "2005-05-18GS.csv", "2002-12-19QQQ.csv", "2008-01-25QQQ.csv")
lapply(v1, function(x) write.csv(1:3, file=x))
Update
Using paste
csvlist2zoo <- function(symbol){
list.files(pattern=paste('\\d{4}-\\d{2}-\\d{2}',symbol, ".csv", sep=""))
}
csvlist2zoo("QQQ")
#[1] "2002-12-19QQQ.csv" "2008-01-25QQQ.csv"

Find all paragraphs of text that are related to a topic

Given a set of words ["college", "sports", "coding"], and a set of paragraphs of text (i.e. facebook posts), how can I see for each word the paragraphs that are related to that topic?
So for college, how can I find all the paragraphs of text that may be about the topic college?
I'm new to natural language processing, and not very advanced at regex. Clues about how to get started, what the right terms to google, etc are appreciated.
One basic ideea would be to iterate over your posts and see if any post matches any of the topic.
Let's say we have the following posts:
Post 1:
Dadadad adada college fgdssfgoksh jkhsfdkjshdkj sports hfjkshgkjshgjhsdgjkhskjgfs.
Post 2:
Sports dadadad adada fgdssfgoksh jkhsfdkjshdkj hfjkshgkjshgjhsdgjkhskjgfs.
Post 3:
Coding adskjdsflkshdflksjlg lsdjk hsjdkh kdsafkj asfjkhsa coding fhksajhdf kjhskfhsfd ssdggsd.
and the following topics:
["college", "sports", "coding"]
The regex could be: (topicName)+
E.g.: (college)+ or (sports)+ or (coding)+
Small pseudocode:
for every topicName
for every post
var customRegex = new RegExp('(' + topicName + ')+');
if customRegex.test(post) then
//post matches topicName
else
//post doesn't match topicName
endif
endfor
endfor
Hope it could give you a starting point.
Exact string matching won't take you far, especially with small fragments of text. I suggest you to use semantic similarity for this. A simple web search will give several implementations.