Finding a pattern and extracting strings - regex

I'm trying to scrap a website I'm a newbie using regular expressions. I have a long character vector, this is the line that I'm aiming:
<h3 class=\"title4\">Results: <span id=\"hitCount.top\">10,079</span></h3>\n
I want to extract the number that it is in between <span id=\"hitCount.top\"> and </span>. In this case 10,079. My approach so far, though, not really working.
x <- '<h3 class=\"title4\">Results: <span id=\"hitCount.top\">10,079</span>'
m <- gregexpr(pattern="[<span id=\"hitCount.top\">].+[</span>]", x, ignore.case = FALSE, perl = FALSE,
fixed = FALSE, useBytes = FALSE)
regmatches(x, m)
Any help will be appreciated.

Just to illustrate how easy it may become if you are using XML package:
> library("XML")
> url = "PATH_TO_HTML"
> parsed_doc = htmlParse(file=url, useInternalNodes = TRUE)
> h3title4 <- getNodeSet(doc = parsed_doc, path = "//h3[#class='title4']")
> plain_text <- sapply(h3title4, xmlValue)
> plain_text
[1] "Results: 10,079"
> sub("\\D*", "", plain_text)
[1] "10,079"
The sub("\\D*", "", plain_text) line will remove the first chunk of 0+ non-digits in the input, that is, \D* will match Results: and will replace it with an empty string.
The example HTML I used was
<html>
<body>
<h3 class="title4">Results: <span id="hitCount.top">10,079</span></h3>
<img width="10%" height="10%" src="https://upload.wikimedia.org/wikipedia/commons/thumb/f/fe/Green-Up-Arrow.svg/2000px-Green-Up-Arrow.svg.png"/>
</body>
</html>

Using stringr library
> library(stringr)
> str_extract(x, "(?<=<span id=\"hitCount.top\">)(.*?)(?=</span>)")
[1] "10,079"
Using gsub (sub can also be used here instead of gsub)
> gsub(".*<span id=\"hitCount.top\">(.*?)</span>.*", "\\1", x)
[1] "10,079"

Related

Extracting numbers from a string including decimels and scientific notation

I have some strings that look like
x<-"p = 9.636e-05"
And I would like to extract just the number using gsub. So far I have
gsub("[[:alpha:]](?!-)|=|\\^2", "", x)
But that removes the 'e' from the scientific notation, giving me
" 9.636-05"
Which can't be converted to a number using as.numeric. I know that it would be possible to use a lookahead to match the "-", but I don't know exactly how to go about doing this.
You could try
sub('.* = ', '', x)
#[1] "9.636e-05"
You can use the following to initially remove all non-digit characters at the start of the string:
sub('^\\D+', '', x)
Try
format(as.numeric(gsub("[^0-9e.-]", "", x)), scientific = FALSE)
# [1] "0.00009636"
Through sub or regmatches function.
> x<-"p = 9.636e-05"
> sub(".* ", "", x)
[1] "9.636e-05"
> regmatches(x, regexpr("\\S+$", x))
[1] "9.636e-05"
> library(stringi)
> stri_extract(x, regex="\\S+$")
[1] "9.636e-05"

regular expression in R for word of variable length between two characters

How do I extract the word, wordofvariablelength from the string below.
<a href=\"http://www.adrive.com/browse/wordofvariablelength\" class=\"next-button\" id=\"explore-gutter\" data-linkid=\"huiazc\"> <strong class=\"text gutter-text \">
I was able to get the first part of the string using the below code, but is there a regular expression I can use to get only the word immediately after "browse/" and before "\", which here is the word, "wordofvariablelength" using the code below
mystring = substr(mystring,nchar("<a href=\"http://www.thesaurus.com/browse/")+1,nchar("<a href=\"http://www.thesaurus.com/browse/")+20)
Note that the word, wordofvariablelength could be of any length, and so I cannot hardcode and start and end
Through regmatches function.
> x <- "<a href=\"http://www.adrive.com/browse/wordofvariablelength\" class=\"next-button\" id=\"explore-gutter\" data-linkid=\"huiazc\"> <strong class=\"text gutter-text \">"
> regmatches(x, regexpr('.*?"[^"]*/\\K[^/"]*(?=")', x, perl=TRUE))
[1] "wordofvariablelength"
OR
> regmatches(x, regexpr('[^/"]*(?="\\s+class=")', x, perl=TRUE))
[1] "wordofvariablelength"
OR
Much more simpler one using gsub.
> gsub('.*/|".*', "", x)
[1] "wordofvariablelength"
Try
sub('.*?\\.com/[^/]*\\/([a-z]+).*', '\\1', mystring)
#[1] "wordofvariablelength"
Or
library(stringr)
str_extract(mystring, perl('(?<=browse/)[A-Za-z]+'))
#[1] "wordofvariablelength"
data
mystring <- "<a href=\"http://www.adrive.com/browse/wordofvariablelength\" class=\"next-button\" id=\"explore-gutter\" data-linkid=\"huiazc\"> <strong class=\"text gutter-text \">"
you can use this regex
/browse\/(.*?)\\/g
demo here https://regex101.com/r/gX4dC0/1
You can use the following regex (?<=browse/).*?(?=\\").
The regex means: check if we have browse/, then take all the subsequent characters up to (but without consuming) \.
Sample code (and a sample program here):
mystr <- "<a href=\"http://www.adrive.com/browse/wordofvariablelength\" class=\"next-button\" id=\"explore-gutter\" data-linkid=\"huiazc\"> <strong class=\"text gutter-text \">"
regmatches(mystr, regexpr('(?<=browse/).*?(?=\\")', mystr, perl=T))
perl=T means we are using Perl-like regex flavor that allows using fixed-width look-behind ((?<=browse/)).
Output:
[1] "wordofvariablelength"

Using XPath 1.0, how can I have more than one anonymous function operating on the extracted content?

With R, httr and XML you can scrape this site; the relevant HTML code is below.
doc <- htmlTreeParse("http://www.mehaffyweber.com/Firm/Offices/", useInternal = TRUE)
<div id="content">
<img id="printLogo" style="padding-bottom:30px" src="/images/logo_print.jpg">
<div id="contentTitle">
<div style="height: 30px;">
<h1>Offices</h1>
<h3>Beaumont Location:</h3>
<p>
<p>
<br>
<h3>
<strong>Houston Location:</strong>
</h3>
<p>
<p>
<h3>
<strong>Austin Location:</strong>
</h3>
To extract only the cities where this company has offices, this XLPath 1.0 code works:
(string <- xpathSApply(doc, "//h3", function(x) {
gsub("Location:|\\W|Â", "", xmlValue(x, trim = TRUE))}))
I tried to paste the state to the city with a second anonymous function but failed:
> (string <- xpathSApply(doc, "//h3", function(x) {
+ gsub("Location:|\\W|Â", "", xmlValue(x, trim = TRUE))} &&
+ function(x) {paste0(xmlValue(x), " , TX")}))
Error in { : invalid 'x' type in 'x && y'
So did a simpler try when I did not repeat function(x)
> (string <- xpathSApply(doc, "//h3", function(x) {
+ gsub("Location:|\\W|Â", "", xmlValue(x, trim = TRUE)) &&
+ paste0(xmlValue(x), " , TX")}))
Error in gsub("Location:|\\W|Â", "", xmlValue(x, trim = TRUE)) && paste0(xmlValue(x), :
invalid 'x' type in 'x && y'
DESIRED OUTPUT: How might I combine both anonymous functions and create this string?
[1] "Beaumont, TX" "Houston, TX" "Austin, TX"
A couple of things. htmlParse is shorthand for htmlTreeParse(..., useInternal = TRUE).
You have issues with encoding on this document so the RCurl library will help to remove the strange encodings you are encountering.
library(XML)
library(RCurl)
appHTML <- getURL("http://www.mehaffyweber.com/Firm/Offices/"
, .encoding = "UTF-8")
doc <- htmlParse(appHTML, encoding = "UTF-8")
xpathSApply is a shorthand for two operations. It applies the xpath to the doc and gets the relevant nodes. Then each of this nodes is applied to the function the user stipulates. The x passing to the function is basically the output from:
getNodeSet(doc, "//h3")
or in shorthand
doc["//h3"]
Each element of doc["//h3"] is an internal XML node:
> str(doc['//h3'])
List of 3
$ :Classes 'XMLInternalElementNode', 'XMLInternalNode', 'XMLAbstractNode' <externalptr>
$ :Classes 'XMLInternalElementNode', 'XMLInternalNode', 'XMLAbstractNode' <externalptr>
$ :Classes 'XMLInternalElementNode', 'XMLInternalNode', 'XMLAbstractNode' <externalptr>
- attr(*, "class")= chr "XMLNodeSet"
So the x in your function is just like an element of doc["//h3"]. So you can experiment
with doc["//h3"][[1]]
x<- doc['//h3'][[1]]
temp <- gsub("\\WLocation:", "", xmlValue(x))
paste0(temp, ", TX")
[1] "Beaumont, TX"
Then you can apply this logic in your function:
xpathSApply(doc, "//h3", function(x){
temp <- gsub("\\WLocation:", "", xmlValue(x))
paste0(temp, ", TX")
})
[1] "Beaumont, TX" "Houston, TX" "Austin, TX"
If you're willing to use rvest and stringr it's a pretty simple solution:
library(rvest)
library(stringr)
pg <- html("http://www.mehaffyweber.com/Firm/Offices/")
found <- pg %>%
html_nodes("#content") %>%
html_text() %>%
str_match_all("([[:alpha:]]+), Texas")
sprintf("%s, TX", found[[1]][,2])
## [1] "Beaumont, TX" "Houston, TX" "Austin, TX"
You can use the following to get your desired result.
string <- xpathSApply(doc, '//h3', function(x) {
paste0(sub('^([A-Z][a-z]+).*', '\\1', xmlValue(x)), ', TX')
})
# [1] "Beaumont, TX" "Houston, TX" "Austin, TX"

Finding last character in R using regexpr function

I am having problem with finding the last character in a string. I am trying to use the regexpr function to check if the last character is equal to / forward slash.
But unfortunately it does work. Can anyone help me? Below is my code.
regexpr( pattern = ".$", text = /home/rexamine/archivist2/ex/// ) != "/"
You can avoid using regular expression and use substr to do this.
> x <- '/home/rexamine/archivist2/ex///'
> substr(x, nchar(x)-1+1, nchar(x)) == '/'
[1] TRUE
Or use str_sub from the stringr package:
> str_sub(x, -1) == '/'
[1] TRUE
You could use a simple grepl function,
> text = "/home/rexamine/archivist2/ex///"
> grepl("/$", text, perl=TRUE)
[1] TRUE
> text = "/home/rexamine/archivist2/ex"
> grepl("/$", text, perl=TRUE)
[1] FALSE
^.*\/$
You can use this.This will fail if last character is not /.

R regex matching for tweet pattern

I am trying to use the regex feature in R to parse some tweet text into its key words. I have the following code.
sentence = gsub("[[:punct:]]", "", sentence)
sentence = gsub("[[:cntrl:]]", "", sentence)
sentence = gsub("\\d+", "", sentence)
sentence = tolower(sentence)
However, one of my sentences has the sequence "\ud83d\udc4b". THe parsing fails for this sequence (the error is "invalid input in utf8towcs"). I would like to replace such sequences with "". I tried substituting the regex "\u+", but that did not match. What is the regex I should use to match this sequence? Thanks.
I think you want something like this,
> s <- "\ud83d\udc4b Delta"
> Encoding(s)
[1] "UTF-8"
> iconv(s, "ASCII", sub="")
[1] " Delta"
> f <- iconv(s, "ASCII", sub="")
> sentence = tolower(f)
> sentence
[1] " delta"
> sentence = RemoveNotASCII(sentence)
A function to remove not ASCII characters below.
RemoveNotASCII <- function#Remove all non ASCII characters
### remove column by columns non ASCII characters from a dataframe
(
x ##<< dataframe
){
n <- ncol(x)
z <- list()
for (j in 1:n) {
y = as.character(x[,j])
if (class(y)=="character") {
Encoding(y) <- "latin1"
y <- iconv(y, "latin1", "ASCII", sub="")
}
z[[j]] <- y
}
z = do.call("cbind.data.frame", z)
names(z) <- names(x)
return(z)
### Dataframe with non ASCII characters removed
}
The qdapRegex package has the rm_non_ascii function to handle this:
library(qdapRegex)
tolower(rm_non_ascii(s))
## [1] "delta"