Matching complex URLs within text blocks (R) - regex

I want to use the Regex by John Gruber (http://daringfireball.net/2010/07/improved_regex_for_matching_urls) to match complex URLs in text blocks. The Regex is quite complex (as is the task, see regex to find url in a text).
My problem is that I don't get it work with R:
x <- c("http://foo.com/blah_blah",
"http://foo.com/blah_blah/",
"(Something like http://foo.com/blah_blah)",
"http://foo.com/blah_blah_(wikipedia)",
"http://foo.com/more_(than)_one_(parens)",
"(Something like http://foo.com/blah_blah_(wikipedia))",
"http://foo.com/blah_(wikipedia)#cite-1",
"http://foo.com/blah_(wikipedia)_blah#cite-1",
"http://foo.com/unicode_(✪)_in_parens",
"http://foo.com/(something)?after=parens",
"http://foo.com/blah_blah.",
"http://foo.com/blah_blah/.",
"<http://foo.com/blah_blah>",
"<http://foo.com/blah_blah/>",
"http://foo.com/blah_blah,",
"http://www.extinguishedscholar.com/wpglob/?p=364.",
"http://✪df.ws/1234",
"rdar://1234",
"rdar:/1234",
"x-yojimbo-item://6303E4C1-6A6E-45A6-AB9D-3A908F59AE0E",
"message://%3c330e7f840905021726r6a4ba78dkf1fd71420c1bf6ff#mail.gmail.com%3e",
"http://➡.ws/䨹",
"www.c.ws/䨹",
"<tag>http://example.com</tag>",
"Just a www.example.com link.",
"http://example.com/something?with,commas,in,url, but not at end",
"What about <mailto:gruber#daringfireball.net?subject=TEST> (including brokets).",
"mailto:name#example.com",
"bit.ly/foo",
"“is.gd/foo/”",
"WWW.EXAMPLE.COM",
"http://www.asianewsphoto.com/(S(neugxif4twuizg551ywh3f55))/Web_ENG/View_DetailPhoto.aspx?PicId=752",
"http://www.asianewsphoto.com/(S(neugxif4twuizg551ywh3f55))",
"http://lcweb2.loc.gov/cgi-bin/query/h?pp/horyd:#field(NUMBER+#band(thc+5a46634))")
t <- regexec("\\b((?:[a-z][\\w-]+:(?:/{1,3}|[a-z0-9%])|www\\d{0,3}[.]|[a-z0-9.\\-]+[.][a-z]{2,4}/)(?:[^\\s()<>]+|\\(([^\\s()<>]+|(\\([^\\s()<>]+\\)))*\\))+(?:\\(([^\\s()<>]+|(\\([^\\s()<>]+\\)))*\\)|[^\\s`!()\\[\\]{};:'".,<>?«»“”‘’]))", x)
regmatches(x,t)
I appreciate your help.

I ended up using gregexpr as this supports perl=TRUE. After adapting the Regex for R, I came up with the following solution (use data above).
findURL <- function(x){
t <- gregexpr("(?xi)\\b(
(?:[a-z][\\w-]+:(?:/{1,3}|[a-z0-9%])|www\\d{0,3}[.]|[a-z0-9.\\-]+[.][a-z]{2,4}/)
(?:[^\\s\\(\\)<>]+|\\(([^\\s\\(\\)<>]+|(\\([^\\s\\(\\)<>]+\\)))*\\))+
(?:\\(([^\\s\\(\\)<>]+|(\\([^\\s\\(\\)<>]+\\)))*\\)|[^\\s`!\\(\\)\\[\\]{};:'\\\"\\.,<>\\?«»“”‘’])
)",x, perl=TRUE, fixed=FALSE)
regmatches(x,t)
}
# Find URLs
urls <- findURL(x)
# Count URLs
count.urls.temp <- lapply(urls, length)
count.urls <- sum(unlist(count.urls.temp))
I hope this is helpful for others.

Related

R- Subset a corpus by meta data (id) matching partial strings

I'm using the R (3.2.3) tm-package (0.6-2) and would like to subset my corpus according to partial string matches contained with the metadatum "id".
For example, I would like to filter all documents that contain the string "US" within the "id" column. The string "US" would be preceded and followed by various characters and numbers.
I have found a similar example here. It is recommended to download the quanteda package but I think this should also be possible with the tm package.
Another more relevant answer to a similar problem is found here. I have tried to adapt that sample code to my context. However, I don't manage to incorporate the partial string matching.
I imagine there might be multiple things wrong with my code so far.
What I have so far looks like this:
US <- tm_filter(corpus, FUN = function(corpus, filter) any(meta(corpus)["id"] == filter), grep(".*US.*", corpus))
And I receive the following error message:
Error in structure(as.character(x), names = names(x)) :
'names' attribute [3811] must be the same length as the vector [3]
I'm also not sure how to come up with a reproducible example simulating my problem for this post.
It could work like this:
library(tm)
reut21578 <- system.file("texts", "crude", package = "tm")
(corp <- VCorpus(DirSource(reut21578), list(reader = readReut21578XMLasPlain)))
# <<VCorpus>>
# Metadata: corpus specific: 0, document level (indexed): 0
# Content: documents: 20
(idx <- grep("0", sapply(meta(corp, "id"), paste0), value=TRUE))
# 502 704 708
# "502" "704" "708"
(corpsubset <- corp[idx] )
# <<VCorpus>>
# Metadata: corpus specific: 0, document level (indexed): 0
# Content: documents: 3
You are looking for "US" instead of "0". Have a look at ?grep for details (e.g. fixed=TRUE).

How to match a specific string using regular expressions in R

I am trying to extract some financial data using regular expressions in R.
I have used a RegEx tester, http://regexr.com/, to make a regular expression that SHOULD capture the information I need - the problem is just that it doesn't...
I have extracted data from this URL: http://finance.yahoo.com/q/cp?s=%5EOMXC20+Components
I want to match the company names (DANSKE.CO, DSV.CO etc.) and I have created following regular expression which matches it on regexr.com:
.q\?s=(\S*\\)
But it doesn't work in R. Can someone help me figure out how to go about this?
Instead of messing around with regular expressions I would use XPath for something like fetching HTML content:
library("XML")
f <- tempfile()
download.file("https://finance.yahoo.com/q/cp?s=^OMXC20+Components", f)
doc <- htmlParse(f)
xpathSApply(doc, "//b/a", xmlValue)
# [1] "CARL-B.CO" "CHR.CO" "COLO-B.CO" "DANSKE.CO" "DSV.CO"
# [6] "FLS.CO" "GEN.CO" "GN.CO" "ISS.CO" "JYSK.CO"
# [11] "MAERSK-A.CO" "MAERSK-B.CO" "NDA-DKK.CO" "NOVO-B.CO" "NZYM-B.CO"
# [16] "PNDORA.CO" "TDC.CO" "TRYG.CO" "VWS.CO" "WDH.CO"
Does this help? If not, post back, and I'll provide another suggestion.
library(XML)
stocks <- c("AXP","BA","CAT","CSCO")
for (s in stocks) {
url <- paste0("http://finviz.com/quote.ashx?t=", s)
webpage <- readLines(url)
html <- htmlTreeParse(webpage, useInternalNodes = TRUE, asText = TRUE)
tableNodes <- getNodeSet(html, "//table")
# ASSIGN TO STOCK NAMED DFS
assign(s, readHTMLTable(tableNodes[[9]],
header= c("data1", "data2", "data3", "data4", "data5", "data6",
"data7", "data8", "data9", "data10", "data11", "data12")))
# ADD COLUMN TO IDENTIFY STOCK
df <- get(s)
df['stock'] <- s
assign(s, df)
}
# COMBINE ALL STOCK DATA
stockdatalist <- cbind(mget(stocks))
stockdata <- do.call(rbind, stockdatalist)
# MOVE STOCK ID TO FIRST COLUMN
stockdata <- stockdata[, c(ncol(stockdata), 1:ncol(stockdata)-1)]
# SAVE TO CSV
write.table(stockdata, "C:/Users/rshuell001/Desktop/MyData.csv", sep=",",
row.names=FALSE, col.names=FALSE)
# REMOVE TEMP OBJECTS
rm(df, stockdatalist)

Extracting text after "?"

I have a string
x <- "Name of the Student? Michael Sneider"
I want to extract "Michael Sneider" out of it.
I have used:
str_extract_all(x,"[a-z]+")
str_extract_all(data,"\\?[a-z]+")
But can't extract the name.
I think this should help
substr(x, str_locate(x, "?")+1, nchar(x))
Try this:
sub('.*\\?(.*)','\\1',x)
x <- "Name of the Student? Michael Sneider"
sub(pattern = ".+?\\?" , x , replacement = '' )
To take advantage of the loose wording of the question, we can go WAY overboard and use natural language processing to extract all names from the string:
library(openNLP)
library(NLP)
# you'll also have to install the models with the next line, if you haven't already
# install.packages('openNLPmodels.en', repos = 'http://datacube.wu.ac.at/', type = 'source')
s <- as.String(x) # convert x to NLP package's String object
# make annotators
sent_token_annotator <- Maxent_Sent_Token_Annotator()
word_token_annotator <- Maxent_Word_Token_Annotator()
entity_annotator <- Maxent_Entity_Annotator()
# call sentence and word annotators
s_annotated <- annotate(s, list(sent_token_annotator, word_token_annotator))
# call entity annotator (which defaults to "person") and subset the string
s[entity_annotator(s, s_annotated)]
## Michael Sneider
Overkill? Probably. But interesting, and not actually all that hard to implement, really.
str_match is more helpful in this situation
str_match(x, ".*\\?\\s(.*)")[, 2]
#[1] "Michael Sneider"

Remove chararcters in text corpus

I'm analyzing a corpus of emails. Some emails contain URLs. When I apply the removePunctuation function from the tm library, I get httpwww, and then I lose the info of a web address. What I would like to do, is to replace the "://" with " " across all of the corpus. I tried gsub, but then I the datatype of the corpus changes and I can't continue to process it with tm package.
Here is an example:
As you can see, gsub changes the class of the corpus to an array of characters, causing tm_map to fail.
> corpus
# A corpus with 4257 text documents
> corpus1 <- gsub("http://","http ",corpus)
> class(corpus1)
# [1] "character"
> class(corpus)
# [1] "VCorpus" "Corpus" "list"
> cleanSW <- tm_map(corpus1,removeWords, stopwords("english"))
# Error in UseMethod("tm_map", x) :
# no applicable method for 'tm_map' applied to an object of class "character"
> cleanSW <- tm_map(corpus,removeWords, stopwords("english"))
> cleanSW
# A corpus with 4257 text documents
How can I bypass it? Maybe there's a way to convert it back to corpus from array of characters?
Found a solution to this problem here: Removing non-English text from Corpus in R using tm(), Corpus(VectorSource(dat1)) worked for me.

XSS concerns when using OpenCPU and knitr to print user-supplied data

I'm using openCPU and knitr to generate custom feedback after surveys. To this end, I basically let survey developers specify rmd files. In this use case, the survey developers are trusted, but the survey takers may not be.
I'm now thinking about XSS. It's not a big worry as user feedback will of course usually only be displayed to the user who entered the data on display, but of course characters like '<' will be used for non-malicious reasons and I'd like to think ahead and explore some of the trials and tribulations of freely mixing R with web apps.
Knitr and R generally were not made with untrusted users and XSS in my mind. OpenCPU rectifies many security issues with running AppArmored-R as an API, but I wonder whether a maximum-flexibility approach like mine can also be proofed.
Possible points at which one might separate trusted and untrusted markup:
Before knitting, i.e. I pass escaped user data to the rmd-file. Drawback: An oblivious survey dev might unescape it accidentally or because it's annoying in some context.
During knitting. This would be ideal, I guess, but I don't know if it's possible, especially if a survey dev could potentially alter chunk settings.
After knitting. I think it's impossible to separate trusted and untrusted markup post-hoc.
Some code to paste into OpenCPU's knitr app:
```{r}
good_userdata = 'I like brackets [].'
bad_userdata = 'some text should not be
[linked](javascript:location.href=\'http://example.com?secrets\';), <s>struck</s> or __bold__'
escape_html = highr:::escape_html
escape_md <- function(x){
x <- gsub('\\[', '\\\\[', x);
x <- gsub('_', '\\\\_', x);
x
}
good_userdata_escaped = escape_md(escape_html(good_userdata))
bad_userdata_escaped = escape_md(escape_html(bad_userdata))
```
## let's say survey devs wants to print text like this
```{r}
cat(good_userdata_escaped)
cat(bad_userdata_escaped) # doesn't know about text like this
```
## gets annoyed, does
```{r}
good_userdata_escaped <- gsub('\\\\', '', good_userdata_escaped);
bad_userdata_escaped <- gsub('\\\\', '', bad_userdata_escaped);
```
##
so that this looks nice
```{r}
cat(good_userdata_escaped)
```
## later renders the same text inline, so that is evaluated as markdown
`r good_userdata_escaped # doesn't look dangerous`
`r bad_userdata_escaped`
Edit 2
Sorry, I had provided only some HTML tags, thinking possible XSS attacks were obvious. Michel Fortin had some examples on his page.
I'm not 100% sure I understand your concern. If you're worried about XSS, you're worried about users including a javascript tag or so in the markdown right?
```{r}
userdata = '<script>alert("I am evil")</script>'
```
```{r,results='asis'}
cat(userdata)
```
You can prevent this by escaping html characters. I think there's a section on this in the markdown definition. So you would need to escape all user input, either when inserting it in your DB or when extracting it:
escape <- function(x){
x <- gsub("<", "<", x);
x <- gsub(">", ">", x);
x <- gsub("&", "&", x);
x
}
Try running the following:
```{r output}
escape <- function(x){
x <- gsub("&", "&", x);
x <- gsub("<", "<", x);
x <- gsub(">", ">", x);
x
}
```
```{r}
userdata = escape('<script>alert("I am evil")</script>')
```
```{r,results='asis'}
cat(userdata)
```
That should take care of any code injection. I'm not quite sure how the __bold__ example is a concern, because afaics this can not be used for an XSS attack as there is no scripting. But if you want to prevent users from messing with layout too, than you should escape all markdown characters I guess.