I have a text file containing urls that 'd like to replace with a tags that open into a new tab. I'm converting the .txt file to a .md file and want clickable links.
I have shown below (1) a MWE, (2) desired output (3) my initial attempts to create a function (I assume this will/could take gsub and the sprintf function to achieve):
MWE:
x <- c("content here: http://stackoverflow.com/",
"still more",
"http://www.talkstats.com/ but also http://www.r-bloggers.com/",
"http://htmlpreview.github.io/?https://github.com/h5bp/html5-boilerplate/blob/master/404.html"
)
** Desired output:**
> x
[1] "content here: http://stackoverflow.com/"
[2] "still more"
[3] "http://www.talkstats.com/ but also http://www.r-bloggers.com/"
[4] "http://htmlpreview.github.io/?https://github.com/h5bp/html5-boilerplate/blob/master/404.html"
Initial attempt to solve:
repl <- function(x) sprintf("%s", x, x)
gsub("http.", repl(), x)
One corner case for using "http.\\s" as the regex is that the string may not end in a space as in x[3] or the url is contains to http which wouldn't only want to parse one time (as seen in x[4]).
PLEASE NOTE THAT R's REGEX IS SPECIFIC TO R;
ANSWERS FROM OTHER LANGUAGES ARE NOT LIKELY TO WORK
This works with your sample x, and using your repl method:
gsub("(http://[^ ]*)", repl('\\1'), x)
or without your repl method:
gsub("(http://[^ ]*)", '\\1', x)
Related
Is it possible to download all zip files from a webpage without specifying the individual links one at a time.
I would like to download all monthly account zip files from http://download.companieshouse.gov.uk/en_monthlyaccountsdata.html.
I am using Windows 8.1, R3.1.1. I do not have wget on the PC so can't use a recursive call.
Alternative:
As a workaround i have tried downloading the webpage text itself. I would then like to extract the name of each zip file which i can then pass to download.file in a loop. However, i am struggling with extracting the name.
pth <- "http://download.companieshouse.gov.uk/en_monthlyaccountsdata.html"
temp <- tempfile()
download.file(pth,temp)
dat <- readLines(temp)
unlink(temp)
g <- dat[grepl("accounts_monthly", tolower(dat))]
g contains character strings with the file names, amongst other characters.
g
[1] " <li>Accounts_Monthly_Data-September2013.zip (775Mb)</li>"
[2] " <li>Accounts_Monthly_Data-October2013.zip (622Mb)</li>"
I would like to extract the name of the files Accounts_Monthly_Data-September2013.zip and so on, but my regex is quite terrible (see for yourself)
gsub(".*\\>(\\w+\\.zip)\\s+", "\\1", g)
data
g <- c(" <li>Accounts_Monthly_Data-September2013.zip (775Mb)</li>",
" <li>Accounts_Monthly_Data-October2013.zip (622Mb)</li>"
)
Use the XML package:
pth <- "http://download.companieshouse.gov.uk/en_monthlyaccountsdata.html"
library(XML)
doc <- htmlParse(pth)
myfiles <- doc["//a[contains(text(),'Accounts_Monthly_Data')]", fun = xmlAttrs]
fileURLS <- file.path("http://download.companieshouse.gov.uk", myfiles)
mapply(download.file, url = fileURLS, destfile = myfiles)
"//a[contains(text(),'Accounts_Monthly_Data')]" is an XPATH expression. It instructs the XML package to select all nodes that are anchors( a ) containing text "Accounts_Monthly_Data". This results is a list of nodes. The fun = xmlAttrs argument then tells the XML package to pass these nodes to the xmlAttrs function. This function strips the attributes from xml nodes. The anchor only have one attribute in this case the href which is what we are looking for.
I've a directory with csv files, about 12k in number, with the naming format being
YYYY-MM-DD<TICK>.csv
. The <TICK> refers to ticker of a stock, e.g. MSFT, GS, QQQ etc. There are total 500 tickers, of various length.
My aim is to merge all the csv for a particular tick and save as a zoo object in individual RData file in a separate directory.
To automate this I've managed to do the csv manipulation, setup as a function which gets a ticker as input, does all the data modification. But I'm stuck in making the file listing stage, passing the pattern to match the ticker being processed. I'm unable to make the pattern to be matched dependent on the ticker.
Below is the function i've tried to make work, doesn't work:
csvlist2zoo <- function(symbol){
csvlist=list.files(path = "D:/dataset/",pattern=paste("'.*?",symbol,".csv'",sep=""),full.names=T)
}
This works, but can't make it work in function
csvlist2zoo <- function(symbol){
csvlist=list.files(path = "D:/dataset/",pattern='.*?"ibm.csv',sep=""),full.names=T)
}
Searched in SO, there are similar questions, not exactly meeting my requirement. But if I missed something please point out in the right direction. Still fighting with regex.
OS: Win8 64bit, R version-3.1.0 (if needed)
Try:
csvlist2zoo <- function(symbol){
list.files(pattern=paste0('\\d{4}-\\d{2}-\\d{2}',symbol, ".csv"))
}
csvlist2zoo("QQQ")
#[1] "2002-12-19QQQ.csv" "2008-01-25QQQ.csv"
csvlist2zoo("GS")
#[1] "2005-05-18GS.csv"
I created some files in the working directory (linux)
v1 <- c("2001-05-17MSFT.csv", "2005-05-18GS.csv", "2002-12-19QQQ.csv", "2008-01-25QQQ.csv")
lapply(v1, function(x) write.csv(1:3, file=x))
Update
Using paste
csvlist2zoo <- function(symbol){
list.files(pattern=paste('\\d{4}-\\d{2}-\\d{2}',symbol, ".csv", sep=""))
}
csvlist2zoo("QQQ")
#[1] "2002-12-19QQQ.csv" "2008-01-25QQQ.csv"
In R I load one environment from a file that contains various timeseries plus one configuration object/vector.
I want to process all timeseries in the environment in a loop but want to exclude the configuration object.
At the moment my code is like this:
for(x in ls(myEnv)) {
if(x!="configData") {
# do something, e. g.
View(myEnv[[x]], x)
}
}
Is there a way to use the pattern parameter of the ls-function to omit the if clause?
for(x in ls(myEnv, pattern="magic regex picks all but *configData*")) {
# do something, e. g.
View(myEnv[[x]], x)
}
All examples I could find for pattern were based on a whitelist-approach (positive list), but I'd like to get all except configData.
Is this possible?
Thanks.
for( x in setdiff(ls(myEnv), "configData") )
and
for(x in grep("configData", ls(myEnv), value=TRUE, invert=TRUE))
both work fine, thanks.
BTW, cool! I wasn't aware of hiding it by using a leading "." ... so the best solution for me is to make sure that configData becomes .configData in the source file so that ls() won't show it.
I'll have two strings of the form
"Initestimate" or "L#estimate" with # being a 1 or 2 digit number
" Nameestimate" with Name being the name of the actual symbol. In the example below, the name of our symbol is "6JU4"
And I have a matrix containing, among other things, columns containing "InitSymbol" and "L#Symbol". I want to return the column name of the column where the first row holds the substring before "estimate".
I'm using stringr. Right now I have it coded with a bunch of calls to str_sub but its really sloppy and I wanted to clean it up and do it right.
example code:
> examplemat <- matrix(c("RYU4","6JU4","6EU4",1,2,3),ncol=6)
> colnames(examplemat) <- c("InitSymb","L1Symb","L2Symb","RYU4estimate","6JU4estimate","6EU4estimate")
> examplemat
InitSymb L1Symb L2Symb RYU4estimate 6JU4estimate 6EU4estimate
[1,] "RYU4" "6JU4" "6EU4" "1" "2" "3"
> searchStr <- "L1estimate"
So with answer being the answer I'm looking for, I want to be able to input examplemat[,answer] so I can extract the data column (in this case, "2")
I don't really know how to do regex, but I think the answer looks something like
examplemat[,paste0(**some regex function**("[(Init)|(L[:digit:]+)]",searchStr),"estimate")]
what function goes there, and is my regex code right?
May be you can try:
library(stringr)
Extr <- str_extract(searchStr, '^[A-Za-z]\\d+')
Extr
[1] "L1"
#If the searchStr is `Initestimate`
#Extr <- str_extract(searchStr, '^[A-Za-z]{4}')
pat1 <- paste0("(?<=",Extr,").*")
indx1 <-examplemat[,str_detect(colnames(examplemat),perl(pat1))]
pat2 <- paste0("(?<=",indx1,").*")
examplemat[,str_detect(colnames(examplemat), perl(pat2))]
#6JU4estimate
# "2"
#For searchStr using Initestimate;
#examplemat[,str_detect(colnames(examplemat), perl(pat2))]
#RYU4estimate
# "1"
The question is bit confusing so I am quite not sure if my interpretation is correct.
First, you would extract the values in the string "coolSymb" without "Symb"
Second, you can detect if column name contains "cool" and return the location (column index)
by which() statement.
Finally, you can extract the value using simple matrix indexing.
library(stringr)
a = str_split("coolSymb", "Symb")[[1]][1]
b = which(str_detect(colnames(examplemat), a))
examplemat[1, b]
Hope this helps,
won782's use of str_split inspired me to find an answer that works, although I still want to know how to do this by matching the prefix instead of excluding the suffix, so I'll accept an answer that does that.
Here's the step-by-step
> str_split("L1estimate","estimate")[[1]][1]
[1] "L1"
replace the above step with one that gets {L1} instead of getting {not estimate} for bonus points
> paste0(str_split("L1estimate","estimate")[[1]][1],"Symb")
[1] "L1Symb"
> examplemat[1,paste0(str_split("L1estimate","estimate")[[1]][1],"Symb")]
L1Symb
[1,] "6JU4"
> paste0(examplemat[1,paste0(str_split("L1estimate","estimate")[[1]][1],"Symb")],"estimate")
[1] "6JU4estimate"
> examplemat[,paste0(examplemat[1,paste0(str_split("L1estimate","estimate")[[1]][1],"Symb")],"estimate")]
6JU4estimate
[1,] "2"
I have some text documents which contains:
Different types of emails addresses: I mean public domain such as gmail, yahoo,
etc and private emails as well such as abc#mycompany.org...
Different hyperlinks such as abc.com, http://abc.com, www.abc.org, ...
So, I wish to know if I can write a single regex command to remove all such entries from my documents for further processing, and if yes then please share some links, documents, or anything useful. I wish to remove any sort of email id or hyperlink from the documents using regex function. I'll be implementing the regex code in R. Since, I'm a newbie in this area so any detailed explanation will be highly appreciated.
So, if I give input as:
"abc#mycompany.org aasd234bc.com to be retained http://abc.com
www.abc.org org com .com comm in sahgo234#flkja23.in"
Then I should get output as:
"to be retained org com comm in"
You can try something like that:
x <- c("abc#mycompany.org", "abc.com", "http://abc.com", "www.abc.org")
gsub("(#.+$|\\..{1,3}$|(^http://)?(w{3}\\.)?)", "", x, perl=T)
If I better understand your question and if it is the first email adress that you need to remove:
gsub("(^\\b\\S+\\#\\S+\\..{1,3}(\\s)?\\b)", "", x, perl=T)
otherwise:
gsub("(\\b\\S+\\#\\S+\\..{1,3}(\\s)?\\b)", "", x, perl=T)
HTH
I wouldn't call this truly regex and it's likely slower but...
x <- c("abc#mycompany.org aasd234bc.com to be retained abc.com www.abc.org org com .com comm in sahgo234#flkja23.in")
y <- unlist(strsplit(x, "\\s+"))
paste(y[!grepl("#|\\.com|\\.org|www\\.|\\.org|\\.in", y)], collapse=" ")
## [1] "to be retained org com comm in"
EDIT: For a multi-row vector wrap it up as a function and lapply it...
x <- c("abc#mycompany.org aasd234bc.com to be retained abc.com www.abc.org org com .com comm in sahgo234#flkja23.in",
"abc#mycompany.org aasd234bc.com to be retained abc.com www.abc.org org com .com comm in sahgo234#flkja23.in")
FUN <- function(x) {
y <- unlist(strsplit(x, "\\s+"))
paste(y[!grepl("#|\\.com|\\.org|www\\.|\\.org|\\.in", y)], collapse=" ")
}
unlist(lapply(x, FUN))
## > unlist(lapply(x, FUN))
## [1] "to be retained org com comm in" "to be retained org com comm in"