Recursively download zip files from webpage (Windows)

Recursively download zip files from webpage (Windows) - regex

Is it possible to download all zip files from a webpage without specifying the individual links one at a time.
I would like to download all monthly account zip files from http://download.companieshouse.gov.uk/en_monthlyaccountsdata.html.
I am using Windows 8.1, R3.1.1. I do not have wget on the PC so can't use a recursive call.
Alternative:
As a workaround i have tried downloading the webpage text itself. I would then like to extract the name of each zip file which i can then pass to download.file in a loop. However, i am struggling with extracting the name.
pth <- "http://download.companieshouse.gov.uk/en_monthlyaccountsdata.html"
temp <- tempfile()
download.file(pth,temp)
dat <- readLines(temp)
unlink(temp)
g <- dat[grepl("accounts_monthly", tolower(dat))]
g contains character strings with the file names, amongst other characters.
g
[1] " <li>Accounts_Monthly_Data-September2013.zip (775Mb)</li>"
[2] " <li>Accounts_Monthly_Data-October2013.zip (622Mb)</li>"
I would like to extract the name of the files Accounts_Monthly_Data-September2013.zip and so on, but my regex is quite terrible (see for yourself)
gsub(".*\\>(\\w+\\.zip)\\s+", "\\1", g)
data
g <- c(" <li>Accounts_Monthly_Data-September2013.zip (775Mb)</li>",
" <li>Accounts_Monthly_Data-October2013.zip (622Mb)</li>"
)

Use the XML package:
pth <- "http://download.companieshouse.gov.uk/en_monthlyaccountsdata.html"
library(XML)
doc <- htmlParse(pth)
myfiles <- doc["//a[contains(text(),'Accounts_Monthly_Data')]", fun = xmlAttrs]
fileURLS <- file.path("http://download.companieshouse.gov.uk", myfiles)
mapply(download.file, url = fileURLS, destfile = myfiles)
"//a[contains(text(),'Accounts_Monthly_Data')]" is an XPATH expression. It instructs the XML package to select all nodes that are anchors( a ) containing text "Accounts_Monthly_Data". This results is a list of nodes. The fun = xmlAttrs argument then tells the XML package to pass these nodes to the xmlAttrs function. This function strips the attributes from xml nodes. The anchor only have one attribute in this case the href which is what we are looking for.

Related

Modify a file with R at a given position

How can I modify a text file with R at a given position?
It's not a table but any file, for example an xml file.
For example to replace the content of line 122 from columns 7 (to 10) by the content of a variable, that is "3.14".
and update the file.
Imagine that line was
<name=0.32>
now should be
<name=3.14>
Or another option, maybe easier.
Look for all aparitions of occurences of "Variable=" and change the next 4 characters.

You would have to read in the entire file and then use string manipulation. E.g.,
f <- "file.xml"
x <- readLines(f)
x[122] <- paste0(substring(x[122], 1, 8), "3.14", substring(x[122], 13, nchar(x[122])))
writeLines(x, f)
I assume there is a command-line tool that would be better suited to this.

Importing unfriedly formatted data in Excel and forcing messy values as column names

I'm trying to import some publicly available life outcomes data using the code below:
require(gdata)
# Source SIMD12 data zone level data
simd.sg.xls <- read.xls(xls = "http://www.gov.scot/Resource/0044/00447385.xls",
sheet = "Quick Lookup", verbose = TRUE)
Naturally, the imported data frame doesn't look good:
I would like to amend my column names using the code below:
# Clean column names
names(simd.sg.xls) <- make.names(names = as.character(simd.sg.xls[1,]),
unique = TRUE,allow_ = TRUE)
But it produces rather unpleasant results:
> names(simd.sg.xls)
[1] "X1" "X1.1" "X771" "X354" "X229" "X74" "X67" "X33" "X19" "X1.2"
[11] "X6" "X1.3" "X8" "X7" "X7.1" "X6506" "X21" "X1.4" "X6158" "X6506.1"
[21] "X6506.2" "X6506.3" "X6263" "X6506.4" "X6468" "X1010" "X815" "X99" "X58" "X65"
[31] "X60" "X6506.5" "X21.1" "X1.5" "X6173" "X5842" "X6506.6" "X6506.7" "X6263.1" "X6506.8"
[41] "X6481" "X883" "X728" "X112" "X69" "X56" "X54" "X6506.9" "X21.2" "X1.6"
[51] "X6143" "X5651" "X6506.10" "X6506.11" "X6263.2" "X6506.12" "X6480" "X777" "X647" "X434"
[61] "X518" "X246" "X436" "X6506.13" "X21.3" "X1.7" "X6136" "X5677" "X6506.14" "X6506.15"
[71] "X6263.3" "X6506.16" "X660" "X567" "X480" "X557" "X261" "X456"
My question is if there is a way to neatly force the values from the first row to the column names? As I'm doing a lot of data I'm looking for solution that would be easily reproducible, I can accommodate a lot of violation to the actual strings to get syntactically correct names but ideally I would avoid faffing around with elaborate regular expressions as I'm often reading files like the one linked here and don't wan to be forced to adjust the rules for each single import.

It looks like the problem is that the header is on the second line, not the first. You could include a skip=1 argument but a more general way of dealing with this using read.xls seems to be to use the pattern and header arguments which force the first line which matches the pattern string to be treated as the header. Your code becomes:
require(gdata)
# Source SIMD12 data zone level data
simd.sg.xls <- read.xls(xls = "http://www.gov.scot/Resource/0044/00447385.xls",
sheet = "Quick Lookup", verbose = TRUE,
pattern="DATAZONE", header=TRUE)
UPDATE
I don't get the warning messages you do when I execute the code. The messages refer to an issue with locale. The locale settings on my system are:
Sys.getlocale()
[1] "LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1252;LC_MONETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252"
Yours are probably different. Locale data could be OS dependent. I'm using Windows 8.1. Also I'm using Strawberry Perl; you appear to be using something else. So some possible reasons for the discrepancy in warning messages but nothing more specific.
On the second question in your comment, to read the entire file, and convert a particular row ( in this case, row 2) to column names, you could use the following code:
simd.sg.xls <- read.xls(xls = "http://www.gov.scot/Resource/0044/00447385.xls",
sheet = "Quick Lookup", verbose = TRUE,
header=FALSE, stringsAsFactors=FALSE)
names(simd.sg.xls) <- make.names(names = simd.sg.xls[2,],
unique = TRUE,allow_ = TRUE)
simd.sg.xls <- simd.sg.xls[-(1:2),]
All data will be of character type so you'll need to convert to factor and numeric as necessary.

Parameterize pattern match as function argument in R

I've a directory with csv files, about 12k in number, with the naming format being
YYYY-MM-DD<TICK>.csv
. The <TICK> refers to ticker of a stock, e.g. MSFT, GS, QQQ etc. There are total 500 tickers, of various length.
My aim is to merge all the csv for a particular tick and save as a zoo object in individual RData file in a separate directory.
To automate this I've managed to do the csv manipulation, setup as a function which gets a ticker as input, does all the data modification. But I'm stuck in making the file listing stage, passing the pattern to match the ticker being processed. I'm unable to make the pattern to be matched dependent on the ticker.
Below is the function i've tried to make work, doesn't work:
csvlist2zoo <- function(symbol){
csvlist=list.files(path = "D:/dataset/",pattern=paste("'.*?",symbol,".csv'",sep=""),full.names=T)
}
This works, but can't make it work in function
csvlist2zoo <- function(symbol){
csvlist=list.files(path = "D:/dataset/",pattern='.*?"ibm.csv',sep=""),full.names=T)
}
Searched in SO, there are similar questions, not exactly meeting my requirement. But if I missed something please point out in the right direction. Still fighting with regex.
OS: Win8 64bit, R version-3.1.0 (if needed)

Try:
csvlist2zoo <- function(symbol){
list.files(pattern=paste0('\\d{4}-\\d{2}-\\d{2}',symbol, ".csv"))
}
csvlist2zoo("QQQ")
#[1] "2002-12-19QQQ.csv" "2008-01-25QQQ.csv"
csvlist2zoo("GS")
#[1] "2005-05-18GS.csv"
I created some files in the working directory (linux)
v1 <- c("2001-05-17MSFT.csv", "2005-05-18GS.csv", "2002-12-19QQQ.csv", "2008-01-25QQQ.csv")
lapply(v1, function(x) write.csv(1:3, file=x))
Update
Using paste
csvlist2zoo <- function(symbol){
list.files(pattern=paste('\\d{4}-\\d{2}-\\d{2}',symbol, ".csv", sep=""))
}
csvlist2zoo("QQQ")
#[1] "2002-12-19QQQ.csv" "2008-01-25QQQ.csv"

replace all text urls with html url

I have a text file containing urls that 'd like to replace with a tags that open into a new tab. I'm converting the .txt file to a .md file and want clickable links.
I have shown below (1) a MWE, (2) desired output (3) my initial attempts to create a function (I assume this will/could take gsub and the sprintf function to achieve):
MWE:
x <- c("content here: http://stackoverflow.com/",
"still more",
"http://www.talkstats.com/ but also http://www.r-bloggers.com/",
"http://htmlpreview.github.io/?https://github.com/h5bp/html5-boilerplate/blob/master/404.html"
)
** Desired output:**
> x
[1] "content here: http://stackoverflow.com/"
[2] "still more"
[3] "http://www.talkstats.com/ but also http://www.r-bloggers.com/"
[4] "http://htmlpreview.github.io/?https://github.com/h5bp/html5-boilerplate/blob/master/404.html"
Initial attempt to solve:
repl <- function(x) sprintf("%s", x, x)
gsub("http.", repl(), x)
One corner case for using "http.\\s" as the regex is that the string may not end in a space as in x[3] or the url is contains to http which wouldn't only want to parse one time (as seen in x[4]).
PLEASE NOTE THAT R's REGEX IS SPECIFIC TO R;
ANSWERS FROM OTHER LANGUAGES ARE NOT LIKELY TO WORK

This works with your sample x, and using your repl method:
gsub("(http://[^ ]*)", repl('\\1'), x)
or without your repl method:
gsub("(http://[^ ]*)", '\\1', x)

R - grab first portion of file name

I have lots of files in a directory dir, with the following format
[xyz][sequence of numbers or letters]_[single number]_[more stuff that does not matter].csv
for example, xyz39289_3_8932jda.csv
I would like to write a function that returns all the first portions of all the file names in that directory. By first portion, I mean the [xyz][sequence of numbers] portion. So, in the example above, this would include the xyz39289. As such, the function would ultimately return a list such as
[xyz39289, xyz9382, xyz03319927, etc]
How can I do this in R? In Java, I would do the following:
File[] files = new File(dir).listFiles();
ArrayList<String> output = new ArrayList<String>();
for(int i = 0; i < files.length; i++) {
output.add(files[i].getName().substring(0,files[i].getName().indexOf("_"));
}

Might be easiest to delete everything after the first _.
sub("_.*$", "", files)

After you get your list of files with list.files (and possibly extract just the files that you want that begin with xyz, I'd use sub.
files <- list.files(dir)
files <- files[grep("^xyz",files, perl = TRUE)]
filepart <- sub("^(xyz[^_]*)_.*$","\\1",files, perl = TRUE)
There's also a regexpr method that I'm not too certain with. Something like
files <- list.files(dir)
matchdat <- regexpr("^xyz.*?(?=_)",files, perl = TRUE)
filepart <- regmatches(test,matchdat)

here's another version. list all files
myfiles <- list.files(path="./dir")
split each file name on "_" and keep the first part
myfiles.pre <- sapply(myfiles, function(x) strsplit(x,"_",fixed=T)[[1]][1])

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Recursively download zip files from webpage (Windows) - regex

Related

Modify a file with R at a given position

Importing unfriedly formatted data in Excel and forcing messy values as column names

Parameterize pattern match as function argument in R

replace all text urls with html url

R - grab first portion of file name

Categories

Resources