R - grab first portion of file name - regex

I have lots of files in a directory dir, with the following format
[xyz][sequence of numbers or letters]_[single number]_[more stuff that does not matter].csv
for example, xyz39289_3_8932jda.csv
I would like to write a function that returns all the first portions of all the file names in that directory. By first portion, I mean the [xyz][sequence of numbers] portion. So, in the example above, this would include the xyz39289. As such, the function would ultimately return a list such as
[xyz39289, xyz9382, xyz03319927, etc]
How can I do this in R? In Java, I would do the following:
File[] files = new File(dir).listFiles();
ArrayList<String> output = new ArrayList<String>();
for(int i = 0; i < files.length; i++) {
output.add(files[i].getName().substring(0,files[i].getName().indexOf("_"));
}

Might be easiest to delete everything after the first _.
sub("_.*$", "", files)

After you get your list of files with list.files (and possibly extract just the files that you want that begin with xyz, I'd use sub.
files <- list.files(dir)
files <- files[grep("^xyz",files, perl = TRUE)]
filepart <- sub("^(xyz[^_]*)_.*$","\\1",files, perl = TRUE)
There's also a regexpr method that I'm not too certain with. Something like
files <- list.files(dir)
matchdat <- regexpr("^xyz.*?(?=_)",files, perl = TRUE)
filepart <- regmatches(test,matchdat)

here's another version. list all files
myfiles <- list.files(path="./dir")
split each file name on "_" and keep the first part
myfiles.pre <- sapply(myfiles, function(x) strsplit(x,"_",fixed=T)[[1]][1])

Related

Grabbing parts of filename with python & boto3

I just started with python and I am still a newbie , I want to create a function that grabs parts of filenames corresponding to a certain pattern these files are stored in s3 bucket.
So in my case, let's say I have 5 .txt files
Transfarm_DAT_005995_20190911_0300.txt
Transfarm_SupplierDivision_058346_20190911_0234.txt
Transfarm_SupplierDivision_058346_20200702_0245.txt
Transfarm_SupplierDivision_058346_20200703_0242.txt
Transfarm_SupplierDivision_058346_20200704_0241.txt
I want the script to go through these filenames, grab the string "Category i.e "Transfarm_DAT" and date "20190911"" and before the filename extension.
Can you point me in the direction to which Python modules and possibly guides that could assist me?
Check out the split and join functions if your filenames are always like this. Otherwise, regex is another avenue.
files_list = ['Transfarm_DAT_005995_20190911_0300.txt ', 'Transfarm_SupplierDivision_058346_20190911_0234.txt',
'Transfarm_SupplierDivision_058346_20200702_0245.txt', 'Transfarm_SupplierDivision_058346_20200703_0242.txt', 'Transfarm_SupplierDivision_058346_20200704_0241.txt']
category_list = []
date_list = []
for f in files_list:
date = f.split('.')[0].split('_',2)[2]
category = '_'.join([f.split('.')[0].split('_')[0], f.split('.')[0].split('_')[1]])
# print(category, date)
category_list.append(category)
date_list.append(date)
print(category_list, date_list)
Output lists:
['Transfarm_DAT', 'Transfarm_SupplierDivision', 'Transfarm_SupplierDivision', 'Transfarm_SupplierDivision', 'Transfarm_SupplierDivision'] ['005995_20190911_0300', '058346_20190911_0234', '058346_20200702_0245', '058346_20200703_0242', '058346_20200704_0241']

Extract string from within a longer string | Ruby

I'm trying to extract a sentence from a long string using Ruby.
Example:
Hi There the file is in this directory: /Volumes/GAW-FS01/08_Video/02_Projects/ Plus/S/metadata.xml Thank you for your inquiry.
It should return: /Volumes/GAW-FS01/08_Video/02_Projects/ Plus/S/metadata.xml
The beginning always include "/Volume" and it has ti end with ".xml"
Thank you
You're probably looking for this -
long_string = <<file
Hi There the file is in this directory: /Volumes/GAW-FS01/08_Video/02_Projects/Plus/S/metadata.xml Thank you for your inquiry.Hi There the file is in this directory: /Volumes/GAW-def/08_Video/02_Projects/Plus/S/metadata.xml Thank you for your inquiry.Hi There the file is in this directory: /Volumes/GAW-FS01/08_Video/02_Projects/hello/S/metadata.xml Thank you for your inquiry.
file
long_string.split.select{ |word| word.starts_with?("/Volumes") && word.end_with?(".xml") }
It'll give you array of paths which match your condition like follow -
["/Volumes/GAW-FS01/08_Video/02_Projects/Plus/S/metadata.xml", "/Volumes/GAW-def/08_Video/02_Projects/Plus/S/metadata.xml", "/Volumes/GAW-FS01/08_Video/02_Projects/hello/S/metadata.xml"]
There are many ways to do this.
string = "Hi There the file is in this directory: /Volumes/GAW-FS01/08_Video/02_Projects/ Plus/S/metadata.xml Thank you for your inquiry."
Funky:
"Volume#{string[/Volume(.*?).xml/, 1]}.xml"
Funkier, find the index of the strings and use the range to return the desired string
first = "Volume"
last = ".xml"
string[(string.index(first)) .. (string.index(last) + last.length)]
=> "Volumes/GAW-FS01/08_Video/02_Projects/ Plus/S/metadata.xml "
Not so funky, split the string by the start and end points and return the desired string
"Volume#{string.split('Volume').last.split('.xml').first}.xml"
=> "Volumes/GAW-FS01/08_Video/02_Projects/ Plus/S/metadata.xml"
Lastly, the way you'd probably want to do this
string[/Volume(.*?).xml/, 0]
=> "Volumes/GAW-FS01/08_Video/02_Projects/ Plus/S/metadata.xml"
You notice the first and last options are the same, despite the int passed changing between 0 and 1. The 0 captures the regular expressions used to match where the 1 does not.

Automate process for merging CSV files in Python

I am trying to work with 12 different csv files, that are stored in datafolder. I have a function that opens each file separately (openFile) and performs a specific calculation on the data. I then want to be able to apply a function to each file. The names of the files are all similar to this: UOG_001-AC_TOP-Accelerometer-2017-07-22T112654.csv . The code below shows how I was planning to read the files into the openFile function:
for file in os.listdir(DATA_PATH + 'datafolder/'):
if file.endswith('.csv'):
abs_path = os.path.abspath(DATA_PATH + 'datafolder/' + file)
print(abs_path)
data = openFile(abs_path)
data2 = someFunction(data)
I need to merge specific files, which have the same two letters in the file name. At the end I should have 6 files instead of 12. The files are not stored in the order that they need to be merged in datafolder as this will eventually lead to being used for a larger number of files. The files all have the same header
Am I able to supply a list of the two letters that are the key words in the file to use in regex? e.g.
list = ['AC', 'FO', 'CK', 'OR', 'RS', 'IK']
Any suggestions on how I can achieve this with or without regex?
You could walk through the file tree and, then based on the first two letters of the file name, save each pair of files that need to be merged.
fileList = {'AC':[], 'FO':[], 'CK':[], 'OR':[], 'RS':[], 'IK':[]}
for file in os.listdir(DATA_PATH + 'datafolder/'):
if file.endswith('.csv'): #Ensure we are looking at csv
#Add the file to its correct bucket based off of the letters in name
fileList[extractTwoLettersFromFileName(file)].append(file)
for twoLetters, files in fileList.items():
mergeFiles(files)
I did not provide implementation for extracting the letters and merging the files, but from you question you seem to already have that implementation.
You can first so a simple substring check, and then based on that, classify the filenames into groups:
letters_list = ['AC', 'FO', 'CK', 'OR', 'RS', 'IK']
for letters in letters_list:
if letters in filename:
filename_list = filename_dict.get(letters, list())
filename_list.append(filename)
filename_dict[letters] = filename_list
Here is the the Path object from pathlib to create a list of files whose names end with '.csv'. Then I use a function to examine each file's name for the presence of one or another of those strings that you mentioned using a regex so that I can create a list of the pairs of these strings with their associates filenames. Notice that the length of this list of pairs is 12, and that the filenames can be recovered from this.
Having made that list I can use groupby from itertools to create two-element lists of files that share those strings in the file_kinds list. You can merge items in these lists.
>>> from pathlib import Path
>>> file_kinds = ['AC', 'FO', 'CK', 'OR', 'RS', 'IK']
>>> def try_match(filename):
... m = re.search('(%s)'%'|'.join(file_kinds), filename)
... if m:
... return m.group()
... else:
... return None
...
>>> all_files_list = [(item, try_match(item.name)) for item in list(Path(r'C:/scratch/datafolder').glob('*.csv')) if try_match(item.name)]
>>> len(all_files_list)
12
Expression for extracting full paths from all_files_list:
[str(_[0]) for _ in all_files_list]
>>> for kind, files_list in groupby(all_files_list, key=itemgetter(1)):
... kind, [str(_[0]) for _ in list(files_list)]
...
('AC', ['C:\\scratch\\datafolder\\UOG_001-AC_A_TOP-Accelerometer-2017-07-22T112654.csv', 'C:\\scratch\\datafolder\\UOG_001-AC__B_TOP-Accelerometer-2017-07-22T112654.csv'])
('CK', ['C:\\scratch\\datafolder\\UOG_001-CK_A_TOP-Accelerometer-2017-07-22T112654.csv', 'C:\\scratch\\datafolder\\UOG_001-CK__B_TOP-Accelerometer-2017-07-22T112654.csv'])
('FO', ['C:\\scratch\\datafolder\\UOG_001-FO_A_TOP-Accelerometer-2017-07-22T112654.csv', 'C:\\scratch\\datafolder\\UOG_001-FO__B_TOP-Accelerometer-2017-07-22T112654.csv'])
('IK', ['C:\\scratch\\datafolder\\UOG_001-IK_A_TOP-Accelerometer-2017-07-22T112654.csv', 'C:\\scratch\\datafolder\\UOG_001-IK__B_TOP-Accelerometer-2017-07-22T112654.csv'])
('OR', ['C:\\scratch\\datafolder\\UOG_001-OR_A_TOP-Accelerometer-2017-07-22T112654.csv', 'C:\\scratch\\datafolder\\UOG_001-OR__B_TOP-Accelerometer-2017-07-22T112654.csv'])
('RS', ['C:\\scratch\\datafolder\\UOG_001-RS_A_TOP-Accelerometer-2017-07-22T112654.csv', 'C:\\scratch\\datafolder\\UOG_001-RS__B_TOP-Accelerometer-2017-07-22T112654.csv'])

Recursively download zip files from webpage (Windows)

Is it possible to download all zip files from a webpage without specifying the individual links one at a time.
I would like to download all monthly account zip files from http://download.companieshouse.gov.uk/en_monthlyaccountsdata.html.
I am using Windows 8.1, R3.1.1. I do not have wget on the PC so can't use a recursive call.
Alternative:
As a workaround i have tried downloading the webpage text itself. I would then like to extract the name of each zip file which i can then pass to download.file in a loop. However, i am struggling with extracting the name.
pth <- "http://download.companieshouse.gov.uk/en_monthlyaccountsdata.html"
temp <- tempfile()
download.file(pth,temp)
dat <- readLines(temp)
unlink(temp)
g <- dat[grepl("accounts_monthly", tolower(dat))]
g contains character strings with the file names, amongst other characters.
g
[1] " <li>Accounts_Monthly_Data-September2013.zip (775Mb)</li>"
[2] " <li>Accounts_Monthly_Data-October2013.zip (622Mb)</li>"
I would like to extract the name of the files Accounts_Monthly_Data-September2013.zip and so on, but my regex is quite terrible (see for yourself)
gsub(".*\\>(\\w+\\.zip)\\s+", "\\1", g)
data
g <- c(" <li>Accounts_Monthly_Data-September2013.zip (775Mb)</li>",
" <li>Accounts_Monthly_Data-October2013.zip (622Mb)</li>"
)
Use the XML package:
pth <- "http://download.companieshouse.gov.uk/en_monthlyaccountsdata.html"
library(XML)
doc <- htmlParse(pth)
myfiles <- doc["//a[contains(text(),'Accounts_Monthly_Data')]", fun = xmlAttrs]
fileURLS <- file.path("http://download.companieshouse.gov.uk", myfiles)
mapply(download.file, url = fileURLS, destfile = myfiles)
"//a[contains(text(),'Accounts_Monthly_Data')]" is an XPATH expression. It instructs the XML package to select all nodes that are anchors( a ) containing text "Accounts_Monthly_Data". This results is a list of nodes. The fun = xmlAttrs argument then tells the XML package to pass these nodes to the xmlAttrs function. This function strips the attributes from xml nodes. The anchor only have one attribute in this case the href which is what we are looking for.

Parameterize pattern match as function argument in R

I've a directory with csv files, about 12k in number, with the naming format being
YYYY-MM-DD<TICK>.csv
. The <TICK> refers to ticker of a stock, e.g. MSFT, GS, QQQ etc. There are total 500 tickers, of various length.
My aim is to merge all the csv for a particular tick and save as a zoo object in individual RData file in a separate directory.
To automate this I've managed to do the csv manipulation, setup as a function which gets a ticker as input, does all the data modification. But I'm stuck in making the file listing stage, passing the pattern to match the ticker being processed. I'm unable to make the pattern to be matched dependent on the ticker.
Below is the function i've tried to make work, doesn't work:
csvlist2zoo <- function(symbol){
csvlist=list.files(path = "D:/dataset/",pattern=paste("'.*?",symbol,".csv'",sep=""),full.names=T)
}
This works, but can't make it work in function
csvlist2zoo <- function(symbol){
csvlist=list.files(path = "D:/dataset/",pattern='.*?"ibm.csv',sep=""),full.names=T)
}
Searched in SO, there are similar questions, not exactly meeting my requirement. But if I missed something please point out in the right direction. Still fighting with regex.
OS: Win8 64bit, R version-3.1.0 (if needed)
Try:
csvlist2zoo <- function(symbol){
list.files(pattern=paste0('\\d{4}-\\d{2}-\\d{2}',symbol, ".csv"))
}
csvlist2zoo("QQQ")
#[1] "2002-12-19QQQ.csv" "2008-01-25QQQ.csv"
csvlist2zoo("GS")
#[1] "2005-05-18GS.csv"
I created some files in the working directory (linux)
v1 <- c("2001-05-17MSFT.csv", "2005-05-18GS.csv", "2002-12-19QQQ.csv", "2008-01-25QQQ.csv")
lapply(v1, function(x) write.csv(1:3, file=x))
Update
Using paste
csvlist2zoo <- function(symbol){
list.files(pattern=paste('\\d{4}-\\d{2}-\\d{2}',symbol, ".csv", sep=""))
}
csvlist2zoo("QQQ")
#[1] "2002-12-19QQQ.csv" "2008-01-25QQQ.csv"