Identifying list of xlsx files from the cloud (google drive) - list

I am trying to create a loop where I can perform the same set of functions for a bunch of files. However, the files are stored on a shared folder in google drive (xlsx files) and I can't seem to get the code to "find them". I am working on a Mac (if that makes a difference).
Here is an example of what I have tried:
library("googledrive")
library("readxl")
library("curl")
library("googlesheets")
library("xlsx") # and a few more which I have tried!
> setwd("/Users/xxx/Documents/R") #working on a Mac
> WS.URL <- "https://drive.google.com/drive/u/0/folders/xxx" # this is the shared
drive folder containing numerous xlsx files
##a - the main one I am trying to do ###
> list.files(path = "WS.URL")
character(0) ## there are about 10 files in this folder which aren't showing up. I can't create a loop if I can't retrieve the files.
#b
> nfiles <-length(WS)
> nfiles
[1] 1 # should be about 10
#c
dest <- ("/Users/xxx/Documents/R")
try(download.file("WS.URL", dest))
I have no idea if I am missing something really obvious, still getting to grips with R. Surely this should be straightforward?
HELP!

I cant help much with R but there is a parameter on file.list method called q its used for searching
GET https://www.googleapis.com/drive/v3/files?q=sharedWithMe
by return sending q=sharedWithMe it should return all the files that are shared with you. Testing the q method is easer using the Google APIs exploerer you might want to test it here
Note as far as i know https://drive.google.com/drive/u/0/folders/xxx is not the proper end point for the google drive api which may be causing some of your issues.

Related

Is there a way to solve Stata's r(601) while looping the append for imported excel files?

I am trying to append multiple Excel files into a large database by executing the following code:
cls
set more off
clear all
global route = "C:\Users\NICOLE\Desktop\CAR"
cd "$route"
tempfile buildDB
save `buildDB', emptyok
local filenames : dir "$route" files "*.xlsx"
display `filenames'
foreach f of local filenames {
import excel using `"`f'"' ,firstrow allstring clear
gen source = `"`f'"'
append using `buildDB'
save `"`buildDB'"', replace
}
save "C:\Users\NICOLE\Desktop\CAR\DB_EG-RAC.dta" ,replace
Stata manages to append all of the files, but it also displays the following message of error:
file C:\Users\NICOLE.xlsx not found r(601);
And I do not know how to solve it, because it does not let my code run as it should. Thanks!
We have deadlock here. On the face of it the filename in question is not one you write in your code, but could only be part of the result of
local filenames : dir "$route" files "*.xlsx"
But the file named isn't even in the same directory as that named. Moreover, you are adamant that the file doesn't exist and Stata according to your error report can't find it.
The question still remains: how does Stata get asked to open a file that supposedly doesn't exist?
My only guesses are feeble:
Code you are not showing is responsible.
You are running slightly different versions of this script in different places and getting confused. Can you replicate this error that you did get once all over again? Have you searched everywhere remotely possible on the C: drive for this file nicole.xlsx?
It is crucial to realise that we can test nothing here. The problem has not been presented reproducibly.

GIMP script-fu, can "file-glob" return only files with particular extentions?

I'm trying to use python-fu in GIMP. I would like pdb.file_glob to return an array of image files in the format I specify. I tried:
myGlob = "*.png|*.PNG|*.jpg|*.JPG|*.gif|*.GIF|*.xcf|*.XCF"
globpath = os.path.join(patternDir, myGlob)
num_files, files = pdb.file_glob(globpath, 0)
But the files array is always empty, I assume because the glob syntax is invalid.
Note that if I use myGlob="*", I get the graphics files I want, but I also get files such as "fake.txt", which I want to exclude.
The doc of all PDB functions can be found via the Python-fu console. Hit the Browse... button and then enter your search in the filter bar at the top of the left pane. This documentation is dynamic, it includes the documentation of any callable plugin/script (as long as authors have written some of course)
The PDB functions for Python are a direct mapping of the script-fu API. In this specific case file_glob() was very recently added to the script-fu API because there is nothing in the base TinyScheme language to do it. In Python, you are better off using the standard Python API, os.walk() or glob.glob()/glob.iglob().
In any case such functions only do simple pattern matching, if you want several extensions you want something like this:
sorted([filename for ext in ['XCF','xcf','jpg','JPG','jpeg','gif','GIF'] for filename in glob.glob('*.'+ext)])
Edit: this is a "comprehension", more or less a loop with the inner instruction outside. You can read it as:
files=[]
for ext in ['XCF','xcf','jpg','JPG','jpeg','gif','GIF']:
for filename in glob.glob('*.'+ext):
list.append(filename)

rclone - How do I list which directory has the latest files in AWS S3 bucket?

I am currently using rclone accessing AWS S3 data, and since I don't use either one much I am not an expert.
I am accessing the public bucket unidata-nexrad-level2-chunks and there are 1000 folders I am looking at. To see these, I am using the windows command prompt and entering :
rclone lsf chunks:unidata-nexrad-level2-chunks/KEWX
Only one folder has realtime data being written to it at any time and that is the one I need to find. How do I determine which one is the one I need? I could run a check to see which folder has the newest data. But how can I do that?
The output from my command looks like this :
1/
10/
11/
12/
13/
14/
15/
16/
17/
18/
19/
2/
20/
21/
22/
23/
... ... ... (to 1000)
What can I do to find where the latest data is being written to? Since it is only one folder at a time, I hope it would be simple.
Edit : I realized I need a way to list the latest file (along with it's folder #) without listing every single file and timestamp possible in all 999 directories. I am starting a bounty and the correct answer that allows me to do this without slogging through all of them will be awarded the bounty. If it takes 20 minutes to list all contents from all 999 folders, it's useless as the next folder will be active by that time.
If you wanted to know the specific folder with the very latest file, you should write your own script that retrieves a list of ALL objects, then figures out which one is the latest and which bucket it is in. Here's a Python script that does it:
import boto3
s3_resource = boto3.resource('s3')
objects = s3_resource.Bucket('unidata-nexrad-level2-chunks').objects.filter(Prefix='KEWX/')
date_key_list = [(object.last_modified, object.key) for object in objects]
print(len(date_key_list)) # How many objects?
date_key_list.sort(reverse=True)
print(date_key_list[0][1])
Output:
43727
KEWX/125/20200912-071306-065-I
It takes a while to go through those 43,700 objects!

load data into power BI from relative path

I am trying to find a solution to load an external data file but from a relative path, so when someone else open my PBIX it will still work on his/her computer.
many thanks.
Relative paths are *not* currently supported by Power BI.
To ease the pain, you can create a variable that contains the path where the files are located, and use that variable to determine the path of each table. That way, you only have to change a single place (that variable) and all the tables will automatically point to the new location.
Create a Blank Query, give it a name (e.g. dataFolderPath) and type in the path where your files are (e.g. C:\Users\augustoproiete\Desktop)
With the variable created, edit each of your tables in the Advanced Editor and concatenate your variable with the name of the file.
e.g. instead of "C:\Users\augustoproiete\Desktop\data.xlsx", change it to dataFolderPath & "\data.xlsx"
You can also vote/watch this feature request to be notified when it gets implemented:
Support relative path to excel/csv sources
You can use also the "Parameters" function.
1. Create a new Parameter like "PathExcelFiles"
Parameter_ScreenShot
Edit your "Source" entry
SourceEntry_ScreenShot
Done !
I don't think this is possible yet.
Please add your support for this idea so the Microsoft Power BI team will be more likely to add this as a new feature.
I couldn't bear the fact that there is no possibility to use relative paths, but finally I had to...
So I tried to find a half-decent acceptable workaround.
Using Python-Script it is at least possible to get access to the users %HOME% directory.
let
PySource = Python.Execute("from pathlib import Path#(lf)import pandas as pd#(lf)dataset = pd.DataFrame([[str(Path.home())]], columns = [1])"),
homeDir = Text.Trim(Lines.ToText(PySource{[Name="dataset"]}[Value][1])),
...
The same should be possible with R-Script but didn't do it.
Anybody knows any better solution to get the %HOME% directory inside "Power" Query? I would be glad to have one.
Then I created two scripts inside my working directory install.bat:
#ECHO OFF
if exist "%HOME%\.pbiTemplatePath\filepath.txt" GOTO :ERROR
#This is are the key commands
mkdir "%HOME%\.pbiTemplatePath"
echo|set /p="%cd%" > "%HOME%\.pbiTemplatePath\filepath.txt"
GOTO :END
#Just a little message box
:ERROR
SET msgboxTitle=There is already another working directory installed.
SET /p msgboxBody=<"%HOME%\.pbiTemplatePath\filepath.txt"
SET tmpmsgbox=%temp%\~tmpmsgbox.vbs
IF EXIST "%tmpmsgbox%" DEL /F /Q "%tmpmsgbox%"
ECHO msgbox "%msgboxBody%",0,"%msgboxTitle%">"%tmpmsgbox%"
WSCRIPT "%tmpmsgbox%"
:END
and uninstall_all.bat:
#ECHO OFF
if exist "%HOME%\.pbiTemplatePath\filepath.txt" RMDIR /S /Q "%HOME%\.pbiTemplatePath\"
So in "Power" BI I did this:
let
PySource = Python.Execute("from pathlib import Path#(lf)import pandas as pd#(lf)dataset = pd.DataFrame([[str(Path.home())]], columns = [1])"),
homeDir = Text.Trim(Lines.ToText(PySource{[Name="dataset"]}[Value][1])),
workingDirFile = Text.Combine({homeDir, ".PbiTemplatePath\filepath.txt"} , "\"),
workingDir = Text.Trim(Lines.ToText(Csv.Document(File.Contents(workingDirFile),[Delimiter=";", Columns=1, QuoteStyle=QuoteStyle.None])[Column1])),
...
Now if my git-repository (containing a "Power" BI-template-file and some config-files saying the template where to load the data from and the install/uninstall-scripts). Install has to be executed once and nobody has to copy and paste any path.
I'd be glad about any suggestion of improvement. It's not the solution Gotham deserves... Gotham deserves a better one.
As mentioned by a few people, you can use a dataset parameter and reference that in your script. What I haven't seen mentioned is that you can change these values using an API call:
https://learn.microsoft.com/en-us/rest/api/power-bi/datasets/update-parameters

Python in Knime: Downloading files and dynamically pressing them into workflow

I'm using Knime 3.1.2 on OSX and Linux for OPENMS analysis (Mass Spectrometry).
Currently, it uses static filename.mzML files manually put in a directory. It usually has more than one file pressed in at a time ('Input FileS' module not 'Input File' module) using a ZipLoopStart.
I want these files to be downloaded dynamically and then pressed into the workflow...but I'm not sure the best way to do that.
Currently, I have a Python script that downloads .gz files (from AWS S3) and then unzips them. I already have variations that can unzip the files into memory using StringIO (and maybe pass them into the workflow from there as data??).
It can also download them to a directory...which maybe can them be used as the source? But I don't know how to tell the ZipLoop to wait and check the directory after the python script is run.
I also could have the python script run as a separate entity (outside of knime) and then, once the directory is populated, call knime...HOWEVER there will always be a different number of files (maybe 1, maybe three)...and I don't know how to make the 'Input Files' knime node to handle an unknown number of input files.
I hope this makes sense.
Thanks!
Thanks to Gábor for getting me on the right track. Although I ended up doing a slightly different route after much experimentation.
===
Being new to Knime, I don't know if this is an efficient use of Knime, or a complete Kluge...but it does work.
So, part of the problem is some of the Knime specific objects - One of which is called URIDataValue.
A Python Pandas dataframe is, apparently, interchangable with the Knime tables. However, I don't know if there's a way to import one of these URIDataValue objects into Python. So here's what I did...
1. I wrote a Python script that creates a Pandas Dataframe, and populates it with one Column. Everything is a string, including the column header:
from pandas import DataFrame
# Create empty table
T = DataFrame(
[
['file:///Users/.../copy/lfq_spikein_dilution_1.mzML'],
['file:///Users/.../copy/lfq_spikein_dilution_2.mzML'],
],
)
T.columns = ['URIDataValue']
#print T
output_table = T
That creates this dataframe:
Note: The column name and values are just strings. But it is (apparently) important that the column header be 'URIDataValue'...even though HERE it's just text. If the column name is not 'URIDataValue' the next node doesn't know what to do.
NEXT, the 'output_table' from the 'Python Source' node is patched to a 'String to URI' node, which (apparently and magically) knows to change the entire columns string values to URIDataValues (presumably based on the name of the first column...don't know that for sure).
Finally, the NEW table, with the correct data objects goes to a 'URI to PORT' node...since apparently 'Port' objects and a 'URI' object are different.
This, then, matches the needed input to the ZipLoop...which is normally the out put from a static (hard coded) 'Input Files' node.
Now, to actually solve the question above, I just have to add the code to my 'Python Source' to download and unzip the S3 files, then annotate the dataframe with their locations, and go.
I have no idea what I'm doing, but it worked.
There are multiple options to let things work:
Convert the files in-memory to a Binary Object cells using Python, later you can use that in KNIME. (This one, I am not sure is supported, but as I remember it was demoed in one of the last KNIME gatherings.)
Save the files to a temporary folder (Create Temp Dir) using Python and connect the Pyhon node using a flow variable connection to a file reader node in KNIME (which should work in a loop: List Files, check the Iterate List of Files metanode).
Maybe there is already S3 Remote File Handling support in KNIME, so you can do the downloading, unzipping within KNIME. (Not that I know of, but it would be nice.)
I would go with option 2, but I am not so familiar with Python, so for you, probably option 1 is the best. (In case option 3 is supported, that is the best in my opinion.)