Python in Knime: Downloading files and dynamically pressing them into workflow

Python in Knime: Downloading files and dynamically pressing them into workflow - python-2.7

I'm using Knime 3.1.2 on OSX and Linux for OPENMS analysis (Mass Spectrometry).
Currently, it uses static filename.mzML files manually put in a directory. It usually has more than one file pressed in at a time ('Input FileS' module not 'Input File' module) using a ZipLoopStart.
I want these files to be downloaded dynamically and then pressed into the workflow...but I'm not sure the best way to do that.
Currently, I have a Python script that downloads .gz files (from AWS S3) and then unzips them. I already have variations that can unzip the files into memory using StringIO (and maybe pass them into the workflow from there as data??).
It can also download them to a directory...which maybe can them be used as the source? But I don't know how to tell the ZipLoop to wait and check the directory after the python script is run.
I also could have the python script run as a separate entity (outside of knime) and then, once the directory is populated, call knime...HOWEVER there will always be a different number of files (maybe 1, maybe three)...and I don't know how to make the 'Input Files' knime node to handle an unknown number of input files.
I hope this makes sense.
Thanks!

Thanks to Gábor for getting me on the right track. Although I ended up doing a slightly different route after much experimentation.
===
Being new to Knime, I don't know if this is an efficient use of Knime, or a complete Kluge...but it does work.
So, part of the problem is some of the Knime specific objects - One of which is called URIDataValue.
A Python Pandas dataframe is, apparently, interchangable with the Knime tables. However, I don't know if there's a way to import one of these URIDataValue objects into Python. So here's what I did...
1. I wrote a Python script that creates a Pandas Dataframe, and populates it with one Column. Everything is a string, including the column header:
from pandas import DataFrame
# Create empty table
T = DataFrame(
[
['file:///Users/.../copy/lfq_spikein_dilution_1.mzML'],
['file:///Users/.../copy/lfq_spikein_dilution_2.mzML'],
],
)
T.columns = ['URIDataValue']
#print T
output_table = T
That creates this dataframe:
Note: The column name and values are just strings. But it is (apparently) important that the column header be 'URIDataValue'...even though HERE it's just text. If the column name is not 'URIDataValue' the next node doesn't know what to do.
NEXT, the 'output_table' from the 'Python Source' node is patched to a 'String to URI' node, which (apparently and magically) knows to change the entire columns string values to URIDataValues (presumably based on the name of the first column...don't know that for sure).
Finally, the NEW table, with the correct data objects goes to a 'URI to PORT' node...since apparently 'Port' objects and a 'URI' object are different.
This, then, matches the needed input to the ZipLoop...which is normally the out put from a static (hard coded) 'Input Files' node.
Now, to actually solve the question above, I just have to add the code to my 'Python Source' to download and unzip the S3 files, then annotate the dataframe with their locations, and go.
I have no idea what I'm doing, but it worked.

There are multiple options to let things work:
Convert the files in-memory to a Binary Object cells using Python, later you can use that in KNIME. (This one, I am not sure is supported, but as I remember it was demoed in one of the last KNIME gatherings.)
Save the files to a temporary folder (Create Temp Dir) using Python and connect the Pyhon node using a flow variable connection to a file reader node in KNIME (which should work in a loop: List Files, check the Iterate List of Files metanode).
Maybe there is already S3 Remote File Handling support in KNIME, so you can do the downloading, unzipping within KNIME. (Not that I know of, but it would be nice.)
I would go with option 2, but I am not so familiar with Python, so for you, probably option 1 is the best. (In case option 3 is supported, that is the best in my opinion.)

Related

Read-in df from csv before launching main app | Dash

I am trying to get my first dashboard with python dash running.
The whole thing is very similar to this https://github.com/dkrizman/dash-manufacture-spc-dashboard.
At the beginning a Dataframe is read in from a csv. My problem seems to be quite easy to solve but somehow I am not succeeding:
I want to create a initial window that allows the user to select (from e.g. dropdown) the csv file (or accordingly the path) that is read in. All the .csv files look the same but just have different values.
When using the modal components I get problems with the install of bootstrap and I thought there must be an easier way?
Thanks for your help!
Best,
Nik

Is there a way to solve Stata's r(601) while looping the append for imported excel files?

I am trying to append multiple Excel files into a large database by executing the following code:
cls
set more off
clear all
global route = "C:\Users\NICOLE\Desktop\CAR"
cd "$route"
tempfile buildDB
save `buildDB', emptyok
local filenames : dir "$route" files "*.xlsx"
display `filenames'
foreach f of local filenames {
import excel using `"`f'"' ,firstrow allstring clear
gen source = `"`f'"'
append using `buildDB'
save `"`buildDB'"', replace
}
save "C:\Users\NICOLE\Desktop\CAR\DB_EG-RAC.dta" ,replace
Stata manages to append all of the files, but it also displays the following message of error:
file C:\Users\NICOLE.xlsx not found r(601);
And I do not know how to solve it, because it does not let my code run as it should. Thanks!

We have deadlock here. On the face of it the filename in question is not one you write in your code, but could only be part of the result of
local filenames : dir "$route" files "*.xlsx"
But the file named isn't even in the same directory as that named. Moreover, you are adamant that the file doesn't exist and Stata according to your error report can't find it.
The question still remains: how does Stata get asked to open a file that supposedly doesn't exist?
My only guesses are feeble:
Code you are not showing is responsible.
You are running slightly different versions of this script in different places and getting confused. Can you replicate this error that you did get once all over again? Have you searched everywhere remotely possible on the C: drive for this file nicole.xlsx?
It is crucial to realise that we can test nothing here. The problem has not been presented reproducibly.

rclone - How do I list which directory has the latest files in AWS S3 bucket?

I am currently using rclone accessing AWS S3 data, and since I don't use either one much I am not an expert.
I am accessing the public bucket unidata-nexrad-level2-chunks and there are 1000 folders I am looking at. To see these, I am using the windows command prompt and entering :
rclone lsf chunks:unidata-nexrad-level2-chunks/KEWX
Only one folder has realtime data being written to it at any time and that is the one I need to find. How do I determine which one is the one I need? I could run a check to see which folder has the newest data. But how can I do that?
The output from my command looks like this :
1/
10/
11/
12/
13/
14/
15/
16/
17/
18/
19/
2/
20/
21/
22/
23/
... ... ... (to 1000)
What can I do to find where the latest data is being written to? Since it is only one folder at a time, I hope it would be simple.
Edit : I realized I need a way to list the latest file (along with it's folder #) without listing every single file and timestamp possible in all 999 directories. I am starting a bounty and the correct answer that allows me to do this without slogging through all of them will be awarded the bounty. If it takes 20 minutes to list all contents from all 999 folders, it's useless as the next folder will be active by that time.

If you wanted to know the specific folder with the very latest file, you should write your own script that retrieves a list of ALL objects, then figures out which one is the latest and which bucket it is in. Here's a Python script that does it:
import boto3
s3_resource = boto3.resource('s3')
objects = s3_resource.Bucket('unidata-nexrad-level2-chunks').objects.filter(Prefix='KEWX/')
date_key_list = [(object.last_modified, object.key) for object in objects]
print(len(date_key_list)) # How many objects?
date_key_list.sort(reverse=True)
print(date_key_list[0][1])
Output:
43727
KEWX/125/20200912-071306-065-I
It takes a while to go through those 43,700 objects!

Pick up a particular file from a directory using regex in Talend

My directory contains files named as WM_PersonFile_22022018 , WM_PersonFile_23022018, WM_PersonFile_24022018 , WM_PersonFile_25022018 and these files come on a daily basis. I am using tFileList to iterate through the files
What should be my regex in my Filemask to pick up the most recent file? Should the Use Global Expressions as Filemask be unchecked?
I tried "*.txt" which is picking up all the files.

RegEx would help you to filter for the correct files.
Some other logic would get you the newest file. If you use tFileList, you might be able to sort after date and only take the first result.
Alternatively, if you also want to check the date in the filename is correct, you might need to add a little logic with a tMap, tAssert, tJava or tJavaRow.

How do I make NSMetadataQuery see my saved documents' folders as packages?

I'm writing an app which saves and loads documents both locally and on iCloud. Locally is working fine, but I'm having a problem with iCloud.
The documents are saved as a package - the UIDocument reads and writes an NSFileWrapper which contains an image file, a thumbnail file, and an info plist. When I save the document to iCloud and then look at the files under 'Manage Storage', I see the individual files instead of the packages; and more importantly when I search for files using NSMetadataQuery it returns an NSMetadataItem for each of the individual files instead of the packages. As a result, my app doesn't realise there are any packages to load and iCloud is pretty useless.
I thought that if I set up the document type and exported the UTI correctly that the packages would be treated properly. Was that right? If so, what's the checklist for setting up a document type as a package? I have:
Added a document type
set LSTypeIsPackage to YES (I've tried string YES and bool YES)
set CFBundleTypeExtensions to an array containing one string: the file suffix
set LSHandlerRank to Owner
Exported a UTI with the same identifier
set it to conform to com.apple.package
added a UITypeTagSpecification dictionary, containing an array for the key public.filename-extension, which contains one string: the file suffix
I've also tried adding a matching Imported UTI to match the exported one, but no luck there.
What did I miss?
UPDATE: I notice that the OP in this question is seeing the behaviour I want (even though he doesn't want it) so it must be possible.

Based on this I tried removing the LSItemContentTypes from my plist, and it worked.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js