Rename files sequentially keeping "connected" file names pattern using regular expressions - regex

I'm trying to reorganize photo library that contains edited files as well as originals. I already achieved desired folder structure using Exif Sorter, i.e %UserProfile%\Photos\%year%\%month%\%day%.
Each %day% folder contains photo image files with a little bit different name pattern:
IMG_0001.jpg
ZMGM00002.jpg
ZMGM00003 (Edited).jpg
ZMGM00003.jpg
IMG_0002 (Edited).jpg
IMG_0002.jpg
IMG_0004.jpg
I'd like files to be named sequentially but keeping relevant " (Edited)" suffix:
DSC_0001.jpg
DSC_0002.jpg
DSC_0002 (Edited).jpg
DSC_0003.jpg
DSC_0004 (Edited).jpg
DSC_0004.jpg
DSC_0005.jpg
So far I came up with regular expression to rename "*.jpg" and "* (Edited).jpg" preserving it's "suffix" part when it's there (" (Edited)") (sorry I use RegexRenamer because I'm beginner):
match string ^(\D+)(_)?(\d+)(Edited)?
replace string DCS_$#$4
However I get sequential numbering across all files and thus the relevance of edited files is lost:
DSC_0001.jpg
DSC_0002.jpg
DSC_0003 (Edited).jpg
DSC_0004.jpg
DSC_0005 (Edited).jpg
DSC_0006.jpg
DSC_0007.jpg
Is there any way I can rename files and preserve filename "connection" pattern between them, i.e. so I get DSC_0002 (Edited).jpg & DSC_0002.jpg instead of DSC_0002 (Edited).jpg & DSC_0003.jpg?
Since I've got thousands of folders, the renaming should be recourse & sequence should restarted with each new folder. I believe this requires PowerShell or batch scripting that will determine required condition but I'm not sure where to start. I am open to ideas like maybe I could process file names via Excel first and then batch-rename from TXT/CSV file.
P.S. I've got like 80000 family photos since late 90's, it would take ages to process by hand. I can run anything in Windows and macOS to solve this (would prefer Windows though).

Related

I wonder if I can perform data-pipeline by directory of a specific name with DataFusion

I'm using google-cloud-platform data fusion.
Assuming that the bucket's path is as follows:
test_buk/...
In the test_buk bucket there are four files:
20190901, 20190902
20191001, 20191002
Let's say there is a directory inside test_buk called dir.
I have a prefix-based bundle based on 201909(e.g, 20190901, 20190902)
also, I have a prefix-based bundle based on 201910(e.g, 20191001, 20191002)
I'd like to complete the data-pipeline for 201909 and 201910 bundles.
Here's what I've tried:
with regex path filter
gs://test_buk/dir//2019 to run the data pipeline.
If regex path filter is inserted, the Input value is not read, and likewise there is no Output value.
When I want to create a data pipeline with a specific directory in a bundle, how do I handle it in a datafusion?
If using directly the raw path (gs://test_buk/dir/), you might be getting an error when escaping special characters in the regex. That might be the reason for which you do not get any input file into the pipeline that matches your filter.
I suggest instead that you use ".*" to math the initial part (given that you are also specifying the path, no additional files in other folders will match the filter).
Therefore, I would use the following expressions depending on the group of files you want to use (feel free to change the extension of the files):
path = gs://test_buk/dir/
regex path filter = .*201909.*\.csv or .*201910.*\.csv
If you would like to know more about the regex used, you can take a look at (1)

Pick up a particular file from a directory using regex in Talend

My directory contains files named as WM_PersonFile_22022018 , WM_PersonFile_23022018, WM_PersonFile_24022018 , WM_PersonFile_25022018 and these files come on a daily basis. I am using tFileList to iterate through the files
What should be my regex in my Filemask to pick up the most recent file? Should the Use Global Expressions as Filemask be unchecked?
I tried "*.txt" which is picking up all the files.
RegEx would help you to filter for the correct files.
Some other logic would get you the newest file. If you use tFileList, you might be able to sort after date and only take the first result.
Alternatively, if you also want to check the date in the filename is correct, you might need to add a little logic with a tMap, tAssert, tJava or tJavaRow.

GAMS: Filename cannot be used as valid UEL

I trying to merge a large data set in gams. The file should consist of multiple gdx files with several names. The programme merges the files as I would like them to however: it replaces the names of the file to be merged with File_1, File_2, File_3 and so on. I would like to see the name of the gdx file in the merged file (and so far the script I wrote worked fine).
I'm receiving the following error for each line:
***Filename cannot be used as a valid UEL
Existing name: ImpactYesPGTNoLDViolation-D1-PG10-LDI-LDE0.001-LB0.0045-PDC0-D10
Replaced with File_1
Why does this happen? Could it be that the existing name is too long? I tried finding out more about this error but so far have not found any information on it. And is there anyway to fix it? I need the information of the existing name in order to further process the output.
You are right. This name is too long to be used as UEL (aka label). You can only use up to 63 characters. You can read more about this and other limitations for UELs here

Reading dates from filenames

I want to extract dates from the suffixes of files in a particular folder. The contents of such a folder look something like:
Packed_Folder_1_2016.06.10
Packed_Folder_1_2016.08.06
Packed_Folder_1_2015.09.03
packed_Folder_1_2015.01.08
... (so on and so forth, always in the same path just different suffixes)
There is no pattern to the dates. I need to make a VS form (2013) to read the name of the files and store the date differences.
Notice how the filenames always follow a pattern? It's always Packed_Folder_1_####.##.##, where the last part is a date.
So what you want to do is list the file names in the folder, and try to find a file that matches the pattern. You could use a regular expression to match the filename (it would be something like R"(Packed_Folder_1_\d{4}\.\d{2}\.\d{2})").
You are talking about Forms, so I am assuming you are able to use Visual C++. If that is the case, you can check FileSystemWatcher Class.
You instantiated it with a given path ( file or directory ), and it will trigger events based on some changes on the target (simple change, creation, rename - you can select which one). You could then update your reference, in case its change suits your needs.

Python in Knime: Downloading files and dynamically pressing them into workflow

I'm using Knime 3.1.2 on OSX and Linux for OPENMS analysis (Mass Spectrometry).
Currently, it uses static filename.mzML files manually put in a directory. It usually has more than one file pressed in at a time ('Input FileS' module not 'Input File' module) using a ZipLoopStart.
I want these files to be downloaded dynamically and then pressed into the workflow...but I'm not sure the best way to do that.
Currently, I have a Python script that downloads .gz files (from AWS S3) and then unzips them. I already have variations that can unzip the files into memory using StringIO (and maybe pass them into the workflow from there as data??).
It can also download them to a directory...which maybe can them be used as the source? But I don't know how to tell the ZipLoop to wait and check the directory after the python script is run.
I also could have the python script run as a separate entity (outside of knime) and then, once the directory is populated, call knime...HOWEVER there will always be a different number of files (maybe 1, maybe three)...and I don't know how to make the 'Input Files' knime node to handle an unknown number of input files.
I hope this makes sense.
Thanks!
Thanks to Gábor for getting me on the right track. Although I ended up doing a slightly different route after much experimentation.
===
Being new to Knime, I don't know if this is an efficient use of Knime, or a complete Kluge...but it does work.
So, part of the problem is some of the Knime specific objects - One of which is called URIDataValue.
A Python Pandas dataframe is, apparently, interchangable with the Knime tables. However, I don't know if there's a way to import one of these URIDataValue objects into Python. So here's what I did...
1. I wrote a Python script that creates a Pandas Dataframe, and populates it with one Column. Everything is a string, including the column header:
from pandas import DataFrame
# Create empty table
T = DataFrame(
[
['file:///Users/.../copy/lfq_spikein_dilution_1.mzML'],
['file:///Users/.../copy/lfq_spikein_dilution_2.mzML'],
],
)
T.columns = ['URIDataValue']
#print T
output_table = T
That creates this dataframe:
Note: The column name and values are just strings. But it is (apparently) important that the column header be 'URIDataValue'...even though HERE it's just text. If the column name is not 'URIDataValue' the next node doesn't know what to do.
NEXT, the 'output_table' from the 'Python Source' node is patched to a 'String to URI' node, which (apparently and magically) knows to change the entire columns string values to URIDataValues (presumably based on the name of the first column...don't know that for sure).
Finally, the NEW table, with the correct data objects goes to a 'URI to PORT' node...since apparently 'Port' objects and a 'URI' object are different.
This, then, matches the needed input to the ZipLoop...which is normally the out put from a static (hard coded) 'Input Files' node.
Now, to actually solve the question above, I just have to add the code to my 'Python Source' to download and unzip the S3 files, then annotate the dataframe with their locations, and go.
I have no idea what I'm doing, but it worked.
There are multiple options to let things work:
Convert the files in-memory to a Binary Object cells using Python, later you can use that in KNIME. (This one, I am not sure is supported, but as I remember it was demoed in one of the last KNIME gatherings.)
Save the files to a temporary folder (Create Temp Dir) using Python and connect the Pyhon node using a flow variable connection to a file reader node in KNIME (which should work in a loop: List Files, check the Iterate List of Files metanode).
Maybe there is already S3 Remote File Handling support in KNIME, so you can do the downloading, unzipping within KNIME. (Not that I know of, but it would be nice.)
I would go with option 2, but I am not so familiar with Python, so for you, probably option 1 is the best. (In case option 3 is supported, that is the best in my opinion.)