PDI - Collecting File From FTP Older Than N Day - kettle

I have a job that will Collect Data from FTP using Get a file with FTP and I want it's only collect yesterday file or older than n day or base on specific date.
How do that? Is any way or possible?
What I know is Get a file with FTP only copy file directly from FTP to destination folder. So, I can't use any field and assign it into JavaScript variable to create condition.
My requirement is moving only yesterday or ... file from FTP into Location I need, not all of them because I have a lot of file about 30K-40K with various file size and it will took a lot of time if I do that.
Below is the pic what I have design.

There a Scripting/Shell job entry on which you can put any shell script, including :
find . -mindepth 1 -maxdepth 1 -mtime -7 -exec mv -t /destination/path {} +
For eplanation on the shell script have a look there : https://unix.stackexchange.com/questions/207679/moving-files-modified-after-a-specific-date

By using the 'Get File Names' step in a transformation, you can access your FTP files (via VFS) and their atributes, namely the 'lastmodifiedtime'.
With this information you can do a simple filter by dates, and only download the files which are older than N days, or any other filter you require. With that in hand you can move, download or any other file related action you desire.

Related

gsutil rsync between gzip/non-gzip local/cloud locations

For change detection, can gsutil's rsync use the gzip'd size for change detection?
Here's the situation:
Uploaded non-gzip'd static site content to a bucket using cp -Z so it's compressed at rest in the cloud.
Modify HTML files locally.
Need to rsync only the locally modified files.
So the upshot is that the content is compressed in the cloud and uncompressed locally. Can rsync be used to figure out what's changed?
From what I've tried, I'm thinking no because of the way rsync does it's change detection:
If -c is used, compare checksums but ONLY IF file sizes are the same.
Otherwise use times.
And it doesn't look like -J/-j impacts comparing the file size (the local uncompressed filesize is compared against the compressed cloud version which of course is FALSE) so -c won't kick in. Then, the times won't match and thus everything is uploaded again.
This seems like a fairly common use case. Is there a way of solving this?
Thank you,
Hans
To figure out how rsync identifies what has been changed while using gsutils please check Change Detection Algorithm.
I am unsure how do you want to compare between gzip non-gzip, but maybe gsutil compose could be used to make that middle step while compare between files before being compressed.
Take into account that in gsutils rsync's 4th limitation:
The gsutil rsync command copies changed files in their entirety and does not employ the rsync delta-transfer algorithm to transfer portions of a changed file. This is because Cloud Storage objects are immutable and no facility exists to read partial object checksums or perform partial replacements.

Caching for faster recompression of folder given edit/add/delete

Summary
Let's say I have a large number of files in a folder that I want to compress/zip before I send to a server. After I've zipped them together, I realize I want to add/remove/modify a file. Can going through the entire compression process from scratch be avoided?
Details
I imagine there might be some way to cache part of the compression process (whether it is .zip, .gz or .bzip2), to make the compression incremental, even if it results in sub-optimal compression. For example, consider the naive dictionary encoding compression algorithm. I imagine it should be possible to use the encoding dictionary on a single file without re-processing all the files. I also imagine that the loss in compression provided by this caching mechanism would grow as more files are added/removed/edited.
Similar Questions
There are two questions related to this problem:
A C implementation, which implies it's possible
A C# related question, which implies it's possible by zipping individual files first?
A PHP implementation, which implies it isn't possible without a special file-system
A Java-specific adjacent question, which implies it's semi-possible?
Consulting the man page of zip, there are several relevant commands:
Update
-u
--update
Replace (update) an existing entry in the zip archive only if it has
been modified more recently than the version already in the zip
archive. For example:
zip -u stuff *
will add any new files in the current directory, and update any files
which have been modified since the zip archive stuff.zip was last
created/modified (note that zip will not try to pack stuff.zip into
itself when you do this).
Note that the -u option with no input file arguments acts like the -f
(freshen) option.
Delete
-d
--delete
Remove (delete) entries from a zip archive. For example:
zip -d foo foo/tom/junk foo/harry/\* \*.o
will remove the entry foo/tom/junk, all of the files that start with
foo/harry/, and all of the files that end with .o (in any path).
Note that shell pathname expansion has been inhibited with
backslashes, so that zip can see the asterisks, enabling zip to match
on the contents of the zip archive instead of the contents of the
current directory.
Yes. The entries in a zip file are all compressed individually. You can select and copy just the compressed entries you want from any zip file to make a new zip file, and you can add new entries to a zip file.
There is no need for any caching.
As an example, the zip command does this.

Copying multiple files inside a Google Cloud bucket to different directories based on file name

Suppose I have multiple files in different sub-directories with names like 20060630 AD8,11 +1015.WAV and 20050508_Natoa_Enc1_AD5AK_1.WAV. Now I know that all these files will have a substring like AD (in the first file) and AD, AK (in the second). There are total 16 of these classes (AD, AK, AN etc) that I've made as empty folders in the top level directory.
I want to copy all these files according to the substrings matched into their respective directory. Now using gsutil, the commands may come like:
gsutil cp gs://bucket/Field/2005/20060630 AD8,11 +1015.WAV gs://bucket/AD/20060630 AD8,11 +1015.WAV
How can this approach work for automating the task for thousands of files in the same bucket?
Is it safe to assume an approach like:
if 'AD' in filename:
gsutil cp gs://bucket/<filename> gs://bucket/AD/<filename>
elif 'AK' in filename:
gsutil cp gs://bucket/<filename> gs://bucket/AK/<filename>
You can write an easy BASH script for this. The code would be pretty simple since gsutil supports wildcards and it can recursively dive into sub-directories to find your files.
#!/bin/bash
bucket_name=my-example-bucket
substring_list=(
AD
AK
AN
)
for substring in "${substring_list[#]}"; do
gsutil cp gs://$bucket_name/**/*$substring* gs://$bucket_name/$substring/
done
I also see that you have some Python experience, so you could alternatively leverage the Python Client for Google Cloud Storage along with a similar wildcard strategy.

Is there a way to merge rsync and tar (compress)?

NOTE: I am using the term tar loosely here. I mean compress whether it be tar.gz, tar.bz2, zip, etc.
Is there a flag for rsync to negotiate the changed files between source/destination, tar the changed source files, send the single tar file to the destination machine and untar the changed files once arrived?
I have millions of files and remotely rsyncing across the internet to AWS seems very slow.
I know that rsync has a compression option (z), but it's my understanding that that compresses changed files on a per file basis. If there are many small files, the overhead of sending a 1KB as opposed to a 50KB file is still the bottleneck.
Also, simply tarring the whole directory is not efficient either as it will take an hour to archive
You can use the rsyncable option of gzip or pigz to compress the tar file to .gz format. (You will likely have to find a patch for gzip to add that. It's already part of pigz.)
The option partitions the resulting gzip file in a way that permits rsync to find only the modified portions for much more efficient transfers when only some of the files in the .tar.gz file have been changed.
I was looking for exact same thing as you and I landed on using borg.
tar cf - -C $DIR . | borg create $REPO::$NAME
tar will still read entire folder so you won't avoid a read penalty versus just rsyncing two dirs (since I believe rsync uses tricks to avoid reading each file for changes), but you will avoid the write penalty because borg will only write blocks it hasn't encountered before. Also borg auto compresses so no need for xz/gzip. Also, if borg is installed on both ends it won't send over superfluous data either since the two borgs can let each other know what they have versus don't.
If avoiding that read penalty is crucial for you, you could possibly use rsync to use its tricks to just tell you which files changed, create a difftar and send that to borg, but then getting borg to merge archives is whole second headache. You'd likely end up creating a filter that removes paths that were deleted from the original archive and then creating a new archive of just file additions/changes. And then you'd have to do that for each archive recursively. In the end it will create the original archive by extracting each version in sequence, but like I said a total headache.

How can one extract text using PowerShell from zip files stored on dropbox?

I have been asked by a client to extract text from pdf files stored in zip archives on dropbox. I want to know how (and whether it is possible) to access these files using PowerShell. (I've read about APIs you can use to access things on dropbox, but have no idea how this could integrate in a PowerShell script) I'd ideally like to avoid downloading them, as there are around 7000 of them. What I want is a script to read the content of these files online, in dropbox, and then to process the relevant data (text) into a spreadsheet.
Just to reiterate - (i) Is it possible to access pdf files from Dropbox (and the text in them) which are stored in zip archives, and (ii) How can one go about this using PowerShell - what sort of script/instructions does one need to write
Note: I am still finding my way around PowerShell, so it is hard for me to elaborate - however, as and when I become more familiar, I will happily update this post.
The only officially supported programmatic interface for Dropbox is the Dropbox API:
https://www.dropbox.com/developers
It does let you access file contents, e.g., using /files (GET):
https://www.dropbox.com/developers/core/docs#files-GET
However, it doesn't offer any ability to interact with the contents of zip files remotely. (Dropbox just considers zip files as blobs of data like any other file.) That being the case, exactly what you want isn't possible, since you can't look inside the zip files without downloading them first. (Likewise, even if the PDF files weren't in zip files, the Dropbox API doesn't currently offer any ability to search the text in the PDF files remotely. You would still need to download them.)