Limit MQFTE file transfer to one file at a time - websphere-mq-fte

I have a MQFTE setup where we are receiving files from an external vendor. The files get dumped on a server in DMZ and we have an MQFTE agent that picks the files from that server and drops to our server.
We receive files in "sets" i.e. each incoming file has an associated xml file that describes and contains metadata about the file. E.g. a applicationform.pdf and applicationform.xml. The final application stores the pdf file based on the data/metadata in the xml.
Since the trigger is fired for each incoming file, we check in the trigger whether or not we've received the XML file and the content file (e.g. PDF).
However, I don't think this is the best approach as it adds to a lot of booking code to check for concurrency issues when both files arrive at same time. Is there a way to :
Restrict the trigger so that it only fires when both files have arrived? In my research this is not possible.
Configure the agent on the server so that it only receives one file at a time? Looking at the documentation, it seems like it can achieved but only on the agent initiating the transfer, not on the agent receiving the transfer? The documentation hints at monitorMaxResourcesInPoll and -bs parameter, but that would be on the source agent I guess. Since the agent is shared with multiple systems, this would impact them as well.
Also, I would appreciate any tips and suggestions or even alternative solutions to best meet the requirement.

I don't think there is a way to check for both files existing before the monitor triggers. What some users do is send all of the files they want to transfer, and then finally put a 'marker' file in the directory which the resource monitor looks for. Because the marker file is only written after all other files are ready to be sent, the monitor only transfers the files when they're all there.
In answer to 2) I you could set maxDestinationTransfers to 1 on the destination agent to limit it to receive a single transfer at a time. If a transfer contains multiple files they will be transferred in sequence so the destination is really only receiving 1 file at a time. monitorMaxResourcesInPoll simply limits the monitoring agent to the number of files it parses in the source directory per monitor poll. You could set that to 1 but if you want to transfer the PDF and the XML file in the same transfer you'd need to set it to 2. It's probably not the setting you want to use.

Related

Apache Airflow Processing file in the order of receipt

I have multiple source sending incremental data and there are no metadata columns at record level. How can I ensure that Airflow is processing data in the order of receipt. I may end-up processing the file in out-of-sync order.
Does airflow have inbuilt methods/way to handle the files in the order of receival. ?
Airflow version used :2.4.3
You can use boto to retrieve the last modified timestamp from files in your S3 bucket within a PythonOperator.
This question has an answer that shows how to pull the last modified timestamp. Then you can sort the keys by the timestamp, process the files in that order and move the files to an achieve folder or bucket so only new files are processed with every DAG run.
As a general note if you have any control over your sources I would prefer trying to add a timestamp at the record level, this seems like an easier option.

Putting a TWS file dependencies on AWS S3 stored file

I have an ETL application which is suppose to migrate to AWS infra. The scheduler being used in my application is Tivoli Work Scheduler and we want to use the same on cloud as well which has file dependencies.
Now when we move to aws , the files to be watched will land in S3 Bucket. Can we put the OPEN dependency for files in S3? If yes, What would be the hostname ( HOST#Filepath ) ?
If Not, what services should be aligned to serve the purpose. I have both time as well as file dependency in my SCHEDULES.
Eg. The file might get uploaded on S3 at 1AM. AT 3 AM my schedule will get triggered, look for the file in S3 bucket. If present, starts execution and if not then it should wait as per other parameters on tws.
Any help or advice would be nice to have.
If I understand this correctly, job triggered at 3am will identify all files uploaded within last e.g. 24 hours.
You can list all s3 files to list everything uploaded within specific period of time.
Better solution would be to create S3 upload trigger which will send information to SQS and have your code inspect the depth (number of messages) there and start processing the files one by one. An additional benefit would be an assurance that all items are processed without having to worry about time overalpse.

IBM MQ v9 - Managed File Transfer - Initiate MFT once file is placed in folder

How can I initiate a file transfer, once a file is placed in a folder, using IBM MQ v9.0 Managed File Transfer. I can achieve the same by below transfer initiation methods (have tried and tested, working fine).
Transfer by a initiation file
Transfer by a schedule
Transfer by a monitor
A file monitor is fine, but the trigger file contains the details of files to be transferred. When the trigger file is placed in the folder, files specified in the trigger file are being transferred.
I need a solution, once a file is placed in the folder, the file itself should be fetched and transferred.
As per below IBM link https://www.ibm.com/support/knowledgecenter/en/SSFKSJ_8.0.0/com.ibm.wmqfte.doc/create_monitor_cmd.htm in the purpose section
"For example, you can use a resource monitor in the following way: An
external application puts one or more files in a known directory and
when processing is complete, the external application places a trigger
file in a monitored directory. The trigger file is then detected and a
defined file transfer starts, which copies the files from the known
directory to a destination agent".
i.e We have to place the set of files to be transferred and then place a (second) trigger file to initiate the transfer. My question is, is there a way to initiate without the second file, once a file is placed in the transfer directory.
Any help is very much appreciated.
Regards
Yasothar

Compose Google Storage Objects without headers via CLI

I was wondering if it would be possible to compose Google Storage Objects (specifically csv files) without headers (i.e. without the row with column names) while using gsutil.
Currently, I can do the following:
gsutil compose gs://bucket/test_file_1.csv gs://bucket/test_file_2.csv gs://bucket/test-composition-files.csv
However, I will be unable to ingest test-composition-files.csv into Google BigQuery because compose blindly appended the files (including the column names).
One possible solution would be to download the file locally and process it with pandas, but this is not ideal for large files.
Is there any way to do this via the CLI? I could not find anything in the docs.
By reading the comment, I think you are spending effort in the wrong way. I understood that you wanted to load your files into big query, but the large number of file prevented you to do this (too many API calls). And dataflow is too slow.
Maybe you can think differently. I have 2 solutions to propose
If you need "near real time" ingestion, and if file size is bellow 1.5Gb, the best way is to build a function which read the file and perform a stream write to BigQuery. This function is triggered by a Cloud Storage event. If there is several file in the same time, several functions will be spawn. Be careful, stream write to BigQuery is not free
If you can wait up to 2 minutes when a file arrive, I recommend you to build a Cloud Functions, triggered every 2 minutes. This function read the file name in a bucket, move them to a sub directory and perform a load job of all the files in the sub directory. You are limited to 1000 load jobs per day (and per table), a day contains 1440 minutes. Batch every 2 minutes you are OK. The load job are free.
Is it acceptable alternatives?

"Realtime" syncing of large numbers of log files to S3

I have a large number of logfiles from a service that I need to regularly run analysis on via EMR/Hive. There are thousands of new files per day, and they can technically come out of order relative to the file name (e.g. a batch of files comes a week after the date in the file name).
I did an initial load of the files via Snowball, then set up a script that syncs the entire directory tree once per day using the 'aws s3 sync' cli command. This is good enough for now, but I will need a more realtime solution in the near future. The issue with this approach is that it takes a very long time, on the order of 30 minutes per day. And using a ton of bandwidth all at once! I assume this is because it needs to scan the entire directory tree to determine what files are new, then sends them all at once.
A realtime solution would be beneficial in 2 ways. One, I can get the analysis I need without waiting up to a day. Two, the network use would be lower and more spread out, instead of spiking once a day.
It's clear that 'aws s3 sync' isn't the right tool here. Has anyone dealt with a similar situation?
One potential solution could be:
Set up a service on the log-file side that continuously syncs (or aws s3 cp) new files based on the modified date. But wouldn't that need to scan the whole directory tree on the log server as well?
For reference, the log-file directory structure is like:
/var/log/files/done/{year}/{month}/{day}/{source}-{hour}.txt
There is also a /var/log/files/processing/ directory for files being written to.
Any advice would be appreciated. Thanks!
You could have a Lambda function triggered automatically as a new object is saved on your S3 bucket. Check Using AWS Lambda with Amazon S3 for details. The event passed to the Lambda function will contain the file name, allowing you to target only the new files in the syncing process.
If you'd like wait until you have, say 1,000 files, in order to sync in batch, you could use AWS SQS and the following workflow (using 2 Lambda functions, 1 CloudWatch rule and 1 SQS queue):
S3 invokes Lambda whenever there's a new file to sync
Lambda stores the filename in SQS
CloudWatch triggers another Lambda function every X minutes/hours to check how many files are there in SQS for syncing. Once there's 1,000 or more, it retrieves those filenames and run the syncing process.
Keep in mind that Lambda has a hard timeout of 5 minutes. If you sync job takes too long, you'll need to break it in smaller chunks.
You could set the bucket up to log HTTP requests to a separate bucket, then parse the log to look for newly created files and their paths. One troublespot, as well as PUT requests, you have to look for the multipart upload ops which are a sequence of POSTs. Best to log for a few days to see what gets created before putting any effort in to this approach