FTE should not pick the file until unless file is ready for transfer - websphere-mq-fte

I came across the scenario where FTE is transferring file before files is ready. Basically the issue is MQ is writing the file and FTE picked the file in between and transfer the files before it got complete.

I think the MQ/application that is writing the file must have released the lock on the file which indicates file writing is done. So FTE might have picked up the file for transfer.

Related

AKKA: passing local resource to actor

Let's suppose I want to use AKKA actor model to create a program crunching data coming from files.
Since the model, as far as I understood, is winning if the actor really are unaware on where they are running, passing the path of the file in the message should be an error -some actors when the app scales will possibly not to have access to that path -. By opposite, passing the entire file as bytes would not be an option due to resource issue ( what if file is big and bigger? )
What is the correct strategy to handle this situation? On the same question: would be the assumption of having a distributed file system a good excuse to accept paths as messages?
I don't think there's a single definitive answer, because it depends on the nature of the data and the "crunching". However, in the typical case where you really are doing data processing of the files, you are going to have to read the files into memory at some point. So, yes, the generally answer is to read the entire file as bytes.
In answer to the question of "what if the file is bigger", that's why we have streaming libraries like Akka Streams. For example, a common case might be to use Alpakka to watch for files in a local directory (or FTP), parse them into records, filter/map the records to do some initial cleansing, and then stream those records to distributed actors to process. Because you are using streaming, Akka is not trying to load the whole file into memory at a time, and you get the benefit of backpressure so that you don't overload the actors doing the processing.
That's not to say a distributed file system might not have uses. For example, so that you have high availability. If you upload a file to the local filesystem of an Akka node and the Akka node fails, you obviously lose access to your file. But that's really a separate issue to how you do distributed processing.

When is a newly created HDFS file considered available for read?

Creating a HDFS file involves several things, metadata, allocating blocks, replicating blocks. My question is that, when is a file considered available for read? Does it need to wait until all blocks are fully replicated?
In my HDFS log, I noticed HDFS first allocated blocks for my mapreduce staging file:
org.apache.hadoop.hdfs.StateChange: BLOCK* allocate blk_1073743864_3041, replicas=10.206.36.220:9866, 10.206.37.92:9866, 10.206.36.186:9866, 10.206.36.246:9866, 10.206.38.104:9866, 10.206.37.119:9866, 10.206.37.255:9866, 10.206.37.129:9866, 10.206.38.97:9866, 10.206.38.5:9866 for /tmp/hadoop-yarn/staging/xxx/.staging/job_12345678_0567/job.split
but later the job failed to find the file:
INFO org.apache.hadoop.ipc.Server: IPC Server handler 80 on 8020, call Call#1 Retry#1 org.apache.hadoop.hdfs.protocol.ClientProtocol.getBlockLocations from 10.206.38.106:46254: java.io.FileNotFoundException: File does not exist: /tmp/hadoop-yarn/staging/xxx/.staging/job_12345678_0567/job.split
finally I see
INFO org.apache.hadoop.hdfs.StateChange: BLOCK* allocate blk_1073744008_3185, replicas=10.206.37.253:9866, 10.206.36.167:9866 for /tmp/hadoop-yarn/staging/xxx/.staging/job_12345678_0567/job.split.1234567890._COPYING_
INFO org.apache.hadoop.hdfs.StateChange: DIR* completeFile: /tmp/hadoop-yarn/staging/xxx/.staging/job_12345678_0567/job.split.1234567890._COPYING_ is closed by DFSClient_NONMAPREDUCE_-1702431035_1
I'm guessing the file has never been fully created.
Data is available for read immediately after the flush operation.If a writer wants to ensure that data it has written does not get lost in the event of a system crash, it has to invoke flush. A successful return from the flush call guarantees that HDFS has persisted the data and relevant metadata. The writer can invoke flush as many times and as frequently as it needs. The writer can repeatedly write a few bytes to a file and then invoke flush. A reader that has the file already opened might not see that changes that this flush effected, but any re-opens of the file will allow the reader to access the newly written data. HDFS guarantees that once data is written to a file and flushed, it would either be available to new readers or an exception would be generated. New readers will encounter an IO error while reading those portions of the file that are currently unavailable. This can occur if all the replicas of a block are unavailable. HDFS guarantees to generate an exception (i.e. no silent data loss) if a new reader tries to read data that was earlier flushed but is currently unavailable.

Multi threaded application in C++

I'm working on a Multi Threaded application programmed in C++. I uses some temporary files to pass data between my threads. One thread writes the data to be processed into files in a directory. Another thread scans the directory for work files and reads the files and process them further, then delete those files. I have to use these files , because if my app gets killed when , i have to retain the data which has not been processed yet.
But i hate to be using multiple files. I just want to use a single file. One thread continuously writing to a file and other thread reading the data and deleting the data which has been read.
Like a vessel is filled from top and at bottom i can get and delete the data from vessel. How to do this efficiently in C++ , first is there a way ..?
As was suggested in the comments to your questions using a database like SQLite may be a very good solution.
However if you insist on using a file then this is of course possible.
I did it myself once - created a persistent queue on disk using a file.
Here are the guidelines on how to achieve this:
The file should contain a header which point to the next unprocessed record (entry) and to the next available place to write to.
If the records have variable length then each record should contain a header which states the record length.
You may want to add to each record a flag that indicates whether the record was processed
file locking can be used to ensure no one reads from the portion of the file that is being written to
Use low level IO - don't use buffered streams of any kind, use direct write semantics
And here is the schemes for reading and writing (probably with some small logical bugs but you should be able to take it from there):
READER
Lock the file header and read it and unlock it back
Go to the last record position
Read the record header and the record
Write the record header back with the processed flag turned on
If you are not at the end of file Lock the header and write the new location of the next unprocessed record else write some marking to indicate there are no more records to process
Make sure that the next record to write points to the correct place
You may also want the reader to compact the file for you once in a while:
Lock the entire file
Copy all unprocessed records to the beginning of the file (You may want to keep some logic as not to overwrite your unprocessed records - maybe compact only if processed space is larger than unprocessed space)
Update the header
Unlock the file
WRITER
Lock the header of the file and see where the next record is to be written then unlock it
Lock the file from the place to be written to the length of the record
Write the record and unlock
Lock the header if the unprocessed record mark indicates there are no records to process let it point to the new record unlock the header
Hope this sets you on the write track
The win32Api function CreateFileMapping() enables processes to share data, multiple processes can use memory-mapped files that the system paging file stores.
A few good links:
http://msdn.microsoft.com/en-us/library/aa366551(VS.85).aspx
http://msdn.microsoft.com/en-us/library/windows/desktop/aa366551(v=vs.85).aspx
http://www.codeproject.com/Articles/34073/Inter-Process-Communication-IPC-Introduction-and-S
http://www.codeproject.com/Articles/19531/A-Wrapped-Class-of-Share-Memory
http://www.bogotobogo.com/cplusplus/multithreaded2C.php
you can write data that was process line per line and delimeter for each line indicate if this record processing or not

Rotating logs without restart, multiple process problem

Here is the deal:
I have a multiple process system (pre-fork model, similar to apache). all processes are writing to the same log file (in fact a binary log file recording requests and responses, but no matter).
I protect against concurrent access to the log via a shared memory lock, and when the file reach a certain size the process that notices it first roll the logs by:
closing the file.
renaming log.bin -> log.bin.1, log.bin.1 -> log.bin.2 and so on.
deleting logs that are beyond the max allowed number of logs. (say, log.bin.10)
opening a new log.bin file
The problem is that other processes are unaware, and are in fact continue to write to the old log file (which was renamed to log.bin.1).
I can think of several solutions:
some sort of rpc to notify other processes to reopen the log (maybe even a singal). I don't particularly like it.
have processes check the file length via the opened file stream, and somehow detect that the file was renamed under them and reopen log.bin file.
None of those is very elegant in my opinion.
thoughts? recommendations?
Your solution seems fine, but you should store an integer with inode of current logging file in shared memory (see stat(2) with stat.st_ino member).
This way, all process kept a local variable with the opened inode file.
The shared var must be updated when rotating by only one process, and all other process are aware by checking a difference between the local inode and the shared inode. It should induce a reopening.
What about opening the file by name each time before writing a log entry?
get shared memory lock
open file by name
write log entry
close file
release lock
Or you could create a logging process, which receives log messages from the other processes and handles all the rotating transparently from them.
You don't say what language you're using but your processes should all log to a log process and the log process abstracts the file writing.
Logging client1 -> |
Logging client2 -> |
Logging client3 -> | Logging queue (with process lock) -> logging writer -> file roller
Logging client4 -> |
You could copy log.bin to log.bin.1 and then truncate the log.bin file.
So the problems can still write to the old file pointer, which is empty now.
See also man logrotate:
copytruncate
Truncate the original log file to zero size in place after cre‐
ating a copy, instead of moving the old log file and optionally
creating a new one. It can be used when some program cannot be
told to close its logfile and thus might continue writing
(appending) to the previous log file forever. Note that there
is a very small time slice between copying the file and truncat‐
ing it, so some logging data might be lost. When this option is
used, the create option will have no effect, as the old log file
stays in place.
Since you're using shared memory, and if you know how many processes are using the log file.
You can create an array of flags in shared memory, telling each of the processes that the file has been rotated. Each process then resets the flag so that it doesn't re-open the file continuously.

How can I know I am the only person to have a file handle open?

I have a situation where I need to pick up files from a directory and process them as quickly as they appear. The process feeding files into this directory is writing them at a pretty rapid rate (up to a thousand a minute at peak times) and I need to pull them out and process them as they arrive.
One problem I've had is knowing that my C++ code has opened a file that the sending server has finished with -- that is, that the local FTP server isn't still writing to.
Under Solaris, how can I open a file and know with 100% certainty that no-one else has it open?
I should note that once the file has been written to and closed off it the other server won't open it again, so if I can open it and know I've got exclusive access I don't need to worry about checking that I'm still the only one with the file.
You can used flock() with operation LOCK_EX to ensure exclusive access. fcntl() is another possible way
#include <sys/file.h>
int flock(int fd, int operation);
EDIT: Two ways to do this, find an ftp server which locks the file during receiving.
I'm afraid you will not be 100% safe if you monitor the ftp server process, using pfiles or lsof (which is available here http://www.sunfreeware.com/) to make sure that no one else is accessing the files.
Maybe you can check the timestamps of the incomming files and if they havn't changed for a few minutes it would be safe to fetch,process or do something with them.
Is the process that feeds files into the directory is owned by you? If that is the case, then rename the file's extension to .working so that you don't pick up the file that is being used.
EDIT: Since you said it is solaris, write a shell script and use pfiles command to check if the process still uses the file you want to use. If not start processing the file.