When is a newly created HDFS file considered available for read? - hdfs

Creating a HDFS file involves several things, metadata, allocating blocks, replicating blocks. My question is that, when is a file considered available for read? Does it need to wait until all blocks are fully replicated?
In my HDFS log, I noticed HDFS first allocated blocks for my mapreduce staging file:
org.apache.hadoop.hdfs.StateChange: BLOCK* allocate blk_1073743864_3041, replicas=10.206.36.220:9866, 10.206.37.92:9866, 10.206.36.186:9866, 10.206.36.246:9866, 10.206.38.104:9866, 10.206.37.119:9866, 10.206.37.255:9866, 10.206.37.129:9866, 10.206.38.97:9866, 10.206.38.5:9866 for /tmp/hadoop-yarn/staging/xxx/.staging/job_12345678_0567/job.split
but later the job failed to find the file:
INFO org.apache.hadoop.ipc.Server: IPC Server handler 80 on 8020, call Call#1 Retry#1 org.apache.hadoop.hdfs.protocol.ClientProtocol.getBlockLocations from 10.206.38.106:46254: java.io.FileNotFoundException: File does not exist: /tmp/hadoop-yarn/staging/xxx/.staging/job_12345678_0567/job.split
finally I see
INFO org.apache.hadoop.hdfs.StateChange: BLOCK* allocate blk_1073744008_3185, replicas=10.206.37.253:9866, 10.206.36.167:9866 for /tmp/hadoop-yarn/staging/xxx/.staging/job_12345678_0567/job.split.1234567890._COPYING_
INFO org.apache.hadoop.hdfs.StateChange: DIR* completeFile: /tmp/hadoop-yarn/staging/xxx/.staging/job_12345678_0567/job.split.1234567890._COPYING_ is closed by DFSClient_NONMAPREDUCE_-1702431035_1
I'm guessing the file has never been fully created.

Data is available for read immediately after the flush operation.If a writer wants to ensure that data it has written does not get lost in the event of a system crash, it has to invoke flush. A successful return from the flush call guarantees that HDFS has persisted the data and relevant metadata. The writer can invoke flush as many times and as frequently as it needs. The writer can repeatedly write a few bytes to a file and then invoke flush. A reader that has the file already opened might not see that changes that this flush effected, but any re-opens of the file will allow the reader to access the newly written data. HDFS guarantees that once data is written to a file and flushed, it would either be available to new readers or an exception would be generated. New readers will encounter an IO error while reading those portions of the file that are currently unavailable. This can occur if all the replicas of a block are unavailable. HDFS guarantees to generate an exception (i.e. no silent data loss) if a new reader tries to read data that was earlier flushed but is currently unavailable.

Related

Concurent state file manipulation with multiple process beyond our control for Linux and Windows

The question below may sound a bit long and complex, but actually it's a quite simple, generic and common problem of three processes working on a same file. In a text below I'm trying to decompose the problem into set of particular requirements with some illustrative examples.
Task preamble
There is a text file, called index, which contains some metadata.
There is an application (APP), which understands the file format and perform meaningful changes on it.
The file is stored under version control system (VCS), which is a source of changes performed on the same file by other users.
We need to design an application (APP), that will work with the file in a reasonable file, preferable without interring much with VCS, as it's assumed that VCS is used to keep a large project with the index file being just a small part of it, and user may want to update the VCS at any point without considering any ongoing operations within APP. In that case APP should gracefully handle the situation in a way preventing any possible loss of data.
Preable remarks
Please note that VCS is unspecified, it could be perforce, git, svn, tarballs, flash drives or your favourite WWII Morse-based radio and a text editor.
Text file could be binary, that doesn't change things much. But with VCS storage in mind, it's prone to be merged and therefore text/human-readable format is most adequate.
Possible examples for such things are: complex configurations (AI behaviour trees, game object descriptions), resource listings, other things that are not meant to be edited by hand, related to a project at hand, but which history matters.
Note that, unless you are keen to implement your own version control system, "outsourcing" most of the configuration into some external, client-server based solution does not solve the problem - you still have to keep a reference file within version control system with a reference to a matching version of configuration in question in the database. Which means, that you still have the same problem, but at a bit smaller scale - a single text line in a file instead of a dozen.
The task itself
A generic APP in vacuum may the index in three phases: read, modify, write. The read phase - read and de-serialize the file, modify - change an in-memory state, write - serialize the state and write to the file.
There are three kind of generic workflows for such an application:
read -> <present an information>
read -> <present an information and await user's input> -> modify -> write
read -> modify -> write
The first workflow is for read-only "users", like a game client, which reads data once and forgets about the file.
The second workflow is for editing application. With external updates being rather rare occurrence and being improbable that user will edit the same file in few editing applications at the same time, it's only reasonable to assume, that a generic editing application will want to read the state only once (especially if it's a resource-consuming operation) and re-read only in case of external updates.
The third workflow is for an automated cli usage - build servers, scripts and such.
With that in mind, it's reasonable to threat read and modify + write separately. Let's call an operation that makes only read phase and prepares some information a read operation. And a write operation would be an operation that modifies a state from a read operation and writes it to the disk.
As workflows one and two may be running at the same time by different application instances, it's also reasonable to allow multiple read operations running at the same time. Some read operations, like reads for editing applications, may want to wait until any existing write operations are finished to read the most recent and up-to-date state. Other read operations, like this in a game client may want to read the current state, whatever it is, without being blocked at all.
On other hand, it's only reasonable for write operations to detect any other write operations running and abort. Write operations should also detect any external changes made to the index file and abort. Rationale - there is no point to perform (and wait for) any work, that would be thrown away due to the fact that they've been made basing on a possible out-of-date state.
For a robust application, a possibility for a critical failure of a galaxy scale should be assumed at every single point of an application. Under no circumstances such a failure should left the index file inconsistent.
Requirements
file reads are consistent - under no circumstances should we read a half of a file before it have been changed or an another half after.
write operations are exclusive - no other write operations are allowed at the same time with the same file.
write operations are robustly waitable - we should be able to wait for a write operation to complete or fail.
write operations are transactional - under no circumstances should the file be left in partially changed or otherwise inconsistent state or based on an out-of-date state. Any change to the index file prior or during the operation should be detected and operation should be aborted as soon as possible.
Linux
A read operation:
Obtain a shared lock, if requested - open(2) (O_CREAT | O_RDONLY) and flock(2) (LOCK_SH) the "lock" file.
open(2) (O_RDONLY) the index file.
Create contents snapshot and parse it.
close(2) the index file.
Unlock - flock(2) (LOCK_UN) and close(2) the "lock" file
A write operation:
Obtain an exclusive lock - open(2) (O_CREAT | O_RDONLY) and flock(2) (LOCK_EX) the "lock" file.
open(2) (O_RDONLY) the index file.
fcntl(2) (F_SETLEASE, F_RDLCK) the index file. - we are only interested in writes, those RDLCK lease.
Check if the state is up-to-date, do things, change the state, write it to a temporary file nearby.
rename(2) the temporary file to the index - it's atomic, and if we haven't got a lease break so far, we won't at all - this will be a different file, not the one we've got the lease on.
fcntl(2) (*F_SETLEASE, F_UNLCK) the index file.
close(2) the index file (the "old" one, with no reference in the filesystem left)
Unlock - close(2) the "lock" file
If a signal from the lease is received - abort and cleanup, no rename. rename(2) has no mention that it might be interrupted and POSIX requires it to be atomic, so once we've got to it - we've made it.
I know there are shared-memory mutexes and named semaphores (instead of an advisory locking for cooperation between application instances), but I think we all agree, that they are needlessly complex for the task at hand and have their own problems.
Windows
A read operation:
Obtain a shared lock, if requested - CreateFile (OPEN_ALWAYS, GENERIC_READ, FILE_SHARE_READ) and LockFileEx (1 byte) the "lock" file
CreateFile (OPEN_EXISTING, GENERIC_READ, FILE_SHARE_READ) the index file
Read file contents
CloseHandle the index
Unlock - CloseHandle the "lock" file
A write operation:
Obtain an exclusive lock - CreateFile (OPEN_ALWAYS, GENERIC_READ, FILE_SHARE_READ) and LockFileEx (LOCKFILE_EXCLUSIVE_LOCK, 1 byte) the "lock" file
CreateFile (OPEN_EXISTING, GENERIC_READ, FILE_SHARE_READ | FILE_SHARE_WRITE ) the index file
ReadDirectoryChanges (FALSE, FILE_NOTIFY_CHANGE_LAST_WRITE) on the index file directory, with OVERLAPPED structure and an event
Check the state is up-to-date. Modify the state. Write it a temporary file
Replace the index file with a temporary
CloseHandle the index
Unlock - CloseHandle the "lock" file
During the modification part check for the event from the OVERLAPPED structure with WaitForSingleObject (zero timeout). If there are events for the index - abort the operation. Otherwise - fire the watch again, check if we are still up-to-date and if so - continue.
Remarks
Windows version use locking instead of Linux version's notification mechanism, which may interfere with outside processes making writes, but there is seemingly no other way in Windows.
In Linux, you can also use mandatory file locking.
See "Semantics" section:
If a process has locked a region of a file with a mandatory read lock, then
other processes are permitted to read from that region. If any of these
processes attempts to write to the region it will block until the lock is
released, unless the process has opened the file with the O_NONBLOCK
flag in which case the system call will return immediately with the error
status EAGAIN.
and:
If a process has locked a region of a file with a mandatory write lock, all
attempts to read or write to that region block until the lock is released,
unless a process has opened the file with the O_NONBLOCK flag in which case
the system call will return immediately with the error status EAGAIN.
With this approach, the APP may set read or write lock on file, and VCS will be blocked until lock is released.
Note that neither mandatory locks, nor file leases will work good if VCS can unlink() index file or replace it using rename():
If you use mandatory locks, VCS won't be blocked.
If you use file leases, APP won't get notification.
You also can't establish locks or leases on a directory. What you can do in this case:
After read operation, APP can manually check that file still exist and has the same i-node.
But it's not enough for write operations. Since APP can't atomically check file i-node and modify file, it can accidentally overwrite changes made by VCS without being able to detect it. You probably can detect this situation using inotify(7).

Buffering to the hard disk

I am receiving a large quantity of data at a fixed rate. I need to do some processing on this data on a different thread, but this may run slower than the data is coming in, so I need to buffer the data. Due to the quantity of data coming in the available RAM would be quickly exhausted, so it needs to overflow onto the hard disk. What I could do with is something like a filesystem-backed pipe, so the writer could be blocked by the filesystem, but not by the reader running too slowly.
Here's a rough set of requirements:
Writing should not be blocked by the reader running too slowly.
If data is read slow enough that the available RAM is exhausted it should overflow to the filesystem. It's ok for writes to the disk to block.
Reading should block if no data is available unless the stream has been closed by the writer.
If the reader is able to keep up with the data then it should never hit the hard disk as the RAM buffer would be sufficient (nice but not essential).
Disk space should be recovered as the data is consumed (or soon after).
Does such a mechanism exist in Windows?
This looks like a classic message queue. Did you consider MSMQ or similar? MSMQ has all the properties you are asking for. You may want to use direct addressing to avoid Active Directory http://msdn.microsoft.com/en-us/library/ms700996(v=vs.85).aspx and use local or TCP/IP queue address.
Use an actual file. Write to the file as the data is received, and in another process read the data from the file and process it.
You even get the added benefits of no multithreading.

update a file simultaneously without locking file

Problem- Multiple processes want to update a file simultaneously.I do not want to use file locking functionality as highly loaded environment a process may block for a while which i don't want. I want something like all process send data to queue or some shared place or something else and one master process will keep on taking data from there and write to the file.So that no process will get block.
One possibility using socket programming.All the processes will send data to to single port and master keep on listening this single port and store data to file.But what if master got down for few seconds.if it happen than i may write to some file based on timestamp and than later sync.But i am putting this on hold and looking for some other solution.(No data lose)
Another possibility may be tacking lock for the particular segment of the file on which the process want to write.Basically each process will write a line.I am not sure how good it will be for high loaded system.
Please suggest some solution for this problem.
Have a 0mq instance handle the writes (as you initially proposed for the socket) and have the workers connect to it and add their writes to the queue (example in many languages).
Each process can write to own file (pid.temp) and periodically rename file (pid-0.data, pid-1.data, ...) for master process that can grab all this files.
You may not need to construct something like this. If you do not want to get processes blocked just use the LOCK_NB flag of perl flock. Periodically try to flock. If not succeeds continue the processing and the values can stored in an array. If file locked, write the data to it from the array.

Multi threaded application in C++

I'm working on a Multi Threaded application programmed in C++. I uses some temporary files to pass data between my threads. One thread writes the data to be processed into files in a directory. Another thread scans the directory for work files and reads the files and process them further, then delete those files. I have to use these files , because if my app gets killed when , i have to retain the data which has not been processed yet.
But i hate to be using multiple files. I just want to use a single file. One thread continuously writing to a file and other thread reading the data and deleting the data which has been read.
Like a vessel is filled from top and at bottom i can get and delete the data from vessel. How to do this efficiently in C++ , first is there a way ..?
As was suggested in the comments to your questions using a database like SQLite may be a very good solution.
However if you insist on using a file then this is of course possible.
I did it myself once - created a persistent queue on disk using a file.
Here are the guidelines on how to achieve this:
The file should contain a header which point to the next unprocessed record (entry) and to the next available place to write to.
If the records have variable length then each record should contain a header which states the record length.
You may want to add to each record a flag that indicates whether the record was processed
file locking can be used to ensure no one reads from the portion of the file that is being written to
Use low level IO - don't use buffered streams of any kind, use direct write semantics
And here is the schemes for reading and writing (probably with some small logical bugs but you should be able to take it from there):
READER
Lock the file header and read it and unlock it back
Go to the last record position
Read the record header and the record
Write the record header back with the processed flag turned on
If you are not at the end of file Lock the header and write the new location of the next unprocessed record else write some marking to indicate there are no more records to process
Make sure that the next record to write points to the correct place
You may also want the reader to compact the file for you once in a while:
Lock the entire file
Copy all unprocessed records to the beginning of the file (You may want to keep some logic as not to overwrite your unprocessed records - maybe compact only if processed space is larger than unprocessed space)
Update the header
Unlock the file
WRITER
Lock the header of the file and see where the next record is to be written then unlock it
Lock the file from the place to be written to the length of the record
Write the record and unlock
Lock the header if the unprocessed record mark indicates there are no records to process let it point to the new record unlock the header
Hope this sets you on the write track
The win32Api function CreateFileMapping() enables processes to share data, multiple processes can use memory-mapped files that the system paging file stores.
A few good links:
http://msdn.microsoft.com/en-us/library/aa366551(VS.85).aspx
http://msdn.microsoft.com/en-us/library/windows/desktop/aa366551(v=vs.85).aspx
http://www.codeproject.com/Articles/34073/Inter-Process-Communication-IPC-Introduction-and-S
http://www.codeproject.com/Articles/19531/A-Wrapped-Class-of-Share-Memory
http://www.bogotobogo.com/cplusplus/multithreaded2C.php
you can write data that was process line per line and delimeter for each line indicate if this record processing or not

Rotating logs without restart, multiple process problem

Here is the deal:
I have a multiple process system (pre-fork model, similar to apache). all processes are writing to the same log file (in fact a binary log file recording requests and responses, but no matter).
I protect against concurrent access to the log via a shared memory lock, and when the file reach a certain size the process that notices it first roll the logs by:
closing the file.
renaming log.bin -> log.bin.1, log.bin.1 -> log.bin.2 and so on.
deleting logs that are beyond the max allowed number of logs. (say, log.bin.10)
opening a new log.bin file
The problem is that other processes are unaware, and are in fact continue to write to the old log file (which was renamed to log.bin.1).
I can think of several solutions:
some sort of rpc to notify other processes to reopen the log (maybe even a singal). I don't particularly like it.
have processes check the file length via the opened file stream, and somehow detect that the file was renamed under them and reopen log.bin file.
None of those is very elegant in my opinion.
thoughts? recommendations?
Your solution seems fine, but you should store an integer with inode of current logging file in shared memory (see stat(2) with stat.st_ino member).
This way, all process kept a local variable with the opened inode file.
The shared var must be updated when rotating by only one process, and all other process are aware by checking a difference between the local inode and the shared inode. It should induce a reopening.
What about opening the file by name each time before writing a log entry?
get shared memory lock
open file by name
write log entry
close file
release lock
Or you could create a logging process, which receives log messages from the other processes and handles all the rotating transparently from them.
You don't say what language you're using but your processes should all log to a log process and the log process abstracts the file writing.
Logging client1 -> |
Logging client2 -> |
Logging client3 -> | Logging queue (with process lock) -> logging writer -> file roller
Logging client4 -> |
You could copy log.bin to log.bin.1 and then truncate the log.bin file.
So the problems can still write to the old file pointer, which is empty now.
See also man logrotate:
copytruncate
Truncate the original log file to zero size in place after cre‐
ating a copy, instead of moving the old log file and optionally
creating a new one. It can be used when some program cannot be
told to close its logfile and thus might continue writing
(appending) to the previous log file forever. Note that there
is a very small time slice between copying the file and truncat‐
ing it, so some logging data might be lost. When this option is
used, the create option will have no effect, as the old log file
stays in place.
Since you're using shared memory, and if you know how many processes are using the log file.
You can create an array of flags in shared memory, telling each of the processes that the file has been rotated. Each process then resets the flag so that it doesn't re-open the file continuously.