HDFS File replace while other applications are accessing data

HDFS File replace while other applications are accessing data - hdfs

I have been working on an approach to Refresh HDFS files while other consumers/applications accessing data. I have a HDFS directory which has file accessible for the users which i need to replace with latest incoming data everyday, My refresh process few seconds/milli seconds only. But still the challenge is the jobs which already reads this data for analytics get effected due to this refresh process. My approach to refresh files is instead of writing the spark jobs resulted data into actual data locations where users access, I will first write the data a temporary location and then replace with hdfs file replace API. But still my problem is not solved. Please suggest any solution or a workaround to handle HDFS file replace with no impact on downstream.
val conf: Configuration = new Configuration()
val fs: FileSystem = FileSystem.get(conf)
val currentDate = java.time.LocalDate.now
val destPath = outputPath + "/data"
val archivePath = outputPath + "/archive/" + currentDate
val dataTempPath = new Path(destPath + "_temp")
val dataPath = new Path(destPath)
if(fs.exists(dataPath)){
fs.delete(dataPath, true)
}
if(fs.exists(dataTempPath)){
fs.rename(dataTempPath, dataPath)
}
val archiveTempData = new Path(archivePath+"_temp")
val archive = new Path(archivePath)
if(fs.exists(archive)){
fs.delete(archive,true)
}
if(fs.exists(archiveTempData)){
fs.rename(archiveTempData, archive)
}

Simpler approach
Use 2 HDFS locations cyclically per source or target for loading with table defs t1_x and t2_x respectively, and, use a view_x to switch between t1_x and t2_x likewise.
Queries should always use the view_x.
You can clean up the no longer used HDFS locations in a timely manner prior to next cycle.
The clue is to leave the new and old data around for a while.
Comment to make
Only drawback is if a set of queries need to run against the old versions of data. If the changed data is of the nature "added to", then no issue, but if it can overwrite, then there is an issue.
More complicated approach
In the latter case, not sure if an issue or not, you need to apply an annoying solution as outlined hereunder.
Which is to version the data (via partitioning) with some value.
And have a control table with current_version and pick this value up and use it in all related queries until you can use the new current_version.
And then do your maintenance.

Related

Parallel Writes to NFS-backed File

UPDATE: I had each node write to a separate file, and when the separate files were concatenated together the result was correct. I also updated the code to attempt a channel flush and file sync after each write of a single record, but there are still issues between nodes 0 and 1, now. If I make Node 0 sleep for a few seconds before it starts its iteration of the coforall loop, the records come out correct. If not, the last few hundred bytes of Node 0's records seem to be reliably overwritten with NULL bytes, up to the start of Node 1's records. The issues between Node 1 and Node 2, and Node 2 and Node 3, seem to not show up anymore.
Additionally, if I suppress either Node 0 or Node 1 from writing, I see the fully-formed records from the un-suppressed node written correctly to the file. In the case that Node 1 is suppressed, I see 9,997 100B records (or 999,700) correct bytes followed by NULL bytes in the file where Node 1's suppressed records would go. In the case that Node 0 is suppressed, I see exactly 999,700 NULL bytes in the file, after which Node 1's records begin.
Original Post:
I'm trying to troubleshoot an issue with parallel writes from different nodes to a shared NFS-backed file on disk. At the moment, I suspect that something is wrong with the way writes to the disk happen on the NFS server.
I'm working on adapting MPI+C code that uses pwrite to write to coordinated chunks of a file. If I try to have the equivalent locales in Chapel write to the file inside of a coforall loop, I end up with the bits of the file around the node boundaries messed up - usually the final few hundred bytes of each node's data are garbled. However, if I have just one locale iterate through the data on all locales and write it, the data comes out correctly. That is, I use the same data structures to calculate the offsets, but only Locale 0 seeks to that offset and performs the writes.
I've verified that the offsets into the file that each locale runs do not overlap, and I'm using a single channel per task, defined from within the on loc do block, so that tasks don't share a single channel.
Are there known issues with writing to a file from different locales? A lot of the documentation makes it seem like this is known to be safe, but an unsubstantiated guess seems to indicate that there are issues with caching of file contents; when examining the incorrect data, the bits that are incorrect seem to be the original data from the file in that location at the beginning of the program.
I've included the relevant routine below, in case you easily spot something I missed. To make this serial, I convert the coforall loc in Locales and on loc do block into a for j in 0..numLocales-1 loop, and replace here.id with j. Please let me know what else would help get to the bottom of this. Thanks!
proc write_share_of_data(data_filename: string, ref record_blocks) throws {
coforall loc in Locales {
on loc do {
var data_file: file = open(data_filename, iomode.cwr);
var data_writer = data_file.writer(kind=ionative, locking=false);
var line: [1..100] uint(8);
const indices = record_blocks[here.id].D;
var local_record_offset = + reduce record_blocks[0..here.id-1].D.size;
writeln("Loc ", here.id, ": record offset is ", local_record_offset);
var local_bytes_offset = terarec.terarec_width_disk * local_record_offset;
data_writer.seek(start=local_bytes_offset);
for i in indices {
var write_rec: terarec_t = record_blocks[here.id].records[i];
line[1..10] = write_rec.key;
line[11..98] = write_rec.value;
line[99] = 13; // \r
line[100] = 10; // \n
data_writer.write(line);
lines_written += 1;
}
data_file.fsync();
data_writer.close();
data_file.close();
}
}
return;
}

Adding an answer here that solved my particular problem, though it doesn't explain the behavior seen. I ended up changing the outer loop from coforall loc in Locales to for loc in Locales. This isn't too big of an issue since it's all writing to one file anyway - I doubt that multiple locales can actually make much headway in all attempting to write concurrently to a single file on an NFS server. As a result, the change still allows nodes to write the data they have locally to NFS, rather than forcing Node 0 to collect and then write the data on behalf of all locales. This amounts to only adding idle time to the write operation commensurate with the time it takes Locale 0 to start the remote task on other nodes when the previous node has finished writing, which for the application at hand is not a concern.

Have you tried specifying start/end in file.writer instead of using seek? Does that change anything? What about specifying the end offset for the channel.seek call? Does it matter if the file is created and has the appropriate size before you start?
Other than that, I wonder if this issue would appear for both NFS and Lustre. If it appears for both it might well be a Chapel bug. It sounds from your description that the C program was using this pattern, which points to it being a bug. But, have you run C code doing this on your setup? If it being a Chapel bug seems most likely after further investigation, we would appreciate a bug report issue with a reproducer.
I know that NFS does not always do what one would like, in terms of data consistency. It's my understanding that it has "close to open" semantics but it's unclear to me what that means in the context of opening a file and writing to a particular region within it, in parallel from different locales.
From Why NFS Sucks by Olaf Kirch:
An NFS client is permitted to cache changes locally and send them to
the server whenever it sees fit. This sort of lazy write-back greatly
helps write performance, but the flip side is that everyone else will
be blissfully unaware of these change before they hit the server. To
make things just a little harder, there is also no requirement for a
client to transmit its cached write in any particular fashion, so
dirty pages can (and often will be) written out in random order.
I read two implications from this paragraph that are relevant to your situation here:
The writes you do on different locales can be observed by the NFS server in an arbitrary order. (However as I understand it, the data should be sent to the server by the time your fsync call returns).
These writes are done at an OS page granularity (usually 4k). (Note that this is more a hypothesis I am making than it is a fact. It should be tested or further investigated).
It would be interesting to check if 2. is a plausible explanation for the behavior you are seeing. For example, you could explore having each locale operate on a multiple of 4096 records (or potentially try writing records of 4096 bytes each) and see if that changes the behavior. If 2 is indeed the explanation, it should be possible to create a C program that demonstrates the behavior as well.

Does AWS S3 GetObject read the partial of the Object being uploaded to s3 at the same time

I have a lambda (L1) replacing a file (100MB) into an s3 location ( s3://bucket/folder/abc.json). I have another lambdas (L2 , L3) reading the same file at the same time, one via golang api another via Athena query. The s3 bucket/ folder is not versioned.
The question is: Does the lambdas L2, L3 read the old copy of the file till the new file got uploaded? Or does it read the partial file that is being uploaded? If its the later then how do you make sure that the L2, L3 read the files only on full upload?

Amazon S3 is now strongly consistent. This means once you upload an object, all people that read that object are guaranteed to get the updated version of the object.
On the surface, that sounds like it guarantees that your question is "yes, all clients will get the old version or the new version of a file". The truth is still a bit fuzzier than that.
Under the covers, many of the S3 APIs upload with a multi-part upload. This is well known, and doesn't change what I've said above, since the upload must be done before the object is available. However, many of the APIs also use multiple byte-range requests during downloads to download larger objects. This is problematic. It means a download might download part of file v1, then when it goes to download another part, it might get v2 if v2 was just uploaded.
With a little bit of effort, we can demonstrate this:
#!/usr/bin/env python3
import boto3
import multiprocessing
import io
import threading
bucket = "a-bucket-to-use"
key = "temp/dummy_key"
size = 104857600
class ProgressWatcher:
def __init__(self, filesize, downloader):
self._size = float(filesize)
self._seen_so_far = 0
self._lock = threading.Lock()
self._launch = True
self.downloader = downloader
def __call__(self, bytes_amount):
with self._lock:
self._seen_so_far += bytes_amount
if self._launch and (self._seen_so_far / self._size) >= 0.95:
self._launch = False
self.downloader.start()
def upload_helper(pattern, name, callback):
# Upload a file of 100mb of "pattern" bytes
s3 = boto3.client('s3')
print(f"Uploading all {name}..")
temp = io.BytesIO(pattern * size)
s3.upload_fileobj(temp, bucket, key, Callback=callback)
print(f"Done uploading all {name}")
def download_helper():
# Download a file
s3 = boto3.client('s3')
print("Starting download...")
s3.download_file(bucket, key, "temp_local_copy")
print("Done with download")
def main():
# See how long an upload takes
upload_helper(b'0', "zeroes", None)
# Watch how the next upload progresses, this will start a download when it's nearly done
watcher = ProgressWatcher(size, multiprocessing.Process(target=download_helper))
# Start another upload, overwriting the all-zero file with all-ones
upload_helper(b'1', "ones", watcher)
# Wait for the downloader to finish
watcher.downloader.join()
# See what the resulting file looks like
print("Loading file..")
counts = [0, 0]
with open("temp_local_copy") as f:
for x in f.read():
counts[ord(x) - ord(b'0')] += 1
print("Results")
print(counts)
if __name__ == "__main__":
main()
This code uploads an object to S3 that's 100mb of "0". It then starts an upload, using the same key, of 100mb of "1", and when that second upload is 95% done, it starts a download of that S3 object. It then counts how many "0" and "1"s it sees in the downloaded file.
Running this, with the latest versions of Python and Boto3, your exact output will no doubt differ than mine due to network conditions, but this is what I saw with a test run:
Uploading all zeroes..
Done uploading all zeroes
Uploading all ones..
Starting download...
Done uploading all ones
Done with download
Loading file..
Results
[83886080, 20971520]
The last line is important. The downloaded file was mostly "0" bytes, but there were 20mb of "1" bytes. Meaning, I got some part of v1 of the file and some part of v2, despite only performing one download call.
Now, in practice, this is unlikely to happen, and more so if you have better network bandwidth then I do here on a run of the mill home Internet connection.
But it can always potentially happen. If you need to ensure that the downloaders never see a partial file like this, you either need to do something like verify a hash of the file is correct, or my preference is to upload with different keys each time, and have some mechnism for the client to discover the "latest" key, so they can download the whole unchanged file, even if an upload finishes while they're uploading.

The readers see only the old file until the new one is fully uploaded. There is no read of a partial file.
"Amazon S3 never adds partial objects."
SO discussion
Announcement

c++ Permanent Offline Counter

I have an embedded server, which can be unplugged any time. Is there an elegant way to implement a transactional c++ counter? In the worst case it should return the previous ID.
I have an embedded server which periodically generates report files. The server does not have time or network connection, so I want to generate the report files incrementally. However, after the report files are downloaded I would like to delete the report files, while maintaining the counter:
report00001.txt
report00002.txt
report00003.txt
report00004.txt
// all the files have been deleted
report00005.txt
...
I would like to use a code like this:
int last = read_current_id("counter.txt");
last++;
// transaction begin
write_id("counter.txt", last);
// transaction end

(assuming your server is running some sort of unixy operating system)
You could try using the write-and-rename idiom for this.
What you do is write your new counter value to a different file, say counter.txt~, then rename the temporary file onto the regular counter.txt. rename guarantees that either the new or old version of the file will exist at any time.
You should also mount your filesystem with the sync option so that file contents are not buffered in RAM. Note however that this will reduce performance, and may shorten lifespan of flash memory.

PoCo Logging. Log file name containing timestamp of creation OR New log file every time application starts

I need my program to start a new log file in each execution. I want to use PoCo as I am already using this library in the code.
From my point of view, I have two possibilities but I do not know how to configure any of them using a channel in Poco.
Just starting a new file each time the program starts
The atual file name (not the rolled but the one that is being writting) containing the timestamp when it was created.
If I am not wrong, using FileChannel is not possible any these possibilities. I guess I could write a new PoCo channel but, obviously, I prefer something already working.
Does anybody have any idea for this. I tried to figure out using two channels but I do not see how.
thank you

FileChannel has rotateOnOpen property. If you set this to true, it will create new file every time channel is opened. See FileChannel. If you do not have this property available, you are using an older version of Poco; in this case, you can simply open File channel with a newly generated name every time your application starts:
std::string name = yourCustomNameGenFunc();
AutoPtr<FileChannel> pChannel = new FileChannel(name);

Libtorrent - Given a magnet link, how do you generate a torrent file?

I have read through the manual and I cannot find the answer. Given a magnet link I would like to generate a torrent file so that it can be loaded on the next startup to avoid redownloading the metadata. I have tried the fast resume feature, but I still have to fetch meta data when I do it and that can take quite a bit of time. Examples that I have seen are for creating torrent files for a new torrent, where as I would like to create one matching a magnet uri.

Solution found here:
http://code.google.com/p/libtorrent/issues/detail?id=165#c5
See creating torrent:
http://www.rasterbar.com/products/libtorrent/make_torrent.html
Modify first lines:
file_storage fs;
// recursively adds files in directories
add_files(fs, "./my_torrent");
create_torrent t(fs);
To this:
torrent_info ti = handle.get_torrent_info()
create_torrent t(ti)
"handle" is from here:
torrent_handle add_magnet_uri(session& ses, std::string const& uri add_torrent_params p);
Also before creating torrent you have to make sure that metadata has been downloaded, do this by calling handle.has_metadata().
UPDATE
Seems like libtorrent python api is missing some of important c++ api that is required to create torrent from magnets, the example above won't work in python cause create_torrent python class does not accept torrent_info as parameter (c++ has it available).
So I tried it another way, but also encountered a brick wall that makes it impossible, here is the code:
if handle.has_metadata():
torinfo = handle.get_torrent_info()
fs = libtorrent.file_storage()
for file in torinfo.files():
fs.add_file(file)
torfile = libtorrent.create_torrent(fs)
torfile.set_comment(torinfo.comment())
torfile.set_creator(torinfo.creator())
for i in xrange(0, torinfo.num_pieces()):
hash = torinfo.hash_for_piece(i)
torfile.set_hash(i, hash)
for url_seed in torinfo.url_seeds():
torfile.add_url_seed(url_seed)
for http_seed in torinfo.http_seeds():
torfile.add_http_seed(http_seed)
for node in torinfo.nodes():
torfile.add_node(node)
for tracker in torinfo.trackers():
torfile.add_tracker(tracker)
torfile.set_priv(torinfo.priv())
f = open(magnet_torrent, "wb")
f.write(libtorrent.bencode(torfile.generate()))
f.close()
There is an error thrown on this line:
torfile.set_hash(i, hash)
It expects hash to be const char* but torrent_info.hash_for_piece(int) returns class big_number which has no api to convert it back to const char*.
When I find some time I will report this missing api bug to libtorrent developers, as currently it is impossible to create a .torrent file from a magnet uri when using python bindings.
torrent_info.orig_files() is also missing in python bindings, I'm not sure whether torrent_info.files() is sufficient.
UPDATE 2
I've created an issue on this, see it here:
http://code.google.com/p/libtorrent/issues/detail?id=294
Star it so they fix it fast.
UPDATE 3
It is fixed now, there is a 0.16.0 release. Binaries for windows are also available.

Just wanted to provide a quick update using the modern libtorrent Python package: libtorrent now has the parse_magnet_uri method which you can use to generate a torrent handle:
import libtorrent, os, time
def magnet_to_torrent(magnet_uri, dst):
"""
Args:
magnet_uri (str): magnet link to convert to torrent file
dst (str): path to the destination folder where the torrent will be saved
"""
# Parse magnet URI parameters
params = libtorrent.parse_magnet_uri(magnet_uri)
# Download torrent info
session = libtorrent.session()
handle = session.add_torrent(params)
print "Downloading metadata..."
while not handle.has_metadata():
time.sleep(0.1)
# Create torrent and save to file
torrent_info = handle.get_torrent_info()
torrent_file = libtorrent.create_torrent(torrent_info)
torrent_path = os.path.join(dst, torrent_info.name() + ".torrent")
with open(torrent_path, "wb") as f:
f.write(libtorrent.bencode(torrent_file.generate()))
print "Torrent saved to %s" % torrent_path

If saving the resume data didn't work for you, you are able to generate a new torrent file using the information from the existing connection.
fs = libtorrent.file_storage()
libtorrent.add_files(fs, "somefiles")
t = libtorrent.create_torrent(fs)
t.add_tracker("http://10.0.0.1:312/announce")
t.set_creator("My Torrent")
t.set_comment("Some comments")
t.set_priv(True)
libtorrent.set_piece_hashes(t, "C:\\", lambda x: 0), libtorrent.bencode(t.generate())
f=open("mytorrent.torrent", "wb")
f.write(libtorrent.bencode(t.generate()))
f.close()
I doubt that it'll make the resume faster than the function built specifically for this purpose.

Try to see this code http://code.google.com/p/libtorrent/issues/attachmentText?id=165&aid=-5595452662388837431&name=java_client.cpp&token=km_XkD5NBdXitTaBwtCir8bN-1U%3A1327784186190
it uses add_magnet_uri which I think is what you need

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js