Parallel Writes to NFS-backed File - chapel

UPDATE: I had each node write to a separate file, and when the separate files were concatenated together the result was correct. I also updated the code to attempt a channel flush and file sync after each write of a single record, but there are still issues between nodes 0 and 1, now. If I make Node 0 sleep for a few seconds before it starts its iteration of the coforall loop, the records come out correct. If not, the last few hundred bytes of Node 0's records seem to be reliably overwritten with NULL bytes, up to the start of Node 1's records. The issues between Node 1 and Node 2, and Node 2 and Node 3, seem to not show up anymore.
Additionally, if I suppress either Node 0 or Node 1 from writing, I see the fully-formed records from the un-suppressed node written correctly to the file. In the case that Node 1 is suppressed, I see 9,997 100B records (or 999,700) correct bytes followed by NULL bytes in the file where Node 1's suppressed records would go. In the case that Node 0 is suppressed, I see exactly 999,700 NULL bytes in the file, after which Node 1's records begin.
Original Post:
I'm trying to troubleshoot an issue with parallel writes from different nodes to a shared NFS-backed file on disk. At the moment, I suspect that something is wrong with the way writes to the disk happen on the NFS server.
I'm working on adapting MPI+C code that uses pwrite to write to coordinated chunks of a file. If I try to have the equivalent locales in Chapel write to the file inside of a coforall loop, I end up with the bits of the file around the node boundaries messed up - usually the final few hundred bytes of each node's data are garbled. However, if I have just one locale iterate through the data on all locales and write it, the data comes out correctly. That is, I use the same data structures to calculate the offsets, but only Locale 0 seeks to that offset and performs the writes.
I've verified that the offsets into the file that each locale runs do not overlap, and I'm using a single channel per task, defined from within the on loc do block, so that tasks don't share a single channel.
Are there known issues with writing to a file from different locales? A lot of the documentation makes it seem like this is known to be safe, but an unsubstantiated guess seems to indicate that there are issues with caching of file contents; when examining the incorrect data, the bits that are incorrect seem to be the original data from the file in that location at the beginning of the program.
I've included the relevant routine below, in case you easily spot something I missed. To make this serial, I convert the coforall loc in Locales and on loc do block into a for j in 0..numLocales-1 loop, and replace here.id with j. Please let me know what else would help get to the bottom of this. Thanks!
proc write_share_of_data(data_filename: string, ref record_blocks) throws {
coforall loc in Locales {
on loc do {
var data_file: file = open(data_filename, iomode.cwr);
var data_writer = data_file.writer(kind=ionative, locking=false);
var line: [1..100] uint(8);
const indices = record_blocks[here.id].D;
var local_record_offset = + reduce record_blocks[0..here.id-1].D.size;
writeln("Loc ", here.id, ": record offset is ", local_record_offset);
var local_bytes_offset = terarec.terarec_width_disk * local_record_offset;
data_writer.seek(start=local_bytes_offset);
for i in indices {
var write_rec: terarec_t = record_blocks[here.id].records[i];
line[1..10] = write_rec.key;
line[11..98] = write_rec.value;
line[99] = 13; // \r
line[100] = 10; // \n
data_writer.write(line);
lines_written += 1;
}
data_file.fsync();
data_writer.close();
data_file.close();
}
}
return;
}

Adding an answer here that solved my particular problem, though it doesn't explain the behavior seen. I ended up changing the outer loop from coforall loc in Locales to for loc in Locales. This isn't too big of an issue since it's all writing to one file anyway - I doubt that multiple locales can actually make much headway in all attempting to write concurrently to a single file on an NFS server. As a result, the change still allows nodes to write the data they have locally to NFS, rather than forcing Node 0 to collect and then write the data on behalf of all locales. This amounts to only adding idle time to the write operation commensurate with the time it takes Locale 0 to start the remote task on other nodes when the previous node has finished writing, which for the application at hand is not a concern.

Have you tried specifying start/end in file.writer instead of using seek? Does that change anything? What about specifying the end offset for the channel.seek call? Does it matter if the file is created and has the appropriate size before you start?
Other than that, I wonder if this issue would appear for both NFS and Lustre. If it appears for both it might well be a Chapel bug. It sounds from your description that the C program was using this pattern, which points to it being a bug. But, have you run C code doing this on your setup? If it being a Chapel bug seems most likely after further investigation, we would appreciate a bug report issue with a reproducer.
I know that NFS does not always do what one would like, in terms of data consistency. It's my understanding that it has "close to open" semantics but it's unclear to me what that means in the context of opening a file and writing to a particular region within it, in parallel from different locales.
From Why NFS Sucks by Olaf Kirch:
An NFS client is permitted to cache changes locally and send them to
the server whenever it sees fit. This sort of lazy write-back greatly
helps write performance, but the flip side is that everyone else will
be blissfully unaware of these change before they hit the server. To
make things just a little harder, there is also no requirement for a
client to transmit its cached write in any particular fashion, so
dirty pages can (and often will be) written out in random order.
I read two implications from this paragraph that are relevant to your situation here:
The writes you do on different locales can be observed by the NFS server in an arbitrary order. (However as I understand it, the data should be sent to the server by the time your fsync call returns).
These writes are done at an OS page granularity (usually 4k). (Note that this is more a hypothesis I am making than it is a fact. It should be tested or further investigated).
It would be interesting to check if 2. is a plausible explanation for the behavior you are seeing. For example, you could explore having each locale operate on a multiple of 4096 records (or potentially try writing records of 4096 bytes each) and see if that changes the behavior. If 2 is indeed the explanation, it should be possible to create a C program that demonstrates the behavior as well.

Related

c++ Permanent Offline Counter

I have an embedded server, which can be unplugged any time. Is there an elegant way to implement a transactional c++ counter? In the worst case it should return the previous ID.
I have an embedded server which periodically generates report files. The server does not have time or network connection, so I want to generate the report files incrementally. However, after the report files are downloaded I would like to delete the report files, while maintaining the counter:
report00001.txt
report00002.txt
report00003.txt
report00004.txt
// all the files have been deleted
report00005.txt
...
I would like to use a code like this:
int last = read_current_id("counter.txt");
last++;
// transaction begin
write_id("counter.txt", last);
// transaction end
(assuming your server is running some sort of unixy operating system)
You could try using the write-and-rename idiom for this.
What you do is write your new counter value to a different file, say counter.txt~, then rename the temporary file onto the regular counter.txt. rename guarantees that either the new or old version of the file will exist at any time.
You should also mount your filesystem with the sync option so that file contents are not buffered in RAM. Note however that this will reduce performance, and may shorten lifespan of flash memory.

Can a process be limited on how much physical memory it uses?

I'm working on an application, which has the tendency to use excessive amounts of memory, so I'd like to reduce this.
I know this is possible for a Java program, by adding a Maximum heap size parameter during startup of the Java program (e.g. java.exe ... -Xmx4g), but here I'm dealing with an executable on a Windows-10 system, so this is not applicable.
The title of this post refers to this URL, which mentions a way to do this, but which also states:
Maximum Working Set. Indicates the maximum amount of working set assigned to the process. However, this number is ignored by Windows unless a hard limit has been configured for the process by a resource management application.
Meanwhile I can confirm that the following lines of code indeed don't have any impact on the memory usage of my program:
HANDLE h_jobObject = CreateJobObject(NULL, L"Jobobject");
if (!AssignProcessToJobObject(h_jobObject, OpenProcess(PROCESS_ALL_ACCESS, FALSE, GetCurrentProcessId())))
{
throw "COULD NOT ASSIGN SELF TO JOB OBJECT!:";
}
JOBOBJECT_EXTENDED_LIMIT_INFORMATION tagJobBase = { 0 };
tagJobBase.BasicLimitInformation.MaximumWorkingSetSize = 1; // far too small, just to see what happens
BOOL bSuc = SetInformationJobObject(h_jobObject, JobObjectExtendedLimitInformation, (LPVOID)&tagJobBase, sizeof(tagJobBase));
=> bSuc is true, or is there anything else I should expect?
In top of this, the mentioned tools (resource managed applications, like Hyper-V) seem not to work on my Windows-10 system.
Next to this, there seems to be another post about this subject "Is there any way to force the WorkingSet of a process to be 1GB in C++?", but here the results seem to be negative too.
For a good understanding: I'm working in C++, so the solution, proposed in this URL are not applicable.
So now I'm stuck with the simple question: is there a way, implementable in C++, to limit the memory usage of the current process, running on Windows-10?
Does anybody have an idea?
Thanks in advance

Arduino substring doesn't work

I have a static method that searches (and returns) into String msg the value between a TAG
this is the code function:
static String genericCutterMessage(String TAG, String msg){
Serial.print("a-----");
Serial.println(msg);
Serial.print("b-----");
Serial.println(TAG);
if(msg.indexOf(TAG) >= 0){
Serial.print("msg ");
Serial.println(msg);
int startTx = msg.indexOf(TAG)+3;
int endTx = msg.indexOf(TAG,startTx)-2;
Serial.print("startTx ");
Serial.println(startTx);
Serial.print("endTx ");
Serial.println(endTx);
String newMsg = msg.substring(startTx,endTx);
Serial.print("d-----");
Serial.println(newMsg);
Serial.println("END");
Serial.println(newMsg.length());
return newMsg;
} else {
Serial.println("d-----TAG NOT FOUND");
return "";
}
}
and this is output
a-----[HS][TS]5132[/TS][TO]5000[/TO][/HS]
b-----HS
msg [HS][TS]5132[/TS][TO]5000[/TO][/HS]
startTx 4
endTx 30
d-----
END
0
fake -_-'....go on! <-- print out of genericCutterMessage
in that case I want return the string between HS tag, so my expected output is
[TS]5132[/TS][TO]5000[/TO]
but I don't know why I receive a void string.
to understand how substring works I just followed tutorial on official Arduino site
http://www.arduino.cc/en/Tutorial/StringSubstring
I'm not an expert in C++ and Arduino but this looks like a flushing or buffering problem, isn't it?
Any idea?
Your code is correct, this should not happen. Which forces you to consider the unexpected ways that this could possibly fail. There is really only one candidate mishap I can think of, your Arduino is running out of RAM. It has very little, the Uno only has 2 kilobytes for example. It doesn't take a lot of string munching to fill that up.
This is not reported in a smooth way. All I can do is point you to the relevant company page. Quoting:
If you run out of SRAM, your program may fail in unexpected ways; it will appear to upload successfully, but not run, or run strangely. To check if this is happening, you can try commenting out or shortening the strings or other data structures in your sketch (without changing the code). If it then runs successfully, you're probably running out of SRAM. There are a few things you can do to address this problem:
If your sketch talks to a program running on a (desktop/laptop) computer, you can try shifting data or calculations to the computer, reducing the load on the Arduino.
If you have lookup tables or other large arrays, use the smallest data type necessary to store the values you need; for example, an int takes up two bytes, while a byte uses only one (but can store a smaller range of values).
If you don't need to modify the strings or data while your sketch is running, you can store them in flash (program) memory instead of SRAM; to do this, use the PROGMEM keyword.
That's not very helpful in your specific case, you'll have to look at the rest of the program for candidates. Or upgrade your hardware, StackExchange has a dedicated site for Arduino enthusiasts, surely the best place to get advice.

Need an example of Ypsilon usage

I started to mess with Ypsilon, which is a C++ implementation of Scheme.
It conforms R6RS, features fast garbage collector, supports multi-core CPUs and Unicode but has a LACK of documentation, C++ code examples and comments in the code!
Authors provide it as a standalone console application.
My goal is to use it as a scripting engine in an image processing application.
The source code is well structured, but the structure is unfamiliar.
I spent two weeks penetrating it, and here's what I've found out:
All communication with outer world is done via C++ structures called
ports, they correspond to Scheme ports.
Virtual machine has 3 ports: IN, OUT and ERROR.
Ports can be std-ports (via console), socket-ports,
bytevector-ports, named-file-ports and custom-ports.
Each custom port must provide a filled structure called handlers.
Handlers is a vector containing 6 elements: 1st one is a boolean
(whether
port is textual), and other five are function pointers (onRead, onWrite, onSetPos, onGetPos, onClose).
As far as I understand, I need to implement 3 custom ports (IN, OUT and ERROR).
But for now I can't figure out, what are the input parameters of each function (onRead, onWrite, onSetPos, onGetPos, onClose) in handlers.
Unfortunately, there is neither example of implementing a custom port no example of following stuff:
C++ to Scheme function bindings (provided examples are a bunch of
.scm-files, still unclear what to do on the C++ side).
Compiling and
running bytecode (via bytevector-ports? But how to compile text to
bytecode?).
Summarizing, if anyone provides a C++ example of any scenario mentioned above, it would significantly save my time.
Thanks in advance!
Okay, from what I can read of the source code, here's how the various handlers get called (this is all unofficial, based purely on source code inspection):
Read handler: (lambda (bv off len)): takes a bytevector (which your handler will put the read data into), an offset (fixnum), and a length (fixnum). You should read in up to len bytes, placing those bytes into bv starting at off. Return the number of bytes actually read in (as a fixnum).
Write handler: (lambda (bv off len)): takes a bytevector (which contains the data to write), an offset (fixnum), and a length (fixnum). Grab up to len bytes from bv, starting at off, and write them out. Return the number of bytes actually written (as a fixnum).
Get position handler: (lambda (pos)) (called in text mode only): Allows you to store some data for pos so that a future call to the set position handler with the same pos value will reset the position back to the current position. Return value ignored.
Set position handler: (lambda (pos)): Move the current position to the value of pos. Return value ignored.
Close handler: (lambda ()): Close the port. Return value ignored.
To answer another question you had, about compiling and running "bytecode":
To compile an expression, use compile. This returns a code object.
There is no publicly-exported approach to run this code object. Internally, the code uses run-vmi, but you can't access this from outside code.
Internally, the only place where compiled code is loaded and used is in its auto-compile-cache system.
Have a look at heap/boot/eval.scm for details. (Again, this is not an official response, but based purely on personal experimentation and source code inspection.)

how to JUDGE other program's result via cpp?

I've got a series of cpp source file and I want to write another program to JUDGE if they can run correctly (give input and compare their output with standart output) . so how to:
call/spawn another program, and give a file to be its standard input
limit the time and memory of the child process (maybe setrlimit thing? is there any examples?)
donot let the process to read/write any file
use a file to be its standard output
compare the output with the standard output.
I think the 2nd and 3rd are the core part of this prob. Is there any way to do this?
ps. system is Linux
To do this right, you probably want to spawn the child program with fork, not system.
This allows you to do a few things. First of all, you can set up some pipes to the parent process so the parent can supply the input to the child, and capture the output from the child to compare to the expected result.
Second, it will let you call seteuid (or one of its close relatives like setreuid) to set the child process to run under a (very) limited user account, to prevent it from writing to files. When fork returns in the parent, you'll want to call setrlimit to limit the child's CPU usage.
Just to be clear: rather than directing the child's output to a file, then comparing that to the expected output, I'd capture the child's output directly via a pipe to the parent. From there the parent can write the data to a file if desired, but can also compare the output directly to what's expected, without going through a file.
std::string command = "/bin/local/app < my_input.txt > my_output_file.txt 2> my_error_file.txt";
int rv = std::system( command.c_str() );
1) The system function from the STL allows you to execute a program (basically as if invoked from a shell). Note that this approach is inherenly insecure, so only use it in a trusted environment.
2) You will need to use threads to be able to achieve this. There are a number of thread libraries available for C++, but I cannot give you recommendation.
[After edit in OP's post]
3) This one is harder. You either have to write a wrapper that monitors read/write access to files or do some Linux/Unix privilege magic to prevent it from accessing files.
4) You can redirect the output of a program (that it thinks goes to the standard output) by adding > outFile.txt after the way you would normally invoke the program (see 1)) -- e.g. otherapp > out.txt
5) You could run diff on the saved file (from 3)) to the "golden standard"/expected output captured in another file. Or use some other method that better fits your need (for example you don't care about certain formatting as long as the "content" is there). -- This part is really dependent on your needs. diff does a basic comparing job well.