How to use HadoopJarStepConfig.StepProperties? - amazon-web-services

AWS docs state that this property is "A list of Java properties that are set when the job flow step runs. You can use these properties to pass key-value pairs to your main function in the JAR file."
But there is no explanation (at least, I failed to find any) how exactly they are passed, and how to properly access said collection of key-value pairs on a main function side.
Quick check proved that they aren't passed via environment nor command line arguments. Could be some other way?

Okay, seems that this map goes to Java system properties and is accessible from main function side via System.getProperties() call, but there are some non-obvious implications.
First thing to keep in the mind, that internally they are set via environment variable HADOOP_CLIENT_OPTS as -Dkey=value switches. But EMR does not bother itself to properly escape keys nor values by shell rules.
Also, it does not report any syntax errors if there are properties with non-printable characters, just omits setting them altogether. And it plays even worse with special shell characters like * ? ( ) \ and such — it'll fail the task execution without a proper explanation, and the log records will vaguely point only to obscure syntax errors in some eval() call deeply inside of EMR internal shell script wrappers.
Please be aware about that behaviour.
Properties must be shell-escaped, and in some cases even doubly shell-escaped.

Related

Why does GDB not remove convenience variables?

Per the documentation here, gdb states:
Function: gdb.set_convenience_variable (name, value)
[...] If value is None, then the convenience variable is removed.
but when I execute
gdb.set_convenience_variable('foo', 1)
gdb.set_convenience_variable('foo', None)
a show conv in the gdb shell shows $foo = void. The expected behavior is that gdb will remove the variable completely. In a custom command I use uuids as variable names on the gdb-side for holding intermediate expression results (to avoid name clashes) so having these variables stick around is not ideal. I did not see anything about this in the gdb bug tracker and going through the code it does not appear there is a method to actually remove a convenience variable - just set it to void (here).
I concur. The function to create internal variables shows why this was not implemented right away: this function simply prepends to the existing list of internal variables. Removing an element from such a singly-linked list is not completely trivial since you have to splice the according elements as anybody who has ever implemented a singly linked list can testify but it is not hard either... let's see what the maintainers say. Please consider filing such bug reports yourself if you find such problems!

How to pass command-line arguments to a Windows application correctly?

Passing command-line arguments to an application in Linux just works fine with the exec* commands where you clearly pass each argument on its own. Doing so on Windows using the same functions is not an option if one wants to control the standard pipes. As those functions are based on CreateProcess() there are some clear rules on how to escape special characters like double quotes.
Sadly, this only works correctly as long as the called application retrieves its command-line arguments via main(), wmain() or CommandLineToArgvW(). However, if the called application gets those arguments via WinMain(), wWinMain(), GetCommandLineA() or GetCommandLineW() it is up to the application how to parse the command-line arguments as it gets the whole command-line rather than argument by argument.
That means a simple application named test using main() as entry point gets "abc" if called as test.exe \"abc\". Calling cmd.exe as cmd.exe /c "echo \"abc\"" will not output "abc" as expected but \"abc\".
This leads to my question:How it possible to pass command-line arguments to Windows applications in a generic way despite these quirks?
In Windows, you need to think about the command as a whole, not as a list of individual arguments. Applications are not obliged to parse the command into arguments in any particular way, or indeed at all; consider the example of the echo command, which treats the command line as a single string.
This can be a problem for runtime library developers, because it means there is no reliable way to implement a POSIX-like exec function. Some library developers take the straightforward approach, and require the programmer to provide quote marks as necessary, and some attempt to quote the arguments automatically. In the latter case it is essential to provide some method for the programmer to specify the command line as a whole, disabling any automatic quotation, even if that means a Windows-specific extension.
However, in your scenario (as described in the comments) there shouldn't be a problem. All you have to do is make sure you ask the user for a command, not for a list of arguments. Your program simply doesn't need to know how the command will be split up into arguments, if at all; understanding the command's syntax is the user's job. [NB: if you don't think this is true, you need to explain your scenario much more clearly. Provide an example of what the user might enter and how you think your application would need to process it.]
PS: since you mentioned the C library _exec functions, note that they don't work as you might be expecting. The arguments are not passed individually to the child, since that's impossible; in the Microsoft C runtime, if I remember correctly, the arguments are simply combined together into a single string, with a single space as the delimiter, so ("hello there") will be indistinguishable from ("hello", "there").
PPS: note that calling cmd.exe to parse the command introduces an additional (and much more complicated) layer of processing. Generally speaking taking that into account would still be the user's job, but you may want to be aware of it. The escape character for cmd.exe processing is the caret.
It is the C language that makes you need to use a backslash before a double quote in C code. There is no such rule for shell processing. So if you writing code to call CreateProcess and passing the literal string "abc" then you need to use backslashes because you are writing in C. But if are writing a shell script to pass invoke your app to pass "abc", e.g. the Echo example, then you don't use backslashes because there is no C code involved.

Create an executable that calls another executable?

I want to make a small application that runs another application multiple times for different input parameters.
Is this already done?
Is it wrong to use system("myAp param"), for each call (of course with different param value)?
I am using kdevelop on Linux-Ubuntu.
From your comments, I understand that instead of:
system("path/to/just_testing p1 p2");
I shall use:
execl("path/to/just_testing", "path/to/just_testing", "p1", "p2", (char *) 0);
Is it true? You are saying that execl is safer than system and it is better to use?
In the non-professional field, using system() is perfectly acceptable, but be warned, people will always tell you that it's "wrong." It's not wrong, it's a way of solving your problem without getting too complicated. It's a bit sloppy, yes, but certainly is still a usable (if a bit less portable) option. The data returned by the system() call will be the return value of the application you're calling. Based on the limited information in your post, I assume that's all you're really wanting to know.
DIFFERENCES BETWEEN SYSTEM AND EXEC
system() will invoke the default command shell, which will execute the command passed as argument.
Your program will stop until the command is executed, then it'll continue.
The value you get back is not about the success of the command itself, but regards the correct opening of command shell.
A plus of system() is that it's part of the standard library.
With exec(), your process (the calling process) is replaced. Moreover you cannot invoke a script or an internal command. You could follow a commonly used technique: Differences between fork and exec
So they are quite different (for further details you could see: Difference between "system" and "exec" in Linux?).
A correct comparison is between POSIX spawn() and system(). spawn() is more complex but it allows to read the external command's return code.
SECURITY
system() (or popen()) can be a security risk since certain environment variables (like $IFS / $PATH) can be modified so that your program will execute external programs you never intended it to (i.e. a command is specified without a path name and the command processor path name resolution mechanism is accessible to an attacker).
Also the system() function can result in exploitable vulnerabilities:
when passing an unsanitized or improperly sanitized command string originating from a tainted source;
if a relative path to an executable is specified and control over the current working directory is accessible to an attacker;
if the specified executable program can be spoofed by an attacker.
For further details: ENV33-C. Do not call system()
Anyway... I like Somberdon's answer.

A Good Way to Store C++ CLI Arguments? (W/O using libraries)

So, I'm writting a CLI application in C++ which will accept a bunch of arguments.
The syntax is pretty typical, -tag arg1 arg2 -tag2 arg1 ...
Right now, I take the char** argv and parse them into an
std::map< std::string, std::list<**std::string** > > >
The key is the tag, and the list holds each token behind that tag but before the next one. I don't want to store my args as just std::strings; but I need to make it more interactive.
By interactive, I mean when a user types './myprog -help' a list of all available commands comes up with descriptions.
Currently, my class to facilitate this is:
class Argument
{
public:
Argument(std::string flag, std::string desc);
std::string getFlag();
std::string getDesc();
std:;list<std::string> > getArgs();
void setArgs(std::list<std::string> > args);
bool validSyntax()=0;
std::string getSyntaxErrorDesc()=0;
};
The std::map structure is in a class ProgramCommands which goes about handling the these Arguments.
Now that the problem description is over, my 4 questions are:
How do I give the rest of the program access to the data in ProgramCommands?
I Don't want to make a singleton, at all; and I'd prefer to not have to pass ProgramCommands as an arg to almost every function in the program.
Do you have better ideas about storing how I'm doing this?
How best can I add arguments to the program, without hardcoding them into the ProgramCommands, or main?
std::string only allows for 1 line descriptions, does anyone have an elegant solution to this besides using a list of strings or boost?
EDIT
I don't really want to use libraries because this is a school project (sniffing & interpreting packets). I could, if I wanted to, but I'd rather not.
Your choices on storing the command line arguments are either: Make them a global or pass them around to the functions that need them. Which way is best depends on the sorts of options you have.
If MANY places in your program need the options (for instance a 'verbose' option), then I'd just make the structure a global and get on with my life. It doesn't need to be a singleton (you'll still only have one of them, but that's OK).
If you only need the options at startup time (i.e. # of threads to start initially or port # to connect on), then you can keep the parsing local to 'main' and just pass the parameters needed to the appropriate functions.
I tend to just parse options with the venerable getopt library (yes, that's a leftover from C - and it works just fine) and I stuff the option info (flags, values) into a global structure or a series of global variables. I give usage instructions by having a function 'print_usage' that just prints out the basic usage info as a single block of text. I find it works, it's quick, it's simple, and it gets the job done.
I dont understand your objection to using a singleton to - this is the sort of thing they are made for. If you want them accessible to every object but not pass them as arguments or use a singlton - there are only a couple of tricks I can think of:
-put the parsed arguments them into shared memory and then read them from every function that needs them
-write the parsed arguments out to a binary file and then read them from every function that needs them
-global variables
None of these solutions are nearly as elegant as a singleton, are MUCH more labor intensive and are well... sort of silly compared to a singleton... why hamstring yourself?

Custom front end and back end with Pantheios logging

Apologies if I'm missing something really obvious, but I'm trying to understand how to write a custom front end and back end with Pantheios. (I'm using it from C++, not C.)
I can follow the purposes of the initialisation functions (I think) but I'm unsure about the others: pantheios_be_logEntry, pantheios_fe_getProcessIdentity and pantheios_fe_isSeverityLogged.
In particular, I'm confused about the relationship between a front end and a back end. How do I make them communicate with each other?
Not sure I understand exactly what you don't understand, but maybe that's part of the problem. ;-) So I'll try my best and you let me know whether it's near or not.
pantheios_fe_getProcessIdentity() is called once, when Pantheios is initializing. You need to return a string that identifies the process. (Actually, it identifies the link-unit; a term defined in Imperfect C++, written by Pantheios' creator, Matthew Wilson, which means the scope of link names, i.e. an executable program module or a dynamic library module.)
pantheios_fe_isSeverityLogged() is called whenever a log statement is executed in application code. It returns non-zero to indicate that the statement should be processed and sent to the output (via the back-end). If it returns zero, no processing occurs. FWIU, this is the main reason why Pantheios is so fast.
pantheios_be_logEntry() is called whenever a log statement is to be sent for output, when pantheios_fe_isSeverityLogged() has returned non-zero and the Pantheios core has processed the statement (forming all the arguments in your code into a single string). It sends the statement string to wherever it should go. For example, the be.fprintf back-end prints it to the console using fprint().
Once you grok these aspects, the second part of your question is where it gets interesting. When your front-end and back-end are initialized they get to create some context (e.g. a C++ object) that the Pantheios core holds for them, and gives them back each time it calls a front/back end API function. When you're customizing both, you can have them communicate via some shared context that they both know about, but which the Pantheios core does not (and should not) know about, beyond having an opaque handle (void*) to it.
HTH