hadoop mapreduce job is not running - mapreduce

i have created a basic mapreduce program and created jar file out of it. when i am trying to run it from console like:
[cloudera#localhost ~]$ hadoop jar /home/cloudera/Desktop/csvjar.jar testpackage.Mapreduce /import/climate /output5
Nothing is happening, no error or map reduce status. It just displays
[cloudera#localhost ~]
Mapreduce is the class where map, reduce and main function resides. Jar file kept on local machine and HDFS also. I have tried with both the paths. Nothing happened in both the conditions. Output5 folder does not exist in the hdfs.

I also came through the same issue. In my code, I missed the closing braces while checking the arguments section in the driver code. I am attaching the part of the code with "}" for reference.
if(otherArgs.length !=3){
System.err.println("Number of argument passed is not 3");
System.exit(1);
}
I hope this would help you.

Related

AWS Lambda download a file using Chromedriver

I have a container that is built to run selenium-chromedriver with python to download an excel(.xlsx) file from a website.
I am Using SAM to build & deploy this image to be run in AWS Lambda.
When I build the container and invoke it locally, the program executes as expected: The download occurs and I can see the file placed in the root directory of the container.
The problem is: when I deploy this image to AWS and invoke my lambda function I get no errors, however, my download is never executed. The file never appears in my root directory.
My first thought was that maybe I didn't allocate enough memory to the lambda instance. I gave it 512 MB, and the logs said it was using 416MB. Maybe there wasn't enough room to fit another file inside? So I have increased the memory provided to 1024 MB, but still no luck.
My next thought was that maybe the download was just taking a long time, so I also allowed the program to wait for 5 minutes after clicking the download to ensure that the download is given time to complete. Still no luck.
I have also tried setting the following options for chromedriver (full list of chromedriver options posted at bottom):
options.add_argument(f"--user-data-dir={'/tmp'}"),
options.add_argument(f"--data-path={'/tmp'}"),
options.add_argument(f"--disk-cache-dir={'/tmp'}")
and also setting tempfolder = mkdtemp() and passing that into the chrome options as above in place of /tmp. Still no luck.
Since this applicaton is in a container, it should run the same locally as it does on AWS. So I am wondering if it is part of the config outside of the container that is blocking my ability to download a file? Maybe the request is going out but the response is not being allowed back in?
Please let me know if there is anything I need to clarify -- Any help on this issue is greatly appreciated!
Full list of Chromedriver options
options.binary_location = '/opt/chrome/chrome'
options.headless = True
options.add_argument('--disable-extensions')
options.add_argument('--no-first-run')
options.add_argument('--ignore-certificate-errors')
options.add_argument('--disable-client-side-phishing-detection')
options.add_argument('--allow-running-insecure-content')
options.add_argument('--disable-web-security')
options.add_argument('--lang=' + random.choice(language_list))
options.add_argument('--user-agent=' + fake_user_agent.user_agent())
options.add_argument('--no-sandbox')
options.add_argument("--window-size=1920x1080")
options.add_argument("--single-process")
options.add_argument("--disable-dev-shm-usage")
options.add_argument("--disable-dev-tools")
options.add_argument("--no-zygote")
options.add_argument(f"--user-data-dir={'/tmp'}")
options.add_argument(f"--data-path={'/tmp'}")
options.add_argument(f"--disk-cache-dir={'/tmp'}")
options.add_argument("--remote-debugging-port=9222")
options.add_argument("start-maximized")
options.add_argument("enable-automation")
options.add_argument("--headless")
options.add_argument("--disable-browser-side-navigation")
options.add_argument("--disable-gpu")
driver = webdriver.Chrome("/opt/chromedriver", options=options)```
Just in case anybody stumbles across this queston in future, adding the following to chrome options solved my issue:
prefs = {
"profile.default_content_settings.popups": 0,
"download.default_directory": r"/tmp",
"directory_upgrade": True
}
options.add_experimental_option("prefs", prefs)

AWS Lambda Extension throws exit status 127 (/usr/bin/env: node : No such file or directory)

I am creating a Lambda extension to get secret values from secret manager using as a template:
https://github.com/hariohmprasath/aws-lambda-extensions
I have zipped the files into the following structure.
extension.zip
--> extensions
--> secret-extension
--> secret-extension
--> node_modules
--> extensions-api.js
--> index.js
--> package.json
--> package-lock.json
--> secrets.js
Error:
{
"errorMessage": "RequestId: e5c06575-cf7d-46c0-b168-624e8e9cf572 Error: exit status 127",
"errorType": "Extension.Crash"
}
The Error is that /usr/bin/env: node : No such file or directory
At the top of the index.js file is the command #!/usr/bin/env node (in order to interpret the file in node)
The runtime environment is Nodejs 12 and have tried with 14 as well.(extension documentation says node 12 runtime is required)
What could be causing this issue?
The lambda runtime is a node runtime so node should be installed.
I have ls the folder and /env folder exists.
I know node exists within the runtime as node -v returns v14.20.0 or v12.22.11
I am on a windows machine
creating the extension (dont think the deployment could be causing
this because it was written on windows machine.
Any help would be appreciated.
So found out it has to do with a custom environment they are using for the example provided by AWS. Instead I went the route of using a runtime independent solution which has worked as expected.
Documentation
I suspect the issue you may have been encountering is the same as mine, and that issue was:
The #!/usr/bin/env node had the whitespace characters \r\n at the end of the line which obviously cannot be seen unless you have your editor display these, and this is how windows handles new lines (*nix systems use just \n); Now when the lambda reads the line, it is trying to interpret it as #!/usr/bin/env node\r which obviously won't exist, and can't run the file via node.
The problem with the logs is when you look at the logs, it won't render the \r as that, it could do 1 of 2 things depending on where you look at the logs:
It will interpret \r as a new line character, and thereby just print the whitespace, which is not obvious in the log message; OR the other situation that can occur (which is what happened to me):
It shows just : No such file or directory because it's interpreted the \r as a carriage return, which means it takes the cursor to the beginning of the line, and overwrites as it prints the new characters,
I am pretty confident this is your issue, and I will admit I didn't solve this 100% on my own as a person in my team had a similar issue with whitespace characters and only after allot of head banging did I think of it, and confirmed using hexdump -C to confirm the issue.

C++ execute temp file as bash script

I have a program that needs to run a program we'll call externalProg in parallel on our linux (CentOS) cluster - or rather, it needs to run many instances of externalProg, each on different cores. Each "thread" creates 3 files based on a few parameters - the inputs to externalProg, a command file to tell externalProg how to execute my file, and a bash script to set up the environment (calls a setup script provided by the manufacturer) and actually call externalProg with my inputs.
Since this needs to be parallel with an unknown number of concurrent threads and I don't want to risk overwriting another thread's files, I am creating temp files using
mkstemp("PREFIX_XXXXXX")
for these input files. After the external program runs, I extract the relevant data and store it, and close the temp files (therefore deleting them).
We'll call the files created (Which actually have a name based on the template above)
tmpInputs - Inputs to externalProg
tmpCommand - Input that tells externalProg how to execute tmpInputs
tmpBash - bash script to set up and call externalProg with my inputs
The file tmpBash looks something like
source /path/to/setup/script # Sets up environment variables
externalProg < /path/to/tmpCommand
where tmpCommand is just a simple text file.
The problem I'm having is actually executing the bash script. Within my program, I call
ostringstream launchcmd;
launchcmd << "bash " << path_to_tmpBash
system(launchcmd.str().c_str());
But nothing happens. No error, no warning, no 'file not found' or permission denied or anything. I have verified that the files are being created and have the correct content. The rest of the code after system() is executed successfully (Though it fails since externalProg wasn't run).
Strangely, if I go back to the terminal and type
bash /path/to/tmpBash
then externalProg is executed successfully. I have also cout'd the launchcmd string, copy and pasted that in to the terminal, which also works successfully. For some reason, this only fails when called within my program.
After a bit of experimentation, I've determined that system() calls /bin/sh on our cluster. If I change launchcmd to look like
/path/to/tmpBash
(So that the full command should look like /bin/sh /path/to/tmpBash), I get a permission denied error, which is no surprise. The problem is that I can't chmod +x the tmpBash file while it's still open, and if I close the file, it gets deleted - so I'm not sure how to address that.
Is there something obviously wrong I'm doing, or does system() have some nuance that I'm missing?
edit: I wanted to add that I can successfully call things like
system("echo $PATH")
and get the expected results (in this case, my default $PATH).
Two separate ideas:
Change your SHELL environment variable to be /bin/bash, then call system(),
or:
Use execve directly `execve('/bin/bash', ['/path/to/tmpBash'], environ)

zmq ventilator/worker/sink paradigm not working w/ subprocess

I am trying to replicate the ventilator/workers/sink paradigm described in the ZMQ guide. I have the same Python Ventilator, the same C++ worker as, and the same Python Sink as was described in the ZMQ examples. I want to launch the ventilator, workers, and sink from one main python script, so I created "class" wrappers around the ventilator & sink, and both of those classes subclass the Python module "multiprocessing.Process." Since the C++ is a binary, I launch it with Python's subprocess.Popen call.
The order of starting all of this up is as follows:
h = subprocess.Popen('test') # test is the name of the binary
time.sleep(1)
s = sinkObj.start()
time.sleep(1)
v = ventObj.start()
What I am finding is that no data is getting through the system when I start up the components like this. However, if I start the C++ binary in its own shell, and only start the sinkObj and ventObj from the main python script, it works fine.
I apologize in advance if this is more of a Python question than a ZMQ question, but I haven't run into issues like this w/ Python's subprocess. I have also tried using os.system() instead of the subprocess... but same issue. I put all the code on this website: https://github.com/kkarrancsu/zmqtest if anybody is curious to test it out. There is a readme on that git which tells you what the files are.
Any ideas on why this could be happening?
------------------------- UPDATE --------------------
I found that if I create a shell script which simply launches the C binary, and call that shell script w/ os.system('run_the_shell_script') it works! So this means that there is something wrong with the way that I am using subprocess.Popen(...), but can't seem to pinpoint what the issue is. I tried w/ the shell=True flag, but it still hangs with that...
It's the name of the worker binary file that causes the problem.
There two solutions:
Chang the name of the binary file test to test_new and do the same in your All.py file, and then it will work as you desire.
Substitute subprocess.Popen('./test', shell=True) for subprocess.Popen('test', shell=True).
test is Linux command. If you type the following in your shell
$ echo $PATH
You may see that . is at the last position. It means that until shell couldn't find the binary file to be executed in the directories that $PATH indicates, it will try to search for it in the current directory .
When you execute subprocess.Popen('test', shell=True), it could find it before trying the . directory and so it won't execute the workers.
As I see, the ventilator and sink bind() to ports 6557 and 6558, and the C++ app connect() to these ports. In this case, if you start the cpp app first, it will try to connect() to the endpoints, but as nothing is bound there, it will drop silently.
In ZeroMQ the basic principle is "First Bind, then Connect". So you should not connect() before you bind() something on the socket. Imagine bind() is the 'Server', and connect() is the client. You cannot connect client to non-existing server. Also, in ZeroMQ every socket can be 'Server', but you SHOULD HAVE only 1 bind()-ing socket per URL. And you can have multiple 'connect()'s.

Python not calling an external program part 3

I have been having problems trying to run an external program from a python program that was generated from a trigger in a postgres 9.2 database. The trigger works. It writes to a file. I had tried just running the external program but the permissions would not allow it to run. I was able to create a folder (using os.system(“mkdir”) ). The owner of the folder is NETWORK SERVICE.
I need to run a program called sdktest. When I try to run it no response happens so I think that means that the python program does not have enough permissions (with an owner of NETWORK SERVICE) to run it.
I have been having my program copy files that it needs into that directory so they would have the correct permissions and that has worked to some degree but the program that I need to run is the last one and it is not running because it does not have enough permissions.
My python program runs a C++ program called PG_QB_Connector which calls sdktest.
Is there any way I can change the owner of the process to be a “normal” owner? Is there a better way to do this? Basically I just need to have this C++ program have eniough perms to run correctly.
BTW, when I run the C++ program by hand, the line that runs the sdktest program runs correctly, however, when I run it from the postgres/python it does not do anything...
I have Windows 7, python 3.2. The other 2 questions that I asked about this are located here and here
The python program:
CREATE or replace FUNCTION scalesmyone (thename text)
RETURNS int
AS $$
a=5
f = open('C:\\JUNK\\frompython.txt','w')
f.write(thename)
f.close()
import os
os.system('"mkdir C:\\TEMPWITHOWNER"')
os.system('"mkdir C:\\TEMPWITHOWNER\\addcustomer"')
os.system('"copy C:\\JUNK\\junk.txt C:\\TEMPWITHOWNER\\addcustomer"')
os.system('"copy C:\\BATfiles\\junk6.txt C:\\TEMPWITHOWNER\\addcustomer"')
os.system('"copy C:\\BATfiles\\run_addcust.bat C:\\TEMPWITHOWNER\\addcustomer"')
os.system('"copy C:\\Workfiles\\PG_QB_Connector.exe C:\\TEMPWITHOWNER\\addcustomer"')
os.system('"copy C:\\Workfiles\\sdktest.exe C:\\TEMPWITHOWNER\\addcustomer"')
import subprocess
return_code = subprocess.call(["C:\\TEMPWITHOWNER\\addcustomer\\PG_QB_Connector.exe", '"hello"'])
$$ LANGUAGE plpython3u;
The C++ program that is called from the python program and calls sdktest.exe is below
command = "copy C:\\Workfiles\\AddCustomerFROMWEB.xml C:\\TEMPWITHOWNER\\addcustomer\\AddCustomerFROMWEB.xml";
system(command.c_str());
//everything except for the qb file is in my local folder
command = "C:\\TEMPWITHOWNER\\addcustomer\\sdktest.exe \"C:\\Users\\Public\\Documents\\Intuit\\QuickBooks\\Company Files\\Shain Software.qbw\" C:\\TEMPWITHOWNER\\addcustomer\\AddCustomerFROMWEB.xml C:\\TEMPWITHOWNER\\addcustomer\\outputfromsdktestofaddcust.xml";
system(command.c_str());
It sounds like you want to invoke a command-line program from within a PostgreSQL trigger or function.
A usually-better alternative is to have the trigger send a NOTIFY and have a process with a PostgreSQL connection LISTENing for notifications. When a notification comes in, the process can start your program. This is the approach I would recommend; it's a lot cleaner and it means your program doesn't have to run under PostgreSQL's user ID. See NOTIFY and LISTEN.
If you really need to run commands from inside Pg:
You can use PL/Pythonu with os.system or subprocess.check_call; PL/Perlu with system(); etc. All these can run commands from inside Pg if you need to. You can't invoke programs directly from PostgreSQL, you need to use one of the 'untrusted' (meaning fully privileged, not sandboxed) procedural languages to invoke external executables. PL/TCL can probably do it too.
Update:
Your Python code as shown above has several problems:
Using os.system in Python to copy files is just wrong. Use the shutil library: http://docs.python.org/3/library/shutil.html to copy files, and the simple os.mkdir command to create directories.
The double-layered quoting looks wrong; didn't you mean to quote only each argument not the whole command? You should be using subprocess.call instead of os.system anyway.
Your final subprocess.call invocation appears OK, but fails to check the error code so you'll never know if it went wrong; you should use subprocess.check_call instead.
The C++ code also appears to fail to check for errors from the system() invocations so you'll never know if the command it runs fails.
Like the Python code, copying files in C++ by using the copy shell command is generally wrong. Microsoft Windows provides the CopyFile function for this; equivalents or alternatives exist on other platforms and you can use portable-but-less-efficient stream copying too.