failed to use mapreduce in python - python-2.7

I am trying to learn mapreduce program using python mrjob. I am getting following error:
Traceback:
dumping stdin to local file /tmp/pyes_mrjob.testuser.20131004.103251.998597/STDIN
Making directory hdfs:///user/testuser/tmp/mrjob/pyes_mrjob.user.20131004.103251.998597/files/ on HDFS
> /usr/lib/hadoop-mapreduce/bin/hadoop fs -mkdir hdfs:///user/testuser/tmp/mrjob/pyes_mrjob.testuser.20131004.103251.998597/files/
Traceback (most recent call last):
File "pyes_mrjob.py", line 34, in <module>
MRWordFrequencyCount.run()
File "/usr/local/lib/python2.7/dist-packages/mrjob/job.py", line 500, in run
mr_job.execute()
File "/usr/local/lib/python2.7/dist-packages/mrjob/job.py", line 518, in execute
super(MRJob, self).execute()
File "/usr/local/lib/python2.7/dist-packages/mrjob/launch.py", line 146, in execute
self.run_job()
File "/usr/local/lib/python2.7/dist-packages/mrjob/launch.py", line 207, in run_job
runner.run()
File "/usr/local/lib/python2.7/dist-packages/mrjob/runner.py", line 458, in run
self._run()
File "/usr/local/lib/python2.7/dist-packages/mrjob/hadoop.py", line 236, in _run
self._upload_local_files_to_hdfs()
File "/usr/local/lib/python2.7/dist-packages/mrjob/hadoop.py", line 263, in _upload_local_files_to_hdfs
self._mkdir_on_hdfs(self._upload_mgr.prefix)
File "/usr/local/lib/python2.7/dist-packages/mrjob/hadoop.py", line 271, in _mkdir_on_hdfs
self.invoke_hadoop(['fs', '-mkdir', path])
File "/usr/local/lib/python2.7/dist-packages/mrjob/fs/hadoop.py", line 81, in invoke_hadoop
proc = Popen(args, stdout=PIPE, stderr=PIPE)
File "/usr/lib/python2.7/subprocess.py", line 679, in __init__
errread, errwrite)
File "/usr/lib/python2.7/subprocess.py", line 1249, in _execute_child
raise child_exception
OSError: [Errno 2] No such file or directory
I executed the command manually its working fine there but when i try to execute my program its not working.
Since just started learning can someone suggest what library i have to choose. According to some blogs somelibraries has good documention and some libraries has better perfomance and .... I came across below post which looks older
http://blog.cloudera.com/blog/2013/01/a-guide-to-python-frameworks-for-hadoop/
But so many libraries got updates recently. So can some suggest me library i can start with..

i guess this problem is caused by the way how mrjob calls "hadoop fs -mkdir", if the parent dir of the targeted dir you want to make doesn't exist, -mkdir will fail. that means you have to use "hadoop fs -mkdir -p [path]". Ultimately, you will need to modify mrjob library manually in [path of mrjob install](mine is /usr/lib/python2.6/site-packages/mrjob)/hadoop.py at line 271:
self.invoke_hadoop(['fs', '-mkdir', path])
to
self.invoke_hadoop(['fs', '-mkdir', '-p', path])
Good Luck!

It looks like you set your HADOOP_HOME to "/usr/lib/hadoop-mapreduce". However, this is wrong and it should be set to "/usr/lib/hadoop".
Also, if you get an error saying that the hadoop-streaming.jar could not be found, create a symlink in "/usr/lib/hadoop" to this jar as follows:
sudo ln -s /usr/lib/hadoop-mapreduce/hadoop-streaming.jar /usr/lib/hadoop

Related

conda build failing with need_source_download message

I successfully built a package on the same Ubuntu desktop 2 months ago and am running into an error building the next version of the same package. I've updated the recipe and made sure conda itself was up-to-date before running the build as usual:
(base) pmena#pmena-7080=> cd anaconda_build/
(base) pmena#pmena-7080=> conda build mi-instrument
No numpy version specified in conda_build_config.yaml. Falling back to default numpy value of 1.11
WARNING:conda_build.metadata:No numpy version specified in conda_build_config.yaml. Falling back to default numpy value of 1.11
Adding in variants from internal_defaults
INFO:conda_build.variants:Adding in variants from internal_defaults
Attempting to finalize metadata for mi-instrument
INFO:conda_build.metadata:Attempting to finalize metadata for mi-instrument
Traceback (most recent call last):
File "/home/pmena/miniconda2/bin/conda-build", line 11, in <module>
sys.exit(main())
File "/home/pmena/miniconda2/lib/python2.7/site-packages/conda_build/cli/main_build.py", line 474, in main
execute(sys.argv[1:])
File "/home/pmena/miniconda2/lib/python2.7/site-packages/conda_build/cli/main_build.py", line 465, in execute
verify=args.verify, variants=args.variants)
File "/home/pmena/miniconda2/lib/python2.7/site-packages/conda_build/api.py", line 209, in build
notest=notest, need_source_download=need_source_download, variants=variants)
File "/home/pmena/miniconda2/lib/python2.7/site-packages/conda_build/build.py", line 2863, in build_tree
notest=notest,
File "/home/pmena/miniconda2/lib/python2.7/site-packages/conda_build/build.py", line 1837, in build
output_metas = expand_outputs([(m, need_source_download, need_reparse_in_env)])
File "/home/pmena/miniconda2/lib/python2.7/site-packages/conda_build/render.py", line 757, in expand_outputs
for (output_dict, m) in _m.copy().get_output_metadata_set(permit_unsatisfiable_variants=False):
File "/home/pmena/miniconda2/lib/python2.7/site-packages/conda_build/metadata.py", line 2054, in get_output_metadata_set
bypass_env_check=bypass_env_check)
File "/home/pmena/miniconda2/lib/python2.7/site-packages/conda_build/metadata.py", line 727, in finalize_outputs_pass
permit_unsatisfiable_variants=permit_unsatisfiable_variants)
File "/home/pmena/miniconda2/lib/python2.7/site-packages/conda_build/render.py", line 538, in finalize_metadata
exclude_pattern)
File "/home/pmena/miniconda2/lib/python2.7/site-packages/conda_build/render.py", line 390, in add_upstream_pins
permit_unsatisfiable_variants, exclude_pattern)
File "/home/pmena/miniconda2/lib/python2.7/site-packages/conda_build/render.py", line 378, in _read_upstream_pin_files
permit_unsatisfiable_variants=permit_unsatisfiable_variants)
File "/home/pmena/miniconda2/lib/python2.7/site-packages/conda_build/render.py", line 154, in get_env_dependencies
channel_urls=tuple(m.config.channel_urls))
File "/home/pmena/miniconda2/lib/python2.7/site-packages/conda_build/environ.py", line 749, in get_install_actions
locking=locking, timeout=timeout)
File "/home/pmena/miniconda2/lib/python2.7/site-packages/conda_build/index.py", line 172, in get_build_index
update_index(output_folder, verbose=debug)
File "/home/pmena/miniconda2/lib/python2.7/site-packages/conda_build/index.py", line 273, in update_index
current_index_versions=current_index_versions)
File "/home/pmena/miniconda2/lib/python2.7/site-packages/conda_build/index.py", line 776, in index
with tqdm(total=len(subdirs), disable=(verbose or not progress), leave=False) as t:
AttributeError: __exit__
This has always been a relatively straightforward process, so I'm hoping that it's just a simple oversight. Thanks in advance!
I ended up installing a slightly older version of miniconda2, which I was able to find in the anaconda archives. The build completed successfully after that.

trying to use cookiecutter-django, getting errors and does not create anything

trying to get a Django project started using cookiecutter-django and can't seem to get it to generate anything.
using Python 3.6, Django 2.0.5, cookiecutter 1.6.0 (then created a virtualenv and entered a new, blank directory)
so I enter this command:
cookiecutter https://github.com/pydanny/cookiecutter-django
and get this error traceback:
Traceback (most recent call last):
File "c:\python\python36\lib\runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "c:\python\python36\lib\runpy.py", line 85, in _run_code
exec(code, run_globals)
File "c:\Python\python36\Scripts\cookiecutter.exe\__main__.py", line 9, in
<module>
File "c:\python\python36\lib\site-packages\click\core.py", line 722, in
__call__
return self.main(*args, **kwargs)
File "c:\python\python36\lib\site-packages\click\core.py", line 697, in main
rv = self.invoke(ctx)
File "c:\python\python36\lib\site-packages\click\core.py", line 895, in
invoke
return ctx.invoke(self.callback, **ctx.params)
File "c:\python\python36\lib\site-packages\click\core.py", line 535, in
invoke
return callback(*args, **kwargs)
File "c:\python\python36\lib\site-packages\cookiecutter\cli.py", line 120,
in main
password=os.environ.get('COOKIECUTTER_REPO_PASSWORD')
File "c:\python\python36\lib\site-packages\cookiecutter\main.py", line 63,
in cookiecutter
password=password
File "c:\python\python36\lib\site-packages\cookiecutter\repository.py", line
103, in determine_repo_dir
no_input=no_input,
File "c:\python\python36\lib\site-packages\cookiecutter\vcs.py", line 99, in
clone
stderr=subprocess.STDOUT,
File "c:\python\python36\lib\subprocess.py", line 336, in check_output
**kwargs).stdout
File "c:\python\python36\lib\subprocess.py", line 418, in run
output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command '['git', 'clone',
'https://github.com/pydanny/cookiecutter-django']' returned non-zero exit
status 128.
OK - figured out how to get this to work.
used Github desktop
from cookiecutter-django repository, right click
open it Git Shell
this opens a Powershell window.
CD to directory where project will be placed in.
cookiecutter https://github.com/pydanny/cookiecutter-django
and it works.
not sure exactly why this works when regular CMD and elevated CMD do not, but this was the only way I could get it to work.
This is a permission issue with github due to the need to setup ssh keys. By the way I'm using ubuntu 12.
https://help.github.com/articles/generating-a-new-ssh-key-and-adding-it-to-the-ssh-agent/ - create a key first in your machine using the instructions in the link. Once you have your ssh key, proceed to step 2. (Step 2 is indicated in the first link as last step)
https://help.github.com/articles/adding-a-new-ssh-key-to-your-github-account - add the generated ssh key to your github account.

'No such file or directory' error after submitting a training job

I execute:
gcloud beta ml jobs submit training ${JOB_NAME} --config config.yaml
and after about 5 minutes the job errors out with this error:
Traceback (most recent call last):
File "/usr/lib/python2.7/runpy.py", line 162, in _run_module_as_main "__main__", fname, loader, pkg_name)
File "/usr/lib/python2.7/runpy.py", line 72, in _run_code exec code in run_globals
File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 232, in <module> tf.app.run()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 30, in run sys.exit(main(sys.argv[:1] + flags_passthrough))
File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 228, in main run_training()
File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 129, in run_training data_sets = input_data.read_data_sets(FLAGS.train_dir, FLAGS.fake_data)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/datasets/mnist.py", line 212, in read_data_sets with open(local_file, 'rb') as f: IOError: [Errno 2] No such file or directory: 'gs://my-bucket/mnist/train/train-images.gz'
The strange thing is, as far as I can tell, that file exists at that url.
This error usually indicates you are using a multi-region GCS bucket for your output. To avoid this error you should use a regional GCS bucket. Regional buckets provide stronger consistency guarantees which are needed to avoid these types of errors.
For more information about properly setting up GCS buckets for Cloud ML please refer to the Cloud ML Docs
Normal IO does not know how to deal with GCS gs:// correctly. You need:
first_data_file = args.train_files[0]
file_stream = file_io.FileIO(first_data_file, mode='r')
# run experiment
model.run_experiment(file_stream)
But ironically, you can move files from the gs://bucket to your root directory, which your programs CAN then actually see:
with file_io.FileIO(gs://presentation_mplstyle_path, mode='r') as input_f:
with file_io.FileIO('presentation.mplstyle', mode='w+') as output_f:
output_f.write(input_f.read())
mpl.pyplot.style.use(['./presentation.mplstyle'])
And finally, moving a file from your root back to a gs://bucket:
with file_io.FileIO(report_name, mode='r') as input_f:
with file_io.FileIO(job_dir + '/' + report_name, mode='w+') as output_f:
output_f.write(input_f.read())
Should be easier IMO.

py2app runs fine in alias mode but not when bundled tkinter

I am trying to bundle an app in my Mac OS X Snow Leopard 10.6.8. I've python2.7, py27-tkinter installed through macports 2.3.4.
My code runs perfect when run as a script as well as in py2app's alias mode. However, when I bundle it doesn't behave properly: meaning, upon every conditional check on the user input integer, the result should be updated in the window. However, the result is updated only for the first input integer and nothing gets updated for the subsequent inputs. When I try to run interactively, I get the following error:
Traceback (most recent call last):
File "setup.py", line 18, in
setup_requires=['py2app'],
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/distutils/core.py", line 151, in setup
dist.run_commands()
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/distutils/dist.py", line 953, in run_commands
self.run_command(cmd)
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/distutils/dist.py", line 972, in run_command
cmd_obj.run()
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/py2app/build_app.py", line 659, in run
self._run()
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/py2app/build_app.py", line 865, in _run
self.run_normal()
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/py2app/build_app.py", line 939, in run_normal
mf = self.get_modulefinder()
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/py2app/build_app.py", line 814, in get_modulefinder
debug=debug,
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/modulegraph/find_modules.py", line 341, in find_modules
find_needed_modules(mf, scripts, includes, packages)
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/modulegraph/find_modules.py", line 266, in find_needed_modules
path = m.packagepath[0]
TypeError: 'NoneType' object has no attribute '__getitem__'
>>>
Could anyone shed some light on what is missing/wrong here and how to troubleshoot and ultimately solve the problem?
Many thanks in advance.

portia (scrapy/slybot) errors on windows

i installed portia and got it to work i annotated some websites (looks really good)
but when i try to run the spiders i get some errors and nothing gets crawled
im running python 2.7.6 on win 7
C:\Python27\Scripts>python portiacrawl C:\portia\slyd\data\projects\new_project
Traceback (most recent call last):
File "portiacrawl", line 7, in <module>
execfile(__file__)
File "C:\portia\slybot\bin\portiacrawl", line 56, in <module>
main()
File "C:\portia\slybot\bin\portiacrawl", line 54, in main
subprocess.call(command_spec)
File "C:\Python27\lib\subprocess.py", line 522, in call
return Popen(*popenargs, **kwargs).wait()
File "C:\Python27\lib\subprocess.py", line 709, in __init__
errread, errwrite)
File "C:\Python27\lib\subprocess.py", line 957, in _execute_child
startupinfo)
WindowsError: [Error 2] O sistema nÒo conseguiu localizar o ficheiro especificado
I am troubleshooting portia on Windows 8.1 and encountered the same error, exactly.
Try running 'python portiacrawl' by itself to determine if there is a subsequent menu. You should be able to see Help info on 'portiacrawl'. I suspect that you need to name the [spider] & [options] as well as change the terminal directory to see the output from the crawler. I suggest trying the following but rename [spider] to actual name of your spider w/o brackets:
Enter into terminal: C:\portia\slyd\data\projects <------Change to proper directory in cmd
Make sure you are in the terminal directory "C:\portia\slyd\data\projects"
The Cmd propmpt should look like: C:\portia\slyd\data\projects> <----waiting for portia initiation.
Enter into terminal:
python portiacrawl C:\portia\slyd\data\projects\new_project [spider] -t csv -o test.csv; or,
python portiacrawl [spider] -t csv -o test.csv
Report back. I am curious as to the terminal response. Did it initiate portiacrawl & return "access is denied."