How to assign the python interpreter spark worker used? - python-2.7

How to assign the python interpreter spark worker used?
i try several method like:
1) set env Vars
export PYSPARK_DRIVER_PYTHON=/python_path/bin/python
export PYSPARK_PYTHON=/python_path/bin/python
not work. i'm sure PYSPARK_DRIVER_PYTHON PYSPARK_PYTHON env set success use:
env | grep PYSPARK_PYTHON
i want to pyspark use
/python_path/bin/python
as the starting python interpreter
but worker start use the :
python -m deamon
i don't want to link default python to /python_path/bin/python in the fact that
this may affect other devs, bcz default python and /python_path/bin/python is not same version, and both in production use.
Also set spark-env.sh not works:
spark.pyspark.driver.python=/python_path/bin/python
spark.pyspark.python=/python_path/bin/python
when start driver some warning logs like:
conf/spark-env.sh: line 63:
spark.pyspark.driver.python=/python_path/bin/python: No such file or directory
conf/spark-env.sh: line 64:
spark.pyspark.python=/python_path/bin/python: No such file or directory

1) Check permissions on your python directory. Maybe Spark doesn't have correct permissions. Try to do: sudo chmod -R 777 /python_path/bin/python
2) Spark documentation says:
Property spark.pyspark.python take precedence if it is set.
So try also set spark.pyspark.python in conf/spark-defaults.conf.
3) Also if you use cluster with more then one node you need to check if Python is installed in a correct directory on each node because you don't know where workers will be started.
4) Spark will use the first Python interpreter available on your system PATH, so like workaround you can set the path to your python in PYTHON variable.

Related

Run Django commands on Elastic Beanstalk SSH -> Missing environment variables

So this has been a long-running problem for me and I'd love to fix it - I also think it will help a lot of others. I'd love to run Django commands after ssh'ing on my Elastic Beanstalk EC2 instance. E. g.
python manage.py dumpdata
The reason why this is not possible are the missing environment variables. They are present when the server boots up but are unset as soon as the server is running (EB will create a virtual env within the EC2 and delete the variables from there).
I've recently figured out that there is a prebuilt script to retrieve the env variables on the EC2 instances:
/opt/elasticbeanstalk/bin/get-config environment
This will return a stringified object like this:
{"AWS_STATIC_ASSETS_SECRET_ACCESS_KEY":"xxx-xxx-xxx","DJANGO_KEY":"xxx-xxx-xxx","DJANGO_SETTINGS_MODULE":"xx.xx.xx","PYTHONPATH":"/var/app/venv/staging-LQM1lest/bin","RDS_DB_NAME":"xxxxxxx":"xxxxxx","RDS_PASSWORD":"xxxxxx"}
This is where I'm stuck currently. I think need would need a script, that takes this object parses it and sets the key / values as environment variables. I would need to be able to run this script from the ec2 instance.
Or a command to execute from the .ebextensions that would get the variables and sets them.
Am I absolutely unsure how to proceed at this point? Am I overlooking something obvious here? Is there someone who has written a script for this already? Is this even the right approach?
I'd love your help!
Your env variables are stored in /opt/elasticbeanstalk/deployment/env
Thus to export them, you can do the following (must be root to access the file):
export $(cat /opt/elasticbeanstalk/deployment/env | xargs)
Once you execute the command you can confirm the presence of your env variables using:
env
To use this in your .extentions, you can try:
container_commands:
10_dumpdata:
command: |
export $(cat /opt/elasticbeanstalk/deployment/env | xargs)
source $PYTHONPATH/activate
python ./manage.py dumpdata

Unable to execute a step on a running EMR

I have an EMR cluster 5.28.1 running in AWS but I forgot to install from python libraries as part of the bootstrap action. Now that the cluster is running, I was simply attempting to add a step via the EMR console. Here are my settings
JAR: s3://us-east-1.elasticmapreduce/libs/script-runner/script-runner.jar
Main class: None
Arguments: s3://xxxx/install_python_libraries.sh
Unfortunately, I get the following error.
Cannot run program "s3://xxxxx/install_python_libraries.sh" (in directory "."): error=2, No such file or directory
I am not sure what I am doing wrong. The shell script looks like this.
#!/bin/bash -xe
# Non-standard and non-Amazon Machine Image Python modules:
sudo pip-3.6 install boto3
sudo pip-3.6 install xmltodict
I also tried this by simply using 'command-runner.jar' but I get the same error. Can you please help me figure out the problem so I do this via the console? I would like to install the libraries on all nodes - master and core.
Thanks
The issue is the xxx.sh files EOL/carriage return type.
In other words, if it is Windows ("\r\n") then it will not work and return the ./ file not found error.
Convert it to unix type ("\n") using something like notepad++ and it will run fine.
(In notepad++ edit>EOL Conversion>Unix(LF) hit save and try again)

xboost dmlc-submit eats command quotes (needed to run python job via scl enable)

I use RedHat 6.8 for my YARN cluster nodes, and it uses Python 2.6.6. In order to get Python 2.7 in RedHat one uses Software Collections. To activate a software collection one must do scl enable. To print Python version for example one does scl enable python27 'python -V'
Problem is when I try to submit my Python job to YARN as follows
dmlc-submit --cluster=yarn scl enable python27 'python -V'
It seems to eat up the quotes and produces this error (instead of the expected Python 2.7.8):
Unable to open /etc/scl/prefixes/python!
That is the same output one gets when doing following at the bash prompt on any machine
scl enable python27 python -V
I'm trying to figure out how to fool argparse into letting this through.
Looked into argparser (which dmlc-core and xgboost use) source code trying to figure out any possible escaping mechanism. Couldn't find any but did find out dmlc-submit picks up job submission node environment and recreates on each of the execution nodes. Looked at the scl enable and /opt/rh/python27/enable script and realized LD_LIBRARY_PATH is the important one. PATH is also but for some reason it doesn't carry over so I now specify full path to my SCL provided python27
Following now happily runs on my YARN 2.7.1 cluster
cd xgboost
dmlc-core/tracker/dmlc-submit --cluster=yarn --num-workers=9 --worker-cores=4 /opt/rh/python27/root/usr/bin/python tests/distributed/test_basic.py

Why doesn't my custom recipes run on AWS OpsWorks?

I've created a GitHub repo for my simple custom recipe:
my-cookbook/
|- recipes/
|- appsetup.rb
I've added the repo to Custom Chef Recipes as https://github.com/my-github-user/my-github-repo.git
I've added my-cookbook::appsetup to the Setup "cycle".
I know it's executed, because it fails to load if I mess up the syntax.
This is my appsetup.rb:
node[:deploy].each do |app_name, deploy|
script "install_composer" do
interpreter "bash"
user "root"
cwd "#{deploy[:deploy_to]}/current"
code "curl -sS https://getcomposer.org/installer | php && php composer.phar install --no-dev"
end
end
When I log into the instance by SSH with the ubuntu user, composer isn't installed.
I've also tried the following to no avail (A nodejs install):
node[:deploy].each do |app_name, deploy|
execute "installing node" do
command "add-apt-repository --yes ppa:chris-lea/node.js && apt-get update && sudo apt-get install python-software-properties python g++ make nodejs"
end
end
Node doesn't get installed, and there are no errors in the log. The only references to the cookbook in the log just says:
[2014-03-31T13:26:04+00:00] INFO: OpsWorks Custom Run List: ["opsworks_initial_setup", "ssh_host_keys", "ssh_users", "mysql::client", "dependencies", "ebs", "opsworks_ganglia::client", "opsworks_stack_state_sync", "mod_php5_apache2", "my-cookbook::appsetup", "deploy::default", "deploy::php", "test_suite", "opsworks_cleanup"]
...
2014-03-31T13:26:04+00:00] INFO: New Run List expands to ["opsworks_initial_setup", "ssh_host_keys", "ssh_users", "mysql::client", "dependencies", "ebs", "opsworks_ganglia::client", "opsworks_stack_state_sync", "mod_php5_apache2", "my-cookbook::appsetup", "deploy::default", "deploy::php", "test_suite", "opsworks_cleanup"]
...
[2014-03-31T13:26:05+00:00] DEBUG: Loading Recipe my-cookbook::appsetup via include_recipe
[2014-03-31T13:26:05+00:00] DEBUG: Found recipe appsetup in cookbook my-cookbook
Am I missing some critical step somewhere? The recipe is clearly recognized and loaded, but doesn't seem to be executed.
(The following are fictitious names: my-github-user, my-github-repo, my-cookbook)
I know you've abandoned the cookbook but I'm almost 100% sure it's because you don't have a metadata.rb file in the root of your cookbook.
Your cookbook name should not contain a dash. I had the same problem, replacing by '_' solved it for me.
If those commands are failing silently, it could be that your use of && is obscuring a failure.
As for add-apt-repository, that is an interactive command. Try using the "--yes" option to answer yes by default, making it no longer interactive.
If you do not execute your command successfully, you will not find the files in the current directory. Check inside the last release folder to see if it had been put there.
It maybe prudent to check if you got the right directory etc setup by changing the CWD to : /tmp

sudo /etc/init.d/celeryd start generates a "Unknown command: 'celeryd_multi'"

I'm setting up celery to run daemonized, using the variables from my virtual environment. But when I run $ sudo /etc/init.d/celeryd start, I get Unknown command: 'celeryd_multi' Type 'manage.py help' for usage.
I have set the following:
CELERYD_CHDIR="/home/myuser/projects/myproject"
ENV_PYTHON="/home/myuser/.virtualenvs/myproject/bin/python"
CELERYD_MULTI="$ENV_PYTHON $CELERYD_CHDIR/manage.py celeryd_multi"
When I run $ /home/myuser/.virtualenvs/myproject/bin/python /home/myuser/projects/myproject/manage.py celeryd_multi from the command line, it works fine.
Any ideas? I will gladly post any other code you need :)
Thank you!
Maybe you just set a wrong DJANGO_SETTINGS_MODULE:
try: DJANGO_SETTINGS_MODULE="settings" <-> DJANGO_SETTINGS_MODULE="project.settings"
The problem here is that when you run it as your user, virtualenv already has proper environment activated for your user "myuser" and it pulls packages from /home/myuser/.virtualenvs/myproject/...
When you do sudo /etc/init.d/celeryd start you are starting celery as root which probably doesn't have virtualenv activated in /root/.virtualenvs/ if such a thing even exists and thus it looks for python packages in /usr/lib/... where your default python is and consequently where your celery is not installed.
Your options are to either:
Replicate the same virtualenv under root user and start it like you tried with sudo
Keep virtualenv where it is and start celery as your user "myuser" (no sudo) without using init scripts.
Write a script that will su - myuser -c /bin/sh /home/myuser/.virtualenvs/myproject/bin/celeryd to invoke it from init.d as a myuser.
Install supervisor outside of virtualenv and let it do the dirtywork for you
Thoughts:
Avoid using root for anything you don't have to.
If you don't need celery to start on boot then this is fine, wrapped in a script possibly.
Plain hackish to me, but works if you don't want to invest additional 30min to use something else.
Probably best way to handle ALL of your python startup needs, highly recommended.