Can't add jars pyspark in jupyter of Google DataProc

Can't add jars pyspark in jupyter of Google DataProc - google-cloud-platform

I have a Jupyter notebook on DataProc and I need a jar to run some job. I'm aware of editting spark-defaults.conf and using the --jars=gs://spark-lib/bigquery/spark-bigquery-latest.jar to submit the job from command line - they both work well. However, if I want to directly add jar to jupyter notebook, I tried the methods below and they all fail.
Method 1:
import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--jars gs://spark-lib/bigquery/spark-bigquery-latest.jar pyspark-shell'
Method 2:
spark = SparkSession.builder.appName('Shakespeare WordCount')\
.config('spark.jars', 'gs://spark-lib/bigquery/spark-bigquery-latest.jar')\
.getOrCreate()
They both have the same error:
---------------------------------------------------------------------------
Py4JJavaError Traceback (most recent call last)
<ipython-input-1-2b7692efb32b> in <module>()
19 # Read BQ data into spark dataframe
20 # This method reads from BQ directly, does not use GCS for intermediate results
---> 21 df = spark.read.format('bigquery').option('table', table).load()
22
23 df.show(5)
/usr/lib/spark/python/pyspark/sql/readwriter.py in load(self, path, format, schema, **options)
170 return self._df(self._jreader.load(self._spark._sc._jvm.PythonUtils.toSeq(path)))
171 else:
--> 172 return self._df(self._jreader.load())
173
174 #since(1.4)
/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py in __call__(self, *args)
1255 answer = self.gateway_client.send_command(command)
1256 return_value = get_return_value(
-> 1257 answer, self.gateway_client, self.target_id, self.name)
1258
1259 for temp_arg in temp_args:
/usr/lib/spark/python/pyspark/sql/utils.py in deco(*a, **kw)
61 def deco(*a, **kw):
62 try:
---> 63 return f(*a, **kw)
64 except py4j.protocol.Py4JJavaError as e:
65 s = e.java_exception.toString()
/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
326 raise Py4JJavaError(
327 "An error occurred while calling {0}{1}{2}.\n".
--> 328 format(target_id, ".", name), value)
329 else:
330 raise Py4JError(
Py4JJavaError: An error occurred while calling o81.load.
: java.lang.ClassNotFoundException: Failed to find data source: bigquery. Please find packages at http://spark.apache.org/third-party-projects.html
at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:657)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:194)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:167)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.ClassNotFoundException: bigquery.DefaultSource
at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$20$$anonfun$apply$12.apply(DataSource.scala:634)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$20$$anonfun$apply$12.apply(DataSource.scala:634)
at scala.util.Try$.apply(Try.scala:192)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$20.apply(DataSource.scala:634)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$20.apply(DataSource.scala:634)
at scala.util.Try.orElse(Try.scala:84)
at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:634)
... 13 more
The task I try to run is very simple:
table = 'publicdata.samples.shakespeare'
df = spark.read.format('bigquery').option('table', table).load()
df.show(5)
I understand there are many similar questions and answers, but they are either not working or not fitting my needs. There are ad-hoc jars I'll need and I don't want to keep all of them in the default configurations. I'd like to be more flexible and add jars on-the-go. How can I solve this? Thank you!

Unfortunately there isn't a built-in way to do this dynamically without effectively just editing spark-defaults.conf and restarting the kernel. There's an open feature request in Spark for this.
Zeppelin has some usability features for adding jars through the UI but even in Zeppelin you have to restart the interpreter after doing so for the Spark context to pick it up in its classloader. And also those options require the jarfiles to already be staged on the local filesystem; you can't just refer to remote file paths or URLs.
One workaround would be to create an init action which sets up a systemd service which regularly polls on some HDFS directory to sync into one of the existing classpath directories like /usr/lib/spark/jars:
#!/bin/bash
# Sets up continuous sync'ing of an HDFS directory into /usr/lib/spark/jars
# Manually copy jars into this HDFS directory to have them sync into
# ${LOCAL_DIR} on all nodes.
HDFS_DROPZONE='hdfs:///usr/lib/jars'
LOCAL_DIR='file:///usr/lib/spark/jars'
readonly ROLE="$(/usr/share/google/get_metadata_value attributes/dataproc-role)"
if [[ "${ROLE}" == 'Master' ]]; then
hdfs dfs -mkdir -p "${HDFS_DROPZONE}"
fi
SYNC_SCRIPT='/usr/lib/hadoop/libexec/periodic-sync-jars.sh'
cat << EOF > "${SYNC_SCRIPT}"
#!/bin/bash
while true; do
sleep 5
hdfs dfs -ls ${HDFS_DROPZONE}/*.jar 2>/dev/null | grep hdfs: | \
sed 's/.*hdfs:/hdfs:/' | xargs -n 1 basename 2>/dev/null | sort \
> /tmp/hdfs_files.txt
hdfs dfs -ls ${LOCAL_DIR}/*.jar 2>/dev/null | grep file: | \
sed 's/.*file:/file:/' | xargs -n 1 basename 2>/dev/null | sort \
> /tmp/local_files.txt
comm -23 /tmp/hdfs_files.txt /tmp/local_files.txt > /tmp/diff_files.txt
if [ -s /tmp/diff_files.txt ]; then
for FILE in \$(cat /tmp/diff_files.txt); do
echo "$(date): Copying \${FILE} from ${HDFS_DROPZONE} into ${LOCAL_DIR}"
hdfs dfs -cp "${HDFS_DROPZONE}/\${FILE}" "${LOCAL_DIR}/\${FILE}"
done
fi
done
EOF
chmod 755 "${SYNC_SCRIPT}"
SERVICE_CONF='/usr/lib/systemd/system/sync-jars.service'
cat << EOF > "${SERVICE_CONF}"
[Unit]
Description=Period Jar Sync
[Service]
Type=simple
ExecStart=/bin/bash -c '${SYNC_SCRIPT} &>> /var/log/periodic-sync-jars.log'
Restart=on-failure
[Install]
WantedBy=multi-user.target
EOF
chmod a+rw "${SERVICE_CONF}"
systemctl daemon-reload
systemctl enable sync-jars
systemctl restart sync-jars
systemctl status sync-jars
Then, whenever you need a jarfile to be available everywhere you just copy the jarfile into hdfs:///usr/lib/jars, and the periodic poller will automatically stick it into /usr/lib/spark/jars and then you simply restart your kernel to pick it up. You can add jars to that HDFS directory either by SSH'ing in and running hdfs dfs -cp directly, or simply subprocess out from your Jupyter notebook:
import subprocess
sp = subprocess.Popen(
['hdfs', 'dfs', '-cp',
'gs://spark-lib/bigquery/spark-bigquery-latest.jar',
'hdfs:///usr/lib/jars/spark-bigquery-latest.jar'],
stdout=subprocess.PIPE,
stderr=subprocess.PIPE)
out, err = sp.communicate()
print(out)
print(err)

Related

awslogs-agent-setup.py not working on Ubuntu 17.10 (artful)

This works fine on ubuntu 16.04, but not on 17.10
+ curl https://s3.amazonaws.com/aws-cloudwatch/downloads/latest/awslogs-agent-setup.py -O
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
^M 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0^M100 56093 100 56093 0 0 56093 0 0:00:01 --:--:-- 0:00:01 98929
+ chmod +x ./awslogs-agent-setup.py
+ ./awslogs-agent-setup.py -n -c /etc/awslogs/awslogs.conf -r us-west-2
Step 1 of 5: Installing pip ...^[[0mlibyaml-dev does not exist in system ^[[0m^[[92mDONE^[[0m
Step 2 of 5: Downloading the latest CloudWatch Logs agent bits ... ^[[0mTraceback (most recent call last):
File "./awslogs-agent-setup.py", line 1317, in <module>
main()
File "./awslogs-agent-setup.py", line 1313, in main
setup.setup_artifacts()
File "./awslogs-agent-setup.py", line 858, in setup_artifacts
self.install_awslogs_cli()
File "./awslogs-agent-setup.py", line 570, in install_awslogs_cli
subprocess.call([AWSCLI_CMD, 'configure', 'set', 'plugins.cwlogs', 'cwlogs'], env=DEFAULT_ENV)
File "/usr/lib/python2.7/subprocess.py", line 168, in call
return Popen(*popenargs, **kwargs).wait()
File "/usr/lib/python2.7/subprocess.py", line 390, in __init__
errread, errwrite)
File "/usr/lib/python2.7/subprocess.py", line 1025, in _execute_child
raise child_exception
OSError: [Errno 2] No such file or directory
I noticed that earlier on in the process, in the AWS boilerplate it failed to install libyaml-dev but not sure if that's the only problem.

Always find the answer right after I post it...
Here's my modified CF template command:
050_install_awslogs:
command: !Sub
"/bin/bash -x\n
exec >>/var/log/cf_050_install_awslogs.log 2>&1 \n
echo 050_install_awslogs...\n
set -xe\n
# Get the CloudWatch Logs agent\n
mkdir /opt/awslogs\n
cd /opt/awslogs\n
# Needed for python3 in 17.10\n
apt-get install -y libyaml-dev python-dev \n
pip3 install awscli-cwlogs\n
# avoid it complaining about not having /var/awslogs/bin/aws binary\n
if [ ! -d /var/awslogs/bin ] ; then\n
mkdir -p /var/awslogs/bin\n
ln -s /usr/local/bin/aws /var/awslogs/bin/aws\n
fi\n
curl https://s3.amazonaws.com/aws-cloudwatch/downloads/latest/awslogs-agent-setup.py -O\n
chmod +x ./awslogs-agent-setup.py\n
# Hack for python 3.6 & old awslogs-agent-setup.py\n
sed -i 's/3,6/3,7/' awslogs-agent-setup.py\n
./awslogs-agent-setup.py -n -c /etc/awslogs/awslogs.conf -r ${AWS::Region}\n
echo 050_install_awslogs end\n
"
Not entirely sure about the need for the dir creation but I expect this is a temporary case that will get resolved soon as one still needs to fudge the python 3.6 compatibility check.
it may be installable using python 2.7 as well, but that felt like going backwards at this point as the my rationale for 17.10 was python 3.6.
Credit for the yaml package and dir creation idea to https://forums.aws.amazon.com/thread.jspa?threadID=265977 but I prefer to avoid easy_install.

I had similar issue on Ubuntu 18.04.
Instruction from AWS for standalone install worked for my case.
To download and run it standalone, use the following commands and follow the prompts:
curl https://s3.amazonaws.com/aws-cloudwatch/downloads/latest/awslogs-agent-setup.py -O
curl https://s3.amazonaws.com/aws-cloudwatch/downloads/latest/AgentDependencies.tar.gz -O
tar xvf AgentDependencies.tar.gz -C /tmp/
sudo python ./awslogs-agent-setup.py --region us-east-1 --dependency-path /tmp/AgentDependencies

Error while Connecting PySpark to AWS Redshift

Have been trying to connect Spark 2.2.1 on my EMR 5.11.0 cluster to our Redshift store.
The approaches I followed was -
Use the inbuilt Redshift JDBC
pyspark --jars /usr/share/aws/redshift/jdbc/RedshiftJDBC41.jar
from pyspark.sql import SQLContext
sc
sql_context = SQLContext(sc)
redshift_url = "jdbc:redshift://HOST:PORT/DATABASE?user=USER&password=PASSWORD"
redshift_query = "select * from table;"
redshift_query_tempdir_storage = "s3://personal_warehouse/wip_dumps/"
# Read data from a query
df_users = sql_context.read \
.format("com.databricks.spark.redshift") \
.option("url", redshift_url) \
.option("query", redshift_query) \
.option("tempdir", redshift_query_tempdir_storage) \
.option("forward_spark_s3_credentials", "true") \
.load()
This gives me the following error -
Traceback (most recent call last): File "", line 7, in
File "/usr/lib/spark/python/pyspark/sql/readwriter.py",
line 165, in load
return self._df(self._jreader.load()) File "/usr/lib/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py",
line 1133, in call File
"/usr/lib/spark/python/pyspark/sql/utils.py", line 63, in deco
return f(*a, kw) File "/usr/lib/spark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line
319, in get_return_value ***py4j.protocol.Py4JJavaError: An error
occurred while calling o63.load. : java.lang.ClassNotFoundException:
Failed to find data source: com.databricks.spark.redshift. Please find
packages at http://spark.apache.org/third-party-projects.html at*
org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:546)
at
org.apache.spark.sql.execution.datasources.DataSource.providingClass$lzycompute(DataSource.scala:87)
at
org.apache.spark.sql.execution.datasources.DataSource.providingClass(DataSource.scala:87)
at
org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:302)
at
org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)
at
org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:146)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498) at
py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at
py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at
py4j.Gateway.invoke(Gateway.java:280) at
py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79) at
py4j.GatewayConnection.run(GatewayConnection.java:214) at
java.lang.Thread.run(Thread.java:748) Caused by:
java.lang.ClassNotFoundException:
com.databricks.spark.redshift.DefaultSource at
java.net.URLClassLoader.findClass(URLClassLoader.java:381) at
java.lang.ClassLoader.loadClass(ClassLoader.java:424) at
java.lang.ClassLoader.loadClass(ClassLoader.java:357) at
org.apache.spark.sql.execution.datasources.DataSource$$anonfun$22$$anonfun$apply$14.apply(DataSource.scala:530)
at
org.apache.spark.sql.execution.datasources.DataSource$$anonfun$22$$anonfun$apply$14.apply(DataSource.scala:530)
at scala.util.Try$.apply(Try.scala:192) at
org.apache.spark.sql.execution.datasources.DataSource$$anonfun$22.apply(DataSource.scala:530)
at
org.apache.spark.sql.execution.datasources.DataSource$$anonfun$22.apply(DataSource.scala:530)
at scala.util.Try.orElse(Try.scala:84) at
org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:530)
... 16 more
Can someone please help tell where I've missed out on something / made a stupid mistake?
Thanks!

I had to include 4 jar files in the EMR spark-submit options to get this working.
List of jar files:
1.RedshiftJDBC41-1.2.12.1017.jar
2.spark-redshift_2.10-2.0.0.jar
3.minimal-json-0.9.4.jar
4.spark-avro_2.11-3.0.0.jar
You can download the jar files and store them on a S3 bucket and point to it in the spark-submit options like :
--jars s3://<pathToJarFile>/RedshiftJDBC41-1.2.10.1009.jar,s3://<pathToJarFile>/minimal-json-0.9.4.jar,s3://<pathToJarFile>/spark-avro_2.11-3.0.0.jar,s3://<pathToJarFile>/spark-redshift_2.10-2.0.0.jar
And then finally query your redshift like in this example : spark-redshift-example in your spark code.

You need to add the Spark Redshift datasource to your pyspark command:
pyspark --jars /usr/share/aws/redshift/jdbc/RedshiftJDBC41.jar \
--packages com.databricks:spark-redshift_2.11:2.0.1

The problem is that spark is not finding the necessary packages in the moment to execute it. To do this at the time of executing the script .sh that launches the execution of the python file you have to add not only the driver but also the necessary package.
script test.sh
sudo pip install boto3
spark-submit --jars RedshiftJDBC42-1.2.15.1025.jar --packages com.databricks:spark-redshift_2.11:2.0.1 test.py
script test.py
from pyspark.sql import SQLContext
sc
sql_context = SQLContext(sc)
redshift_url = "jdbc:redshift://HOST:PORT/DATABASE?user=USER&password=PASSWORD"
redshift_query = "select * from table;"
redshift_query_tempdir_storage = "s3://personal_warehouse/wip_dumps/"
# Read data from a query
df_users = sql_context.read \
.format("com.databricks.spark.redshift") \
.option("url", redshift_url) \
.option("query", redshift_query) \
.option("tempdir", redshift_query_tempdir_storage) \
.option("forward_spark_s3_credentials", "true") \
.load()
Run the script test.sh
sudo sh test.sh
The problem must be solved now.

Saltstack fails when creating minions from a map file on DigitalOcean

I can't create a minion from the map file, no idea what's happened. A month ago my script was working correctly, right now it fails. I was trying to do some research about it but I could't find anything about it. Could someone have a look on my DEBUG log? The minion is created on DigitalOcean but the master server can't connect to it at all.
so I run:
salt-cloud -P -m /etc/salt/cloud.maps.d/production.map -l debug
The master is running on Ubuntu 16.04.1 x64, the minion also.
I use the latest saltstack's library:
echo "deb http://repo.saltstack.com/apt/ubuntu/16.04/amd64/latest xenial main" >> /etc/apt/sources.list.d/saltstack.list
I tested both 2016.3.2 and 2016.3.3, what is interesting, the same script was working correctly 4 weeks ago, I assume something had to change.
ERROR:
Writing /usr/lib/python2.7/dist-packages/salt-2016.3.3.egg-info
* INFO: Running install_ubuntu_git_post()
disabled
Created symlink from /etc/systemd/system/multi-user.target.wants/salt-minion.service to /lib/systemd/system/salt-minion.service.
* INFO: Running install_ubuntu_check_services()
* INFO: Running install_ubuntu_restart_daemons()
Job for salt-minion.service failed because a configured resource limit was exceeded. See "systemctl status salt-minion.service" and "journalctl -xe" for details.
start: Unable to connect to Upstart: Failed to connect to socket /com/ubuntu/upstart: Connection refused
* ERROR: No init.d support for salt-minion was found
* ERROR: Fai
[DEBUG ] led to run install_ubuntu_restart_daemons()!!!
[ERROR ] Failed to deploy 'minion-zk-0'. Error: Command 'ssh -t -t -oStrictHostKeyChecking=no -oUserKnownHostsFile=/dev/null -oControlPath=none -oPasswordAuthentication=no -oChallengeResponseAuthentication=no -oPubkeyAuthentication=yes -oIdentitiesOnly=yes -oKbdInteractiveAuthentication=no -i /etc/salt/keys/cloud/do.pem -p 22 root#REMOVED_IP '/tmp/.saltcloud-5d18c002-e817-46d5-9fb2-d3bdb2dfe7fd/deploy.sh -c '"'"'/tmp/.saltcloud-5d18c002-e817-46d5-9fb2-d3bdb2dfe7fd'"'"' -P git v2016.3.3'' failed. Exit code: 1
Traceback (most recent call last):
File "/usr/lib/python2.7/dist-packages/salt/cloud/__init__.py", line 2293, in create_multiprocessing
local_master=parallel_data['local_master']
File "/usr/lib/python2.7/dist-packages/salt/cloud/__init__.py", line 1281, in create
output = self.clouds[func](vm_)
File "/usr/lib/python2.7/dist-packages/salt/cloud/clouds/digital_ocean.py", line 481, in create
ret = __utils__['cloud.bootstrap'](vm_, __opts__)
File "/usr/lib/python2.7/dist-packages/salt/utils/cloud.py", line 527, in bootstrap
deployed = deploy_script(**deploy_kwargs)
File "/usr/lib/python2.7/dist-packages/salt/utils/cloud.py", line 1516, in deploy_script
if root_cmd(deploy_command, tty, sudo, **ssh_kwargs) != 0:
File "/usr/lib/python2.7/dist-packages/salt/utils/cloud.py", line 2167, in root_cmd
retcode = _exec_ssh_cmd(cmd, allow_failure=allow_failure, **kwargs)
File "/usr/lib/python2.7/dist-packages/salt/utils/cloud.py", line 1784, in _exec_ssh_cmd
cmd, proc.exitstatus
SaltCloudSystemExit: Command 'ssh -t -t -oStrictHostKeyChecking=no -oUserKnownHostsFile=/dev/null -oControlPath=none -oPasswordAuthentication=no -oChallengeResponseAuthentication=no -oPubkeyAuthentication=yes -oIdentitiesOnly=yes -oKbdInteractiveAuthentication=no -i /etc/salt/keys/cloud/do.pem -p 22 root#REMOVED_ID '/tmp/.saltcloud-5d18c002-e817-46d5-9fb2-d3bdb2dfe7fd/deploy.sh -c '"'"'/tmp/.saltcloud-5d18c002-e817-46d5-9fb2-d3bdb2dfe7fd'"'"' -P git v2016.3.3'' failed. Exit code: 1
[DEBUG ] LazyLoaded nested.output
minion-zk-0:
----------
Error:
Command 'ssh -t -t -oStrictHostKeyChecking=no -oUserKnownHostsFile=/dev/null -oControlPath=none -oPasswordAuthentication=no -oChallengeResponseAuthentication=no -oPubkeyAuthentication=yes -oIdentitiesOnly=yes -oKbdInteractiveAuthentication=no -i /etc/salt/keys/cloud/do.pem -p 22 root#REMOVED_IP '/tmp/.saltcloud-5d18c002-e817-46d5-9fb2-d3bdb2dfe7fd/deploy.sh -c '"'"'/tmp/.saltcloud-5d18c002-e817-46d5-9fb2-d3bdb2dfe7fd'"'"' -P git v2016.3.3'' failed. Exit code: 1
root#master-zk:/etc/salt/cloud.maps.d# salt '*' test.ping
minion-zk-0:
Minion did not return. [No response]
root#master-zk:/etc/salt/cloud.maps.d#

It is located in your cloud configuration somewhere in /etc/salt/cloud.profiles.d/, /etc/salt/cloud.providers.d/ or /etc/salt/cloud.d/. Just figure out where and change the value salt to your masters ip.
I currently do this in my providers setting like that:
hit-vcenter:
driver: vmware
user: 'foo'
password: 'secret'
url: 'some url'
protocol: 'https'
port: 443
minion:
master: 10.1.10.1

java.io.IOException: Broken pipe on increasing number of mappers/reducers, a lot

I am running MapReduce job on a hadoop cluster of 6 nodes with 4 map tasks and 10 reduce tasks configured.
Mapper/Reducer fails a lot on increasing number of map/reduce tasks as below,
I encounter the following error:
stderr logs
java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 143
at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:362)
at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:576)
at org.apache.hadoop.streaming.PipeReducer.reduce(PipeReducer.java:130)
at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:519)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:420)
at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
at org.apache.hadoop.mapred.Child.main(Child.java:249)
and this:
syslog logs
2014-03-01 15:11:30,118 WARN org.apache.hadoop.streaming.PipeMapRed: java.io.IOException: Broken pipe
at java.io.FileOutputStream.writeBytes(Native Method)
at java.io.FileOutputStream.write(FileOutputStream.java:260)
at java.io.BufferedOutputStream.write(BufferedOutputStream.java:105)
at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123)
at java.io.DataOutputStream.flush(DataOutputStream.java:106)
at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:569)
at org.apache.hadoop.streaming.PipeReducer.reduce(PipeReducer.java:130)
at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:519)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:420)
at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
at org.apache.hadoop.mapred.Child.main(Child.java:249)
2014-03-01 15:11:30,118 INFO org.apache.hadoop.streaming.PipeMapRed: PipeMapRed failed!
2014-03-01 15:11:30,121 INFO org.apache.hadoop.mapred.TaskLogsTruncater: Initializing logs' truncater with mapRetainSize=-1 and reduceRetainSize=-1
2014-03-01 15:11:30,146 INFO org.apache.hadoop.io.nativeio.NativeIO: Initialized cache for UID to User mapping with a cache timeout of 14400 seconds.
2014-03-01 15:11:30,146 INFO org.apache.hadoop.io.nativeio.NativeIO: Got UserName hduser for UID 1001 from the native implementation
2014-03-01 15:11:30,147 WARN org.apache.hadoop.mapred.Child: Error running child
java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 143
at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:362)
at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:576)
at org.apache.hadoop.streaming.PipeReducer.reduce(PipeReducer.java:130)
at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:519)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:420)
at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
at org.apache.hadoop.mapred.Child.main(Child.java:249)
2014-03-01 15:11:30,149 WARN org.apache.hadoop.mapred.Task: Parent died. Exiting attempt_201402281751_0042_r_000004_0
2014-03-01 15:11:31,252 INFO org.apache.hadoop.streaming.PipeMapRed: Records R/W=983976/1957694
Even the simplest program is creating this problem:
mapper.py
#!/usr/bin/env python
import sys
for line in sys.stdin:
if line:
print "%s\n%s"%(line, line)
reducer.py
#!/usr/bin/env python
import sys
for line in sys.stdin:
if line:
print "%s"%(line)
I am using the following command to run hadoop.
hadoop jar /usr/local/hadoop/contrib/streaming/hadoop-streaming-1.0.3.jar -D mapred.reduce.tasks=10 -file /home/hduser/code/K1D/code1/mapper2.py -mapper mapper2.py -file /home/hduser/code/K1D/code1/reducer2.py -reducer reducer2.py -input /user/hduser/data-out/part-00000 -output /user/hduser/data-out1 -partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner
Can you suggest anything?

I ran into something similar and realized that I didn't have the exact version of python installed into one of my data nodes. I had
#!/usr/bin/env python
which I changed to
#!/usr/bin/env python2.7

How to set up Loggly on Elastic Beanstalk?

I'd like to set up Loggly to run on AWS Elastic Beanstalk, but can't find any information on how to do this. Is there any guide anywhere, or some general guidance on how to start?

This is how I do it, for papertrailapp.com (which I prefer instead of loggly). In your /ebextensions folder (see more info) you create logs.config, where specify:
container_commands:
01-set-correct-hostname:
command: hostname www.example.com
02-forward-rsyslog-to-papertrail:
# https://papertrailapp.com/systems/setup
command: echo "*.* #logs.papertrailapp.com:55555" >> /etc/rsyslog.conf
03-enable-remote-logging:
command: echo -e "\$ModLoad imudp\n\$UDPServerRun 514\n\$ModLoad imtcp\n\$InputTCPServerRun 514\n\$EscapeControlCharactersOnReceive off" >> /etc/rsyslog.conf
04-restart-syslog:
command: service rsyslog restart
55555 should be replaced with the UDP port number provided by papertrailapp.com. Every time after new instance bootstrap this config will be applied. Then, in your log4j.properties:
log4j.rootLogger=WARN, SYSLOG
log4j.appender.SYSLOG=org.apache.log4j.net.SyslogAppender
log4j.appender.SYSLOG.facility=local1
log4j.appender.SYSLOG.header=true
log4j.appender.SYSLOG.syslogHost=localhost
log4j.appender.SYSLOG.layout=org.apache.log4j.PatternLayout
log4j.appender.SYSLOG.layout.ConversionPattern=[%p] %t %c: %m%n
I'm not sure whether it's an optimal solution. Read more about this mechanism in jcabi-beanstalk-maven-plugin

You can also use the installation script from loggly itself.
The setup below follows the instructions for the legacy setup on https://www.loggly.com/docs/configure-syslog-script/ with minor changes (no confirmation prompts, sudo command replaced since no tty is available)
(edit: updated link, seems to be an outdated solution now in loggly docs)
Place the following script in .ebextensions/loggly.config
Replace TOKEN and ACCOUNT with your own.
#
# Install loggly.com on AWS Elastic Beanstalk
# Tested with node.js environment
# Save this file as .ebextensions/loggly.config
# Deploy per normal scripts or aws.push. To help debug the push, ssh & tail /var/log/cfn-init.log
# See Also /var/log/eb-tools.log
#
commands:
01_loggly_dl:
command: wget -q -O /tmp/loggly.py https://www.loggly.com/install/configure-syslog.py
02_loggly_config:
command: su --session-command="python /tmp/loggly.py setup --auth TOKEN --account ACCOUNT --yes"

Here is a link to loggly support site for using syslogd with loggly:
http://wiki.loggly.com/loggingconfiguration
or using the loggly api with your own app:
http://wiki.loggly.com/apidocumention

Here is an elasticbeanstalk config for Loggly that I've just started using thanks to pointers from this thread and the logging SaaS vendors setup instructions. [Loggly Config Mgmt, Papertrail rsyslog ]
Save the file as loggly.config in the .ebextensions directory and make sure to check the YAML formatting conventions (no tabs, etc). Substitute your Loggly TCP port number, username, password and domain name into the angle brackets as required.
Note that for AWS ruby versions of elasticbeanstalk, there may be differences in the EC2 /etc/rsyslog setup. For example, if /etc/rsyslog.d already exists, and there is already an "$IncludeConfig /etc/rsyslog.d/*.conf" directive, then command "01-forward-rsyslog-to-loggly:" can be removed.
Deploy per normal scripts or aws.push. To help debug the push, ssh & tail /var/log/cfn-init.log
files:
"/etc/rsyslog.d/90-loggly.conf" :
mode: "000664"
owner: root
group: root
content: |
# ### begin forwarding rule ###
# The statement between the begin ... end define a SINGLE forwarding
# rule. They belong together, do NOT split them. If you create multiple
# forwarding rules, duplicate the whole block!
# Remote Logging (we use TCP for reliable delivery)
#
# An on-disk queue is created for this action. If the remote host is
# down, messages are spooled to disk and sent when it is up again.
$WorkDirectory /var/lib/rsyslog # where to place spool files
$ActionQueueFileName fwdRule1 # unique name prefix for spool files
$ActionQueueMaxDiskSpace 1g # 1gb space limit (use as much as possible)
$ActionQueueSaveOnShutdown on # save messages to disk on shutdown
$ActionQueueType LinkedList # run asynchronously
$ActionResumeRetryCount -1 # infinite retries if host is down
*.* ##logs.loggly.com:<yourportnum> # !!!Loggly supplied port number for each app!!!
# ### end of the forwarding rule ###
encoding: plain
"/tmp/loggly.py" :
mode: "000755"
owner: root
group: root
content: |
import json
import sys
import urllib2
'''
Auto-authenticate Syslog TCP inputs.
Usage: python inputs.py -u user -p pass -s subdomain
'''
state = None
params = {}
for i in range(len(sys.argv)):
arg = sys.argv[i]
if state:
params[state] = arg
state = None
if arg == '--username' or arg == '-u':
state = 'username'
if arg == '--password' or arg == '-p':
state = 'password'
if arg == '--subdomain' or arg == '-s':
state = 'subdomain'
url = 'https://%s.loggly.com/api/inputs' % params['subdomain']
password_mgr = urllib2.HTTPPasswordMgrWithDefaultRealm()
password_mgr.add_password(None, url, params['username'], params['password'])
handler = urllib2.HTTPBasicAuthHandler(password_mgr)
opener = urllib2.build_opener(handler)
opener.open(url)
urllib2.install_opener(opener)
inputs = json.loads(urllib2.urlopen(url).read())
for input in inputs:
if input['service']['name'] == 'syslogtcp':
url = 'https://%s.loggly.com/api/inputs/%d/adddevice' % \
(params['subdomain'], input['id'])
response = urllib2.urlopen(url, {}).read()
print response
encoding: plain
commands:
01-forward-rsyslog-to-loggly:
# http://loggly.com/support/sending-data/logging-from/syslog/rsyslog/cd
command: test "$(grep -s '90-loggly.conf' /etc/rsyslog.conf)" == "" && echo -e "\n# Include the loggly.conf file\n\$IncludeConfig /etc/rsyslog.d/90-loggly.conf" >> /etc/rsyslog.conf
02-restart-syslog:
command: service rsyslog restart
03-inform_loggly:
command: "python /tmp/loggly.py -u <Yourloginname> -p <Yourpassword> -s <Yourdomainname>"

Typically, /etc/rsyslog.config will have a "$IncludeConfig /etc/rsyslog.d/*.conf" at the end - so you can simply introduce your own configuration file using the "files:" portion of your .ebextensions file. This works whether you are deploying to fresh servers or not.
For a ruby production.log, you might have something like this in a .ebextensions/01loggly.config file. Note this picks up your beanstalk environment name too as a loggly tag.
# For docs on eb configs, see http://docs.aws.amazon.com/elasticbeanstalk/latest/dg/customize-containers-ec2.html
# This set of commands sets up loggly forwarding
files:
"/etc/rsyslog.d/myapp-loggly.conf" :
mode: "000664"
owner: root
group: root
content: |
$template LogglyFormat,"<%pri%>%protocol-version% %timestamp:::date-rfc3339% %HOSTNAME% %app-name% %procid% %msgid% [yourlogglyid#41058 tag=`{ "Ref" : "AWSEBEnvironmentName" }`] %msg%\n"
*.* ##logs-01.loggly.com:514;LogglyFormat
# One time config
$ModLoad imfile
$InputFilePollInterval 10
$PrivDropToGroup adm
$WorkDirectory /var/spool/rsyslog
# Add a tag for file events
# For production.log
$InputFileName /var/app/support/logs/production.log
$InputFileTag production-log
$InputFileStateFile stat-production-log #this must be unique for each file being polled
$InputFileSeverity info
$InputFilePersistStateInterval 20000
$InputRunFileMonitor
# Send to Loggly then discard
if $programname == 'myapp-production-log' then ##logs-01.loggly.com:514;LogglyFormat
if $programname == 'myapp-production-log' then ~
encoding: plain
commands:
00-make-work-directory:
command: mkdir -p /var/spool/rsyslog
01-restart-syslog:
command: service rsyslog restart
For Tomcat, you might do something like this in a .ebextesions/01logglyg.config file:
# For docs on eb configs, see http://docs.aws.amazon.com/elasticbeanstalk/latest/dg/customize-containers-ec2.html
# This set of commands sets up loggly forwarding
files:
"/etc/rsyslog.d/mytomcatapp-loggly.conf" :
mode: "000664"
owner: root
group: root
content: |
$template LogglyFormat,"<%pri%>%protocol-version% %timestamp:::date-rfc3339% %HOSTNAME% %app-name% %procid% %msgid% [yourlogglygidhere#41058 tag=`{ "Ref" : "AWSEBEnvironmentName" }`] %msg%\n"
*.* ##logs-01.loggly.com:514;LogglyFormat
# One time config
$ModLoad imfile
$InputFilePollInterval 10
$PrivDropToGroup adm
$WorkDirectory /var/spool/rsyslog
# catalina.log
$InputFileName /var/log/tomcat7/catalina.log
$InputFileTag catalina-log
$InputFileStateFile stat-catalina-log
$InputFileSeverity info
$InputFilePersistStateInterval 20000
$InputRunFileMonitor
if $programname == 'catalina-log' then ##logs-01.loggly.com:514;LogglyFormat
if $programname == 'catalina-log' then ~
# catalina.out
$InputFileName /var/log/tomcat7/catalina.out
$InputFileTag catalina-out
$InputFileStateFile stat-catalina-out
$InputFileSeverity info
$InputFilePersistStateInterval 20000
$InputRunFileMonitor
if $programname == 'catalina-out' then ##logs-01.loggly.com:514;LogglyFormat
if $programname == 'catalina-out' then ~
# host-manager.log
$InputFileName /var/log/tomcat7/host-manager.log
$InputFileTag host-manager
$InputFileStateFile stat-host-manager
$InputFileSeverity info
$InputFilePersistStateInterval 20000
$InputRunFileMonitor
if $programname == 'host-manager' then ##logs-01.loggly.com:514;LogglyFormat
if $programname == 'host-manager' then ~
# initd.log
$InputFileName /var/log/tomcat7/initd.log
$InputFileTag initd
$InputFileStateFile stat-initd
$InputFileSeverity info
$InputFilePersistStateInterval 20000
$InputRunFileMonitor
if $programname == 'initd' then ##logs-01.loggly.com:514;LogglyFormat
if $programname == 'initd' then ~
# localhost.log
$InputFileName /var/log/tomcat7/localhost.log
$InputFileTag localhost-log
$InputFileStateFile stat-localhost-log
$InputFileSeverity info
$InputFilePersistStateInterval 20000
$InputRunFileMonitor
if $programname == 'localhost-log' then ##logs-01.loggly.com:514;LogglyFormat
if $programname == 'localhost-log' then ~
# manager.log
$InputFileName /var/log/tomcat7/manager.log
$InputFileTag manager
$InputFileStateFile stat-manager
$InputFileSeverity info
$InputFilePersistStateInterval 20000
$InputRunFileMonitor
if $programname == 'manager' then ##logs-01.loggly.com:514;LogglyFormat
if $programname == 'manager' then ~
encoding: plain
commands:
00-make-work-directory:
command: mkdir -p /var/spool/rsyslog
01-restart-syslog:
command: service rsyslog restart
This config is working for me - though I haven't yet determined how to get multi-line entries coming into a single entry in Loggly yet.

I know this is question is fairly old but I found that the answers really didnt answer the question or just plain didnt work correctly when implemented. I found that this works (file .ebextenstions/02loggly.config):
container_commands:
01-transform-rsyslog.conf:
command: sed "s/NODE_ENV/$NODE_ENV/g" scripts/22-loggly.conf.temp > scripts/22-loggly.conf
02-setup-rsyslog.conf:
command: cp scripts/22-loggly.conf /etc/rsyslog.d/22-loggly.conf
03-restart:
command: /sbin/service rsyslog restart
the "01-transform-rsyslog.conf" step is optional; I use that to set a tag by NODE_ENV in the file. "22-loggly.conf.temp" is a modified version of the "22-loggly.conf" file that gets created at "/etc/rsyslog.d/" when you run the linux source setup script (https://www.loggly.com/install/configure-syslog.py). I just installed it on a ec2 instance and copied the file.
Note I had to prepend '/sbin' to my service command because it was failing for me without it. Also, this restarts syslog on every deploy, which should be fine.
Now you just have to make sure your app logs to syslog. For Java it is going to be log4j or similar. For Node.js (which is what I'm using), rconsole works (https://github.com/tblobaum/rconsole).

None of the things I tried seemed to work, and the loggly documentation is very confusing!
I hope that this will help someone, this is how I got it to work.
Paste the following in .ebextensions/loggly.config
files:
"/etc/rsyslog.conf" :
mode: "000644"
owner: root
group: root
content: |
$ModLoad imfile
$InputFilePollInterval 10
$PrivDropToGroup adm
# Input for FILE.LOG
$InputFileName /var/app/current/PATH_TO_YOUR_LOG_FILE
$InputFileTag social_php:
$InputFileStateFile stat-social_php #this must be unique for each file being polled
$InputFileSeverity info
$InputRunFileMonitor
#Add a tag for events from this file
$template LogglyFormatsocial_php,"<%pri%>%protocol-version% %timestamp:::date-rfc3339% %HOSTNAME% %app-name% %procid% %msgid% [TOKEN#41058 tag=\"php_log\"] %msg%\n"
if $programname == 'social_php' then ##logs.loggly.com:37146;LogglyFormatsocial_php
if $programname == 'social_php' then ~
*.* ##logs.loggly.com:37146
commands:
01-restart-syslog:
command: service rsyslog restart
Replace all instances of social_php with the tag that makes sense for your application.
Replace /var/app/current/PATH_TO_YOUR_LOG_FILE with your log file location

Follow my loggly configuration in elasticbeanstalk. For Linux + log4j
on .ebextensions file configuration
container_commands:
01_configure_sudo_access:
command: sed -i -- 's/ requiretty/ \!requiretty/g' /etc/sudoers
02_loggy_configure:
command: sudo python .ebextensions/scripts/loggly_config.py
03_restore_sudo_access:
command: sed -i -- 's/ \!requiretty/ requiretty/g' /etc/sudoers
Loggly script in python for default AMI:
import os
rsyslog_path = '/etc/rsyslog.conf'
loggly_file_path = '/etc/rsyslog.d/22-loggly.conf'
class LogglyConfig:
def __init__(self):
self.__linux_log()
self.__config_loggly_for_log4j()
def __linux_log(self):
#not installed on this machine
if not os.path.exists(loggly_file_path):
os.system('rm -f configure-linux.sh')
os.system('wget https://www.loggly.com/install/configure-linux.sh')
os.system('sudo bash configure-linux.sh -a DOMAIN -t TOKEN -u USER -p PASSWORD -s')
def __config_loggly_for_log4j(self):
f = open(rsyslog_path,'r')
file_text = f.read()
f.close()
file_text = file_text.replace('#$ModLoad imudp', '$ModLoad imudp')
file_text = file_text.replace('#$UDPServerRun 514', '$UDPServerRun 514')
f = open(rsyslog_path,'w')
f.write(file_text)
f.close()
os.system('service rsyslog restart')
LogglyConfig()
In log4j.properties on your java project
log4j.rootLogger=INFO, SYSLOG
log4j.appender.SYSLOG=org.apache.log4j.net.SyslogAppender
log4j.appender.SYSLOG.SyslogHost=localhost
log4j.appender.SYSLOG.Facility=Local3
log4j.appender.SYSLOG.Header=true
log4j.appender.SYSLOG.layout=org.apache.log4j.PatternLayout
log4j.appender.SYSLOG.layout.ConversionPattern=java %d{ISO8601} %p %t %c{1}.%M - %m%n

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Can't add jars pyspark in jupyter of Google DataProc - google-cloud-platform

Related

awslogs-agent-setup.py not working on Ubuntu 17.10 (artful)

Error while Connecting PySpark to AWS Redshift

Saltstack fails when creating minions from a map file on DigitalOcean

java.io.IOException: Broken pipe on increasing number of mappers/reducers, a lot

How to set up Loggly on Elastic Beanstalk?

Categories

Resources