error while running a python program in spark environment - python-2.7

I am using spark 1.3.0.
I have a problem running the python program in spark python shell.
This is how I submit the job :
/bin/spark-submit progname.py
the error I found is,
NameError: name 'sc' is not defined
on that line.
Any idea?
Thanks in advance

## Imports
from pyspark import SparkConf, SparkContext
## CONSTANTS
APP_NAME = "My Spark Application"
##OTHER FUNCTIONS/CLASSES
## Main functionality
def main(sc):
rdd = sc.parallelize(range(1000), 10)
print rdd.mean()
if __name__ == "__main__":
# Configure OPTIONS
conf = SparkConf().setAppName(APP_NAME)
conf = conf.setMaster("local[*]")
#in cluster this will be like
#"spark://ec2-0-17-03-078.compute-#1.amazonaws.com:7077"
sc = SparkContext(conf=conf)
# Execute Main functionality
main(sc)

conf = pyspark.SparkConf()
This is how you should create SparkConf object.
Further you can use chaining to do thins like set application name etc
conf = pyspark.SparkConf().setAppName("My_App_Name")
Then pass this config var to create spark context.

The first thing a Spark program must do is to create a SparkContext object, which tells Spark how to access a cluster. To create a SparkContext you first need to build a SparkConf object that contains information about your application.
conf = SparkConf().setAppName(appName).setMaster(master)
sc = SparkContext(conf=conf)

Related

how to use sagemaker inside pyspark

I have a simple requirement, I need to run sagemaker prediction inside a spark job
am trying to run the below
ENDPOINT_NAME = "MY-ENDPOINT_NAME"
from sagemaker_pyspark import SageMakerModel
from sagemaker_pyspark import EndpointCreationPolicy
from sagemaker_pyspark.transformation.serializers import ProtobufRequestRowSerializer
from sagemaker_pyspark.transformation.deserializers import ProtobufResponseRowDeserializer
attachedModel = SageMakerModel(
existingEndpointName=ENDPOINT_NAME,
endpointCreationPolicy=EndpointCreationPolicy.DO_NOT_CREATE,
endpointInstanceType=None, # Required
endpointInitialInstanceCount=None, # Required
requestRowSerializer=ProtobufRequestRowSerializer(
featuresColumnName="featureCol"
), # Optional: already default value
responseRowDeserializer= ProtobufResponseRowDeserializer(schema=ouput_schema),
)
transformedData2 = attachedModel.transform(df)
transformedData2.show()
I get the following error TypeError: 'JavaPackage' object is not callable
this was solved by ...
classpath = ":".join(sagemaker_pyspark.classpath_jars())
conf = SparkConf() \
.set("spark.driver.extraClassPath", classpath)
sc = SparkContext(conf=conf)

Failed to find data source: delta in Python environment

Following: https://docs.delta.io/latest/quick-start.html#python
I have installed delta-spark and run:
from delta import *
builder = pyspark.sql.SparkSession.builder.appName("MyApp") \
.config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
.config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog")
spark = spark = configure_spark_with_delta_pip(builder).getOrCreate()
However when I run:
data = spark.range(0, 5)
data.write.format("delta").save("/tmp/delta-table")
the error states: delta not recognised
& if I run
DeltaTable.isDeltaTable(spark, "packages/tests/streaming/data")
It states: TypeError: 'JavaPackage' object is not callable
It seemed that I could run these commands locally (such as unit tests) without Maven or running it in a pyspark shell? It would be good to just see if I am missing a dependency?
You can just install delta-spark PyPi package using pip install delta-spark (it will pull pyspark as well), and then refer to it.
Or you can add a configuration option that will fetch Delta package. It's .config("spark.jars.packages", "io.delta:delta-core_2.12:<delta-version>"). For Spark 3.1 Delta versions is 1.0.0 (see releases mapping docs for more information).
I have an example of using Delta tables in unit tests (please note, that import statement is in the function definition because Delta package is loaded dynamically):
import pyspark
import pyspark.sql
import pytest
import shutil
from pyspark.sql import SparkSession
delta_dir_name = "/tmp/delta-table"
#pytest.fixture
def delta_setup(spark_session):
data = spark_session.range(0, 5)
data.write.format("delta").save(delta_dir_name)
yield data
shutil.rmtree(delta_dir_name, ignore_errors=True)
def test_delta(spark_session, delta_setup):
from delta.tables import DeltaTable
deltaTable = DeltaTable.forPath(spark_session, delta_dir_name)
hist = deltaTable.history()
assert hist.count() == 1
environment is initialized via pytest-spark:
[pytest]
filterwarnings =
ignore::DeprecationWarning
spark_options =
spark.sql.extensions: io.delta.sql.DeltaSparkSessionExtension
spark.sql.catalog.spark_catalog: org.apache.spark.sql.delta.catalog.DeltaCatalog
spark.jars.packages: io.delta:delta-core_2.12:1.0.0
spark.sql.catalogImplementation: in-memory

arguments error while calling an AWS Glue Pythonshell job from boto3

Based on the previous post, I have an AWS Glue Pythonshell job that needs to retrieve some information from the arguments that are passed to it through a boto3 call.
My Glue job name is test_metrics
The Glue pythonshell code looks like below
import sys
from awsglue.utils import getResolvedOptions
args = getResolvedOptions(sys.argv,
['test_metrics',
's3_target_path_key',
's3_target_path_value'])
print ("Target path key is: ", args['s3_target_path_key'])
print ("Target Path value is: ", args['s3_target_path_value'])
The boto3 code that calls this job is below:
glue = boto3.client('glue')
response = glue.start_job_run(
JobName = 'test_metrics',
Arguments = {
'--s3_target_path_key': 's3://my_target',
'--s3_target_path_value': 's3://my_target_value'
}
)
print(response)
I see a 200 response after I run the boto3 code in my local machine, but Glue error log tells me:
test_metrics.py: error: the following arguments are required: --test_metrics
What am I missing?
Which job you are trying to launch? Spark Job or Python shell job?
If spark job, JOB_NAME is mandatory parameter. In Python shell job, it is not needed at all.
So in your python shell job, replace
args = getResolvedOptions(sys.argv,
['test_metrics',
's3_target_path_key',
's3_target_path_value'])
with
args = getResolvedOptions(sys.argv,
['s3_target_path_key',
's3_target_path_value'])
Seems like the documentation is kinda broken.
I had to update the boto3 code like below to make it work
glue = boto3.client('glue')
response = glue.start_job_run(
JobName = 'test_metrics',
Arguments = {
'--test_metrics': 'test_metrics',
'--s3_target_path_key': 's3://my_target',
'--s3_target_path_value': 's3://my_target_value'} )
We can get glue job name in python shell from sys.argv

EMRSpark Erorr:value couchbase is not a member of org.apache.spark.sql.DataFrameReader

I tried to connect my couchBase server to EMR Spark 1.4.1, while encountered the
val airlines = sqlContext.read.couchbase(schemaFilter = org.apache.spark.sql.sources.EqualTo("type", "airline"))
<console>:24: error: value couchbase is not a member of org.apache.spark.sql.DataFrameReader
Those are all commands executed successfully before that error command:
spark-shell --packages com.couchbase.client:spark-connector_2.10:1.0.0
import org.apache.spark.{SparkContext, SparkConf}
val sc = new SparkContext(new SparkConf().setAppName("test").set("com.couchbase.bucket.travel-sample", ""))
val cfg = new SparkConf().setAppName("keyValueExample").setMaster("local[*]").set("com.couchbase.bucket.travel-sample", "")
import org.apache.spark.sql.SQLContext
val sql = new SQLContext(sc)
import com.couchbase.spark._
Do I need to configure anything more? Since I'm using AWS EMR, I assumed that I don't have to modify the .sbt file? I think I have already imported the package either while specifying when connecting to spark-shell, or in line(command) 7?
Documentation says you have to import the following:
scala> import com.couchbase.spark._
import com.couchbase.spark._
scala> import com.couchbase.spark.sql._
import com.couchbase.spark.sql._
Full doc is available here: http://developer.couchbase.com/documentation/server/current/connectors/spark-1.0/spark-shell.html

PyCharm can't find 'SPARK_HOME' when imported from a different file

I've two files.
test.py
from pyspark import SparkContext
from pyspark import SparkConf
from pyspark import SQLContext
class Connection():
conf = SparkConf()
conf.setMaster("local")
conf.setAppName("Remote_Spark_Program - Leschi Plans")
conf.set('spark.executor.instances', 1)
sc = SparkContext(conf=conf)
sqlContext = SQLContext(sc)
print ('all done.')
con = Connection()
test_test.py
from test import Connection
sparkConnect = Connection()
when I run test.py the connection is made successfully but with test_test.py it gives
raise KeyError(key)
KeyError: 'SPARK_HOME'
KEY_ERROR arises if the SPARK_HOME is not found or invalid. So it's better to add it to your bashrc and check and reload in your code. So add this at the top of your test.py
import os
import sys
import pyspark
from pyspark import SparkContext, SparkConf, SQLContext
# Create a variable for our root path
SPARK_HOME = os.environ.get('SPARK_HOME',None)
# Add the PySpark/py4j to the Python Path
sys.path.insert(0, os.path.join(SPARK_HOME, "python", "lib"))
sys.path.insert(0, os.path.join(SPARK_HOME, "python"))
pyspark_submit_args = os.environ.get("PYSPARK_SUBMIT_ARGS", "")
if not "pyspark-shell" in pyspark_submit_args: pyspark_submit_args += " pyspark-shell"
os.environ["PYSPARK_SUBMIT_ARGS"] = pyspark_submit_args
Also add this at the end of your ~/.bashrc file
COMMAND: vim ~/.bashrc if you are using any Linux based OS
# needed for Apache Spark
export SPARK_HOME="/opt/spark"
export IPYTHON="1"
export PYSPARK_PYTHON="/usr/bin/python3"
export PYSPARK_DRIVER_PYTHON="ipython3"
export PYSPARK_DRIVER_PYTHON_OPTS="notebook"
export PYTHONPATH="$SPARK_HOME/python/:$PYTHONPATH"
export PYTHONPATH="$SPARK_HOME/python/lib/py4j-0.9-src.zip:$PYTHONPATH"
export PYSPARK_SUBMIT_ARGS="--master local[2] pyspark-shell"
export CLASSPATH="$CLASSPATH:/opt/spark/lib/spark-assembly-1.6.1-hadoop2.6.0.jar
Note:
In the above bashrc code, I have given my SPARK_HOME value as /opt/spark you can give the location where you keep your spark folder(the downloaded one from the website).
Also I'm using python3 you can change it to python in the bashrc if you are using python 2.+ versions
I was using Ipython, for easy testing during runtime, like load the data once and test your code many times. If you are using plain old text editor, let me know I will update the bashrc accordingly.