Failed to find data source: delta in Python environment - unit-testing

Following: https://docs.delta.io/latest/quick-start.html#python
I have installed delta-spark and run:
from delta import *
builder = pyspark.sql.SparkSession.builder.appName("MyApp") \
.config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
.config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog")
spark = spark = configure_spark_with_delta_pip(builder).getOrCreate()
However when I run:
data = spark.range(0, 5)
data.write.format("delta").save("/tmp/delta-table")
the error states: delta not recognised
& if I run
DeltaTable.isDeltaTable(spark, "packages/tests/streaming/data")
It states: TypeError: 'JavaPackage' object is not callable
It seemed that I could run these commands locally (such as unit tests) without Maven or running it in a pyspark shell? It would be good to just see if I am missing a dependency?

You can just install delta-spark PyPi package using pip install delta-spark (it will pull pyspark as well), and then refer to it.
Or you can add a configuration option that will fetch Delta package. It's .config("spark.jars.packages", "io.delta:delta-core_2.12:<delta-version>"). For Spark 3.1 Delta versions is 1.0.0 (see releases mapping docs for more information).
I have an example of using Delta tables in unit tests (please note, that import statement is in the function definition because Delta package is loaded dynamically):
import pyspark
import pyspark.sql
import pytest
import shutil
from pyspark.sql import SparkSession
delta_dir_name = "/tmp/delta-table"
#pytest.fixture
def delta_setup(spark_session):
data = spark_session.range(0, 5)
data.write.format("delta").save(delta_dir_name)
yield data
shutil.rmtree(delta_dir_name, ignore_errors=True)
def test_delta(spark_session, delta_setup):
from delta.tables import DeltaTable
deltaTable = DeltaTable.forPath(spark_session, delta_dir_name)
hist = deltaTable.history()
assert hist.count() == 1
environment is initialized via pytest-spark:
[pytest]
filterwarnings =
ignore::DeprecationWarning
spark_options =
spark.sql.extensions: io.delta.sql.DeltaSparkSessionExtension
spark.sql.catalog.spark_catalog: org.apache.spark.sql.delta.catalog.DeltaCatalog
spark.jars.packages: io.delta:delta-core_2.12:1.0.0
spark.sql.catalogImplementation: in-memory

Related

how to use sagemaker inside pyspark

I have a simple requirement, I need to run sagemaker prediction inside a spark job
am trying to run the below
ENDPOINT_NAME = "MY-ENDPOINT_NAME"
from sagemaker_pyspark import SageMakerModel
from sagemaker_pyspark import EndpointCreationPolicy
from sagemaker_pyspark.transformation.serializers import ProtobufRequestRowSerializer
from sagemaker_pyspark.transformation.deserializers import ProtobufResponseRowDeserializer
attachedModel = SageMakerModel(
existingEndpointName=ENDPOINT_NAME,
endpointCreationPolicy=EndpointCreationPolicy.DO_NOT_CREATE,
endpointInstanceType=None, # Required
endpointInitialInstanceCount=None, # Required
requestRowSerializer=ProtobufRequestRowSerializer(
featuresColumnName="featureCol"
), # Optional: already default value
responseRowDeserializer= ProtobufResponseRowDeserializer(schema=ouput_schema),
)
transformedData2 = attachedModel.transform(df)
transformedData2.show()
I get the following error TypeError: 'JavaPackage' object is not callable
this was solved by ...
classpath = ":".join(sagemaker_pyspark.classpath_jars())
conf = SparkConf() \
.set("spark.driver.extraClassPath", classpath)
sc = SparkContext(conf=conf)

ImportError with nltk_train

I am trying to get my ways around with nltk-trainer (https://github.com/japerk/nltk-trainer). I managed to train Dutch taggers and chunkers with the commands (directly in Anaconda console):
python train_tagger.py conll2002 --fileids ned.train --classifier IIS --filename ~/nltk_data/taggers/conll2002_ned_IIS.pickle
python train_chunker.py conll2002 --fileids ned.train --classifier NaiveBayes --filename ~/nltk_data/chunkers/conll2002_ned_NaiveBayes.pickle
Then I run a little script to test the tagger and the chunker:
import nltk
from nltk.corpus import conll2002
# Loading training pickles
tokenizer = nltk.data.load('tokenizers/punkt/dutch.pickle')
tagger = nltk.data.load('taggers/conll2002_ned_IIS.pickle')
chunker = nltk.data.load('chunkers/conll2002_ned_NaiveBayes.pickle')
# Testing
test_sents = conll2002.tagged_sents(fileids="ned.testb")[0:1000]
print "tagger accuracy on test-set: " + str(tagger.evaluate(test_sents))
test_sents = conll2002.chunked_sents(fileids="ned.testb")[0:1000]
print "tagger accuracy on test-set: " + str(chunker.evaluate(test_sents))
This works fine from the nltk-trainer-master folder, but when I move the script elsewhere, I get an import error:
ImportError: No module named nltk_trainer.chunking.chunkers
How can I make this work outside the nltk-trainer-master folder, without copying the nltk_trainer folder?
(Python 2.7, nltk 3.2.1)

EMRSpark Erorr:value couchbase is not a member of org.apache.spark.sql.DataFrameReader

I tried to connect my couchBase server to EMR Spark 1.4.1, while encountered the
val airlines = sqlContext.read.couchbase(schemaFilter = org.apache.spark.sql.sources.EqualTo("type", "airline"))
<console>:24: error: value couchbase is not a member of org.apache.spark.sql.DataFrameReader
Those are all commands executed successfully before that error command:
spark-shell --packages com.couchbase.client:spark-connector_2.10:1.0.0
import org.apache.spark.{SparkContext, SparkConf}
val sc = new SparkContext(new SparkConf().setAppName("test").set("com.couchbase.bucket.travel-sample", ""))
val cfg = new SparkConf().setAppName("keyValueExample").setMaster("local[*]").set("com.couchbase.bucket.travel-sample", "")
import org.apache.spark.sql.SQLContext
val sql = new SQLContext(sc)
import com.couchbase.spark._
Do I need to configure anything more? Since I'm using AWS EMR, I assumed that I don't have to modify the .sbt file? I think I have already imported the package either while specifying when connecting to spark-shell, or in line(command) 7?
Documentation says you have to import the following:
scala> import com.couchbase.spark._
import com.couchbase.spark._
scala> import com.couchbase.spark.sql._
import com.couchbase.spark.sql._
Full doc is available here: http://developer.couchbase.com/documentation/server/current/connectors/spark-1.0/spark-shell.html

PyCharm can't find 'SPARK_HOME' when imported from a different file

I've two files.
test.py
from pyspark import SparkContext
from pyspark import SparkConf
from pyspark import SQLContext
class Connection():
conf = SparkConf()
conf.setMaster("local")
conf.setAppName("Remote_Spark_Program - Leschi Plans")
conf.set('spark.executor.instances', 1)
sc = SparkContext(conf=conf)
sqlContext = SQLContext(sc)
print ('all done.')
con = Connection()
test_test.py
from test import Connection
sparkConnect = Connection()
when I run test.py the connection is made successfully but with test_test.py it gives
raise KeyError(key)
KeyError: 'SPARK_HOME'
KEY_ERROR arises if the SPARK_HOME is not found or invalid. So it's better to add it to your bashrc and check and reload in your code. So add this at the top of your test.py
import os
import sys
import pyspark
from pyspark import SparkContext, SparkConf, SQLContext
# Create a variable for our root path
SPARK_HOME = os.environ.get('SPARK_HOME',None)
# Add the PySpark/py4j to the Python Path
sys.path.insert(0, os.path.join(SPARK_HOME, "python", "lib"))
sys.path.insert(0, os.path.join(SPARK_HOME, "python"))
pyspark_submit_args = os.environ.get("PYSPARK_SUBMIT_ARGS", "")
if not "pyspark-shell" in pyspark_submit_args: pyspark_submit_args += " pyspark-shell"
os.environ["PYSPARK_SUBMIT_ARGS"] = pyspark_submit_args
Also add this at the end of your ~/.bashrc file
COMMAND: vim ~/.bashrc if you are using any Linux based OS
# needed for Apache Spark
export SPARK_HOME="/opt/spark"
export IPYTHON="1"
export PYSPARK_PYTHON="/usr/bin/python3"
export PYSPARK_DRIVER_PYTHON="ipython3"
export PYSPARK_DRIVER_PYTHON_OPTS="notebook"
export PYTHONPATH="$SPARK_HOME/python/:$PYTHONPATH"
export PYTHONPATH="$SPARK_HOME/python/lib/py4j-0.9-src.zip:$PYTHONPATH"
export PYSPARK_SUBMIT_ARGS="--master local[2] pyspark-shell"
export CLASSPATH="$CLASSPATH:/opt/spark/lib/spark-assembly-1.6.1-hadoop2.6.0.jar
Note:
In the above bashrc code, I have given my SPARK_HOME value as /opt/spark you can give the location where you keep your spark folder(the downloaded one from the website).
Also I'm using python3 you can change it to python in the bashrc if you are using python 2.+ versions
I was using Ipython, for easy testing during runtime, like load the data once and test your code many times. If you are using plain old text editor, let me know I will update the bashrc accordingly.

Add method imports to shell_plus

In shell_plus, is there a way to automatically import selected helper methods, like the models are?
I often open the shell to type:
proj = Project.objects.get(project_id="asdf")
I want to replace that with:
proj = getproj("asdf")
Found it in the docs. Quoted from there:
Additional Imports
In addition to importing the models you can specify other items to
import by default. These are specified in SHELL_PLUS_PRE_IMPORTS and
SHELL_PLUS_POST_IMPORTS. The former is imported before any other
imports (such as the default models import) and the latter is imported
after any other imports. Both have similar syntax. So in your
settings.py file:
SHELL_PLUS_PRE_IMPORTS = (
('module.submodule1', ('class1', 'function2')),
('module.submodule2', 'function3'),
('module.submodule3', '*'),
'module.submodule4'
)
The above example would directly translate to the following python
code which would be executed before the automatic imports:
from module.submodule1 import class1, function2
from module.submodule2 import function3
from module.submodule3 import *
import module.submodule4
These symbols will be available as soon as the shell starts.
ok, two ways:
1) using PYTHONSTARTUP variable (see this Docs)
#in some file. (here, I'll call it "~/path/to/foo.py"
def getproj(p_od):
#I'm importing here because this script run in any python shell session
from some_app.models import Project
return Project.objects.get(project_id="asdf")
#in your .bashrc
export PYTHONSTARTUP="~/path/to/foo.py"
2) using ipython startup (my favourite) (See this Docs,this issue and this Docs ):
$ pip install ipython
$ ipython profile create
# put the foo.py script in your profile_default/startup directory.
# django run ipython if it's installed.
$ django-admin.py shell_plus