Package GraphFrames Spark2.0 - apache-spark-2.0

I have spark 2.0 Scala 2.11.8 and I am trying to include graph frames package.
I typed the following in the scala shell:
<spark-shell --packages graphframes:graphframes:0.1.0-spark1.6>
But still I got the error message:
scala> import org.graphframes._
<console>:23: error: object graph frames is not a member of package org
import org.graphframes._
^
scala> import org.graphframes.GrahFrame
<console>:23: error: object graphframes is not a member of package org
import org.graphframes.GrahFrame
Please refer to the screenshot attached.

You need to do something like this:
$SPARK_HOME/bin/spark-shell --packages graphframes:graphframes:0.5.0-spark2.1-s_2.11

Related

Why do I get AttributeError: module 'google_cloud_pipeline_components.aiplatform' has no attribute 'EndpointDeleteOp' in GCP?

My code is:
from google.cloud import aiplatform
from google_cloud_pipeline_components import aiplatform as gcc_aip
... (in a pipeline definition:)
delete_endpoint_op = gcc_aip.EndpointDeleteOp(some_condition)
and when I compile the pipeline I get:
AttributeError: module 'google_cloud_pipeline_components.aiplatform' has no attribute 'EndpointDeleteOp'
but this says it exists. Could this be I am importing the wrong version, and, if so, how do I check and fix? TIA!
Yes it is a version issue, check the version:
!python3 -c "import google_cloud_pipeline_components; print('google_cloud_pipeline_components version: {}'.format(google_cloud_pipeline_components.__version__))"
update it:
!pip3 install google-cloud-pipeline-components --upgrade

Spark job submission error on EMR: java.net.URISyntaxException: Expected scheme-specific part at index 3: s3

I submit Spark jobs to EMR via Managed Airflow service of AWS (MWAA). The jobs were running fine with the MWAA version 1.10.12. Recently, AWS released a newer version of MWAA i.e. 2.0.2. I created a new environment with this version and tried submitting the same job to EMR. But it failed with the following error :
Exception in thread "main" java.lang.IllegalArgumentException: java.net.URISyntaxException: Expected scheme-specific part at index 3: s3:
at org.apache.hadoop.fs.Path.initialize(Path.java:263)
at org.apache.hadoop.fs.Path.<init>(Path.java:221)
at org.apache.hadoop.fs.Path.<init>(Path.java:129)
at org.apache.hadoop.fs.Globber.doGlob(Globber.java:229)
at org.apache.hadoop.fs.Globber.glob(Globber.java:149)
at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:2096)
at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:2078)
at org.apache.spark.deploy.DependencyUtils$.resolveGlobPath(DependencyUtils.scala:192)
at org.apache.spark.deploy.DependencyUtils$.$anonfun$resolveGlobPaths$2(DependencyUtils.scala:147)
at org.apache.spark.deploy.DependencyUtils$.$anonfun$resolveGlobPaths$2$adapted(DependencyUtils.scala:145)
at scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:245)
at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:38)
at scala.collection.TraversableLike.flatMap(TraversableLike.scala:245)
at scala.collection.TraversableLike.flatMap$(TraversableLike.scala:242)
at scala.collection.AbstractTraversable.flatMap(Traversable.scala:108)
at org.apache.spark.deploy.DependencyUtils$.resolveGlobPaths(DependencyUtils.scala:145)
at org.apache.spark.deploy.SparkSubmit.$anonfun$prepareSubmitEnvironment$5(SparkSubmit.scala:364)
at scala.Option.map(Option.scala:230)
at org.apache.spark.deploy.SparkSubmit.prepareSubmitEnvironment(SparkSubmit.scala:364)
at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:902)
at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1038)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1047)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.net.URISyntaxException: Expected scheme-specific part at index 3: s3:
at java.net.URI$Parser.fail(URI.java:2847)
at java.net.URI$Parser.failExpecting(URI.java:2853)
at java.net.URI$Parser.parse(URI.java:3056)
at java.net.URI.<init>(URI.java:746)
at org.apache.hadoop.fs.Path.initialize(Path.java:260)
... 27 more
Command exiting with ret '1'
The spark-submit command looks like this:
spark-submit --deploy-mode cluster
--master yarn
--queue low
--jars s3://bucket-name/jars/mysql-connector-java-8.0.20.jar,s3://bucket-name/jars/postgresql-42.1.1.jar,s3://bucket-name/jars/delta-core_2.12-1.0.0.jar
--py-files s3://bucket-name/dependencies/spark.py, s3://bucket-name/dependencies/helper_functions.py
--files s3://bucket-name/configs/spark_config.json
s3://bucket-name/jobs/data_processor.py [command-line-args]
The job submission failed in under 10 seconds. Hence, the YARN application ID did not get created.
What I tried to resolve the error:
I added amazon related package to requirements.txt:
apache-airflow[mysql]==2.0.2
pycryptodome==3.9.9
apache-airflow-providers-amazon==1.3.0
I changed the import statements from:
from airflow.contrib.operators.emr_add_steps_operator import EmrAddStepsOperator
from airflow.contrib.operators.emr_create_job_flow_operator import EmrCreateJobFlowOperator
from airflow.contrib.operators.emr_terminate_job_flow_operator import EmrTerminateJobFlowOperator
from airflow.contrib.sensors.emr_step_sensor import EmrStepSensor
from airflow.hooks.S3_hook import S3Hook
from airflow.operators.python_operator import PythonOperator
from airflow.operators.bash_operator import BashOperator
from airflow.models import Variable
to
from airflow.providers.amazon.aws.operators.emr_create_job_flow import EmrCreateJobFlowOperator
from airflow.providers.amazon.aws.operators.emr_add_steps import EmrAddStepsOperator
from airflow.providers.amazon.aws.operators.emr_terminate_job_flow import EmrTerminateJobFlowOperator
from airflow.providers.amazon.aws.sensors.emr_step import EmrStepSensor
from airflow.providers.amazon.aws.hooks.s3 import S3Hook
from airflow.operators.python import PythonOperator
from airflow.operators.bash import BashOperator
from airflow import DAG
from airflow.models import Variable
Changed the URI scheme to s3n and s3a
I checked the official documentation and blogs on MWAA as well as Airflow 2.0.2 and made the above changes. But nothing has worked so far. I seek help in resolving this error at the earliest. Thanks in advance

Failed to find data source: delta in Python environment

Following: https://docs.delta.io/latest/quick-start.html#python
I have installed delta-spark and run:
from delta import *
builder = pyspark.sql.SparkSession.builder.appName("MyApp") \
.config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
.config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog")
spark = spark = configure_spark_with_delta_pip(builder).getOrCreate()
However when I run:
data = spark.range(0, 5)
data.write.format("delta").save("/tmp/delta-table")
the error states: delta not recognised
& if I run
DeltaTable.isDeltaTable(spark, "packages/tests/streaming/data")
It states: TypeError: 'JavaPackage' object is not callable
It seemed that I could run these commands locally (such as unit tests) without Maven or running it in a pyspark shell? It would be good to just see if I am missing a dependency?
You can just install delta-spark PyPi package using pip install delta-spark (it will pull pyspark as well), and then refer to it.
Or you can add a configuration option that will fetch Delta package. It's .config("spark.jars.packages", "io.delta:delta-core_2.12:<delta-version>"). For Spark 3.1 Delta versions is 1.0.0 (see releases mapping docs for more information).
I have an example of using Delta tables in unit tests (please note, that import statement is in the function definition because Delta package is loaded dynamically):
import pyspark
import pyspark.sql
import pytest
import shutil
from pyspark.sql import SparkSession
delta_dir_name = "/tmp/delta-table"
#pytest.fixture
def delta_setup(spark_session):
data = spark_session.range(0, 5)
data.write.format("delta").save(delta_dir_name)
yield data
shutil.rmtree(delta_dir_name, ignore_errors=True)
def test_delta(spark_session, delta_setup):
from delta.tables import DeltaTable
deltaTable = DeltaTable.forPath(spark_session, delta_dir_name)
hist = deltaTable.history()
assert hist.count() == 1
environment is initialized via pytest-spark:
[pytest]
filterwarnings =
ignore::DeprecationWarning
spark_options =
spark.sql.extensions: io.delta.sql.DeltaSparkSessionExtension
spark.sql.catalog.spark_catalog: org.apache.spark.sql.delta.catalog.DeltaCatalog
spark.jars.packages: io.delta:delta-core_2.12:1.0.0
spark.sql.catalogImplementation: in-memory

EMRSpark Erorr:value couchbase is not a member of org.apache.spark.sql.DataFrameReader

I tried to connect my couchBase server to EMR Spark 1.4.1, while encountered the
val airlines = sqlContext.read.couchbase(schemaFilter = org.apache.spark.sql.sources.EqualTo("type", "airline"))
<console>:24: error: value couchbase is not a member of org.apache.spark.sql.DataFrameReader
Those are all commands executed successfully before that error command:
spark-shell --packages com.couchbase.client:spark-connector_2.10:1.0.0
import org.apache.spark.{SparkContext, SparkConf}
val sc = new SparkContext(new SparkConf().setAppName("test").set("com.couchbase.bucket.travel-sample", ""))
val cfg = new SparkConf().setAppName("keyValueExample").setMaster("local[*]").set("com.couchbase.bucket.travel-sample", "")
import org.apache.spark.sql.SQLContext
val sql = new SQLContext(sc)
import com.couchbase.spark._
Do I need to configure anything more? Since I'm using AWS EMR, I assumed that I don't have to modify the .sbt file? I think I have already imported the package either while specifying when connecting to spark-shell, or in line(command) 7?
Documentation says you have to import the following:
scala> import com.couchbase.spark._
import com.couchbase.spark._
scala> import com.couchbase.spark.sql._
import com.couchbase.spark.sql._
Full doc is available here: http://developer.couchbase.com/documentation/server/current/connectors/spark-1.0/spark-shell.html

How to import pymaps in python 2.7

I want to plot my gps data on google maps using python. I searched & found that pygmaps does the thing but I'm getting error when I import this file it says "'module' object has no attribute 'maps'". I've saved this file in libs folder in python27 in c drive. What should i do now ?
import pygmaps
mymap = pygmaps.maps(18.458184, 73.850781, 14)
mymap.addpoint(18.458184, 73.850781, '#0000FF')
mymap.draw('./mymap.html')
As I understand your code should be like the following:
import pygmaps
mymap = pygmaps.pygmaps(18.458184, 73.850781, 14)
mymap.addpoint(18.458184, 73.850781, '#0000FF')
mymap.draw('./mymap.html')