Run spark unit test on Windows - unit-testing

I'm trying to run some transformation on Spark, it works fine on cluster (YARN, linux machines).
However, when I'm trying to run it on local machine (Windows 7) under unit test, I got errors:
java.io.IOException: Could not locate executable null\bin\winutils.exe in the Hadoop binaries.
at org.apache.hadoop.util.Shell.getQualifiedBinPath(Shell.java:318)
at org.apache.hadoop.util.Shell.getWinUtilsPath(Shell.java:333)
at org.apache.hadoop.util.Shell.<clinit>(Shell.java:326)
at org.apache.hadoop.util.StringUtils.<clinit>(StringUtils.java:76)
at org.apache.hadoop.security.Groups.parseStaticMapping(Groups.java:93)
My code is following:
#Test
def testETL() = {
val conf = new SparkConf()
val sc = new SparkContext("local", "test", conf)
try {
val etl = new IxtoolsDailyAgg() // empty constructor
val data = sc.parallelize(List("in1", "in2", "in3"))
etl.etl(data) // rdd transformation, no access to SparkContext or Hadoop
Assert.assertTrue(true)
} finally {
if(sc != null)
sc.stop()
}
}
Why is it trying to access hadoop at all? and how can I fix it?
Thank you in advance

I've solved this issue on my own http://simpletoad.blogspot.com/2014/07/runing-spark-unit-test-on-windows-7.html

Related

Spark-BigTable - HBase client not closed in Pyspark?

I'm trying to execute a Pyspark statement that writes to BigTable within a Python for loop, which leads to the following error (job submitted using Dataproc). Any client not properly closed (as suggested here) and if yes, any way to do so in Pyspark ?
Note that manually re-executing the script each time with a new Dataproc job works fine, so the job itself is correct.
Thanks for your support !
Pyspark script
from pyspark import SparkContext
from pyspark.sql import SQLContext
import json
sc = SparkContext()
sqlc = SQLContext(sc)
def create_df(n_start,n_stop):
# Data
row_1 = ['a']+['{}'.format(i) for i in range(n_start,n_stop)]
row_2 = ['b']+['{}'.format(i) for i in range(n_start,n_stop)]
# Spark schema
ls = [row_1,row_2]
schema = ['col0'] + ['col{}'.format(i) for i in range(n_start,n_stop)]
# Catalog
first_col = {"col0":{"cf":"rowkey", "col":"key", "type":"string"}}
other_cols = {"col{}".format(i):{"cf":"cf", "col":"col{}".format(i), "type":"string"} for i in range(n_start,n_stop)}
first_col.update(other_cols)
columns = first_col
d_catalogue = {}
d_catalogue["table"] = {"namespace":"default", "name":"testtable"}
d_catalogue["rowkey"] = "key"
d_catalogue["columns"] = columns
catalog = json.dumps(d_catalogue)
# Dataframe
df = sc.parallelize(ls, numSlices=1000).toDF(schema=schema)
return df,catalog
for i in range(0,2):
N_step = 100
N_start = 1
N_stop = N_start+N_step
data_source_format = "org.apache.spark.sql.execution.datasources.hbase"
df,catalog = create_df(N_start,N_stop)
df.write\
.options(catalog=catalog,newTable= "5")\
.format(data_source_format)\
.save()
N_start += N_step
N_stop += N_step
Dataproc job
gcloud dataproc jobs submit pyspark <my_script>.py \
--cluster $SPARK_CLUSTER \
--jars <path_to_jar>/bigtable-dataproc-spark-shc-assembly-0.1.jar \
--region=us-east1
Error
...
ERROR com.google.bigtable.repackaged.io.grpc.internal.ManagedChannelOrphanWrapper: *~*~*~ Channel ManagedChannelImpl{logId=41, target=bigtable.googleapis.com:443} was not shutdown properly!!! ~*~*~*
Make sure to call shutdown()/shutdownNow() and wait until awaitTermination() returns true.
...
If you are not using the latest version, try updating to it. It looks similar to this issue that was fixed recently. I would imagine the error message still showing up, but the job now finishing means that the support team is still working on it and hopefully they will fix it in the next release.

InvalidQueryException: Consistency level LOCAL_ONE is not supported for this operation. Supported consistency levels are: LOCAL_QUORUM

import org.apache.spark._
import org.apache.spark.SparkContext._
import com.datastax.spark.connector._
import com.datastax.spark.connector.cql.CassandraConnector
val conf = new SparkConf()
.setMaster("local[*]")
.setAppName("XXXX")
.set("spark.cassandra.connection.host" ,"cassandra.us-east-2.amazonaws.com")
.set("spark.cassandra.connection.port", "9142")
.set("spark.cassandra.auth.username", "XXXXX")
.set("spark.cassandra.auth.password", "XXXXX")
.set("spark.cassandra.connection.ssl.enabled", "true")
.set("spark.cassandra.connection.ssl.trustStore.path", "/home/nihad/.cassandra/cassandra_truststore.jks")
.set("spark.cassandra.connection.ssl.trustStore.password", "XXXXX")
.set("spark.cassandra.output.consistency.level", "LOCAL_QUORUM")
val connector = CassandraConnector(conf)
val session = connector.openSession()
sesssion.execute("""INSERT INTO "covid19".delta_by_states (state_code, state_value, date ) VALUES ('kl', 5, '2020-03-03');""")
session.close()
i amn trying to write data to AWS Cassandra Keyspace using Spark App set in my local system.
Problem is when i execute above code, I get Exception like below:
"com.datastax.oss.driver.api.core.servererrors.InvalidQueryException:
Consistency level LOCAL_ONE is not supported for this operation.
Supported consistency levels are: LOCAL_QUORUM"
As you can see from the above code I have already set cassandra.output.consistency.level as LOCAL_QUORUM in Spark Conf. Also I am using datastax cassandra driver.
But when I read data from AWS Cassandra, it works fine. Also I tried same INSERT command in AWS Keyspace cqlsh. It is working fine there too. So Query is valid.
Can someone help me how to set consistency via datastax.CassandraConnector?
Cracked it.
Instead of setting cassandra consistency via spark config. I created an application.conf file in src/main/resources directory.
datastax-java-driver {
basic.contact-points = [ "cassandra.us-east-2.amazonaws.com:9142"]
advanced.auth-provider{
class = PlainTextAuthProvider
username = "serviceUserName"
password = "servicePassword"
}
basic.load-balancing-policy {
local-datacenter = "us-east-2"
}
advanced.ssl-engine-factory {
class = DefaultSslEngineFactory
truststore-path = "yourPath/.cassandra/cassandra_truststore.jks"
truststore-password = "trustorePassword"
}
basic.request.consistency = LOCAL_QUORUM
basic.request.timeout = 5 seconds
}
and created cassandra session like below
import com.datastax.oss.driver.api.core.config.DriverConfigLoader
import com.datastax.oss.driver.api.core.CqlSession
val loader = DriverConfigLoader.fromClassPath("application.conf")
val session = CqlSession.builder().withConfigLoader(loader).build()
sesssion.execute("""INSERT INTO "covid19".delta_by_states (state_code, state_value, date ) VALUES ('kl', 5, '2020-03-03');""")
It finally worked. No need to mess with spark config
Doc for Driver Config https://docs.datastax.com/en/drivers/java/4.0/com/datastax/oss/driver/api/core/config/DriverConfigLoader.html#fromClasspath-java.lang.String-
datastax configuration doc https://docs.datastax.com/en/developer/java-driver/4.6/manual/core/configuration/reference/

Querying S3 from Mac

All
I am trying to connect to S3 environment from a spark installed in local mac machine and using the following commands
./bin/spark-shell --packages com.amazonaws:aws-java-sdk-pom:1.11.271,org.apache.hadoop:hadoop-aws:3.1.1,org.apache.hadoop:hadoop-hdfs:2.7.1
This connects to scala and downloads all the libraries
Then I execute the following commands in spark shell
val accessKeyId = System.getenv("AWS_ACCESS_KEY_ID")
val secretAccessKey = System.getenv("AWS_SECRET_ACCESS_KEY")
val hadoopConf=sc.hadoopConfigurationhadoopConf.set("fs.s3.impl","org.apache.hadoop.fs.s3a.S3AFileSystem")
hadoopConf.set("fs.s3.awsAccessKeyId", accessKeyId)
hadoopConf.set("fs.s3.awsSecretAccessKey", secretAccessKey)
hadoopConf.set("fs.s3n.awsAccessKeyId", accessKeyId)
hadoopConf.set("fs.s3n.awsSecretAccessKey", secretAccessKey)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val df = sqlContext.read.json("s3a://path/1551467354353.c948f177e1fb.dev.0fd8f5fd-22d4-4523-b6bc-b68c181b4906.gz")
But I get NoClassDefFoundError: org/apache/hadoop/fs/StreamCapabilities when I use S3a or S3
Any idea what I could be missing here ?

Run local DynamoDB spark job without EMR

I want to run local Dynamodb spark job without using EMR cluster,
that read data from some table and write it to parquet / CSV file.
I didn't found any spark-dynamo connector that supports that, maybe you have any ideas?
My code sample:
import org.apache.hadoop.dynamodb.DynamoDBItemWritable
import org.apache.hadoop.dynamodb.read.DynamoDBInputFormat
import org.apache.hadoop.io.Text
import org.apache.hadoop.mapred.JobConf
import org.apache.spark.sql.SparkSession
object copyDynamoTable extends App {
val spark = SparkSession
.builder()
.appName("test")
.master("local")
.getOrCreate()
val jobConf = new JobConf(spark.sparkContext.hadoopConfiguration)
jobConf.set("dynamodb.servicename", "dynamodb")
jobConf.set("dynamodb.input.tableName", "hen.poc.client") // Pointing to DynamoDB table
jobConf.set("dynamodb.endpoint", "dynamodb.us-east-1.amazonaws.com")
jobConf.set("dynamodb.regionid", "us-east-1")
jobConf.set("dynamodb.throughput.read", "1")
jobConf.set("dynamodb.throughput.read.percent", "1")
jobConf.set("dynamodb.version", "2011-12-05")
jobConf.set("mapred.output.format.class", "org.apache.hadoop.dynamodb.write.DynamoDBOutputFormat")
jobConf.set("mapred.input.format.class", "org.apache.hadoop.dynamodb.read.DynamoDBInputFormat")
val orders = spark.sparkContext.hadoopRDD(jobConf, classOf[DynamoDBInputFormat], classOf[Text], classOf[DynamoDBItemWritable])
println(orders.count)
I received the following exception:
18/09/05 17:06:41 INFO util.TaskCalculator: Cluster has 1 active nodes.
18/09/05 17:06:41 WARN util.ClusterTopologyNodeCapacityProvider: Exception when trying to determine instance types
java.nio.file.NoSuchFileException: /mnt/var/lib/info/job-flow.json
at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)
at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
at sun.nio.fs.UnixFileSystemProvider.newByteChannel(UnixFileSystemProvider.java:214)
at java.nio.file.Files.newByteChannel(Files.java:361)
at java.nio.file.Files.newByteChannel(Files.java:407)
at java.nio.file.Files.readAllBytes(Files.java:3152)
at org.apache.hadoop.dynamodb.util.ClusterTopologyNodeCapacityProvider.readJobFlowJsonString(ClusterTopologyNodeCapacityProvider.java:103)
at org.apache.hadoop.dynamodb.util.ClusterTopologyNodeCapacityProvider.getCoreNodeMemoryMB(ClusterTopologyNodeCapacityProvider.java:42)
at org.apache.hadoop.dynamodb.util.TaskCalculator.getMaxMapTasks(TaskCalculator.java:54)
at org.apache.hadoop.dynamodb.DynamoDBUtil.calcMaxMapTasks(DynamoDBUtil.java:265)
at org.apache.hadoop.dynamodb.read.AbstractDynamoDBInputFormat.getSplits(AbstractDynamoDBInputFormat.java:47)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:200)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2099)
at org.apache.spark.rdd.RDD.count(RDD.scala:1162)
at com.data.spark.dynamodb.copyDynamoTable$.delayedEndpoint$com$riskified$data$spark$dynamodb$copyDynamoTable$1(copyDynamoTable.scala:30)
at com.data.spark.dynamodb.copyDynamoTable$delayedInit$body.apply(copyDynamoTable.scala:9)
at scala.Function0$class.apply$mcV$sp(Function0.scala:34)
at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12)
at scala.App$$anonfun$main$1.apply(App.scala:76)
at scala.App$$anonfun$main$1.apply(App.scala:76)
at scala.collection.immutable.List.foreach(List.scala:381)
at scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:35)
at scala.App$class.main(App.scala:76)
at com.data.spark.dynamodb.copyDynamoTable$.main(copyDynamoTable.scala:9)
at com.data.spark.dynamodb.copyDynamoTable.main(copyDynamoTable.scala)
Exception in thread "main" java.lang.ArithmeticException: / by zero
This is a file that is present on an EMR cluster. This is to try to determine what instance type it is running against to determine some job settings such as memory. Obviously running locally you wouldn't have this file so this is expected.
Please follow the bellow thread :
emr/github.com/issues/50

Reading .conf file from AWS s3 through spark and scala

I was able to load a text file from AWS S3 but facing a problem in reading the ".conf" file. Getting the error
"Exception in thread "main" com.typesafe.config.ConfigException$Missing: No configuration setting found for key 'spark'"
Scala code:
val configFile1 = ConfigFactory.load( "s3n://<bucket_name>/aws.conf" )
configFile1.getString("spark.lineage.key")
Here what I end up doing it, Create a wrapper utility Config.scala
import java.io.File
import com.amazonaws.auth.DefaultAWSCredentialsProviderChain
import com.amazonaws.services.s3.{AmazonS3Client, AmazonS3URI}
import com.typesafe.config.{ConfigFactory, Config => TConfig}
import scala.io.Source
object Config {
private def read(location: String): String = {
val awsCredentials = new DefaultAWSCredentialsProviderChain()
val s3Client = new AmazonS3Client(awsCredentials)
val s3Uri = new AmazonS3URI(location)
val fullObject = s3Client.getObject(s3Uri.getBucket, s3Uri.getKey)
Source.fromInputStream(fullObject.getObjectContent).getLines.mkString("\n")
}
def apply(location: String): TConfig = {
if (location.startsWith("s3")) {
val content = read(location)
ConfigFactory.parseString(content)
} else {
ConfigFactory.parseFile(new File(location))
}
}
}
Use the created wrapper
val conf: TConfig = Config("s3://config/path")
You may use provided scope for aws-java-sdk since it will be available in the EMR cluster.
According to my research, we can only read delimiter files from AWS S3 through spark/scala. As .conf files are of = pair, its not possible.
Only way would be modify the format of data in the file.
Typesafe Config does not support loading .conf files from S3, but you can read s3 file as a string yourself and pass to typesafe config like val conf = ConfigFactory.parseString(... .conf files as string ...)