java.lang.OutOfMemoryError in EMR spark app - amazon-web-services

Trying to scale a pyspark app on AWS EMR. Was able to get it to work for one day of data (around 8TB), but keep running into (what I believe are) OOM errors when trying to test it on one week of data (around 50TB)
I set my spark configs based on this article. Originally, I got a
java.lang.OutOfMemoryError: Java heap space from the Driver std out. From browsing online, it seemed like my spark.driver.memory was too low, so I boosted that up quite a bit. Now I am running into a different OOM error. In the driver std err, I see something like this:
.saveAsTable('hive_tables.apf_intermediate_table')
File "/mnt1/yarn/usercache/hadoop/appcache/application_1622843305817_0001/container_1622843305817_0001_02_000001/pyspark.zip/pyspark/sql/readwriter.py", line 778, in saveAsTable
File "/mnt1/yarn/usercache/hadoop/appcache/application_1622843305817_0001/container_1622843305817_0001_02_000001/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
File "/mnt1/yarn/usercache/hadoop/appcache/application_1622843305817_0001/container_1622843305817_0001_02_000001/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
File "/mnt1/yarn/usercache/hadoop/appcache/application_1622843305817_0001/container_1622843305817_0001_02_000001/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o12119.saveAsTable.
: org.apache.spark.SparkException: Job aborted.
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:198)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:174)
at org.apache.spark.sql.execution.datasources.DataSource.writeAndRead(DataSource.scala:517)
at org.apache.spark.sql.execution.command.CreateDataSourceTableAsSelectCommand.saveDataIntoTable(createDataSourceTables.scala:217)
at org.apache.spark.sql.execution.command.CreateDataSourceTableAsSelectCommand.run(createDataSourceTables.scala:176)
at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)
at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102)
at org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:173)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:169)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:197)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:194)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:169)
at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:114)
at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:112)
at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:696)
at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:696)
at org.apache.spark.sql.execution.SQLExecution$.org$apache$spark$sql$execution$SQLExecution$$executeQuery$1(SQLExecution.scala:83)
at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1$$anonfun$apply$1.apply(SQLExecution.scala:94)
at org.apache.spark.sql.execution.QueryExecutionMetrics$.withMetrics(QueryExecutionMetrics.scala:141)
at org.apache.spark.sql.execution.SQLExecution$.org$apache$spark$sql$execution$SQLExecution$$withMetrics(SQLExecution.scala:178)
at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:93)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:200)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:92)
at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:696)
at org.apache.spark.sql.DataFrameWriter.createTable(DataFrameWriter.scala:494)
at org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:473)
at org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:429)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.spark.SparkException: Task not serializable
and in the driver std out, I see something like this:
21/06/04 21:57:58 INFO CodeGenerator: Code generated in 52.445036 ms
21/06/04 21:57:58 INFO CodeGenerator: Code generated in 10.271344 ms
21/06/04 21:57:59 INFO CodeGenerator: Code generated in 7.076469 ms
21/06/04 21:57:59 INFO CodeGenerator: Code generated in 6.111364 ms
21/06/04 21:57:59 INFO CodeGenerator: Code generated in 9.17915 ms
21/06/04 21:57:59 INFO CodeGenerator: Code generated in 50.537577 ms
21/06/04 21:57:59 INFO CodeGenerator: Code generated in 9.398937 ms
21/06/04 21:58:00 INFO CodeGenerator: Code generated in 44.204289 ms
21/06/04 21:58:00 INFO CodeGenerator: Code generated in 6.294884 ms
21/06/04 21:58:00 INFO CodeGenerator: Code generated in 8.570691 ms
21/06/04 21:58:02 INFO CodeGenerator: Code generated in 55.276023 ms
21/06/04 21:58:02 INFO CodeGenerator: Code generated in 10.988539 ms
21/06/04 21:58:07 INFO CodeGenerator: Code generated in 284.051432 ms
21/06/04 21:58:07 ERROR Utils: uncaught error in thread spark-listener-group-eventLog, stopping SparkContext
java.lang.OutOfMemoryError
at java.lang.AbstractStringBuilder.hugeCapacity(AbstractStringBuilder.java:161)
at java.lang.AbstractStringBuilder.newCapacity(AbstractStringBuilder.java:155)
at java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:125)
at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:596)
at java.lang.StringBuilder.append(StringBuilder.java:190)
at com.fasterxml.jackson.core.util.TextBuffer.contentsAsString(TextBuffer.java:351)
at com.fasterxml.jackson.core.io.SegmentedStringWriter.getAndClear(SegmentedStringWriter.java:83)
at com.fasterxml.jackson.databind.ObjectMapper.writeValueAsString(ObjectMapper.java:2933)
at org.apache.spark.util.JsonProtocol$.sparkEventToJson(JsonProtocol.scala:107)
at org.apache.spark.scheduler.EventLoggingListener.logEvent(EventLoggingListener.scala:138)
at org.apache.spark.scheduler.EventLoggingListener.onOtherEvent(EventLoggingListener.scala:236)
at org.apache.spark.scheduler.SparkListenerBus$class.doPostEvent(SparkListenerBus.scala:80)
at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37)
at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37)
at org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:91)
at org.apache.spark.scheduler.AsyncEventQueue.org$apache$spark$scheduler$AsyncEventQueue$$super$postToAll(AsyncEventQueue.scala:92)
at org.apache.spark.scheduler.AsyncEventQueue$$anonfun$org$apache$spark$scheduler$AsyncEventQueue$$dispatch$1.apply$mcJ$sp(AsyncEventQueue.scala:92)
at org.apache.spark.scheduler.AsyncEventQueue$$anonfun$org$apache$spark$scheduler$AsyncEventQueue$$dispatch$1.apply(AsyncEventQueue.scala:87)
at org.apache.spark.scheduler.AsyncEventQueue$$anonfun$org$apache$spark$scheduler$AsyncEventQueue$$dispatch$1.apply(AsyncEventQueue.scala:87)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
at org.apache.spark.scheduler.AsyncEventQueue.org$apache$spark$scheduler$AsyncEventQueue$$dispatch(AsyncEventQueue.scala:87)
at org.apache.spark.scheduler.AsyncEventQueue$$anon$1$$anonfun$run$1.apply$mcV$sp(AsyncEventQueue.scala:83)
at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1347)
at org.apache.spark.scheduler.AsyncEventQueue$$anon$1.run(AsyncEventQueue.scala:82)
21/06/04 21:58:07 ERROR Utils: throw uncaught fatal error in thread spark-listener-group-eventLog
java.lang.OutOfMemoryError
with the physical plan being spit out repeatedly right after in both logs.
I assume that since the code has run successfully with less input data, that this is still an OOM error and that I can safely ignore the Task not serializable error. However, I have tried many different EMR / Spark configs and keep getting some variation of this error.
An example cluster set-up would be something like this:
Driver node: 1 M5.8Xlarge instance
Core nodes: 9 M5.8Xlarge instances (each with 32 GiB RAM, 128 vCores)
spark.dynamicAllocation.enabled=False
spark.executors.cores=5
spark.executor.memory=18g
spark.executor.memoryOverhead=2g
spark.driver.memory=44g
spark.driver.cores=5
spark.executor.instances=53
spark.default.parallelism=530
spark.sql.shuffle.partitions=200
spark.serializer=org.apache.spark.serializer.KryoSerializer
spark.executor.extraJavaOptions="-XX:+UseG1GC
-XX:+UnlockDiagnosticVMOptions -XX:+G1SummarizeConcMark -XX:InitiatingHeapOccupancyPercent=35 -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:OnOutOfMemoryError='kill -9 %p'"
spark.driver.extraJavaOptions="-XX:+UseG1GC-XX:+UnlockDiagnosticVMOptions
XX:+G1SummarizeConcMark -XX:InitiatingHeapOccupancyPercent=35 -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:OnOutOfMemoryError='kill -9 %p'"
Does anything stand out that could be causing this?
Spark 2.4.7
Relevant Code:
df = spark.read.parquet(SDL_PATH.format(datekeys=datekeys))
df = df.filter(F.col('request') == 1)\
.filter(F.col('pricing_rule_applied') == 0)\
.filter(F.col('deal_id') == 'company')\
.filter(F.col('multivariate_test_data').isNull() | (F.col('multivariate_test_data')=='{"passthrough":true}'))\
.select(F.col('platform_id'),
F.col('demand_id'),
F.col('publisher_id'),
'channel_id',
'market_id',
'deal_id',
F.col('auction_type'),
F.col('price_floor_gross'),
F.col('price_floor_net'),
F.col('ad_slots'),
F.col('country').alias('country_id'),
F.col('device_type'),
F.col('browser_id'),
F.col('request').alias('bid_request'),
F.when(F.col('bid'), 1).otherwise(0).alias('bid'),
F.col('max_bid').alias('max_bid'),
F.when(F.col('win'), 1).otherwise(0).alias('win'),
F.col('spend').alias('spend'),
F.col('above_floor'),
F.col('price_floor_adjusted'),
F.when((F.col('block_type') == "filter") | (F.col('block_type') == "ad_review") | (F.col('block_type') == "quarantine") | F.col('block_type').isNull(), 1).otherwise(0).alias('blocked'),
F.when((F.col('ad_timeout').isNull()) | (F.col('ad_timeout') == 0), 0).otherwise(1).alias('timeout'),
F.when((F.col('ad_opt_out').isNull()) | (F.col('ad_opt_out') == 0), 0).otherwise(1).alias('opt_out'))
df = df.join(deal, on=['deal_id', 'publisher_id'],how='left')
hashes = spark.read.table('market.server_hashes')\
.withColumn('fehash',F.substring(F.col('hash'), 1, 8))\
.select('fehash',F.substring(F.col('hostname'), 13, 5).alias('datacenter'))
df = df.withColumn('fehash',F.lower(F.substring(F.col('market_id'), -12,8)))\
.join(F.broadcast(hashes), on='fehash', how='left')
# dfWriter_sql = df.repartition(128).write.format("orc").option("orc.compress", "snappy")
# dfWriter_sql.mode('overwrite').saveAsTable('datascience.jp_auction_sim_sample_30p')
good_ids = df.filter('(ad_slots = 1) and (bid=1)').groupBy('market_id').agg(F.sum('win').alias('wins')).filter('wins <=1')
modified_floors = df.join(good_ids, on='market_id',how='inner').filter('(blocked=0) and (timeout=0) and (opt_out=0) and (bid=1)').cache()
#modified_floors = modified_floors.filter(F.col('price_floor_gross') == F.col('price_floor_adjusted'))
#print(modified_floors.count())
window = Window.partitionBy('market_id').orderBy(F.col('tier').asc(),F.col('max_bid').desc(),F.col('demand_id').asc())
check = 0
feats = ['country_id','device_type','browser_id']
features = ['publisher_id','channel_id','datacenter','country_id','device_type','browser_id']
for perc_val, val in zip(vals, str_vals):
# val = str(int(perc_val*100))
print(val)
if val == '114':
val = '115'
sim_df = modified_floors.withColumn('price_floor_modified', F.col('price_floor_adjusted')*perc_val)\
.withColumn('coeff', F.lit(perc_val))\
.withColumn('row_number',F.row_number().over(window))\
.withColumn('next_highest_temp',F.lead('max_bid', count=1, default=None).over(window))\
.withColumn('next_highest', F.when(F.col('next_highest_temp').isNull(), F.when(F.col('max_bid') >= F.col('price_floor_modified'), F.col('price_floor_modified')).otherwise(F.lit(0))).otherwise(F.col('next_highest_temp')))\
.withColumn('adj_win',
F.when((F.col('row_number') == 1) & (F.col('max_bid') >= F.col('price_floor_modified')),
F.lit(1)).otherwise(F.lit(0)))\
.withColumn('adj_spend',
F.when(
F.col('adj_win') == 1,
F.when(
(F.col('auction_type') == 1) | (F.col('auction_type') == 3),
F.col('max_bid')
).otherwise(
F.when(F.col('next_highest') >= F.col('price_floor_modified'),
F.col('next_highest')
).otherwise(
F.col('price_floor_modified')
)
)
).otherwise(
F.lit(0)
))\
.withColumn('adj_above_floor', F.when(F.col('max_bid') > F.col('price_floor_modified'), 1).otherwise(0))\
.withColumn('adj_below_floor', F.when(F.col('max_bid') < F.col('price_floor_modified'), 1).otherwise(0))
if check == 1:
agg_df = agg_df.union(get_full_tier(sim_df))
else:
agg_df = get_full_tier(sim_df)
check = 1
agg_df = agg_df.withColumn('datekey',F.lit(int(DATEKEY))).cache()
window = Window.partitionBy('hashkey').orderBy(F.col('estimated_platform_spend').desc(),F.col('coeff').desc())
agg_df = agg_df.withColumn('hashkey',F.md5(
F.concat_ws(
'-',
F.coalesce('channel_id',F.lit('')),
F.coalesce('datacenter',F.lit('')),
F.coalesce('country_id',F.lit('')),
F.coalesce('device_type',F.lit('')),
F.coalesce('browser_id',F.lit(''))
))).withColumn('optimum',(F.row_number().over(window) == 1)
& (F.col('platform_spend_improvement_incremental') > 0)
& (F.col('coeff') != 1.0))
#agg_df.write.format("parquet").mode('overwrite').saveAsTable('datascience.apf_intermediate_table')
affiliate = spark.read.table('tableau_cpc3.affiliate')
browsers = spark.read.table('tableau_cpc3.device_browser')\
.select(F.col('device_browser_name').alias('browser_name'),F.col('device_browser_value').cast('integer').alias('browser_id'))
regions = spark.sql('''SELECT region_id as country_id,
CASE region_name WHEN 'Croatia (Hrvatska)' THEN 'Croatia'
WHEN 'Great Britain (UK)' THEN 'United Kingdom'
WHEN 'New Zealand (Aotearoa)' THEN 'New Zealand'
WHEN 'Vatican City State (Holy See)' THEN 'Vatican City'
WHEN 'Congo' THEN 'Congo (Brazzaville)'
WHEN 'Palestinian Territory' THEN 'Palestinian Territories'
WHEN 'S. Georgia and S. Sandwich Islands' THEN 'South Georgia and the South Sandwich Islands'
ELSE region_name
END AS country
FROM tableau_cpc3.region''')
tier_check = 0
to_write = agg_df.join(F.broadcast(affiliate).select(F.col('name').alias('publisher_name'), F.col('affiliate_id').alias('publisher_id')), on='publisher_id', how='left')
to_write = to_write.join(F.broadcast(affiliate).select(F.col('name').alias('channel_name'), F.col('affiliate_id').alias('channel_id')), on='channel_id', how='left')
to_write = to_write.join(F.broadcast(regions), on='country_id', how='left').withColumn('date',F.col('datekey').cast('string'))
to_write = to_write.join(F.broadcast(browsers), on='browser_id', how='left')
to_write = to_write.select('publisher_id',
'publisher_name',
'channel_id',
'channel_name',
'datacenter',
'country_id',
'country',
'device_type',
F.expr('''CASE device_type WHEN 0 THEN 'Unknown'
WHEN 1 THEN 'Mobile'
WHEN 2 THEN 'Desktop'
WHEN 3 THEN 'CTV'
WHEN 4 THEN 'Phone'
WHEN 5 THEN 'Tablet'
WHEN 6 THEN 'Connected Device'
WHEN 7 THEN 'STB' ELSE 'Unknown' END''').alias('device_type_name'),
F.col('browser_id').cast('integer').alias('browser_id'),
'browser_name',
'coeff',
'auctions',
'bids_returned',
'channel_floor',
'unadjusted_platform_spend',
'unadjusted_count_win',
'unadjusted_count_above_floor',
'unadjusted_count_below_floor',
'estimated_platform_spend',
'estimated_count_win',
'estimated_count_above_floor',
'estimated_count_below_floor',
'competitive_rate',
'datekey',
'optimum',
'platform_spend_improvement_percent',
'platform_spend_improvement_incremental',
'date',
F.md5(
F.concat_ws(
'-',
F.coalesce('channel_id',F.lit('')),
F.coalesce('datacenter',F.lit('')),
F.coalesce('country_id',F.lit('')),
F.coalesce('device_type',F.lit('')),
F.coalesce('browser_id',F.lit(''))
)
).alias('hashkey')).distinct()
to_write.write\
.format("parquet")\
.partitionBy('coeff','datacenter')\
.mode('overwrite')\
.option("path", APF_INTERMEDIATE_TABLE_PATH)\
.saveAsTable('hive_tables.apf_intermediate_table')

You use machine type that doesn't fit your configuration. You give your executor:
spark.executors.cores=5
spark.executor.memory=18g
spark.executor.memoryOverhead=2g
spark.executor.instances=53
Actually, you say here: "for my executors i want 20G memory", while your machines have 32Gx0.85=27.2G, so you can place only one executor on single machine (using your current memory settings), that executor will use 5 cores, 1 core is required for Spark, so per machine, you don't use 128-1-5=122 cores and 27.2G-20G=7.2G RAM.
Effectively, with your settings, you have 9 nodes and you can run only 9 executors (1 executor per node). I would recommend you to use machines with more memory and less cores.

Related

Apache Spark Random Forest Classification on AWS EMR fails and causes errors: Allocation Failure, HeartbeatReceiver, YarnClusterScheduler, etc

I recently got into Apache Spark on AWS. I have a dataset with 10 columns and 7 million rows but I cannot use the whole set because Spark cannot handle it. When I take more than 1.5 million lines it crashes on a single r4.16xlarge instance with 488 GB RAM with insufficient memory erros (which I can confirm observing it with top, the memory consumption rises up to 100%). But when I try to run it on a whole cluster with significantly more memory (4 / 488 = 1952 GB) it also fails. I am using the following parameters for starting the EMR step:
spark-submit --deploy-mode cluster --class randomforest --driver-memory 400G --num-executors 4 --executor-cores 8 --executor-memory 400g s3://spark-cluster-demo/randomforest_2.11-1.0.jar
This is the scala script inside the jar I am executing:
import org.apache.spark.SparkContext, org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.classification.{RandomForestClassificationModel, RandomForestClassifier}
import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator
import org.apache.spark.ml.feature.{IndexToString, StringIndexer, VectorIndexer}
object randomforest {
def main(args: Array[String]) {
val sparkConf = new SparkConf().setAppName("Spark Cluster Test")
val sc = new SparkContext(sparkConf)
val spark = SparkSession.builder().appName("Spark Cluster Test").getOrCreate
// Load and parse the data file, converting it to a DataFrame.
val data = spark.read.format("libsvm").load("s3://spark-cluster-demo/2004_4000000.txt")
val maxCategories = 512
val numTrees = 64
val featureSubsetStrategy = "auto" // supported featureSubsetStrategy settings: auto, all, onethird, sqrt, log2
val impurity = "gini"
val maxDepth = 30
val maxBins = 2048
val maxMemoryInMB = 409600
// Index labels, adding metadata to the label column.
// Fit on whole dataset to include all labels in index.
val labelIndexer = new StringIndexer().setInputCol("label").setOutputCol("indexedLabel").fit(data)
// Automatically identify categorical features, and index them.
// Set maxCategories so features with > X distinct values are treated as continuous.
val featureIndexer = new VectorIndexer().setInputCol("features").setOutputCol("indexedFeatures").setMaxCategories(maxCategories).fit(data)
// Split the data into training and test sets (30% held out for testing).
val Array(trainingData, testData) = data.randomSplit(Array(0.7, 0.3))
// Train a RandomForest model.
val rf = new RandomForestClassifier().setLabelCol("indexedLabel").setFeaturesCol("indexedFeatures").setNumTrees(numTrees).setFeatureSubsetStrategy(featureSubsetStrategy).setImpurity(impurity).setMaxDepth(maxDepth).setMaxBins(maxBins).setMaxMemoryInMB(maxMemoryInMB)
// Convert indexed labels back to original labels.
val labelConverter = new IndexToString().setInputCol("prediction").setOutputCol("predictedLabel").setLabels(labelIndexer.labels)
// Chain indexers and forest in a Pipeline.
val pipeline = new Pipeline().setStages(Array(labelIndexer, featureIndexer, rf, labelConverter))
// Train model. This also runs the indexers.
val t0 = System.nanoTime()
val model = pipeline.fit(trainingData)
val t1 = System.nanoTime()
println("Training time: " + (t1 - t0) + " ns")
// Make predictions.
val predictions = model.transform(testData)
// Select example rows to display.
predictions.select("predictedLabel", "label", "features").show(5)
}
}
The whole log is too big to post it here. I get various different errors and failures which I will list here. All of them accure dozens or hundreds times all over the place:
2018-11-26T09:38:35.392+0000: [GC (Allocation Failure) 2018-11-26T09:38:35.392+0000: [ParNew: 559232K->36334K(629120K), 0.0217574 secs] 559232K->36334K(2027264K), 0.0218269 secs] [Times: user=0.28 sys=0.01, real=0.02 secs]
18/11/26 10:24:37 WARN TransportChannelHandler: Exception in connection from ip-172-31-17-60.ec2.internal/172.31.17.60:45589
java.io.IOException: Connection reset by peer
at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
at sun.nio.ch.IOUtil.read(IOUtil.java:192)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)
at io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:288)
at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:1106)
at io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:343)
at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:123)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:645)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:580)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:497)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:459)
at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
at java.lang.Thread.run(Thread.java:748)
18/11/26 10:24:37 ERROR TransportResponseHandler: Still have 1 requests outstanding when connection from ip-172-31-17-60.ec2.internal/172.31.17.60:45589 is closed
18/11/26 10:24:37 ERROR OneForOneBlockFetcher: Failed while starting block fetches
18/11/26 10:24:42 ERROR RetryingBlockFetcher: Exception while beginning fetch of 1 outstanding blocks (after 1 retries)
18/11/26 10:22:29 WARN HeartbeatReceiver: Removing executor 2 with no recent heartbeats: 141885 ms exceeds timeout 120000 ms
18/11/26 10:22:29 ERROR YarnClusterScheduler: Lost executor 2 on ip-172-31-26-87.ec2.internal: Executor heartbeat timed out after 141885 ms
18/11/26 10:22:29 WARN TaskSetManager: Lost task 16.0 in stage 47.0 (TID 1079, ip-172-31-26-87.ec2.internal, executor 2): ExecutorLostFailure (executor 2 exited caused by one of the running tasks) Reason: Executor heartbeat timed out after 141885 ms
18/11/26 10:22:29 WARN TaskSetManager: Lost task 4.0 in stage 47.0 (TID 1067, ip-172-31-26-87.ec2.internal, executor 2): ExecutorLostFailure (executor 2 exited caused by one of the running tasks) Reason: Executor heartbeat timed out after 141885 ms
18/11/26 10:22:29 WARN TaskSetManager: Lost task 12.0 in stage 47.0 (TID 1075, ip-172-31-26-87.ec2.internal, executor 2): ExecutorLostFailure (executor 2 exited caused by one of the running tasks) Reason: Executor heartbeat timed out after 141885 ms
18/11/26 10:22:29 WARN TaskSetManager: Lost task 0.0 in stage 47.0 (TID 1063, ip-172-31-26-87.ec2.internal, executor 2): ExecutorLostFailure (executor 2 exited caused by one of the running tasks) Reason: Executor heartbeat timed out after 141885 ms
18/11/26 10:22:29 WARN TaskSetManager: Lost task 20.0 in stage 47.0 (TID 1083, ip-172-31-26-87.ec2.internal, executor 2): ExecutorLostFailure (executor 2 exited caused by one of the running tasks) Reason: Executor heartbeat timed out after 141885 ms
18/11/26 10:22:29 WARN TaskSetManager: Lost task 8.0 in stage 47.0 (TID 1071, ip-172-31-26-87.ec2.internal, executor 2): ExecutorLostFailure (executor 2 exited caused by one of the running tasks) Reason: Executor heartbeat timed out after 141885 ms
18/11/26 10:22:29 INFO TaskSetManager: Starting task 8.1 in stage 47.0 (TID 1084, ip-172-31-26-87.ec2.internal, executor 2, partition 8, PROCESS_LOCAL, 8550 bytes)
18/11/26 10:22:29 INFO TaskSetManager: Starting task 20.1 in stage 47.0 (TID 1085, ip-172-31-26-87.ec2.internal, executor 2, partition 20, PROCESS_LOCAL, 8550 bytes)
18/11/26 10:40:29 WARN BlockManagerMasterEndpoint: No more replicas available for rdd_36_2 !
18/11/26 10:40:29 WARN BlockManagerMasterEndpoint: No more replicas available for rdd_36_10 !
18/11/26 10:40:29 WARN BlockManagerMasterEndpoint: No more replicas available for rdd_36_6 !
18/11/26 10:40:29 WARN BlockManagerMasterEndpoint: No more replicas available for rdd_36_18 !
18/11/26 10:40:29 WARN BlockManagerMasterEndpoint: No more replicas available for rdd_36_22 !
18/11/26 10:40:29 WARN BlockManagerMasterEndpoint: No more replicas available for rdd_36_1 !
18/11/26 10:40:29 WARN BlockManagerMasterEndpoint: No more replicas available for rdd_36_14 !
18/11/26 10:40:29 WARN BlockManagerMasterEndpoint: No more replicas available for rdd_36_5 !
18/11/26 10:40:29 INFO BlockManagerMasterEndpoint: Removing block manager BlockManagerId(8, ip-172-31-26-192.ec2.internal, 43287, None)
18/11/26 10:40:29 INFO BlockManagerMaster: Removed 8 successfully in removeExecutor
18/11/26 10:40:29 INFO DAGScheduler: Executor 8 added was in lost list earlier.
18/11/26 10:40:29 ERROR ApplicationMaster: User class threw exception: org.apache.spark.SparkException: Job aborted due to stage failure: Task 5 in stage 47.0 failed 4 times, most recent failure: Lost task 5.3 in stage 47.0 (TID 1117, ip-172-31-26-192.ec2.internal, executor 8): ExecutorLostFailure (executor 8 exited caused by one of the running tasks) Reason: Executor heartbeat timed out after 170206 ms
Driver stacktrace:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 5 in stage 47.0 failed 4 times, most recent failure: Lost task 5.3 in stage 47.0 (TID 1117, ip-172-31-26-192.ec2.internal, executor 8): ExecutorLostFailure (executor 8 exited caused by one of the running tasks) Reason: Executor heartbeat timed out after 170206 ms
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1803)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1791)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1790)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1790)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:871)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:871)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:871)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2024)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1973)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1962)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:682)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2034)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2055)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2074)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2099)
at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:945)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
at org.apache.spark.rdd.RDD.collect(RDD.scala:944)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$collectAsMap$1.apply(PairRDDFunctions.scala:743)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$collectAsMap$1.apply(PairRDDFunctions.scala:742)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
at org.apache.spark.rdd.PairRDDFunctions.collectAsMap(PairRDDFunctions.scala:742)
at org.apache.spark.ml.tree.impl.RandomForest$.findBestSplits(RandomForest.scala:563)
at org.apache.spark.ml.tree.impl.RandomForest$.run(RandomForest.scala:198)
at org.apache.spark.ml.classification.RandomForestClassifier.train(RandomForestClassifier.scala:139)
at org.apache.spark.ml.classification.RandomForestClassifier.train(RandomForestClassifier.scala:45)
at org.apache.spark.ml.Predictor.fit(Predictor.scala:118)
at org.apache.spark.ml.Predictor.fit(Predictor.scala:82)
at org.apache.spark.ml.Pipeline$$anonfun$fit$2.apply(Pipeline.scala:153)
at org.apache.spark.ml.Pipeline$$anonfun$fit$2.apply(Pipeline.scala:149)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
at scala.collection.IterableViewLike$Transformed$class.foreach(IterableViewLike.scala:44)
at scala.collection.SeqViewLike$AbstractTransformed.foreach(SeqViewLike.scala:37)
at org.apache.spark.ml.Pipeline.fit(Pipeline.scala:149)
at randomforest$.main(randomforest.scala:48)
at randomforest.main(randomforest.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$4.run(ApplicationMaster.scala:721)
Is something wrong with my configuration or is it impossible to use such a big dataset for RF?

Create frozen graph from pretrained model

Hi I am newbie to tensorflow. My aim is to convert .pb file to .tflite from pretrain model for my understanding. I have download mobilenet_v1_1.0_224 Model. Below is structure for model
mobilenet_v1_1.0_224.ckpt.data-00000-of-00001 - 66312kb
mobilenet_v1_1.0_224.ckpt.index - 20kb
mobilenet_v1_1.0_224.ckpt.meta - 3308kb
mobilenet_v1_1.0_224.tflite - 16505kb
mobilenet_v1_1.0_224_eval.pbtxt - 520kb
mobilenet_v1_1.0_224_frozen.pb - 16685kb
I know model already has .tflite file, but for my understanding I am trying to convert it.
My First Step : Creating frozen Graph file
import tensorflow as tf
imported_meta = tf.train.import_meta_graph(base_dir + model_folder_name + meta_file,clear_devices=True)
graph_ = tf.get_default_graph()
with tf.Session() as sess:
#saver = tf.train.import_meta_graph(base_dir + model_folder_name + meta_file, clear_devices=True)
imported_meta.restore(sess, base_dir + model_folder_name + checkpoint)
graph_def = sess.graph.as_graph_def()
output_graph_def = graph_util.convert_variables_to_constants(sess, graph_def, ['MobilenetV1/Predictions/Reshape_1'])
with tf.gfile.GFile(base_dir + model_folder_name + './my_frozen.pb', "wb") as f:
f.write(output_graph_def.SerializeToString())
I have successfully created my_frozen.pb - 16590 kb . But original file size is 16,685kb, which is clearly visible in folder structure above. So this is my first question why file size is different, Am I following some wrong path.
My Second Step : Creating tflite file using bazel command
bazel run --config=opt tensorflow/contrib/lite/toco:toco -- --input_file=/path_to_folder/my_frozen.pb --output_file=/path_to_folder/model.tflite --inference_type=FLOAT --input_shape=1,224,224,3 --input_array=input --output_array=MobilenetV1/Predictions/Reshape_1
This commands give me model.tflite - 0 kb.
Trackback for bazel Command
INFO: Analysed target //tensorflow/contrib/lite/toco:toco (0 packages loaded).
INFO: Found 1 target...
Target //tensorflow/contrib/lite/toco:toco up-to-date:
bazel-bin/tensorflow/contrib/lite/toco/toco
INFO: Elapsed time: 0.369s, Critical Path: 0.01s
INFO: Build completed successfully, 1 total action
INFO: Running command line: bazel-bin/tensorflow/contrib/lite/toco/toco '--input_file=/home/ubuntu/DEEP_LEARNING/Prashant/TensorflowBasic/mobilenet_v1_1.0_224/frozengraph.pb' '--output_file=/home/ubuntu/DEEP_LEARNING/Prashant/TensorflowBasic/mobilenet_v1_1.0_224/float_model.tflite' '--inference_type=FLOAT' '--input_shape=1,224,224,3' '--input_array=input' '--output_array=MobilenetV1/Predictions/Reshape_1'
2018-04-12 16:36:16.190375: I tensorflow/contrib/lite/toco/import_tensorflow.cc:1265] Converting unsupported operation: FIFOQueueV2
2018-04-12 16:36:16.190707: I tensorflow/contrib/lite/toco/import_tensorflow.cc:1265] Converting unsupported operation: QueueDequeueManyV2
2018-04-12 16:36:16.202293: I tensorflow/contrib/lite/toco/graph_transformations/graph_transformations.cc:39] Before Removing unused ops: 290 operators, 462 arrays (0 quantized)
2018-04-12 16:36:16.211322: I tensorflow/contrib/lite/toco/graph_transformations/graph_transformations.cc:39] Before general graph transformations: 290 operators, 462 arrays (0 quantized)
2018-04-12 16:36:16.211756: F tensorflow/contrib/lite/toco/graph_transformations/resolve_batch_normalization.cc:86] Check failed: mean_shape.dims() == multiplier_shape.dims()
Python Version - 2.7.6
Tensorflow Version - 1.5.0
Thanks In advance :)
The error Check failed: mean_shape.dims() == multiplier_shape.dims()
was an issue with resolution of batch norm and has been resolved in:
https://github.com/tensorflow/tensorflow/commit/460a8b6a5df176412c0d261d91eccdc32e9d39f1#diff-49ed2a40acc30ff6d11b7b326fbe56bc
In my case the error occurred using tensorflow v1.7
Solution was to use tensorflow v1.15 (nightly)
toco --graph_def_file=/path_to_folder/my_frozen.pb \
--input_format=TENSORFLOW_GRAPHDEF \
--output_file=/path_to_folder/my_output_model.tflite \
--input_shape=1,224,224,3 \
--input_arrays=input \
--output_format=TFLITE \
--output_arrays=MobilenetV1/Predictions/Reshape_1 \
--inference-type=FLOAT

Causes of "java.lang.NoSuchMethodError: org.eclipse.paho.client.mqttv3.MqttConnectOptions.setAutomaticReconnect(Z)V"

I am trying to run spark structured streaming MQTT using Apache Bahir by modifying the sample wordcount example provided.
SPARK version: spark-2.2.0-bin-hadoop2.7.
I am using this command to run the program: bin\spark-submit --packages org.apache.bahir:spark-sql-streaming-mqtt_2.11:2.2.0 mqtt.py
Below is my code:
# mqtt.py
from __future__ import print_function
import sys
from pyspark.sql import SparkSession
from pyspark.sql.functions import explode
from pyspark.sql.functions import split
if __name__ == "__main__":
spark = SparkSession\
.builder\
.appName("StructuredNetworkWordCount")\
.getOrCreate()
broker_uri = 'xyz.com:1883'
lines = (spark\
.readStream\
.format("org.apache.bahir.sql.streaming.mqtt.MQTTStreamSourceProvider")\
.option("topic","xyz")\
.load("tcp://{}".format(broker_uri)))\
# Split the lines into words
words = lines.select(
# explode turns each item in an array into a separate row
explode(
split(lines.value, ' ')
).alias('word')
)
# Generate running word count
wordCounts = words.groupBy('word').count()
# Start running the query that prints the running counts to the console
query = wordCounts\
.writeStream\
.outputMode('complete')\
.format('console')\
.start()
query.awaitTermination()
But I get the following error on getting query:
17/11/09 19:48:14 INFO StateStoreCoordinatorRef: Registered StateStoreCoordinator endpoint
17/11/09 19:48:16 INFO StreamExecution: Starting [id = e0335f31-f3e0-4ee1-a774-52582268845c, runId = f6a87268-164c-4eab-82db-1ac0bacd2bad]. Use C:\Users\xyz\AppData\Local\Temp\temporary-42cbc22f-7c1d-413c-b81c-3d4496f8e297 to store the query checkpoint.
17/11/09 19:48:16 WARN MQTTStreamSourceProvider: If `clientId` is not set, a random value is picked up.
Recovering from failure is not supported in such a case.
17/11/09 19:48:16 ERROR StreamExecution: Query [id = e0335f31-f3e0-4ee1-a774-52582268845c, runId = f6a87268-164c-4eab-82db-1ac0bacd2bad] terminated with error
java.lang.NoSuchMethodError: org.eclipse.paho.client.mqttv3.MqttConnectOptions.setAutomaticReconnect(Z)V
at org.apache.bahir.sql.streaming.mqtt.MQTTStreamSourceProvider.createSource(MQTTStreamSource.scala:219)
at org.apache.spark.sql.execution.datasources.DataSource.createSource(DataSource.scala:243)
at org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$2$$anonfun$applyOrElse$1.apply(StreamExecution.scala:158)
at org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$2$$anonfun$applyOrElse$1.apply(StreamExecution.scala:155)
at scala.collection.mutable.MapLike$class.getOrElseUpdate(MapLike.scala:194)
at scala.collection.mutable.AbstractMap.getOrElseUpdate(Map.scala:80)
at org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$2.applyOrElse(StreamExecution.scala:155)
at org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$2.applyOrElse(StreamExecution.scala:153)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272)
at org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:256)
at org.apache.spark.sql.execution.streaming.StreamExecution.logicalPlan$lzycompute(StreamExecution.scala:153)
at org.apache.spark.sql.execution.streaming.StreamExecution.logicalPlan(StreamExecution.scala:147)
at org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runBatches(StreamExecution.scala:276)
at org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:206)
Exception in thread "stream execution thread for [id = e0335f31-f3e0-4ee1-a774-52582268845c, runId = f6a87268-164c-4eab-82db-1ac0bacd2bad]" java.lang.NoSuchMethodError: org.eclipse.paho.client.mqttv3.MqttConnectOptions.setAutomaticReconnect(Z)V
at org.apache.bahir.sql.streaming.mqtt.MQTTStreamSourceProvider.createSource(MQTTStreamSource.scala:219)
at org.apache.spark.sql.execution.datasources.DataSource.createSource(DataSource.scala:243)
at org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$2$$anonfun$applyOrElse$1.apply(StreamExecution.scala:158)
at org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$2$$anonfun$applyOrElse$1.apply(StreamExecution.scala:155)Traceback (most recent call last):
File "C:/Users/xyz/Documents/Fall-17/Transportation/spark/mqtt.py", line 84, in <module>
query.awaitTermination()
File "C:\spark\spark-2.2.0-bin-hadoop2.7\python\lib\pyspark.zip\pyspark\sql\streaming.py", line 106, in awaitTermination
at scala.collection.mutable.MapLike$class.getOrElseUpdate(MapLike.scala:194) File "C:\spark\spark-2.2.0-bin-hadoop2.7\python\lib\py4j-0.10.4-src.zip\py4j\java_gateway.py", line 1133, in __call__
File "C:\spark\spark-2.2.0-bin-hadoop2.7\python\lib\pyspark.zip\pyspark\sql\utils.py", line 75, in deco
pyspark.sql.utils.StreamingQueryException: u'org.eclipse.paho.client.mqttv3.MqttConnectOptions.setAutomaticReconnect(Z)V\n=== Streaming Query ===\nIdentifier: [id = e0335f31-f3e0-4ee1-a774-52582268845c, runId = f6a87268-164c-4eab-82db-1ac0bacd2bad]\nCurrent Committed Offsets: {}\nCurrent Available Offsets: {}\n\nCurrent State: INITIALIZING\nThread State: RUNNABLE'
at scala.collection.mutable.AbstractMap.getOrElseUpdate(Map.scala:80)
at org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$2.applyOrElse(StreamExecution.scala:155)
at org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$2.applyOrElse(StreamExecution.scala:153)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272)
at org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:256)
at org.apache.spark.sql.execution.streaming.StreamExecution.logicalPlan$lzycompute(StreamExecution.scala:153)
at org.apache.spark.sql.execution.streaming.StreamExecution.logicalPlan(StreamExecution.scala:147)
at org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runBatches(StreamExecution.scala:276)
at org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:206)
17/11/09 19:48:16 INFO SparkContext: Invoking stop() from shutdown hook
Can anyone help me where I am going wrong?
I had the same problem, and I solved it. Your condition might be different from mine. I know I should just comment instead of answer. However my reputation is too low to comment.
In my condition, my maven project included dependencies of both org.apache.bahir:spark-streaming-mqtt_2.11:2.2.0 and org.apache.bahir:spark-sql-streaming-mqtt_2.11:2.2.0
spark-streaming-mqtt_2.11 depends on org.eclipse.paho.client.mqttv3: 1.0.2
spark-sql-streaming-mqtt_2.11 depends on org.eclipse.paho.client.mqttv3: 1.1.0
1.1.0 has MqttConnectOptions.setAutomaticReconnect, but 1.0.2 doesn't.
I removed the spark-streaming-mqtt_2.11 dependency then it works.

Stack trace: ExitCodeException exitCode=1 when starting MapReduce job on Bigtable

We are using Google Cloud Bigtable for our Big Data.
When I'm running MapReduce job I assembly a jar and run it and now I'm getting this error:
Application application_1451577928704_0050 failed 2 times due to AM
Container for appattempt_1451577928704_0050_000002 exited with
exitCode: 1 For more detailed output, check application tracking
page:http://censored:8088/cluster/app/application_1451577928704_0050Then,
click on links to logs of each attempt. Diagnostics: Exception from
container-launch. Container id:
container_e02_1451577928704_0050_02_000001 Exit code: 1 Stack trace:
ExitCodeException exitCode=1: at
org.apache.hadoop.util.Shell.runCommand(Shell.java:545) at
org.apache.hadoop.util.Shell.run(Shell.java:456) at
org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:722)
at
org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:211)
at
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
at
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
at java.util.concurrent.FutureTask.run(FutureTask.java:266) at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745) Container exited with a
non-zero exit code 1 Failing this attempt. Failing the application.
When I logged to see the logging of the workers node I saw this error:
2016-02-15 02:59:54,106 INFO [main]
org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Created MRAppMaster
for application appattempt_1451577928704_0050_000001 2016-02-15
02:59:54,294 WARN [main] org.apache.hadoop.util.NativeCodeLoader:
Unable to load native-hadoop library for your platform... using
builtin-java classes where applicable 2016-02-15 02:59:54,319 INFO
[main] org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Executing with
tokens: 2016-02-15 02:59:54,319 INFO [main]
org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Kind:
YARN_AM_RM_TOKEN, Service: , Ident: (appAttemptId { application_id {
id: 50 cluster_timestamp: 1451577928704 } attemptId: 1 } keyId:
-******) 2016-02-15 02:59:54,424 INFO [main] org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Using mapred
newApiCommitter. 2016-02-15 02:59:54,755 WARN [main]
org.apache.hadoop.hdfs.shortcircuit.DomainSocketFactory: The
short-circuit local reads feature cannot be used because libhadoop
cannot be loaded. 2016-02-15 02:59:54,855 INFO [main]
org.apache.hadoop.mapreduce.v2.app.MRAppMaster: OutputCommitter set in
config null 2016-02-15 02:59:54,911 INFO [main]
org.apache.hadoop.service.AbstractService: Service
org.apache.hadoop.mapreduce.v2.app.MRAppMaster failed in state INITED;
cause: org.apache.hadoop.yarn.exceptions.YarnRuntimeException:
java.lang.ClassCastException:
org.apache.xerces.dom.DeferredElementNSImpl cannot be cast to
org.w3c.dom.Text
org.apache.hadoop.yarn.exceptions.YarnRuntimeException:
java.lang.ClassCastException:
org.apache.xerces.dom.DeferredElementNSImpl cannot be cast to
org.w3c.dom.Text at
org.apache.hadoop.mapreduce.v2.app.MRAppMaster$1.call(MRAppMaster.java:478)
at
org.apache.hadoop.mapreduce.v2.app.MRAppMaster$1.call(MRAppMaster.java:458)
at
org.apache.hadoop.mapreduce.v2.app.MRAppMaster.callWithJobClassLoader(MRAppMaster.java:1560)
at
org.apache.hadoop.mapreduce.v2.app.MRAppMaster.createOutputCommitter(MRAppMaster.java:458)
at
org.apache.hadoop.mapreduce.v2.app.MRAppMaster.serviceInit(MRAppMaster.java:377)
at
org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
at
org.apache.hadoop.mapreduce.v2.app.MRAppMaster$4.run(MRAppMaster.java:1518)
at java.security.AccessController.doPrivileged(Native Method) at
javax.security.auth.Subject.doAs(Subject.java:422) at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
at
org.apache.hadoop.mapreduce.v2.app.MRAppMaster.initAndStartAppMaster(MRAppMaster.java:1515)
at
org.apache.hadoop.mapreduce.v2.app.MRAppMaster.main(MRAppMaster.java:1448)
Caused by: java.lang.ClassCastException:
org.apache.xerces.dom.DeferredElementNSImpl cannot be cast to
org.w3c.dom.Text at
org.apache.hadoop.conf.Configuration.loadResource(Configuration.java:2603)
at
org.apache.hadoop.conf.Configuration.loadResources(Configuration.java:2502)
at
org.apache.hadoop.conf.Configuration.getProps(Configuration.java:2405)
at org.apache.hadoop.conf.Configuration.get(Configuration.java:981)
at
org.apache.hadoop.conf.Configuration.getTrimmed(Configuration.java:1031)
at
org.apache.hadoop.conf.Configuration.getBoolean(Configuration.java:1432)
at
org.apache.hadoop.hbase.HBaseConfiguration.checkDefaultsVersion(HBaseConfiguration.java:67)
at
org.apache.hadoop.hbase.HBaseConfiguration.addHbaseResources(HBaseConfiguration.java:81)
at
org.apache.hadoop.hbase.HBaseConfiguration.create(HBaseConfiguration.java:96)
at
org.apache.hadoop.hbase.HBaseConfiguration.create(HBaseConfiguration.java:105)
at
org.apache.hadoop.hbase.mapreduce.TableOutputFormat.setConf(TableOutputFormat.java:184)
at
org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:76)
at
org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:136)
at
org.apache.hadoop.mapreduce.v2.app.MRAppMaster$1.call(MRAppMaster.java:474)
... 11 more
I tried an older jar and it's running perfectly fine and I'm not sure why the new jar won't work - Didn't change anything.
Please advise?
Thanks!
Update 1: Here is some more details:
I setup the cluster with the dataproc.
We are using the newest versions, here is the library dependencies:
val BigtableHbase = "com.google.cloud.bigtable" % "bigtable-hbase-1.1"
% "0.2.2" val BigtableHbaseMapreduce = "com.google.cloud.bigtable" %
"bigtable-hbase-mapreduce" % "0.2.2" val CommonsCli = "commons-cli" %
"commons-cli" % "1.2" val HadoopCommon = "org.apache.hadoop" %
"hadoop-common" % "2.7.1" val HadoopMapreduceClientApp =
"org.apache.hadoop" % "hadoop-mapreduce-client-app" % "2.7.1" val
HbaseCommon = "org.apache.hbase" % "hbase-common" % "1.1.2" val
HbaseProtocol = "org.apache.hbase" % "hbase-protocol" % "1.1.2" val
HbaseClient = "org.apache.hbase" % "hbase-client" % "1.1.2" val
HbaseServer = "org.apache.hbase" % "hbase-server" % "1.1.2" val
HbaseAnnotations = "org.apache.hbase" % "hbase-annotations" % "1.1.2"
libraryDependencies += BigtableHbase libraryDependencies +=
BigtableHbaseMapreduce libraryDependencies += CommonsCli
libraryDependencies += HadoopCommon libraryDependencies +=
HadoopMapreduceClientApp libraryDependencies += HbaseCommon
libraryDependencies += HbaseProtocol libraryDependencies +=
HbaseClient libraryDependencies += HbaseServer libraryDependencies +=
HbaseAnnotations
Java version:
openjdk version "1.8.0_66-internal" OpenJDK Runtime Environment (build
1.8.0_66-internal-b17) OpenJDK 64-Bit Server VM (build 25.66-b17, mixed mode)
Alpn version: alpn-boot-8.1.3.v20150130
hbase verison:
2016-02-15 20:45:42,050 INFO [main] util.VersionInfo: HBase 1.1.2
2016-02-15 20:45:42,051 INFO [main] util.VersionInfo: Source code
repository file:///mnt/ram/bigtop/bigtop/output/ hbase/hbase-1.1.2
revision=Unknown 2016-02-15 20:45:42,051 INFO [main]
util.VersionInfo: Compiled by bigtop on Tue Nov 10 19:09:17 UTC 2015
2016-02-15 20:45:42,051 INFO [main] util.VersionInfo: From source
with checksum 42e8a1890c700d37485c69a44a3
hadoop version:
Hadoop 2.7.1 Subversion
https://bigdataoss-internal.googlesource.com/third_party/apache/bigtop
-r 2a194d4d838b79460c3ceb892f3c94 44218ba970 Compiled by bigtop on 2015-11-10T18:38Z Compiled with protoc 2.5.0 From source with checksum
fc0a1a23fc1868e4d5ee7fa2b28a58a This command was run using
/usr/lib/hadoop/hadoop-common-2.7.1.jar
I found the problem in my case!
The hbase-site.xml was slightly different in the hbase.client.connection.impl property.
<property>
<name>hbase.client.connection.impl</name>
<value>com.google.cloud.bigtable.hbase1_1.BigtableConnection</value>
</property>
I got to this after extracting and comparing the two jars.
The newer versions of the bigtable client jar include newer versions of the gRPC jar. Newer versions of the gRPC jar depend on newer versions of alpn-boot or OpenSSL. In addition to a new version of the bigtable jar, you may need a new version of the alpn-boot jar. Unfortunately, the Jetty team isn't making new alpn-boot jars for Java7, which bdutil depends on.
We are actively working on moving away from bdutil to dataproc, which is the newer version of Google Cloud Hadoop management. Dataproc uses Java 8, and doesn't have the same problems as bdutil. There are still kinks we need to work out.
More information can be found at:
https://cloud.google.com/dataproc/examples/cloud-bigtable-example
and
https://github.com/grpc/grpc-java/blob/master/SECURITY.md

Hive query HBase so slow

I want to use Hive to query the data which is HBase table, like:
100 column=cf1:val, timestamp=1379536809907, value=val_100
And I type the following command in hive:
select * from hbase_table where value = 'val_100';
I believe everything has been set up. When I type the above command, it shows me like this:
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_201309031344_0024, Tracking URL = http://namenode:50030/jobdetails.jsp?jobid=job_201309031344_0024
Kill Command = /usr/libexec/../bin/hadoop job -kill job_201309031344_0024
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2013-09-18 14:45:53,815 Stage-1 map = 0%, reduce = 0%
2013-09-18 14:46:54,654 Stage-1 map = 0%, reduce = 0%
2013-09-18 14:47:55,449 Stage-1 map = 0%, reduce = 0%
2013-09-18 14:48:56,193 Stage-1 map = 0%, reduce = 0%
2013-09-18 14:49:56,990 Stage-1 map = 0%, reduce = 0%
Why the map task is always 0%? What's wrong with it? Does any place which I do not configure?
I find the error in datanode log file :
2013-09-18 15:18:27,415 INFO org.apache.zookeeper.ClientCnxn:
Opening socket connection to server datanode2/192.168.1.97:2181.
Will not attempt to authenticate using SASL (unknown error)
2013-09-18 15:18:27,417 WARN org.apache.zookeeper.ClientCnxn:
Session 0x0 for server null, unexpected error, closing socket
connection and attempting reconnect java.net.ConnectException:
Connection refused at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:692)
at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:350)
at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1068)