Stack trace: ExitCodeException exitCode=1 when starting MapReduce job on Bigtable

Stack trace: ExitCodeException exitCode=1 when starting MapReduce job on Bigtable - mapreduce

We are using Google Cloud Bigtable for our Big Data.
When I'm running MapReduce job I assembly a jar and run it and now I'm getting this error:
Application application_1451577928704_0050 failed 2 times due to AM
Container for appattempt_1451577928704_0050_000002 exited with
exitCode: 1 For more detailed output, check application tracking
page:http://censored:8088/cluster/app/application_1451577928704_0050Then,
click on links to logs of each attempt. Diagnostics: Exception from
container-launch. Container id:
container_e02_1451577928704_0050_02_000001 Exit code: 1 Stack trace:
ExitCodeException exitCode=1: at
org.apache.hadoop.util.Shell.runCommand(Shell.java:545) at
org.apache.hadoop.util.Shell.run(Shell.java:456) at
org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:722)
at
org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:211)
at
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
at
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
at java.util.concurrent.FutureTask.run(FutureTask.java:266) at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745) Container exited with a
non-zero exit code 1 Failing this attempt. Failing the application.
When I logged to see the logging of the workers node I saw this error:
2016-02-15 02:59:54,106 INFO [main]
org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Created MRAppMaster
for application appattempt_1451577928704_0050_000001 2016-02-15
02:59:54,294 WARN [main] org.apache.hadoop.util.NativeCodeLoader:
Unable to load native-hadoop library for your platform... using
builtin-java classes where applicable 2016-02-15 02:59:54,319 INFO
[main] org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Executing with
tokens: 2016-02-15 02:59:54,319 INFO [main]
org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Kind:
YARN_AM_RM_TOKEN, Service: , Ident: (appAttemptId { application_id {
id: 50 cluster_timestamp: 1451577928704 } attemptId: 1 } keyId:
-******) 2016-02-15 02:59:54,424 INFO [main] org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Using mapred
newApiCommitter. 2016-02-15 02:59:54,755 WARN [main]
org.apache.hadoop.hdfs.shortcircuit.DomainSocketFactory: The
short-circuit local reads feature cannot be used because libhadoop
cannot be loaded. 2016-02-15 02:59:54,855 INFO [main]
org.apache.hadoop.mapreduce.v2.app.MRAppMaster: OutputCommitter set in
config null 2016-02-15 02:59:54,911 INFO [main]
org.apache.hadoop.service.AbstractService: Service
org.apache.hadoop.mapreduce.v2.app.MRAppMaster failed in state INITED;
cause: org.apache.hadoop.yarn.exceptions.YarnRuntimeException:
java.lang.ClassCastException:
org.apache.xerces.dom.DeferredElementNSImpl cannot be cast to
org.w3c.dom.Text
org.apache.hadoop.yarn.exceptions.YarnRuntimeException:
java.lang.ClassCastException:
org.apache.xerces.dom.DeferredElementNSImpl cannot be cast to
org.w3c.dom.Text at
org.apache.hadoop.mapreduce.v2.app.MRAppMaster$1.call(MRAppMaster.java:478)
at
org.apache.hadoop.mapreduce.v2.app.MRAppMaster$1.call(MRAppMaster.java:458)
at
org.apache.hadoop.mapreduce.v2.app.MRAppMaster.callWithJobClassLoader(MRAppMaster.java:1560)
at
org.apache.hadoop.mapreduce.v2.app.MRAppMaster.createOutputCommitter(MRAppMaster.java:458)
at
org.apache.hadoop.mapreduce.v2.app.MRAppMaster.serviceInit(MRAppMaster.java:377)
at
org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
at
org.apache.hadoop.mapreduce.v2.app.MRAppMaster$4.run(MRAppMaster.java:1518)
at java.security.AccessController.doPrivileged(Native Method) at
javax.security.auth.Subject.doAs(Subject.java:422) at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
at
org.apache.hadoop.mapreduce.v2.app.MRAppMaster.initAndStartAppMaster(MRAppMaster.java:1515)
at
org.apache.hadoop.mapreduce.v2.app.MRAppMaster.main(MRAppMaster.java:1448)
Caused by: java.lang.ClassCastException:
org.apache.xerces.dom.DeferredElementNSImpl cannot be cast to
org.w3c.dom.Text at
org.apache.hadoop.conf.Configuration.loadResource(Configuration.java:2603)
at
org.apache.hadoop.conf.Configuration.loadResources(Configuration.java:2502)
at
org.apache.hadoop.conf.Configuration.getProps(Configuration.java:2405)
at org.apache.hadoop.conf.Configuration.get(Configuration.java:981)
at
org.apache.hadoop.conf.Configuration.getTrimmed(Configuration.java:1031)
at
org.apache.hadoop.conf.Configuration.getBoolean(Configuration.java:1432)
at
org.apache.hadoop.hbase.HBaseConfiguration.checkDefaultsVersion(HBaseConfiguration.java:67)
at
org.apache.hadoop.hbase.HBaseConfiguration.addHbaseResources(HBaseConfiguration.java:81)
at
org.apache.hadoop.hbase.HBaseConfiguration.create(HBaseConfiguration.java:96)
at
org.apache.hadoop.hbase.HBaseConfiguration.create(HBaseConfiguration.java:105)
at
org.apache.hadoop.hbase.mapreduce.TableOutputFormat.setConf(TableOutputFormat.java:184)
at
org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:76)
at
org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:136)
at
org.apache.hadoop.mapreduce.v2.app.MRAppMaster$1.call(MRAppMaster.java:474)
... 11 more
I tried an older jar and it's running perfectly fine and I'm not sure why the new jar won't work - Didn't change anything.
Please advise?
Thanks!
Update 1: Here is some more details:
I setup the cluster with the dataproc.
We are using the newest versions, here is the library dependencies:
val BigtableHbase = "com.google.cloud.bigtable" % "bigtable-hbase-1.1"
% "0.2.2" val BigtableHbaseMapreduce = "com.google.cloud.bigtable" %
"bigtable-hbase-mapreduce" % "0.2.2" val CommonsCli = "commons-cli" %
"commons-cli" % "1.2" val HadoopCommon = "org.apache.hadoop" %
"hadoop-common" % "2.7.1" val HadoopMapreduceClientApp =
"org.apache.hadoop" % "hadoop-mapreduce-client-app" % "2.7.1" val
HbaseCommon = "org.apache.hbase" % "hbase-common" % "1.1.2" val
HbaseProtocol = "org.apache.hbase" % "hbase-protocol" % "1.1.2" val
HbaseClient = "org.apache.hbase" % "hbase-client" % "1.1.2" val
HbaseServer = "org.apache.hbase" % "hbase-server" % "1.1.2" val
HbaseAnnotations = "org.apache.hbase" % "hbase-annotations" % "1.1.2"
libraryDependencies += BigtableHbase libraryDependencies +=
BigtableHbaseMapreduce libraryDependencies += CommonsCli
libraryDependencies += HadoopCommon libraryDependencies +=
HadoopMapreduceClientApp libraryDependencies += HbaseCommon
libraryDependencies += HbaseProtocol libraryDependencies +=
HbaseClient libraryDependencies += HbaseServer libraryDependencies +=
HbaseAnnotations
Java version:
openjdk version "1.8.0_66-internal" OpenJDK Runtime Environment (build
1.8.0_66-internal-b17) OpenJDK 64-Bit Server VM (build 25.66-b17, mixed mode)
Alpn version: alpn-boot-8.1.3.v20150130
hbase verison:
2016-02-15 20:45:42,050 INFO [main] util.VersionInfo: HBase 1.1.2
2016-02-15 20:45:42,051 INFO [main] util.VersionInfo: Source code
repository file:///mnt/ram/bigtop/bigtop/output/ hbase/hbase-1.1.2
revision=Unknown 2016-02-15 20:45:42,051 INFO [main]
util.VersionInfo: Compiled by bigtop on Tue Nov 10 19:09:17 UTC 2015
2016-02-15 20:45:42,051 INFO [main] util.VersionInfo: From source
with checksum 42e8a1890c700d37485c69a44a3
hadoop version:
Hadoop 2.7.1 Subversion
https://bigdataoss-internal.googlesource.com/third_party/apache/bigtop
-r 2a194d4d838b79460c3ceb892f3c94 44218ba970 Compiled by bigtop on 2015-11-10T18:38Z Compiled with protoc 2.5.0 From source with checksum
fc0a1a23fc1868e4d5ee7fa2b28a58a This command was run using
/usr/lib/hadoop/hadoop-common-2.7.1.jar

I found the problem in my case!
The hbase-site.xml was slightly different in the hbase.client.connection.impl property.
<property>
<name>hbase.client.connection.impl</name>
<value>com.google.cloud.bigtable.hbase1_1.BigtableConnection</value>
</property>
I got to this after extracting and comparing the two jars.

The newer versions of the bigtable client jar include newer versions of the gRPC jar. Newer versions of the gRPC jar depend on newer versions of alpn-boot or OpenSSL. In addition to a new version of the bigtable jar, you may need a new version of the alpn-boot jar. Unfortunately, the Jetty team isn't making new alpn-boot jars for Java7, which bdutil depends on.
We are actively working on moving away from bdutil to dataproc, which is the newer version of Google Cloud Hadoop management. Dataproc uses Java 8, and doesn't have the same problems as bdutil. There are still kinks we need to work out.
More information can be found at:
https://cloud.google.com/dataproc/examples/cloud-bigtable-example
and
https://github.com/grpc/grpc-java/blob/master/SECURITY.md

Related

object stream is not a member of package akka

I've below build.sbt file -
ThisBuild / version := "0.1.0-SNAPSHOT"
ThisBuild / scalaVersion := "2.13.7"
val akkaVersion = "2.6.18"
lazy val root = (project in file("."))
.settings(
name := "akka-sbt-multijvm-issue"
)
libraryDependencies ++= Seq("com.typesafe.akka" %% "akka-actor" % akkaVersion,
"com.typesafe.akka" %% "akka-stream" % akkaVersion,
"com.typesafe.akka" %% "akka-cluster" % akkaVersion)
I've below main code -
package com.example
object MaterializerApp extends App {
import akka.stream.Materializer
}
when I compile the code I get the below error -
sbt clean compile
[info] welcome to sbt 1.6.1 (Azul Systems, Inc. Java 11.0.12)
[info] loading global plugins from /Users/rajkumar.natarajan/.sbt/1.0/plugins
[info] loading settings for project akka-sbt-multijvm-issue-build from plugins.sbt ...
[info] loading project definition from /Users/rajkumar.natarajan/Documents/Coding/akka-sbt-multijvm-issue/project
[info] loading settings for project root from build.sbt ...
[info] set current project to akka-sbt-multijvm-issue (in build file:/Users/rajkumar.natarajan/Documents/Coding/akka-sbt-multijvm-issue/)
[info] Executing in batch mode. For better performance use sbt's shell
[success] Total time: 0 s, completed Jan 5, 2022, 9:13:16 PM
[info] compiling 1 Scala source to /Users/rajkumar.natarajan/Documents/Coding/akka-sbt-multijvm-issue/target/scala-2.13/classes ...
[error] /Users/rajkumar.natarajan/Documents/Coding/akka-sbt-multijvm-issue/src/main/scala/com/example/MaterializerApp.scala:5:15: object stream is not a member of package akka
[error] import akka.stream.Materializer
^
[error] one error found
[error] (Compile / compileIncremental) Compilation failed
[error] Total time: 4 s, completed Jan 5, 2022, 9:13:21 PM
Note: When I change the akkaVersion to 2.6.17, the compilation succeeds.
How can I fix this error?

java.lang.OutOfMemoryError in EMR spark app

Trying to scale a pyspark app on AWS EMR. Was able to get it to work for one day of data (around 8TB), but keep running into (what I believe are) OOM errors when trying to test it on one week of data (around 50TB)
I set my spark configs based on this article. Originally, I got a
java.lang.OutOfMemoryError: Java heap space from the Driver std out. From browsing online, it seemed like my spark.driver.memory was too low, so I boosted that up quite a bit. Now I am running into a different OOM error. In the driver std err, I see something like this:
.saveAsTable('hive_tables.apf_intermediate_table')
File "/mnt1/yarn/usercache/hadoop/appcache/application_1622843305817_0001/container_1622843305817_0001_02_000001/pyspark.zip/pyspark/sql/readwriter.py", line 778, in saveAsTable
File "/mnt1/yarn/usercache/hadoop/appcache/application_1622843305817_0001/container_1622843305817_0001_02_000001/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
File "/mnt1/yarn/usercache/hadoop/appcache/application_1622843305817_0001/container_1622843305817_0001_02_000001/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
File "/mnt1/yarn/usercache/hadoop/appcache/application_1622843305817_0001/container_1622843305817_0001_02_000001/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o12119.saveAsTable.
: org.apache.spark.SparkException: Job aborted.
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:198)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:174)
at org.apache.spark.sql.execution.datasources.DataSource.writeAndRead(DataSource.scala:517)
at org.apache.spark.sql.execution.command.CreateDataSourceTableAsSelectCommand.saveDataIntoTable(createDataSourceTables.scala:217)
at org.apache.spark.sql.execution.command.CreateDataSourceTableAsSelectCommand.run(createDataSourceTables.scala:176)
at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)
at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102)
at org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:173)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:169)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:197)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:194)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:169)
at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:114)
at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:112)
at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:696)
at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:696)
at org.apache.spark.sql.execution.SQLExecution$.org$apache$spark$sql$execution$SQLExecution$$executeQuery$1(SQLExecution.scala:83)
at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1$$anonfun$apply$1.apply(SQLExecution.scala:94)
at org.apache.spark.sql.execution.QueryExecutionMetrics$.withMetrics(QueryExecutionMetrics.scala:141)
at org.apache.spark.sql.execution.SQLExecution$.org$apache$spark$sql$execution$SQLExecution$$withMetrics(SQLExecution.scala:178)
at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:93)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:200)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:92)
at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:696)
at org.apache.spark.sql.DataFrameWriter.createTable(DataFrameWriter.scala:494)
at org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:473)
at org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:429)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.spark.SparkException: Task not serializable
and in the driver std out, I see something like this:
21/06/04 21:57:58 INFO CodeGenerator: Code generated in 52.445036 ms
21/06/04 21:57:58 INFO CodeGenerator: Code generated in 10.271344 ms
21/06/04 21:57:59 INFO CodeGenerator: Code generated in 7.076469 ms
21/06/04 21:57:59 INFO CodeGenerator: Code generated in 6.111364 ms
21/06/04 21:57:59 INFO CodeGenerator: Code generated in 9.17915 ms
21/06/04 21:57:59 INFO CodeGenerator: Code generated in 50.537577 ms
21/06/04 21:57:59 INFO CodeGenerator: Code generated in 9.398937 ms
21/06/04 21:58:00 INFO CodeGenerator: Code generated in 44.204289 ms
21/06/04 21:58:00 INFO CodeGenerator: Code generated in 6.294884 ms
21/06/04 21:58:00 INFO CodeGenerator: Code generated in 8.570691 ms
21/06/04 21:58:02 INFO CodeGenerator: Code generated in 55.276023 ms
21/06/04 21:58:02 INFO CodeGenerator: Code generated in 10.988539 ms
21/06/04 21:58:07 INFO CodeGenerator: Code generated in 284.051432 ms
21/06/04 21:58:07 ERROR Utils: uncaught error in thread spark-listener-group-eventLog, stopping SparkContext
java.lang.OutOfMemoryError
at java.lang.AbstractStringBuilder.hugeCapacity(AbstractStringBuilder.java:161)
at java.lang.AbstractStringBuilder.newCapacity(AbstractStringBuilder.java:155)
at java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:125)
at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:596)
at java.lang.StringBuilder.append(StringBuilder.java:190)
at com.fasterxml.jackson.core.util.TextBuffer.contentsAsString(TextBuffer.java:351)
at com.fasterxml.jackson.core.io.SegmentedStringWriter.getAndClear(SegmentedStringWriter.java:83)
at com.fasterxml.jackson.databind.ObjectMapper.writeValueAsString(ObjectMapper.java:2933)
at org.apache.spark.util.JsonProtocol$.sparkEventToJson(JsonProtocol.scala:107)
at org.apache.spark.scheduler.EventLoggingListener.logEvent(EventLoggingListener.scala:138)
at org.apache.spark.scheduler.EventLoggingListener.onOtherEvent(EventLoggingListener.scala:236)
at org.apache.spark.scheduler.SparkListenerBus$class.doPostEvent(SparkListenerBus.scala:80)
at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37)
at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37)
at org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:91)
at org.apache.spark.scheduler.AsyncEventQueue.org$apache$spark$scheduler$AsyncEventQueue$$super$postToAll(AsyncEventQueue.scala:92)
at org.apache.spark.scheduler.AsyncEventQueue$$anonfun$org$apache$spark$scheduler$AsyncEventQueue$$dispatch$1.apply$mcJ$sp(AsyncEventQueue.scala:92)
at org.apache.spark.scheduler.AsyncEventQueue$$anonfun$org$apache$spark$scheduler$AsyncEventQueue$$dispatch$1.apply(AsyncEventQueue.scala:87)
at org.apache.spark.scheduler.AsyncEventQueue$$anonfun$org$apache$spark$scheduler$AsyncEventQueue$$dispatch$1.apply(AsyncEventQueue.scala:87)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
at org.apache.spark.scheduler.AsyncEventQueue.org$apache$spark$scheduler$AsyncEventQueue$$dispatch(AsyncEventQueue.scala:87)
at org.apache.spark.scheduler.AsyncEventQueue$$anon$1$$anonfun$run$1.apply$mcV$sp(AsyncEventQueue.scala:83)
at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1347)
at org.apache.spark.scheduler.AsyncEventQueue$$anon$1.run(AsyncEventQueue.scala:82)
21/06/04 21:58:07 ERROR Utils: throw uncaught fatal error in thread spark-listener-group-eventLog
java.lang.OutOfMemoryError
with the physical plan being spit out repeatedly right after in both logs.
I assume that since the code has run successfully with less input data, that this is still an OOM error and that I can safely ignore the Task not serializable error. However, I have tried many different EMR / Spark configs and keep getting some variation of this error.
An example cluster set-up would be something like this:
Driver node: 1 M5.8Xlarge instance
Core nodes: 9 M5.8Xlarge instances (each with 32 GiB RAM, 128 vCores)
spark.dynamicAllocation.enabled=False
spark.executors.cores=5
spark.executor.memory=18g
spark.executor.memoryOverhead=2g
spark.driver.memory=44g
spark.driver.cores=5
spark.executor.instances=53
spark.default.parallelism=530
spark.sql.shuffle.partitions=200
spark.serializer=org.apache.spark.serializer.KryoSerializer
spark.executor.extraJavaOptions="-XX:+UseG1GC
-XX:+UnlockDiagnosticVMOptions -XX:+G1SummarizeConcMark -XX:InitiatingHeapOccupancyPercent=35 -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:OnOutOfMemoryError='kill -9 %p'"
spark.driver.extraJavaOptions="-XX:+UseG1GC-XX:+UnlockDiagnosticVMOptions
XX:+G1SummarizeConcMark -XX:InitiatingHeapOccupancyPercent=35 -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:OnOutOfMemoryError='kill -9 %p'"
Does anything stand out that could be causing this?
Spark 2.4.7
Relevant Code:
df = spark.read.parquet(SDL_PATH.format(datekeys=datekeys))
df = df.filter(F.col('request') == 1)\
.filter(F.col('pricing_rule_applied') == 0)\
.filter(F.col('deal_id') == 'company')\
.filter(F.col('multivariate_test_data').isNull() | (F.col('multivariate_test_data')=='{"passthrough":true}'))\
.select(F.col('platform_id'),
F.col('demand_id'),
F.col('publisher_id'),
'channel_id',
'market_id',
'deal_id',
F.col('auction_type'),
F.col('price_floor_gross'),
F.col('price_floor_net'),
F.col('ad_slots'),
F.col('country').alias('country_id'),
F.col('device_type'),
F.col('browser_id'),
F.col('request').alias('bid_request'),
F.when(F.col('bid'), 1).otherwise(0).alias('bid'),
F.col('max_bid').alias('max_bid'),
F.when(F.col('win'), 1).otherwise(0).alias('win'),
F.col('spend').alias('spend'),
F.col('above_floor'),
F.col('price_floor_adjusted'),
F.when((F.col('block_type') == "filter") | (F.col('block_type') == "ad_review") | (F.col('block_type') == "quarantine") | F.col('block_type').isNull(), 1).otherwise(0).alias('blocked'),
F.when((F.col('ad_timeout').isNull()) | (F.col('ad_timeout') == 0), 0).otherwise(1).alias('timeout'),
F.when((F.col('ad_opt_out').isNull()) | (F.col('ad_opt_out') == 0), 0).otherwise(1).alias('opt_out'))
df = df.join(deal, on=['deal_id', 'publisher_id'],how='left')
hashes = spark.read.table('market.server_hashes')\
.withColumn('fehash',F.substring(F.col('hash'), 1, 8))\
.select('fehash',F.substring(F.col('hostname'), 13, 5).alias('datacenter'))
df = df.withColumn('fehash',F.lower(F.substring(F.col('market_id'), -12,8)))\
.join(F.broadcast(hashes), on='fehash', how='left')
# dfWriter_sql = df.repartition(128).write.format("orc").option("orc.compress", "snappy")
# dfWriter_sql.mode('overwrite').saveAsTable('datascience.jp_auction_sim_sample_30p')
good_ids = df.filter('(ad_slots = 1) and (bid=1)').groupBy('market_id').agg(F.sum('win').alias('wins')).filter('wins <=1')
modified_floors = df.join(good_ids, on='market_id',how='inner').filter('(blocked=0) and (timeout=0) and (opt_out=0) and (bid=1)').cache()
#modified_floors = modified_floors.filter(F.col('price_floor_gross') == F.col('price_floor_adjusted'))
#print(modified_floors.count())
window = Window.partitionBy('market_id').orderBy(F.col('tier').asc(),F.col('max_bid').desc(),F.col('demand_id').asc())
check = 0
feats = ['country_id','device_type','browser_id']
features = ['publisher_id','channel_id','datacenter','country_id','device_type','browser_id']
for perc_val, val in zip(vals, str_vals):
# val = str(int(perc_val*100))
print(val)
if val == '114':
val = '115'
sim_df = modified_floors.withColumn('price_floor_modified', F.col('price_floor_adjusted')*perc_val)\
.withColumn('coeff', F.lit(perc_val))\
.withColumn('row_number',F.row_number().over(window))\
.withColumn('next_highest_temp',F.lead('max_bid', count=1, default=None).over(window))\
.withColumn('next_highest', F.when(F.col('next_highest_temp').isNull(), F.when(F.col('max_bid') >= F.col('price_floor_modified'), F.col('price_floor_modified')).otherwise(F.lit(0))).otherwise(F.col('next_highest_temp')))\
.withColumn('adj_win',
F.when((F.col('row_number') == 1) & (F.col('max_bid') >= F.col('price_floor_modified')),
F.lit(1)).otherwise(F.lit(0)))\
.withColumn('adj_spend',
F.when(
F.col('adj_win') == 1,
F.when(
(F.col('auction_type') == 1) | (F.col('auction_type') == 3),
F.col('max_bid')
).otherwise(
F.when(F.col('next_highest') >= F.col('price_floor_modified'),
F.col('next_highest')
).otherwise(
F.col('price_floor_modified')
)
)
).otherwise(
F.lit(0)
))\
.withColumn('adj_above_floor', F.when(F.col('max_bid') > F.col('price_floor_modified'), 1).otherwise(0))\
.withColumn('adj_below_floor', F.when(F.col('max_bid') < F.col('price_floor_modified'), 1).otherwise(0))
if check == 1:
agg_df = agg_df.union(get_full_tier(sim_df))
else:
agg_df = get_full_tier(sim_df)
check = 1
agg_df = agg_df.withColumn('datekey',F.lit(int(DATEKEY))).cache()
window = Window.partitionBy('hashkey').orderBy(F.col('estimated_platform_spend').desc(),F.col('coeff').desc())
agg_df = agg_df.withColumn('hashkey',F.md5(
F.concat_ws(
'-',
F.coalesce('channel_id',F.lit('')),
F.coalesce('datacenter',F.lit('')),
F.coalesce('country_id',F.lit('')),
F.coalesce('device_type',F.lit('')),
F.coalesce('browser_id',F.lit(''))
))).withColumn('optimum',(F.row_number().over(window) == 1)
& (F.col('platform_spend_improvement_incremental') > 0)
& (F.col('coeff') != 1.0))
#agg_df.write.format("parquet").mode('overwrite').saveAsTable('datascience.apf_intermediate_table')
affiliate = spark.read.table('tableau_cpc3.affiliate')
browsers = spark.read.table('tableau_cpc3.device_browser')\
.select(F.col('device_browser_name').alias('browser_name'),F.col('device_browser_value').cast('integer').alias('browser_id'))
regions = spark.sql('''SELECT region_id as country_id,
CASE region_name WHEN 'Croatia (Hrvatska)' THEN 'Croatia'
WHEN 'Great Britain (UK)' THEN 'United Kingdom'
WHEN 'New Zealand (Aotearoa)' THEN 'New Zealand'
WHEN 'Vatican City State (Holy See)' THEN 'Vatican City'
WHEN 'Congo' THEN 'Congo (Brazzaville)'
WHEN 'Palestinian Territory' THEN 'Palestinian Territories'
WHEN 'S. Georgia and S. Sandwich Islands' THEN 'South Georgia and the South Sandwich Islands'
ELSE region_name
END AS country
FROM tableau_cpc3.region''')
tier_check = 0
to_write = agg_df.join(F.broadcast(affiliate).select(F.col('name').alias('publisher_name'), F.col('affiliate_id').alias('publisher_id')), on='publisher_id', how='left')
to_write = to_write.join(F.broadcast(affiliate).select(F.col('name').alias('channel_name'), F.col('affiliate_id').alias('channel_id')), on='channel_id', how='left')
to_write = to_write.join(F.broadcast(regions), on='country_id', how='left').withColumn('date',F.col('datekey').cast('string'))
to_write = to_write.join(F.broadcast(browsers), on='browser_id', how='left')
to_write = to_write.select('publisher_id',
'publisher_name',
'channel_id',
'channel_name',
'datacenter',
'country_id',
'country',
'device_type',
F.expr('''CASE device_type WHEN 0 THEN 'Unknown'
WHEN 1 THEN 'Mobile'
WHEN 2 THEN 'Desktop'
WHEN 3 THEN 'CTV'
WHEN 4 THEN 'Phone'
WHEN 5 THEN 'Tablet'
WHEN 6 THEN 'Connected Device'
WHEN 7 THEN 'STB' ELSE 'Unknown' END''').alias('device_type_name'),
F.col('browser_id').cast('integer').alias('browser_id'),
'browser_name',
'coeff',
'auctions',
'bids_returned',
'channel_floor',
'unadjusted_platform_spend',
'unadjusted_count_win',
'unadjusted_count_above_floor',
'unadjusted_count_below_floor',
'estimated_platform_spend',
'estimated_count_win',
'estimated_count_above_floor',
'estimated_count_below_floor',
'competitive_rate',
'datekey',
'optimum',
'platform_spend_improvement_percent',
'platform_spend_improvement_incremental',
'date',
F.md5(
F.concat_ws(
'-',
F.coalesce('channel_id',F.lit('')),
F.coalesce('datacenter',F.lit('')),
F.coalesce('country_id',F.lit('')),
F.coalesce('device_type',F.lit('')),
F.coalesce('browser_id',F.lit(''))
)
).alias('hashkey')).distinct()
to_write.write\
.format("parquet")\
.partitionBy('coeff','datacenter')\
.mode('overwrite')\
.option("path", APF_INTERMEDIATE_TABLE_PATH)\
.saveAsTable('hive_tables.apf_intermediate_table')

You use machine type that doesn't fit your configuration. You give your executor:
spark.executors.cores=5
spark.executor.memory=18g
spark.executor.memoryOverhead=2g
spark.executor.instances=53
Actually, you say here: "for my executors i want 20G memory", while your machines have 32Gx0.85=27.2G, so you can place only one executor on single machine (using your current memory settings), that executor will use 5 cores, 1 core is required for Spark, so per machine, you don't use 128-1-5=122 cores and 27.2G-20G=7.2G RAM.
Effectively, with your settings, you have 9 nodes and you can run only 9 executors (1 executor per node). I would recommend you to use machines with more memory and less cores.

Apache Spark Random Forest Classification on AWS EMR fails and causes errors: Allocation Failure, HeartbeatReceiver, YarnClusterScheduler, etc

I recently got into Apache Spark on AWS. I have a dataset with 10 columns and 7 million rows but I cannot use the whole set because Spark cannot handle it. When I take more than 1.5 million lines it crashes on a single r4.16xlarge instance with 488 GB RAM with insufficient memory erros (which I can confirm observing it with top, the memory consumption rises up to 100%). But when I try to run it on a whole cluster with significantly more memory (4 / 488 = 1952 GB) it also fails. I am using the following parameters for starting the EMR step:
spark-submit --deploy-mode cluster --class randomforest --driver-memory 400G --num-executors 4 --executor-cores 8 --executor-memory 400g s3://spark-cluster-demo/randomforest_2.11-1.0.jar
This is the scala script inside the jar I am executing:
import org.apache.spark.SparkContext, org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.classification.{RandomForestClassificationModel, RandomForestClassifier}
import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator
import org.apache.spark.ml.feature.{IndexToString, StringIndexer, VectorIndexer}
object randomforest {
def main(args: Array[String]) {
val sparkConf = new SparkConf().setAppName("Spark Cluster Test")
val sc = new SparkContext(sparkConf)
val spark = SparkSession.builder().appName("Spark Cluster Test").getOrCreate
// Load and parse the data file, converting it to a DataFrame.
val data = spark.read.format("libsvm").load("s3://spark-cluster-demo/2004_4000000.txt")
val maxCategories = 512
val numTrees = 64
val featureSubsetStrategy = "auto" // supported featureSubsetStrategy settings: auto, all, onethird, sqrt, log2
val impurity = "gini"
val maxDepth = 30
val maxBins = 2048
val maxMemoryInMB = 409600
// Index labels, adding metadata to the label column.
// Fit on whole dataset to include all labels in index.
val labelIndexer = new StringIndexer().setInputCol("label").setOutputCol("indexedLabel").fit(data)
// Automatically identify categorical features, and index them.
// Set maxCategories so features with > X distinct values are treated as continuous.
val featureIndexer = new VectorIndexer().setInputCol("features").setOutputCol("indexedFeatures").setMaxCategories(maxCategories).fit(data)
// Split the data into training and test sets (30% held out for testing).
val Array(trainingData, testData) = data.randomSplit(Array(0.7, 0.3))
// Train a RandomForest model.
val rf = new RandomForestClassifier().setLabelCol("indexedLabel").setFeaturesCol("indexedFeatures").setNumTrees(numTrees).setFeatureSubsetStrategy(featureSubsetStrategy).setImpurity(impurity).setMaxDepth(maxDepth).setMaxBins(maxBins).setMaxMemoryInMB(maxMemoryInMB)
// Convert indexed labels back to original labels.
val labelConverter = new IndexToString().setInputCol("prediction").setOutputCol("predictedLabel").setLabels(labelIndexer.labels)
// Chain indexers and forest in a Pipeline.
val pipeline = new Pipeline().setStages(Array(labelIndexer, featureIndexer, rf, labelConverter))
// Train model. This also runs the indexers.
val t0 = System.nanoTime()
val model = pipeline.fit(trainingData)
val t1 = System.nanoTime()
println("Training time: " + (t1 - t0) + " ns")
// Make predictions.
val predictions = model.transform(testData)
// Select example rows to display.
predictions.select("predictedLabel", "label", "features").show(5)
}
}
The whole log is too big to post it here. I get various different errors and failures which I will list here. All of them accure dozens or hundreds times all over the place:
2018-11-26T09:38:35.392+0000: [GC (Allocation Failure) 2018-11-26T09:38:35.392+0000: [ParNew: 559232K->36334K(629120K), 0.0217574 secs] 559232K->36334K(2027264K), 0.0218269 secs] [Times: user=0.28 sys=0.01, real=0.02 secs]
18/11/26 10:24:37 WARN TransportChannelHandler: Exception in connection from ip-172-31-17-60.ec2.internal/172.31.17.60:45589
java.io.IOException: Connection reset by peer
at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
at sun.nio.ch.IOUtil.read(IOUtil.java:192)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)
at io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:288)
at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:1106)
at io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:343)
at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:123)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:645)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:580)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:497)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:459)
at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
at java.lang.Thread.run(Thread.java:748)
18/11/26 10:24:37 ERROR TransportResponseHandler: Still have 1 requests outstanding when connection from ip-172-31-17-60.ec2.internal/172.31.17.60:45589 is closed
18/11/26 10:24:37 ERROR OneForOneBlockFetcher: Failed while starting block fetches
18/11/26 10:24:42 ERROR RetryingBlockFetcher: Exception while beginning fetch of 1 outstanding blocks (after 1 retries)
18/11/26 10:22:29 WARN HeartbeatReceiver: Removing executor 2 with no recent heartbeats: 141885 ms exceeds timeout 120000 ms
18/11/26 10:22:29 ERROR YarnClusterScheduler: Lost executor 2 on ip-172-31-26-87.ec2.internal: Executor heartbeat timed out after 141885 ms
18/11/26 10:22:29 WARN TaskSetManager: Lost task 16.0 in stage 47.0 (TID 1079, ip-172-31-26-87.ec2.internal, executor 2): ExecutorLostFailure (executor 2 exited caused by one of the running tasks) Reason: Executor heartbeat timed out after 141885 ms
18/11/26 10:22:29 WARN TaskSetManager: Lost task 4.0 in stage 47.0 (TID 1067, ip-172-31-26-87.ec2.internal, executor 2): ExecutorLostFailure (executor 2 exited caused by one of the running tasks) Reason: Executor heartbeat timed out after 141885 ms
18/11/26 10:22:29 WARN TaskSetManager: Lost task 12.0 in stage 47.0 (TID 1075, ip-172-31-26-87.ec2.internal, executor 2): ExecutorLostFailure (executor 2 exited caused by one of the running tasks) Reason: Executor heartbeat timed out after 141885 ms
18/11/26 10:22:29 WARN TaskSetManager: Lost task 0.0 in stage 47.0 (TID 1063, ip-172-31-26-87.ec2.internal, executor 2): ExecutorLostFailure (executor 2 exited caused by one of the running tasks) Reason: Executor heartbeat timed out after 141885 ms
18/11/26 10:22:29 WARN TaskSetManager: Lost task 20.0 in stage 47.0 (TID 1083, ip-172-31-26-87.ec2.internal, executor 2): ExecutorLostFailure (executor 2 exited caused by one of the running tasks) Reason: Executor heartbeat timed out after 141885 ms
18/11/26 10:22:29 WARN TaskSetManager: Lost task 8.0 in stage 47.0 (TID 1071, ip-172-31-26-87.ec2.internal, executor 2): ExecutorLostFailure (executor 2 exited caused by one of the running tasks) Reason: Executor heartbeat timed out after 141885 ms
18/11/26 10:22:29 INFO TaskSetManager: Starting task 8.1 in stage 47.0 (TID 1084, ip-172-31-26-87.ec2.internal, executor 2, partition 8, PROCESS_LOCAL, 8550 bytes)
18/11/26 10:22:29 INFO TaskSetManager: Starting task 20.1 in stage 47.0 (TID 1085, ip-172-31-26-87.ec2.internal, executor 2, partition 20, PROCESS_LOCAL, 8550 bytes)
18/11/26 10:40:29 WARN BlockManagerMasterEndpoint: No more replicas available for rdd_36_2 !
18/11/26 10:40:29 WARN BlockManagerMasterEndpoint: No more replicas available for rdd_36_10 !
18/11/26 10:40:29 WARN BlockManagerMasterEndpoint: No more replicas available for rdd_36_6 !
18/11/26 10:40:29 WARN BlockManagerMasterEndpoint: No more replicas available for rdd_36_18 !
18/11/26 10:40:29 WARN BlockManagerMasterEndpoint: No more replicas available for rdd_36_22 !
18/11/26 10:40:29 WARN BlockManagerMasterEndpoint: No more replicas available for rdd_36_1 !
18/11/26 10:40:29 WARN BlockManagerMasterEndpoint: No more replicas available for rdd_36_14 !
18/11/26 10:40:29 WARN BlockManagerMasterEndpoint: No more replicas available for rdd_36_5 !
18/11/26 10:40:29 INFO BlockManagerMasterEndpoint: Removing block manager BlockManagerId(8, ip-172-31-26-192.ec2.internal, 43287, None)
18/11/26 10:40:29 INFO BlockManagerMaster: Removed 8 successfully in removeExecutor
18/11/26 10:40:29 INFO DAGScheduler: Executor 8 added was in lost list earlier.
18/11/26 10:40:29 ERROR ApplicationMaster: User class threw exception: org.apache.spark.SparkException: Job aborted due to stage failure: Task 5 in stage 47.0 failed 4 times, most recent failure: Lost task 5.3 in stage 47.0 (TID 1117, ip-172-31-26-192.ec2.internal, executor 8): ExecutorLostFailure (executor 8 exited caused by one of the running tasks) Reason: Executor heartbeat timed out after 170206 ms
Driver stacktrace:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 5 in stage 47.0 failed 4 times, most recent failure: Lost task 5.3 in stage 47.0 (TID 1117, ip-172-31-26-192.ec2.internal, executor 8): ExecutorLostFailure (executor 8 exited caused by one of the running tasks) Reason: Executor heartbeat timed out after 170206 ms
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1803)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1791)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1790)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1790)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:871)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:871)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:871)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2024)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1973)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1962)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:682)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2034)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2055)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2074)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2099)
at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:945)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
at org.apache.spark.rdd.RDD.collect(RDD.scala:944)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$collectAsMap$1.apply(PairRDDFunctions.scala:743)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$collectAsMap$1.apply(PairRDDFunctions.scala:742)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
at org.apache.spark.rdd.PairRDDFunctions.collectAsMap(PairRDDFunctions.scala:742)
at org.apache.spark.ml.tree.impl.RandomForest$.findBestSplits(RandomForest.scala:563)
at org.apache.spark.ml.tree.impl.RandomForest$.run(RandomForest.scala:198)
at org.apache.spark.ml.classification.RandomForestClassifier.train(RandomForestClassifier.scala:139)
at org.apache.spark.ml.classification.RandomForestClassifier.train(RandomForestClassifier.scala:45)
at org.apache.spark.ml.Predictor.fit(Predictor.scala:118)
at org.apache.spark.ml.Predictor.fit(Predictor.scala:82)
at org.apache.spark.ml.Pipeline$$anonfun$fit$2.apply(Pipeline.scala:153)
at org.apache.spark.ml.Pipeline$$anonfun$fit$2.apply(Pipeline.scala:149)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
at scala.collection.IterableViewLike$Transformed$class.foreach(IterableViewLike.scala:44)
at scala.collection.SeqViewLike$AbstractTransformed.foreach(SeqViewLike.scala:37)
at org.apache.spark.ml.Pipeline.fit(Pipeline.scala:149)
at randomforest$.main(randomforest.scala:48)
at randomforest.main(randomforest.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$4.run(ApplicationMaster.scala:721)
Is something wrong with my configuration or is it impossible to use such a big dataset for RF?

Create frozen graph from pretrained model

Hi I am newbie to tensorflow. My aim is to convert .pb file to .tflite from pretrain model for my understanding. I have download mobilenet_v1_1.0_224 Model. Below is structure for model
mobilenet_v1_1.0_224.ckpt.data-00000-of-00001 - 66312kb
mobilenet_v1_1.0_224.ckpt.index - 20kb
mobilenet_v1_1.0_224.ckpt.meta - 3308kb
mobilenet_v1_1.0_224.tflite - 16505kb
mobilenet_v1_1.0_224_eval.pbtxt - 520kb
mobilenet_v1_1.0_224_frozen.pb - 16685kb
I know model already has .tflite file, but for my understanding I am trying to convert it.
My First Step : Creating frozen Graph file
import tensorflow as tf
imported_meta = tf.train.import_meta_graph(base_dir + model_folder_name + meta_file,clear_devices=True)
graph_ = tf.get_default_graph()
with tf.Session() as sess:
#saver = tf.train.import_meta_graph(base_dir + model_folder_name + meta_file, clear_devices=True)
imported_meta.restore(sess, base_dir + model_folder_name + checkpoint)
graph_def = sess.graph.as_graph_def()
output_graph_def = graph_util.convert_variables_to_constants(sess, graph_def, ['MobilenetV1/Predictions/Reshape_1'])
with tf.gfile.GFile(base_dir + model_folder_name + './my_frozen.pb', "wb") as f:
f.write(output_graph_def.SerializeToString())
I have successfully created my_frozen.pb - 16590 kb . But original file size is 16,685kb, which is clearly visible in folder structure above. So this is my first question why file size is different, Am I following some wrong path.
My Second Step : Creating tflite file using bazel command
bazel run --config=opt tensorflow/contrib/lite/toco:toco -- --input_file=/path_to_folder/my_frozen.pb --output_file=/path_to_folder/model.tflite --inference_type=FLOAT --input_shape=1,224,224,3 --input_array=input --output_array=MobilenetV1/Predictions/Reshape_1
This commands give me model.tflite - 0 kb.
Trackback for bazel Command
INFO: Analysed target //tensorflow/contrib/lite/toco:toco (0 packages loaded).
INFO: Found 1 target...
Target //tensorflow/contrib/lite/toco:toco up-to-date:
bazel-bin/tensorflow/contrib/lite/toco/toco
INFO: Elapsed time: 0.369s, Critical Path: 0.01s
INFO: Build completed successfully, 1 total action
INFO: Running command line: bazel-bin/tensorflow/contrib/lite/toco/toco '--input_file=/home/ubuntu/DEEP_LEARNING/Prashant/TensorflowBasic/mobilenet_v1_1.0_224/frozengraph.pb' '--output_file=/home/ubuntu/DEEP_LEARNING/Prashant/TensorflowBasic/mobilenet_v1_1.0_224/float_model.tflite' '--inference_type=FLOAT' '--input_shape=1,224,224,3' '--input_array=input' '--output_array=MobilenetV1/Predictions/Reshape_1'
2018-04-12 16:36:16.190375: I tensorflow/contrib/lite/toco/import_tensorflow.cc:1265] Converting unsupported operation: FIFOQueueV2
2018-04-12 16:36:16.190707: I tensorflow/contrib/lite/toco/import_tensorflow.cc:1265] Converting unsupported operation: QueueDequeueManyV2
2018-04-12 16:36:16.202293: I tensorflow/contrib/lite/toco/graph_transformations/graph_transformations.cc:39] Before Removing unused ops: 290 operators, 462 arrays (0 quantized)
2018-04-12 16:36:16.211322: I tensorflow/contrib/lite/toco/graph_transformations/graph_transformations.cc:39] Before general graph transformations: 290 operators, 462 arrays (0 quantized)
2018-04-12 16:36:16.211756: F tensorflow/contrib/lite/toco/graph_transformations/resolve_batch_normalization.cc:86] Check failed: mean_shape.dims() == multiplier_shape.dims()
Python Version - 2.7.6
Tensorflow Version - 1.5.0
Thanks In advance :)

The error Check failed: mean_shape.dims() == multiplier_shape.dims()
was an issue with resolution of batch norm and has been resolved in:
https://github.com/tensorflow/tensorflow/commit/460a8b6a5df176412c0d261d91eccdc32e9d39f1#diff-49ed2a40acc30ff6d11b7b326fbe56bc

In my case the error occurred using tensorflow v1.7
Solution was to use tensorflow v1.15 (nightly)
toco --graph_def_file=/path_to_folder/my_frozen.pb \
--input_format=TENSORFLOW_GRAPHDEF \
--output_file=/path_to_folder/my_output_model.tflite \
--input_shape=1,224,224,3 \
--input_arrays=input \
--output_format=TFLITE \
--output_arrays=MobilenetV1/Predictions/Reshape_1 \
--inference-type=FLOAT

Tests fail when run with SBT on the command line but not when run in the IDE

When I run the unit tests of my Flink application through the IntelliJ IDE they pass without any issue. When I run them through SBT though, a few exceptions are thrown (see below). What may the cause for these exceptions be? I've been unable to track them down.
Edit: It's worth noting that the project in IntelliJ was created as a "sbt project", so the way the IDE knows about the project dependencies is also through the build.sbt file. Why is this file not being enough when running sbt form the command line?
$ sbt clean test
[info] Loading global plugins from /Users/myuser/.sbt/0.13/plugins
[info] Loading project definition from /Users/myuser/projects/anonymizer/project
[info] Set current project to anonymizer (in build file:/Users/myuser/projects/anonymizer/)
[success] Total time: 8 s, completed Jan 20, 2016 6:15:20 PM
[info] Updating {file:/Users/myuser/projects/anonymizer/}anonymizer...
[info] Resolving jline#jline;2.12.1 ...
[info] Done updating.
[info] Compiling 8 Scala sources to /Users/myuser/projects/anonymizer/target/scala-2.11/classes...
[info] Compiling 2 Scala sources to /Users/myuser/projects/anonymizer/target/scala-2.11/test-classes...
[error] Test myorg.mypackage.TestMyClass failed: java.nio.file.FileAlreadyExistsException: /var/folders/n4/_bl8xyqs15xbgy37k889plm80000gn/T/TestBaseUtils-logdir9030651763276830933.tmp/jobmanager.out, took 0.0 sec
[error] at sun.nio.fs.UnixException.translateToIOException(UnixException.java:88)
[error] at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
[error] at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
[error] at sun.nio.fs.UnixFileSystemProvider.newByteChannel(UnixFileSystemProvider.java:214)
[error] at java.nio.file.Files.newByteChannel(Files.java:361)
[error] at java.nio.file.Files.createFile(Files.java:632)
[error] at org.apache.flink.test.util.TestBaseUtils.startCluster(TestBaseUtils.java:136)
[error] at org.apache.flink.test.util.TestBaseUtils.startCluster(TestBaseUtils.java:124)
[error] at org.apache.flink.streaming.util.StreamingMultipleProgramsTestBase.setup(StreamingMultipleProgramsTestBase.java:72)
[error] ...
2016-01-20 18:15:38 INFO FlinkMiniCluster:230 - Starting FlinkMiniCluster.
2016-01-20 18:15:38 INFO Slf4jLogger:80 - Slf4jLogger started
2016-01-20 18:15:38 INFO BlobServer:94 - Created BLOB server storage directory /var/folders/n4/_bl8xyqs15xbgy37k889plm80000gn/T/blobStore-07aa1e12-785d-4a18-bb16-6c2664f6b4f2
[...]
2016-01-20 18:15:39 INFO Task:470 - Loading JAR files for task Source: Collection Source -> Flat Map -> Sink: Unnamed (1/1)
2016-01-20 18:15:39 INFO Task:858 - Source: Collection Source -> Flat Map -> Sink: Unnamed (1/1) switched to FAILED with exception.
java.lang.Exception: Could not load the task's invokable class.
at org.apache.flink.runtime.taskmanager.Task.loadAndInstantiateInvokable(Task.java:729)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:474)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.ClassNotFoundException: org.apache.flink.streaming.runtime.tasks.SourceStreamTask
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at org.apache.flink.runtime.taskmanager.Task.loadAndInstantiateInvokable(Task.java:725)
... 2 more
2016-01-20 18:15:39 INFO Task:672 - Freeing task resources for Source: Collection Source -> Flat Map -> Sink: Unnamed (1/1)
2016-01-20 18:15:39 INFO TestingTaskManager:128 - Unregistering task and sending final execution state FAILED to JobManager for task Source: Collection Source -> Flat Map -> Sink: Unnamed (717fa0d09f9592d62db7b9e52f08de6e)
2016-01-20 18:15:39 INFO ExecutionGraph:934 - Source: Collection Source -> Flat Map -> Sink: Unnamed (1/1) (717fa0d09f9592d62db7b9e52f08de6e) switched from DEPLOYING to FAILED
2016-01-20 18:15:39 INFO TestingJobManager:137 - Status of job f8d59cfb76aaaeb016a14120125338fd (Flink Streaming Job) changed to FAILING.
java.lang.Exception: Could not load the task's invokable class.
at org.apache.flink.runtime.taskmanager.Task.loadAndInstantiateInvokable(Task.java:729)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:474)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.ClassNotFoundException: org.apache.flink.streaming.runtime.tasks.SourceStreamTask
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at org.apache.flink.runtime.taskmanager.Task.loadAndInstantiateInvokable(Task.java:725)
... 2 more
2016-01-20 18:15:39 INFO JobClientActor:280 - 01/20/2016 18:15:39 Source: Collection Source -> Flat Map -> Sink: Unnamed(1/1) switched to FAILED
java.lang.Exception: Could not load the task's invokable class.
at org.apache.flink.runtime.taskmanager.Task.loadAndInstantiateInvokable(Task.java:729)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:474)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.ClassNotFoundException: org.apache.flink.streaming.runtime.tasks.SourceStreamTask
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at org.apache.flink.runtime.taskmanager.Task.loadAndInstantiateInvokable(Task.java:725)
... 2 more
2016-01-20 18:15:39 INFO JobClientActor:280 - 01/20/2016 18:15:39 Job execution switched to status FAILING.
[...]
This is my build.sbt file:
name := "anonymizer"
version := "1.0"
scalaVersion := "2.11.7"
libraryDependencies ++= Seq(
"org.apache.flink" % "flink-streaming-scala_2.11" % "0.10.1",
"org.apache.flink" % "flink-clients_2.11" % "0.10.1",
"org.apache.flink" % "flink-connector-kafka_2.11" % "0.10.1",
"com.fasterxml.jackson.module" % "jackson-module-scala_2.11" % "2.6.3"
)
// Dependencies needed by the unit tests when using junit
libraryDependencies ++= Seq(
"com.novocode" % "junit-interface" % "0.11" % "test",
"org.apache.flink" % "flink-streaming-contrib_2.11" % "0.10.1",
"org.apache.flink" % "flink-streaming-java_2.11" % "0.10.1" % "test" classifier "tests",
"org.apache.flink" % "flink-core_2.11" % "0.10.1" % "test" classifier "tests",
"org.apache.flink" % "flink-runtime_2.11" % "0.10.1" % "test" classifier "tests",
"org.apache.flink" % "flink-test-utils_2.11" % "0.10.1" % "test"
)

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Stack trace: ExitCodeException exitCode=1 when starting MapReduce job on Bigtable - mapreduce

Related

object stream is not a member of package akka

java.lang.OutOfMemoryError in EMR spark app

Apache Spark Random Forest Classification on AWS EMR fails and causes errors: Allocation Failure, HeartbeatReceiver, YarnClusterScheduler, etc

Create frozen graph from pretrained model

Tests fail when run with SBT on the command line but not when run in the IDE

Categories

Resources