So I have Adobe ColdFusion 11 Update 5 installed on a Windows 2012 server. I added a Firebird datasource using the Jaybird 2.2.5 JDBC driver. I created a query using that datasource like so:
<cfquery name="rawdata" datasource="RSReport">
select * from rptpatientstudy
where statusorder >= 200
and issuer = 'RWHG'
and studydatetime >= '2015-01-01'
and studydatetime < '2015-04-01';
</cfquery>
but when I run the page, I get a Dynamic SQL error mentioning an unknown token.
Here's the stack trace:
org.firebirdsql.gds.GDSException: Dynamic SQL Error
SQL error code = -104 Token unknown - line 2, column 1 RETURNING
at org.firebirdsql.gds.impl.wire.AbstractJavaGDSImpl.readStatusVector(AbstractJavaGDSImpl.java:2092)
at org.firebirdsql.gds.impl.wire.AbstractJavaGDSImpl.receiveResponse(AbstractJavaGDSImpl.java:2042)
at org.firebirdsql.gds.impl.wire.AbstractJavaGDSImpl.iscDsqlPrepare(AbstractJavaGDSImpl.java:1465)
at org.firebirdsql.gds.impl.GDSHelper.prepareStatement(GDSHelper.java:190)
at org.firebirdsql.jdbc.AbstractStatement.prepareFixedStatement(AbstractStatement.java:1441)
at org.firebirdsql.jdbc.AbstractStatement.internalExecute(AbstractStatement.java:1423)
at org.firebirdsql.jdbc.AbstractStatement.execute(AbstractStatement.java:867)
at org.firebirdsql.jdbc.AbstractStatement.execute(AbstractStatement.java:441)
at coldfusion.server.j2ee.sql.JRunStatement.execute(JRunStatement.java:359)
at coldfusion.sql.Executive.executeQuery(Executive.java:1479)
at coldfusion.sql.Executive.executeQuery(Executive.java:1229)
at coldfusion.sql.Executive.executeQuery(Executive.java:1159)
at coldfusion.sql.SqlImpl.execute(SqlImpl.java:406)
at coldfusion.tagext.sql.QueryTag.executeQuery(QueryTag.java:1185)
at coldfusion.tagext.sql.QueryTag.startQueryExecution(QueryTag.java:814)
at coldfusion.tagext.sql.QueryTag.doEndTag(QueryTag.java:767)
at cfmarla2ecfm1504149200.runPage(C:\inetpub\WebSites\intranet\marla.cfm:1)
at coldfusion.runtime.CfJspPage.invoke(CfJspPage.java:246)
at coldfusion.tagext.lang.IncludeTag.handlePageInvoke(IncludeTag.java:736)
at coldfusion.tagext.lang.IncludeTag.doStartTag(IncludeTag.java:572)
at coldfusion.filter.CfincludeFilter.invoke(CfincludeFilter.java:65)
at coldfusion.filter.IpFilter.invoke(IpFilter.java:45)
at coldfusion.filter.ApplicationFilter.invoke(ApplicationFilter.java:466)
at coldfusion.filter.MonitoringFilter.invoke(MonitoringFilter.java:40)
at coldfusion.filter.PathFilter.invoke(PathFilter.java:142)
at coldfusion.filter.ExceptionFilter.invoke(ExceptionFilter.java:94)
at coldfusion.filter.ClientScopePersistenceFilter.invoke(ClientScopePersistenceFilter.java:28)
at coldfusion.filter.BrowserFilter.invoke(BrowserFilter.java:38)
at coldfusion.filter.NoCacheFilter.invoke(NoCacheFilter.java:58)
at coldfusion.filter.GlobalsFilter.invoke(GlobalsFilter.java:38)
at coldfusion.filter.DatasourceFilter.invoke(DatasourceFilter.java:22)
at coldfusion.filter.CachingFilter.invoke(CachingFilter.java:62)
at coldfusion.CfmServlet.service(CfmServlet.java:219)
at coldfusion.bootstrap.BootstrapServlet.service(BootstrapServlet.java:89)
at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:303)
at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:208)
at coldfusion.monitor.event.MonitoringServletFilter.doFilter(MonitoringServletFilter.java:42)
at coldfusion.bootstrap.BootstrapFilter.doFilter(BootstrapFilter.java:46)
at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:241)
at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:208)
at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:220)
at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:122)
at org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:501)
at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:171)
at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:116)
at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:422)
at org.apache.coyote.ajp.AjpProcessor.process(AjpProcessor.java:199)
at org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:607)
at org.apache.tomcat.util.net.JIoEndpoint$SocketProcessor.run(JIoEndpoint.java:314)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61)
at java.lang.Thread.run(Thread.java:745)
Here's a list of the fields in the table:
STUDYINSTANCEUID
PATIENTID
PATIENTNAME
STUDYDATETIME
STUDYDESCRIPTION
PRIORITY
READINGPHYSICIAN
REFERINGPHYSICIAN
PERFORMINGPHYSICIAN
TECHNOLOGIST
TRANSCRIPTIONIST
ACCESSIONNUMBER
INSTITUTIONNAME
STATUSORDER
SCHEDULEDBODYPARTEXAMINED
SCHEDULEDMODALITY
UTCOFFSET
HL7BILLED
SCHEDULEDEXAMROOM
PATIENTBIRTHDATE
MODALITIES
BODYPARTS
IMAGES
AGE
SEX
CONFLICT
STUDYID
FILESETID
ISSUER
SCHEDULEDLATERALITY
LOCATION
PROCEDURECODE
DATETIMEREAD
DURATION
FILM
BURNCD
MAIL
COURIER
AUTOFAXLIST
AUTOEMAILLIST
SCHEDULEDRESOURCE
ACCOUNTSTATUS
LANGUAGE
FINANCIALTYPE
STATE
INSURANCEPAYERS
INSURANCEEXPIRIES
COMMENTS
ALLERGIES
CANCELLATIONREASON
CELLPHONE
HOMEPHONE
AUTHORIZATIONNUMBER
DATEOFSURGERY
RECEIVEDTIME
INTERNALVISITID
SMOKINGSTATUS
I'm really not sure where this might have come from, and I can't seem to find something directly applicable on Google; most of the results there seem to deal with using derived tables or SELECT * in the query.
TL;DR: It is a bug in Jaybird. I have created bug JDBC-391 and this has been fixed in Jaybird 2.2.8.
The execute method called in the stacktrace is execute(String, int) which is usually only called for update/insert that should return generated keys. It looks like Cold Fusion always calls it this way (with value Statement.RETURN_GENERATED_KEYS).
The implementation that handles generated keys in Jaybird assumes that its internal statement parser throws an exception when confronted with a SELECT statement; if that happens the query should be processed unmodified. Unfortunately there is no test for this assumption and the parser actually doesn't throw an exception when errors occur during parsing (they are signaled in a different way). As a result the SELECT query is modified by appending a RETURNING clause, which is a syntax error when the query is then sent to the server.
I am not sure if there is a workaround for this, except maybe reverting to Jaybird 2.1.6 as this problem has been there since the initial implementation in Jaybird 2.2.0. You could also check if Cold Fusion has any configuration that would modify the way it executes queries.
This has been fixed in Jaybird 2.2.8.
Disclosure: I am one of the developers of Jaybird.
Related
I'm developing a pipeline that reads a file from Cloud Sorage, makes some transformation in Wrangle and writes the resulting rows into a BigQuery table.
The execution returns this error:
org.apache.avro.file.DataFileWriter$AppendWriteException: org.apache.avro.UnresolvedUnionException: Not in union ["string","null"]: false
at org.apache.avro.file.DataFileWriter.append(DataFileWriter.java:308) ~[org.apache.avro.avro-1.8.2.jar:1.8.2]
at io.cdap.plugin.gcp.bigquery.sink.AvroRecordWriter.write(AvroRecordWriter.java:90) ~[%20artifact7616726366619772164.jar:na]
at io.cdap.plugin.gcp.bigquery.sink.AvroRecordWriter.write(AvroRecordWriter.java:37) ~[%20artifact7616726366619772164.jar:na]
at io.cdap.plugin.gcp.bigquery.sink.BigQueryRecordWriter.write(BigQueryRecordWriter.java:58) ~[%20artifact7616726366619772164.jar:na]
at io.cdap.plugin.gcp.bigquery.sink.BigQueryRecordWriter.write(BigQueryRecordWriter.java:32) ~[%20artifact7616726366619772164.jar:na]
at io.cdap.cdap.etl.spark.io.TrackingRecordWriter.write(TrackingRecordWriter.java:41) ~[hydrator-spark-core3_2.12-6.7.1.jar:na]
at org.apache.spark.internal.io.HadoopMapReduceWriteConfigUtil.write(SparkHadoopWriter.scala:367) ~[spark-core_2.12-3.1.3.jar:3.1.3]
at org.apache.spark.internal.io.SparkHadoopWriter$.$anonfun$executeTask$1(SparkHadoopWriter.scala:137) [spark-core_2.12-3.1.3.jar:3.1.3]
at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1473) ~[spark-core_2.12-3.1.3.jar:na]
at org.apache.spark.internal.io.SparkHadoopWriter$.executeTask(SparkHadoopWriter.scala:134) [spark-core_2.12-3.1.3.jar:3.1.3]
at org.apache.spark.internal.io.SparkHadoopWriter$.$anonfun$write$1(SparkHadoopWriter.scala:88) [spark-core_2.12-3.1.3.jar:3.1.3]
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) ~[spark-core_2.12-3.1.3.jar:3.1.3]
at org.apache.spark.scheduler.Task.run(Task.scala:131) ~[spark-core_2.12-3.1.3.jar:3.1.3]
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:498) ~[spark-core_2.12-3.1.3.jar:3.1.3]
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439) ~[spark-core_2.12-3.1.3.jar:na]
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:501) ~[spark-core_2.12-3.1.3.jar:3.1.3]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[na:1.8.0_332]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[na:1.8.0_332]
at java.lang.Thread.run(Thread.java:750) ~[na:1.8.0_332]
What could be the cause? Could it be a problem in the schema definition?
I can't see any specific field mentioned in the log. The schema is defined in JSON like (just a small fragment as sample):
"schema": "{
"type":"record",
"name":"etlSchemaBody",
"fields":[
{
"name":"Id",
"type":["string","null"]
},
{
"name":"IsDeleted",
"type":["boolean","null"]},
...
Thanks and regards
Edit:
No error is returned by the execution of the flow preview.
I also tried replacing the final "BigQuery" node with another sink GCS node to a JSON file, and it works fine, so I guess the problem is in the BigQuery sink node
I'm trying to use a multidelimiter in a table insert for a hive job in emr on amazon aws. As explained in this link. The delimiter for the file is "|".
https://cwiki.apache.org/confluence/display/Hive/MultiDelimitSerDe
However, I ended up having to use...
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.MultiDelimitSerDe'
Instead of the documented...
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.MultiDelimitSerDe'
in order for it to not give me this error.
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. Cannot validate serde: org.apache.hadoop.hive.serde2.MultiDelimitSerDe
OK. So when I don't get that error, by adding the .contrib, I get this error which is caused by Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.ClassNotFoundException: Class org.apache.hadoop.hive.contrib.serde2.MultiDelimitSerDe not found
Status: Failed
Vertex failed, vertexName=Map 1, vertexId=vertex_1548264520414_0027_1_00, diagnostics=[Task failed, taskId=task_1548264520414_0027_1_00_000021, diagnostics=[TaskAttempt 0 failed, info=[Error: Error while running task ( failure ) : attempt_1548264520414_0027_1_00_000021_0:java.lang.RuntimeException: java.lang.RuntimeException: Map operator initialization failed
at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:211)
at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:168)
at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:370)
at org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:73)
at org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:61)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1840)
at org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:61)
at org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:37)
at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.RuntimeException: Map operator initialization failed
at org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.init(MapRecordProcessor.java:354)
at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:184)
... 14 more
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.ClassNotFoundException: Class org.apache.hadoop.hive.contrib.serde2.MultiDelimitSerDe not found
at org.apache.hadoop.hive.ql.exec.MapOperator.getConvertedOI(MapOperator.java:328)
at org.apache.hadoop.hive.ql.exec.MapOperator.setChildren(MapOperator.java:420)
at org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.init(MapRecordProcessor.java:286)
... 15 more
So I've been reading that you have to add the .jar file.
https://community.hortonworks.com/questions/82189/hive-cannot-see-jar.html
And so I've tried all kinds of things to get this to work. It says that it is adding it it to the class path.
hive> add jar /usr/lib/hive/lib/hive-contrib-2.3.3-amzn-1.jar
> ;
Added [/usr/lib/hive/lib/hive-contrib-2.3.3-amzn-1.jar] to class path
Added resources: [/usr/lib/hive/lib/hive-contrib-2.3.3-amzn-1.jar]
hive> add jar /usr/lib/hive/lib/hive-contrib.jar
> ;
Added [/usr/lib/hive/lib/hive-contrib.jar] to class path
Added resources: [/usr/lib/hive/lib/hive-contrib.jar]
hive> exit;
So I'm not sure what to do. It's acting as if the .jar file for hive-contrib isn't in the class path despite me adding it. I've also tried running...
export HADOOP_USER_CLASSPATH_FIRST=true
which is found here...
How to include jars in Hive (Amazon Hadoop env)
And that doesn't fix it either.
How can I use a multidelimiter SerDe property for a hive job on aws?
Thank you.
I could not get MultiDelimitSerDe to work. Instead, I was lucky in that the delimiter had quotations on either side of the pipe. So it looks like "|". This turns the values between the quotes into strings, so the additional pipes in those column values don't act as delimiters.
"Test | Test2 "|" Test3 | Test 4 | Test 5 "|" Test 6 "
You can see an explanation in the link below. The part that talks about it is in the comments, not the article.
https://www.ericlin.me/2015/07/how-to-create-a-hive-multi-character-delimitered-table/
If I didn't have those quotation marks around the delimiter, I'm not sure how I would have been able to work with a multi delimiter. Especially if I had quotations in any of my fields, but after checking, out of the billions of rows, there is not a single quote.
For small s3 input files (~10GB), glue ETL job works fine but for the larger dataset (~200GB), the job is failing.
Adding a part of ETL code.
# Converting Dynamic frame to dataframe
df = dropnullfields3.toDF()
# create new partition column
partitioned_dataframe = df.withColumn('part_date', df['timestamp_utc'].cast('date'))
# store the data in parquet format on s3
partitioned_dataframe.write.partitionBy(['part_date']).format("parquet").save(output_lg_partitioned_dir, mode="append")
Job executed for 4 hours and threw error.
File "script_2017-11-23-15-07-32.py", line 49, in
partitioned_dataframe.write.partitionBy(['part_date']).format("parquet").save(output_lg_partitioned_dir,
mode="append") File
"/mnt/yarn/usercache/root/appcache/application_1511449472652_0001/container_1511449472652_0001_02_000001/pyspark.zip/pyspark/sql/readwriter.py",
line 550, in save File
"/mnt/yarn/usercache/root/appcache/application_1511449472652_0001/container_1511449472652_0001_02_000001/py4j-0.10.4-src.zip/py4j/java_gateway.py",
line 1133, in call File
"/mnt/yarn/usercache/root/appcache/application_1511449472652_0001/container_1511449472652_0001_02_000001/pyspark.zip/pyspark/sql/utils.py",
line 63, in deco File
"/mnt/yarn/usercache/root/appcache/application_1511449472652_0001/container_1511449472652_0001_02_000001/py4j-0.10.4-src.zip/py4j/protocol.py",
line 319, in get_return_value py4j.protocol.Py4JJavaError: An error
occurred while calling o172.save. : org.apache.spark.SparkException:
Job aborted. at
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply$mcV$sp(FileFormatWriter.scala:147)
at
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:121)
at
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:121)
at
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57)
at
org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:121)
at
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:101)
at
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
at
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
at
org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
at
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)
at
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)
at
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:135)
at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at
org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:132)
at
org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:113)
at
org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:87)
at
org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:87)
at
org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:492)
at
org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:215)
at
org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:198)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498) at
py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at
py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at
py4j.Gateway.invoke(Gateway.java:280) at
py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79) at
py4j.GatewayConnection.run(GatewayConnection.java:214) at
java.lang.Thread.run(Thread.java:748) Caused by:
org.apache.spark.SparkException: Job aborted due to stage failure:
Total size of serialized results of 3385 tasks (1024.1 MB) is bigger
than spark.driver.maxResultSize (1024.0 MB) at
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1435)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1423)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1422)
at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1422)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
at scala.Option.foreach(Option.scala:257) at
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:802)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1650)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1605)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1594)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) at
org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:628)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1918) at
org.apache.spark.SparkContext.runJob(SparkContext.scala:1931) at
org.apache.spark.SparkContext.runJob(SparkContext.scala:1951) at
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply$mcV$sp(FileFormatWriter.scala:127)
... 30 more
End of LogType:stdout
I would appreciate it if you could provide any guidance to resolve this issue.
You can only set configurable options like maxResultSize during context instantiation, and glue provides you with a context (from memory you can't instantiate a new context). I don't think you will be able to change the value of this property.
You'll normally get this error if you collect results to the driver which exceed the specified size. You aren't doing that in this case so the error is confusing.
It seems like you are spawning 3385 tasks, which are presumably related to the dates in your input file (3385 dates, ~9 years?). You might try writing this file in batches, e.g.
partitioned_dataframe = df.withColumn('part_date', df['timestamp_utc'].cast('date'))
for year in range(2000,2018):
partitioned_dataframe = partitioned_dateframe.where(year(part_date) = year)
partitioned_dataframe.write.partitionBy(['part_date'])
.format("parquet")
.save(output_lg_partitioned_dir, mode="append")
I haven't checked this code; you'll at least need to import pyspark.sql.functions.year for it to work.
When I've done data processing with Glue I simply found that batching the work was more effective than trying to get large datasets be completed successfully. The system is good but hard to debug; the stability on large data doesn't come easily.
I am trying to import a JSON data from S3, and after making some queries, export the output as JSON format to S3 again. However, I get the "org.apache.hadoop.hive.serde2.SerDeException: java.io.IOException: Start token not found where expected" error at hive step on EMR cluster. In order to understand what the problem is, I simplify the Hive script and JSON data, but it keeps giving the same error. How can I solve this problem?
Cluster configuration:
Release: emr-5.3.1
Hive version: 2.1.1
Hadoop distribution: Amazon 2.7.3
Service Role: EMR_DefaultRole
MasterInstanceType: m4.large
The content of the simplifed JSON data:
[{"MyID":"FOO123","MyField":"FOO"},{"MyID":"BAR123","MyField":"BAR"}]
Hive script:
DROP TABLE IF EXISTS SOURCE;
DROP TABLE IF EXISTS DESTINATION;
CREATE EXTERNAL TABLE SOURCE(MyID STRING, MyField STRING)
ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
LOCATION 's3://myPath/subPath/';
CREATE EXTERNAL TABLE DESTINATION(MyID STRING, MyField STRING)
ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
LOCATION 's3://anotherPath/subPath/';
INSERT OVERWRITE TABLE DESTINATION SELECT MyID, MyField FROM SOURCE;
And here is the stack trace:
Vertex failed, vertexName=Map 4, vertexId=vertex_1278452616863_0001_1_00, diagnostics=[Task failed, taskId=task_1278452616863, diagnostics=[TaskAttempt 0 failed, info=[Error: Error while running task ( failure ) : attempt_1278452616863:java.lang.RuntimeException: java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing writable [{"MyID":"FOO123","MyField":"FOO"},{"MyID":"BAR123","MyField":"BAR"}]
at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:211)
at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:168)
at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:370)
at org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:73)
at org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:61)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
at org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:61)
at org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:37)
at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing writable [{"MyID":"FOO123","MyField":"FOO"},{"MyID":"BAR123","MyField":"BAR"}]
at org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.processRow(MapRecordSource.java:95)
at org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.pushRecord(MapRecordSource.java:70)
at org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.run(MapRecordProcessor.java:383)
at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:185)
... 14 more
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing writable [{"MyID":"FOO123","MyField":"FOO"},{"MyID":"BAR123","MyField":"BAR"}]
at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:497)
at org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.processRow(MapRecordSource.java:86)
... 17 more
Caused by: org.apache.hadoop.hive.serde2.SerDeException: java.io.IOException: Start token not found where expected
at org.apache.hive.hcatalog.data.JsonSerDe.deserialize(JsonSerDe.java:183)
at org.apache.hadoop.hive.ql.exec.MapOperator$MapOpCtx.readRow(MapOperator.java:128)
at org.apache.hadoop.hive.ql.exec.MapOperator$MapOpCtx.access$200(MapOperator.java:92)
at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:488)
... 18 more
Caused by: java.io.IOException: Start token not found where expected
at org.apache.hive.hcatalog.data.JsonSerDe.deserialize(JsonSerDe.java:169)
... 21 more
Thanks.
JSON should start with { and not with array ([)
I tried with this approach updated my JSON file with structure as
{"MyID":"FOO123","MyField":"FOO"},
{"MyID":"BAR123","MyField":"BAR"}
but after done, I noticed only the first object is being inserted into the table.
How to troubleshoot following openjpa error.
"Attempt to commit a null javax.transaction.Transaction. Some application servers set the transaction to null if a rollback occurs."
I see a similar error here, but there was no any comments.
Full stack trace:
[2012-03-12 02:32:29,476] ERROR {org.apache.ode.bpel.engine.BpelEngineImpl} - Scheduled job failed; jobDetail=JobDetails( instanceId: null mexId: hqejbhcnphr7333k8shi7i processId: {http://ode/bpel/sampleprocess2}sampleProcess-1 type: INVOKE_INTERNAL channel: null correlatorId: null correlationKeySet: null retryCount: null inMem: false detailsExt: {enqueue=false}) {org.apache.ode.bpel.engine.BpelEngineImpl}
<openjpa-2.0.0-wso2v1-r52033:64539M nonfatal user error> org.apache.openjpa.persistence.InvalidStateException: Attempt to commit a null javax.transaction.Transaction. Some application servers set the transaction to null if a rollback occurs.
at org.apache.openjpa.kernel.BrokerImpl.setRollbackOnlyInternal(BrokerImpl.java:1595)
at org.apache.openjpa.kernel.BrokerImpl.setRollbackOnly(BrokerImpl.java:1581)
at org.apache.openjpa.kernel.DelegatingBroker.setRollbackOnly(DelegatingBroker.java:951)
at org.apache.openjpa.persistence.EntityManagerImpl.setRollbackOnly(EntityManagerImpl.java:605)
at org.apache.openjpa.persistence.PersistenceExceptions$2.translate(PersistenceExceptions.java:77)
at org.apache.openjpa.kernel.DelegatingQuery.translate(DelegatingQuery.java:99)
at org.apache.openjpa.kernel.DelegatingQuery.execute(DelegatingQuery.java:536)
at org.apache.openjpa.persistence.QueryImpl.execute(QueryImpl.java:288)
at org.apache.openjpa.persistence.QueryImpl.getResultList(QueryImpl.java:300)
at org.apache.ode.dao.jpa.ProcessDAOImpl.getCorrelator(ProcessDAOImpl.java:95)
at org.apache.ode.bpel.engine.BpelRuntimeContextImpl.select(BpelRuntimeContextImpl.java:306)
at org.apache.ode.bpel.runtime.PICK.run(PICK.java:105)
at sun.reflect.GeneratedMethodAccessor40.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.ode.jacob.vpu.JacobVPU$JacobThreadImpl.run(JacobVPU.java:451)
at org.apache.ode.jacob.vpu.JacobVPU.execute(JacobVPU.java:139)
at org.apache.ode.bpel.engine.BpelRuntimeContextImpl.execute(BpelRuntimeContextImpl.java:879)
at org.apache.ode.bpel.engine.PartnerLinkMyRoleImpl.invokeNewInstance(PartnerLinkMyRoleImpl.java:205)
at org.apache.ode.bpel.engine.BpelProcess$1.invoke(BpelProcess.java:309)
at org.apache.ode.bpel.engine.BpelProcess.invokeProcess(BpelProcess.java:250)
at org.apache.ode.bpel.engine.BpelProcess.invokeProcess(BpelProcess.java:305)
at org.apache.ode.bpel.engine.BpelProcess.handleJobDetails(BpelProcess.java:458)
at org.apache.ode.bpel.engine.BpelEngineImpl.onScheduledJob(BpelEngineImpl.java:553)
at org.apache.ode.bpel.engine.BpelServerImpl.onScheduledJob(BpelServerImpl.java:445)
at org.apache.ode.scheduler.simple.SimpleScheduler$RunJob$1.call(SimpleScheduler.java:537)
at org.apache.ode.scheduler.simple.SimpleScheduler$RunJob$1.call(SimpleScheduler.java:531)
at org.apache.ode.scheduler.simple.SimpleScheduler.execTransaction(SimpleScheduler.java:284)
at org.apache.ode.scheduler.simple.SimpleScheduler.execTransaction(SimpleScheduler.java:239)
at org.apache.ode.scheduler.simple.SimpleScheduler$RunJob.call(SimpleScheduler.java:531)
at org.apache.ode.scheduler.simple.SimpleScheduler$RunJob.call(SimpleScheduler.java:515)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
[2012-03-12 02:32:29,477] ERROR {org.apache.ode.scheduler.simple.SimpleScheduler} - Error while processing a persisted job: Job hqejbhcnphr7333k8shi7j time: 2012-03-12 02:32:29 PDT transacted: true persisted: true details: JobDetails( instanceId: null mexId: hqejbhcnphr7333k8shi7i processId: {http://ode/bpel/sampleprocess2}sampleProcess-1 type: INVOKE_INTERNAL channel: null correlatorId: null correlationKeySet: null retryCount: null inMem: false detailsExt: {enqueue=false}) {org.apache.ode.scheduler.simple.SimpleScheduler}
java.lang.IllegalStateException: No transaction associated with current thread
at org.apache.geronimo.transaction.manager.TransactionManagerImpl.rollback(TransactionManagerImpl.java:247)
at org.apache.ode.scheduler.simple.SimpleScheduler.execTransaction(SimpleScheduler.java:297)
at org.apache.ode.scheduler.simple.SimpleScheduler.execTransaction(SimpleScheduler.java:239)
at org.apache.ode.scheduler.simple.SimpleScheduler$RunJob.call(SimpleScheduler.java:531)
at org.apache.ode.scheduler.simple.SimpleScheduler$RunJob.call(SimpleScheduler.java:515)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
[2012-03-12 02:32:29,683] ERROR {org.apache.ode.scheduler.simple.SimpleScheduler} - Error while processing job, retrying in 5s {org.apache.ode.scheduler.simple.SimpleScheduler}
OpenJPA properties
"openjpa.TransactionMode", "managed"
"openjpa.Log", "commons"
"openjpa.ManagedRuntime", new JpaTxMgrProvider(_tm)
"openjpa.ConnectionFactory", _ds
"openjpa.ConnectionFactoryMode", "managed"
"openjpa.jdbc.TransactionIsolation", "read-committed"
"openjpa.FlushBeforeQueries", "true"
I'm running a standalone server which uses Embedded Tomcat.
I'm not expecting a solution, but some pointers to troubleshoot the issue.
Thanks,
Waruna
i have the same case:
Caused by: <openjpa-2.2.2-r422266:1468616 nonfatal user error> org.apache.openjpa.persistence.InvalidStateException: Attempt to commit a null javax.transaction.Transaction. Some application servers set the transaction to null if a rollback occurs.
at org.apache.openjpa.kernel.BrokerImpl.setRollbackOnlyInternal(BrokerImpl.java:1664)
at org.apache.openjpa.kernel.BrokerImpl.setRollbackOnly(BrokerImpl.java:1650)
... 26 more
If your persistence.xml has a properties like this one:
<properties>
<property name="openjpa.jdbc.SynchronizeMappings" value="buildSchema(foreignKeys=true)"/>
</properties>
then openjpa generates all tables from your jpa-classes and the table OPENJPA_SEQUENCE_TABLE (VERY IMPORTANT!!!).
Without the property "openjpa.jdbc.SynchronizeMappings" the table OPENJPA_SEQUENCE_TABLE will be not generated and you get some exceptions in the openjpa-log like this one:
fatal store error> org.apache.openjpa.util.StoreException: Table "OPENJPA_SEQUENCE_TABLE" not found; SQL statement:
SELECT SEQUENCE_VALUE FROM PUBLIC.OPENJPA_SEQUENCE_TABLE WHERE ID = ? FOR UPDATE [42102-174] {SELECT SEQUENCE_VALUE FROM PUBLIC.OPENJPA_SEQUENCE_TABLE WHERE ID = ? FOR UPDATE} [code=42102, state=42S02]
at org.apache.openjpa.jdbc.sql.DBDictionary.narrow(DBDictionary.java:4962)
at org.apache.openjpa.jdbc.sql.DBDictionary.newStoreException(DBDictionary.java:4922)
at org.apache.openjpa.jdbc.sql.SQLExceptions.getStore(SQLExceptions.java:136)
at org.apache.openjpa.jdbc.sql.SQLExceptions.getStore(SQLExceptions.java:110)
at org.apache.openjpa.jdbc.sql.SQLExceptions.getStore(SQLExceptions.java:62)
at org.apache.openjpa.jdbc.kernel.AbstractJDBCSeq.next(AbstractJDBCSeq.java:66)
Edit: try to add this table manually:
CREATE TABLE OPENJPA_SEQUENCE_TABLE
(
ID TINYINT PRIMARY KEY NOT NULL,
SEQUENCE_VALUE BIGINT
);