Error when running Pytest with Delta Lake Tables

Error when running Pytest with Delta Lake Tables - unit-testing

I am working in the VDI of a company and they use their own artifactory for security reasons.
Currently I am writing unit tests to perform tests for a function that deletes entries from a delta table. When I started, I received an error of unresolved dependencies, because my spark session was configured in a way that it would load jars from maven. I was able to solve this issue by loading these jars locally from /opt/spark/jars. Now my code looks like this:
class TestTransformation(unittest.TestCase):
#classmethod
def test_ksu_deletion(self):
self.spark = SparkSession.builder\
.appName('SPARK_DELETION')\
.config("spark.delta.logStore.class", "org.apache.spark.sql.delta.storage.S3SingleDriverLogStore")\
.config("spark.jars", "/opt/spark/jars/delta-core_2.12-0.7.0.jar, /opt/spark/jars/hadoop-aws-3.2.0.jar")\
.config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension")\
.config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog")\
.getOrCreate()
os.environ["KSU_DELETION_OBJECT"]="UNITTEST/"
deltatable = DeltaTable.forPath(self.spark, "/projects/some/path/snappy.parquet")
deltatable.delete(col("DATE") < get_current()
However, I am getting the error message:
E py4j.protocol.Py4JJavaError: An error occurred while calling z:io.delta.tables.DeltaTable.forPath.
E : java.lang.NoSuchMethodError: org.apache.spark.sql.AnalysisException.<init>(Ljava/lang/String;Lscala/Option;Lscala/Option;Lscala/Option;Lscala/Option;)V
Do you have any idea by what this is caused? I am assuming it has to do with the way I am configuring spark.sql.extions and/or the spark.sql.catalog, but to be honest, I am quite a newb in Spark.
I would greatly appreciate any hint.
Thanks a lot in advance!
Edit:
We are using Spark 3.0.2 (Scala 2.12.10). According to https://docs.delta.io/latest/releases.html, this should be compatible. Apart from the SparkSession, I trimmed down the subsequent code to
df.spark.read.parquet(Path/to/file.snappy.parquet)
and now I am getting the error message
java.lang.IncompatibleClassChangeError: class org.apache.spark.sql.catalyst.plans.logical.DeltaDelete has interface org.apache.spark.sql.catalyst.plans.logical.UnaryNode as super class
As I said, I am quite new to (Py)Spark, so please dont hesitate to mention things you consider completely obvious.
Edit 2: I checked the Python path I am exporting in the Shell before running the code and I can see the following:
Could this cause any problem? I dont understand why I do not get this error when running the code within pipenv (with spark-submit)

It looks like that you're using incompatible version of the Delta lake library. 0.7.0 was for Spark 3.0, but you're using another version - either lower, or higher. Consult Delta releases page to find mapping between Delta version & required Spark versions.
If you're using Spark 3.1 or 3.2, consider using delta-spark Python package that will install all necessary dependencies, so you just import DeltaTable class.
Update: Yes, this happens because of the conflicting versions - you need to remove delta-spark and pyspark Python package, and install pyspark==3.0.2 explicitly.
P.S. Also, look onto pytest-spark package that can simplify specification of configuration for all tests. You can find examples of it + Delta here.

Related

Amazon Deequ (Spark + Scala ) - java.lang.NoSuchMethodError: 'scala.Option org.apache.spark.sql.catalyst.expressions.aggregate.AggregateFunction.toAgg

Spark Version - 3.0.1
Amazon Deequ version - deequ-2.0.0-spark-3.1.jar
Im running the below code in spark shell in my local :
import com.amazon.deequ.analyzers.runners.{AnalysisRunner, AnalyzerContext}
import com.amazon.deequ.analyzers.runners.AnalyzerContext.successMetricsAsDataFrame
import com.amazon.deequ.analyzers.{Compliance, Correlation, Size, Completeness, Mean,
ApproxCountDistinct, Maximum, Minimum, Entropy}
import com.amazon.deequ.analyzers.{Compliance, Correlation, Size, Completeness, Mean,
ApproxCountDistinct, Maximum, Minimum, Entropy}
val analysisResult: AnalyzerContext = {AnalysisRunner.onData(datasourcedf).addAnalyzer(Size()).addAnalyzer(Completeness("customerNumber")).addAnalyzer(ApproxCountDistinct("customerNumber")).addAnalyzer(Minimum("creditLimit")).addAnalyzer(Mean("creditLimit")).addAnalyzer(Maximum("creditLimit")).addAnalyzer(Entropy("creditLimit")).**run()**}
ERROR:
java.lang.NoSuchMethodError: 'scala.Option
org.apache.spark.sql.catalyst.expressions.aggregate.AggregateFunction.toAggregateExpression$default$2()'
at org.apache.spark.sql.DeequFunctions$.withAggregateFunction(DeequFunctions.scala:31)
at org.apache.spark.sql.DeequFunctions$.stateful_approx_count_distinct(DeequFunctions.scala:60)
at com.amazon.deequ.analyzers.ApproxCountDistinct.aggregationFunctions(ApproxCountDistinct.scala:52)
at com.amazon.deequ.analyzers.runners.AnalysisRunner$.$anonfun$runScanningAnalyzers$3(AnalysisRunner.scala:319)
at scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:245)
at scala.collection.immutable.List.foreach(List.scala:392)
at scala.collection.TraversableLike.flatMap(TraversableLike.scala:245)
at scala.collection.TraversableLike.flatMap$(TraversableLike.scala:242)
at scala.collection.immutable.List.flatMap(List.scala:355)
at com.amazon.deequ.analyzers.runners.AnalysisRunner$.liftedTree1$1(AnalysisRunner.scala:319)
at com.amazon.deequ.analyzers.runners.AnalysisRunner$.runScanningAnalyzers(AnalysisRunner.scala:318)
at com.amazon.deequ.analyzers.runners.AnalysisRunner$.doAnalysisRun(AnalysisRunner.scala:167)
at com.amazon.deequ.analyzers.runners.AnalysisRunBuilder.run(AnalysisRunBuilder.scala:110)
... 63 elided
Can someone please let me know how to resolve this issue

You can't use Deeque version 2.0.0 with Spark 3.0 because it's binary incompatible due of the changes in the Spark's internals. With Spark 3.0 you need to use version 1.2.2-spark-3.0

MassTransit.RabbitMqTransport.RabbitMqAddressException: 'The invalid scheme was specified: amqps'

Trying to configure a .Net Core service to talk to an AmazonMQ - RabbitMQ instance. This is the config I'm using:
https://masstransit-project.com/usage/transports/rabbitmq.html#amazonmq-rabbitmq
cfg.Host(new Uri("amqps://b-mdefg-ff37-4e33-855c-577a0c749659.mq.us-westeast-2.amazonaws.com:5432/"), h =>
{
h.Username("MyUsername");
h.Password("XXXXXXXXX");
});
Nuget packages are:
MassTransit.AspNetCore v5.5.6
MassTransit.Autofac v5.3.2
MassTransit.RabbitMQ v5.3.2
MassTransit.SerilogIntegration v5.1.5
The exception being thrown is:
MassTransit.RabbitMqTransport.RabbitMqAddressException: 'The invalid scheme was specified: amqps'
Upgrading to the latest MassTransit caused a lot of unrelated issues. Any clues as to a workaround?

OK, solution was fairly straightforward, but google gave me nothing so perhaps this will save someone some time. I simply upgraded MassTransit.RabbitMQ to v5.5.6 and it all works now.

How to get Instance name in CommandBox CF 2018?

I recently started using commandBox to run ColdFusion in my local environment. After I played around for a while one issue I run into was related to adminapi. Here is the code that I use in one of my projects:
adminObj = createObject("component","cfide.adminapi.runtime");
instance = adminObj.getInstanceName();
This code is pretty straight forward and work just fine if I install traditional ColdFusion Developer version on my machine. I tried running this on commandBox: "app":{ "cfengine":"adobe#2018.0.7" }
After I run the code above this is the error message I got:
Object Instantiation Exception.
Class not found: com.adobe.coldfusion.entman.ProcessServer
The first debugging step was to check if component exists. I simply checked that like this:
adminObj = createObject("component","cfide.adminapi.runtime");
writeDump(adminObj);
The result I got on the screen was this:
component CFIDE.adminapi.runtime
extends CFIDE.adminapi.base
METHODS
Then I tried this to make sure method exists in the scope:
adminObj = createObject("component","cfide.adminapi.runtime");
writeDump(adminObj.getInstanceName);
The output looks like this, and that confirmed that method getInstanceName exists.
function getInstanceName
Arguments: none
ReturnType: any
Roles:
Access: public
Output: false
DisplayName:
Hint: returns the current instance name
Description:
The error is occurring only if I call the function getInstanceName(). Does anyone know what could be the reason of this error? Is there any solution for this particular problem? Like I already mentioned this method works in traditional ColdFusion 2018 developer environment. Thank you.

This is a bug in Adobe ColdFusion. The CFC you're creating is trying to create an instance of a specific Java class. I recognize the class name com.adobe.coldfusion.entman.ProcessServer as being related to their enterprise manager which controls features only available in certain versions of CF as well as features only available on their "standard" Tomcat installation (as opposed to a J2E deployment like CommandBox).
Please report this to Adobe in the Adobe bug tracker as they appear to be incorrectly detecting the servlet installation. I worked with them a couple years ago to improve their servlet detection on CommandBox, but I guess they still have some issues.
As a workaround, you could try and find out what jar that class is from on a non-CommandBox installation of Adobe ColdFusion and add it to the path, but I can't promise that it will work and that it won't have negative consequences.

Dataflow breaks using TaggedOutputs, "can't pickle WeakDictionary"

we are trying to deploy an Streaming pipeline to Dataflow where we separate in few different "routes" that we manipulate differently the data.
We did the complete development with the DirectRunner, and works smoothly as we tested but now, that we did deployed it to Dataflow, it does not work.
The code fails when yielding on the following doFn
class SplitByRoute(beam.DoFn):
OUTPUT_TAG_ROUTE_ONE= "route_one"
OUTPUT_TAG_ROUTE_TWO = "route_two"
OUTPUT_NOT_SUPPORTED = "not_supported"
def __init__(self):
beam.DoFn.__init__(self)
def process(self, elem):
try:
route = self.define_route(elem["param"]) # Just tag it depending on param
except Exception:
route = None
logging.info(f"Routed to {route}")
if route == self.OUTPUT_TAG_ROUTE_ONE:
yield TaggedOutput(self.OUTPUT_TAG_ROUTE_ONE, elem)
elif route == self.OUTPUT_TAG_ROUTE_TWO:
logging.info(f"Element: {elem}")
yield TaggedOutput(self.OUTPUT_TAG_ROUTE_TWO, elem)
else:
yield TaggedOutput(self.OUTPUT_NOT_SUPPORTED, elem)
It does log the element, yield the output and fails with the following error
AttributeError: Can't pickle local object 'WeakValueDictionary.__init__.<locals>.remove' [while running 'generatedPtransform-3196']
Other considerations are that we use taggedOutputs on the pipeline before this DoFn, and it works on Dataflow but this one in particularly fails with the error mentioned. Could it be the memory cache? or something related to it? Where Weakrefs are used?
Far as I know, this error happens when you have a class inside another one. Maybe not(?)
Any suggestions so how we could manage this? It's been very frustrating error.
Thank you!!! :)

We found the error
As you might know, apache-beam uses dill package to serialize the data between the modules. This let us pickle an instance of a object and send it through the pipeline.
The problem was that in self.define_route(elem["param"]), we used that instance of the class and we modified one of it's attributes. As the answer from Samuel Romero says, you can pickle a class, but I didn't really know (and probably someone has to) that if you modify the class instance it can not be pickle again. that's an strage behaviour, I know, so I opened an issue on BEAM https://issues.apache.org/jira/browse/BEAM-10384 if you want to check it out.
I will probably get into it (to understand better the problem) soon or later, but if someone had the same error, the workaround, as I mentioned is to do not modify the instance of a class beeing serialized.
Thanks to anyone who tried to help!

As you can read here, Python uses the pickle library for data serialization and it is subject to its limitations. Data serialization is the way processes transfer data between them since they do not share memory space.
Here I found a suggestion about using a fork of multiprocessing module that uses the dill package instead of pickle. This fork is part of the pathos framework (as is the dill package too) and is now called pathos.multiprocess and not pathos.multiprocessing as seen in the reference I mentioned previously.

Redmine_RE plugin is not saving the data

i am using the Redmine.3.1.0.Then i have installed redmine_re plugin.But when i try to save the requirement using the Redmine_re plugin i am getting the following error
NameError (undefined local variable or method `connection' for #<ReArtifactRelationship:0x800ddb0>):
lib/plugins/acts_as_list/lib/active_record/acts/list.rb:220:in `bottom_item'
lib/plugins/acts_as_list/lib/active_record/acts/list.rb:214:in `bottom_position_in_list'
lib/plugins/acts_as_list/lib/active_record/acts/list.rb:205:in `add_to_list_bottom'
lib/redmine/sudo_mode.rb:63:in `sudo_mode'
pls suggest how to resolve this error
#ste26054

I did not develop this plugin, but I think the support for redmine 3.1.0 is only partial at the moment. (And you may get other errors even after fixing this).
I believe you are getting an error because of this: Deprecate #connection in favour of accessing it via the class
And your error is related to this file:
In this method:
def scope_condition()
"#{connection.quote_column_name("source_id")} = #{quote_value(self.source_id)}
AND
#{connection.quote_column_name("relation_type")} = #{quote_value(self.relation_type)}"
end
Try to add self.class. in front of connection
You may have to repeat this for other files in the code.
If your changes are working, I would suggest you to submit a pull request on their plugin github page :)

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Error when running Pytest with Delta Lake Tables - unit-testing

Related

Amazon Deequ (Spark + Scala ) - java.lang.NoSuchMethodError: 'scala.Option org.apache.spark.sql.catalyst.expressions.aggregate.AggregateFunction.toAgg

MassTransit.RabbitMqTransport.RabbitMqAddressException: 'The invalid scheme was specified: amqps'

How to get Instance name in CommandBox CF 2018?

Dataflow breaks using TaggedOutputs, "can't pickle WeakDictionary"

Redmine_RE plugin is not saving the data

Categories

Resources