Pipeline will fail on GCP when writing tensorflow transform metadata - google-cloud-platform

I hope somebody here can help. I've been googling this error like crazy but haven't found anything.
I have a pipeline that works perfectly when executed locally but it fails when executed on GCP. The following are the error messages that I get.
Workflow failed. Causes: S03:Write transform
fn/WriteMetadata/ResolveBeamFutures/CreateSingleton/Read+Write
transform fn/WriteMetadata/ResolveBeamFutures/ResolveFutures/Do+Write
transform fn/WriteMetadata/WriteMetadata failed., A work item was
attempted 4 times without success. Each time the worker eventually
lost contact with the service. The work item was attempted on:
Traceback (most recent call last): File "preprocess.py", line 491,
in
main() File "preprocess.py", line 487, in main
transform_data(args,pipeline_options,runner) File "preprocess.py", line 451, in transform_data
eval_data |= 'Identity eval' >> beam.ParDo(Identity()) File "/Library/Python/2.7/site-packages/apache_beam/pipeline.py", line 335,
in exit
self.run().wait_until_finish() File "/Library/Python/2.7/site-packages/apache_beam/runners/dataflow/dataflow_runner.py",
line 897, in wait_until_finish
(self.state, getattr(self._runner, 'last_error_msg', None)), self) apache_beam.runners.dataflow.dataflow_runner.DataflowRuntimeException:
Dataflow pipeline failed. State: FAILED, Error: Traceback (most recent
call last): File
"/usr/local/lib/python2.7/dist-packages/dataflow_worker/batchworker.py",
line 582, in do_work
work_executor.execute() File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/executor.py",
line 166, in execute
op.start() File "apache_beam/runners/worker/operations.py", line 294, in apache_beam.runners.worker.operations.DoOperation.start
(apache_beam/runners/worker/operations.c:10607)
def start(self): File "apache_beam/runners/worker/operations.py", line 295, in
apache_beam.runners.worker.operations.DoOperation.start
(apache_beam/runners/worker/operations.c:10501)
with self.scoped_start_state: File "apache_beam/runners/worker/operations.py", line 300, in
apache_beam.runners.worker.operations.DoOperation.start
(apache_beam/runners/worker/operations.c:9702)
pickler.loads(self.spec.serialized_fn)) File "/usr/local/lib/python2.7/dist-packages/apache_beam/internal/pickler.py",
line 225, in loads
return dill.loads(s) File "/usr/local/lib/python2.7/dist-packages/dill/dill.py", line 277, in
loads
return load(file) File "/usr/local/lib/python2.7/dist-packages/dill/dill.py", line 266, in
load
obj = pik.load() File "/usr/lib/python2.7/pickle.py", line 858, in load
dispatchkey File "/usr/lib/python2.7/pickle.py", line 1083, in load_newobj
obj = cls.new(cls, *args) TypeError: new() takes exactly 4 arguments (1 given)
Any ideas??
Thanks,
Pedro

If the pipeline works locally but fails on GCP it's possible that you're running into a version mismatch.
What TF, tf.Transform, beam versions are you running locally and on GCP?

Related

error when trying to run lambda locally using docker with wsl2

Trying to get AWS SAM CLI working to locally test lambda functions. I've installed the helloworld python function, which I can successfully build and validate, until I add the --use-container flag, at which point I get the below errors.
I have Docker Desktop installed and running. I'm using WSL2 with Ubuntu 20.04 on a windows 11 machine.
mylaptop:~/projects/lambda/lambda-python3.8$ sam build --use-container
Starting Build inside a container
Your template contains a resource with logical ID "ServerlessRestApi", which is a reserved logical ID in AWS SAM. It could result in unexpected behaviors and is not recommended.
Building codeuri: /home/projects/lambda/lambda-python3.8/hello_world runtime: python3.8 metadata: {} architecture: x86_64 functions: HelloWorldFunction
<3>init: (15570) ERROR: UtilConnectUnix:467: connect failed 111
Traceback (most recent call last):
File "docker/credentials/store.py", line 80, in _execute
File "subprocess.py", line 411, in check_output
File "subprocess.py", line 512, in run
subprocess.CalledProcessError: Command '['/usr/bin/docker-credential-desktop.exe', 'get']' returned non-zero exit status 255.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "docker/auth.py", line 264, in _resolve_authconfig_credstore
File "docker/credentials/store.py", line 35, in get
File "docker/credentials/store.py", line 93, in _execute
docker.credentials.errors.StoreError: Credentials store docker-credential-desktop.exe exited with "".
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "samcli/__main__.py", line 12, in <module>
File "click/core.py", line 829, in __call__
File "click/core.py", line 782, in main
File "click/core.py", line 1259, in invoke
File "click/core.py", line 1066, in invoke
File "click/core.py", line 610, in invoke
File "click/decorators.py", line 73, in new_func
File "click/core.py", line 610, in invoke
File "samcli/lib/telemetry/metric.py", line 166, in wrapped
File "samcli/lib/telemetry/metric.py", line 124, in wrapped
File "samcli/lib/utils/version_checker.py", line 41, in wrapped
File "samcli/cli/main.py", line 87, in wrapper
File "samcli/commands/build/command.py", line 201, in cli
File "samcli/commands/build/command.py", line 262, in do_cli
File "samcli/commands/build/build_context.py", line 248, in run
File "samcli/lib/build/app_builder.py", line 221, in build
File "samcli/lib/build/build_strategy.py", line 79, in build
File "samcli/lib/build/build_strategy.py", line 89, in _build_functions
File "samcli/lib/build/build_strategy.py", line 171, in build_single_function_definition
File "samcli/lib/build/app_builder.py", line 654, in _build_function
File "samcli/lib/build/app_builder.py", line 819, in _build_function_on_container
File "samcli/local/docker/manager.py", line 115, in run
File "samcli/local/docker/manager.py", line 85, in create
File "samcli/local/docker/manager.py", line 160, in pull_image
File "docker/api/image.py", line 396, in pull
File "docker/auth.py", line 48, in get_config_header
File "docker/auth.py", line 324, in resolve_authconfig
File "docker/auth.py", line 235, in resolve_authconfig
File "docker/auth.py", line 281, in _resolve_authconfig_credstore
docker.errors.DockerException: Credentials store error: StoreError('Credentials store docker-credential-desktop.exe exited with "".')
[15567] Failed to execute script __main__
I ran docker-credential-desktop.exe version which resulted in the 111 error message, so I was able to isolate the issue to something related to docker-desktop-credential.exe. After googling around and trying lots of different suggestions, this finally worked for me, without any restart required.
mv ~/.docker ~/.docker_old

Problem with Dataflow runner and jaydebeapi (one-time problem)

Info on our data flow pipeline we're referring to in this incident:
pipeline is responsible for moving data from Oracle source to BigQuery;
pipeline is written in Python3.6;
it uses ojdbc, jdk and jaydebeapi;
it is ensured in our code that all required libraries etc. are installed always on all the Data Flow workers before execution.
Problem description:
21/10 we experienced problem with Data Flow worker (in europe-west3 region) - see below log. It seems it couldn't load or use jaydebeapi library.
2020-10-21 17:28:42.792 CESTError message from worker: Traceback (most recent call last): File "apache_beam/runners/common.py", line 997, in apache_beam.runners.common.DoFnRunner._invoke_bundle_method File "apache_beam/runners/common.py", line 490, in apache_beam.runners.common.DoFnInvoker.invoke_start_bundle File "apache_beam/runners/common.py", line 496, in apache_beam.runners.common.DoFnInvoker.invoke_start_bundle File "/usr/local/lib/python3.7/site-packages/libs/dataflow/common.py", line 269, in start_bundle jars=[f"/tmp/{self.ojdbc_lib}"] File "/usr/local/lib/python3.7/site-packages/jaydebeapi/init.py", line 412, in connect jconn = _jdbc_connect(jclassname, url, driver_args, jars, libs) File "/usr/local/lib/python3.7/site-packages/jaydebeapi/init.py", line 199, in _jdbc_connect_jpype convertStrings=True) File "/usr/local/lib/python3.7/site-packages/jpype/_core.py", line 216, in startJVM ignoreUnrecognized, convertStrings, interrupt) SystemError: java.lang.ClassNotFoundException: org.jpype.classloader.DynamicClassLoader During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/usr/local/lib/python3.7/site-packages/dataflow_worker/batchworker.py", line 638, in do_work work_executor.execute() File "/usr/local/lib/python3.7/site-packages/dataflow_worker/executor.py", line 179, in execute op.start() File "apache_beam/runners/worker/operations.py", line 662, in apache_beam.runners.worker.operations.DoOperation.start File "apache_beam/runners/worker/operations.py", line 664, in apache_beam.runners.worker.operations.DoOperation.start File "apache_beam/runners/worker/operations.py", line 666, in apache_beam.runners.worker.operations.DoOperation.start File "apache_beam/runners/common.py", line 1014, in apache_beam.runners.common.DoFnRunner.start File "apache_beam/runners/common.py", line 999, in apache_beam.runners.common.DoFnRunner._invoke_bundle_method File "apache_beam/runners/common.py", line 1045, in apache_beam.runners.common.DoFnRunner._reraise_augmented File "/usr/local/lib/python3.7/site-packages/future/utils/init.py", line 446, in raise_with_traceback raise exc.with_traceback(traceback) File "apache_beam/runners/common.py", line 997, in apache_beam.runners.common.DoFnRunner._invoke_bundle_method File "apache_beam/runners/common.py", line 490, in apache_beam.runners.common.DoFnInvoker.invoke_start_bundle File "apache_beam/runners/common.py", line 496, in apache_beam.runners.common.DoFnInvoker.invoke_start_bundle File "/usr/local/lib/python3.7/site-packages/libs/dataflow/common.py", line 269, in start_bundle jars=[f"/tmp/{self.ojdbc_lib}"] File "/usr/local/lib/python3.7/site-packages/jaydebeapi/init.py", line 412, in connect jconn = _jdbc_connect(jclassname, url, driver_args, jars, libs) File "/usr/local/lib/python3.7/site-packages/jaydebeapi/init.py", line 199, in _jdbc_connect_jpype convertStrings=True) File "/usr/local/lib/python3.7/site-packages/jpype/_core.py", line 216, in startJVM ignoreUnrecognized, convertStrings, interrupt) SystemError: java.lang.ClassNotFoundException: org.jpype.classloader.DynamicClassLoader [while running 'Read from Oracle source/Read from database']
Problem occurred several times after running exactly same code again and then disappeared and everything worked well with the same code. Do you have any idea what could happen? It seems to us that it was something with infrastructure/worker provisioning etc.

Q: Sonos Python Self Test error: No handlers could be found for logger "smapi"

I am trying to run the SONOS self test for a music service on Sonos.
After getting the dependencies, and filling out the config file, I try to run the python Sonos selftest, however it runs into an error and I have no clue what the underlying issue might be to get it running:
No handlers could be found for logger "smapi"
Traceback (most recent call last):
File "suite_selftest.py", line 226, in <module>
nightly_mode(parser.config_file)
File "suite_selftest.py", line 51, in nightly_mode
development_mode(config_file)
File "suite_selftest.py", line 186, in development_mode
fixtures.append(getlastupdate.PollingIntervalTest(suite.client, suite.smapiservice))
File "/Users/thomas/Desktop/PythonSelfTest/smapi/content_workflow/getlastupdate.py", line 20, in __init__
self.poll_interval = self.smapiservice.get_polling_interval()
File "../../sonos-1.1.0.dev_r300235-py2.7.egg/sonos/smapi/smapiservice.py", line 465, in get_polling_interval
File "/usr/local/Cellar/python#2/2.7.15_1/Frameworks/Python.framework/Versions/2.7/lib/python2.7/ConfigParser.py", line 362, in getfloat
return self._get(section, float, option)
File "/usr/local/Cellar/python#2/2.7.15_1/Frameworks/Python.framework/Versions/2.7/lib/python2.7/ConfigParser.py", line 356, in _get
return conv(self.get(section, option))
ValueError: could not convert string to float:
Found the fix already, forgot to add the Polling Interval in the config file...

Program Fails for Spark_Home at foreachPartition with change in location of database load utility from one module to another

I am working on a python spark project, where initially i had written a script to load dataframe to postgres for a particular client which include a utility function which loads data to postgres.
df.rdd.repartition(self.__max_conn).foreachPartition(
lambda iterator: load_utils.load_tab_postgres(conn_prop=conn_prop,
tab_name=<tablename>,
iterator=iterator))
Initially the entire code was in single module including the above code snippet and load_utils(), which was working perfectly fine.
Later i had to extract common code including load_utils into a base module that could be used in different client modules. This is when the same code failed with below error:
File
"/opt/cloudera/parcels/CDH-5.13.1-1.cdh5.13.1.p0.2/lib/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 764, in foreachPartition File
"/opt/cloudera/parcels/CDH-5.13.1-1.cdh5.13.1.p0.2/lib/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 1004, in count File
"/opt/cloudera/parcels/CDH-5.13.1-1.cdh5.13.1.p0.2/lib/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 995, in sum File
"/opt/cloudera/parcels/CDH-5.13.1-1.cdh5.13.1.p0.2/lib/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 869, in fold File
"/opt/cloudera/parcels/CDH-5.13.1-1.cdh5.13.1.p0.2/lib/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 771, in collect File
"/opt/cloudera/parcels/CDH-5.13.1-1.cdh5.13.1.p0.2/lib/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line 813, in call File
"/opt/cloudera/parcels/CDH-5.13.1-1.cdh5.13.1.p0.2/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py",
line 45, in deco File
"/opt/cloudera/parcels/CDH-5.13.1-1.cdh5.13.1.p0.2/lib/spark/python/lib/py4j-0.9-src.zip/py4j/protocol.py",
line 308, in get_return_value py4j.protocol.Py4JJavaError: An error
occurred while calling
z:org.apache.spark.api.python.PythonRDD.collectAndServe. :
org.apache.spark.SparkException: Job aborted due to stage failure:
Task 18 in stage 126.0 failed 4 times, most recent failure: Lost task
18.3 in stage 126.0 (TID 24028, tbsatad6r15g24.company.co.us, executor 242): org.apache.spark.api.python.PythonException: Traceback (most
recent call last): File
"/opt/cloudera/parcels/CDH-5.13.1-1.cdh5.13.1.p0.2/lib/spark/python/lib/pyspark.zip/pyspark/worker.py", line 98, in main
command = pickleSer._read_with_length(infile) File "/opt/cloudera/parcels/CDH-5.13.1-1.cdh5.13.1.p0.2/lib/spark/python/lib/pyspark.zip/pyspark/serializers.py",
line 164, in _read_with_length
return self.loads(obj) File "/opt/cloudera/parcels/CDH-5.13.1-1.cdh5.13.1.p0.2/lib/spark/python/lib/pyspark.zip/pyspark/serializers.py",
line 422, in loads
return pickle.loads(obj) File "build/bdist.linux-x86_64/egg/basemodule/init.py", line 12, in
import basemodule.entitymodule.base File "build/bdist.linux-x86_64/egg/basemodule/entitymodule/base.py", line
12, in File
"build/bdist.linux-x86_64/egg/basemodule/contexts.py", line 17, in
File
"/opt/cloudera/parcels/CDH-5.13.1-1.cdh5.13.1.p0.2/lib/spark/python/lib/pyspark.zip/pyspark/conf.py",
line 104, in init
SparkContext._ensure_initialized() File "/opt/cloudera/parcels/CDH-5.13.1-1.cdh5.13.1.p0.2/lib/spark/python/lib/pyspark.zip/pyspark/context.py", line 245, in _ensure_initialized
SparkContext._gateway = gateway or launch_gateway() File "/opt/cloudera/parcels/CDH-5.13.1-1.cdh5.13.1.p0.2/lib/spark/python/lib/pyspark.zip/pyspark/java_gateway.py", line 48, in launch_gateway
SPARK_HOME = os.environ["SPARK_HOME"] File "/dhcommon/dhpython/python/lib/python2.7/UserDict.py", line 23, in
getitem
raise KeyError(key) KeyError: 'SPARK_HOME'
Below is the spark_submit command to run the code:
spark-submit --master yarn --deploy-mode cluster --driver-class-path
postgresql-42.2.4.jre6.jar --jars
spark-csv_2.10-1.4.0.jar,commons-csv-1.4.jar,postgresql-42.2.4.jre6.jar
--py-files project.egg driver_file.py
In both above scenarios the load_utils file containing "load_tab_postgres" method will be bundled in project.egg.

TypeError when using botocore to read from AWS SQS queue

I'm using a Tornado server with tornado-botocore to connect to Amazon SQS services.
When running stress tests we sometimes get the following exception:
Traceback (most recent call last):
File "/home/app/handlers/WebSocketsHandler.py", line 95, in listen_outgoing_queue
message = yield tornado.gen.Task(self.outgoing_queue.read)
File "/home/local/lib/python2.7/site-packages/tornado/gen.py", line 870, in run
value = future.result()
File "/home/local/lib/python2.7/site-packages/tornado/concurrent.py", line 215, in result
raise_exc_info(self._exc_info)
File "/home/local/lib/python2.7/site-packages/tornado/stack_context.py", line 314, in wrapped
ret = fn(*args, **kwargs)
File "/home/local/lib/python2.7/site-packages/tornado_botocore/base.py", line 70, in prepare_response
response_dict, operation_model.output_shape)
File "/home/local/lib/python2.7/site-packages/botocore/parsers.py", line 155, in parse
return self._do_error_parse(response, shape)
File "/home/.env/local/lib/python2.7/site-packages/botocore/parsers.py", line 314, in _do_error_parse
root = self._parse_xml_string_to_dom(xml_contents)
File "/home/local/lib/python2.7/site-packages/botocore/parsers.py", line 274, in _parse_xml_string_to_dom
parser.feed(xml_string)
TypeError: must be string or read-only buffer, not None
could it be caused by the concurrency?
has anyone encountered such behavior?
We are using tornado 4.2.1, botocore 0.65.0 and tonado-botocore 0.1.6
problem solved once i removed the #tornado.gen.engine decorator from the method