MATLAB parallel toolbox, remoteParallelFunction : RUNTIME_ERROR during function evaluation - c++

I'm using the parallel computing toolbox (PCT) in combination with the Simbiology toolbox in MATLAB 2012b. I’m receiving an intermittent error message when I run my script with a remote pool of workers, but not with a local pool of workers:
Caught std::exception Exception message is:
vector::_M_range_check
Error using parallel_function (line 589)
Error in remote execution of remoteParallelFunction : RUNTIME_ERROR
Error in PSOFit (line 486)
parfor ns = 1:r.NumSwp
Error in PSOopt_driver (line 209)
PSOFit(ObjFuncName,LB,UB,PSOopts);
The error does not occur when I comment out the call to the function sbiosimulate (a SimBiology function for model evaluation).
I have a couple of ideas:
I’ve introduced some sort of race condition, that causes a problem in accessing the model variables (is this possible in MATLAB?)
Model compilation in simbiology is sometimes but not always compatible with the PCT, and I’ve hit some sort of edge case
Since sbiosimulate evaluates compiled C++ code, for some inputs there might be a bug in the source that generates the exception
I am aware of this.

I'm a developer of SimBiology. I believe this is a bug that was introduced into SimBiology's C++ code in the R2012a release. The bug is triggered when a simulation ends without producing any simulation results. This can sometimes occur when the model is configured to report only particular times (using the OutputTimes options) AND the simulation is configured to end after a particular amount of real time (using the MaximumWallClock option). Basically, the simulation "times out" before it ever gets a chance to log the first output time.
One way to work around this problem is to always include time 0 in the OutputTimes. This time will always get logged before evaluating the MaximumWallClock criterion, preventing the bug from getting triggered. I am also contacting this user directly and will work on fixing the bug in a future release.

Related

Internal error when using MPI Intel library with reduction operation on communicators

I am having some issues when using reduction operations on MPI communicators.
I have a lots of different communicators created using the algorithm this way :
MPI_ERR_SONDAGE(MPI_Group_incl(world_group, comm_size, &(on_going_communicator[0]), &local_group));
MPI_ERR_SONDAGE(MPI_Comm_create_group(MPI_COMM_WORLD, local_group, tag, &communicator)); tag++;
When I call a reduction operation like so :
MPI_ERR_SONDAGE(MPI_Allreduce(&(temporary[0]), &(temporary_glo[0]), (int)lignes.size(), MPI_DOUBLE, MPI_MAX, communicator));
I get
Assertion failed in file ../../src/mpid/ch4/src/intel/ch4_shm_coll.c
at line 2266: comm->shm_numa_layout[my_numa_node].base_addr
/.../oneapi/2021.4.0/mpi/2021.4.0/lib/release/libmpi.so.12(MPL_backtrace_show+0x1c) [0x2ace34033c8c]
/Cci/Admin/oneapi/2021.4.0/mpi/2021.4.0/lib/release/libmpi.so.12(MPIR_Assert_fail+0x21)
[0x2ace33aaffe1]
/.../oneapi/2021.4.0/mpi/2021.4.0/lib/release/libmpi.so.12(+0x24f609)
[0x2ace337c6609]
/.../oneapi/2021.4.0/mpi/2021.4.0/lib/release/libmpi.so.12(+0x19b518)
[0x2ace33712518]
/.../oneapi/2021.4.0/mpi/2021.4.0/lib/release/libmpi.so.12(+0x1686aa)
[0x2ace336df6aa]
/.../oneapi/2021.4.0/mpi/2021.4.0/lib/release/libmpi.so.12(+0x251ac7)
[0x2ace337c8ac7]
/Cci/Admin/oneapi/2021.4.0/mpi/2021.4.0/lib/release/libmpi.so.12(PMPI_Allreduce+0x562)
[0x2ace33685712]
I only have this problem on big test cases. Meaning lots of communicators with a reasonnable amount of data to reduce. So I cannot create a MCVE, sorry.
When I set the environment variables I_MPI_COLL_DIRECT=off and I_MPI_COLL_INTRANODE=pt2pt, the code works fine. Since I guess the problem is induced by the use of NUMA and I guess forcing point to point communication will inhibit the use of NUMA.
But my fear is that these options will lead to degraded performance, so I really would like to know the bottom problem.
I have tried with :
intel/2020.1.217
intel/2020.2.254
intel/2021.4.0
And they basically show the same error.
Could you tell me or give me a hint of what is going on ?
Thank you.

Poco::Data::MySQL 'Got packets out of order' error

I got an ER_NET_PACKETS_OUT_OF_ORDER error when running a multithreaded C++ app using Poco::Data::MySQL and Poco::Data::SessionPool. The error message looks like this:
MySQL: [MySQL]: [Comment]: mysql_stmt_prepare error [mysql_stmt_error]: Got packets out of order [mysql_stmt_errno]: 1156 [mysql_stmt_sqlstate]: 08S01 [statemnt]: ...
The app is making queries from multiple threads every 100ms. The connections are provided by a common SessionPool.
I got around this problem by adding reset=true to the connection string. However, as stated in the official docs, adding this option may result in problems with encoding.

TPUEstimator does not work with use_tpu=False

I’m trying to run a model using TPUEstimator locally on a CPU first to validate that it works by setting use_tpu=False on the estimator initialization. When running train I get this error.
InternalError: failed to synchronously memcpy host-to-device: host 0x7fcc7e4d4000 to device 0x1deffc002 size 4096: Failed precondition: Unable to enqueue when not opened, queue: [0000:00:04.0 PE0 C0 MC0 TN0 Queue HBM_WRITE]. State is: CLOSED
[[Node: optimizer/gradients/neural_network/fully_connected_2/BiasAdd_grad/BiasAddGrad_G14 = _Recv[client_terminated=false, recv_device="/job:worker/replica:0/task:0/device:TPU:0", send_device="/job:worker/replica:0/task:0/device:CPU:0", send_device_incarnation=-7832507818616568453, tensor_name="edge_42_op...iasAddGrad", tensor_type=DT_FLOAT, _device="/job:worker/replica:0/task:0/device:TPU:0"]()]]
It looks like it’s still trying to use the TPU, as it says recv_device="/job:worker/replica:0/task:0/device:TPU:0". Why is it trying to use the TPU when use_tpu is set to False?
What optimizer are you using? This type of error can happen if you use a tf.contrib.tpu.CrossShardOptimizer and use_tpu is set to False. The optimizer is trying to shard the work across TPU cores but can’t because you’re running on your CPU.
It’s common practice to have a command line flag that sets whether the TPU is being used. This flag is used to toggle things like CrossShardOptimizer and use_tpu. For example, in the MNIST reference model:
if FLAGS.use_tpu:
optimizer = tf.contrib.tpu.CrossShardOptimizer(optimizer)
https://github.com/tensorflow/models/blob/ad3526a98e7d5e9e57c029b8857ef7b15c903ca2/official/mnist/mnist_tpu.py#L102

Streaming MutationGroups into Spanner

I'm trying to stream MutationGroups into spanner with SpannerIO.
The goal is to write new MuationGroups every 10 seconds, as we will use spanner to query near-time KPI's.
When I don't use any windows, I get the following error:
Exception in thread "main" java.lang.IllegalStateException: GroupByKey cannot be applied to non-bounded PCollection in the GlobalWindow without a trigger. Use a Window.into or Window.triggering transform prior to GroupByKey.
at org.apache.beam.sdk.transforms.GroupByKey.applicableTo(GroupByKey.java:173)
at org.apache.beam.sdk.transforms.GroupByKey.expand(GroupByKey.java:204)
at org.apache.beam.sdk.transforms.GroupByKey.expand(GroupByKey.java:120)
at org.apache.beam.sdk.Pipeline.applyInternal(Pipeline.java:537)
at org.apache.beam.sdk.Pipeline.applyTransform(Pipeline.java:472)
at org.apache.beam.sdk.values.PCollection.apply(PCollection.java:286)
at org.apache.beam.sdk.transforms.Combine$PerKey.expand(Combine.java:1585)
at org.apache.beam.sdk.transforms.Combine$PerKey.expand(Combine.java:1470)
at org.apache.beam.sdk.Pipeline.applyInternal(Pipeline.java:537)
at org.apache.beam.sdk.Pipeline.applyTransform(Pipeline.java:491)
at org.apache.beam.sdk.values.PCollection.apply(PCollection.java:299)
at org.apache.beam.sdk.io.gcp.spanner.SpannerIO$WriteGrouped.expand(SpannerIO.java:868)
at org.apache.beam.sdk.io.gcp.spanner.SpannerIO$WriteGrouped.expand(SpannerIO.java:823)
at org.apache.beam.sdk.Pipeline.applyInternal(Pipeline.java:537)
at org.apache.beam.sdk.Pipeline.applyTransform(Pipeline.java:472)
at org.apache.beam.sdk.values.PCollection.apply(PCollection.java:286)
at quantum.base.transform.entity.spanner.SpannerProtoWrite.expand(SpannerProtoWrite.java:52)
at quantum.base.transform.entity.spanner.SpannerProtoWrite.expand(SpannerProtoWrite.java:20)
at org.apache.beam.sdk.Pipeline.applyInternal(Pipeline.java:537)
at org.apache.beam.sdk.Pipeline.applyTransform(Pipeline.java:491)
at org.apache.beam.sdk.values.PCollection.apply(PCollection.java:299)
at quantum.entitybuilder.pipeline.EntityBuilderPipeline$Write$SpannerWrite.expand(EntityBuilderPipeline.java:388)
at quantum.entitybuilder.pipeline.EntityBuilderPipeline$Write$SpannerWrite.expand(EntityBuilderPipeline.java:372)
at org.apache.beam.sdk.Pipeline.applyInternal(Pipeline.java:537)
at org.apache.beam.sdk.Pipeline.applyTransform(Pipeline.java:491)
at org.apache.beam.sdk.values.PCollection.apply(PCollection.java:299)
at quantum.entitybuilder.pipeline.EntityBuilderPipeline.main(EntityBuilderPipeline.java:122)
:entityBuilder FAILED
Because of the error above I assume the input collection needs to be windowed and triggered, as SpannerIO uses a GroupByKey (this is also what I need for my use case):
...
.apply("1-minute windows", Window.<MutationGroup>into(FixedWindows.of(Duration.standardMinutes(1)))
.triggering(Repeatedly.forever(AfterProcessingTime
.pastFirstElementInPane()
.plusDelayOf(Duration.standardSeconds(10))
).orFinally(AfterWatermark.pastEndOfWindow()))
.discardingFiredPanes()
.withAllowedLateness(Duration.ZERO))
.apply(SpannerIO.write()
.withProjectId(entityConfig.getSpannerProject())
.withInstanceId(entityConfig.getSpannerInstance())
.withDatabaseId(entityConfig.getSpannerDb())
.grouped());
When I do this, I get the following exceptions during runtime:
java.lang.IllegalArgumentException: Attempted to get side input window for GlobalWindow from non-global WindowFn
org.apache.beam.sdk.transforms.windowing.PartitioningWindowFn$1.getSideInputWindow(PartitioningWindowFn.java:49)
com.google.cloud.dataflow.worker.StreamingModeExecutionContext$StepContext.issueSideInputFetch(StreamingModeExecutionContext.java:631)
com.google.cloud.dataflow.worker.StreamingModeExecutionContext$UserStepContext.issueSideInputFetch(StreamingModeExecutionContext.java:683)
com.google.cloud.dataflow.worker.StreamingSideInputFetcher.storeIfBlocked(StreamingSideInputFetcher.java:182)
com.google.cloud.dataflow.worker.StreamingSideInputDoFnRunner.processElement(StreamingSideInputDoFnRunner.java:71)
com.google.cloud.dataflow.worker.SimpleParDoFn.processElement(SimpleParDoFn.java:323)
com.google.cloud.dataflow.worker.util.common.worker.ParDoOperation.process(ParDoOperation.java:43)
com.google.cloud.dataflow.worker.util.common.worker.OutputReceiver.process(OutputReceiver.java:48)
com.google.cloud.dataflow.worker.SimpleParDoFn$1.output(SimpleParDoFn.java:271)
org.apache.beam.runners.core.SimpleDoFnRunner.outputWindowedValue(SimpleDoFnRunner.java:219)
org.apache.beam.runners.core.SimpleDoFnRunner.access$700(SimpleDoFnRunner.java:69)
org.apache.beam.runners.core.SimpleDoFnRunner$DoFnProcessContext.output(SimpleDoFnRunner.java:517)
org.apache.beam.runners.core.SimpleDoFnRunner$DoFnProcessContext.output(SimpleDoFnRunner.java:505)
org.apache.beam.sdk.values.ValueWithRecordId$StripIdsDoFn.processElement(ValueWithRecordId.java:145)
After investigating further it appears to be due to the .apply(Wait.on(input)) in SpannerIO: It has a global side input which does not seem to work with my fixed windows, as the docs of Wait.java state:
If signal is globally windowed, main input must also be. This typically would be useful
* only in a batch pipeline, because the global window of an infinite PCollection never
* closes, so the wait signal will never be ready.
As a temporary workaround I tried the following:
add a GlobalWindow with triggers instead of fixed windows:
.apply("globalwindow", Window.<MutationGroup>into(new GlobalWindows())
.triggering(Repeatedly.forever(AfterProcessingTime
.pastFirstElementInPane()
.plusDelayOf(Duration.standardSeconds(10))
).orFinally(AfterWatermark.pastEndOfWindow()))
.discardingFiredPanes()
.withAllowedLateness(Duration.ZERO))
This results in writes to spanner only when I drain my pipeline. I have the impression the Wait.on() signal is only triggered when the Global windows closes, and doesn't work with triggers.
Disable the .apply(Wait.on(input)) in SpannerIO:
This results in the pipeline getting stuck on the view creation which
is described in this SO post:
SpannerIO Dataflow 2.3.0 stuck in CreateDataflowView.
When I check the worker logs for clues, I do get the following warnings:
logger: "org.apache.beam.sdk.coders.SerializableCoder"
message: "Can't verify serialized elements of type SpannerSchema have well defined equals method. This may produce incorrect results on some PipelineRunner
logger: "org.apache.beam.sdk.coders.SerializableCoder"
message: "Can't verify serialized elements of type BoundedSource have well defined equals method. This may produce incorrect results on some PipelineRunner"
Note that everything works with the DirectRunner and that I'm trying to use the DataflowRunner.
Does anyone have any other suggestions for things I can try to get this running? I can hardly imagine that I'm the only one trying to stream MutationGroups into spanner.
Thanks in advance!
Currently, SpannerIO connector is not supported with Beam Streaming. Please follow this Pull Request which adds streaming support for spanner IO connector.

Debugging livelock in Django/Postgresql

I run a moderately popular web app on Django with Apache2, mod_python, and PostgreSQL 8.3 with the postgresql_psycopg2 database backend. I'm experiencing occasional livelock, identifiable when an apache2 process continually consumes 99% of CPU for several minutes or more.
I did an strace -ppid on the apache2 process, and found that it was continually repeating these system calls:
sendto(25, "Q\0\0\0SSELECT (1) AS \"a\" FROM \"account_profile\" WHERE \"account_profile\".\"id\" = 66201 \0", 84, 0, NULL, 0) = 84
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
poll([{fd=25, events=POLLIN|POLLERR, revents=POLLIN}], 1, -1) = 1
recvfrom(25, "E\0\0\0\210SERROR\0C25P02\0Mcurrent transaction is aborted, commands ignored until end of transaction block\0Fpostgres.c\0L906\0Rexec_simple_query\0\0Z\0\0\0\5E", 16384, 0, NULL, NULL) = 143
This exact fragment repeats continually in the trace, and was running for over 10 minutes before I finally killed the apache2 process. (Note: I edited this to replace my previous strace fragment with a new one that shows full the full string contents rather than truncated.)
My interpretation of the above is that django is attempting to do an existence check on my table account_profile, but at some earlier point (before I started the trace) something went wrong (SQL parse error? referential integrity or uniqueness constraint violation? who knows?), and now Postgresql is returning the error "current transaction is aborted". For some reason, instead of raising an Exception and giving up, it just keeps retrying.
One possibility is that this is being triggered in a call to Profile.objects.get_or_create. This is the model class that maps to the account_profile table. Perhaps there is something in get_or_create that is designed to catch too broad a set of exceptions and retry? From the web server logs, it appears that this livelock might have occurred as a result of a double-click on the POST button in my site's registration form.
This condition has occurred a couple of times over the past few days on the live site, and results in a significant slowdown until I intervene, so pretty much anything other than infinite deadlock would be an improvement! :)
This turned out to be entirely my fault. I found the spot where the select (1) as 'a' statement seemed to originate (in django/models/base.py) and hacked it to log a traceback, which pointed clearly at my code.
I had some code that makes up a unique email "key" for each Profile. These keys are randomly generated, so because there is some possibility of overlap, I run it in a try/except within a while loop. My assumption was that the database's unique constraint would cause the save to fail if the key was not unique, and I'd be able to try again.
Unfortunately, in Postgresql you cannot simply try again after an integrity error. You have to issue a COMMIT or ROLLBACK command (even if you're in autocommit mode, apparently) before you can try again. So I had an infinite loop of failing save attempts where I was ignoring the error message.
Now I look for a more specific exception (django.db.IntegrityError) and run a limited number of attempts so that the loop is not infinite.
Thanks to everyone for viewing/answering.
Your analysis sounds pretty good. Clearly it's not picking up the fact that the transaction is aborted. I suggest you report this as a bug to the django project...