TPUEstimator does not work with use_tpu=False - google-cloud-platform

I’m trying to run a model using TPUEstimator locally on a CPU first to validate that it works by setting use_tpu=False on the estimator initialization. When running train I get this error.
InternalError: failed to synchronously memcpy host-to-device: host 0x7fcc7e4d4000 to device 0x1deffc002 size 4096: Failed precondition: Unable to enqueue when not opened, queue: [0000:00:04.0 PE0 C0 MC0 TN0 Queue HBM_WRITE]. State is: CLOSED
[[Node: optimizer/gradients/neural_network/fully_connected_2/BiasAdd_grad/BiasAddGrad_G14 = _Recv[client_terminated=false, recv_device="/job:worker/replica:0/task:0/device:TPU:0", send_device="/job:worker/replica:0/task:0/device:CPU:0", send_device_incarnation=-7832507818616568453, tensor_name="edge_42_op...iasAddGrad", tensor_type=DT_FLOAT, _device="/job:worker/replica:0/task:0/device:TPU:0"]()]]
It looks like it’s still trying to use the TPU, as it says recv_device="/job:worker/replica:0/task:0/device:TPU:0". Why is it trying to use the TPU when use_tpu is set to False?

What optimizer are you using? This type of error can happen if you use a tf.contrib.tpu.CrossShardOptimizer and use_tpu is set to False. The optimizer is trying to shard the work across TPU cores but can’t because you’re running on your CPU.
It’s common practice to have a command line flag that sets whether the TPU is being used. This flag is used to toggle things like CrossShardOptimizer and use_tpu. For example, in the MNIST reference model:
if FLAGS.use_tpu:
optimizer = tf.contrib.tpu.CrossShardOptimizer(optimizer)
https://github.com/tensorflow/models/blob/ad3526a98e7d5e9e57c029b8857ef7b15c903ca2/official/mnist/mnist_tpu.py#L102

Related

Cardano-cli query utxo fails

Error message I am trying to create a staking pool for cardano, i got the node up and running but cardano-cli is giving me a hard time. I have it installed as when i type cardano-cli version it returns infocardano-cli version.
However when i enter cardano-cli query utxo --mainnet --address $RECEIVER i get this error:
cardano-cli: Network.Socket.connect: <socket: 11>: does not exist (No such file or directory)root#vmi803461:~#
Could it be because the blockchain isn't fully synched?
I am running windows 10 with vs
node
Yes ivan p is correct however its also important to mention that there have been cardano node example scripts involving stakepools such as creating a shelley blockchain from scratch that had a bug for quite some time essentially causing multiple nodes to point to the same socket. If you're attempting to run multiple nodes - which you should be while running a stakepool, double check the precise path and location of the socket
cardano-cli requires both CARDANO_NODE_SOCKET_PATH variables installed in the shell and node server running to access it with the socket.
export CARDANO_NODE_SOCKET_PATH="/root/db/node.socket"
To query the balance of an address we need a running node and the
environment variable CARDANO_NODE_SOCKET_PATH set to the path of the
node.socket:

Vertex AI 504 Errors in batch job - How to fix/troubleshoot

We have a Vertex AI model that takes a relatively long time to return a prediction.
When hitting the model endpoint with one instance, things work fine. But batch jobs of size say 1000 instances end up with around 150 504 errors (upstream request timeout). (We actually need to send batches of 65K but I'm troubleshooting with 1000).
I tried increasing the number of replicas assuming that the # of instances handed to the model would be (1000/# of replicas) but that doesn't seem to be the case.
I then read that the default batch size is 64 and so tried decreasing the batch size to 4 like this from the python code that creates the batch job:
model_parameters = dict(batch_size=4)
def run_batch_prediction_job(vertex_config):
aiplatform.init(
project=vertex_config.vertex_project, location=vertex_config.location
)
model = aiplatform.Model(vertex_config.model_resource_name)
model_params = dict(batch_size=4)
batch_params = dict(
job_display_name=vertex_config.job_display_name,
gcs_source=vertex_config.gcs_source,
gcs_destination_prefix=vertex_config.gcs_destination,
machine_type=vertex_config.machine_type,
accelerator_count=vertex_config.accelerator_count,
accelerator_type=vertex_config.accelerator_type,
starting_replica_count=replica_count,
max_replica_count=replica_count,
sync=vertex_config.sync,
model_parameters=model_params
)
batch_prediction_job = model.batch_predict(**batch_params)
batch_prediction_job.wait()
return batch_prediction_job
I've also tried increasing the machine type to n1-high-cpu-16 and that helped somewhat but I'm not sure I understand how batches are sent to replicas?
Is there another way to decrease the number of instances sent to the model?
Or is there a way to increase the timeout?
Is there log output I can use to help figure this out?
Thanks
Answering your follow up question above.
Is that timeout for a single instance request or a batch request. Also, is it in seconds?
This is a timeout for the batch job creation request.
The timeout is in seconds, according to create_batch_prediction_job() timeout refers to rpc timeout. If we trace the code we will end up here and eventually to gapic where timeout is properly described.
timeout (float): The amount of time in seconds to wait for the RPC
to complete. Note that if ``retry`` is used, this timeout
applies to each individual attempt and the overall time it
takes for this method to complete may be longer. If
unspecified, the the default timeout in the client
configuration is used. If ``None``, then the RPC method will
not time out.
What I could suggest is to stick with whatever is working for your prediction model. If ever adding the timeout will improve your model might as well build on it along with your initial solution where you used a machine with a higher spec. You can also try using a machine with higher memory like the n1-highmem-* family.

Streaming MutationGroups into Spanner

I'm trying to stream MutationGroups into spanner with SpannerIO.
The goal is to write new MuationGroups every 10 seconds, as we will use spanner to query near-time KPI's.
When I don't use any windows, I get the following error:
Exception in thread "main" java.lang.IllegalStateException: GroupByKey cannot be applied to non-bounded PCollection in the GlobalWindow without a trigger. Use a Window.into or Window.triggering transform prior to GroupByKey.
at org.apache.beam.sdk.transforms.GroupByKey.applicableTo(GroupByKey.java:173)
at org.apache.beam.sdk.transforms.GroupByKey.expand(GroupByKey.java:204)
at org.apache.beam.sdk.transforms.GroupByKey.expand(GroupByKey.java:120)
at org.apache.beam.sdk.Pipeline.applyInternal(Pipeline.java:537)
at org.apache.beam.sdk.Pipeline.applyTransform(Pipeline.java:472)
at org.apache.beam.sdk.values.PCollection.apply(PCollection.java:286)
at org.apache.beam.sdk.transforms.Combine$PerKey.expand(Combine.java:1585)
at org.apache.beam.sdk.transforms.Combine$PerKey.expand(Combine.java:1470)
at org.apache.beam.sdk.Pipeline.applyInternal(Pipeline.java:537)
at org.apache.beam.sdk.Pipeline.applyTransform(Pipeline.java:491)
at org.apache.beam.sdk.values.PCollection.apply(PCollection.java:299)
at org.apache.beam.sdk.io.gcp.spanner.SpannerIO$WriteGrouped.expand(SpannerIO.java:868)
at org.apache.beam.sdk.io.gcp.spanner.SpannerIO$WriteGrouped.expand(SpannerIO.java:823)
at org.apache.beam.sdk.Pipeline.applyInternal(Pipeline.java:537)
at org.apache.beam.sdk.Pipeline.applyTransform(Pipeline.java:472)
at org.apache.beam.sdk.values.PCollection.apply(PCollection.java:286)
at quantum.base.transform.entity.spanner.SpannerProtoWrite.expand(SpannerProtoWrite.java:52)
at quantum.base.transform.entity.spanner.SpannerProtoWrite.expand(SpannerProtoWrite.java:20)
at org.apache.beam.sdk.Pipeline.applyInternal(Pipeline.java:537)
at org.apache.beam.sdk.Pipeline.applyTransform(Pipeline.java:491)
at org.apache.beam.sdk.values.PCollection.apply(PCollection.java:299)
at quantum.entitybuilder.pipeline.EntityBuilderPipeline$Write$SpannerWrite.expand(EntityBuilderPipeline.java:388)
at quantum.entitybuilder.pipeline.EntityBuilderPipeline$Write$SpannerWrite.expand(EntityBuilderPipeline.java:372)
at org.apache.beam.sdk.Pipeline.applyInternal(Pipeline.java:537)
at org.apache.beam.sdk.Pipeline.applyTransform(Pipeline.java:491)
at org.apache.beam.sdk.values.PCollection.apply(PCollection.java:299)
at quantum.entitybuilder.pipeline.EntityBuilderPipeline.main(EntityBuilderPipeline.java:122)
:entityBuilder FAILED
Because of the error above I assume the input collection needs to be windowed and triggered, as SpannerIO uses a GroupByKey (this is also what I need for my use case):
...
.apply("1-minute windows", Window.<MutationGroup>into(FixedWindows.of(Duration.standardMinutes(1)))
.triggering(Repeatedly.forever(AfterProcessingTime
.pastFirstElementInPane()
.plusDelayOf(Duration.standardSeconds(10))
).orFinally(AfterWatermark.pastEndOfWindow()))
.discardingFiredPanes()
.withAllowedLateness(Duration.ZERO))
.apply(SpannerIO.write()
.withProjectId(entityConfig.getSpannerProject())
.withInstanceId(entityConfig.getSpannerInstance())
.withDatabaseId(entityConfig.getSpannerDb())
.grouped());
When I do this, I get the following exceptions during runtime:
java.lang.IllegalArgumentException: Attempted to get side input window for GlobalWindow from non-global WindowFn
org.apache.beam.sdk.transforms.windowing.PartitioningWindowFn$1.getSideInputWindow(PartitioningWindowFn.java:49)
com.google.cloud.dataflow.worker.StreamingModeExecutionContext$StepContext.issueSideInputFetch(StreamingModeExecutionContext.java:631)
com.google.cloud.dataflow.worker.StreamingModeExecutionContext$UserStepContext.issueSideInputFetch(StreamingModeExecutionContext.java:683)
com.google.cloud.dataflow.worker.StreamingSideInputFetcher.storeIfBlocked(StreamingSideInputFetcher.java:182)
com.google.cloud.dataflow.worker.StreamingSideInputDoFnRunner.processElement(StreamingSideInputDoFnRunner.java:71)
com.google.cloud.dataflow.worker.SimpleParDoFn.processElement(SimpleParDoFn.java:323)
com.google.cloud.dataflow.worker.util.common.worker.ParDoOperation.process(ParDoOperation.java:43)
com.google.cloud.dataflow.worker.util.common.worker.OutputReceiver.process(OutputReceiver.java:48)
com.google.cloud.dataflow.worker.SimpleParDoFn$1.output(SimpleParDoFn.java:271)
org.apache.beam.runners.core.SimpleDoFnRunner.outputWindowedValue(SimpleDoFnRunner.java:219)
org.apache.beam.runners.core.SimpleDoFnRunner.access$700(SimpleDoFnRunner.java:69)
org.apache.beam.runners.core.SimpleDoFnRunner$DoFnProcessContext.output(SimpleDoFnRunner.java:517)
org.apache.beam.runners.core.SimpleDoFnRunner$DoFnProcessContext.output(SimpleDoFnRunner.java:505)
org.apache.beam.sdk.values.ValueWithRecordId$StripIdsDoFn.processElement(ValueWithRecordId.java:145)
After investigating further it appears to be due to the .apply(Wait.on(input)) in SpannerIO: It has a global side input which does not seem to work with my fixed windows, as the docs of Wait.java state:
If signal is globally windowed, main input must also be. This typically would be useful
* only in a batch pipeline, because the global window of an infinite PCollection never
* closes, so the wait signal will never be ready.
As a temporary workaround I tried the following:
add a GlobalWindow with triggers instead of fixed windows:
.apply("globalwindow", Window.<MutationGroup>into(new GlobalWindows())
.triggering(Repeatedly.forever(AfterProcessingTime
.pastFirstElementInPane()
.plusDelayOf(Duration.standardSeconds(10))
).orFinally(AfterWatermark.pastEndOfWindow()))
.discardingFiredPanes()
.withAllowedLateness(Duration.ZERO))
This results in writes to spanner only when I drain my pipeline. I have the impression the Wait.on() signal is only triggered when the Global windows closes, and doesn't work with triggers.
Disable the .apply(Wait.on(input)) in SpannerIO:
This results in the pipeline getting stuck on the view creation which
is described in this SO post:
SpannerIO Dataflow 2.3.0 stuck in CreateDataflowView.
When I check the worker logs for clues, I do get the following warnings:
logger: "org.apache.beam.sdk.coders.SerializableCoder"
message: "Can't verify serialized elements of type SpannerSchema have well defined equals method. This may produce incorrect results on some PipelineRunner
logger: "org.apache.beam.sdk.coders.SerializableCoder"
message: "Can't verify serialized elements of type BoundedSource have well defined equals method. This may produce incorrect results on some PipelineRunner"
Note that everything works with the DirectRunner and that I'm trying to use the DataflowRunner.
Does anyone have any other suggestions for things I can try to get this running? I can hardly imagine that I'm the only one trying to stream MutationGroups into spanner.
Thanks in advance!
Currently, SpannerIO connector is not supported with Beam Streaming. Please follow this Pull Request which adds streaming support for spanner IO connector.

MATLAB parallel toolbox, remoteParallelFunction : RUNTIME_ERROR during function evaluation

I'm using the parallel computing toolbox (PCT) in combination with the Simbiology toolbox in MATLAB 2012b. I’m receiving an intermittent error message when I run my script with a remote pool of workers, but not with a local pool of workers:
Caught std::exception Exception message is:
vector::_M_range_check
Error using parallel_function (line 589)
Error in remote execution of remoteParallelFunction : RUNTIME_ERROR
Error in PSOFit (line 486)
parfor ns = 1:r.NumSwp
Error in PSOopt_driver (line 209)
PSOFit(ObjFuncName,LB,UB,PSOopts);
The error does not occur when I comment out the call to the function sbiosimulate (a SimBiology function for model evaluation).
I have a couple of ideas:
I’ve introduced some sort of race condition, that causes a problem in accessing the model variables (is this possible in MATLAB?)
Model compilation in simbiology is sometimes but not always compatible with the PCT, and I’ve hit some sort of edge case
Since sbiosimulate evaluates compiled C++ code, for some inputs there might be a bug in the source that generates the exception
I am aware of this.
I'm a developer of SimBiology. I believe this is a bug that was introduced into SimBiology's C++ code in the R2012a release. The bug is triggered when a simulation ends without producing any simulation results. This can sometimes occur when the model is configured to report only particular times (using the OutputTimes options) AND the simulation is configured to end after a particular amount of real time (using the MaximumWallClock option). Basically, the simulation "times out" before it ever gets a chance to log the first output time.
One way to work around this problem is to always include time 0 in the OutputTimes. This time will always get logged before evaluating the MaximumWallClock criterion, preventing the bug from getting triggered. I am also contacting this user directly and will work on fixing the bug in a future release.

Debugging livelock in Django/Postgresql

I run a moderately popular web app on Django with Apache2, mod_python, and PostgreSQL 8.3 with the postgresql_psycopg2 database backend. I'm experiencing occasional livelock, identifiable when an apache2 process continually consumes 99% of CPU for several minutes or more.
I did an strace -ppid on the apache2 process, and found that it was continually repeating these system calls:
sendto(25, "Q\0\0\0SSELECT (1) AS \"a\" FROM \"account_profile\" WHERE \"account_profile\".\"id\" = 66201 \0", 84, 0, NULL, 0) = 84
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
poll([{fd=25, events=POLLIN|POLLERR, revents=POLLIN}], 1, -1) = 1
recvfrom(25, "E\0\0\0\210SERROR\0C25P02\0Mcurrent transaction is aborted, commands ignored until end of transaction block\0Fpostgres.c\0L906\0Rexec_simple_query\0\0Z\0\0\0\5E", 16384, 0, NULL, NULL) = 143
This exact fragment repeats continually in the trace, and was running for over 10 minutes before I finally killed the apache2 process. (Note: I edited this to replace my previous strace fragment with a new one that shows full the full string contents rather than truncated.)
My interpretation of the above is that django is attempting to do an existence check on my table account_profile, but at some earlier point (before I started the trace) something went wrong (SQL parse error? referential integrity or uniqueness constraint violation? who knows?), and now Postgresql is returning the error "current transaction is aborted". For some reason, instead of raising an Exception and giving up, it just keeps retrying.
One possibility is that this is being triggered in a call to Profile.objects.get_or_create. This is the model class that maps to the account_profile table. Perhaps there is something in get_or_create that is designed to catch too broad a set of exceptions and retry? From the web server logs, it appears that this livelock might have occurred as a result of a double-click on the POST button in my site's registration form.
This condition has occurred a couple of times over the past few days on the live site, and results in a significant slowdown until I intervene, so pretty much anything other than infinite deadlock would be an improvement! :)
This turned out to be entirely my fault. I found the spot where the select (1) as 'a' statement seemed to originate (in django/models/base.py) and hacked it to log a traceback, which pointed clearly at my code.
I had some code that makes up a unique email "key" for each Profile. These keys are randomly generated, so because there is some possibility of overlap, I run it in a try/except within a while loop. My assumption was that the database's unique constraint would cause the save to fail if the key was not unique, and I'd be able to try again.
Unfortunately, in Postgresql you cannot simply try again after an integrity error. You have to issue a COMMIT or ROLLBACK command (even if you're in autocommit mode, apparently) before you can try again. So I had an infinite loop of failing save attempts where I was ignoring the error message.
Now I look for a more specific exception (django.db.IntegrityError) and run a limited number of attempts so that the loop is not infinite.
Thanks to everyone for viewing/answering.
Your analysis sounds pretty good. Clearly it's not picking up the fact that the transaction is aborted. I suggest you report this as a bug to the django project...