Debugging livelock in Django/Postgresql - django

I run a moderately popular web app on Django with Apache2, mod_python, and PostgreSQL 8.3 with the postgresql_psycopg2 database backend. I'm experiencing occasional livelock, identifiable when an apache2 process continually consumes 99% of CPU for several minutes or more.
I did an strace -ppid on the apache2 process, and found that it was continually repeating these system calls:
sendto(25, "Q\0\0\0SSELECT (1) AS \"a\" FROM \"account_profile\" WHERE \"account_profile\".\"id\" = 66201 \0", 84, 0, NULL, 0) = 84
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
poll([{fd=25, events=POLLIN|POLLERR, revents=POLLIN}], 1, -1) = 1
recvfrom(25, "E\0\0\0\210SERROR\0C25P02\0Mcurrent transaction is aborted, commands ignored until end of transaction block\0Fpostgres.c\0L906\0Rexec_simple_query\0\0Z\0\0\0\5E", 16384, 0, NULL, NULL) = 143
This exact fragment repeats continually in the trace, and was running for over 10 minutes before I finally killed the apache2 process. (Note: I edited this to replace my previous strace fragment with a new one that shows full the full string contents rather than truncated.)
My interpretation of the above is that django is attempting to do an existence check on my table account_profile, but at some earlier point (before I started the trace) something went wrong (SQL parse error? referential integrity or uniqueness constraint violation? who knows?), and now Postgresql is returning the error "current transaction is aborted". For some reason, instead of raising an Exception and giving up, it just keeps retrying.
One possibility is that this is being triggered in a call to Profile.objects.get_or_create. This is the model class that maps to the account_profile table. Perhaps there is something in get_or_create that is designed to catch too broad a set of exceptions and retry? From the web server logs, it appears that this livelock might have occurred as a result of a double-click on the POST button in my site's registration form.
This condition has occurred a couple of times over the past few days on the live site, and results in a significant slowdown until I intervene, so pretty much anything other than infinite deadlock would be an improvement! :)

This turned out to be entirely my fault. I found the spot where the select (1) as 'a' statement seemed to originate (in django/models/base.py) and hacked it to log a traceback, which pointed clearly at my code.
I had some code that makes up a unique email "key" for each Profile. These keys are randomly generated, so because there is some possibility of overlap, I run it in a try/except within a while loop. My assumption was that the database's unique constraint would cause the save to fail if the key was not unique, and I'd be able to try again.
Unfortunately, in Postgresql you cannot simply try again after an integrity error. You have to issue a COMMIT or ROLLBACK command (even if you're in autocommit mode, apparently) before you can try again. So I had an infinite loop of failing save attempts where I was ignoring the error message.
Now I look for a more specific exception (django.db.IntegrityError) and run a limited number of attempts so that the loop is not infinite.
Thanks to everyone for viewing/answering.

Your analysis sounds pretty good. Clearly it's not picking up the fact that the transaction is aborted. I suggest you report this as a bug to the django project...

Related

How to solve stability problems in Google Dataflow

I have a Dataflow job that has been running stable for several months.
The last 3 days or so, I've problems with the job, it's getting stuck after a certain amount of time and the only thing I can do is stop the job and start a new one. This happened after 2, 6 and 24 hours of processing. Here is the latest exception:
java.lang.ExceptionInInitializerError
at org.apache.beam.runners.dataflow.worker.options.StreamingDataflowWorkerOptions$WindmillServerStubFactory.create (StreamingDataflowWorkerOptions.java:183)
at org.apache.beam.runners.dataflow.worker.options.StreamingDataflowWorkerOptions$WindmillServerStubFactory.create (StreamingDataflowWorkerOptions.java:169)
at org.apache.beam.sdk.options.ProxyInvocationHandler.returnDefaultHelper (ProxyInvocationHandler.java:592)
at org.apache.beam.sdk.options.ProxyInvocationHandler.getDefault (ProxyInvocationHandler.java:533)
at org.apache.beam.sdk.options.ProxyInvocationHandler.invoke (ProxyInvocationHandler.java:158)
at com.sun.proxy.$Proxy54.getWindmillServerStub (Unknown Source)
at org.apache.beam.runners.dataflow.worker.StreamingDataflowWorker.<init> (StreamingDataflowWorker.java:677)
at org.apache.beam.runners.dataflow.worker.StreamingDataflowWorker.fromDataflowWorkerHarnessOptions (StreamingDataflowWorker.java:562)
at org.apache.beam.runners.dataflow.worker.StreamingDataflowWorker.main (StreamingDataflowWorker.java:274)
Caused by: java.lang.RuntimeException: Loading windmill_service failed:
at org.apache.beam.runners.dataflow.worker.windmill.WindmillServer.<clinit> (WindmillServer.java:42)
Caused by: java.io.IOException: No space left on device
at sun.nio.ch.FileDispatcherImpl.write0 (Native Method)
at sun.nio.ch.FileDispatcherImpl.write (FileDispatcherImpl.java:60)
at sun.nio.ch.IOUtil.writeFromNativeBuffer (IOUtil.java:93)
at sun.nio.ch.IOUtil.write (IOUtil.java:65)
at sun.nio.ch.FileChannelImpl.write (FileChannelImpl.java:211)
at java.nio.channels.Channels.writeFullyImpl (Channels.java:78)
at java.nio.channels.Channels.writeFully (Channels.java:101)
at java.nio.channels.Channels.access$000 (Channels.java:61)
at java.nio.channels.Channels$1.write (Channels.java:174)
at java.nio.file.Files.copy (Files.java:2909)
at java.nio.file.Files.copy (Files.java:3027)
at org.apache.beam.runners.dataflow.worker.windmill.WindmillServer.<clinit> (WindmillServer.java:39)
Seems like there is no space left on a device, but shouldn't this be managed by Google? Or is this an error in my job somehow?
UPDATE:
The workflow is as follows:
Reading mass data from PubSub (up to 1500/s)
Filter some messages
Keeping session window on key and grouping by it
Sort the data and do calculations
Output the data to another PubSub
You can increase the storage capacity in the parameter of your pipelise. Look at this one diskSizeGb in this page
In addition, more you keep data in memory, more you need memory. It's the case for the windows, if you never close them, or if you allow late data for too long time, you need a lot of memory to keep all these data up.
Tune either your pipeline, or your machine type. Or both!

Streaming MutationGroups into Spanner

I'm trying to stream MutationGroups into spanner with SpannerIO.
The goal is to write new MuationGroups every 10 seconds, as we will use spanner to query near-time KPI's.
When I don't use any windows, I get the following error:
Exception in thread "main" java.lang.IllegalStateException: GroupByKey cannot be applied to non-bounded PCollection in the GlobalWindow without a trigger. Use a Window.into or Window.triggering transform prior to GroupByKey.
at org.apache.beam.sdk.transforms.GroupByKey.applicableTo(GroupByKey.java:173)
at org.apache.beam.sdk.transforms.GroupByKey.expand(GroupByKey.java:204)
at org.apache.beam.sdk.transforms.GroupByKey.expand(GroupByKey.java:120)
at org.apache.beam.sdk.Pipeline.applyInternal(Pipeline.java:537)
at org.apache.beam.sdk.Pipeline.applyTransform(Pipeline.java:472)
at org.apache.beam.sdk.values.PCollection.apply(PCollection.java:286)
at org.apache.beam.sdk.transforms.Combine$PerKey.expand(Combine.java:1585)
at org.apache.beam.sdk.transforms.Combine$PerKey.expand(Combine.java:1470)
at org.apache.beam.sdk.Pipeline.applyInternal(Pipeline.java:537)
at org.apache.beam.sdk.Pipeline.applyTransform(Pipeline.java:491)
at org.apache.beam.sdk.values.PCollection.apply(PCollection.java:299)
at org.apache.beam.sdk.io.gcp.spanner.SpannerIO$WriteGrouped.expand(SpannerIO.java:868)
at org.apache.beam.sdk.io.gcp.spanner.SpannerIO$WriteGrouped.expand(SpannerIO.java:823)
at org.apache.beam.sdk.Pipeline.applyInternal(Pipeline.java:537)
at org.apache.beam.sdk.Pipeline.applyTransform(Pipeline.java:472)
at org.apache.beam.sdk.values.PCollection.apply(PCollection.java:286)
at quantum.base.transform.entity.spanner.SpannerProtoWrite.expand(SpannerProtoWrite.java:52)
at quantum.base.transform.entity.spanner.SpannerProtoWrite.expand(SpannerProtoWrite.java:20)
at org.apache.beam.sdk.Pipeline.applyInternal(Pipeline.java:537)
at org.apache.beam.sdk.Pipeline.applyTransform(Pipeline.java:491)
at org.apache.beam.sdk.values.PCollection.apply(PCollection.java:299)
at quantum.entitybuilder.pipeline.EntityBuilderPipeline$Write$SpannerWrite.expand(EntityBuilderPipeline.java:388)
at quantum.entitybuilder.pipeline.EntityBuilderPipeline$Write$SpannerWrite.expand(EntityBuilderPipeline.java:372)
at org.apache.beam.sdk.Pipeline.applyInternal(Pipeline.java:537)
at org.apache.beam.sdk.Pipeline.applyTransform(Pipeline.java:491)
at org.apache.beam.sdk.values.PCollection.apply(PCollection.java:299)
at quantum.entitybuilder.pipeline.EntityBuilderPipeline.main(EntityBuilderPipeline.java:122)
:entityBuilder FAILED
Because of the error above I assume the input collection needs to be windowed and triggered, as SpannerIO uses a GroupByKey (this is also what I need for my use case):
...
.apply("1-minute windows", Window.<MutationGroup>into(FixedWindows.of(Duration.standardMinutes(1)))
.triggering(Repeatedly.forever(AfterProcessingTime
.pastFirstElementInPane()
.plusDelayOf(Duration.standardSeconds(10))
).orFinally(AfterWatermark.pastEndOfWindow()))
.discardingFiredPanes()
.withAllowedLateness(Duration.ZERO))
.apply(SpannerIO.write()
.withProjectId(entityConfig.getSpannerProject())
.withInstanceId(entityConfig.getSpannerInstance())
.withDatabaseId(entityConfig.getSpannerDb())
.grouped());
When I do this, I get the following exceptions during runtime:
java.lang.IllegalArgumentException: Attempted to get side input window for GlobalWindow from non-global WindowFn
org.apache.beam.sdk.transforms.windowing.PartitioningWindowFn$1.getSideInputWindow(PartitioningWindowFn.java:49)
com.google.cloud.dataflow.worker.StreamingModeExecutionContext$StepContext.issueSideInputFetch(StreamingModeExecutionContext.java:631)
com.google.cloud.dataflow.worker.StreamingModeExecutionContext$UserStepContext.issueSideInputFetch(StreamingModeExecutionContext.java:683)
com.google.cloud.dataflow.worker.StreamingSideInputFetcher.storeIfBlocked(StreamingSideInputFetcher.java:182)
com.google.cloud.dataflow.worker.StreamingSideInputDoFnRunner.processElement(StreamingSideInputDoFnRunner.java:71)
com.google.cloud.dataflow.worker.SimpleParDoFn.processElement(SimpleParDoFn.java:323)
com.google.cloud.dataflow.worker.util.common.worker.ParDoOperation.process(ParDoOperation.java:43)
com.google.cloud.dataflow.worker.util.common.worker.OutputReceiver.process(OutputReceiver.java:48)
com.google.cloud.dataflow.worker.SimpleParDoFn$1.output(SimpleParDoFn.java:271)
org.apache.beam.runners.core.SimpleDoFnRunner.outputWindowedValue(SimpleDoFnRunner.java:219)
org.apache.beam.runners.core.SimpleDoFnRunner.access$700(SimpleDoFnRunner.java:69)
org.apache.beam.runners.core.SimpleDoFnRunner$DoFnProcessContext.output(SimpleDoFnRunner.java:517)
org.apache.beam.runners.core.SimpleDoFnRunner$DoFnProcessContext.output(SimpleDoFnRunner.java:505)
org.apache.beam.sdk.values.ValueWithRecordId$StripIdsDoFn.processElement(ValueWithRecordId.java:145)
After investigating further it appears to be due to the .apply(Wait.on(input)) in SpannerIO: It has a global side input which does not seem to work with my fixed windows, as the docs of Wait.java state:
If signal is globally windowed, main input must also be. This typically would be useful
* only in a batch pipeline, because the global window of an infinite PCollection never
* closes, so the wait signal will never be ready.
As a temporary workaround I tried the following:
add a GlobalWindow with triggers instead of fixed windows:
.apply("globalwindow", Window.<MutationGroup>into(new GlobalWindows())
.triggering(Repeatedly.forever(AfterProcessingTime
.pastFirstElementInPane()
.plusDelayOf(Duration.standardSeconds(10))
).orFinally(AfterWatermark.pastEndOfWindow()))
.discardingFiredPanes()
.withAllowedLateness(Duration.ZERO))
This results in writes to spanner only when I drain my pipeline. I have the impression the Wait.on() signal is only triggered when the Global windows closes, and doesn't work with triggers.
Disable the .apply(Wait.on(input)) in SpannerIO:
This results in the pipeline getting stuck on the view creation which
is described in this SO post:
SpannerIO Dataflow 2.3.0 stuck in CreateDataflowView.
When I check the worker logs for clues, I do get the following warnings:
logger: "org.apache.beam.sdk.coders.SerializableCoder"
message: "Can't verify serialized elements of type SpannerSchema have well defined equals method. This may produce incorrect results on some PipelineRunner
logger: "org.apache.beam.sdk.coders.SerializableCoder"
message: "Can't verify serialized elements of type BoundedSource have well defined equals method. This may produce incorrect results on some PipelineRunner"
Note that everything works with the DirectRunner and that I'm trying to use the DataflowRunner.
Does anyone have any other suggestions for things I can try to get this running? I can hardly imagine that I'm the only one trying to stream MutationGroups into spanner.
Thanks in advance!
Currently, SpannerIO connector is not supported with Beam Streaming. Please follow this Pull Request which adds streaming support for spanner IO connector.

While loop implementation in Pentaho Kettle

I need guidence on implementing WHILE loop with Kettle/PDI. The scenario is
(1) I have some (may be thousand or thousands of thousand) data in a table, to be validated with a remote server.
(2) Read them and loopup to the remote server; I use Modified Java Script for this as remote server lookup validation is defined in external Java JAR file (I can use "Change number of copies to start... option on Modified java script and set to 5 or 10)
(3) Update the result on database table. There will be 50 to 60% connection failure cases each session.
(4) Repeat Step 1 to step 3 till all gets updated to success
(5) Stop looping on Nth cycle; this is to avoid very long or infinite looping, N value may be 5 or 10.
How to design such a WhILE loop in Pentaho Kettle?
Have you seen this link? It gives a pretty well detailed explanation of how to implement a while loop.
You need a parent job with a sub-transformation for doing a check on the condition which will return a variable to the job on whether to abort or to continue.

How to debug "could not receive data from client: Connection reset by peer"

I'm running a django-celery application on Ubuntu-12.04.
When I run a celery task from my web interface, I get the following error, taken form postgresql-9.3 logfile (maximum level of log):
2013-11-12 13:57:01 GMT tss_usr 8113 LOG: could not receive data from client: Connection reset by peer
tss_usr is the postgresql user of the django application database and (in this example) 8113 is the pid of the process who killed the connection, I guess.
Have you got any idea on why this happens or at least how to debug this issue?
To make things work again I need to restart postgresql which is extremely uncomfortable.
I know this is an older post, but I just found it because I had the same error today in my postgres logs. I narrowed it down to a PDO select statement. I'm using Zend Framework 1.10.3 on Ubuntu Precise.
The following pdo statement generated an error if $opinion is a long text string. The column opinion is type Text in my postgres table. The query succeeds if $opinion is under a certain number of characters. 1000 characters works fine. 2000 characters fails with "could not receive data from client: Connection reset by peer".
$select = $this->db->select()
->from( 'datauserstopics' )
->where("opinion = ?",trim($opinion))
->where("datatopicsid = ?",trim($tid))
->where("datausersid= ?",$datausersid);
$stmt = $this->db->query($select);
I circumvented the problem by using:
->where("substr(opinion,1,100) = ?",trim(substr($opinion,1,100)))
This is not a perfect solution, but for my purposes, the select statement using substr() suffices.
Note that I have no problem inserting long strings into the same table/column. The disconnect problem only appears for me on the PDO select with relatively long text strings.
I'm getting it in 2017 with 9.4, I have no text fields, don't know what a PDO is. My select statement is about 50 bytes long, I'm trying to fetch an int4 and a double precision. I suspect the error message can mean multiple things.
I've since found https://dba.stackexchange.com/questions/142350/postgres-could-not-receive-data-from-client-connection-reset-by-peer which indicates it could be a problem with the client configuration. My client is libpg and PQconnectdb() is giving me a CONNECTION_OK return. It works at least partly.
For me, restarting the hypervisor where both the Postgres and the application using it helped. I've seen stack traces in dmesg before, though.

MATLAB parallel toolbox, remoteParallelFunction : RUNTIME_ERROR during function evaluation

I'm using the parallel computing toolbox (PCT) in combination with the Simbiology toolbox in MATLAB 2012b. I’m receiving an intermittent error message when I run my script with a remote pool of workers, but not with a local pool of workers:
Caught std::exception Exception message is:
vector::_M_range_check
Error using parallel_function (line 589)
Error in remote execution of remoteParallelFunction : RUNTIME_ERROR
Error in PSOFit (line 486)
parfor ns = 1:r.NumSwp
Error in PSOopt_driver (line 209)
PSOFit(ObjFuncName,LB,UB,PSOopts);
The error does not occur when I comment out the call to the function sbiosimulate (a SimBiology function for model evaluation).
I have a couple of ideas:
I’ve introduced some sort of race condition, that causes a problem in accessing the model variables (is this possible in MATLAB?)
Model compilation in simbiology is sometimes but not always compatible with the PCT, and I’ve hit some sort of edge case
Since sbiosimulate evaluates compiled C++ code, for some inputs there might be a bug in the source that generates the exception
I am aware of this.
I'm a developer of SimBiology. I believe this is a bug that was introduced into SimBiology's C++ code in the R2012a release. The bug is triggered when a simulation ends without producing any simulation results. This can sometimes occur when the model is configured to report only particular times (using the OutputTimes options) AND the simulation is configured to end after a particular amount of real time (using the MaximumWallClock option). Basically, the simulation "times out" before it ever gets a chance to log the first output time.
One way to work around this problem is to always include time 0 in the OutputTimes. This time will always get logged before evaluating the MaximumWallClock criterion, preventing the bug from getting triggered. I am also contacting this user directly and will work on fixing the bug in a future release.