GCP IoT stops processing messages - google-cloud-platform

We have multiple devices (rpi2) connected to IoT core with our self-engineered "firmware" that uses Python and Paho-mqtt client. Our steps to reproduce error:
Log into Google Cloud
Set device to DEBUG logging level.
Attempt to send command using Google's IoT website (e.g., remote firmware update or just anything other)
Observe immediate error responses (two cases below) from the Google control panel.
Observe the logs on the device; no record of contact from GCP.
And the two cases are:
Message couldn’t be sent because the device is not connected
This is true and not true, because we are seeing that our device gets disconnected from time to time without particular reason. We do have JWT token refreshing mechanism and it's working fine, but sometimes GCP just... disconnects device!
The command couldn't be sent because the device is not subscribed to the MQTT wildcard topic.
This is not true, all devices are subscribed (but if they are not connected then maybe they are not subscribed...)
Here are the things that we already checked:
1. The JWT are refreshed 10 minutes faster than one hour (the expiry is set to 3600, but we refresh every 3000 seconds)
2. Explicitly set MQTTv311 as the protocol to speak in Python Paho-mqtt client
3. Implemented an on_log() handler.
4. Implemented PINGREQ as IoT core documentation stated.
5. Checked devices' internet connection and it's just fine.
Here are some logs:
From device:
[2020-04-05 21:20:58,624] root - INFO - Trying to set date and time to: 2020-04-05 21:20:58.624278
[2020-04-05 21:20:59,239] root - INFO - Starting collecting device state
[2020-04-05 21:20:59,256] root - INFO - Connecting to the cloud
[2020-04-05 21:20:59,262] root - INFO - Device client_id is 'projects/XXX/locations/us-central1/registries/iot-registry/devices/XXX-000003'
[2020-04-05 21:20:59,549] root - INFO - Starting pusher service
[2020-04-05 21:21:00,050] root - INFO - on_connect: Connection Accepted.
[2020-04-05 21:21:00,051] root - INFO - Subscribing to /devices/XXX-000003/commands/#
[2020-04-05 21:22:04,827] root - INFO - Since 2020-04-05T21:20:59.549934, pusher sent: 469 messages, received confirmation for 469 messages, recorded 0 errors
[2020-04-05 21:23:07,426] root - INFO - Since 2020-04-05T21:22:04.828064, pusher sent: 6 messages, received confirmation for 6 messages, recorded 0 errors
[2020-04-05 21:24:09,815] root - INFO - Since 2020-04-05T21:23:07.427720, pusher sent: 6 messages, received confirmation for 6 messages, recorded 0 errors
[2020-04-05 21:25:12,036] root - INFO - Since 2020-04-05T21:24:09.816221, pusher sent: 6 messages, received confirmation for 6 messages, recorded 0 errors
[2020-04-05 21:26:14,099] root - INFO - Since 2020-04-05T21:25:12.037507, pusher sent: 6 messages, received confirmation for 6 messages, recorded 0 errors
[2020-04-05 21:27:16,052] root - INFO - Since 2020-04-05T21:26:14.100430, pusher sent: 6 messages, received confirmation for 6 messages, recorded 0 errors
[2020-04-05 21:28:17,807] root - INFO - Since 2020-04-05T21:27:16.053253, pusher sent: 6 messages, received confirmation for 6 messages, recorded 0 errors
[2020-04-05 21:29:19,416] root - INFO - Since 2020-04-05T21:28:17.808075, pusher sent: 6 messages, received confirmation for 6 messages, recorded 0 errors
From GCP:
GCP logs

Related

Target Connection is Stale ERROR in WSO2 EI 6.4.0

I'm trying to push incoming JSON payload to AWS SQS Queue in WSO2 EI 6.4.0. Facing Exception like java.io.IOException: Target Connection is stale intermittently.
We're unable to push payload to Queue.
Log:
[2022-08-27 03:08:49,801] [-1] [] [HTTPS-Sender I/O dispatcher-5] WARN {org.apache.synapse.transport.passthru.Targe
tHandler} - Connection closed by target host while sending the request Remote Address : proxy.abc.com/
10.0.x.x:3090
[2022-08-27 03:08:49,801] [-1234] [] [PassThroughMessageProcessor-29] ERROR {org.apache.synapse.transport.passthru.P
assThroughHttpSSLSender} - IO while building message
java.io.IOException: Target Connection is stale..
As per this wso2 link, do i need to disable this http.connection.stalecheck by making value as 1 in <ESB_Home>/repository/conf/nhttp.properties file?
Please suggest to resolve this issue.
The error java.io.IOException: Target Connection is stale is not the cause of your issue, it's just a by product of the actual issue. With the information provided I believe the actual issue is the following.
[2022-08-27 03:08:49,801] [-1] [] [HTTPS-Sender I/O dispatcher-5] WARN {org.apache.synapse.transport.passthru.Targe
tHandler} - Connection closed by target host while sending the request Remote Address : proxy.abc.com/
10.0.x.x:3090
Since the connection was closed by the target(Assume this is SQS) the collections are getting staled after some time. This is expected, so try to find out why the target is closing the connection. If you are going through a corporate Proxy, check the proxy first, then check SQS side to see whether there is any information useful to debug the issue.

how to send multiple http2 requests over the same connection with libcurl

I'm using https://curl.haxx.se/libcurl/c/http2-download.html to send mulitple http2 requests to a demo http server. This server is based on spring webflux. To verify if libcurl can send http2 requests concurrently, the server will delay 10 seconds before return response. In this way, I hope to observe that the server will receive multiple http2 requests at almost the same time over the same connection, after 10 seconds, the client will receive responses.
However,I noticed that the server received the requests sequentially. It seems that the client doesn't send the next request before geting the response of previous request.
Here is the log of server, the requests arrived every 10 seconds.
2021-05-07 17:14:57.514 INFO 31352 --- [ctor-http-nio-2] i.g.h.mongo.controller.PostController : Call get 609343a24b79c21c4431a2b1
2021-05-07 17:15:07.532 INFO 31352 --- [ctor-http-nio-2] i.g.h.mongo.controller.PostController : Call get 609343a24b79c21c4431a2b1
2021-05-07 17:15:17.541 INFO 31352 --- [ctor-http-nio-2] i.g.h.mongo.controller.PostController : Call get 609343a24b79c21c4431a2b1
Any guys can help figure out my mistakes? Thank you
For me,
curl -v --http2 --parallel --config urls.txt
did exactly what you need, where urls.txt was like
url = "localhost:8080/health"
url = "localhost:8080/health"
the result was that at first, curl sent first request via HTTP/1.1, received 101 upgrade to http/2, immediately sent the second request without waiting for response, and then received two times 200 response in succession.
Note: -v is added for verbosity to validate it works as expected. You don't need it other than for printing the underlying protocol conversation.

Orderer disconnections in a Hyperledger Fabric application

We have a hyperledger application. The main application is hosted on AWS VM's whereas the DR is hosted on Azure VM's. Recently the Microsoft Team identified that one of the DR VM's became unavailable and the availability was restored in approximately 8 minutes. As per Microsoft "This unexpected occurrence was caused by an Azure initiated auto-recovery action. The auto-recovery action was triggered by a hardware issue on the physical node where the virtual machine was hosted. As designed, your VM was automatically moved to a different and healthy physical node to avoid further impact." The Zookeeper VM was also redeployed at the same
The day after this event occurred, we have started noticing that an orderer goes offline and immediately comes online after a few seconds. This disconnection/connection occurs regularly after a gap of 12 hours and 10 minutes.
We have noticed two things
In the log we get
- [orderer/consensus/kafka] startThread -> CRIT 24df#033[0m [channel:
testchainid] Cannot set up channel consumer = kafka server: The
requested offset is outside the range of offsets maintained by the
server for the given topic/partition.
- panic: [channel: testchainid] Cannot set up channel consumer = kafka
server: The requested offset is outside the range of offsets
maintained by the server for the given topic/partition.
- goroutine 52 [running]:
- github.com/hyperledger/fabric/vendor/github.com/op/go-logging.(*Logger).Panicf(0xc4202748a0,
0x108dede, 0x31, 0xc420327540, 0x2, 0x2)
- /w/workspace/fabric-binaries-x86_64/gopath/src/github.com/hyperledger/fabric/vendor/github.com/op/go-logging/logger.go:194
+0x134
- github.com/hyperledger/fabric/orderer/consensus/kafka.startThread(0xc42022cdc0)
- /w/workspace/fabric-binaries-x86_64/gopath/src/github.com/hyperledger/fabric/orderer/consensus/kafka/chain.go:261
+0xb33
- created by
github.com/hyperledger/fabric/orderer/consensus/kafka.(*chainImpl).Start
- /w/workspace/fabric-binaries-x86_64/gopath/src/github.com/hyperledger/fabric/orderer/consensus/kafka/chain.go:126
+0x3f
Another thing which we noticed is that, in logs prior to the VM failure event there were 3 kafka brokers but we can see only 2 kafka brokers in the logs after this event.
Can someone guide me on this? How do I resolve this problem?
Additional information - We have been through the Kafka logs of the day after which the VM was redeployed and we noticed the following
org.apache.kafka.common.network.InvalidReceiveException: Invalid receive (size = 1195725856 larger than 104857600)
at org.apache.kafka.common.network.NetworkReceive.readFromReadableChannel(NetworkReceive.java:132)
at org.apache.kafka.common.network.NetworkReceive.readFrom(NetworkReceive.java:93)
at org.apache.kafka.common.network.KafkaChannel.receive(KafkaChannel.java:231)
at org.apache.kafka.common.network.KafkaChannel.read(KafkaChannel.java:192)
at org.apache.kafka.common.network.Selector.attemptRead(Selector.java:528)
at org.apache.kafka.common.network.Selector.pollSelectionKeys(Selector.java:469)
at org.apache.kafka.common.network.Selector.poll(Selector.java:398)
at kafka.network.Processor.poll(SocketServer.scala:535)
at kafka.network.Processor.run(SocketServer.scala:452)
at java.lang.Thread.run(Thread.java:748)
It seems that we have a solution but it needs to be validated. Once the solution is validated, I will post it on this site.

OpenStack HAproxy issues

I am using openshift-django17 to bootstrap my application on Openshift. Before I moved to Django 1.7, I was using authors previous repository for openshift-django16 and I did not have the problem which I will describe next. After running successfully for approximately 6h I get the following error:
Service Temporarily Unavailable The server is temporarily unable to
service your request due to maintenance downtime or capacity problems.
Please try again later.
After I restart the application it works without any problem for some hours, then I get this error again. Now gears should never enter idle mode, as I am posting some data every 5 minutes through RESTful POST API from outside of the app. I have run rhc tail command and I think the error lies in HAproxy:
==> app-root/logs/haproxy.log <== [WARNING] 081/155915 (497777) : config : log format ignored for proxy 'express' since it has no log
address. [WARNING] 081/155915 (497777) : Server express/local-gear is
DOWN, reason: Layer 4 connection problem, info: "Connection refused",
check duration: 0ms. 0 active and 0 backup servers left. 0 sessions
active, 0 requeued, 0 remaining in queue. [ALERT] 081/155915 (497777)
: proxy 'express' has no server available! [WARNING] 081/155948
(497777) : Server express/local-gear is UP, reason: Layer7 check
passed, code: 200, info: "HTTP status check returned code 200", ch eck
duration: 11ms. 1 active and 0 backup servers online. 0 sessions
requeued, 0 total in queue. [WARNING] 081/170359 (127633) : config :
log format ignored for proxy 'stats' si nce it has no log address.
[WARNING] 081/170359 (127633) : config : log format ignored for proxy
'express' since it has no log address. [WARNING] 081/170359 (497777) :
Stopping proxy stats in 0 ms. [WARNING] 081/170359 (497777) : Stopping
proxy express in 0 ms. [WARNING] 081/170359 (497777) : Proxy stats
stopped (FE: 1 conns, BE: 0 conns). [WARNING] 081/170359 (497777) :
Proxy express stopped (FE: 206 conns, BE: 312 co
I also run some CRON job once a day, but I am 99% sure it does not have to do anything with this. It looks like a problem on Openshift side, right? I have posted this issue on the github of the authors repository, where he suggested I try stackoverflow.
It turned out this was due to a bug in openshift-django17 setting DEBUG in settings.py to True even though it was specified in environment variables as False (pull request for fix here). The reason 503 Service Temporarily Unavailable appeared was because of Openshift memory limit violations due to DEBUG being turned ON as stated in Django settings documentation for DEBUG:
It is also important to remember that when running with DEBUG turned on, Django will remember every SQL query it executes. This is useful when you’re debugging, but it’ll rapidly consume memory on a production server.

Spark - Remote Akka Client Disassociated

I am setting up Spark 0.9 on AWS and am finding that when launching the interactive Pyspark shell, my executors / remote workers are first being registered:
14/07/08 22:48:05 INFO cluster.SparkDeploySchedulerBackend: Registered executor:
Actor[akka.tcp://sparkExecutor#ip-xx-xx-xxx-xxx.ec2.internal:54110/user/
Executor#-862786598] with ID 0
and then disassociated almost immediately, before I have the chance to run anything:
14/07/08 22:48:05 INFO cluster.SparkDeploySchedulerBackend: Executor 0 disconnected,
so removing it
14/07/08 22:48:05 ERROR scheduler.TaskSchedulerImpl: Lost an executor 0 (already
removed): remote Akka client disassociated
Any idea what might be wrong? I've tried adjusting the JVM options spark.akka.frameSize and spark.akka.timeout, but I'm pretty sure this is not the issue since (1) I'm not running anything to begin with, and (2) my executors are disconnecting a few seconds after startup, which is well within the default 100s timeout.
Thanks!
Jack
I had a very similar problem, if not the same.
It started to work for me once the workers were connecting to master by using the very same name as the master thought it had.
My log messages were something like:
ERROR remote.EndpointWriter: AssociationError [akka.tcp://sparkWorker#idc1-hrm1.heylinux.com:7078] -> [akka.tcp://sparkMaster#vagrant-centos64.vagrantup.com:7077]: Error [Association failed with [akka.tcp://sparkMaster#vagrant-centos64.vagrantup.com:7077]].
ERROR remote.EndpointWriter: AssociationError [akka.tcp://sparkWorker#192.168.121.127:7078] -> [akka.tcp://sparkMaster#idc1-hrm1.heylinux.com:7077]: Error [Association failed with [akka.tcp://sparkMaster#idc1-hrm1.heylinux.com:7077]]
WARN util.Utils: Your hostname, idc1-hrm1 resolves to a loopback address: 127.0.0.1; using 192.168.121.187 instead (on interface eth0)
So check the log of the master and see what name it thinks it has.
Then use that very same name on the workers.