WSO2 EI 6.5.0 Sizing and Storate, Clustered Deployment

WSO2 EI 6.5.0 Sizing and Storate, Clustered Deployment - wso2

I'm wondering how to calculate storage usage or sizing (database, log files, i do not know others) of WSO2 EI Clustered deployment (Load balancer + Node 1 + Node 2 )
Wondering which hardware environment should we set up?
Our traffic is very high, almost receives 10000 request per day,
1. Should we use hardware environment what WSO2 recommended ?

you can follow the production deployment guide[1] to follow the hardware requirements recommended by wso2. Basically follow the installation prerequisites[2] and common guidelines[3] sections on that doc. Also for best practices for managing the logs can be found on [4]
[1] - https://docs.wso2.com/display/CLUSTER44x/Production+Deployment+Guidelines
[2] - https://docs.wso2.com/display/CLUSTER44x/Production+Deployment+Guidelines#ProductionDeploymentGuidelines-Installationprerequisites
[3] - https://docs.wso2.com/display/CLUSTER44x/Production+Deployment+Guidelines#ProductionDeploymentGuidelines-Commonguidelinesandchecklist
[4] - https://docs.wso2.com/display/ADMIN44x/Monitoring+Logs#MonitoringLogs-Managingloggrowth

Related

Eureka Server memory, renew threshold is 0, self preservation issue - AWS

I deployed 2 instances of Eureka server and a total of 12 instances microservices. .
Renews (last min) is as expected 24. But Renew Threshold is always 0. Is this how it supposed to be when self preservation is turned on? Also seeing this error - THE SELF PRESERVATION MODE IS TURNED OFF. THIS MAY NOT PROTECT INSTANCE EXPIRY IN CASE OF NETWORK/OTHER PROBLEMS. What's the expected behavior in this case and how to resolve this if this is a problem?
As mentioned above, I deployed 2 instances of Eureka Server but after running for a while like around 19-20 hours, one instance of Eureka Server always goes down. Why that could be possibly happening? I checked the processes running using top command and found that Eureka Server is taking a lot of memory. What needs to be configured on Eureka Server so that it don't take a lot of memory?
Below is the configuration in the application.properties file of Eureka Server:
spring.application.name=eureka-server
eureka.instance.appname=eureka-server
eureka.instance.instance-id=${spring.application.name}:${spring.application.instance_id:${random.int[1,999999]}}
eureka.server.enable-self-preservation=false
eureka.datacenter=AWS
eureka.environment=STAGE
eureka.client.registerWithEureka=false
eureka.client.fetchRegistry=false
Below is the command that I am using to start the Eureka Server instances.
#!/bin/bash
java -Xms128m -Xmx256m -Xss256k -XX:+HeapDumpOnOutOfMemoryError -Dspring.profiles.active=stage -Dserver.port=9011 -Deureka.instance.prefer-ip-address=true -Deureka.instance.hostname=example.co.za -Deureka.client.serviceUrl.defaultZone=http://example.co.za:9012/eureka/ -jar eureka-server-1.0.jar &
java -Xms128m -Xmx256m -Xss256k -XX:+HeapDumpOnOutOfMemoryError -Dspring.profiles.active=stage -Dserver.port=9012 -Deureka.instance.prefer-ip-address=true -Deureka.instance.hostname=example.co.za -Deureka.client.serviceUrl.defaultZone=http://example.co.za:9011/eureka/ -jar eureka-server-1.0.jar &
Is this approach to create multiple instances of Eureka Server correct?
Deployment is on AWS. Is there any specific configuration needed for Eureka Server on AWS?
Spring Boot version: 2.3.4.RELEASE
I am new to all these, any help or direction will be a great help.

Let me try to answer your question one by one.
Renews (last min) is as expected 24. But Renew Threshold is always 0. Is this how it supposed to be when self-preservation is turned on?
What's the expected behaviour in this case and how to resolve this if this is a problem?
I can see that eureka.server.enable-self-preservation=false in your configuration, This is really needed if you want to remove an already registered application as soon as it fails to renew its lease.
Self-preservation feature is to prevent the above-mentioned situation since it can happen if there are some network hiccups. Say, you have two services A and B, both are registered to eureka and suddenly, B failed to renew its lease because of a temporary network hiccup. If Self-preservation is not there then B will be removed from the registry and A won't be able to reach B despite B is available.
So we can say that Self-preservation is a resiliency feature of eureka.
Renews threshold is the expected renews per minute, Eureka server enters self-preservation mode if the actual number of heartbeats in last minute(Renews) is less than the expected number of renews per minute(Renew Threshold) and
Of course, you can control the Renews threshold. you can configure renewal-percent-threshold (by default it is 0.85)
So in your case,
Total number of application instances = 12
You don't have eureka.instance.leaseRenewalIntervalInSeconds so default value 30s
and eureka.client.registerWithEureka=false
so Renewals(last minute) will be 24
You don't have renewal-percent-threshold configured, so the default value is 0.85
Number of renewals per application instance per minute = 2 (30s each)
so in case of self-preservation is enable Renews threshold will be calculated as 2 * 12 * 0.85 = 21 (rounded)
And in your case self-preservation is turned off, so Eureka won't calculate Renews Threshold
One instance of Eureka Server always goes down. Why that could be possibly happening?
I'm not able to answer this question time being, this can be because of multiple reasons.
You can find the reason mostly from logs, or if you can post logs here it would be great.
What needs to be configured on Eureka Server so that it doesn't take a lot of memory?
From the information that you have provided, I cannot tell about your memory issue and in addition to that you already specified -Xmx256m and I didn't face any memory issues with the eureka servers so far.
But I can say that top is not the right tool for checking the memory consumed by your java process. When JVM starts, It takes some memory from the operating system.
This is the amount of memory you see in tools like ps and top. so better use jstat or jvmtop
Is this approach to create multiple instances of Eureka Server correct?
It seems you are having the same hostname(eureka.instance.hostname) for both instances. Replication won't work if you use the same hostname.
And make sure that you have the same application names in both instances.
Deployment is on AWS. Is there any specific configuration needed for Eureka Server on AWS?
Nothing specifically for AWS as per my knowledge, other than making sure that the instances can communicate with each other.

Kubernetes Istio latency path wise in Grafana

I am using Istio in AWS EKS cluster. I am using the pre-installed prometheus and grafana to monitor pods, Istio mesh, Istio service workloads.
I have three services running in three different workspace,
Service 1:- service1.namespace1.svc.cluster.local
Service 2 :- service2.namespace2.svc.cluster.local
Service 3:- service3.namespace3.svc.cluster.local
I can find the latency of each service end points from Istio Service Dashboard in Grafana . But, it just shows the latency for service end points, not the end point prefix. Though the overall service end point latency is fine but I want to check which path is taking time in a service end point.
Let's say P50 Latency of service1.namespace1.svc.cluster.local is 2.91ms , but I also want to check latency of each path. It has four paths,
service1.namespace1.svc.cluster.local/login => Loging Path , P50 Latency = ?
service1.namespace1.svc.cluster.local/signup => Singup Path , P50 Latency = ?
service1.namespace1.svc.cluster.local/auth => Auth path , P50 Latency = ?
service1.namespace1.svc.cluster.local/list => List path , P50 Latency = ?
I am not sure if it is possible in Prometheus and Grafana stack. What is the recommended way to achieve it ?
Istioctl version --remote
client version: 1.5.1
internal-popcraftio-ingressgateway version:
citadel version: 1.4.3
galley version: 1.4.3
ingressgateway version: 1.5.1
pilot version: 1.4.3
policy version: 1.4.3
sidecar-injector version: 1.4.3
telemetry version: 1.4.3
pilot version: 1.5.1
office-popcraftio-ingressgateway version:
data plane version: 1.4.3 (83 proxies), 1.5.1 (4 proxies)

To my knowledge this is not something that the Istio metrics can provide. However, you should take a look at the available metrics that your server framework provides, if any. So, this is application (framework)-dependent. See for instance for SpringBoot ( https://docs.spring.io/spring-metrics/docs/current/public/prometheus ) or Vert.x ( https://vertx.io/docs/vertx-micrometer-metrics/java/#_http_server )
One thing to be aware of, with HTTP path-based metrics, is that it is likely to make the metrics cardinality explode, if not used with care. Imagine some of your paths contain unbounded dynamic values (e.g. /object/123465 , with 123456 being an ID), if that path is stored as a Prometheus label, that would mean under the hood that Prometheus will create one metric per ID, which is likely to cause performance issues on Prometheus and risk out-of-memory on your app.
This is I think a good reason to NOT have Istio providing path-based metrics. While on the other end, frameworks can have the sufficient knowledge to provide metrics based on path template instead of actual path (e.g. /object/$ID instead of /object/123465), which solves the cardinality problem.
PS: Kiali has some documentation about runtimes monitoring, that may help: https://kiali.io/documentation/runtimes-monitoring/

Orderer disconnections in a Hyperledger Fabric application

We have a hyperledger application. The main application is hosted on AWS VM's whereas the DR is hosted on Azure VM's. Recently the Microsoft Team identified that one of the DR VM's became unavailable and the availability was restored in approximately 8 minutes. As per Microsoft "This unexpected occurrence was caused by an Azure initiated auto-recovery action. The auto-recovery action was triggered by a hardware issue on the physical node where the virtual machine was hosted. As designed, your VM was automatically moved to a different and healthy physical node to avoid further impact." The Zookeeper VM was also redeployed at the same
The day after this event occurred, we have started noticing that an orderer goes offline and immediately comes online after a few seconds. This disconnection/connection occurs regularly after a gap of 12 hours and 10 minutes.
We have noticed two things
In the log we get
- [orderer/consensus/kafka] startThread -> CRIT 24df#033[0m [channel:
testchainid] Cannot set up channel consumer = kafka server: The
requested offset is outside the range of offsets maintained by the
server for the given topic/partition.
- panic: [channel: testchainid] Cannot set up channel consumer = kafka
server: The requested offset is outside the range of offsets
maintained by the server for the given topic/partition.
- goroutine 52 [running]:
- github.com/hyperledger/fabric/vendor/github.com/op/go-logging.(*Logger).Panicf(0xc4202748a0,
0x108dede, 0x31, 0xc420327540, 0x2, 0x2)
- /w/workspace/fabric-binaries-x86_64/gopath/src/github.com/hyperledger/fabric/vendor/github.com/op/go-logging/logger.go:194
+0x134
- github.com/hyperledger/fabric/orderer/consensus/kafka.startThread(0xc42022cdc0)
- /w/workspace/fabric-binaries-x86_64/gopath/src/github.com/hyperledger/fabric/orderer/consensus/kafka/chain.go:261
+0xb33
- created by
github.com/hyperledger/fabric/orderer/consensus/kafka.(*chainImpl).Start
- /w/workspace/fabric-binaries-x86_64/gopath/src/github.com/hyperledger/fabric/orderer/consensus/kafka/chain.go:126
+0x3f
Another thing which we noticed is that, in logs prior to the VM failure event there were 3 kafka brokers but we can see only 2 kafka brokers in the logs after this event.
Can someone guide me on this? How do I resolve this problem?
Additional information - We have been through the Kafka logs of the day after which the VM was redeployed and we noticed the following
org.apache.kafka.common.network.InvalidReceiveException: Invalid receive (size = 1195725856 larger than 104857600)
at org.apache.kafka.common.network.NetworkReceive.readFromReadableChannel(NetworkReceive.java:132)
at org.apache.kafka.common.network.NetworkReceive.readFrom(NetworkReceive.java:93)
at org.apache.kafka.common.network.KafkaChannel.receive(KafkaChannel.java:231)
at org.apache.kafka.common.network.KafkaChannel.read(KafkaChannel.java:192)
at org.apache.kafka.common.network.Selector.attemptRead(Selector.java:528)
at org.apache.kafka.common.network.Selector.pollSelectionKeys(Selector.java:469)
at org.apache.kafka.common.network.Selector.poll(Selector.java:398)
at kafka.network.Processor.poll(SocketServer.scala:535)
at kafka.network.Processor.run(SocketServer.scala:452)
at java.lang.Thread.run(Thread.java:748)

It seems that we have a solution but it needs to be validated. Once the solution is validated, I will post it on this site.

WSO2 API Manager fails with [PassThroughMessageProcessor-58] WARN DataPublisher Event queue is full, unable to process the event for endpoint group

We are doing evaluation for metering purpose using WSO2 API Manager and DAS. (Latest versions)
Environment:
WSO2 API Manager runs as 2 node active-active deployment model using Hazlecast. (4 Core 8GB Ram) &
DAS runs as single node.
Both are connecting to backend RDBMS as mysql.
DAS and MYSQL shares the same server of 12 Core 24GB RAM. We dedicatedly allocated 12GB to MYSQL.
We started the test at the rate of 750reads/sec and everything went well for 27hrs until the metering reaches 72 Million and after which we have got the below error.
At API Manager: [PassThroughMessageProcessor-130] WARN DataPublisher Event queue is full, unable to process the event for endpoint Group.
At Das: (After 10 mins), we have got INFO {com.leansoft.bigqueue.page.MappedPageFactoryImpl} - Page file /$DAS_HOME/repository/data/index_staging_queues/4P/index/page-12.dat was just deleted. {com.leansoft.bigqueue.page.MappedPageFactoryImpl}.
Is this something that we have reached the limit w.r.t the infra setup or some performance issues w.r.t DAS. Can you please help us?

You need to tune the server performance of DAS and APIM

WSO2 API Manager 2.1 : Gateway not enforcing Throttling Limits

We have deployed API-M 2.1 in a distributed way (each component, GW, TM, KM are running in their own Docker image) on top on DC/OS 1.9 ( Mesos ).
We have issues to get the gateway to enforce throttling policies (should it be subscription tiers or app-level policies). Here is what we have managed to define so far:
The Traffic Manager itself does it job : it receives the event streams, analyzes them on the fly and pushes an event onto the JMS topic throttledata
The Gateway reads the message properly.
So basically we have discarded a communication issue.
However we found two potential issues:
In the event which is pushed to the TM component, the value of the appTenant is null (instead of carbon.super)- We have a single tenant defined.
When the gateway receives the throttling message, it decides to let the message go thinking the "stopOnQuotaReach" is set to false, when it is set to true (we checked the value in the database).
Digging into the source code, we related those two issues to a single source: the value for both values above are read from the authContext and apparently incorrectly set. We are stuck and running out of ideas of things to try and would need some pointers to what could be a potential source of the problem and things to check.
Can somebody help please ?
Thanks- Isabelle.

Is there two TM with HA enabled available in the system?
If the TM is HA enabled, how gateways publish data to TM. Is it load balanced data publishing or failover data publishing to the TMs?
Did you follow below articles to configure the environment with respect to your deployment?
http://wso2.com/library/articles/2016/10/article-scalable-traffic-manager-deployment-patterns-for-wso2-api-manager-part-1/
http://wso2.com/library/articles/2016/10/article-scalable-traffic-manager-deployment-patterns-for-wso2-api-manager-part-2/
Is throttling completely not working in your environment?
Have you noticed any JMS connection related logs in gateways nodes?

In these tests, we have disabled HA to avoid possible complications. Neither subscription nor app throttling policies are working, both because parameters that should have values have not the adequate value (appTenant, stopOnQuotaReach).
Our scenario is far more basic. If we go with one instance of each component, it fails as Isabelle described. And the only thing we know is that both parameters come from the Authentication Context.
Thank you!

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js