AppFabric Cache Crashes - appfabric

The Scenario:
Have written a back end service that periodically checks for new rows in database and inserts it into AppFabric Cache. We have been using this approach since 2 odd months.
We are using a server machine to store data into cache in different regions. We are not using any Cluster of machines. Its only a single machine. The "default" cache region has been divided into 3 regions for 3 different environments. Each environment access this machine for cached data from its specified cache region.
It was working fine until the following is happening since few days.
We are getting the following exception.
ErrorCode: ERRCA0016 :SubStatus ES0001 : The connection was terminated, possibly due to server or network problems or serialized Object size is greater than MaxBufferSize on server. Result of the request is unknown.
The subsequent access to cache throws the following exception:
ErrorCode ERRCA0017 : SubStatus ES0001: There is a temporary failure. Please retry later.
After 4 to 5 try, we get the following exception.
ErrorCode ERRCA0018 : SubStatus ES0001 : The request timed out.
After all this, access to cache throws following exception:
ErrorCode ERRCA0005 : SubStatus ES0001 : Region referred to does not exist. Use CreateRegion API to fix the error.
Looking into the first error ErrorCode: ERRCA0016 :SubStatus ES0001 : Checked if the serialized object to be stored is of greater size than max buffer size. But prior to this larger size object were kept into the cache. Seems this can not be the problem.
What can be the exact problem that can be occurring ?
EDITED:
Did view the logs of event logger for Windows AppFabirc Cache. This is what we found upon our diggings.
These are some of the frequent error logs obtained.
Source :
AppFabricCachingService.Failfast
Param :
Lease with external store expired: Microsoft.Fabric.Federation.ExternalRingStateStoreException: Lease already expired at Microsoft.Fabric.Data.ExternalStoreAuthority.UpdateNode(NodeInfo nodeInfo, TimeSpan timeout) at Microsoft.Fabric.Federation.SiteNode.PerformExternalRingStateStoreOperations(Boolean& canFormRing, Boolean isInsert, Boolean isJoining)
General :
AppFabric Caching service crashed.
Lease with external store expired: Microsoft.Fabric.Federation.ExternalRingStateStoreException:
Lease already expired
   at Microsoft.Fabric.Data.ExternalStoreAuthority.UpdateNode(NodeInfo nodeInfo, TimeSpan timeout)
   at Microsoft.Fabric.Federation.SiteNode.PerformExternalRingStateStoreOperations(Boolean& canFormRing, Boolean isInsert, Boolean isJoining)}
Source:
AppFabricCachingService.Crash
Param :
System.Runtime.CallbackException: Async Callback threw an exception. ---> System.IdentityModel.Tokens.SecurityTokenValidationException: The service does not allow you to log on anonymously. at ......
Probable Causes
Upon Searching for the above event log errors, found that this could be caused due to the following problem. The Cache server is a different server and SQL configuration was used for the same whose database was on different server. So while getting cache configurations from SQL database there would be some failure in creating connection between the cache server and database server. So, we moved the cache configurations from SQL to XML. But still we got the error.
ErrorCode ERRCA0017 : SubStatus ES0001: There is a temporary failure. Please retry later.
ErrorCode ERRCA0005 : SubStatus ES0001 : Region referred to does not exist. Use CreateRegion API to fix the error.
Upon some more digging up, we are guessing that the problem could be this.
Whenever a machine that does not has grant access permissions to access AppFabric Cache, then it tries for some number of attempts and then appfabric stops working. After granting access to cache using Grant powershell commands, the machine is now able to access the cache. Will have to monitor for a few days.
Could this a valid reason ?

Related

AWS S3 Client throws "Operator called default onErrorDropped\njava.util.concurrent.CompletionException"

I am using the below AWS S3 AsyncClient from my Spring webflux application.As part of performance testing I had to introduce eventLoopGroup in S3 Client Config below to enable the S3 operations when the request rate was really high with 50 concurrent users triggering 20 request each having 5 documents to upload to S3.Before this I was getting the below error
reactor.core.publisher.Operators : Operator called default onErrorDropped\njava.util.concurrent.CompletionException: software.amazon.awssdk.core.exception.SdkClientException: Unable to execute HTTP request: Acquire operation took longer than the configured maximum time. This indicates that a request cannot get a connection from the pool within the specified maximum time. This can be due to high request rate.\nConsider taking any of the following actions to mitigate the issue: increase max connections, increase acquire timeout, or slowing the request rate.\nIncreasing the max connections can increase client throughput (unless the network interface is already fully utilized), but can eventually start to hit operation system limitations on the number of file descriptors used by the process
Introducing eventLoopGroup with numberOfThreads as 15 drastically reduced the error count to 6 from around a 1000. However once it fails, it didn't allow the next requests and failed with the same error.I was assuming that after sometime the used threads will be returned to the thread pool and would be available for use.But I have to restart the spring webflux application to get rid of the errors.Kindly let me know that the changes I have made below are correct.
#Bean
public S3AsyncClient s3client() {
SdkAsyncHttpClient httpClient =
NettyNioAsyncHttpClient.builder()
.readTimeout(Duration.ofSeconds(30))
.writeTimeout(Duration.ZERO)
.maxConcurrency(300)
.eventLoopGroup(SdkEventLoopGroup.builder().numberOfThreads(15).build())
.connectionAcquisitionTimeout(Duration.ofSeconds(30))
.build();
S3Configuration serviceConfiguration =
S3Configuration.builder()
.checksumValidationEnabled(false)
.chunkedEncodingEnabled(true)
.build();
S3AsyncClientBuilder s3AsyncClientBuilder =
S3AsyncClient.builder()
.httpClient(httpClient)
.region(Region.AP_SOUTHEAST_2)
.serviceConfiguration(serviceConfiguration);
return s3AsyncClientBuilder.build();
}

Informatica powercenter power exchange PWX-00267 DBAPI error

I am executing a workflow in informatica which is supposed to inset values in a target file.
Some of the records are getting inserted but i get an error after a few insertions saying:
[Informatica][ODBC PWX Driver] PWX-00267 DBAPI error for file……… Write error on record 119775 Requested 370 SQLSTATE [08S01]
Is this because of file constraints of how the record can be or due to some other reasons?
I'm not sure if this is exactly the case, but looking for the error code 08S01 I've found this site that lists Data Provider Error Codes. Under SQLCODE 370 (assuming this is what your error message indicates) I've found:
Message: There are insufficient resources on the target system to
complete the command. Contact your server administrator.
Reason: The resource limits reached reply message indicates that the
server could not be completed due to insufficient server resources
(e.g. memory, lock, buffer).
Action: Verify the connection and command parameters, and then
re-attempt the connection and command request. Review a client network
trace to determine if the server returned a SQL communications area
reply data (SQLCARD) with an optional reason code or other optional
diagnostic information.

A timeout occurred while waiting for memory resources to execute the query in resource pool 'SloDWPool'

I have a series of Azure SQL Data Warehouse databases (for our development/evaluation purposes). Due to a recent unplanned extended outage (due to an issue with the Tenant Ring associated with some of these databases), I decided to resume the canary queries I had been running before but had quiesced for a couple of months due to frequent exceptions.
The canary queries are not running particularly frequently on any specific database, say every 15 minutes. On one database, I've received two indications of issues completing the canary query in 24 hours. The error is:
Msg 110802, Level 16, State 1, Server adwscdev1, Line 1110802;An internal DMS error occurred that caused this operation to fail. Details: A timeout occurred while waiting for memory resources to execute the query in resource pool 'SloDWPool' (2000000007). Rerun the query.
This database is under essentially no load, running at more than 100 DWU.
Other databases on the same logical server may be running under a load, but I have not seen the error on them.
What is the explanation for this error?
Please open a support ticket for this issue, support will have full access to the DMS logs and be able to see exactly what is going on. this behavior is not expected.
While I agree a support case would be reasonable I think you should also try scaling up to say DWU400 and retrying. I would also consider trying largerc or xlargerc on DWU100 and DWU400 as described here. Note it gets more memory and resources per query.
Run the following then retry your query:
EXEC sp_addrolemember 'largerc', 'yourLoginName'

In Azure Eventhub receiver giving "Encountered error while fetching the list of EventHub PartitionIds" error

I am trying to implement the receiver part as per the tutorial
https://azure.microsoft.com/en-us/documentation/articles/event-hubs-java-ephjava-getstarted/
Failure while registering: com.microsoft.azure.eventprocessorhost.EPHConfigurationException:
Encountered error while fetching the list of EventHub PartitionIds
I have 16 partitions in my eventhub. But when I send the data I don't specify any partition. How do I know in which partition my data is sent to? Am I getting the above error because of all the partitions?
Make sure that your consumer sas policies does includes "manage" and not only "listen".
I guess it has to have manage rights to be able to list the partitions.
Ideally - EPH should work with "Listen"-only Claims. Right now we have a bug in EventProcessorHost client-code as a result of which - it needs "Manage" claims. We are working on it.
The error "Encountered error while fetching the list of EventHub PartitionIds" is generic and Thrown at PartitionManager, while querying for Partitions. You ran into one of the exceptions in the below catch block. Please indicate inner-exception for completeness (SEO) & faster resolution.
catch(XPathExpressionException|ParserConfigurationException|IOException|InvalidKeyException|NoSuchAlgorithmException|URISyntaxException|SAXException exception)
{
throw new EPHConfigurationException("Encountered error while fetching the list of EventHub PartitionIds", exception);
}
EDIT:
this issue is fixed in version 0.7.7.
I was working with integrating eventhub with the ELK stack and came across the same error.
To solve this I found that in the settings for the Event Hub Namespace that my event hub was in I had to allow access to the VNET my ELK stack was in.
This can be done by going to the following page: Your EventHubNamespace > Settings - Firewalls and virtual networks.
Either allow access from all networks or add your specific VNET, Subnet, or IP range (for specific machines).
This in addition to setting consumer sas policies to Manage should resolve the "Encountered error while fetching the list of EventHub PartitionIds" error.

Troubleshooting AppFabric Scaling Issues (Intermittent ErrorCode<ERRCA0017>:SubStatus<ES0006> Errors)

We've implemented AppFabric Windows Server Cache for our web application. Initially, we were able to use the cache without any issues. We then increased traffic roughly 100 fold, and began experiencing intermittent exceptions. The exceptions occur roughly once per 2 days, for about a minute at a time.
Our configuration:
9 web servers inserting/retrieving objects in cache:
Mostly temporal 500 byte operational type objects
Using 1 named region
Objects stored with tags
Retrieved in bulk for a given tag
Cache Cluster:
1 host (lead) AppFabric 1.1 (version reported by get-cachehost is 3)
SQL configuration provider
96GB of RAM on host, the default 50% (48GB) allocated to AppFabric
Cache Host Config
Cache Client Config
The errors in the order that they occur (the exceptions are occur for each of the nine webservers during the 1 minute period):
System.Net.Sockets.SocketException : An existing connection was forcibly closed by the remote host
Microsoft.ApplicationServer.Caching.DataCacheException: ErrorCode<ERRCA0016>:SubStatus<ES0001>:The connection was terminated, possibly due to server or network problems or serialized Object size is greater than MaxBufferSize on server. Result of the request is unknown. ---> System.ServiceModel.CommunicationException: The socket connection was aborted. This could be caused by an error processing your message or a receive timeout being exceeded by the remote host, or an underlying network resource issue. Local socket timeout was '00:15:00'. ---> System.Net.Sockets.SocketException: An existing connection was forcibly closed by the remote host
--- End of inner exception stack trace ---
at System.Runtime.AsyncResult.End[TAsyncResult](IAsyncResult result)
at System.ServiceModel.Channels.FramingDuplexSessionChannel.EndReceive(IAsyncResult result)
at Microsoft.ApplicationServer.Caching.WcfClientChannel.CompleteProcessing(IAsyncResult result)
--- End of inner exception stack trace ---
at Microsoft.ApplicationServer.Caching.DataCache.ThrowException(ResponseBody respBody)
at Microsoft.ApplicationServer.Caching.DataCache.GetNextBatch(String region, DataCacheTag[] tags, GetByTagsOperation op, IMonitoringListener listener, Byte[][]& state, Boolean& more)
at Microsoft.ApplicationServer.Caching.CacheEnumerator.MoveNext()
at System.Linq.Enumerable.WhereSelectEnumerableIterator'2.MoveNext()
at System.Linq.Enumerable.<ExceptIterator>d__99'1.MoveNext()
at System.Collections.Generic.List'1..ctor(IEnumerable'1 collection)
at System.Linq.Enumerable.ToList[TSource](IEnumerable'1 source)
Microsoft.ApplicationServer.Caching.DataCacheException:
ErrorCode<ERRCA0017>:SubStatus<ES0006>:There is a temporary failure. Please retry later. (One or more specified cache servers are unavailable, which could be caused by busy network or servers. For on-premises cache clusters, also verify the following conditions. Ensure that security permission has been granted for this client account, and check that the AppFabric Caching Service is allowed through the firewall on all cache hosts. Also the MaxBufferSize on the server must be greater than or equal to the serialized object size sent from the client.)
at Microsoft.ApplicationServer.Caching.DataCache.ThrowException(ResponseBody respBody)
at Microsoft.ApplicationServer.Caching.DataCache.GetNextBatch(String region, DataCacheTag[] tags, GetByTagsOperation op, IMonitoringListener listener, Byte[][]& state, Boolean& more)
at Microsoft.ApplicationServer.Caching.CacheEnumerator.MoveNext()
at System.Linq.Enumerable.WhereSelectEnumerableIterator'2.MoveNext()
at System.Linq.Enumerable.<ExceptIterator>d__99'1.MoveNext()
at System.Collections.Generic.List'1..ctor(IEnumerable'1 collection)
at System.Linq.Enumerable.ToList[TSource](IEnumerable'1 source)
Microsoft.ApplicationServer.Caching.DataCacheException:
ErrorCode<ERRCA0018>:SubStatus<ES0001>:The request timed out.
at Microsoft.ApplicationServer.Caching.DataCache.ThrowException(ResponseBody respBody)
at Microsoft.ApplicationServer.Caching.DataCache.GetNextBatch(String region, DataCacheTag[] tags, GetByTagsOperation op, IMonitoringListener listener, Byte[][]& state, Boolean& more)
at Microsoft.ApplicationServer.Caching.CacheEnumerator.MoveNext()
at System.Linq.Enumerable.WhereSelectEnumerableIterator'2.MoveNext()
at System.Linq.Enumerable.<ExceptIterator>d__99'1.MoveNext()
at System.Collections.Generic.List'1..ctor(IEnumerable'1 collection)
at System.Linq.Enumerable.ToList[TSource](IEnumerable'1 source)
We have also created a tracelog session on the caching server to capture more information to diagnose the issue - any suggestions on how to analyze this would be appreciated (I can make this available if need be).
We also monitored various AppFabric, CLR, and Network performance counters, below is a screenshot of the event as it occurs:
Thanks in advance for any thoughts or advice you can share on resolving this issue.
UPDATE 1
The following are the exceptions occurring continuously on the AppFabric Caching Server during the intermittent errors (abstracted from tracelogs) :
System.ServiceModel.CommunicationException: The socket connection was aborted because an asynchronous send to the socket did not complete within the allotted timeout of 00:00:00.0082078. The time allotted to this operation may have been a portion of a longer timeout. ---> System.ObjectDisposedException: The socket connection has been disposed. Object name: 'System.ServiceModel.Channels.SocketConnection'. --- End of inner exception stack trace --- at System.ServiceModel.Channels.SocketConnection.ThrowIfNotOpen() at System.ServiceModel.Channels.SocketConnection.BeginRead(Int32 offset, Int32 size, TimeSpan timeout, WaitCallback callback, Object state) at System.ServiceModel.Channels.SessionConnectionReader.BeginReceive(TimeSpan timeout, WaitCallback callback, Object state) at System.ServiceModel.Channels.SynchronizedMessageSource.ReceiveAsyncResult.PerformOperation(TimeSpan timeout) at System.ServiceModel.Channels.SynchronizedMessageSource.SynchronizedAsyncResult'1..ctor(SynchronizedMessageSource syncSource, TimeSpan timeout, AsyncCallback callback, Object state) at System.ServiceModel.Channels.FramingDuplexSessionChannel.BeginReceive(TimeSpan timeout, AsyncCallback callback, Object state) at Microsoft.ApplicationServer.Caching.WcfServerChannel.CompleteProcessing(IAsyncResult result)
System.ServiceModel.CommunicationObjectAbortedException: The communication object, System.ServiceModel.Channels.ServerSessionPreambleConnectionReader+ServerFramingDuplexSessionChannel, cannot be used for communication because it has been Aborted. at System.Runtime.AsyncResult.End[TAsyncResult](IAsyncResult result) at System.ServiceModel.Channels.FramingDuplexSessionChannel.OnEndSend(IAsyncResult result) at Microsoft.ApplicationServer.Caching.ReplyContext.EndSend(IAsyncResult result)
System.ServiceModel.CommunicationObjectFaultedException: The communication object, System.ServiceModel.Channels.ServerSessionPreambleConnectionReader+ServerFramingDuplexSessionChannel, cannot be used for communication because it is in the Faulted state. at System.ServiceModel.Channels.CommunicationObject.ThrowIfDisposedOrNotOpen() at System.ServiceModel.Channels.OutputChannel.Send(Message message, TimeSpan timeout) at Microsoft.ApplicationServer.Caching.ReplyContext.Reply(Message message, TimeSpan timeout)
System.TimeoutException: Sending to via http://www.w3.org/2005/08/addressing/anonymous timed out after 00:00:15. The time allotted to this operation may have been a portion of a longer timeout. ---> System.TimeoutException: Cannot claim lock within the allotted timeout of 00:00:15. The time allotted to this operation may have been a portion of a longer timeout. --- End of inner exception stack trace --- at System.ServiceModel.Channels.FramingDuplexSessionChannel.OnSend(Message message, TimeSpan timeout) at System.ServiceModel.Channels.OutputChannel.Send(Message message, TimeSpan timeout) at Microsoft.ApplicationServer.Caching.ReplyContext.Reply(Message message, TimeSpan timeout)
UPDATE 2
After another day of troubleshooting we took the following actions which produced some improvement:
Based on this and this we increased maxConnectionsToServer to 3. As a result we gained 50% more client requests/sec as recorded by the AppFabric Caching:Cache perf counter, but the intermittent errors did not stop occuring
We increased the maxBufferSize and maxBufferPoolSize to 2147483647 (int32.max) on the Cache Server configuration. So far we are able to handle 300x traffic volume w/o errors. We will continue to increase traffic volume and monitor. More updates to follow
UPDATE 3
We added two more hosts with 16GB each to the cluster and enabled HighAvailability mode (via Secondaries=1). Currently the original host remains in the cluster with 96GB ram - all hosts have cacheSize = 12GB. On the cache clients we increase the MaxConnectionToServer to 12 (1 per core). Below are our findings:
Occasionally we get (once or twice every 10 minutes):
ErrorCode<ERRCA0017>:SubStatus<ES0005>:There is a temporary failure. Please retry later. (There was a contention on the store.)
ErrorCode<ERRCA0017>:SubStatus<ES0004>:There is a temporary failure. Please retry later. (Replication queue was full. This may happen during reconfiguration of cluster hosts.)
The original 96GB cache hosts still experiences 1 minute outages as described above. The new cache hosts haven't experienced the outage
We plan to remove 80GB ram from the original cache host. More updates to follow.
UPDATE 4
The problem seems to have been solved by reducing the amount of RAM in the cache hosts to 16GB. We no longer see the intermittent errors with traffic increased to 400x. Seems to be cased closed. Now on to the next issue: High Availability
Have you installed http://support.microsoft.com/kb/983182 and http://support.microsoft.com/kb/2527387 ?
In your code do you check for the exception and the retrylater bool?
catch (DataCacheException ex2)
{
if (ex2.ErrorCode == DataCacheErrorCode.RetryLater)
{
The use of a named region forces the server to push the values of that named region to a single server instead of spreading out the hashes across all of your cache servers. ("To provide this added search functionality, objects in a region are limited to a single cache host." http://msdn.microsoft.com/en-us/library/ee790985(v=azure.10).aspx )
What I would recommend is that you shard your named region across 2 more servers and put them in a cluster. This way you are limiting the exceptions to a smaller server when it is running the GC and trying to find more ram to put and store objects and tags into.
Reposting an answer given by Jeff-ITGuy on social.msdn.microsoft.com:
You appear to be encountering an issue nearly identical to one I'm working with Microsoft at the moment. If it's the same issue, it is probably caused by GC taking a long time and causing delays in the response time for AppFabric. From your perf counters it looked like GC time shot up when you started getting the problem so it probably is the same issue.
This issue is being investigated actively by Microsoft. In the meantime, in order to alleviate the problem (at least from our findings) you can run more servers with less memory (shrink the size of the memory space that GC is working against) and you can increase the RequestTimeout on your client. By default that is set to 15000 (15 secs) but we have tried raising it to 30000 which helped eliminate some of the issues. This is NOT a good long term solution in my opinion, just passing on information. I've seen the issue with servers having only 24gb memory (with 12gb for cache) and it only really got noticeably better when we tried 8gb servers with 4gb set to cache - that caused GC to do MUCH better.
Hope that helps, but if this is the issue I think it is there's no solution yet.
It did help, the intermittent errors stopped after we reduced the cache host RAM to 16GB.
The fix for this issue is currently available here:
http://support.microsoft.com/kb/2787717