What could cause the "Too many changes at once in directory:D:\home\site\wwwroot\App_Data\jobs\continuous" Exception? - azure-webjobs

We have different continuous Azure WebJobs (using WebJobs SDK) listening to Azure Service Bus topics which are running across multiple instances on an App Service Environment. Everyday we get multiple log files in the log folder D:\home\LogFiles\kudu\trace containing an exception as below:
<step title="Error occurred" date="2016-12-14T18:26:38.860" instance="qwerty" type="error" text="System.IO.InternalBufferOverflowException: Too many changes at once in directory:D:\home\site\wwwroot\App_Data\jobs\continuous." >
<step title="Cleanup Xml Logs" date="2016-12-14T18:26:38.907" /><!-- duration: 63ms -->
</step><!-- duration: 156ms -->
Is this something to worry about? Is there a way to avoid this exception to happen?

No this isn't something you need to worry about. Our WebJobs infrastructure uses FileSystemWatchers to watch your files for changes (e.g. when job bits/sources are updated, etc.). From time to time due to Azure Storage issues or other network gremlins the file shares that the FileWatcher is connected to might experience transient issues, causing the FileWatcher to throw errors.
We have logic to handle such transient errors (source code here) and we trace them to the log file you pointed out. We log this primarily for our own use to help us monitor/diagnose issues. The logs you mentioned above are coming from this codepath. We have an issue here in our repo to review the verbosity of these traces.

Related

How can I filter out errors on sentry to avoid consuming my quota?

I'm using Sentry to log my errors, but there are errors I'm not able to fix (or could not be fixed by me) like
OSError (write error)
Or error that come from RQ (each time I deploy my app)
Or client errors (which are client.errors)
I can't just ignore them because I consume all my quota. How I can filter out this errors?
Here some references for interested people.
uwsgi: OSError: write error during GET request
Fixing broken pipe error in uWSGI with Python
https://github.com/unbit/uwsgi/issues/1623
I created a Gist for rate limiting the amount of events that are being send to Sentry:
https://gist.github.com/jurrian/e22f8e724b8499a29c5537e956f0dc7f
It uses ratelimitingfilter which can be configured to set a rate per minute, and additionally add a burst to start rate limiting after a number of events.
I get the same errors, but i never had any problems with my quota. But if you really want to filter it, you can just do it in your sdk:
https://docs.sentry.io/error-reporting/configuration/filtering/?platform=python
But beware, this could hide other errors as mentioned here:
https://github.com/pypa/warehouse/issues/679
To safe yourself some quota, you have two options:
Avoid forwarding events client side, thus preventing events being send to sentry at all. Have a look at the docs for available client-side filters. The drawback with this approach is of course that you need a new code deployment for any adjustment of client-side filters and some clients may not instantly reflect your code changes.
Avoid forwarding events on sentry's side, via inbound filters ([Project] > Project Settings > Inbound Filters). According to the sentry documentation on quota usage, events filtered via inbound filters are not affecting your quota.
Inbound filters include:
Common browser extension errors
Events coming from localhost
Known legacy browsers errors
Known web crawlers
By their error message
From specific release versions of your code
From certain IP addresses
Business plans and above also allow to filter events by error messages.

A timeout occurred while waiting for memory resources to execute the query in resource pool 'SloDWPool'

I have a series of Azure SQL Data Warehouse databases (for our development/evaluation purposes). Due to a recent unplanned extended outage (due to an issue with the Tenant Ring associated with some of these databases), I decided to resume the canary queries I had been running before but had quiesced for a couple of months due to frequent exceptions.
The canary queries are not running particularly frequently on any specific database, say every 15 minutes. On one database, I've received two indications of issues completing the canary query in 24 hours. The error is:
Msg 110802, Level 16, State 1, Server adwscdev1, Line 1110802;An internal DMS error occurred that caused this operation to fail. Details: A timeout occurred while waiting for memory resources to execute the query in resource pool 'SloDWPool' (2000000007). Rerun the query.
This database is under essentially no load, running at more than 100 DWU.
Other databases on the same logical server may be running under a load, but I have not seen the error on them.
What is the explanation for this error?
Please open a support ticket for this issue, support will have full access to the DMS logs and be able to see exactly what is going on. this behavior is not expected.
While I agree a support case would be reasonable I think you should also try scaling up to say DWU400 and retrying. I would also consider trying largerc or xlargerc on DWU100 and DWU400 as described here. Note it gets more memory and resources per query.
Run the following then retry your query:
EXEC sp_addrolemember 'largerc', 'yourLoginName'

AWS Lambda: Service error

What does this error mean?
I have 5 Lambda Functions deployed using Java that worked perfectly but since this afternoon all of them started displaying the same message when I execute each:
Service error.
No output, no logs, only that message in a red box.
In http://status.aws.amazon.com/ they say:
6:05 PM PDT We are investigating increased error rates and elevated
latencies for AWS Lambda requests in the US-EAST-1 Region. Newly
created functions and console editing are also affected.
Why does it happen and is there a way to prevent it?
From time to time, parts of Amazon's AWS service fail. Sometimes the failure is very small and short-lived, and in other cases there are larger distributed failures.
Your system design needs to take into account the possibility that the piece of AWS that you are counting on will not work at the moment, and try to route around the damage. For instance, you can run Lambda in multiple regions. (It already runs in multiple availability zones inside a single region, so you don't have to worry about that). This gives you some isolation against failures in any one region.
Getting distributed systems to work at small scale can be hard because the failures that you need to protect against don't happen very often. At large scale, you get systematic efforts like Netflix's "Chaos Monkey", which deliberately introduces failures so that automated processes can detect and correct those issues.
"A distributed system is one in which the failure of a computer you didn't even know existed can render your own computer unusable." -- Leslie Lamport
"When a Fail-Safe system fails, it fails by failing to fail safe." -- John Gall

One or more services have started or stopped unexpectedly SPTimerService (SPTimerV4)

I have stop and restart services(Sharepoint Administration & Sharepoint Timer Service)
I cleaned the Configuration Cache by using mentioned steps.
Summary of the steps to clear the timer job:
Stop SharePoint Timer service on all servers in the farm.
Browse to C:\ProgramData\Microsoft\SharePoint\Config{GUID} where the {GUID} folder contains a bunch of XML files and NOT the files with a “.PERSITEDFILE” extension.
Delete all the XML files
Update the contents of the Cache.ini file to just say “1” (without quotes).
Restart the SharePoint Timer service on each server
Reanalyze the issue in Health Analyzer
Does anyone know why this keeps occurring and how I can stop it?
First of all try and check your ULS Logs and see if there is any error that arise.
Secondly try and maybe check the event viewer on your SharePoint server to see if any errors are shown and make sure you have enough disk space available.
and also you might want to check this :Clearing Timer Services
Let me know if you see any error post it here.
hope it helps.
Yotam.

Spring Integration Pop3MailReceiver stops polling silently without logging why

Problem
I have a very basic configuration for a Spring integration mail adapter setup (below is the relevant sample):
<int:channel id="emailChannel">
<int:interceptors>
<int:wire-tap channel="logger"/>
</int:interceptors>
</int:channel>
<mail:inbound-channel-adapter id="popChannel"
store-uri="pop3://user:password#domain.net/INBOX"
channel="emailChannel"
should-delete-messages="true"
auto-startup="true">
<int:poller max-messages-per-poll="1" fixed-rate="30000"/>
</mail:inbound-channel-adapter>
<int:logging-channel-adapter id="logger" level="DEBUG"/>
<int:service-activator input-channel="emailChannel" ref="mailResultsProcessor" method="onMessage" />
This is working fine the majority of the time and I can see the logs showing the polling (and it works fine hooking into my mailResultsProcessor when a mail is there):
2013-08-13 08:19:29,748 [task-scheduler-3] DEBUG org.springframework.integration.mail.Pop3MailReceiver - opening folder [pop3://user:password#fomain.net/INBOX]
2013-08-13 08:19:29,796 [task-scheduler-3] INFO org.springframework.integration.mail.Pop3MailReceiver - attempting to receive mail from folder [INBOX]
2013-08-13 08:19:29,796 [task-scheduler-3] DEBUG org.springframework.integration.mail.Pop3MailReceiver - found 0 new messages
2013-08-13 08:19:29,796 [task-scheduler-3] DEBUG org.springframework.integration.mail.Pop3MailReceiver - Received 0 messages
2013-08-13 08:19:29,893 [task-scheduler-3] DEBUG org.springframework.integration.endpoint.SourcePollingChannelAdapter - Received no Message during the poll, returning 'false'
The problem I have is that the polling stops during the day, with no indication in the logs why it has stopped working. The only reason I can tell is the debug above is not present in the logs and E-Mails build up on the E-Mail account.
Questions
Has anyone seen this before and know how to resolve it?
Is there a change that I can make in my configuration to capture the issue into the log? I thought the logging channel adapter set to debug would have this covered.
Using version 2.2.3.RELEASE of Spring Integration on Tomcat 7, logs output default to catalina.out. Deployed on AWS standard tomcat 7 instance.
Most likely the poller thread is hung someplace upstream. With your configuration, the next poll won't happen until the current poll completes.
You can use jstack or VisualVM to get a thread dump to find out what the thread is doing.
Another possibility is you are suffering from poller thread starvation - if you have a lot of other polled elements in your application, and depending on their configuration. The default taskScheduler bean has only 10 threads.
You can add a task executor to the <poller/> so each poll is handed off to another thread, but be aware that that can result in concurrent polls if a polled task takes longer to execute than the polling rate.
To resolve this problem specifically I used the configuration below:
<mail:inbound-channel-adapter id="popChannel"
store-uri="pop3://***/INBOX"
channel="emailChannel"
should-delete-messages="true"
auto-startup="true">
<int:poller max-messages-per-poll="5" fixed-rate="60000" task-executor="pool"/>
</mail:inbound-channel-adapter>
<task:executor id="pool" pool-size="10" keep-alive="50"/>
Once moving to this approach we saw no further problems, and is with any use of pool the advantage is any Threads that become a problem are cleaned up and recreated.