Sync Framework - Batching - microsoft-sync-framework

I synchronize two remote databases (Sql Express and Sql Compact) using sync framework 2.1 over WCF (N-Tier), using batching.
Recently I receive this log file, this is an error that appears quite rarely, but when it does it creates a lot of problem (it seems the tables from the data included in this sync scope that fails is deleted).
I am positive sure nobody is messing with the BatchingDirectory so it should be there and contain all the data. Could the error below be related to the fact that I have
CleanupBatchingDirectory = true
and this is delete before the chages are applied?
11/06/2012 14:16:49 Error ** :PosPosSync:ThreadId=7: **:
SyncScope ErpProduct failed
Message: An unexpected error occurred when applying batch file C:\Documents and Settings\kasse6\Application Data\POSSyncDataClient\PosSync_5b009e9008c14d0ba6a9e47726d8d620\4e77ef8c-3045-4c55-809f-014ae2b96155.batch. See the inner exception for more details.
Type : Microsoft.Synchronization.Data.DbSyncException
Stack : at Microsoft.Synchronization.Data.DbSyncBatchConsumer.ApplyBatches(DbSyncScopeMetadata scopeMetadata, DbSyncSession syncSession, SyncSessionStatistics sessionStatistics)
at Microsoft.Synchronization.Data.RelationalSyncProvider.ProcessChangeBatch(ConflictResolutionPolicy resolutionPolicy, ChangeBatch sourceChanges, Object changeDataRetriever, SyncCallbacks syncCallbacks, SyncSessionStatistics sessionStatistics)
at Microsoft.Synchronization.KnowledgeProviderProxy.ProcessChangeBatch(CONFLICT_RESOLUTION_POLICY resolutionPolicy, ISyncChangeBatch pSourceChangeManager, Object pUnkDataRetriever, ISyncCallback pCallback, _SYNC_SESSION_STATISTICS& pSyncSessionStatistics)
at Microsoft.Synchronization.CoreInterop.ISyncSession.Start(CONFLICT_RESOLUTION_POLICY resolutionPolicy, _SYNC_SESSION_STATISTICS& pSyncSessionStatistics)
at Microsoft.Synchronization.KnowledgeSyncOrchestrator.DoOneWaySyncHelper(SyncIdFormatGroup sourceIdFormats, SyncIdFormatGroup destinationIdFormats, KnowledgeSyncProviderConfiguration destinationConfiguration, SyncCallbacks DestinationCallbacks, ISyncProvider sourceProxy, ISyncProvider destinationProxy, ChangeDataAdapter callbackChangeDataAdapter, SyncDataConverter conflictDataConverter, Int32& changesApplied, Int32& changesFailed)
at Microsoft.Synchronization.KnowledgeSyncOrchestrator.DoOneWayKnowledgeSync(SyncDataConverter sourceConverter, SyncDataConverter destinationConverter, SyncProvider sourceProvider, SyncProvider destinationProvider, Int32& changesApplied, Int32& changesFailed)
at Microsoft.Synchronization.KnowledgeSyncOrchestrator.Synchronize()
at Microsoft.Synchronization.SyncOrchestrator.Synchronize()
at PosPosSync.Local.PosPosSyncService.SynchronizeProviders(KnowledgeSyncProvider localProvider, KnowledgeSyncProvider remoteProvider, SyncDirectionOrder syncDirectionOrder)
at PosPosSync.Local.PosPosSyncService.SyncronizeData(String scopeName, SyncDirectionOrder syncDirectionOrder)
Source : Microsoft.Synchronization
Target : Void Start(CONFLICT_RESOLUTION_POLICY, _SYNC_SESSION_STATISTICS ByRef)
------- Inner Exception ------
Message: Could not find a part of the path 'C:\Documents and Settings\kasse6\Application Data\POSSyncDataClient\PosSync_5b009e9008c14d0ba6a9e47726d8d620\4e77ef8c-3045-4c55-809f-014ae2b96155.batch'.
Type : System.IO.DirectoryNotFoundException
Stack : at System.IO.__Error.WinIOError(Int32 errorCode, String maybeFullPath)
at System.IO.FileStream.Init(String path, FileMode mode, FileAccess access, Int32 rights, Boolean useRights, FileShare share, Int32 bufferSize, FileOptions options, SECURITY_ATTRIBUTES secAttrs, String msgPath, Boolean bFromProxy, Boolean useLongPath)
at System.IO.FileStream..ctor(String path, FileMode mode, FileAccess access, FileShare share, Int32 bufferSize, FileOptions options, String msgPath, Boolean bFromProxy)
at System.IO.FileStream..ctor(String path, FileMode mode)
at Microsoft.Synchronization.Data.DbSyncBatchInfoFactory.Deserialize(String batchFileName, Boolean deserializeData)
at Microsoft.Synchronization.Data.DbSyncBatchConsumer.ReadBatchFile(UInt32 lookupLocation, UInt32 expectedNumber)
at Microsoft.Synchronization.Data.DbSyncBatchConsumer.ReadBatchFile(UInt32 expectedNumber, String& batchFileName)
at Microsoft.Synchronization.Data.DbSyncBatchConsumer.ApplyBatches(DbSyncScopeMetadata scopeMetadata, DbSyncSession syncSession, SyncSessionStatistics sessionStatistics)
Source : mscorlib
Target : Void WinIOError(Int32, System.String)
The think is that after some time it tries to synchronize again all the data, and based on the log information I have, it seems it downloads everything from the client to the server:
11/06/2012 14:26:02 Info ** :PosPosSync:ThreadId=7: **:
EndSync: ScopeName: ErpProduct
DownloadChanges: Applied - Failed: 122363 - 0
UploadChanges: Applied - Failed: 0 - 0
FinishedSync: ElapsedTime, sec: 545,0086488

Try changing the SqlSyncProvider.MemoryDataCacheSize (Batch Size) on your client.
My client Synchronize was throwing the DirectoryNotFoundException when I set the Batch Size to 100kb, normally I run 500kb. I saw this only on large syncs (ex. initial sync of large database). Subsequent syncs worked fine as smaller).
UPDATE
According to MS Documentation issue could be caused by a database row exceeding 110% of the MemoryDataCacheSize.
The application specifies the memory data cache size for each provider that is participating in the synchronization session.
If both providers specify a cache size, Sync Framework uses the smaller value for both providers. The actual cache size will be no more than 110% of the smallest specified size. During a synchronization session, if a single row is greater than 110% of the size the session terminates with an exception.

After fighting with a similar issue for a few years I think I have finally found a solution.
This 'could not find a part of the path' issue also can happen if multiple scopes are defined (and the scopes are big enough to use batching), as there is a minor bug in the example code from MS. The issue is on Dispose in SqlWebSyncService, there is a deletion of the batching folder for this session but the directory info variable is not set to null (which is the test the next scope uses to know if it is making the folder or not). Adding in setting the batching directory to null fixes this issue, as when CheckAndCreateBatchingDirectory then runs through it finds the batching directory as null and goes through making it.
private void Dispose(bool disposing)
{
try
{
if (!this.m_disposed)
{
if (disposing)
{
if (this.m_ServerProvider != null)
{
this.m_ServerProvider.Dispose();
this.m_ServerProvider = null;
}
if (this.m_SessionBatchingDirectory != null)
{
this.m_SessionBatchingDirectory.Refresh();
if (this.m_SessionBatchingDirectory.Exists)
{
try
{
this.m_SessionBatchingDirectory.Delete(true);
}
catch
{
}
}
this.m_SessionBatchingDirectory = null;
}
}
this.m_disposed = true;
}
}
catch (Exception exception)
{
string message = "SqlWebSyncService Cleanup Exception: " + exception;
LogWriter.TraceError(message, new object[0]);
throw new FaultException<WebSyncFaultException>(new WebSyncFaultException(message, exception));
}
}
Similarly on the client SqlSyncProviderProxy.EndSession
...
if (this.m_LocalBatchingDirectory != null)
{
this.m_LocalBatchingDirectory.Refresh();
if (this.m_LocalBatchingDirectory.Exists)
{
this.m_LocalBatchingDirectory.Delete(true);
}
this.m_LocalBatchingDirectory = null;
}
...
This behavior shows itself when changing batch size. We set our batch size large to 70000 and found the problem went away. In hindsight this is because the first scope then all fit in a single batch, so it did not get broken apart and implement batching. When we set the size smaller our first scope (1 of 3) would use batching and we'd see this when scope 2 fired up.

Related

Crashing when calling QTcpSocket::setSocketDescriptor()

my project using QTcpSocket and the function setSocketDescriptor(). The code is very normal
QTcpSocket *socket = new QTcpSocket();
socket->setSocketDescriptor(this->m_socketDescriptor);
This coding worked fine most of the time until I ran a performance testing on Windows Server 2016, the crash occurred. I debugging with the crash dump, here is the log
0000004f`ad1ff4e0 : ucrtbase!abort+0x4e
00000000`6ed19790 : Qt5Core!qt_logging_to_console+0x15a
000001b7`79015508 : Qt5Core!QMessageLogger::fatal+0x6d
0000004f`ad1ff0f0 : Qt5Core!QEventDispatcherWin32::installMessageHook+0xc0
00000000`00000000 : Qt5Core!QEventDispatcherWin32::createInternalHwnd+0xf3
000001b7`785b0000 : Qt5Core!QEventDispatcherWin32::registerSocketNotifier+0x13e
000001b7`7ad57580 : Qt5Core!QSocketNotifier::QSocketNotifier+0xf9
00000000`00000001 : Qt5Network!QLocalSocket::socketDescriptor+0x4cf7
00000000`00000000 : Qt5Network!QAbstractSocket::setSocketDescriptor+0x256
In the stderr log, I see those logs
CreateWindow() for QEventDispatcherWin32 internal window failed (Not enough storage is available to process this command.)
Qt: INTERNAL ERROR: failed to install GetMessage hook: 8, Not enough storage is available to process this command.
Here is the function, where the code was stopped on the Qt codebase
void QEventDispatcherWin32::installMessageHook()
{
Q_D(QEventDispatcherWin32);
if (d->getMessageHook)
return;
// setup GetMessage hook needed to drive our posted events
d->getMessageHook = SetWindowsHookEx(WH_GETMESSAGE, (HOOKPROC) qt_GetMessageHook, NULL, GetCurrentThreadId());
if (Q_UNLIKELY(!d->getMessageHook)) {
int errorCode = GetLastError();
qFatal("Qt: INTERNAL ERROR: failed to install GetMessage hook: %d, %s",
errorCode, qPrintable(qt_error_string(errorCode)));
}
}
I did research and the error Not enough storage is available to process this command. maybe the OS (Windows) does not have enough resources to process this function (SetWindowsHookEx) and failed to create a hook, and then Qt fire a fatal signal, finally my app is killed.
I tested this on Windows Server 2019, the app is working fine, no crashes appear.
I just want to know more about the meaning of the error message (stderr) cause I don't really know what is "Not enough storage"? I think it is maybe the limit or bug of the Windows Server 2016? If yes, is there any way to overcome this issue on Windows Server 2016?
The error ‘Not enough storage is available to process this command’ usually occurs in Windows servers when the registry value is set incorrectly or after a recent reset or reinstallations, the configurations are not set correctly.
Below is verified procedure for this issue:
Click on Start > Run > regedit & press Enter
Find this key name HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\services\LanmanServer\Parameters
Locate IRPStackSize
If this value does not exist Right Click on Parameters key and Click on New > Dword Value and type in IRPStackSize under the name.
The name of the value must be exactly (combination of uppercase and lowercase letters) the same as what I have above.
Right Click on the IRPStackSize and click on Modify
Select Decimal enter a value higher than 15(Maximum Value is 50 decimal) and Click Ok
You can close the registry editor and restart your computer.
Reference
After researching for a few days I finally can configure the Windows Server 2016 setting (registry) to prevent the crash.
So basically it is a limitation of the OS itself, it is called desktop heap limitation.
https://learn.microsoft.com/en-us/troubleshoot/windows-server/performance/desktop-heap-limitation-out-of-memory
(The funny thing is the error message is Not enough storage is available to process this command but the real problem came to desktop heap limitation. )
So for the solution, flowing the steps in this link: https://learn.microsoft.com/en-us/troubleshoot/system-center/orchestrator/increase-maximum-number-concurrent-policy-instances
I increased the 3rd parameter of SharedSection to 2048 and it fix the issue.
Summary steps:
Desktop Heap for the non-interactive desktops is identified by the third parameter of the SharedSection= segment of the following registry value:
HKEY_LOCAL_MACHINE\System\CurrentControlSet\Control\Session Manager\SubSystems\Windows
The default data for this registry value will look something like the following:
%SystemRoot%\system32\csrss.exe ObjectDirectory=\Windows SharedSection=1024,3072,512 Windows=On SubSystemType=Windows ServerDll=basesrv,1 ServerDll=winsrv:UserServerDllInitialization,3 ServerDll=winsrv:ConServerDllInitialization,2 ProfileControl=Off MaxRequestThreads=16
The value to be entered into the Third Parameter of the SharedSection= segment should be based on the calculation of:
(number of desired concurrent policies) * 10 = (third parameter value)
Example: If it's desired to have 200 concurrent policy instances, then 200 * 10 = 2000, rounding up to a nice memory number gives you 2048as the third parameter resulting in the following update to be made to the registry value:
SharedSection=1024,3072,2048

BigQuery Storage Write / managedwriter api return error server_shutting_down

As we know, the advantage of BigQuery Storage Write API, one month ago, we replace insertAll with managedwriter API on our server. It seems to work well for one month, however, we met the following errors recently
rpc error: code = Unavailable desc = closing transport due to: connection error:
desc = "error reading from server: EOF", received prior goaway: code: NO_ERROR,
debug data: "server_shutting_down"
The version of managedwriter API are:
cloud.google.com/go/bigquery v1.25.0
google.golang.org/protobuf v1.27.1
There is a piece of retrying logic for storage write API that detects error messages on our server-side. We notice the response time of storage write API becomes longer after retrying, as a result, OOM is happening on our server. We also tried to increase the request timeout to 30 seconds, and most of those requests could not be completed within it.
How to handle the error server_shutting_down correctly?
Update 02/08/2022
The default stream of managedwrite API is used in our server. And server_shutting_down error comes up periodically. And this issue happened on 02/04/2022 12:00 PM UTC and the default stream of managedwrite API works well for over one month.
Here is one wrapper function of appendRow and we log the cost time of this function.
func (cl *GBOutput) appendRows(ctx context.Context,datas [][]byte, schema *gbSchema) error {
var result *managedwriter.AppendResult
var err error
if cl.schema != schema {
cl.schema = schema
result, err = cl.managedStream.AppendRows(ctx, datas, managedwriter.UpdateSchemaDescriptor(schema.descriptorProto))
} else {
result, err = cl.managedStream.AppendRows(ctx, datas)
}
if err != nil {
return err
}
_, err = result.GetResult(ctx)
return err
}
When the error server_shutting_down comes up, the cost time of this function could be several hundred seconds. It is so weird, and it seems to there is no way to handle the timeout of appendRow.
Are you using the "raw" v1 storage API, or the managedwriter? I ask because managedwriter should handle stream reconnection automatically. Are you simply observing connection closes periodically, or something about your retry traffic induces the closes?
The interesting question is how to deal with in-flight appends for which you haven't yet received an acknowledgement back (or the ack ended in failure). If you're using offsets, you should be able to re-send the append without risk of duplication.
Per the GCP support guy,
The issue is hit once 10MB has been sent over the connection, regardless of how long it takes or how much is inflight at that time. The BigQuery Engineering team has identified the root cause and the fix would be rolled out by Friday, Feb 11th, 2022.

Migitate "Throttling: Rate exceeded" errors on terraform 0.11.7 apply to AWS

Does anyone know how to mitigate throttling when using terraform 0.11.7
terraform-0.11.7 plan -out proposed.plan -no-color
Error: Error refreshing state: 1 error(s) occurred:
* module.ecs_alb.aws_alb_target_group.backend_internal_alb_target_group: 1 error(s) occurred:
* module.ecs_alb.aws_alb_target_group.backend_internal_alb_target_group[5]:
aws_alb_target_group.backend_internal_alb_target_group.5:
Error retrieving Target Group Attributes:
Throttling: Rate exceeded
status code: 400, request id: ...
make: *** [plan] Error 1
I run these from jenkins, so I loop with a try catch like so (we run terraform via make and the tf commands would be plan, apply, output. I'm waiting 10s between retries. I'll probably bump that up to something longer.
while (retry < retries) {
try {
makeError = null
sh "make ${targets.join(' ')}"
break
} catch (Exception ex) {
fileOperations([fileCopyOperation(excludes: '', flattenFiles: false, includes: '**log',
renameFiles: false, sourceCaptureExpression: '',
targetLocation: outputDir, targetNameExpression: '')])
makeError = ex
errorHandling.addResult('runMake', "path: ${path}, targets: ${targets}, retry: ${retry} of ${retries} failed with ${makeError}. retrying")
sleep time: waitSecs, unit: 'SECONDS'
} finally {
retry++
}
}
if (makeError) {
throw new Exception("Max retries reached (${retries})", makeError)
}
It seems that there are a few options (this come from commenters and coworkers)
change the retry logic from wait N sec per retry to: wait N sec, retry, double wait time, repeat (simple and worked for now)
change -parallelism from default of 10 to something lower (simple to do)
change the plan's module from creating so many task groups in a single module to using multiple modules and plans (this becomes an issue as the state changes shifting from one plan to another would force removal recreation or state surgery to avoid that problem)
Given the first one seems to be working, there seems no need to move to the others. If I later find it fails, then I'll experiment with tuning down parallelism just for this plan/module pair

operation on a socket could not be performed because the system lacked sufficient buffer space or because a queue was full

I have a machine running multiple applications which constantly perform UNC access (\\server-ip\share) so:
std::ifstream src(fileName, std::ios::binary);
std::ofstream dst(newFileName, std::ios::binary);
CopyFromRemote(ifstream &src, ofstream &dst);
dst.flush();
dst.close();
src.close();
void CopyFromRemote(ifstream src, ofstream dst)
{
char buffer[8192]; // read 8KB each chunk
while (src.read(buffer, sizeof(buffer)))
{
dst.write(buffer, sizeof(buffer));
// Here there is code that checks that some timer !> max read time so as
// to not be stuck if there is network issue with this src.
}
if (src.eof() && src.gcount() > 0)
{
dst.write(buffer, src.gcount()); // few bytes left
}
}
As can be seen the network is heavily strained by traversing it for each 8KB (files are several MB large). The benefit here is the ability to abort a file copy in case it takes too long from specific source.
The problem I'm facing is after several days all UNC become non-accessible from this machine with error above. I'm not sure what the source of the problem is but it's sporadic & hard to nail. When the problem happens the 1st line fails (std::ifstream src...). telnet also stops to work.
Also: When killing the applications the UNC is accessible again. When restarting the processes the UNC is immediately not accessible again. Restarting the machine solves the problem for several days.
Initially I thought it was Port exhaustion but netstat does not reveal too many connections or hanging connections and the task manager performance tab does not show abnormal figures. TcpQry shows normal TCP/UDP mapping numbers.
Also: Packet capture shows there is no request when problem happens (request not reaching network). Event viewer does not reveal anything. Did following registry changes although this would probably just delay problem not eliminate it but anyway it didn't help:
Find the autodisconnect value in HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\lanmanserver\parameters. If it's not there, create a new REG_DWORD called autodisconnect. Edit the value as Hexadecimal and set it to ffffffff.
Find KeepConn in HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\lanmanworkstation\parameters. If it doesn't exist create it as a REG_DWORD value and assign it the value 65534.
Find HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters and create a new DWORD value named MaxUserPort. Set the value to 65534.
Eventually this was due to Microsoft OS bug. Since machine is an offline machine it does not get regular updates automatically. Installing all OS updates solved the problem.

DIrectshow function is blocked at system bootup

I have an Directshow based mediaplayer application . It works very well without any issues during normal playabck . But occasionally i am facing one issue when the Mediaplayer started just after system boot .
HRESULT CSDirectShow::RenderOutputPins (IBaseFilter* pFilter)
{
const char* funcName = "CSDirectShow::RenderOutputPins()";
HRESULT hr = S_OK;
// Enumerate all pins on the source filter,
// looking for the output pins so that I can call Render() on them
//
CComPtr< IEnumPins > pEnumPin;
if (!FAILED (pFilter->EnumPins (&pEnumPin)))
{
while (true)
{
// get the next pin
//
CComPtr< IPin > pPin;
if (pEnumPin->Next (1L, &pPin, NULL) != S_OK) break;
// I'm not interested in connected pins
// if this pin is an unconnected output pin, then render it.
//
CComPtr< IPin > pConnectedPin;
if (pPin->ConnectedTo (&pConnectedPin) == VFW_E_NOT_CONNECTED)
{
PIN_DIRECTION pinDirection;
PIN_INFO pinInfo;
//Get the information of the pin
if (pPin->QueryDirection (&pinDirection) == S_OK
&& pinDirection == PINDIR_OUTPUT
&& pPin->QueryPinInfo(&pinInfo) == S_OK
&& strstr((char*)pinInfo.achName,"~")==NULL)
{
if (FAILED (hr = m_pGB->Render (pPin)))
{
SafeRelease(&pinInfo.pFilter);
return hr;
}
}
SafeRelease(&pinInfo.pFilter);
}
}
}
TraceMsg ("%s: exit",funcName);
return S_OK;
}
When m_pGB->Render (pPin) is called ,This function never returns and it is blocked inside .I confirmed using logs .This issue happens only when i start my application immediately after bootup . When issues occures if I close and restart my application it works like a charm .Since application is designed to start automatically after system bootup this behaviour has become a bigger concern .Kindly help
IGraphBuilder.Render call does a lot internally, and specifically it goes over enumeration of potentially suitable filter, which in turn attempts to load additional DLLs registered with DirectShow environment. Such file could have missing dependencies, or dependencies on remote or temporarily inaccessible drivers (just one example).
If you experience a deadlock, you can troubleshoot it further (debug it) and get details on locked state, and on activity during Render call.
If the problem is caused by third party filters (esp. codec packs registering a collection of filters at once without thinking too much on compatibility) registered with the system in a not so good way, perhaps you could identify them and uninstall.
If you want to improve the player on your side, you should avoid Render call, and build your filter graph with smaller increments - adding specific filter and connecting pins, without leaving big tasks at mercy of Intelligent Connect, which works well in general but is sensitive to compatibility problems as mentioned above.