I have an AWS Step Function with 12 states and out of thousands of executions I've only seen this error occur once which makes me think that this isn't a configuration issue and instead this is an AWS error.
When my Step Function transfers from one state to the next it fails to pass the output from one state as the input to next. A concrete example, Copy Asset passes some data to Verify Asset but somewhere down the line this data disappears. As mentioned before, I've only seen one instance of this error through thousands of executions.
Here's an image describing the issue:
To try to narrow where the execution fails I will write the various states step-by-step:
LambdaFunctionScheduled Copy Asset. This succeeds, and in the UI the input to the execution is visible so everything is working as expected.
LambdaFunctionStarted Copy Asset. The UI shows the empty set {}, strangely this is also expected since successful executions also show the same state.
LambdaFunctionSucceeded Copy Asset. The UI shows the following output:
{ "output": null, "outputDetails": { "truncated": false } }
Why is the output of Copy Asset empty even though it succeeded. All other successful executions correctly pass the output to the next step.
The Copy Asset Lambda takes the input, does some executions and then passes the input as the output so if there are no errors thrown, and the input isn't empty then the output shouldn't be empty either.
My question is how can I retry on this error if it happens and how can I avoid it?
Related
BACKGROUND
We have testers for our embedded GUI product and when a tester declares "test failed", sometimes it's hard for us developers to reproduce the exact problem because we don't have the exact trace of what happened.
We do currently have a logging framework but us devs have to manually input those logging statements in the code which is fine . . . except when a hard-to-reproduce bug occurs and we didn't have a log statement at the 'right' location and then when we re-build, re-run the test with the same steps, we get a different result.
ISSUE
We would like a solution wherein the compiler produces extra instrumentation code that allows us to see the exact sequence of events including, at the minimum:
function enter/exit (already provided by -finstrument-functions)
control-flow statement enter i.e. entering if/else, which case statement we jumped to
The log would look like this:
int main() entered
if-line 5 entered
else-line 10 entered
void EventLoop() entered
. . .
Some additional nice-to-haves are
Parameter values on function entry AND exit (for pass-by-reference types)
Function return value
QUESTION
Are there any gcc tools or even paid tools that can do this instrumentation automatically?
You can either use gdb for that, and you can automate that (I've got a tool for that in the works, you find it here or you can try to use gcov.
The gcov implementation is so, that it loads the latest gcov data when you start the program. But: you can manually dump and load the data. If you call __gcov_flush it will dump the current data and reset the current state. However: if you do that several times it will always merge the data, so you'd also need to rename the gcov data file then.
So I am building a redhawk module and trying to just pass data through it as a test. After putting their example of how to work with input and output ports into the serviceFunction() I am able to build the module with no errors (I changed variable names to match my ports). When I put the module on the white board and link it up it's fine but as soon as I start the module it crashes. I added a line to write the incoming stream id to the console and that will hit the console 10 to 20 times before the crash (it correctly writes the id of the signal generator that is providing the signal). If I plot the output port nothing is plotted before the crash (when I say crash I mean that the module just disappears from the white board, the ide is still up and running).
The service function is:
int freqModFrTest_i::serviceFunction()
{
bulkio::InFloatPort::dataTransfer *tmp = dataFloatIn->getPacket(bulkio::Const::BLOCKING);
if (not tmp) { // No data is available
return NOOP;
}
else
{
std::cout<<tmp->streamID<<std::endl;
std::vector<float> outputData;
outputData.resize(tmp->dataBuffer.size());
for (unsigned int i=0; i<tmp->dataBuffer.size(); i++) {
outputData[i] = (float)tmp->dataBuffer[i];
}
// NOTE: You must make at least one valid pushSRI call
if (tmp->sriChanged) {
ComplexOut->pushSRI(tmp->SRI);
}
ComplexOut->pushPacket(outputData, tmp->T, tmp->EOS, tmp->streamID);
delete tmp; // IMPORTANT: MUST RELEASE THE RECEIVED DATA BLOCK
return NORMAL;
}
}
Has anyone had a similar issue or any ideas on what would be causing this?
Additional Info:
Following the sugestion by pwolfram I built a sig generator and this component into a waveform. When launching it from a domain I got the error:
2016-01-14 07:41:50,430 ERROR DCE:aa1a189e-0b5b-4968-9150-5fc3d501dadc{1}:1030 -
Child process 3772 terminated with signal 11
when trying to restart the component (as it just stoped rather then disapering) I get the following error:
Error while executing callable. Caused by org.omg.CORBA.TRANSIENT:
Retries exceeded, couldn't reconnect to 10.62.7.21:56857
Retries exceeded, couldn't reconnect to 10.62.7.21:56857
In REDHAWK 2.0.0 I created a component with the same name (freqModFrTest) and port names (dataFloatIn and ComplexOut) and used your service function verbatim. I did not however get any issues.
Here are a few things to try:
Clean and rebuild the component. The Sandbox (what you referred to as the whiteboard) will run the binary that has been built. It is possible that you've modified the code and have an older version of the binary on disk. Right click on the project and select "clean project". Then right click and select "Build Project" this will make sure that the binary matches your source code.
Run the component in debug mode. If you double click on the SPD file, under the "overview" tab there is "Debug a component in the sandbox". This will launch the component in the chalkboard within a debugging context. You can set breakpoints and walk through the code line by line. If you set no breakpoints though the IDE will stop execution when a fatal error occurs. If there is an issue (like invalid memory access) the IDE will prompt you to enter debug mode and it should point out the line in code where the issue is.
If those options fail, you can enable core dumps and use GDB to see where in the code the issue is occurring. There are lots of tutorials online for GDB but the gist is that before launching the IDE, you'll want to type "ulimit -c unlimited" then from the same terminal, launch the IDE. Now when your component dies, it will produce a core file.
Hopefully one of these gets you going down the right path.
I'm stuck on problem were I would like to ask for some help:
I have the task to print some files of different types using ShellExecuteEx with the "print" verb and need to guarantee print order of all files. Therefore I use FindFirstPrinterChangeNotification and FindNextPrinterChangeNotification to monitor the events PRINTER_CHANGE_ADD_JOB and PRINTER_CHANGE_DELETE_JOB using two different threads in the background which I start before calling ShellExecuteEx as I don't know anything about the application which will print the files etc. The only thing I know is that I'm the only one printing and which file I print. My solution seems to work well, my program successfully recognizes the event PRINTER_CHANGE_ADD_JOB for my file, I even verify that this event is issued for my file by checking what is give to me as additional info by specifying JOB_NOTIFY_FIELD_DOCUMENT.
The problem now is with the event PRINTER_CHANGE_DELETE_JOB, where I don't get any addition info about the print job, though my logic is exactly the same for both events: I've written one generic thread function which simply gets executed with the event it is used for. My thread is recognizing the PRINTER_CHANGE_DELETE_JOB event, but on each call to FindNextPrinterChangeNotification whenever this event occured I don't get any addition data in ppPrinterNotifyInfo. This works for the start event, though, I verified using my logs and the debugger. But with PRINTER_CHANGE_DELETE_JOB the only thing I get is NULL.
I already searched the web and there are some similar questions, but most of the time related to VB or simply unanswered. I'm using a C++ project and as my code works for the ADD_JOB-event I don't think I'm doing something completely wrong. But even MSDN doesn't mention this behavior and I would really like to make sure that the DELETE_JOB event is the one for my document, which I can't without any information about the print job. After I get the DELETE_JOB event my code doesn't even recognize other events, which is OK because the print job is done afterwards.
The following is what I think is the relevant notification code:
WORD jobNotifyFields[1] = {JOB_NOTIFY_FIELD_DOCUMENT};
PRINTER_NOTIFY_OPTIONS_TYPE pnot[1] = {JOB_NOTIFY_TYPE, 0, 0, 0, 1, jobNotifyFields};
PRINTER_NOTIFY_OPTIONS pno = {2, 0, 1, pnot};
HANDLE defaultPrinter = PrintWaiter::openDefaultPrinter();
HANDLE changeNotification = FindFirstPrinterChangeNotification( defaultPrinter,
threadArgs->event,
0, &pno);
[...]
DWORD waitResult = WAIT_FAILED;
while ((waitResult = WaitForSingleObject(changeNotification, threadArgs->wfsoTimeout)) == WAIT_OBJECT_0)
{
LOG4CXX_DEBUG(logger, L"Irgendein Druckereignis im Thread zum Warten auf Ereignis " << LogStringConv(threadArgs->event) << L" erkannt.");
[...]
PPRINTER_NOTIFY_INFO notifyInfo = NULL;
DWORD events = 0;
FindNextPrinterChangeNotification(changeNotification, &events, NULL, (LPVOID*) ¬ifyInfo);
if (!(events & threadArgs->event) || !notifyInfo || !notifyInfo->Count)
{
LOG4CXX_DEBUG(logger, L"unpassendes Ereignis " << LogStringConv(events) << L" ignoriert");
FreePrinterNotifyInfo(notifyInfo);
continue;
}
[...]
I would really appreciate if anyone could give some hints on why I don't get any data regarding the print job. Thanks!
https://forums.embarcadero.com/thread.jspa?threadID=86657&stqc=true
Here's what I think is going on:
I observe two events in two different threads for the start and end of each print job. With some debugging and logging I recognized that FindNextPrinterChangeNotification doesn't always return only the two distinct events I've notified for, but some 0-events in general. In those cases FindNextPrinterChangeNotification returns 0 as the events in pdwChange. If I print a simple text file using notepad.exe I only get one event for creation of the print job with value 256 for pdwChange and the data I need in notifyInfo to compare my printed file name against and comparing both succeeds. If I print a pdf file using current Acrobat Reader 11 I get two events, one has pdwChange as 256, but gives something like "local printdatafile" as the name of the print job started, which is obviously not the file I printed. The second event has a pdwChange of 0, but the name of the print job provided in notifyInfo is the file name I used to print. As I use FreePDF for testing pruproses, I think the first printer event is something internal to my special setup.
The notifications for the deletion of a print job create 0 events, too. This time those are sent before FindNextPrinterChangeNotification returns 1024 in pdwChange, and timely very close after the start of the print job. In this case the exactly one generated 0 event contains notifyInfo with a document name which equals the file name I started printing. After the 0 event there's exactly one additional event with pdwChange of 1024, but without any data for notifyInfo.
I think Windows is using some mechanism which provides additional notifications for the same event as 0 events after the initial event has been fired with it's real value the user notified with, e.g. 256 for PRINTER_CHANGE_ADD_JOB. On the other hand it seems that some 0 events are simply fired to provide data for an upcoming event which then gets the real value of e.g. 1024 for PRINTER_CHANGE_DELETE_JOB, but without anymore data because that has already been delivered to the event consumer with a very early 0 event. Something like "Look, there's more for the last events." and "Look, something is going to happen with the data I already provide now." Implementing such an approach my prints now seem to work as expected.
Of course what I wrote doesn't fit to what is documented for FindNextPrinterChangeNotification, but it makes a bit of sense to me. ;-)
You're not checking for overflows or errors.
The documentation for FindNextPrinterChangeNotification says this:
If the PRINTER_NOTIFY_INFO_DISCARDED bit is set in the Flags member of
the PRINTER_NOTIFY_INFO structure, an overflow or error occurred, and
notifications may have been lost. In this case, no additional
notifications will be sent until you make a second
FindNextPrinterChangeNotification call that specifies
PRINTER_NOTIFY_OPTIONS_REFRESH.
You need to check for that flag and do as described above, and you should also be checking the return code from FindNextPrinterChangeNotification.
I'm having a problem where sometimes my code will function correctly, but other times it will fail.
This is the first bit of PDH related code that I run:
const std::wstring pidWildcardPath = L"\\Process(*)\\ID Process";
DWORD bufferSize = 0;
LPTSTR paths = NULL;
PDH_STATUS status = PdhExpandCounterPath(
pidWildcardPath.c_str(),
paths,
&bufferSize);
checkPDHStatus(status, PDH_MORE_DATA, L"Expected request for more data.");
The result of the PdhExpandCounterPath function call is 0x800007D0 (PDH_CSTATUS_NO_MACHINE). The checkPDHStatus function is a simple function that I wrote that asserts that the status is equal to the second parameter. In this case, I expect the result to be PDH_MORE_DATA because paths is NULL and bufferSize is 0. The goal of this call is to determine the size of the buffer I must allocate to store all of the results for a subsequent call to PdhExpandCounterPath. This is described in the PDH documentation under the Remarks section.
The list of PDH error codes describes PDH_MORE_DATA as "Unable to connect to the specified computer, or the computer is offline." As you can see by the performance counter path in the code above, I am not even trying to connect to a different computer than my own.
It is interesting the way that this code fails. Sometimes it works fine and then other times, it will fail on multiple back-to-back executions of my application. I have #include <pdh.h> in my header file and I have a section in my property sheet for this DLL that looks like this:
<Tool
Name="VCLinkerTool"
AdditionalDependencies="pdh.lib"
/>
I'm not sure if it matters, but this program is built by Visual Studio 2005 and run on Windows XP. Am I doing something incorrectly?
I'm a co-worker of Dave's and have discovered the following during my investigation:
the code above runs fine when run from a logged-in interactive session
the code runs fine when initiated as a Scheduled Task AND the user is logged in at the time the scheduled task is fired off
the code FAILS only when run as a Scheduled Task AND the user is NOT logged in at the time the task starts
the code continues to fail if the user logs in after the failing task has started but while it is still running (because it is looping "endlessly" until it gets a PDH_MORE_DATA status back).
In the failing instances, the following environment variables have not been established/set for the program: APPDATA, HOMEDRIVE and HOMEPATH ... I don't think this is a problem. However, the failing program also lacks the SeCreateGlobalPrivilege from its token; the passing programs all have this privilege in the token and PERFMON shows it as "Default Enabled". The other difference is that failing program has the NT_AUTH\BATCH user group in the token, while the passing program has NT_AUTH\INTERACTIVE instead ... all other user groups and privileges are the same for both cases. I think the global privilege is coming from the interactive login, but don't know if it has any bearing on PDH operation.
I cannot find anything in the Performance Counter/PDH documentation that talks about needing any special permissions or privileges for this functionality to succeed. Is the global privilege required to use Performance Counters ?
Or is there some other context/environment difference between running Scheduled Tasks (as a specific user) when that user is/isn't logged in at the time the task starts, that would account for the PDH call succeeding/failing respectively ?
Try this format, indicating the local computer:
const std::wstring pidWildcardPath = L"\\.\Process(*)\ID Process";
I am using QSettings to try and figure out if an INI is valid.(using status() to check) I made a purposefully invalid INI file and loaded it in. The first time the code is called, it returns invalid, but every time after that, it returns valid. Is this a bug in my code?
It's a Qt bug caused by some global state. Note that the difference in results happens whether or not you call delete on your QSettings object, which you should. Here's a brief summary of what happens on the first run:
The result code is set to NoError.
A global cache is checked to see if your file is present
Your file isn't present the first time, so it's parsed on qsettings.cpp line 1530 (Qt-4.6.2)
Parsing results in an error and the result code is set (see qsettings.cpp line 1552).
The error result code is returned.
And the second run is different:
The result code is set to NoError.
A global cache is checked, your file is present.
The file size and timestamp are checked to see if the file has changed (see qsettings.cpp line 1424).
The result code is returned, which happens to be NoError -- the file was assumed to have been parsed correctly.
Checked your code, you need to delete the file object before returning.
Apart from that, your code uses the QSettings::QSettings(fileName, format) c'tor to open an ini-file. That call ends in the function QConfFile::fromName (implemented in qsettings.cpp). As I read it (there are a few macros and such that I decided not to follow), the file is not re-opened if the file already is open (i.e. you have not deleted the object since the last time). Thus the status will be ok the second time around.