Why I receive no WT_PACKETEXT window messages after initializing Wintab extensions?

Why I receive no WT_PACKETEXT window messages after initializing Wintab extensions? - c++

I'm currently trying to process the absolute value of a drawing tablet's touch ring, through the Wintab API. However, despite following instructions as they are described in the documentation, it seems like the WTOpen call doesn't set any extension settings. Using the touch ring after initializing Wintab still triggers the default events, while the default events for pen inputs are suppressed and all pen inputs related to my application instead.
Here are the relevant segments of code:
...
#include "wintab.h"
#define PACKETDATA (PK_X | PK_Y | PK_Z | PK_NORMAL_PRESSURE | PK_ORIENTATION | PK_TIME | PK_BUTTONS)
#define PACKETMODE 0
#define PACKETTOUCHSTRIP PKEXT_ABSOLUTE
#define PACKETTOUCHRING PKEXT_ABSOLUTE
#include "pktdef.h"
...
internal b32
InitWinTab(HWND Window, window_mapping *Map)
{
if(!LoadWintabFunctions())
return false;
LOGCONTEXT Tablet;
AXIS TabletX, TabletY, TabletZ, Pressure;
if(!gpWTInfoA(WTI_DEFCONTEXT, 0, &Tablet))
return false;
gpWTInfoA(WTI_DEVICES, DVC_X, &TabletX);
gpWTInfoA(WTI_DEVICES, DVC_Y, &TabletY);
gpWTInfoA(WTI_DEVICES, DVC_Z, &TabletZ);
gpWTInfoA(WTI_DEVICES, DVC_NPRESSURE, &Pressure);
UINT TouchStripOffset = 0xFFFF;
UINT TouchRingOffset = 0xFFFF;
for(UINT i = 0, ScanTag = 0; gpWTInfoA(WTI_EXTENSIONS + i, EXT_TAG, &ScanTag); i++)
{
if (ScanTag == WTX_TOUCHSTRIP)
TouchStripOffset = i;
if (ScanTag == WTX_TOUCHRING)
TouchRingOffset = i;
}
Tablet.lcOptions |= CXO_MESSAGES;
Tablet.lcPktData = PACKETDATA;
Tablet.lcPktMode = PACKETMODE;
Tablet.lcMoveMask = PACKETDATA;
Tablet.lcBtnUpMask = Tablet.lcBtnDnMask;
Tablet.lcInOrgX = 0;
Tablet.lcInOrgY = 0;
Tablet.lcInExtX = TabletX.axMax;
Tablet.lcInExtY = TabletY.axMax;
if(TouchStripOffset != 0xFFFF)
{
WTPKT DataMask;
gpWTInfoA(WTI_EXTENSIONS + TouchStripOffset, EXT_MASK, &DataMask);
Tablet.lcPktData |= DataMask;
}
if(TouchRingOffset != 0xFFFF)
{
WTPKT DataMask;
gpWTInfoA(WTI_EXTENSIONS + TouchRingOffset, EXT_MASK, &DataMask);
Tablet.lcPktData |= DataMask;
}
Map->AxisMax.x = (r32)TabletX.axMax;
Map->AxisMax.y = (r32)TabletY.axMax;
Map->AxisMax.z = (r32)TabletZ.axMax;
Map->PressureMax = (r32)Pressure.axMax;
if(!gpWTOpenA(Window, &Tablet, TRUE))
return false;
return(TabletX.axMax && TabletY.axMax && TabletZ.axMax && Pressure.axMax);
}
...
case WT_PACKET:
{
PACKET Packet;
if(gpWTPacket((HCTX)LParam, (UINT)WParam, &Packet))
{
...
}
} break;
case WT_PACKETEXT:
{
PACKETEXT Packet;
if(gpWTPacket((HCTX)LParam, (UINT)WParam, &Packet))
{
...
}
} break;
...
The bitmask for the packet data in the initialization have sensible bits set for both extensions and don't overlap with the existing bitmask. No stage of the initialization fails. WT_PACKET gets called only with valid packet data while WT_PACKETEXT never gets called. Furthermore, calling WTPacketsGet with a PACKETEXT pointer on the HCTX returned by WTOpen fills the packet with garbage from the regular packet queue. This leaves me with the conclusion that somehow WTOpen didn't receive notification that the extensions should be loaded and I'm unable to find what else I should define in the LOGCONTEXT data structure to change that.
Is there a mistake in my initialization? Or is there a way to get a better readout to why the extensions didn't load?

It turned out that a driver setting prevented the extension packets from being sent, in favor of using the touch ring for different function. Changing this setting resolved the issue. The code didn't contain any errors itself.

Related

Intel OneAPI Video decoding memory leak when using C++ CLI

I am trying to use Intel OneAPI/OneVPL to decode a stream I receive from an RTSP Camera in C#. But when I run the code I get an enormous memory leak. Around 1-200MB per run, which is around once every second.
When I've collected a GoP from the camera where I know the first data is a keyframe I pass it as a byte array to my CLI and C++ code.
Here I expect it to decode all the frames and return decoded images. It receives 30 frames and returns 16 decoded images but has a memory leak.
I've tried to use Visual Studio memory profiler and all I can tell from it is that its unmanaged memory that's my problem. I've tried to override the "new" and "delete" method inside videoHandler.cpp to track and compare all allocations and deallocations and as far as I can tell everything is handled correctly in there. I cannot see any classes that get instantiated that do not get cleaned up. I think my issue is in the CLI class videoHandlerWrapper.cpp. Am I missing something obvious?
videoHandlerWrapper.cpp
array<imgFrameWrapper^>^ videoHandlerWrapper::decode(array<System::Byte>^ byteArray)
{
array<imgFrameWrapper^>^ returnFrames = gcnew array<imgFrameWrapper^>(30);
{
std::vector<imgFrame> frames(30); //Output from decoding process. imgFrame implements a deconstructor that will rid the data when exiting scope
std::vector<unsigned char> bytes(byteArray->Length); //Input for decoding process
Marshal::Copy(byteArray, 0, IntPtr((unsigned char*)(&((bytes)[0]))), byteArray->Length); //Copy from managed (C#) to unmanaged (C++)
int status = _pVideoHandler->decode(bytes, frames); //Decode
for (size_t i = 0; i < frames.size(); i++)
{
if (frames[i].size > 0)
returnFrames[i] = gcnew imgFrameWrapper(frames[i].size, frames[i].bytes);
}
}
//PrintMemoryUsage();
return returnFrames;
}
videoHandler.cpp
#define BITSTREAM_BUFFER_SIZE 2000000 //TODO Maybe higher or lower bitstream buffer. Thorough testing has been done at 2000000
int videoHandler::decode(std::vector<unsigned char> bytes, std::vector<imgFrame> &frameData)
{
int result = -1;
bool isStillGoing = true;
mfxBitstream bitstream = { 0 };
mfxSession session = NULL;
mfxStatus sts = MFX_ERR_NONE;
mfxSurfaceArray* outSurfaces = nullptr;
mfxU32 framenum = 0;
mfxU32 numVPPCh = 0;
mfxVideoChannelParam* mfxVPPChParams = nullptr;
void* accelHandle = NULL;
mfxVideoParam mfxDecParams = {};
mfxVersion version = { 0, 1 };
//variables used only in 2.x version
mfxConfig cfg = NULL;
mfxLoader loader = NULL;
mfxVariant inCodec = {};
std::vector<mfxU8> input_buffer;
// Initialize VPL session for any implementation of HEVC/H265 decode
loader = MFXLoad();
VERIFY(NULL != loader, "MFXLoad failed -- is implementation in path?");
cfg = MFXCreateConfig(loader);
VERIFY(NULL != cfg, "MFXCreateConfig failed")
inCodec.Type = MFX_VARIANT_TYPE_U32;
inCodec.Data.U32 = MFX_CODEC_AVC;
sts = MFXSetConfigFilterProperty(
cfg,
(mfxU8*)"mfxImplDescription.mfxDecoderDescription.decoder.CodecID",
inCodec);
VERIFY(MFX_ERR_NONE == sts, "MFXSetConfigFilterProperty failed for decoder CodecID");
sts = MFXCreateSession(loader, 0, &session);
VERIFY(MFX_ERR_NONE == sts, "Not able to create VPL session");
// Print info about implementation loaded
version = ShowImplInfo(session);
//VERIFY(version.Major > 1, "Sample requires 2.x API implementation, exiting");
if (version.Major == 1) {
mfxVariant ImplValueSW;
ImplValueSW.Type = MFX_VARIANT_TYPE_U32;
ImplValueSW.Data.U32 = MFX_IMPL_TYPE_SOFTWARE;
MFXSetConfigFilterProperty(cfg, (mfxU8*)"mfxImplDescription.Impl", ImplValueSW);
sts = MFXCreateSession(loader, 0, &session);
VERIFY(MFX_ERR_NONE == sts, "Not able to create VPL session");
}
// Convenience function to initialize available accelerator(s)
accelHandle = InitAcceleratorHandle(session);
bitstream.MaxLength = BITSTREAM_BUFFER_SIZE;
bitstream.Data = (mfxU8*)calloc(bytes.size(), sizeof(mfxU8));
VERIFY(bitstream.Data, "Not able to allocate input buffer");
bitstream.CodecId = MFX_CODEC_AVC;
std::copy(bytes.begin(), bytes.end(), bitstream.Data);
bitstream.DataLength = static_cast<mfxU32>(bytes.size());
memset(&mfxDecParams, 0, sizeof(mfxDecParams));
mfxDecParams.mfx.CodecId = MFX_CODEC_AVC;
mfxDecParams.IOPattern = MFX_IOPATTERN_OUT_SYSTEM_MEMORY;
sts = MFXVideoDECODE_DecodeHeader(session, &bitstream, &mfxDecParams);
VERIFY(MFX_ERR_NONE == sts, "Error decoding header\n");
numVPPCh = 1;
mfxVPPChParams = new mfxVideoChannelParam[numVPPCh];
for (mfxU32 i = 0; i < numVPPCh; i++) {
mfxVPPChParams[i] = {};
}
//mfxVPPChParams[0].VPP.FourCC = mfxDecParams.mfx.FrameInfo.FourCC;
mfxVPPChParams[0].VPP.FourCC = MFX_FOURCC_BGRA;
mfxVPPChParams[0].VPP.ChromaFormat = MFX_CHROMAFORMAT_YUV420;
mfxVPPChParams[0].VPP.PicStruct = MFX_PICSTRUCT_PROGRESSIVE;
mfxVPPChParams[0].VPP.FrameRateExtN = 30;
mfxVPPChParams[0].VPP.FrameRateExtD = 1;
mfxVPPChParams[0].VPP.CropW = 1920;
mfxVPPChParams[0].VPP.CropH = 1080;
//Set value directly if input and output is the same.
mfxVPPChParams[0].VPP.Width = 1920;
mfxVPPChParams[0].VPP.Height = 1080;
//// USED TO RESIZE. IF INPUT IS THE SAME AS OUTPUT THIS WILL MAKE IT SHIFT A BIT. 1920x1080 becomes 1920x1088.
//mfxVPPChParams[0].VPP.Width = ALIGN16(mfxVPPChParams[0].VPP.CropW);
//mfxVPPChParams[0].VPP.Height = ALIGN16(mfxVPPChParams[0].VPP.CropH);
mfxVPPChParams[0].VPP.ChannelId = 1;
mfxVPPChParams[0].Protected = 0;
mfxVPPChParams[0].IOPattern = MFX_IOPATTERN_IN_SYSTEM_MEMORY | MFX_IOPATTERN_OUT_SYSTEM_MEMORY;
mfxVPPChParams[0].ExtParam = NULL;
mfxVPPChParams[0].NumExtParam = 0;
sts = MFXVideoDECODE_VPP_Init(session, &mfxDecParams, &mfxVPPChParams, numVPPCh); //This causes a MINOR memory leak!
outSurfaces = new mfxSurfaceArray;
while (isStillGoing == true) {
sts = MFXVideoDECODE_VPP_DecodeFrameAsync(session,
&bitstream,
NULL,
0,
&outSurfaces); //Big memory leak. 100MB pr run in the while loop.
switch (sts) {
case MFX_ERR_NONE:
// decode output
if (framenum >= 30)
{
isStillGoing = false;
break;
}
sts = WriteRawFrameToByte(outSurfaces->Surfaces[1], &frameData[framenum]);
VERIFY(MFX_ERR_NONE == sts, "Could not write 1st vpp output");
framenum++;
break;
case MFX_ERR_MORE_DATA:
// The function requires more bitstream at input before decoding can proceed
isStillGoing = false;
break;
case MFX_ERR_MORE_SURFACE:
// The function requires more frame surface at output before decoding can proceed.
// This applies to external memory allocations and should not be expected for
// a simple internal allocation case like this
break;
case MFX_ERR_DEVICE_LOST:
// For non-CPU implementations,
// Cleanup if device is lost
break;
case MFX_WRN_DEVICE_BUSY:
// For non-CPU implementations,
// Wait a few milliseconds then try again
break;
case MFX_WRN_VIDEO_PARAM_CHANGED:
// The decoder detected a new sequence header in the bitstream.
// Video parameters may have changed.
// In external memory allocation case, might need to reallocate the output surface
break;
case MFX_ERR_INCOMPATIBLE_VIDEO_PARAM:
// The function detected that video parameters provided by the application
// are incompatible with initialization parameters.
// The application should close the component and then reinitialize it
break;
case MFX_ERR_REALLOC_SURFACE:
// Bigger surface_work required. May be returned only if
// mfxInfoMFX::EnableReallocRequest was set to ON during initialization.
// This applies to external memory allocations and should not be expected for
// a simple internal allocation case like this
break;
default:
printf("unknown status %d\n", sts);
isStillGoing = false;
break;
}
}
sts = MFXVideoDECODE_VPP_Close(session); // Helps massively! Halves the memory leak speed. Closes internal structures and tables.
VERIFY(MFX_ERR_NONE == sts, "Error closing VPP session\n");
result = 0;
end:
printf("Decode and VPP processed %d frames\n", framenum);
// Clean up resources - It is recommended to close components first, before
// releasing allocated surfaces, since some surfaces may still be locked by
// internal resources.
if (mfxVPPChParams)
delete[] mfxVPPChParams;
if (outSurfaces)
delete outSurfaces;
if (bitstream.Data)
free(bitstream.Data);
if (accelHandle)
FreeAcceleratorHandle(accelHandle);
if (loader)
MFXUnload(loader);
return result;
}
imgFrameWrapper.h
public ref class imgFrameWrapper
{
private:
size_t size;
array<System::Byte>^ bytes;
public:
imgFrameWrapper(size_t u_size, unsigned char* u_bytes);
~imgFrameWrapper();
!imgFrameWrapper();
size_t get_size();
array<System::Byte>^ get_bytes();
};
imgFrameWrapper.cpp
imgFrameWrapper::imgFrameWrapper(size_t u_size, unsigned char* u_bytes)
{
size = u_size;
bytes = gcnew array<System::Byte>(size);
Marshal::Copy((IntPtr)u_bytes, bytes, 0, size);
}
imgFrameWrapper::~imgFrameWrapper()
{
}
imgFrameWrapper::!imgFrameWrapper()
{
}
size_t imgFrameWrapper::get_size()
{
return size;
}
array<System::Byte>^ imgFrameWrapper::get_bytes()
{
return bytes;
}
imgFrame.h
struct imgFrame
{
int size;
unsigned char* bytes;
~imgFrame()
{
if (bytes)
delete[] bytes;
}
};

MFXVideoDECODE_VPP_DecodeFrameAsync() function creates internal memory surfaces for the processing.
You should release surfaces.
Please check this link it's mentioning about it.
https://spec.oneapi.com/onevpl/latest/API_ref/VPL_structs_decode_vpp.html#_CPPv415mfxSurfaceArray
mfxStatus (*Release)(struct mfxSurfaceArray *surface_array)¶
Decrements the internal reference counter of the surface. (*Release) should be
called after using the (*AddRef) function to add a surface or when allocation
logic requires it.
And please check this sample.
https://github.com/oneapi-src/oneVPL/blob/master/examples/hello-decvpp/src/hello-decvpp.cpp
Especially, WriteRawFrame_InternalMem() function in https://github.com/oneapi-src/oneVPL/blob/17968d8d2299352f5a9e09388d24e81064c81c87/examples/util/util/util.h
It shows how to release surfaces.

DPDK 17.11.1 - drops seen when doing destination based rate limiting

Editing the problem statement to highlight more on the core logic
We are seeing performance issues when doing destination based rate limiting.
We maintain state for every {destination-src} pair (max of 100 destinations and 2^16 sources). We have an array of 100 nodes and at each node we have a rte_hash*. This hash table is going to maintain the state of every source ip seen by that destination. We have a mapping for every destination seen (0 to 100) and this is used to index into the array. If a particular source exceeds a threshold defined for this destination in a second, we block the source, else we allow the source. At runtime, when we see only traffic for 2 or 3 destinations, there are no issues, but when we go beyond 5, we are seeing lot of drops. Our function has to do a lookup and identify the flow matching the dest_ip and src_ip. Process the flow and decide whether it needs dropping. If the flow is not found, add it to the hash.
struct flow_state {
struct rte_hash* hash;
};
struct flow_state flow_state_arr[100];
// am going to create these hash tables using rte_hash_create at pipeline_init and free them during pipeline_free.
Am outlining what we do in pseudocode.
run()
{
1) do rx
2) from the pkt, get index into the flow_state_arr and retrieve the rte_hash* handle
3) rte_hash_lookup_data(hash, src_ip,flow_data)
4) if entry found, take decision on the flow (the decision is simply say rate limiting the flow)
5) else rte_hash_add_data(hash,src_ip,new_flow_data) to add the flow to table and forward
}
Please guide if we can have these multiple hash table objects in data path or whats the best way if we need to handle states for every destination separately.
Edit
Thanks for answering. I will be glad to share the code snippets and our gathered results. I don't have comparison results for other DPDK versions, but below are some of the results for our tests using 17.11.1.
Test Setup
Am using IXIA traffic gen (using two 10G links to generate 12Mpps) for 3 destinations 14.143.156.x (in this case - 101,102,103). Each destination's traffic comes from 2^16 different sources. This is the traffic gen setup.
Code Snippet
struct flow_state_t {
struct rte_hash* hash;
uint32_t size;
uint64_t threshold;
};
struct flow_data_t {
uint8_t curr_state; // 0 if blocked, 1 if allowed
uint64_t pps_count;
uint64_t src_first_seen;
};
struct pipeline_ratelimit {
struct pipeline p;
struct pipeline_ratelimit_params params;
rte_table_hash_op_hash f_hash;
uint32_t swap_field0_offset[SWAP_DIM];
uint32_t swap_field1_offset[SWAP_DIM];
uint64_t swap_field_mask[SWAP_DIM];
uint32_t swap_n_fields;
pipeline_msg_req_handler custom_handlers[2]; // handlers for add and del
struct flow_state_t flow_state_arr[100];
struct flow_data_t flows[100][65536];
} __rte_cache_aligned;
/*
add_handler(pipeline,msg) -- msg includes index and threshold
In the add handler
a rule/ threshold is added for a destination
rte_hash_create and store rte_hash* in flow_state_arr[index]
max of 100 destinations or rules are allowed
previous pipelines add the ID (index) to the packet to look in to the
flow_state_arr for the rule
*/
/*
del_handler(pipeline,msg) -- msg includes index
In the del handler
a rule/ threshold #index is deleted
the associated rte_hash* is also freed
the slot is made free
*/
#define ALLOWED 1
#define BLOCKED 0
#define TABLE_MAX_CAPACITY 65536
int do_rate_limit(struct pipeline_ratelimit* ps, uint32_t id, unsigned char* pkt)
{
uint64_t curr_time_stamp = rte_get_timer_cycles();
struct iphdr* iph = (struct iphdr*)pkt;
uint32_t src_ip = rte_be_to_cpu_32(iph->saddr);
struct flow_state_t* node = &ps->flow_state_arr[id];
struct flow_data_t* flow = NULL
rte_hash_lookup_data(node->hash, &src_ip, (void**)&flow);
if (flow != NULL)
{
if (flow->curr_state == ALLOWED)
{
if (flow->pps_count++ > node->threshold)
{
uint64_t seconds_elapsed = (curr_time_stamp - flow->src_first_seen) / CYCLES_IN_1SEC;
if (seconds_elapsed)
{
flow->src_first_seen += seconds_elapsed * CYCLES_IN_1_SEC;
flow->pps_count = 1;
return ALLOWED;
}
else
{
flow->pps_count = 0;
flow->curr_state = BLOCKED;
return BLOCKED;
}
}
return ALLOWED;
}
else
{
uint64_t seconds_elapsed = (curr_time_stamp - flow->src_first_seen) / CYCLES_IN_1SEC;
if (seconds_elapsed > 120)
{
flow->curr_state = ALLOWED;
flow->pps_count = 0;
flow->src_first_seen += seconds_elapsed * CYCLES_IN_1_SEC;
return ALLOWED;
}
return BLOCKED;
}
}
int index = node->size;
// If entry not found and we have reached capacity
// Remove the rear element and mark it as the index for the new node
if (node->size == TABLE_MAX_CAPACITY)
{
rte_hash_reset(node->hash);
index = node->size = 0;
}
// Add new element #packet_flows[mit_id][index]
struct flow_data_t* flow_data = &ps->flows[id][index];
*flow_data = { ALLOWED, 1, curr_time_stamp };
node->size++;
// Add the new key to hash
rte_hash_add_key_data(node->hash, (void*)&src_ip, (void*)flow_data);
return ALLOWED;
}
static int pipeline_ratelimit_run(void* pipeline)
{
struct pipeline_ratelimit* ps = (struct pipeline_ratelimit*)pipeline;
struct rte_port_in* port_in = p->port_in_next;
struct rte_port_out* port_out = &p->ports_out[0];
struct rte_port_out* port_drop = &p->ports_out[2];
uint8_t valid_pkt_cnt = 0, invalid_pkt_cnt = 0;
struct rte_mbuf* valid_pkts[RTE_PORT_IN_BURST_SIZE_MAX];
struct rte_mbuf* invalid_pkts[RTE_PORT_IN_BURST_SIZE_MAX];
memset(valid_pkts, 0, sizeof(valid_pkts));
memset(invalid_pkts, 0, sizeof(invalid_pkts));
uint64_t n_pkts;
if (unlikely(port_in == NULL)) {
return 0;
}
/* Input port RX */
n_pkts = port_in->ops.f_rx(port_in->h_port, p->pkts,
port_in->burst_size);
if (n_pkts == 0)
{
p->port_in_next = port_in->next;
return 0;
}
uint32_t rc = 0;
char* rx_pkt = NULL;
for (j = 0; j < n_pkts; j++) {
struct rte_mbuf* m = p->pkts[j];
rx_pkt = rte_pktmbuf_mtod(m, char*);
uint32_t id = rte_be_to_cpu_32(*(uint32_t*)(rx_pkt - sizeof(uint32_t)));
unsigned short packet_len = rte_be_to_cpu_16(*((unsigned short*)(rx_pkt + 16)));
struct flow_state_t* node = &(ps->flow_state_arr[id]);
if (node->hash && node->threshold != 0)
{
// Decide whether to allow of drop the packet
// returns allow - 1, drop - 0
if (do_rate_limit(ps, id, (unsigned char*)(rx_pkt + 14)))
valid_pkts[valid_pkt_count++] = m;
else
invalid_pkts[invalid_pkt_count++] = m;
}
else
valid_pkts[valid_pkt_count++] = m;
if (invalid_pkt_cnt) {
p->pkts_mask = 0;
rte_memcpy(p->pkts, invalid_pkts, sizeof(invalid_pkts));
p->pkts_mask = RTE_LEN2MASK(invalid_pkt_cnt, uint64_t);
rte_pipeline_action_handler_port_bulk_mod(p, p->pkts_mask, port_drop);
}
p->pkts_mask = 0;
memset(p->pkts, 0, sizeof(p->pkts));
if (valid_pkt_cnt != 0)
{
rte_memcpy(p->pkts, valid_pkts, sizeof(valid_pkts));
p->pkts_mask = RTE_LEN2MASK(valid_pkt_cnt, uint64_t);
}
rte_pipeline_action_handler_port_bulk_mod(p, p->pkts_mask, port_out);
/* Pick candidate for next port IN to serve */
p->port_in_next = port_in->next;
return (int)n_pkts;
}
}
RESULTS
When generated traffic for only one destination from 60000 sources with threshold of 14Mpps, there were no drops. We were able to send 12Mpps from IXIA and recv 12Mpps
Drops were observed after adding 3 or more destinations (each configured to recv traffic from 60000 sources). The throughput was only 8-9 Mpps. When sent for 100 destinations (60000 src each), only 6.4Mpps were handled. 50% drop was seen.
On running it through vtune-profiler, it reported rte_hash_lookup_data as the hotspot and mostly memory bound (DRAM bound). I will attach the vtune report soon.

Based on the update from internal testing, rte_hash library is not causing performance drops. Hence as suggested in comment is more likely due to current pattern and algorithm design which might be leading cache misses and lesser Instruction per Cycle.
To identify whether it is frontend stall or backend pipeline stall or memory stall please either use perf or vtune. Also try to minimize branching and use more likely and prefetch too.

Setting a hardwarebreakpoint in multithreaded application doesn't fire

I wrote a little debugger for analysing and looging certain problems. Now I implemented a hardwarebreakpoint for detecting the access of a memory address being overwritten. When I run my debugger with a test process, then everything works fine. When I access the address, the breakpoint fires and the callstack is logged. The problem is, when I run the same against an application running multiple threads. I'm replicating the breakpoint into every thread that gets created and also the main thread. None of the functions report an error and everything looks fine, but when the address is accessed, the breakpoint never fires.
So I wonder if there is some documentation where this is described or if there are additionaly things that I have to do in case of a multithreaded application.
The function to set the breakpoint is this:
#ifndef _HARDWARE_BREAKPOINT_H
#define _HARDWARE_BREAKPOINT_H
#include "breakpoint.h"
#define MAX_HARDWARE_BREAKPOINT 4
#define REG_DR0_BIT 1
#define REG_DR1_BIT 4
#define REG_DR2_BIT 16
#define REG_DR3_BIT 64
class HardwareBreakpoint : public Breakpoint
{
public:
typedef enum
{
REG_INVALID = -1,
REG_DR0 = 0,
REG_DR1 = 1,
REG_DR2 = 2,
REG_DR3 = 3
} Register;
typedef enum
{
CODE,
READWRITE,
WRITE,
} Type;
typedef enum
{
SIZE_1,
SIZE_2,
SIZE_4,
SIZE_8,
} Size;
typedef struct
{
void *pAddress;
bool bBusy;
Type nType;
Size nSize;
Register nRegister;
} Info;
public:
HardwareBreakpoint(HANDLE hThread);
virtual ~HardwareBreakpoint(void);
/**
* Sets a hardware breakpoint. If no register is free or an error occured
* REG_INVALID is returned, otherwise the hardware register for the given breakpoint.
*/
HardwareBreakpoint::Register set(void *pAddress, Type nType, Size nSize);
void remove(void *pAddress);
void remove(Register nRegister);
inline Info const *getInfo(Register nRegister) const { return &mBreakpoint[nRegister]; }
private:
typedef Breakpoint super;
private:
Info mBreakpoint[MAX_HARDWARE_BREAKPOINT];
size_t mRegBit[MAX_HARDWARE_BREAKPOINT];
size_t mRegOffset[MAX_HARDWARE_BREAKPOINT];
};
#endif // _HARDWARE_BREAKPOINT_H
void SetBits(DWORD_PTR &dw, size_t lowBit, size_t bits, size_t newValue)
{
DWORD_PTR mask = (1 << bits) - 1;
dw = (dw & ~(mask << lowBit)) | (newValue << lowBit);
}
HardwareBreakpoint::HardwareBreakpoint(HANDLE hThread)
: super(hThread)
{
mRegBit[REG_DR0] = REG_DR0_BIT;
mRegBit[REG_DR1] = REG_DR1_BIT;
mRegBit[REG_DR2] = REG_DR2_BIT;
mRegBit[REG_DR3] = REG_DR3_BIT;
CONTEXT ct;
mRegOffset[REG_DR0] = reinterpret_cast<size_t>(&ct.Dr0) - reinterpret_cast<size_t>(&ct);
mRegOffset[REG_DR1] = reinterpret_cast<size_t>(&ct.Dr1) - reinterpret_cast<size_t>(&ct);
mRegOffset[REG_DR2] = reinterpret_cast<size_t>(&ct.Dr2) - reinterpret_cast<size_t>(&ct);
mRegOffset[REG_DR3] = reinterpret_cast<size_t>(&ct.Dr3) - reinterpret_cast<size_t>(&ct);
memset(&mBreakpoint[0], 0, sizeof(mBreakpoint));
for(int i = 0; i < MAX_HARDWARE_BREAKPOINT; i++)
mBreakpoint[i].nRegister = (Register)i;
}
HardwareBreakpoint::Register HardwareBreakpoint::set(void *pAddress, Type nType, Size nSize)
{
CONTEXT ct = {0};
super::setAddress(pAddress);
ct.ContextFlags = CONTEXT_DEBUG_REGISTERS;
if(!GetThreadContext(getThread(), &ct))
return HardwareBreakpoint::REG_INVALID;
size_t iReg = 0;
for(int i = 0; i < MAX_HARDWARE_BREAKPOINT; i++)
{
if (ct.Dr7 & mRegBit[i])
mBreakpoint[i].bBusy = true;
else
mBreakpoint[i].bBusy = false;
}
Info *reg = NULL;
// Address already used?
for(int i = 0; i < MAX_HARDWARE_BREAKPOINT; i++)
{
if(mBreakpoint[i].pAddress == pAddress)
{
iReg = i;
reg = &mBreakpoint[i];
break;
}
}
if(reg == NULL)
{
for(int i = 0; i < MAX_HARDWARE_BREAKPOINT; i++)
{
if(!mBreakpoint[i].bBusy)
{
iReg = i;
reg = &mBreakpoint[i];
break;
}
}
}
// No free register available
if(!reg)
return HardwareBreakpoint::REG_INVALID;
*(void **)(((char *)&ct)+mRegOffset[iReg]) = pAddress;
reg->bBusy = true;
ct.Dr6 = 0;
int st = 0;
if (nType == CODE)
st = 0;
if (nType == READWRITE)
st = 3;
if (nType == WRITE)
st = 1;
int le = 0;
if (nSize == SIZE_1)
le = 0;
else if (nSize == SIZE_2)
le = 1;
else if (nSize == SIZE_4)
le = 3;
else if (nSize == SIZE_8)
le = 2;
SetBits(ct.Dr7, 16 + iReg*4, 2, st);
SetBits(ct.Dr7, 18 + iReg*4, 2, le);
SetBits(ct.Dr7, iReg*2, 1, 1);
ct.ContextFlags = CONTEXT_DEBUG_REGISTERS;
if(!SetThreadContext(getThread(), &ct))
return REG_INVALID;
return reg->nRegister;
}
I'm setting the breakpoint in the main debugger loop whenever a new thread is created CREATE_THREAD_DEBUG_EVENT but looking at the sourcecode of GDB it seems not to be done there, so maybe that is to early?

So I finally found the answer to this problem.
In the debug event loop, I'm monitoring the events that windows sends me. One of those events is CREATE_THREAD_DEBUG_EVENT which I used to set the hardware breakpoint whenever a new thread was created.
The problem is, that the notification of this event comes before the thread got actually started. So Windows is setting the context for the first time AFTER this event is sent, which of course overwrites any context data that I have set before.
The solution I implemented now is, when a CREATE_THREAD_DEBUG_EVENT comes I put a software breakpoint at the start adress of the thread, so that the first instruction is my breakpoint. When I receive the breakpoint event, I restore the original code and install the hardware breakpoint, which now fires fine.
If there is a better solution, I'm all ears. :)

Sending Key Presses with Interception

I have tried all the normal methods of faking keyboard actions (SendInput/SendKeys/etc) but none of them seemed to work for games that used DirectInput. After a lot of reading and searching I stumbled across Interception, which is a C++ Library that allows you to hook into your devices.
It has been a very long time since I worked with C++ (Nothing existed for C#) so I am having some trouble with this. I have pasted in the sample code below.
Does it look like there would be anyway to initiate key actions from the code using this? The samples all just hook into the devices and rewrite actions (x key prints y, inverts mouse axis, etc).
enum ScanCode
{
SCANCODE_X = 0x2D,
SCANCODE_Y = 0x15,
SCANCODE_ESC = 0x01
};
int main()
{
InterceptionContext context;
InterceptionDevice device;
InterceptionKeyStroke stroke;
raise_process_priority();
context = interception_create_context();
interception_set_filter(context, interception_is_keyboard, INTERCEPTION_FILTER_KEY_DOWN | INTERCEPTION_FILTER_KEY_UP);
/*
for (int i = 0; i < 10; i++)
{
Sleep(1000);
stroke.code = SCANCODE_Y;
interception_send(context, device, (const InterceptionStroke *)&stroke, 1);
}
*/
while(interception_receive(context, device = interception_wait(context), (InterceptionStroke *)&stroke, 1) > 0)
{
if(stroke.code == SCANCODE_X) stroke.code = SCANCODE_Y;
interception_send(context, device, (const InterceptionStroke *)&stroke, 1);
if(stroke.code == SCANCODE_ESC) break;
}
The code I commented out was something I tried that didn't work.

You need to tweak key states for UP and DOWN states to get key presses. Pay attention at the while loop that the variable device is returned by interception_wait, your commented out code would send events to what?? device is not initialized! Forget your code and try some more basic. Look at the line inside the loop with the interception_send call, make more two calls after it, but don't forget to change stroke.state before each call using INTERCEPTION_KEY_DOWN and INTERCEPTION_KEY_UP so that you fake down and up events. You'll get extra keys at each keyboard event.
Also, you may try use INTERCEPTION_FILTER_KEY_ALL instead of INTERCEPTION_FILTER_KEY_DOWN | INTERCEPTION_FILTER_KEY_UP. The arrow keys may be special ones as mentioned at the website.

void ThreadMethod()
{
while (true)
{
if (turn)
{
for (int i = 0; i < 10; i++)
{
Sleep(1000);
InterceptionKeyStroke stroke;
stroke.code = SCANCODE_Y;
stroke.state = 0;
interception_send(context, device, (const InterceptionStroke *)&stroke, 1);
Sleep(1);
stroke.state = 1;
interception_send(context, device, (const InterceptionStroke *)&stroke, 1);
turn = false;
}
}
else Sleep(1);
}
}
CreateThread(NULL, NULL, (LPTHREAD_START_ROUTINE)ThreadMethod, NULL, NULL, NULL);
while (interception_receive(context, device = interception_wait(context), (InterceptionStroke*)&stroke, 1) > 0)
{
if (stroke.code == SCANCODE_F5) turn = true;
interception_send(context, device, (InterceptionStroke*)&stroke, 1);
if (stroke.code == SCANCODE_ESC) break;
}

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Why I receive no WT_PACKETEXT window messages after initializing Wintab extensions? - c++

It turned out that a driver setting prevented the extension packets from being sent, in favor of using the touch ring for different function. Changing this setting resolved the issue. The code didn't contain any errors itself.

Related

Intel OneAPI Video decoding memory leak when using C++ CLI

DPDK 17.11.1 - drops seen when doing destination based rate limiting

More efficient way for reading CAN data in while loop

Setting a hardwarebreakpoint in multithreaded application doesn't fire

Sending Key Presses with Interception

Categories

Resources