I have this C++ code that shows how to extend a software by compiling it to a DLL and putting it in the application folder:
#include <windows.h>
#include <DemoPlugin.h>
/** A helper function to convert a char array into a
LPBYTE array. */
LPBYTE message(const char* message, long* pLen)
{
size_t length = strlen(message);
LPBYTE mem = (LPBYTE) GlobalAlloc(GPTR, length + 1);
for (unsigned int i = 0; i < length; i++)
{
mem[i] = message[i];
}
*pLen = length + 1;
return mem;
}
long __stdcall Execute(char* pMethodName, char* pParams,
char** ppBuffer, long* pBuffSize, long* pBuffType)
{
*pBuffType = 1;
if (strcmp(pMethodName, "") == 0)
{
*ppBuffer = (char*) message("Hello, World!",
pBuffSize);
}
else if (strcmp(pMethodName, "Count") == 0)
{
char buffer[1024];
int length = strlen(pParams);
*ppBuffer = (char*) message(itoa(length, buffer, 10),
pBuffSize);
}
else
{
*ppBuffer = (char*) message("Incorrect usage.",
pBuffSize);
}
return 0;
}
Is is possible to make a plugin this way using Cython? Or even py2exe? The DLL just has to have an entry point, right?
Or should I just compile it natively and embed Python using elmer?
I think the solution is to use both. Let me explain.
Cython makes it convenient to make a fast plugin using python but inconvenient (if at all possible) to make the right "kind" of DLL. You would probably have to use the standalone mode so that the necessary python runtime is included and then mess with the generated c code so an appropriate DLL gets compiled.
Conversely, elmer makes it convenient to make the DLL but runs "pure" python code which might not be fast enough. I assume speed is an issue because you are considering cython as opposed to simple embedding.
My suggestion is this: the pure python code that elmer executes should import a standard cython python extension and execute code from it. This way you don't have to hack anything ugly and you have the best of both worlds.
One more solution to consider is using shedskin, because that way you can get c++ code from your python code that is independent from the python runtime.
Related
I have a very serious problem with Embarcadero C++Builder 10.2.2 when compiling a mid-sized in-house application with the "classic" BCC32 compiler.
There's a class to wrap up some DLL functions responsible for reading the number of available hardware USB-interface boxes used for a connection interface and their serial numbers as string. The code below compiles without a flaw when using BCC32, as well as with BCC32c, but during runtime the BCC32 version crashes horribly, while everything works well with BCC32c.
The number of connected devices is sucessfully detected in both cases (which is 2 in my setup) but instead of returning the correct UsbDeviceSerialNo in
FDllPointers.GetUSBDeviceSN()
(for example 126739701012387721219679016621) there's only garbage in there:
��\x18.
In the second iteration of
for (unsigned int UsbDeviceNo = 0; UsbDeviceNo < 32; ++UsbDeviceNo)
there's an 'Access violation at address 00000000...' in the line
memset(UsbDeviceSerialNo, 0, SerNoBufferSize);
std::size_t TComUsbRoot::RefreshInterfaces()
{
// Note: This function does not work correctly when compiled
// with Borland BCC32 due to offset errors and AVs!!
// I don't have a clue, why this is the case at the moment.
const unsigned int SerNoBufferSize = 40;
unsigned long UsbDeviceMask = 0;
char UsbDeviceSerialNo[SerNoBufferSize];
// Clear the current interfaces
FInterfaces.clear(); // <-- std::vector<TComUsbInterface>
// Get the currently available interface count
int InterfaceCount = FDllPointers.GetAvailableUSBDevices(&UsbDeviceMask); // <-- int WINAPI CC_GetAvailableUSBDevices(unsigned long * DeviceMask); (from DLL)
// We need at least one interface to proceed
if (!InterfaceCount)
return 0;
// Check all USB device numbers
for (unsigned int UsbDeviceNo = 0; UsbDeviceNo < 32; ++UsbDeviceNo)
{
// Check the USB device mask
if (((1 << UsbDeviceNo) & UsbDeviceMask) != 0)
{
// Set the serial no to zeroes
memset(UsbDeviceSerialNo, 0, SerNoBufferSize);
// Get the USB device's serial number
int Result = FDllPointers.GetUSBDeviceSN(
UsbDeviceNo,
UsbDeviceSerialNo,
SerNoBufferSize
);
if (Result != 0)
{
throw Except(
std::string("ComUSB: Error while reading the USB device's serial number")
);
}
// Convert the serial number to wstring
std::wstring SerialNo(CnvToWStr(UsbDeviceSerialNo));
// Create a new interface object with the USB device's parameters
TComUsbInterface Interface(
*this,
UsbDeviceNo,
SerialNo,
FDllPointers,
FLanguage
);
// Add it to the vector
FInterfaces.push_back(Interface);
}
}
// Return the size of the interfaces vector
return FInterfaces.size();
}
I simply don't know what's wrong here, as everything is allright with the new, Clang-based compiler. Unfortunately I cannot use the new compiler for this project, as the code completion (which is crappy and compiler-based in C++Builder and feels like using an early BETA edition of somtehing) makes the IDE crash 5 times an hour and the overall compilation time can be simply unbearable, even with the use of precompiled headers.
I'm experimenting with http://www.capstone-engine.org on MacOS and MacOS x86_64 binary. It more or less does work, however i do have 2 concerns.
I'm loading test dylib
[self custom_logging:[NSString stringWithFormat:#"Module Path:%#",clientPath]];
NSMutableData *ModuleNSDATA = [NSMutableData dataWithContentsOfFile:clientPath];
[self custom_logging:[NSString stringWithFormat:#"Client Module Size: %lu MB",(ModuleNSDATA.length/1024/1024)]];
[ModuleNSDATA replaceBytesInRange:NSMakeRange(0, 20752) withBytes:NULL length:0];
uint8_t *bytes = (uint8_t*)[ModuleNSDATA bytes];
long size = [ModuleNSDATA length]/sizeof(uint8_t);
[self custom_logging:[NSString stringWithFormat:#"UInt8_t array size: %lu",size]];
ModuleASM = [NSString stringWithCString:disassembly(bytes,size,0x5110).c_str() encoding:[NSString defaultCStringEncoding]];
As far i did my research it seems i need to trim "first" bytes from binary code to remove header and metadata until it encounters real instructions. However i'm not really sure if capstone do provide any api for this or that i need to scan by byte patterns and locate first instruction address.
In fact i've applied simple workaround, i did found safe address which for sure will have instructions on most modules i will load, however i would like to apply proper solution.
I've successfully loaded and disassembled part of module code using workaround i've described. However, sadly, cs_disasm returns mostly no more than 5000-6000 instructions, which is confusing, it seems it breaks on regular instructions which it shouldn't broke on to. I'm not really sure what i'm doing wrong. Module is more than 15mb of code, so there is a lot more than 5k instructions to disassembly.
Below is function i've based on Capstone Docs example
string disassembly(uint8_t *bytearray, long size, uint64_t startAddress){
csh handle;
cs_insn *insn;
size_t count;
string output;
if (cs_open(CS_ARCH_X86, CS_MODE_64, &handle) == CS_ERR_OK){
count = cs_disasm(handle, bytearray, size, startAddress, 0, &insn);
printf("\nCOUNT:%lu",count);
if (count > 0) {
size_t j;
for (j = 0; j < count; j++) {
char buffer[512];
int i=0;
i = sprintf(buffer, "0x%" PRIx64":\t%s\t\t%s\n", insn[j].address, insn[j].mnemonic,insn[j].op_str);
output += buffer;
}
cs_free(insn, count);
} else {
output = "ERROR: Failed to disassemble given code!\n";
}
}
cs_close(&handle);
return output;
}
I will really appreciate any help on this.
Warmly,
David
Anwser is to simply use SKIPDATA mode. Capstone is great, but their docs are very bad.
Working example below. This mode is still very bugged, so preferably this detection of data sectors should be custom code. For me it works fine only with small chunks of code. However, indeed it does disassembly up to end of file.
string disassembly(uint8_t *bytearray, long size, uint64_t startAddress){
csh handle;
cs_insn *insn;
size_t count;
string output;
cs_opt_skipdata skipdata = {
.mnemonic = "db",
};
if (cs_open(CS_ARCH_X86, CS_MODE_64, &handle) == CS_ERR_OK){
cs_option(handle, CS_OPT_DETAIL, CS_OPT_ON);
cs_option(handle, CS_OPT_SKIPDATA, CS_OPT_ON);
cs_option(handle, CS_OPT_SKIPDATA_SETUP, (size_t)&skipdata);
count = cs_disasm(handle, bytearray, size, startAddress, 0, &insn);
if (count > 0) {
size_t j;
for (j = 0; j < count; j++) {
char buffer[512];
int i=0;
i = sprintf(buffer, "0x%" PRIx64":\t%s\t\t%s\n", insn[j].address, insn[j].mnemonic,insn[j].op_str);
output += buffer;
}
cs_free(insn, count);
} else {
output = "ERROR: Failed to disassemble given code!\n";
}
}
cs_close(&handle);
return output;
}
Shame to those trolls who down-voted this question.
I have the following problem.
I use the following function to receive a string from a buffer until a newline occurs.
string get_all_buf(int sock) {
int n = 1, total = 0, found = 0;
char c;
char temp[1024*1024];
string antw = "";
while (!found) {
n = recv(sock, &temp[total], sizeof(temp) - total - 1, 0);
if (n == -1) {
break;
}
total += n;
temp[total] = '\0';
found = (strchr(temp, '\n') != 0);
if (found == 0){
found = (strchr(temp, '\r\n') != 0);
}
}
antw = temp;
size_t foundIndex = antw.find("\r\n");
if (foundIndex != antw.npos)
antw.erase ( antw.find ("\r\n"), 2 );
foundIndex = antw.find("\n");
if (foundIndex != antw.npos)
antw.erase ( antw.find ("\n"), 2 );
return answ;
}
So use it like this:
string an = get_all_buf(sClient);
If I create an exe file everything works perfectly.
But if I create a dll and run it using rundll32 the application closes at "string an = get_all_buf(sClient);" without any error message...
I tried to fix this for hours now, and I am currently a bit desperate...
P.S. sorry for obvious errors or bad coding style, I just started learning C++.
char temp[1024*1024];
declares a 1Mb structure on the stack. This may be too large and overflow available stack memory. You could instead give it static scope
static char temp[1024*1024];
or allocate it dynamically
char* temp = (char*)malloc(1024*1024);
// function body
free(temp);
Alternatively, assuming the mention of run32.dll means you're working on Windows, you could investigate keeping it on the stack by using the /STACK linker option. This probably isn't the best approach - you've already found it causes problems when you change build settings or try to target other platforms.
Instead of creating temp variable on the stack, I'd create it dynamically (on the heap), but not using raw malloc and free as showed in a previous answer, but using modern C++ and std::vector:
#include <vector>
std::vector<char> temp(1024*1024);
This is exception safe, and you don't have to pay attention to release the allocated memory: std::vector's destructor will do that automatically (also in case of exceptions thrown).
Instead of sizeof(temp), in your code you can use temp.size() (which will return the count of elements in the vector, and since this is a vector of chars, it will return just the total vector size in chars i.e. in bytes).
You can still use operator[] for std::vector, as you do for raw C arrays.
Note also that if you are building a DLL and the above function is exposed at the DLL interface, since this function has a C++ interface with a STL class (std::string) at the boundary, you must pay attention that both your DLL and your clients are built with dynamic linking to the same CRT, and with the same compiler and the same compiler settings (e.g. you can't mix a DLL built with VS2008/VC9 with a .EXE built with VS2010/VC10, or a release-build DLL with a debug-build EXE built with the same compiler).
I recently inherited a program that mixes C++ and C++/CLI (.NET). It interfaces to other components over the network, and to a Driver DLL for some special hardware. However, I am trying to figure out the best way to send the data over the network, as what is used seems non-optimal.
The data is stored in a C++ Defined Structure, something like:
struct myCppStructure {
unsigned int field1;
unsigned int field2;
unsigned int dataArray[512];
};
The program works fine when accessing the structure itself from C++/CLI. The problem is that to send it over the network the current code does something like the following:
struct myCppStructure* data;
IntPtr dataPtr(data);
// myNetworkSocket is a NetworkStream cast as a System::IO::Stream^
System::IO::BinaryWriter^ myBinWriter = gcnew BinaryWriter(myNetworkSocket);
__int64 length = sizeof(struct myCppStructure) / sizeof(__int64);
unsigned __int64* ptr = static_cast<__int64*>(dataPtr.toPointer());
for (unsigned int i = 0; i < (length / sizeof(unsigned __int64)); i++)
myBinWriter->Write((*ptr)++);
In normal C++ it'd usually be a call like:
myBinWriter->Write(ptr,length);
But I can't find anything equivalent in C++/CLI. System::IO::BinaryWriter only has basic types and some array<>^ versions of a few of them. Is there nothing more efficient?
P.S. These records are generated many times a second; so doing additional copying (e.g. Marshaling) it out of the question.
Note: The original question asked about C#. I failed to realize that what I was thinking of as C# was really "Managed C++" (aka C++/CLI) under .NET. The above has been edited to replace 'C#' references with 'C++/CLI' references - which I am using for any version of Managed C++, though I am notably using .NET 3.5.
Your structure consists of "basic types" and "array of them". Why can't you just wrote them sequentially using BinaryWriter? Something like
binWriter.Write(data.field1);
binWriter.Write(data.field2);
for(var i = 0; i < 512; i++)
binWriter.Write(data.dataArray[i]);
What you want to do is to find out how the C++ struct is packed, and define a struct with the correct StructLayout attribute.
To define the fixed length int[], you can defined a fixed size buffer inside it. Note that to use this you will have to mark your project /unsafe.
Now you're ready to convert that struct to a byte[] using two steps
Pin the array of structs in memory using a GCHandle.Alloc - this is fast and shouldn't be a performance bottleneck.
Now use Marshal.Copy (don't worry, this is as fast as a memcpy) with the source IntPtr = handle.AddrOfPinnedObject.
Now dispose the GCHandle and you're ready to write the bytes using the "Write" overload mentioned by Serg Rogovtsev.
Hope this helps!
In C# you could do the following. Start by defining the structure.
[StructLayout ( LayoutKind.Sequential )]
internal unsafe struct hisCppStruct
{
public uint field1;
public uint field2;
public fixed uint dataArray [ 512 ];
}
And then write it using the binary writer as follows.
hisCppStruct #struct = new hisCppStruct ();
#struct.field1 = 1;
#struct.field2 = 2;
#struct.dataArray [ 0 ] = 3;
#struct.dataArray [ 511 ] = 4;
using ( BinaryWriter bw = new BinaryWriter ( File.OpenWrite ( #"C:\temp\test.bin") ) )
{
int structSize = Marshal.SizeOf ( #struct );
int limit = structSize / sizeof ( uint );
uint* uintPtr = (uint*) &#struct;
for ( int i = 0 ; i < limit ; i++ )
bw.Write ( uintPtr [ i ] );
}
I'm pretty sure you can do exactly the same in managed C++.
Solution: Apparently the culprit was the use of floor(), the performance of which turns out to be OS-dependent in glibc.
This is a followup question to an earlier one: Same program faster on Linux than Windows -- why?
I have a small C++ program, that, when compiled with nuwen gcc 4.6.1, runs much faster on Wine than Windows XP (on the same computer). The question: why does this happen?
The timings are ~15.8 and 25.9 seconds, for Wine and Windows respectively. Note that I'm talking about the same executable, not only the same C++ program.
The source code is at the end of the post. The compiled executable is here (if you trust me enough).
This particular program does nothing useful, it is just a minimal example boiled down from a larger program I have. Please see this other question for some more precise benchmarking of the original program (important!!) and the most common possibilities ruled out (such as other programs hogging the CPU on Windows, process startup penalty, difference in system calls such as memory allocation). Also note that while here I used rand() for simplicity, in the original I used my own RNG which I know does no heap-allocation.
The reason I opened a new question on the topic is that now I can post an actual simplified code example for reproducing the phenomenon.
The code:
#include <cstdlib>
#include <cmath>
int irand(int top) {
return int(std::floor((std::rand() / (RAND_MAX + 1.0)) * top));
}
template<typename T>
class Vector {
T *vec;
const int sz;
public:
Vector(int n) : sz(n) {
vec = new T[sz];
}
~Vector() {
delete [] vec;
}
int size() const { return sz; }
const T & operator [] (int i) const { return vec[i]; }
T & operator [] (int i) { return vec[i]; }
};
int main() {
const int tmax = 20000; // increase this to make it run longer
const int m = 10000;
Vector<int> vec(150);
for (int i=0; i < vec.size(); ++i)
vec[i] = 0;
// main loop
for (int t=0; t < tmax; ++t)
for (int j=0; j < m; ++j) {
int s = irand(100) + 1;
vec[s] += 1;
}
return 0;
}
UPDATE
It seems that if I replace irand() above with something deterministic such as
int irand(int top) {
static int c = 0;
return (c++) % top;
}
then the timing difference disappears. I'd like to note though that in my original program I used a different RNG, not the system rand(). I'm digging into the source of that now.
UPDATE 2
Now I replaced the irand() function with an equivalent of what I had in the original program. It is a bit lengthy (the algorithm is from Numerical Recipes), but the point was to show that no system libraries are being called explictly (except possibly through floor()). Yet the timing difference is still there!
Perhaps floor() could be to blame? Or the compiler generates calls to something else?
class ran1 {
static const int table_len = 32;
static const int int_max = (1u << 31) - 1;
int idum;
int next;
int *shuffle_table;
void propagate() {
const int int_quo = 1277731;
int k = idum/int_quo;
idum = 16807*(idum - k*int_quo) - 2836*k;
if (idum < 0)
idum += int_max;
}
public:
ran1() {
shuffle_table = new int[table_len];
seedrand(54321);
}
~ran1() {
delete [] shuffle_table;
}
void seedrand(int seed) {
idum = seed;
for (int i = table_len-1; i >= 0; i--) {
propagate();
shuffle_table[i] = idum;
}
next = idum;
}
double frand() {
int i = next/(1 + (int_max-1)/table_len);
next = shuffle_table[i];
propagate();
shuffle_table[i] = idum;
return next/(int_max + 1.0);
}
} rng;
int irand(int top) {
return int(std::floor(rng.frand() * top));
}
edit: It turned out that the culprit was floor() and not rand() as I suspected - see
the update at the top of the OP's question.
The run time of your program is dominated by the calls to rand().
I therefore think that rand() is the culprit. I suspect that the underlying function is provided by the WINE/Windows runtime, and the two implementations have different performance characteristics.
The easiest way to test this hypothesis would be to simply call rand() in a loop, and time the same executable in both environments.
edit I've had a look at the WINE source code, and here is its implementation of rand():
/*********************************************************************
* rand (MSVCRT.#)
*/
int CDECL MSVCRT_rand(void)
{
thread_data_t *data = msvcrt_get_thread_data();
/* this is the algorithm used by MSVC, according to
* http://en.wikipedia.org/wiki/List_of_pseudorandom_number_generators */
data->random_seed = data->random_seed * 214013 + 2531011;
return (data->random_seed >> 16) & MSVCRT_RAND_MAX;
}
I don't have access to Microsoft's source code to compare, but it wouldn't surprise me if the difference in performance was in the getting of thread-local data rather than in the RNG itself.
Wikipedia says:
Wine is a compatibility layer not an emulator. It duplicates functions
of a Windows computer by providing alternative implementations of the
DLLs that Windows programs call,[citation needed] and a process to
substitute for the Windows NT kernel. This method of duplication
differs from other methods that might also be considered emulation,
where Windows programs run in a virtual machine.[2] Wine is
predominantly written using black-box testing reverse-engineering, to
avoid copyright issues.
This implies that the developers of wine could replace an api call with anything at all to as long as the end result was the same as you would get with a native windows call. And I suppose they weren't constrained by needing to make it compatible with the rest of Windows.
From what I can tell, the C standard libraries used WILL be different in the two different scenarios. This affects the rand() call as well as floor().
From the mingw site... MinGW compilers provide access to the functionality of the Microsoft C runtime and some language-specific runtimes. Running under XP, this will use the Microsoft libraries. Seems straightforward.
However, the model under wine is much more complex. According to this diagram, the operating system's libc comes into play. This could be the difference between the two.
While Wine is basically Windows, you're still comparing apples to oranges. As well, not only is it apples/oranges, the underlying vehicles hauling those apples and oranges around are completely different.
In short, your question could trivially be rephrased as "this code runs faster on Mac OSX than it does on Windows" and get the same answer.