Description
For a hybrid C/C++ (C++ wrapped in extern "C") application I'm writing I am trying to decide if it is better for me to include some static definitions my program needs to run as global variables or in an external datastore that needs to be read in every time it is run (i.e. a .csv file or sql database).
My static definitions, when laid out in a csv file, take appx 15 columns each with a maximum of 40 definitions (less than that to begin, but up to 40 due to feature scaling).
Problem
I'm torn because it feels wrong to me to include so much data as global variables that get loaded with the program at compile time. However, the overhead of reading from a datastore every time I run the program after compiling seems unnecessary.
Question
What are the best practices here? My code needs to be portable enough for someone else to understand and I do not want to obfuscate it.
It might be appropriate to generate a seperate C file from the CSV using a high level language, e.g. Python. Then either #include the generated file (only if used in a single module using static), or as a seperate compilation unit.
That way you can easily change the values in the CSV/spreadsheet program of your choice, while still having all data available. The code generating program can be called by the build system, so no manual fiddling.
For speed reasons and code clarity staticly define these variables. Lay out your definitions nicely and comment generously to help future viewers of your code. In a file, you can't comment to inform future editors what everything is. It's just slower.
The very best practice is to implement both options with flexibility to switch between implementations based on memory, computation speeds, and other load conditions.
If the application will be run server-side with generous allocation for memory/cpu, then design to those conditions. Why "feel wrong" as you put it?
Your end goal is not clearly defined. So obfuscation is not an issue yet. Obfuscation comes when you willfully redirect your coding to hide your tracks. But making a complete solution, if that is what is needed, is not obfuscation.
Related
Imagine a project with the development stretched over 10+ years timespan. Some parts are in C, some are in C++ and all of the code uses global functions and global variables. The architecture was designed inherently single threaded and kept growing that way. But now we consider utilizing many-core architectures.
Now one idea being evaluated is to refactor a part of the code into a library, to make it possible to create more than one instance, so that they can run in separate threads and don’t interfere with each other.
The proposal that gains the most traction at this point is to wrap all the library files into namespaces with macro defines, like:
namespace VARIANT {
// all the code
}
Then define the VARIANT in a header or on project level. This will make it possible to have different contexts within different namespaces. And the selling point is that this approach will require minimal code change and has low risk of introducing any regression.
But if at some point we need to make the behavior of Variant1 different from Variant2, things will get tricky, since there’s no way to compare the value of a macro define with a string in a preprocessor macro.
Is there a more elegant way to achieve this?
Another variant might be spotting all global variables and making them thread_local. Requires either C++11 or at least compiler extensions providing the same (__thread using older GCC).
If I read this question right, you even don't need to convert your C files into C++ files (which your approach requires as C does not support namespaces...), but you need C11 for.
Refactoring an old project and make it multithreading is not so simple. First of all you have mixture of C and C++ codes and you cannot blindly follow C++ approach here. Instead of namespace you need to thing on the below areas:-
Find out all the code blocks, container like list, large array of objects etc which need synchronization.
Find out interdependency of threads and how will you control them. For example one thread will generate a report and insert into a table and second one need that information to generate its final report, now you need to find out these kind of dependency among the threads in your code base and need to find out their control mechanism.
Old style of multithreading in C++ was very tricky hence you need to migrate your code to C++11 where implementation of multithreading is much easier.
As you said that in your current project there are lots of global variable, you need to think properly how you are going to share these variables amongst different threads and how will you synchronize access of these variables.
These are some hints you need to consider lots of areas in advance before starting refactoring else all your efforts end in smoke.
GOOD LUCK for your plan.
Just do it in steps, testing each time:
1) typedef a struct with all the globals in it. malloc one, and edit the existing code to reference it. Test - should work exactly the same as with the globals.
2) Create one thread to run one instance of the code. Test - should work exactly the same as with the globals.
3) Try multiple threads.
One step at a time...
Please try very hard to not attempt any bodges!
I need to provide my users the ability to write mathematical computations into the program. I plan to have a simple text interface with a few buttons including those to validate the script grammar, save etc.
Here's where it gets interesting. These functions the user is writing need to execute at multi-megabyte line speeds in a communications application. So I need the speed of a compiled language, but the usage of a script. A fully interpreted language just won't cut it.
My idea is to precompile the saved user modules into objects at initialization of the C++ application. I could then use these objects to execute the code when called upon. Here are the workflows I have in mind:
1) Testing(initial writing) of script: Write code in editor, save, compile into object (testing grammar), run with test I/O, Edit Code
2) Use of Code (Normal operation of application): Load script from file, compile script into object, Run object code, Run object code, Run object code, etc.
I've looked into several off the shelf interpreters, but can't find what I'm looking for. I considered JAVA, as it is pretty fast, but I would need to load the JAVA virtual machine, which means passing objects between C and the virtual machine... The interface is the bottleneck here. I really need to create a native C++ object running C++ code if possible. I also need to be able to run the code on multiple processors effectively in a controlled manner.
I'm not looking for the whole explanation on how to pull this off, as I can do my own research. I've been stalled for a couple days here now, however, and I really need a place to start looking.
As a last resort, I will create my own scripting language to fulfill the need, but that seems a waste with all the great interpreters out there. I've also considered taking an existing open source complier and slicing it up for the functionality I need... just not saving the compiled results to disk... I don't know. I would prefer to use a mainline language if possible... but that's not required.
Any help would be appreciated. I know this is not your run of the mill idea I have here, but someone has to have done it before.
Thanks!
P.S.
One thought that just occurred to me while writing this was this: what about using a true C compiler to create object code, save it to disk as a dll library, then reload and run it inside "my" code? Can you do that with MS Visual Studio? I need to look at the licensing of the compiler... how to reload the library dynamically while the main application continues to run... hmmmmm I could then just group the "functions" created by the user into library groups. Ok that's enough of this particular brain dump...
A possible solution could be use gcc (MingW since you are on windows) and build a DLL out of your user defined code. The DLL should export just one function. You can use the win32 API to handle the DLL (LoadLibrary/GetProcAddress etc.) At the end of this job you have a C style function pointer. The problem now are arguments. If your computation has just one parameter you can fo a cast to double (*funct)(double), but if you have many parameters you need to match them.
I think I've found a way to do this using standard C.
1) Standard C needs to be used because when it is compiled into a dll, the resulting interface is cross compatible with multiple compilers. I plan to do my primary development with MS Visual Studio and compile objects in my application using gcc (windows version)
2) I will expose certain variables to the user (inputs and outputs) and standardize them across units. This allows multiple units to be developed with the same interface.
3) The user will only create the inside of the function using standard C syntax and grammar. I will then wrap that function with text to fully define the function and it's environment (remember those variables I intend to expose?) I can also group multiple functions under a single executable unit (dll) using name parameters.
4) When the user wishes to test their function, I dump the dll from memory, compile their code with my wrappers in gcc, and then reload the dll into memory and run it. I would let them define inputs and outputs for testing.
5) Once the test/create step was complete, I have a compiled library created which can be loaded at run time and handled via pointers. The inputs and outputs would be standardized, so I would always know what my I/O was.
6) The only problem with standardized I/O is that some of the inputs and outputs are likely to not be used. I need to see if I can put default values in or something.
So, to sum up:
Think of an app with a text box and a few buttons. You are told that your inputs are named A, B, and C and that your outputs are X, Y, and Z of specified types. You then write a function using standard C code, and with functions from the specified libraries (I'm thinking math etc.)
So now your done... you see a few boxes below to define your input. You fill them in and hit the TEST button. This would wrap your code in a function context, dump the existing dll from memory (if it exists) and compile your code along with any other functions in the same group (another parameter you could define, basically just a name to the user.) It then runs the function using a functional pointer, using the inputs defined in the UI. The outputs are sent to the user so they can determine if their function works. If there are any compilation errors, that would also be outputted to the user.
Now it's time to run for real. Of course I kept track of what functions are where, so I dynamically open the dll, and load all the functions into memory with functional pointers. I start shoving data into one side and the functions give me the answers I need. There would be some overhead to track I/O and to make sure the functions are called in the right order, but the execution would be at compiled machine code speeds... which is my primary requirement.
Now... I have explained what I think will work in two different ways. Can you think of anything that would keep this from working, or perhaps any advice/gotchas/lessons learned that would help me out? Anything from the type of interface to tips on dynamically loading dll's in this manner to using the gcc compiler this way... etc would be most helpful.
Thanks!
This may be kind of basic but... here goes.
If I decide to embed some kind of scripting language like Lua or Ruby into a C++ program by linking it's interpreter what does that allow me to do in C++ then?
Would I be able to write Ruby or Lua code right into the cpp file or simply call scripts from the program?
If the latter is true, how would I do that?
Because they're scripting languages, the code is always going to be "interpreted." In reality, you aren't "calling" the script code inside your program, but rather when you reach that point, you're executing the interpreter in the context of that thread (the thread that reaches the scripting portion), which then reads the scripting language and executes the applicable machine code after interpreting it (JIT compiling kind of, but not really, there's no compiling involved).
Because of this, its basically the same thing as forking the interpreter and running the script, unless you want access to variables in your compiled program/in your script from the compiled program. To access values to/from, because you're using the thread that has your compiled program's context, you should be able to store script variables on the stack as well and access them when your thread stops running the interpreter (assuming you stored the variables on the stack).
Edit: response:
You would have to write it yourself. Think about it this way: if you want to use assembly in c++, you use the asm keyword. You then in the c++ compiler, need to parse the source file, get to the asm keyword, and then switch to the assembly compiler. Then the assembly compiler needs to go until the end bracket of the asm region and compile this code.
If you want to do this,it will be a bit different, since assembly gets compiled, not interpreted (which is what you want to do). What you'll need to do, is change the compiler you're using (lets say c++), so that it recognizes your own user defined keyword. Lets say this keyword is scriptX{}. You need to change the c++'s parser so that when it see's scriptX{}, it stores everything between the brackets in the readonly data section of your compiled program. You then need to add a hook in the compiled assembly file to switch the context of the thread to your script interpreter, and start the program counter at the beginning of your script section (which you put in read only data section of the object file).
Good luck with that...
A common reason to embed a scripting language into a program is to provide for the ability to control the program with scripts provided by the end user.
Probably the simplest example of such a script is a configuration file. Assume that your program has options, and needs to remember the options from run to run. You could write them out to a file as a binary image of your options structure, but that would be fragile, not easy to inspect or edit, and likely not portable across systems. Writing the options out in plain text with some sort of labels for which is which addresses most of those complaints, but now you need to parse that text and recover the options. Then some users want different options on Tuesdays, want to do simple arithmetic to compute one option from another, or to write one configuration file that they can use on both Windows and Linux, and pretty soon you find yourself inventing a little language to express all of those ideas and mechanisms with. At this point, there's a better way.
The languages Lua and TCL both grew out of essentially that scenario. Larger systems needed to be configured and controlled by end users. End users wanted to edit a simple text file and get immediate satisfaction, even (especially) when working with large systems that might have required hours to compile successfully.
One advantage here is that rather than inventing a programming language one feature at a time as user's needs change, you start with a complete language along with its documentation. The language designer has already made a number of tough decisions for you (how do I represent strings and numbers, what about lists, what about named values, what does if look like, etc.) and has generally also brought a carefully designed and debugged implementation to the table.
Lua is particularly easy to integrate. Reading a simple configuration file and extracting the settings from the Lua state can be done using a small subset of its C API. Once you have Lua available, it is attractive to use it for other purposes. In many cases, you will find that it is more productive to write only the innermost loops in C, and use Lua to glue those functions together and provide all the "business logic" of the application. This is how Adobe Lightroom is implemented, as well as many games on platforms ranging from simple set-top-boxes to iOS devices and even PCs.
This question already has answers here:
Closed 12 years ago.
Possible Duplicate:
Should C++ eliminate header files?
In languages like C# and Java there is no need to declare (for example) a class before using it. If I understand it correctly this is because the compiler does two passes on the code. In the first it just "collects the information available" and in the second one it checks that the code is correct.
In C and C++ the compiler does only one pass so everything needs to be available at that time.
So my question basically is why isn't it done this way in C and C++. Wouldn't it eliminate the needs for header files?
The short answer is that computing power and resources advanced exponentially between the time that C was defined and the time that Java came along 25 years later.
The longer answer...
The maximum size of a compilation unit -- the block of code that a compiler processes in a single chunk -- is going to be limited by the amount of memory that the compiling computer has. In order to process the symbols that you type into machine code, the compiler needs to hold all the symbols in a lookup table and reference them as it comes across them in your code.
When C was created in 1972, computing resources were much more scarce and at a high premium -- the memory required to store a complex program's entire symbolic table at once simply wasn't available in most systems. Fixed storage was also expensive, and extremely slow, so ideas like virtual memory or storing parts of the symbolic table on disk simply wouldn't have allowed compilation in a reasonable timeframe.
The best solution to the problem was to chunk the code into smaller pieces by having a human sort out which portions of the symbol table would be needed in which compilation units ahead of time. Imposing a fairly small task on the programmer of declaring what he would use saved the tremendous effort of having the computer search the entire program for anything the programmer could use.
It also saved the compiler from having to make two passes on every source file: the first one to index all the symbols inside, and the second to parse the references and look them up. When you're dealing with magnetic tape where seek times were measured in seconds and read throughput was measured in bytes per second (not kilobytes or megabytes), that was pretty meaningful.
C++, while created almost 17 years later, was defined as a superset of C, and therefore had to use the same mechanism.
By the time Java rolled around in 1995, average computers had enough memory that holding a symbolic table, even for a complex project, was no longer a substantial burden. And Java wasn't designed to be backwards-compatible with C, so it had no need to adopt a legacy mechanism. C# was similarly unencumbered.
As a result, their designers chose to shift the burden of compartmentalizing symbolic declaration back off the programmer and put it on the computer again, since its cost in proportion to the total effort of compilation was minimal.
Bottom line: there have been advances in compiler technology that make forward declarations unnecessary. Plus computers are thousands of times faster, and so can make the extra calculations necessary to handle the lack of forward declarations.
C and C++ are older and were standardized at a time when it was necessary to save every CPU cycle.
No, it would not obviate header files. It would eliminate the requirement to use a header to declare classes/functions in the same file. The major reason for headers is not to declare things in the same file though. The primary reason for headers is to declare things that are defined in other files.
For better or worse, the rules for the semantics of C (and C++) mandate the "single pass" style behavior. Just for example, consider code like this:
int i;
int f() {
i = 1;
int i = 2;
}
The i=1 assigns to the global, not the one defined inside of f(). This is because at the point of the assignment, the local definition of i hasn't been seen yet so it isn't taken into account. You could still follow these rules with a two-pass compiler, but doing so could be non-trivial. I haven't checked their specs to know with certainty, but my immediate guess would be that Java and C# differ from C and C++ in this respect.
Edit: Since a comment said my guess was incorrect, I did a bit of checking. According to the Java Language Reference, §14.4.2, Java seems to follow pretty close to the same rules as C++ (a little different, but not a whole lot.
At least as I read the C# language specification, (warning: Word file) however, it is different. It (§3.7.1) says: "The scope of a local variable declared in a local-variable-declaration (§8.5.1) is the block in which the declaration occurs."
This appears to say that in C#, the local variable should be visible throughout the entire block in which it is declared, so with code similar to the example I gave, the assignment would be to the local variable, not the global.
So, my guess was half right: Java follows (pretty much0 the same rule as C++ in this respect, but C# does not.
This is because of smaller compilation modules in C/C++. In C/C++, each .c/.cpp file is compiled separately, creating an .obj module. Thus the compiler needs the information about types and variables, declared in other compilation modules. This information is supplied in form of forward declarations, usually in header files.
C#, on the other side, compiles several .cs files into one big compilation module at once.
In fact, when referencing different compiled modules from a C# program, the compiler needs to know the declarations (type names etc.) the same way as C++ compiler does. This information is obtained from the compiled module directly. In C++ the same information is explicitly separated (that's why you cannot find out the variable names from C++-compiled DLL, but can determine it from .NET assembly).
The forward declarations in C++ are a way to provide metadata about the other pieces of code that might be used by the currently compiled source to the compiler, so it can generate the correct code.
That metadata can come from the author of the linked library/component. However, it can also be automatically generated (for example there are tools that generate C++ header files for COM objects). In any case, the C++ way of expressing that metadata is through the header files you need to include in your source code.
The C#/.Net also consume similar metadata at compile time. However, that metadata is automatically generated when the assembly it applies to is built and is usually embedded into it. Thus, when you reference in your C# project an assembly, you are essentially telling the compiler "look for the metadata you need in this assembly as well, please".
In other words, the metadata generation and consumption in C# is more transparent to the developers, allowing them to focus on what really matters - writing their own code.
There are also other benefits to having the metadata about the code bundled with the assembly as well. Reflection, code emitting, on-the-fly serialization - they all depend on the metadata to be able to generate the proper code at run-time.
The C++ analogue to this would be RTTI, although it's not widely-adopted due ot incompatible implementations.
From Eric Lippert, blogger of all things internal to C#: http://blogs.msdn.com/ericlippert/archive/2010/02/04/how-many-passes.aspx:
The C# language does not require that
declarations occur before usages,
which has two impacts, again, on the
user and on the compiler writer. [...]
The impact on the compiler writer is
that we have to have a “two pass”
compiler. In the first pass, we look
for declarations and ignore bodies.
Once we have gleaned all the
information from the declarations that
we would have got from the headers in
C++, we take a second pass over the
code and generate the IL for the
bodies.
To sum up, using something does not require declaring it in C#, whereas it does in C++. That means that in C++, you need to explicitly declare things, and it's more convenient and safe to do that with header files so you don't violate the One Definition Rule.
I need a tool which analyzes C++ sources and says what code isn't used. Size of sources is ~500mb
PC-Lint is good. If it needs to be free/open source your choices dwindle. Cppcheck is free, and will check for unused private functions. I don't think that it looks for things like uninstantiated classes like PC-Lint.
Once again, I'll throw AQTime into the discussion. Has static code analysis for most, if not all, of the supported languages. I didn't really go into that part though, I mainly used the dynamic profilers (memory, performance and so on).
You could use a code coverage tool (dynamic analysis) to get an idea of what code isn't
being executed, and then hand analyze to see if that code is really useless.
If you want a static analysis, you need a tool that can read the entire
500Mb of source code (est. 20 million lines? Wow!) and compute a
conservative estimate of what is used. This requires doing a points-to
analysis over the entire system.
Here's why: If you leave out any module Z, and
decide that FOO is unused, you
might find out later that Z happened to be the one that used FOO,
or more subtly, Z copied a pointer value that happened to have
&FOO in it to a third module M that in turn called the "unused" function
throught the pointer.
What this means is that no static analysis tool that reads just
single modules (compilation units) can answer this question safely.
And at your scale, you can't afford to make dumb mistakes.
My company, Semantic Designs has done points-to analysis for 35 million line systems
of C code using our DMS Software Reengineering Toolkit. DMS
can read very large systems of source code. It required
a custom tool, not so much because the source code was in an odd (archiac)
dialect of C++ (systems in extremely modern dialects can't be this big,
not enough time to code them!), but rather because in very large systems
there are other peculiar factors at play. For the C system we did,
there was a custom dynamic linker, and that affected the points-to analysis,
which in turn had to be customized.
Because systems of the scale you are discussing alway have surprises like this (BIBSEH: "Because In Big Systems, Everything Happens"), you will
likely need a custom tool to answer the question. DMS is designed
to be customized.
See http://www.semanticdesigns.com/Products/DMS/DMSToolkit.html
and http://www.semanticdesigns.com/Products/FrontEnds/CppFrontEnd.html
Code coverage tool is what you need, but you will have to run our program through all functionality and see what is repoted as unused. Since the code could be DLL exported functions you will have to make sure nothing uses them externally. Some code coverage tools: Purify, CTC++, Boundschecker may have code coverage functionality if I remember right and a bunch of other tools.
Be very careful about removing any function that may have been exported without knowing what external program may be linking/using it.