Normalizing variable names C/C++ - c++

I am currently working on a tool, that will compare two files and report its differences. I want to implement a feature that will compare two methods, and report if they are identical (while ignoring variable name changes ). What i have thought of doing, is to Normalize all the variable names to (x0,x1 ..) or something similiar. Then sort the methods ( Alphabetically? ) so that the order is the same. Grap their checksums and then compare the two.
My question:
How do i normalize variable names in a C / C++ file?
or
Do you have any other ideas as to how i could implement the feature?
Regards

You can map 'tokens' (variable names) to an 'interned form', as described above, if you can come up with a repeatable & stable ordering.
This doesn't attempt to understand how the tokens resolve, merely that they are present in the same pattern in two source-files. "Tokens" would be everything other than C/C++ reserved words, no serious parsing/ lexing necessary.
Once you have done that you can convert comments & whitespace to a canonical form.
This wouldn't mostly be of utility to me, but I believe it would achieve a 99.9% or better stab at the problem -- it's conceivable that it could be fooled, but practically not very likely.
Of course, if we have macros those have to handled too.. maybe you can run the C pre-processor on them to fulfill that, if that's a requirement?
Hope this helps.

Surely this is not about normalizing the names, but about figuring out if the two methods do the same thing to the same things within a class. Which means parsing the source code and building some sort of data structure [probably a "tree"] from that. Once you have the tree, the names as such will become meaningless. You may need to track for example what OFFSET into a class member variables are referring to, and which virtual functions within a class.
I don't believe this is at all trivial (unless you restrict the code to a small subset of C++ code), since there are so many different ways to make something do the same thing, and just subtle difference will throw off anything by the most sophisticated of tools. E.g.
class A
{
private:
int arr[10];
...
public:
int sum()
{
int r = 0;
for(i = 0; i < 10; i++)
r += arr[i];
return r;
}
}
class B
{
private:
int arr[10];
...
public:
int sum()
{
int r = 0;
int *p = arr;
for(i = 0; i < 10; i++)
r += *p++;
return r;
}
....
}
These two functions do the same thing.

What about using the temporary tree representation gcc generates during compilation, gcc has a command-line-switch to preserve temporary files:
gcc -save-temps <file>
This code is somewhat simplified and names are unified. The problem is to identify the differences in the original file.
Do not use Optimization!

Related

Helper functions: lambdas vs normal functions

I have a function which internally uses some helper functions to keep its body organized and clean. They're very simple (but not always short) (they're more than just 2), and could be easily inlined inside the function's body, but I don't want to do so myself because, as I said, I want to keep that function's body organized.
All those functions need to be passed some arguments by reference and modify them, and I can write them in two ways (just a silly example):
With normal functions:
void helperf1(int &count, int &count2) {
count += 1;
count2 += 2;
}
int helperf2 (int &count, int &count2) {
return (count++) * (count2--);
}
//actual, important function
void myfunc(...) {
int count = count2 = 0;
while (...) {
helperf1(count, count2);
printf("%d\n", helperf2(count, count2));
}
}
Or with lambda functions that capture those arguments I explicitly pass in the example above:
void myfunc(...) {
int count = count2 = 0;
auto helperf1 = [&count, &count2] () -> void {
count += 1;
count2 += 2;
};
auto helperf2 = [&count, &count2] () -> int {
return (count++) * (count2--);
};
while (...) {
helperf1();
printf("%d\n", helperf2());
}
}
However, I am not sure on what method I should use. With the first, one, there is the "overhead" of passing the arguments (I think), while with the second those arguments could be (are them?) already included in there so that that "overhead" is removed. But they're still lambda functions which should (I think, again) not be as fast as normal functions.
So what should I do? Use the first method? Use the second one? Or sacrifice readability and just inline them in the main function's body?
Your first and foremost concern should be readability (and maintainability)!
Which of regular or lambda functions is more readable strongly depends on the given problem (and a bit on the taste of the reader/maintainer).
Don't be concerned about performance until you find that performance actually is an issue! If performance is an issue, start by benchmarking, not by guessing which implementation you think is faster (in many situations compilers are pretty good at optimizing).
Performance wise, there is no real issue here. Nothing to decide, choose whatever.
But, Lambda expressions won't do you any good for the purpose you want them.
They won't make the code any cleaner.
As a matter of fact I believe they will make the code a bit harder to read compared to a nice calculator object having these helper functions as member functions properly named with clean semantics and interface.
Using Lambda is more readable but they are actually there for more serious reasons , Lambda expressions are also known as "anonymous functions", and are very useful in certain programming paradigms, particularly functional programming, which lambda calculus ( http://en.wikipedia.org/wiki/Lambda_calculus )
Here you can find the goals of using lambdas :
https://dzone.com/articles/why-we-need-lambda-expressions
If you won't need the two helper functions somewhere else in your code, then use your lambda method , but if you will call one of them again somewhere in your project avoid writing them each time as lambdas , you can make a header file called "helpers.(h/hpp)" & a source file called "helper.(c/cpp)" then append all the helper functions there then you gain the readability of both the helper file and the caller file
You can avoid this unskilled habit and challange yourself by writing complex code that you have you read it more than once each time you want to edit it , that increases your programming skills and if you are working in a team , it won't be a problem , use comments , that will let them show more respect to your programming skills (if your complex code is doing the expected behaviour and giving the expected output)
And don't be concerned about performance until you find yourself writing a performance critical algorithm , if not , the difference will be in few milliseconds and the user won't notice it , so you will be loosing you time in an optimization that compiler can do by itself most of the time if you ask him to optimize your code .

How to implement opIndex for compile time indices?

ref auto opIndex(size_t i){
return t[i];
}
Here t is a tuple and i needs to be read at compile time. How would I express this in D?
There isn't any clean way to do this with opIndex currently, for two reasons. First is simple - it isn't implemented. That would be relatively easy to fix on its own but there is a second reason - it adds serious context sensitivity to language grammar.
Consider this struct definition:
struct S
{
// imagine this works, syntax is not important
static int opIndex (size_t i) { return 42; }
}
Now what does the code S[10] mean? Is it a static array type of ten S elements? Or static opIndex call which returns 42? It is impossible to tell without knowing quite a lot of context and in certain cases impossible to tell at all (like typeof(S[10])).
Somewhat relevant (unapproved!) idea: http://wiki.dlang.org/DIP63

Write a C++ function that accepts a 1-D array and calculates the sum of the elements, and displays it

I wanted to create a function that would define an 1d Array, calculate a sum of the elements, and display that sum. I wrote the following code however I'm unaware of the use of pointers and other advanced techniques of coding.
#include <iostream>
using namespace std;
int main()
{
int size;
int A[];
cout << "Enter an array: \n";
cin << A[size];
int sum;
int sumofarrays(A[size]);
sum = sumofarrays(A[size]);
cout << "The sum of the array values is: \n" << sum << "\n";
}
int sumofarrays(int A[size])
{
int i;
int j = 0;
int sum;
int B;
for (i=0; i<size; i++)
{
B = j + A[i];
j = B;
}
sum = B;
return(sum);
}
When attempting to compile this code, I get following error:
SumOfArrays.cpp:19:18: error: called object type 'int' is not a
function or function pointer sum = sumofarrays(size)
If only you had used a container like std::vector<int> A for your data. Then your sum would drop out as:
int sum = std::accumulate(A.begin(), A.end(), 0);
Every professional programmer will then understand in a flash what you're trying to do. That helps make your code readable and maintainable.
Start using the C++ standard library. Read a good book like Stroustrup.
Please choose Bathsheba's answer - it is the correct one. That said, in addition to my comment above, I wanted to give some more tips:
1) You need to learn the difference between an array on the stack (such as "int A[3]") and the heap (such as a pointer allocated by malloc or new). There's some degree of nuance here, so I'm not going to go into it all, but it's very important that you learn this if you want to program in C or C++ - even though best practice is to avoid pointers as much as possible and just use stl containers! ;)
2) I'm not going to tell you to use a particular indentation style. But please pick one and be consistent. You'll drive other programmers crazy with that sort of haphazard approach ;) Also, the same applies to capitalization.
3) Variable names should always be meaningful (with the possible exception of otherwise meaningless loop counters, for which "i" seems to be standard). Nobody is going to look at your code and know immediately what "j" or "B" are supposed to mean.
4) Your algorithm, as implemented, only requires half of those variables. There is no point to using all of those temporaries. Just declare sum as "int sum = 0;" and then inside the loop do "sum += A[i];"
5) Best practice is - unlike the old days, where it wasn't possible - to declare variables only where you need to use them, not beforehand. So for example, you don't need to declare B or j (which, as mentioned, really aren't actually needed) before the loop, you can just declare them inside the loop, as "int B = j + A[i];" and "int j = B;". Or better, "const int", since nothing alters them. But best, as mentioned in #4, don't use them at all, just use sum - the only variable you actually care about ;)
The same applies to your for-loop - you should declare i inside the loop ("for (int i = ....") rather than outside it, unless you have some sort of need to see where the loop broke out after it's done (not possible in your example).
6) While it really makes no difference whatsoever here, you should probably get in the habit of using "++i" in your for-loops rather than "i++". It really only matters on classes, not base types like integers, but the algorithms for prefix-increment are usually a tad faster than postfix-increment.
7) You do realize that you called sumOfArrays twice here, right?
int sum;
int sumofarrays(A[size]);
sum = sumofarrays(A[size]);
What you really meant was:
const int sum = sumofarrays(A);
Or you could have skipped assigning it to a variable at all and just simply called it inside your cout. The goal is to use as little code as possible without being confusing. Because excess unneeded code just increases the odds of throwing someone off or containing an undetected error.
Just don't take this too far and make a giant mishmash or trying to be too "clever" with one-liner "tricks" that nobody is going to understand when they first look at them! ;)
8) I personally recommend - at this stage - avoiding "using" calls like the plague. It's important for you to learn what's part of stl by having to explicitly call "std::...." each time. Also, if you ever write .h files that someone else might use, you don't want to (by force of habit) contaminate them with "using" calls that will have an effect on other peoples' code.
You're a beginner, that's okay - you'll learn! :)

C++ fixed size arrays vs multiple objects of same type

I was wondering whether (apart from the obvious syntax differences) there would be any efficiency difference between having a class containing multiple instances of an object (of the same type) or a fixed size array of objects of that type.
In code:
struct A {
double x;
double y;
double z;
};
struct B {
double xvec[3];
};
In reality I would be using boost::arrays which are a better C++ alternative to C-style arrays.
I am mainly concerned with construction/destruction and reading/writing such doubles, because these classes will often be constructed just to invoke one of their member functions once.
Thank you for your help/suggestions.
Typically the representation of those two structs would be exactly the same. It is, however, possible to have poor performance if you pick the wrong one for your use case.
For example, if you need to access each element in a loop, with an array you could do:
for (int i = 0; i < 3; i++)
dosomething(xvec[i]);
However, without an array, you'd either need to duplicate code:
dosomething(x);
dosomething(y);
dosomething(z);
This means code duplication - which can go either way. On the one hand there's less loop code; on the other hand very tight loops can be quite fast on modern processors, and code duplication can blow away the I-cache.
The other option is a switch:
for (int i = 0; i < 3; i++) {
int *r;
switch(i) {
case 0: r = &x; break;
case 1: r = &y; break;
case 1: r = &z; break;
}
dosomething(*r); // assume this is some big inlined code
}
This avoids the possibly-large i-cache footprint, but has a huge negative performance impact. Don't do this.
On the other hand, it is, in principle, possible for array accesses to be slower, if your compiler isn't very smart:
xvec[0] = xvec[1] + 1;
dosomething(xvec[1]);
Since xvec[0] and xvec[1] are distinct, in principle, the compiler ought to be able to keep the value of xvec[1] in a register, so it doesn't have to reload the value at the next line. However, it's possible some compilers might not be smart enough to notice that xvec[0] and xvec[1] don't alias. In this case, using seperate fields might be a very tiny bit faster.
In short, it's not about one or the other being fast in all cases. It's about matching the representation to how you use it.
Personally, I would suggest going with whatever makes the code working on xvec most natural. It's not worth spending a lot of human time worrying about something that, at best, will probably only produce such a small performance difference that you'll only catch it in micro-benchmarks.
MVC++ 2010 generated exactly the same code for reading/writing from two POD structs like in your example. Since the offsets to read/write to are computable at compile time, this is not surprising. Same goes for construction and destruction.
As for the actual performance, the general rule applies: profile it if it matters, if it doesn't - why care?
Indexing into an array member is perhaps a bit more work for the user of your struct, but then again, he can more easily iterate over the elements.
In case you can't decide and want to keep your options open, you can use an anonymous union:
struct Foo
{
union
{
struct
{
double x;
double y;
double z;
} xyz;
double arr[3];
};
};
int main()
{
Foo a;
a.xyz.x = 42;
std::cout << a.arr[0] << std::endl;
}
Some compilers also support anonymous structs, in that case you can leave the xyz part out.
It depends. For instance, the example you gave is a classic one in favor of 'old-school' arrays: a math point/vector (or matrix)
has a fixed number of elements
the data itself is usually kept
private in an object
since (if?) it has a class as an
interface, you can properly
initialize them in the constructor
(otherwise, classic array
inialization is something I don't
really like, syntax-wise)
In such cases (going with the math vector/matrix examples), I always ended up using C-style arrays internally, as you can loop over them instead of writing copy/pasted code for each component.
But this is a special case -- for me, in C++ nowadays arrays == STL vector, it's fast and I don't have to worry about nuthin' :)
The difference can be in storing the variables in memory. In the first example compiler can add padding to align the data. But in your paticular case it doesn't matter.
raw arrays offer better cache locality than c++ arrays, as presented however, the array example's only advantage over the multiple objects is the ability to iterate over the elements.
The real answer is of course, create a test case and measure.

C structure pointer dereferencing speed

I have a question regarding the speed of pointer dereferencing. I have a structure like so:
typedef struct _TD_RECT TD_RECT;
struct _TD_RECT {
double left;
double top;
double right;
double bottom;
};
My question is, which of these would be faster and why?
CASE 1:
TD_RECT *pRect;
...
for(i = 0; i < m; i++)
{
if(p[i].x < pRect->left) ...
if(p[i].x > pRect->right) ...
if(p[i].y < pRect->top) ...
if(p[i].y > pRect->bottom) ...
}
CASE 2:
TD_RECT *pRect;
double left = pRect->left;
double top = pRect->top;
double right = pRect->right;
double bottom = pRect->bottom;
...
for(i = 0; i < m; i++)
{
if(p[i].x < left) ...
if(p[i].x > right) ...
if(p[i].y < top) ...
if(p[i].y > bottom) ...
}
So in case 1, the loop is directly dereferencing the pRect pointer to obtain the comparison values. In case 2, new values were made on the function's local space (on the stack) and the values were copied from the pRect to the local variables. Through a loop there will be many comparisons.
In my mind, they would be equally slow, because the local variable is also a memory reference on the stack, but I'm not sure...
Also, would it be better to keep referencing p[] by index, or increment p by one element and dereference it directly without an index.
Any ideas? Thanks :)
You'll probably find it won't make a difference with modern compilers. Most of them would probably perform common subexpresion elimination of the expressions that don't change within the loop. It's not wise to assume that there's a simple one-to-one mapping between your C statements and assembly code. I've seen gcc pump out code that would put my assembler skills to shame.
But this is neither a C nor C++ question since the ISO standard doesn't mandate how it's done. The best way to check for sure is to generate the assembler code with something like gcc -S and examine the two cases in detail.
You'll also get more return on your investment if you steer away from this sort of micro-optimisation and concentrate more on the macro level, such as algorithm selection and such.
And, as with all optimisation questions, measure, don't guess! There are too many variables which can affect it, so you should be benchmarking different approaches in the target environment, and with realistic data.
It is not likely to be a hugely performance critical difference. You could profile doing each option multiple times and see. Ensure you have your compiler optimisations set in the test.
With regards to storing the doubles, you might get some performance hit by using const. How big is your array?
With regards to using pointer arithmetic, this can be faster, yes.
You can instantly optimise if you know left < right in your rect (surely it must be). If x < left it can't also be > right so you can put in an "else".
Your big optimisation, if there is one, would come from not having to loop through all the items in your array and not have to perform 4 checks on all of them.
For example, if you indexed or sorted your array on x and y, you would be able, using binary search, to find all values that have x < left and loop through just those.
I think the second case is likely to be faster because you are not dereferencing the pointer to pRect on every loop iteration.
Practically, a compiler doing optimisation may notice this and there might be no difference in the code that is generated, but the possibility of pRect being an alias of an item in p[] could prevent this.
An optimizing compiler will see that the structure accesses are loop invariant and so do a Loop-invariant code motion, making your two cases look the same.
I will be surprised if even a totally non-optimized compile (- O0) will produce differentcode for the two cases presented. In order to perform any operation on a modern processor, the data need to loaded into registers. So even when you declare automatic variables, these variables will not exist in main memory but rather in one of the processors floating point registers. This will be true even when you do not declare the variables yourself and therefore I expect no difference in generated machine code even for when you declare the temporary variables in your C++ code.
But as others have said, compile the code into assembler and see for yourself.