Diff function on two arrays, in c++/mfc/stl? - c++

Diff function on two arrays (or how to turn Old into New)
Example
One[]={2,3,4,5,6,7}
Two[]={1,2,3,5,5,5,9}
Example Result
Diff: insert 1 into One[0], One[]={1,2,3,4,5,6,7}
Diff: delete 4 from One[3], One[]={1,2,3,5,6,7}
Diff: modify 6 into 5 in One[4], One[]={1,2,3,5,5,7}
Diff: modify 7 into 5 in One[5], One[]={1,2,3,5,5,5}
Diff: append 9 into One[6], One[]={1,2,3,5,5,5,9}
Need code in c++/mfc/stl/c, Thanks.

What you need is a string matching algorithm, usually implemented using dynamic programming (see here).
I'd highly suggest using a library that performs the diff instead of implementing it yourself.

Though it's normally done with letters instead of integers, the usual algorithm for computing the Levenstein distance should work just as well here as where it's usually applied.

I'm diff library developer with C++.
http://code.google.com/p/dtl-cpp/
Using My diff library, it is possible to calculate the difference between two sequences.
Please see examples/intdiff.cpp about how to use.

Related

Univac Math pack subroutines in old-school FORTRAN (pre-77)

I have been looking at an engineering paper here which describes an old FORTRAN code for solving pipe flow equations (it's dated 1974, before FORTRAN was standardised as Fortran 77). On page 42 of this document the old code calls the following subroutine:
C SYSTEM SUBROUTINE FROM UNIVAC MATH-PACK TO
C SOLVE LINEAR SYSTEM OF EQ.
CALL GJR(A,51,50,NP,NPP,$98,JC,V)
It's a bit of a long shot, but do any veterans or ancient code buffs recall this system subroutine and it's input arguments? I'm having trouble finding any information about it.
If I can adapt the old code my current application I may rewrite this in C++ or VBA, and will be looking for an equivalent function in these languages.
I'll add to this answer if I find anything more detailed, but I have a place to start looking for the arguments to GJR.
This function is part of the Sperry UNIVAC MATH-PACK library - a full list of functions in the library can be found in http://www.dtic.mil/dtic/tr/fulltext/u2/a170611.pdf GJR is described as "determinant; inverse; solution of simultaneous equations". Marginally helpful.
A better description comes from http://nvlpubs.nist.gov/nistpubs/jres/74B/jresv74Bn4p251_A1b.pdf
A FORTRAN subroutine, one of the Univac 1108 Math Pack programs,
available on the library tapes at the University of Maryland computing
center. It solves simultaneous equations, computes a determinant, or
inverts a matrix or any combination of the three above by using a
Gauss-Jordan elimination technique with column pivoting.
This is slightly more useful, but what we really want is "MATH-PACK, Programmer Reference", UP-7542 Rev. 1 from Sperry-UNIVAC (Unisys) I find a lot of references to this document but no full-text PDF of the document itself.
I'd take a look at the arguments in the function call, how they are set up and how the results are used, then look for equivalent routines in LAPACK or BLAS. See http://www.netlib.org/lapack/
I have a few books on piping networks including "Analysis of Flow in Pipe Networks" by Jeppson (same author as in the original PDF hosted by USU) https://books.google.com/books/about/Analysis_of_flow_in_pipe_networks.html?id=peZSAAAAMAAJ - I'll see if I can dig that up. The book may have a more portable matrix solver than the proprietary Sperry-UNIVAC library.
Update:
From p. 41 of http://ngds.egi.utah.edu/files/GL04099/GL04099_1.pdf I found documentation for the CGJR function, the complex version of GJR from the same library. It is likely the only difference in the arguments is variable type (COMPLEX instead of REAL):
CGJR is a subroutine which solves simultaneous equations, computes a determinant, inverts a matrix, or does any combination of these three operations, by using a Gauss-Jordan elimination technique with column pivoting.
The procedure for using CGJR is as follows:
Calling statement: CALL CGJR(A,NC,NR,N,MC,$K,JC,V)
where
A is the matrix whose inverse or determinant is to be determined. If simultaneous equations are solved, the last MC-N columns of the matrix are the constant vectors of the equations to be solved. On output, if the inverse is computed, it is stored in the first N columns of A. If simultaneous equations are solved, the last MC-N columns contain the solution vectors. A is a complex array.
NC is an integer representing the maximum number of columns of the array A.
NR is an integer representing the maximum number of rows of the array A.
N is an integer representing the number of rows of the array A to be operated on.
MC is the number of columns of the array A, representing the coefficient matrix if simultaneous equations are being solved; otherwise it is a dummy variable.
K is a statement number in the calling program to which control is returned if an overflow or singularity is detected.
1) If an overflow is detected, JC(1) is set to the negative of the last correctly completed row of the reduction and control is then returned to statement number K in the calling program.
2) If a singularity is detected, JC(1)is set to the number of the last correctly completed row, and V is set to (0.,0.) if the determinant was to be computed. Control is then returned to statement number K in the calling program.
JC is a one dimensional permutation array of N elements which is used for permuting the rows and columns of A if an inverse is being computed .. If an inverse is not computed, this array must have at least one cell for the error return identification. On output, JC(1) is N if control is returned normally.
V is a complex variable. On input REAL(V) is the option indicator, set as follows:
invert matrix
compute determinant
do 1. and 2.
solve system of equations
do 1. and 4.
do 2. and 4.
do 1., 2. and 4.
Notes on usage of row dimension arguments N and NR:
The arguments N and NR refer to the row dimensions of the A matrix.
N gives the number of rows operated on by the subroutine, while NR
refers to the total number of rows in the matrix as dimensioned by the
calling program. NR is used only in the dimension statement of the
subroutine. Through proper use of these parameters, the user may specify that only a submatrix, instead of the entire matrix, be operated on by the subroutine.
In your application (pipe flow), look at how matrix A and vector V are populated before the call to GJR and how they are used after the call.
You may be able to replace the call to GJR with a call to LAPACK's SGESV or DGESV without much difficulty.
Aside: The Fortran community really needs a drop-in 'Rosetta library' that wraps LAPACK, etc. for replacing legacy/proprietary IBM, UNIVAC, and Numerical Recipes math functions. The perfect case would be that maintainers would replace legacy functions with de facto standard math functions but in the real world, many of these older programs are un(der)maintained and there simply isn't the will (or, as in this case, the ability) to update them.
Update 2:
I started work on a compatibility library for the Sperry MATH-PACK and STAT-PACK routines as well as a few other legacy libraries, posted at https://bitbucket.org/apthorpe/alfc
Further, I located my copy of Jeppson's Analysis of Flow in Pipe Networks which is a slightly more legible version of the PDF of Steady Flow Analysis of Pipe Networks: An Instructional Manual and modernized the codes listed in the text. I have posted those at https://bitbucket.org/apthorpe/jeppson_pipeflow
Note that I found a number of errors in both the code listings and in the example problems given for many of the codes. If you're trying to learn how to write a pipe flow solver based on Jeppson's paper or text, I'd strongly suggest reviewing my updated codes and test cases because they will save you hours of effort trying to understand why the code doesn't work and why you can't replicate the example cases. This took a fair amount of forensic computing to sort out.
Update 3:
The source to CGJR and DGJR can be found in http://www.dtic.mil/dtic/tr/fulltext/u2/a110089.pdf. DGJR is the closest to what you want, though it references more routines that aren't available (proprietary UNIVAC error-handling routines). It should be easy to convert `DGJR' to single precision and skip the proprietary calls. Otherwise, use the compatibility library mentioned above.

Is there a set-like data structure optimized for searches where it is known ahead of time there will be a high percent of matches?

I have a use case where a set of strings will be searched for a particular string, s. The percent of hits or positive matches for these searches will be very high. Let's say 99%+ of the time, s will be in the set.
I'm using boost::unordered_set right now, and even with its very fast hash algorithm, it takes about 40ms 600ms on good hardware a VM to search the set 500,000 times. Yeah, that's pretty good, but unacceptable for what I'm working on.
So, is there any sort of data structure optimized for a high percent of hits? I cannot precompute the hashes for the strings coming in, so I think I'm looking at a complexity of \$O(avg length of string)\$ for a hash set like boost::unordered_set. I looked at Tries, these would probably perform well in the opposite case where there is rarely hits, but not really any better than hash sets.
edit: some other details with my particular use case:
the number of strings in the set is around 5,000. The longest string is probably no more than 200 chars. Search gets called again and again with the same strings, but they are coming in from an outside system and I cannot predict what the next string will be. The exact match rate is actually 99.975%.
edit2: I did some of my own benchmarking
I collected 5,000 of the strings that occur in the real system. I created two scenarios.
1) I loop over the list of known strings and do a search for them in the container. I do this for 500,000 searches("hits").
2) I loop through a set of strings known not to be in the container, for 500,000 searches ("misses").
(Note - I'm interested in hashing the data in reverse because eyeballing my data, I noticed that there are a lot of common prefixes and the suffixes differ - at least that is what it looks like.)
Tests done on a virtualbox CentOS 5.6 VM running on a macbook host.
hits (ms) misses (ms)
boost::unordered_set with default hash and no reserved size: 591.15 441.39
tr1::unordered_set with default hash 191.09 143.80
boost::unordered_set with a reserve size set: 579.31 431.54
boost::unordered_set w/custom hash (hash on the last 15 chars + str size): 357.34 812.13
boost::unordered_set w/custom hash (hash on the last 25 chars + str size): 362.60 795.33
trie: 1809.34 58.11
trie with reversed insertion/search: 2806.26 311.14
In my tests, where there are a lot of matches, the tr1 set is the best. Where there are a lot of misses, the Trie wins big.
my test loop looks like this, where function_set is the container being tested loaded with 5,000 strings, and functions is a vector of either all the strings in the container or a bunch of strings that are not in the container.
while (searched < kTotalSearches) {
for(std::vector<std::string>::const_iterator i = functions.begin(); i != functions.end(); ++i) {
function_set.count(*i);
searched++;
if (searched == kTotalSearches)
break;
}
}
std::cout << searched << " searches." << std::endl;
I'm pretty sure that Tries is what you are looking for. You are guaranteed not to go down a number of nodes greater than the length of your string. Once you've reached a leaf, then there might be some linear search if there are collisions for this particular node. It depends on how you build it. Since you're using a set I would assume that this is not a problem.
The unordered_set will have a complexity of at worse O(n), but n in this case is the number of nodes that you have (500k) and not the number of characters you are searching for (probably less than 500k).
After edit:
Maybe what you really need is a cache of the results after your search algo succeeded.
This question piqued my curiosity so I did a few tests to satisfy myself with the following results. A few general notes:
The usual caveats about benchmarking apply (don't trust my numbers, do your own benchmarks with your specific use case and data, etc...).
Tests were done using MSVS C++ 2010 (speed optimized, release build).
Benchmarks were run using 10 million loops to improve timing accuracy.
Names were generated by randomly concatenating 20 different strings fragments into strings ranging from 4 to 65 characters in length.
Names included only letters and some tests (trie) were case-insensitive for simplicity, though there's no reason the methods can't be extended to include other characters.
Tests try to match the 99.975% hit rate given in the question.
Test Descriptions
Basic description of the tests run with the relevant details:
String Iteration -- Simply iterates through the function name for a baseline time comparison.
Map -- std::unordered_map<std::string, int>
Set -- std::unordered_set<std::string>
BoostSet -- boost::unordered_set<std::string>, v1.47.0
CharMap -- std::unordered_map<const char*, int>
CharSet -- std::unordered_set<const char*>
FastMap -- Simply a std::unordered_map<> using a custom FNV-1a hash algorithm.
FastSet -- Simply a std::unordered_set<> using a custom FNV-1a hash algorithm.
CustomMap -- A basic hash map I wrote myself years ago.
Trie -- A standard trie downloaded from Google code.
CustomTrie -- A bare-bones trie I wrote myself.
BinarySearch -- Using std::binary_search() on a sorted std::vector<std::string>.
SortArrayMap -- An attempt to use a size_t VectorIndex[26][26][26][26][26] array to index into a sorted array.
PerfectMap -- A std::unordered_map<> using a perfect hash from gperf.
PerfectWordSet -- Using the gperf is_word_set() function directly.
PerfectWordSetFunc -- Same as PerfectWordSet but called in a function instead of inline.
PerfectWordSetThread -- Same as PerfectWordSet but work is split into N threads (standard Window threads). No synchronization is used except for waiting for the threads to finish.
Test Results (Mostly Hits)
Results sorted from slowest to fastest (for the case of mostly hits, ~99.975%):
Trie -- 9100 ms
SortArrayMap -- 6600 ms
PerfectWordSetFunc -- 4050 ms
CustomTrie -- 3470 ms
BinarySearch -- 3420 ms
CustomMap -- 2700 ms
CharSet -- 1300 ms
CharMap -- 1300 ms
BoostSet -- 1200 ms
FastSet -- 970 ms
FastMap -- 930 ms
Original Poster -- 800 ms (estimated)
Set -- 730 ms
Map -- 690 ms
PerfectMap -- 650 ms
PerfectWordSet -- 500 ms
PerfectWordSetThread(1) -- 500 ms
StringIteration -- 350 ms
PerfectWordSetThread(2) -- 260 ms
PerfectWordSetThread(4) -- 150 ms
PerfectWordSetThread(32) -- 125 ms
PerfectWordSetThread(8) -- 120 ms
PerfectWordSetThread(16) -- 110 ms
Test Results (Mostly Misses)
Results sorted from slowest to fastest (for the case of mostly misses, ~0.1% hits):
BinarySearch -- ? (took too long)
SortArrayMap -- 8050 ms
Trie -- 3200 ms
CustomMap -- 1700 ms
BoostSet -- 920 ms
CustomTrie -- 850 ms
FastMap -- 590 ms
FastSet -- 580 ms
CharSet -- 550 ms
CharMap -- 550 ms
StringIteration -- 350 ms
Set -- 330 ms
Map -- 330 ms
PerfectMap -- 280 ms
PerfectWordSet -- 140 ms
PerfectWordSetThread(1) -- 130 ms
PerfectWordSetThread(2) -- 75 ms
PerfectWordSetThread(4) -- 45 ms
PerfectWordSetThread(32) -- 45 ms
PerfectWordSetThread(8) -- 40 ms
PerfectWordSetThread(16) -- 35 ms
Discussion
My first guess was that a trie would be a good fit for this sort of thing but from the results the opposite actually appears to be true. Thinking about it some more this makes sense and is along the same reasons to not use a linked-list.
I assume you may be familiar with the table of latencies that every programmer should know. In your case you have 500k lookups executing in 40ms, or 80ns/lookup. At that scale you easily lose if you have to access anything not already in the L1/L2 cache. A trie is really bad for this as you have an indirect and probably non-local memory access for every character. Given the size of the trie in this case I couldn't figure any way of getting the entire trie to fit in cache to improve performance (though it may be possible). I still think that even if you did get the trie to fit entirely in L2 cache you would lose with all the indirection required.
The std::unordered_ containers actually do a very good job of things out of the box. In fact, in trying to speed them up I actually made them slower (in the poorly named FastMap and FastSet trials).
Same thing with trying to switch from std::string to const char * (about twice as slow).
The boost::unordered_set<> was twice as slow as the std::unordered_set<> and I don't know if that is because I just used the built-in hash function, was using a slightly old version of boost, or something else. Have you tried std::unordered_set<> yourself?
By using gperf you can easily create a perfect hash function if your set of strings is known at compile time. You could probably create a perfect hash at runtime as well, depending on how often new strings are added to the map. This gets you a 23% speed increase over the standard map implementation.
The PerfectWordSetThread tests simply use the perfect hash and splits the work up into 1-32 threads. This problem is perfectly parallel (at least the benchmark is) so you get almost a 5x boost of performance in the 16 threads case. This works out to only 6.3ms/500k lookups, or 13 ns/lookup...a mere 50 cycles on a 4GHz processor.
The StringIteration case really points out how difficult it is going to be to get much faster. Just iterating the string being found takes 350 ms, or 70% of the time compared to the 500 ms map case. Even if you could perfectly guess each string you would still need this 350 ms (for 10 million lookups) to actually compare and verify the match.
Edit: Another thing that illustrates how tight things are is the difference between the PerfectWordSetFunc at 4050 ms and PerfectWordSet at 500 ms. The only difference between the two is that one is called in a function and one is called inline. Calling it as a function reduces the speed by a factor of 8. In basic pseudo-code this is just:
bool IsInPerfectWordSet (string Match)
{
return in_word_set(Match);
}
//Inline benchmark: PerfectWordSet
for i = 1 to 10,000,000
{
if (in_word_set(SomeString)) ++MatchCount;
}
//Function call benchmark: PerfectWordSetFunc
for i = 1 to 10,000,000
{
if (IsInPerfectWordSet(SomeString)) ++MatchCount;
}
This really highlights the difference in performance that inline code/functions can make. You also have to be careful in making sure what you are measuring in a benchmark. Sometimes you would want to include the function call overhead, and sometimes not.
Can You Get Faster?
I've learned to never say "no" to this question, but at some point the effort may not be worth it. If you can split the lookups into threads and use a perfect, or near-perfect, hash function you should be able to approach 100 million lookup matches per second (probably more on a machine with multiple physical processors).
A couple ideas I don't have the knowledge to attempt:
Assembly optimization using SSE
Use the GPU for additional throughput
Change your design so you don't need fast lookups
Take a moment to consider #3....the fastest code is that which never needs to run. If you can reduce the number of lookups, or reduce the need for an extremely high throughput, you won't need to spend time micro-optimizing the ultimate lookup function.
If the set of strings is fixed at compile time (e.g. it is a dictionnary of known human words), you could perhaps use a perfect hash algorithm, and use the gperf generator.
Otherwise, you might perhaps use an array of 26 hash tables, indexed by the first letter of the word to hash.
BTW, perhaps using a sorted array of these strings, with a dichotomical access, might be faster (since log 5000 is about 13), or a std::map or a std::set.
At last, you might define your own hashing function: perhaps in your particular case, hashing only the first 16 bytes could be enough!
If the set of strings is fixed, you could consider generating a dichotomical search on it (e.g. code a script to generate a function with 5000 tests, but only log 5000 being executed).
Also, even if the set of strings is slightly variable (e.g. change from one program run to the next, but stays constant during a single run), you might even consider generating the function (by emitting C++ code, then compiling it) on the fly and dlopen-ing it.
You really should benchmark and try several solutions! It probably is more an engineering issue than an algorithmic one.

n-dimensional interpolation c++ algorithm

How can I implement n-dimensional interpolation in C++? In ideal case I would like to have it generic on actual kernel so that I can switch between e.g., linear and polynomial interpolation (perhaps as a start: linear interpolation). This article ( http://pimiddy.wordpress.com/2011/01/20/n-dimensional-interpolation/ ) discusses this stuff but I have two problems:
1) I could not understand how to implement the "interpolate" method shown in the article in C++
2) More importantly I want to use it in a scenario where you have "multiple independent variables (X)" and "1 dependent variable (Y)" and somehow interpolate on both (?)
For example, if n=3 (i.e. 3-dimensional) and I have the following data:
#X1 X2 X3 Y
10 10 10 3.45
10 10 20 4.52
10 20 15 5.75
20 10 15 5.13
....
How could I know value of Y (dependent variable) for a particular combination of X (independent variables): 17 17 17
I know there exists other ways such as decision trees and SVM but here I am here interested in interpolation.
You can take a look at a set of interpolation alrogithms (including C++ implementation) at alglib.
Also it should be noted that neural networks (backpropagation nets, for example) are treated as good interpolators.
If your question is about the specific article, it's out of my knowledge.

What hash function used in dictionary (hash_table)?

I'm writting interpreter of language.
There is problem: I want to create type-dictionary, where you can put value of any type by index, that value of any type (simple[int,float,string] or complex[list,array,dictionary] of simple types or of complex of simple types ...). That is the same like in python-lang.
What algorithm of hash-function should I use?
For strings there are many examples of hashes - the simplest: sum of all characters multiplied by 31, divided by HASH_SIZE, that simple number.
But for DIFFERENT TYPES, I think, It must be more complicated algorithm.
I find SHA256, but don't know, how use "unsigned char[32]" result type for adressing in hash-table - it is much more than RAM in computer.
thank you.
There are hash tables in C++11, newest C++ standard - std::unordered_map, std::unordered_set.
EDIT:
Since every type has different distribution, usually every type has its own hash function. This is how it's done in Java (.hashCode() method inherited from Object), C#, C++11 and many other implementations.
EDIT2:
Typical hash function does two things:
1.) Create object representation in a natural number. (this is what .hashCode() in Java does)
For example - string "CAT" can be transformed to:
67 * 256^2 + 65 * 256^1 + 84 = 4407636
2.) Map this number to position in array.
One of the way to do this is:
integer_part(fractional_part(k*4407636)*m)
Where k is a constant (Donald Knuth in his book Art of Programming recommends (sqrt(5)+1)/2), m is size of your hash table and fractional_part and integer_part (obviously) calculate fractional part and integer part of real number.
In your hash table implementation, you need to handle collisions, especially when there are much more possible keys than size of your hash table.
EDIT3:
I read more on the subject, and it looks like
67 * 256^2 + 65 * 256^1 + 84 = 4407636
is really bad way to do hash_code. This is because, "somethingAAAAAABC" and "AAAAAABC" give exactly the same hash code.
Well, a common approach is to define the hash function as a method belonging to the type.
That way you can call different algorithms for different types through a common API.
That ,of course, entails that you define wrapper classes for every baisc "c type" that you want to use in your interpreter.

library for matrices in c++

I have a lot of elements in a matrix and when I access them manually it takes a pretty long time to eliminate all the bugs arising from wrong indexing... Is there a suitable library that can keep track of e.g the neighbors,the numbering, if an element is in the outer edge or not and so on.
e.g.
VA=
11 12 13 14
21 22 23 24
31 32 33 34
41 42 43 44
Now what I would like to do is write a function that says something like
for every Neighbor to element at index 12(which would be 41)
do something
I would like this to only recognize the elements at index 8 (31) and 13 (42).
Right now I'm using vectors (vector<vector<int>>V;)but the code gets pretty difficult and clumsy both to write and read since I have these annoying if statements in every single function.
example:
for (int i=0;i<MatrixSIZE;i++)
if ((i+1)%rowSize!=0){//check that it's not in the outer edge.
//Do something
}
What approach would you suggest?
Can boost::MultiArray help me here in some way? Are there any other similar?
UPDATE::
So i'm looking more for a template that can easily access the elements than a template that can do matrix arithmetichs.
Try LAPACK, a linear algebra package.
There is this: http://osl.iu.edu/research/mtl/
or this: http://www.robertnz.net/nm_intro.htm
If you Google it a bit, there's quite a few matrix libraries out there for C++.
This might inspire you:
Matrix classes in c++
Is it used in a larger program ? If not, it would be more adapted to use R to deal with matrices.
If it's in a larger program, you can use a lib such as MTL.