Parsing version numbers to real numbers - coldfusion

I would like to determine if one version number is greater than another. The version number could be any of the following:
4
4.2
4.22.2
4.2.2.233
...as the version number is beyond my control, so I couldn't say how many dots could actually exist in the number.
Since the number is not really a real number, I can't simply say,
Is 4.7 > 4.2.2
How can I go about converting a number, such as 4.2.2 into a real number that could be checked against another version number?
I would preferably like a ColdFusion solution, but the basic concept would also be fine.

This is ripped from the plugin update code in Mango Blog, and updated a little bit. It should do exactly what you want. It returns 1 when argument 1 is greater, -1 when argument 2 is greater, and 0 when they are exact matches. (Note that 4.0.1 will be an exact match to 4.0.1.0)
It uses the CF list functions, instead of arrays, so you might see a small performance increase if you switched to arrays instead... but hey, it works!
function versionCompare( version1, version2 ){
var len1 = listLen(arguments.version1, '.');
var len2 = listLen(arguments.version2, '.');
var i = 0;
var piece1 = '';
var piece2 = '';
if (len1 gt len2){
arguments.version2 = arguments.version2 & repeatString('.0', len1-len2);
}else if (len2 gt len1){
arguments.version1 = arguments.version1 & repeatString('.0', len2-len1);
}
for (i=1; i lte listLen(arguments.version1, '.'); i=i+1){
piece1 = listGetAt(arguments.version1, i, '.');
piece2 = listGetAt(arguments.version2, i, '.');
if (piece1 neq piece2){
if (piece1 gt piece2){
return 1;
}else{
return -1;
}
}
}
//equal
return 0;
}
Running your example test:
<cfoutput>#versionCompare('4.7', '4.2.2')#</cfoutput>
prints:
1

If version 4 actually means 4.0.0, and version 4.2 actually means 4.2.0, you could easily convert the version to a simple integer.
suppose that every part of the version is between 0 and 99, then you could calculate an 'integer version' from X.Y.Z like this:
Version = X*100*100 + Y*100 + Z
If the ranges are bigger or smaller you could use factors higher or lower than 100.
Comparing the version then becomes easy.

Parse each number separately and compare them iteratively.
if (majorVersion > 4 &&
minorVersion > 2 &&
revision > 2)
{
// do something useful
}
// fail here
That's obviously not CF code, but you get the idea.

A version number is basically a period delimited array of numbers, so you can parse both versions into number arrays, and then compare each element in the first array to the corresponding element in the second array.
To get the array, do:
<cfset theArrayofNumbers = listToArray(yourVersionString, ".")>
and then you can do your comparisons.

You can split the string containing the version by periods, then start at the first index and compare down until one is greater than the other (or if they are equal, one contains a value the other does not).
I'm afraid I've never written in coldfusion but that would be the basic logic I'd follow.
This is a rough unoptimized example:
bool IsGreater(string one, string two)
{
int count;
string[] v1;
string[] v2;
v1 = one.Split(".");
v2 = two.Split(".");
count = (one.Length > two.Length) ? one.Length : two.Length;
for (int x=0;x<count;x++)
{
if (Convert.ToInt32(v1[x]) < Convert.ToInt32(v2[x]))
return false;
else if (Convert.ToInt32(v1[x]) > Convert.ToInt32(v2[x])
return true;
} // If they are the same it'll go to the next block.
// If you're here, they both were equal for the shortest version's digit count.
if (v1.Length > v2.Length)
return true; // The first one has additional subversions so it's greater.
}

There is no general way to convert multiple-part version numbers into real numbers, if there is no restriction on the size of each part (e.g. is 4.702.0 > 4.7.2?).
Normally you would define a custom comparison function by creating a sequence or array of version-number parts or components, so 4.7.2 is represented as [4, 7, 2] and 4.702.0 is [4, 702, 0]. Then you compare each element of the two arrays until they don't match:
left = [4, 7, 2]
right = [4, 702, 0]
# check index 0
# left[0] == 4, right[0] == 4
left[0] == right[0]
# equal so far
# check index 1
# left[1] == 7, right[1] == 702
left[1] < right[1]
# so left < right
I don't know about ColdFusion, but in some languages you can do a direct comparison with arrays or sequences. For example, in Python:
>>> left = [4, 7, 2]
>>> right = [4, 702, 0]
>>> left < right
True

Related

How to sort non-numeric strings by converting them to integers? Is there a way to convert strings to unique integers while being ordered?

I am trying to convert strings to integers and sort them based on the integer value. These values should be unique to the string, no other string should be able to produce the same value. And if a string1 is bigger than string2, its integer value should be greater. Ex: since "orange" > "apple", "orange" should have a greater integer value. How can I do this?
I know there are an infinite number of possibilities between just 'a' and 'b' but I am not trying to fit every single possibility into a number. I am just trying to possibly sort, let say 1 million values, not an infinite amount.
I was able to get the values to be unique using the following:
long int order = 0;
for (auto letter : word)
order = order * 26 + letter - 'a' + 1;
return order;
but this obviously does not work since the value for "apple" will be greater than the value for "z".
This is not a homework assignment or a puzzle, this is something I thought of myself. Your help is appreciated, thank you!
You are almost there ... just a minor tweaks are needed:
you are multiplying by 26
however you have letters (a..z) and empty space so you should multiply by 27 instead !!!
Add zeropading
in order to make starting letter the most significant digit you should zeropad/align the strings to common length... if you are using 32bit integers then max size of string is:
floor(log27(2^32)) = 6
floor(32/log2(27)) = 6
Here small example:
int lexhash(char *s)
{
int i,h;
for (h=0,i=0;i<6;i++) // process string
{
if (s[i]==0) break;
h*=27;
h+=s[i]-'a'+1;
}
for (;i<6;i++) h*=27; // zeropad missing letters
return h;
}
returning these:
14348907 a
28697814 b
43046721 c
373071582 z
15470838 abc
358171551 xyz
23175774 apple
224829626 orange
ordered by hash:
14348907 a
15470838 abc
23175774 apple
28697814 b
43046721 c
224829626 orange
358171551 xyz
373071582 z
This will handle all lowercase a..z strings up to 6 characters length which is:
26^6 + 26^5 +26^4 + 26^3 + 26^2 + 26^1 = 321272406 possibilities
For more just use bigger bitwidth for the hash. Do not forget to use unsigned type if you use the highest bit of it too (not the case for 32bit)
You can use position of char:
std::string s("apple");
int result = 0;
for (size_t i = 0; i < s.size(); ++i)
result += (s[i] - 'a') * static_cast<int>(i + 1);
return result;
By the way, you are trying to get something very similar to hash function.

code debugging: 'take a list of ints between 0-9, return largest number divisible by 3'

I'm trying to understand what is wrong with my current solution.
The problem is as follows:
using python 2.7.6"
You have L, a list containing some digits (0 to 9). Write a function answer(L) which finds the largest number that can be made from some or all of these digits and is divisible by 3. if it is not possible to make such a number, return 0 as the answer. L will contain anywhere from 1 to 9 digits. The same digit may appear multiple times in the list, but each element in the list may only be used once.
input: (int list) l = [3, 1, 4, 1]
output: (int) 4311
input (int list) l = [3 ,1 ,4 ,1 ,5, 9]
output: (int) = 94311
This is my code to tackle the problem:
import itertools
def answer(l):
'#remove the zeros to speed combinatorial analysis:'
zero_count = l.count(0)
for i in range(l.count(0)):
l.pop(l.index(0))
' # to check if a number is divisible by three, check if the sum '
' # of the individual integers that make up the number is divisible '
' # by three. (e.g. 431: 4+3+1 = 8, 8 % 3 != 0, thus 431 % 3 != 0)'
b = len(l)
while b > 0:
combo = itertools.combinations(l, b)
for thing in combo:
'# if number is divisible by 3, reverse sort it and tack on zeros left behind'
if sum(thing) % 3 == 0:
thing = sorted(thing, reverse = True)
max_div_3 = ''
for digit in thing:
max_div_3 += str(digit)
max_div_3 += '0'* zero_count
return int(max_div_3)
b -= 1
return int(0)
I have tested this assignment many times in my own sandbox and it always works.
However when I have submitted it against my instructor, I end up always failing 1 case.. with no explanation of why. I cannot interrogate the instructor's tests, they are blindly pitched against the code.
Does anyone have an idea of a condition under which my code fails to either return the largest integer divisible by 3 or, if none exists, 0?
The list always has at least one number in it.
It turns out that the problem was with the order of itertools.combinations(l, b)
and sorted(thing, reverse = True). The original code was finding the first match of n%3 == 0 but not necessarily the largest match. Performing sort BEFORE itertools.combinations allowed itertools to find the largest n%3 == 0.

How to store output of very large Fibonacci number?

I am making a program for nth Fibonacci number. I made the following program using recursion and memoization.
The main problem is that the value of n can go up to 10000 which means that the Fibonacci number of 10000 would be more than 2000 digit long.
With a little bit of googling, I found that i could use arrays and store every digit of the solution in an element of the array but I am still not able to figure out how to implement this approach with my program.
#include<iostream>
using namespace std;
long long int memo[101000];
long long int n;
long long int fib(long long int n)
{
if(n==1 || n==2)
return 1;
if(memo[n]!=0)
return memo[n];
return memo[n] = fib(n-1) + fib(n-2);
}
int main()
{
cin>>n;
long long int ans = fib(n);
cout<<ans;
}
How do I implement that approach or if there is another method that can be used to achieve such large values?
One thing that I think should be pointed out is there's other ways to implement fib that are much easier for something like C++ to compute
consider the following pseudo code
function fib (n) {
let a = 0, b = 1, _;
while (n > 0) {
_ = a;
a = b;
b = b + _;
n = n - 1;
}
return a;
}
This doesn't require memoisation and you don't have to be concerned about blowing up your stack with too many recursive calls. Recursion is a really powerful looping construct but it's one of those fubu things that's best left to langs like Lisp, Scheme, Kotlin, Lua (and a few others) that support it so elegantly.
That's not to say tail call elimination is impossible in C++, but unless you're doing something to optimise/compile for it explicitly, I'm doubtful that whatever compiler you're using would support it by default.
As for computing the exceptionally large numbers, you'll have to either get creative doing adding The Hard Way or rely upon an arbitrary precision arithmetic library like GMP. I'm sure there's other libs for this too.
Adding The Hard Way™
Remember how you used to add big numbers when you were a little tater tot, fresh off the aluminum foil?
5-year-old math
1259601512351095520986368
+ 50695640938240596831104
---------------------------
?
Well you gotta add each column, right to left. And when a column overflows into the double digits, remember to carry that 1 over to the next column.
... <-001
1259601512351095520986368
+ 50695640938240596831104
---------------------------
... <-472
The 10,000th fibonacci number is thousands of digits long, so there's no way that's going to fit in any integer C++ provides out of the box. So without relying upon a library, you could use a string or an array of single-digit numbers. To output the final number, you'll have to convert it to a string tho.
(woflram alpha: fibonacci 10000)
Doing it this way, you'll perform a couple million single-digit additions; it might take a while, but it should be a breeze for any modern computer to handle. Time to get to work !
Here's an example in of a Bignum module in JavaScript
const Bignum =
{ fromInt: (n = 0) =>
n < 10
? [ n ]
: [ n % 10, ...Bignum.fromInt (n / 10 >> 0) ]
, fromString: (s = "0") =>
Array.from (s, Number) .reverse ()
, toString: (b) =>
b .reverse () .join ("")
, add: (b1, b2) =>
{
const len = Math.max (b1.length, b2.length)
let answer = []
let carry = 0
for (let i = 0; i < len; i = i + 1) {
const x = b1[i] || 0
const y = b2[i] || 0
const sum = x + y + carry
answer.push (sum % 10)
carry = sum / 10 >> 0
}
if (carry > 0) answer.push (carry)
return answer
}
}
We can verify that the Wolfram Alpha answer above is correct
const { fromInt, toString, add } =
Bignum
const bigfib = (n = 0) =>
{
let a = fromInt (0)
let b = fromInt (1)
let _
while (n > 0) {
_ = a
a = b
b = add (b, _)
n = n - 1
}
return toString (a)
}
bigfib (10000)
// "336447 ... 366875"
Expand the program below to run it in your browser
const Bignum =
{ fromInt: (n = 0) =>
n < 10
? [ n ]
: [ n % 10, ...Bignum.fromInt (n / 10 >> 0) ]
, fromString: (s = "0") =>
Array.from (s) .reverse ()
, toString: (b) =>
b .reverse () .join ("")
, add: (b1, b2) =>
{
const len = Math.max (b1.length, b2.length)
let answer = []
let carry = 0
for (let i = 0; i < len; i = i + 1) {
const x = b1[i] || 0
const y = b2[i] || 0
const sum = x + y + carry
answer.push (sum % 10)
carry = sum / 10 >> 0
}
if (carry > 0) answer.push (carry)
return answer
}
}
const { fromInt, toString, add } =
Bignum
const bigfib = (n = 0) =>
{
let a = fromInt (0)
let b = fromInt (1)
let _
while (n > 0) {
_ = a
a = b
b = add (b, _)
n = n - 1
}
return toString (a)
}
console.log (bigfib (10000))
Try not to use recursion for a simple problem like fibonacci. And if you'll only use it once, don't use an array to store all results. An array of 2 elements containing the 2 previous fibonacci numbers will be enough. In each step, you then only have to sum up those 2 numbers. How can you save 2 consecutive fibonacci numbers? Well, you know that when you have 2 consecutive integers one is even and one is odd. So you can use that property to know where to get/place a fibonacci number: for fib(i), if i is even (i%2 is 0) place it in the first element of the array (index 0), else (i%2 is then 1) place it in the second element(index 1). Why can you just place it there? Well when you're calculating fib(i), the value that is on the place fib(i) should go is fib(i-2) (because (i-2)%2 is the same as i%2). But you won't need fib(i-2) any more: fib(i+1) only needs fib(i-1)(that's still in the array) and fib(i)(that just got inserted in the array).
So you could replace the recursion calls with a for loop like this:
int fibonacci(int n){
if( n <= 0){
return 0;
}
int previous[] = {0, 1}; // start with fib(0) and fib(1)
for(int i = 2; i <= n; ++i){
// modulo can be implemented with bit operations(much faster): i % 2 = i & 1
previous[i&1] += previous[(i-1)&1]; //shorter way to say: previous[i&1] = previous[i&1] + previous[(i-1)&1]
}
//Result is in previous[n&1]
return previous[n&1];
}
Recursion is actually discommanded while programming because of the time(function calls) and ressources(stack) it consumes. So each time you use recursion, try to replace it with a loop and a stack with simple pop/push operations if needed to save the "current position" (in c++ one can use a vector). In the case of the fibonacci, the stack isn't even needed but if you are iterating over a tree datastructure for example you'll need a stack (depends on the implementation though). As I was looking for my solution, I saw #naomik provided a solution with the while loop. That one is fine too, but I prefer the array with the modulo operation (a bit shorter).
Now concerning the problem of the size long long int has, it can be solved by using external libraries that implement operations for big numbers (like the GMP library or Boost.multiprecision). But you could also create your own version of a BigInteger-like class from Java and implement the basic operations like the one I have. I've only implemented the addition in my example (try to implement the others they are quite similar).
The main idea is simple, a BigInt represents a big decimal number by cutting its little endian representation into pieces (I'll explain why little endian at the end). The length of those pieces depends on the base you choose. If you want to work with decimal representations, it will only work if your base is a power of 10: if you choose 10 as base each piece will represent one digit, if you choose 100 (= 10^2) as base each piece will represent two consecutive digits starting from the end(see little endian), if you choose 1000 as base (10^3) each piece will represent three consecutive digits, ... and so on. Let's say that you have base 100, 12765 will then be [65, 27, 1], 1789 will be [89, 17], 505 will be [5, 5] (= [05,5]), ... with base 1000: 12765 would be [765, 12], 1789 would be [789, 1], 505 would be [505]. It's not the most efficient, but it is the most intuitive (I think ...)
The addition is then a bit like the addition on paper we learned at school:
begin with the lowest piece of the BigInt
add it with the corresponding piece of the other one
the lowest piece of that sum(= the sum modulus the base) becomes the corresponding piece of the final result
the "bigger" pieces of that sum will be added ("carried") to the sum of the following pieces
go to step 2 with next piece
if no piece left, add the carry and the remaining bigger pieces of the other BigInt (if it has pieces left)
For example:
9542 + 1097855 = [42, 95] + [55, 78, 09, 1]
lowest piece = 42 and 55 --> 42 + 55 = 97 = [97]
---> lowest piece of result = 97 (no carry, carry = 0)
2nd piece = 95 and 78 --> (95+78) + 0 = 173 = [73, 1]
---> 2nd piece of final result = 73
---> remaining: [1] = 1 = carry (will be added to sum of following pieces)
no piece left in first `BigInt`!
--> add carry ( [1] ) and remaining pieces from second `BigInt`( [9, 1] ) to final result
--> first additional piece: 9 + 1 = 10 = [10] (no carry)
--> second additional piece: 1 + 0 = 1 = [1] (no carry)
==> 9542 + 1 097 855 = [42, 95] + [55, 78, 09, 1] = [97, 73, 10, 1] = 1 107 397
Here is a demo where I used the class above to calculate the fibonacci of 10000 (result is too big to copy here)
Good luck!
PS: Why little endian? For the ease of the implementation: it allows to use push_back when adding digits and iteration while implementing the operations will start from the first piece instead of the last piece in the array.

Scala: decrement for-loop

I noticed that the following two for-loop cases behave differently sometimes while most of the time they are the same. I couldn't figure out the pattern, does anyone have any idea? Thanks!
case 1:
for (i <- myList.length - 1 to 0 by -1) { ... }
case 2:
for (i <- myList.length - 1 to 0) { ...}
Well, they definitely don't do the same things. n to 0 by -1 means "start at n and go to 0, counting backwards by 1. So:
5 to 0 by -1
// res0: scala.collection.immutable.Range = Range(5, 4, 3, 2, 1, 0)
Whereas n to 0 means "start at n and got to 0 counting forward by 1". But you'll notice that if n > 0, then there will be nothing in that list, since there is no way to count forward to 0 from anything greater than zero.
5 to 0
// res1: scala.collection.immutable.Range.Inclusive = Range()
The only way that they would produce the same result is if n=0 since counting from 0 to 0 is the same forwards and backwards:
0 to 0 by -1 // Range(0)
0 to 0 // Range(0)
In your case, since you're starting at myList.length - 1, they will produce the same result when the length of myList is 1.
In summary, the first version makes sense, because you want to count down to 0 by counting backward (by -1). And the second version doesn't make sense because you're not going to want to count forward to 0 from a length (which is necessarily non-negative).
First, we need to learn more about how value members to and by works.
to - Click Here for API documentation
to is a value member that appears in classes like int, double etc.
scala> 1 to 3
res35: scala.collection.immutable.Range.Inclusive = Range(1, 2, 3)
Honestly, you don't have to use start to end by step and start.to(end, step) will also work if you are more comfortable working with in this world. Basically, to will return you a Range.Inclusive object if we are talking about integer inputs.
by - Click Here for API documentation
Create a new range with the start and end values of this range and a new step
scala> Range(1,8) by 3
res54: scala.collection.immutable.Range = Range(1, 4, 7)
scala> Range(1,8).by(3)
res55: scala.collection.immutable.Range = Range(1, 4, 7)
In the end, lets spend some time looking at what happens when the step is on a different direction from start to end. Like 1 to 3 by -1
Here is the source code of the Range class and it is actually pretty straightforward to read:
def by(step: Int): Range = copy(start, end, step)
So by is actually calling a function copy, so what is copy?
protected def copy(start: Int, end: Int, step: Int): Range = new Range(start, end, step)
So copy is literally recreate a new range with different step, then lets look at the constructor or Range itself.
Reading this paragraph of code
override final val isEmpty = (
(start > end && step > 0)
|| (start < end && step < 0)
|| (start == end && !isInclusive)
)
These cases will trigger the exception and your result will be a empty Range in cases like 1 to 3 by -1..etc.
Sorry the length of my post is getting out of control since I am also learning Scala now.
Why don't you just read the source code of Range, it is written by Martin Odersky and it is only 500 lines including comments :)

Wildcard String Search Algorithm

In my program I need to search in a quite big string (~1 mb) for a relatively small substring (< 1 kb).
The problem is the string contains simple wildcards in the sense of "a?c" which means I want to search for strings like "abc" or also "apc",... (I am only interested in the first occurence).
Until now I use the trivial approach (here in pseudocode)
algorithm "search", input: haystack(string), needle(string)
for(i = 0, i < length(haystack), ++i)
if(!CompareMemory(haystack+i,needle,length(needle))
return i;
return -1; (Not found)
Where "CompareMemory" returns 0 iff the first and second argument are identical (also concerning wildcards) only regarding the amount of bytes the third argument gives.
My question is now if there is a fast algorithm for this (you don't have to give it, but if you do I would prefer c++, c or pseudocode). I started here
but I think most of the fast algorithms don't allow wildcards (by the way they exploit the nature of strings).
I hope the format of the question is ok because I am new here, thank you in advance!
A fast way, which is kind of the same thing as using a regexp, (which I would recommend anyway), is to find something that is fixed in needle, "a", but not "?", and search for it, then see if you've got a complete match.
j = firstNonWildcardPos(needle)
for(i = j, i < length(haystack)-length(needle)+j, ++i)
if(haystack[i] == needle[j])
if(!CompareMemory(haystack+i-j,needle,length(needle))
return i;
return -1; (Not found)
A regexp would generate code similar to this (I believe).
Among strings over an alphabet of c characters, let S have length s and let T_1 ... T_k have average length b. S will be searched for each of the k target strings. (The problem statement doesn't mention multiple searches of a given string; I mention it below because in that paradigm my program does well.)
The program uses O(s+c) time and space for setup, and (if S and the T_i are random strings) O(k*u*s/c) + O(k*b + k*b*s/c^u) total time for searching, with u=3 in program as shown. For longer targets, u should be increased, and rare, widely-separated key characters chosen.
In step 1, the program creates an array L of s+TsizMax integers (in program, TsizMax = allowed target length) and uses it for c lists of locations of next occurrences of characters, with list heads in H[] and tails in T[]. This is the O(s+c) time and space step.
In step 2, the program repeatedly reads and processes target strings. Step 2A chooses u = 3 different non-wild key characters (in current target). As shown, the program just uses the first three such characters; with a tiny bit more work, it could instead use the rarest characters in the target, to improve performance. Note, it doesn't cope with targets with fewer than three such characters.
The line "L[T[r]] = L[g+i] = g+i;" within Step 2A sets up a guard cell in L with proper delta offset so that Step 2G will automatically execute at end of search, without needing any extra testing during the search. T[r] indexes the tail cell of the list for character r, so cell L[g+i] becomes a new, self-referencing, end-of-list for character r. (This technique allows the loops to run with a minimum of extraneous condition testing.)
Step 2B sets vars a,b,c to head-of-list locations, and sets deltas dab, dac, and dbc corresponding to distances between the chosen key characters in target.
Step 2C checks if key characters appear in S. This step is necessary because otherwise a while loop in Step 2E will hang. We don't want more checks within those while loops because they are the inner loops of search.
Step 2D does steps 2E to 2i until var c points to after end of S, at which point it is impossible to make any more matches.
Step 2E consists of u = 3 while loops, that "enforce delta distances", that is, crawl indexes a,b,c along over each other as long as they are not pattern-compatible. The while loops are fairly fast, each being in essence (with ++si instrumentation removed) "while (v+d < w) v = L[v]" for various v, d, w. Replicating the three while loops a few times may increase performance a little and will not change net results.
In Step 2G, we know that the u key characters match, so we do a complete compare of target to match point, with wild-character handling. Step 2H reports result of compare. Program as given also reports non-matches in this section; remove that in production.
Step 2I advances all the key-character indexes, because none of the currently-indexed characters can be the key part of another match.
You can run the program to see a few operation-count statistics. For example, the output
Target 5=<de?ga>
012345678901234567890123456789012345678901
abc1efgabc2efgabcde3gabcdefg4bcdefgabc5efg
# 17, de?ga and de3ga match
# 24, de?ga and defg4 differ
# 31, de?ga and defga match
Advances: 'd' 0+3 'e' 3+3 'g' 3+3 = 6+9 = 15
shows that Step 2G was entered 3 times (ie, the key characters matched 3 times); the full compare succeeded twice; step 2E while loops advanced indexes 6 times; step 2I advanced indexes 9 times; there were 15 advances in all, to search the 42-character string for the de?ga target.
/* jiw
$Id: stringsearch.c,v 1.2 2011/08/19 08:53:44 j-waldby Exp j-waldby $
Re: Concept-code for searching a long string for short targets,
where targets may contain wildcard characters.
The user can enter any number of targets as command line parameters.
This code has 2 long strings available for testing; if the first
character of the first parameter is '1' the jay[42] string is used,
else kay[321].
Eg, for tests with *hay = jay use command like
./stringsearch 1e?g a?cd bc?e?g c?efg de?ga ddee? ddee?f
or with *hay = kay,
./stringsearch bc?e? jih? pa?j ?av??j
to exercise program.
Copyright 2011 James Waldby. Offered without warranty
under GPL v3 terms as at http://www.gnu.org/licenses/gpl.html
*/
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <limits.h>
//================================================
int main(int argc, char *argv[]) {
char jay[]="abc1efgabc2efgabcde3gabcdefg4bcdefgabc5efg";
char kay[]="ludehkhtdiokihtmaihitoia1htkjkkchajajavpajkihtijkhijhipaja"
"etpajamhkajajacpajihiatokajavtoia2pkjpajjhiifakacpajjhiatkpajfojii"
"etkajamhpajajakpajihiatoiakavtoia3pakpajjhiifakacpajjhkatvpajfojii"
"ihiifojjjjhijpjkhtfdoiajadijpkoia4jihtfjavpapakjhiifjpajihiifkjach"
"ihikfkjjjjhijpjkhtfdoiajakijptoik4jihtfjakpapajjkiifjpajkhiifajkch";
char *hay = (argc>1 && argv[1][0]=='1')? jay:kay;
enum { chars=1<<CHAR_BIT, TsizMax=40, Lsiz=TsizMax+sizeof kay, L1, L2 };
int L[L2], H[chars], T[chars], g, k, par;
// Step 1. Make arrays L, H, T.
for (k=0; k<chars; ++k) H[k] = T[k] = L1; // Init H and T
for (g=0; hay[g]; ++g) { // Make linked character lists for hay.
k = hay[g]; // In same loop, could count char freqs.
if (T[k]==L1) H[k] = T[k] = g;
T[k] = L[T[k]] = g;
}
// Step 2. Read and process target strings.
for (par=1; par<argc; ++par) {
int alpha[3], at[3], a=g, b=g, c=g, da, dab, dbc, dac, i, j, r;
char * targ = argv[par];
enum { wild = '?' };
int sa=0, sb=0, sc=0, ta=0, tb=0, tc=0;
printf ("Target %d=<%s>\n", par, targ);
// Step 2A. Choose 3 non-wild characters to follow.
// As is, chooses first 3 non-wilds for a,b,c.
// Could instead choose 3 rarest characters.
for (j=0; j<3; ++j) alpha[j] = -j;
for (i=j=0; targ[i] && j<3; ++i)
if (targ[i] != wild) {
r = alpha[j] = targ[i];
if (alpha[0]==alpha[1] || alpha[1]==alpha[2]
|| alpha[0]==alpha[2]) continue;
at[j] = i;
L[T[r]] = L[g+i] = g+i;
++j;
}
if (j != 3) {
printf (" Too few target chars\n");
continue;
}
// Step 2B. Set a,b,c to head-of-list locations, set deltas.
da = at[0];
a = H[alpha[0]]; dab = at[1]-at[0];
b = H[alpha[1]]; dbc = at[2]-at[1];
c = H[alpha[2]]; dac = at[2]-at[0];
// Step 2C. See if key characters appear in haystack
if (a >= g || b >= g || c >= g) {
printf (" No match on some character\n");
continue;
}
for (g=0; hay[g]; ++g) printf ("%d", g%10);
printf ("\n%s\n", hay); // Show haystack, for user aid
// Step 2D. Search for match
while (c < g) {
// Step 2E. Enforce delta distances
while (a+dab < b) {a = L[a]; ++sa; } // Replicate these
while (b+dbc < c) {b = L[b]; ++sb; } // 3 abc lines as many
while (a+dac > c) {c = L[c]; ++sc; } // times as you like.
while (a+dab < b) {a = L[a]; ++sa; } // Replicate these
while (b+dbc < c) {b = L[b]; ++sb; } // 3 abc lines as many
while (a+dac > c) {c = L[c]; ++sc; } // times as you like.
// Step 2F. See if delta distances were met
if (a+dab==b && b+dbc==c && c<g) {
// Step 2G. Yes, so we have 3-letter-match and need to test whole match.
r = a-da;
for (k=0; targ[k]; ++k)
if ((hay[r+k] != targ[k]) && (targ[k] != wild))
break;
printf ("# %3d, %s and ", r, targ);
for (i=0; targ[i]; ++i) putchar(hay[r++]);
// Step 2H. Report match, if found
puts (targ[k]? " differ" : " match");
// Step 2I. Advance all of a,b,c, to go on looking
a = L[a]; ++ta;
b = L[b]; ++tb;
c = L[c]; ++tc;
}
}
printf ("Advances: '%c' %d+%d '%c' %d+%d '%c' %d+%d = %d+%d = %d\n",
alpha[0], sa,ta, alpha[1], sb,tb, alpha[2], sc,tc,
sa+sb+sc, ta+tb+tc, sa+sb+sc+ta+tb+tc);
}
return 0;
}
Note, if you like this answer better than current preferred answer, unmark that one and mark this one. :)
Regular expressions usually use a finite state automation-based search, I think. Try implementing that.