Subsequence search - regex

I have a large number of lists (35 MB in total) which I would like to search for subsequences: each term must appear in order but not necessarily consecutively. So 1, 2, 3 matches each of
1, 2, 3, 4, 5, 6
1, 2, 2, 3, 3, 3
but not
6, 5, 4, 3, 2, 1
123, 4, 5, 6, 7
(, is a delimiter, not characters to match.)
Short of running a regex (/1, ([^,]+, )*2, ([^,]+, )*3/ for the example) on tens or hundreds of thousands of sequences, how can I determine which sequences are a match? I can preprocess the sequences, though memory usage needs to stay reasonable (within a constant factor of the existing sequence size, say). The longest sequence is short, less than a kilobyte, so you can assume queries are short as well.

This reminds me of sequence alignment from bioinformatics, where you try to match a small snippet of DNA against a large database. The differences are your presumably larger alphabet, and your increased tolerance for arbitrarily long gaps.
You may find some inspiration looking at the existing tools and algorithms, notably Smith-Waterman and BLAST.

If the individual numbers are spread out over the file and not occurring on the majority of lines then a simple indexing of the line number where they occur could give you a speed up. This will however be slower if your data are lines of the same numbers repeated in different orders.
To build the index would only require a single pass of the data along these lines:
Hash<int, List<int>> index
line_number = 1
foreach(line in filereader)
{
line_number += 1
foreach(parsed_number in line)
index[parsed_number].append(line)
}
That index could be stored and reused for the dataset. To search on it would only need something like this. Please excuse the mixed psuedocode, I've tried to make it as clear as possible. It "return"s when it's out of possible matches and "yield"s a line number when all of the elements of the substring occur on that line.
// prefilled hash linking number searched for to a list of line numbers
// the lines should be in ascending order
Hash<int, List<int>> index
// The subsequence we're looking for
List<int> subsequence = {1, 2, 3}
int len = subsequence.length()
// Take all the lists from the index that match the numbers we're looking for
List<List<int>> lines = index[number] for number in subsequence
// holder for our current search row
// has the current lowest line number each element occurs on
int[] search = new int[len]
for(i = 0; i < len; i++)
search[i] = lines[i].pop()
while(true)
{
// minimum line number, substring position and whether they're equal
min, pos, eq = search[0], 0, true
// find the lowest line number and whether they all match
for(i = 0; i < len; i++)
{
if(search[i] < min)
min, p, eq = search[i], i, false
else if (search[i] > min)
eq = false
}
// if they do all match every one of the numbers occurs on that row
if(eq)
{
yield min; // line has all the elements
foreach(list in lines)
if(list.empty()) // one of the numbers isn't in any more lines
return
// update the search to the next lowest line number for every substring element
for(i = 0; i < len; i++)
search[i] = lines[i].pop()
}
else
{
// the lowest line number for each element is not the same, so discard the lowest one
if(lines[position].empty()) // there are no more lines for the element we'd be updating
return
search[position] = lines[position].pop();
}
}
Notes:
This could trivially be extended to store the position in the line as well as the line number and then only a little extra logic at the "yield" point would be able to determine an actual match instead of just that all the items are present.
I've used "pop" to show how it's moving through the line numbers but you don't actually want to be destroying your index every search.
I've assumed the numbers all fit into ints here. Extend it to longs or even map the string representation of each number to an int if you have really huge numbers.
There are some speedups to be had, especially in skipping lines at the "pop" stages, but I went for the clearer explanation.
Whether using this or another method you could also chop down the computation depending on the data. A single pass to work out whether each line is ascending, descending, all odd, all even, or what the highest and lowest numbers are could be used to cut down the search space for each substring. Whether these would be useful depends entirely on your dataset.

maybe I misunderstood but, isn't this straightforward like this?
search = [1 2 3]
for sequence in sequences:
sidx = 0
for item in sequence:
if item==search[sidx]:
sidx++
if sidx>=len(search): break
if sidx>len(search):
print sequence + "matches"
it seems to be O(N) for N sequences
and O(M) for searching for subsequence length M
not sure if this would be that much faster than a regex though?

Related

Perfect sum problem with fixed subset size

I am looking for a least time-complex algorithm that would solve a variant of the perfect sum problem (initially: finding all variable size subset combinations from an array [*] of integers of size n that sum to a specific number x) where the subset combination size is of a fixed size k and return the possible combinations without direct and also indirect (when there's a combination containing the exact same elements from another in another order) duplicates.
I'm aware this problem is NP-hard, so I am not expecting a perfect general solution but something that could at least run in a reasonable time in my case, with n close to 1000 and k around 10
Things I have tried so far:
Finding a combination, then doing successive modifications on it and its modifications
Let's assume I have an array such as:
s = [1,2,3,3,4,5,6,9]
So I have n = 8, and I'd like x = 10 for k = 3
I found thanks to some obscure method (bruteforce?) a subset [3,3,4]
From this subset I'm finding other possible combinations by taking two elements out of it and replacing them with other elements that sum the same, i.e. (3, 3) can be replaced by (1, 5) since both got the same sum and the replacing numbers are not already in use. So I obtain another subset [1,5,4], then I repeat the process for all the obtained subsets... indefinitely?
The main issue as suggested here is that it's hard to determine when it's done and this method is rather chaotic. I imagined some variants of this method but they really are work in progress
Iterating through the set to list all k long combinations that sum to x
Pretty self explanatory. This is a naive method that do not work well in my case since I have a pretty large n and a k that is not small enough to avoid a catastrophically big number of combinations (the magnitude of the number of combinations is 10^27!)
I experimented several mechanism related to setting an area of research instead of stupidly iterating through all possibilities, but it's rather complicated and still work in progress
What would you suggest? (Snippets can be in any language, but I prefer C++)
[*] To clear the doubt about whether or not the base collection can contain duplicates, I used the term "array" instead of "set" to be more precise. The collection can contain duplicate integers in my case and quite much, with 70 different integers for 1000 elements (counts rounded), for example
With reasonable sum limit this problem might be solved using extension of dynamic programming approach for subset sum problem or coin change problem with predetermined number of coins. Note that we can count all variants in pseudopolynomial time O(x*n), but output size might grow exponentially, so generation of all variants might be a problem.
Make 3d array, list or vector with outer dimension x-1 for example: A[][][]. Every element A[p] of this list contains list of possible subsets with sum p.
We can walk through all elements (call current element item) of initial "set" (I noticed repeating elements in your example, so it is not true set).
Now scan A[] list from the last entry to the beginning. (This trick helps to avoid repeating usage of the same item).
If A[i - item] contains subsets with size < k, we can add all these subsets to A[i] appending item.
After full scan A[x] will contain subsets of size k and less, having sum x, and we can filter only those of size k
Example of output of my quick-made Delphi program for the next data:
Lst := [1,2,3,3,4,5,6,7];
k := 3;
sum := 10;
3 3 4
2 3 5 //distinct 3's
2 3 5
1 4 5
1 3 6
1 3 6 //distinct 3's
1 2 7
To exclude variants with distinct repeated elements (if needed), we can use non-first occurence only for subsets already containing the first occurence of item (so 3 3 4 will be valid while the second 2 3 5 won't be generated)
I literally translate my Delphi code into C++ (weird, I think :)
int main()
{
vector<vector<vector<int>>> A;
vector<int> Lst = { 1, 2, 3, 3, 4, 5, 6, 7 };
int k = 3;
int sum = 10;
A.push_back({ {0} }); //fictive array to make non-empty variant
for (int i = 0; i < sum; i++)
A.push_back({{}});
for (int item : Lst) {
for (int i = sum; i >= item; i--) {
for (int j = 0; j < A[i - item].size(); j++)
if (A[i - item][j].size() < k + 1 &&
A[i - item][j].size() > 0) {
vector<int> t = A[i - item][j];
t.push_back(item);
A[i].push_back(t); //add new variant including current item
}
}
}
//output needed variants
for (int i = 0; i < A[sum].size(); i++)
if (A[sum][i].size() == k + 1) {
for (int j = 1; j < A[sum][i].size(); j++) //excluding fictive 0
cout << A[sum][i][j] << " ";
cout << endl;
}
}
Here is a complete solution in Python. Translation to C++ is left to the reader.
Like the usual subset sum, generation of the doubly linked summary of the solutions is pseudo-polynomial. It is O(count_values * distinct_sums * depths_of_sums). However actually iterating through them can be exponential. But using generators the way I did avoids using a lot of memory to generate that list, even if it can take a long time to run.
from collections import namedtuple
# This is a doubly linked list.
# (value, tail) will be one group of solutions. (next_answer) is another.
SumPath = namedtuple('SumPath', 'value tail next_answer')
def fixed_sum_paths (array, target, count):
# First find counts of values to handle duplications.
value_repeats = {}
for value in array:
if value in value_repeats:
value_repeats[value] += 1
else:
value_repeats[value] = 1
# paths[depth][x] will be all subsets of size depth that sum to x.
paths = [{} for i in range(count+1)]
# First we add the empty set.
paths[0][0] = SumPath(value=None, tail=None, next_answer=None)
# Now we start adding values to it.
for value, repeats in value_repeats.items():
# Reversed depth avoids seeing paths we will find using this value.
for depth in reversed(range(len(paths))):
for result, path in paths[depth].items():
for i in range(1, repeats+1):
if count < i + depth:
# Do not fill in too deep.
break
result += value
if result in paths[depth+i]:
path = SumPath(
value=value,
tail=path,
next_answer=paths[depth+i][result]
)
else:
path = SumPath(
value=value,
tail=path,
next_answer=None
)
paths[depth+i][result] = path
# Subtle bug fix, a path for value, value
# should not lead to value, other_value because
# we already inserted that first.
path = SumPath(
value=value,
tail=path.tail,
next_answer=None
)
return paths[count][target]
def path_iter(paths):
if paths.value is None:
# We are the tail
yield []
else:
while paths is not None:
value = paths.value
for answer in path_iter(paths.tail):
answer.append(value)
yield answer
paths = paths.next_answer
def fixed_sums (array, target, count):
paths = fixed_sum_paths(array, target, count)
return path_iter(paths)
for path in fixed_sums([1,2,3,3,4,5,6,9], 10, 3):
print(path)
Incidentally for your example, here are the solutions:
[1, 3, 6]
[1, 4, 5]
[2, 3, 5]
[3, 3, 4]
You should first sort the so called array. Secondly, you should determine if the problem is actually solvable, to save time... So what you do is you take the last k elements and see if the sum of those is larger or equal to the x value, if it is smaller, you are done it is not possible to do something like that.... If it is actually equal yes you are also done there is no other permutations.... O(n) feels nice doesn't it?? If it is larger, than you got a lot of work to do..... You need to store all the permutations in an seperate array.... Then you go ahead and replace the smallest of the k numbers with the smallest element in the array.... If this is still larger than x then you do it for the second and third and so on until you get something smaller than x. Once you reach a point where you have the sum smaller than x, you can go ahead and start to increase the value of the last position you stopped at until you hit x.... Once you hit x that is your combination.... Then you can go ahead and get the previous element so if you had 1,1,5, 6 in your thingy, you can go ahead and grab the 1 as well, add it to your smallest element, 5 to get 6, next you check, can you write this number 6 as a combination of two values, you stop once you hit the value.... Then you can repeat for the others as well.... You problem can be solved in O(n!) time in the worst case.... I would not suggest that you 10^27 combinations, meaning you have more than 10^27 elements, mhmmm bad idea do you even have that much space??? That's like 3bits for the header and 8 bits for each integer you would need 9.8765*10^25 terabytes just to store that clossal array, more memory than a supercomputer, you should worry about whether your computer can even store this monster rather than if you can solve the problem, that many combinations even if you find a quadratic solution it would crash your computer, and you know what quadratic is a long way off from O(n!)...
A brute force method using recursion might look like this...
For example, given variables set, x, k, the following pseudo code might work:
setSumStructure find(int[] set, int x, int k, int setIdx)
{
int sz = set.length - setIdx;
if (sz < x) return null;
if (sz == x) check sum of set[setIdx] -> set[set.size] == k. if it does, return the set together with the sum, else return null;
for (int i = setIdx; i < set.size - (k - 1); i++)
filter(find (set, x - set[i], k - 1, i + 1));
return filteredSets;
}

C++: Recursive function for variations with repetitions, ordered by amount of different letters

I have a function that generates variations like this: 111, 112, ..., 133, 211, 212, ..., 233, 311, ..., 333. Length of generated sequences always matches length of dictionary; with 4 symbols it'd be 1111 to 4444.
This is done in a brute force algorithm for graph coloring. We're trying to find the right sequence that has as less different colors as possible, i.e. if both 12343 and 12321 are solutions, we'd prefer the latter.
Right now I go and check each and every sequence if it’s right, and then store the best result in process. It’s not really a good code.
So professor asked me to write a function that generates variations in specific order. These sequences should come ordered by their amount of different numbers, like this: 111, 222, 333; 112, 113, 121, …, 323; 123, 213. In this case, if we found out that, say, 121 is right, we just stop, because we already know that it’s the best solution.
The idea is to skip as much sequence checks as possible so the code would run faster. Please help :)
Right now I use this code:
init function
std::vector<int> res; //contains the "alphabet"
res.reserve(V);
for (int i = V - 1; i >= 0; i--) {
res.push_back(i);
}
std::vector<int> index(res.size());
std::vector<int> bestresult(V); //here goes the best answer if it's found
for (int i = V - 1; i >= 0; i--) {
bestresult.push_back(i);
}
int bestcolors = V;
permutate(res, index, 0, bestresult, bestcolors);
result = bestresult;
permutate:
void Graph::permutate(const std::vector<int>& s, std::vector<int>& index, std::size_t depth, std::vector<int>& bestres, int &lowestAmountOfColors)
{
if (depth == s.size()) {
//doing all needed checks and saving bestresult here;
return;
}
for (std::size_t i = 0; i < s.size(); ++i) {
index[depth] = i;
permutate(s, index, depth + 1, bestres, lowestAmountOfColors);
}
}
How can I alter these functions?
The challenge is to find all permutations of colors so that you can test if they are a valid graph coloring. Unfortunately, it is exponential. So we need to search the permutations in a way that we check the smallest solutions first, and we need to prune the solution space dramatically.
To find the smallest solutions first, we must limit the number of colors available, and exhaust those permutations before we grow the number of colors. Pretty simple. We just need a function that considers n colors for N vertices. The number of vertices remains fixed, but we consider n=1, then n=2, etc.
Within the function, we know that we need various combinations of 1, 2, ... n with enough repetition to get a total of N different values. So I made a vector of counts. This vector has n entries, and the values sum up to N.
For example, if we are considering three color solutions for a graph with 7 vertices, one possible count array would be {4, 3, 1} would be used to generate the candidate {1, 1, 1, 1, 2, 2, 2, 3}. Color 1 appears 4 times. Color 2 appears 3 times. Color 3 appears 1 time.
The cool thing about this counts array is that as long as it is sorted greatest to least, then its combinations cannot duplicate any other combination we have considered, because colors are interchangable. (Okay, not entirely accurate, there are some duplications when colors have the same count, but we eliminated a lot of permutations from ever being looked at, which is the whole point).
Once you reduce the counts array to an actual candidate solution, you can find all ordering using combinations, not permutations. This will generate fewer candidates. Google next_combination to find some good code showing how to do this.
When we generate the counts array, I initialized all values to 1, then added all the remaining counts to the first color. I search ALL combinations which meet the counts array. Then I get the next candidate by shifting the counts to the right in such a way that it remains sorted.
So to sum up, find_minimum_graph_coloring has a for loop which calls solve_for_n. That function generates all the possible counts-arrays for that value of n, and calls another function. That function checks all combinations for that counts-array.
The first for loop checks smaller numbers of colors first, so we can return immediately upon finding a solution. The counts-array notation eliminates many equivalent colorations so if we consider {1, 1, 2} then we will never try {2, 2, 1}

Given a string, find two identical subsequences with consecutive indexes C++

I need to construct an algorithm (not necessarily effective) that given a string finds and prints two identical subsequences (by print I mean color for example). What more, the union of the sets of indexes of these two subsequences has to be a set of consecutive natural numbers (a full segment of integers).
In mathematics, the thing what I am looking for is called "tight twins", if it helps anything. (E.g., see the paper (PDF) here.)
Let me give a few examples:
1) consider string 231213231
It has two subsequences I am looking for in the form of "123". To see it better look at this image:
The first subsequence is marked with underlines and the second with overlines. As you can see they have all the properties I need.
2) consider string 12341234
3) consider string 12132344.
Now it gets more complicated:
4) consider string: 13412342
It is also not that easy:
I think that these examples explain well enough what I meant.
I've been thinking a long time about an algorithm that could do that but without success.
For coloring, I wanted to use this piece of code:
using namespace std;
HANDLE hConsole;
hConsole = GetStdHandle(STD_OUTPUT_HANDLE);
SetConsoleTextAttribute(hConsole, k);
where k is color.
Any help, even hints, would be highly appreciated.
Here's a simple recursion that tests for tight twins. When there's a duplicate, it splits the decision tree in case the duplicate is still part of the first twin. You'd have to run it on each substring of even length. Other optimizations for longer substrings could include hashing tests for char counts, as well as matching the non-duplicate portions of the candidate twins (characters that only appear twice in the whole substring).
Explanation of the function:
First, a hash is created with each character as key and the indexes it appears in as values. Then we traverse the hash: if a character count is odd, the function returns false; and indexes of characters with a count greater than 2 are added to a list of duplicates - characters half of which belong in one twin but we don't know which.
The basic rule of the recursion is to only increase i when a match for it is found later in the string, while maintaining a record of chosen matches (js) that i must skip without looking for a match. It works because if we find n/2 matches, in order, by the time j reaches the end, that's basically just another way of saying the string is composed of tight twins.
JavaScript code:
function isTightTwins(s){
var n = s.length,
char_idxs = {};
for (var i=0; i<n; i++){
if (char_idxs[s[i]] == undefined){
char_idxs[s[i]] = [i];
} else {
char_idxs[s[i]].push(i);
}
}
var duplicates = new Set();
for (var i in char_idxs){
// character with odd count
if (char_idxs[i].length & 1){
return false;
}
if (char_idxs[i].length > 2){
for (let j of char_idxs[i]){
duplicates.add(j);
}
}
}
function f(i,j,js){
// base case positive
if (js.size == n/2 && j == n){
return true;
}
// base case negative
if (j > n || (n - j < n/2 - js.size)){
return false;
}
// i is not less than j
if (i >= j) {
return f(i,j + 1,js);
}
// this i is in the list of js
if (js.has(i)){
return f(i + 1,j,js);
// yet to find twin, no match
} else if (s[i] != s[j]){
return f(i,j + 1,js);
} else {
// maybe it's a twin and maybe it's a duplicate
if (duplicates.has(j)) {
var _js = new Set(js);
_js.add(j);
return f(i,j + 1,js) | f(i + 1,j + 1,_js);
// it's a twin
} else {
js.add(j);
return f(i + 1,j + 1,js);
}
}
}
return f(0,1,new Set());
}
console.log(isTightTwins("1213213515")); // true
console.log(isTightTwins("11222332")); // false
WARNING: Commenter גלעד ברקן points out that this algorithm gives the wrong answer of 6 (higher than should be possible!) for the string 1213213515. My implementation gets the same wrong answer, so there seems to be a serious problem with this algorithm. I'll try to figure out what the problem is, but in the meantime DO NOT TRUST THIS ALGORITHM!
I've thought of a solution that will take O(n^3) time and O(n^2) space, which should be usable on strings of up to length 1000 or so. It's based on a tweak to the usual notion of longest common subsequences (LCS). For simplicity I'll describe how to find a minimal-length substring with the "tight twin" property that starts at position 1 in the input string, which I assume has length 2n; just run this algorithm 2n times, each time starting at the next position in the input string.
"Self-avoiding" common subsequences
If the length-2n input string S has the "tight twin" (TT) property, then it has a common subsequence with itself (or equivalently, two copies of S have a common subsequence) that:
is of length n, and
obeys the additional constraint that no character position in the first copy of S is ever matched with the same character position in the second copy.
In fact we can safely tighten the latter constraint to no character position in the first copy of S is ever matched to an equal or lower character position in the second copy, due to the fact that we will be looking for TT substrings in increasing order of length, and (as the bottom section shows) in any minimal-length TT substring, it's always possible to assign characters to the two subsequences A and B so that for any matched pair (i, j) of positions in the substring with i < j, the character at position i is assigned to A. Let's call such a common subsequence a self-avoiding common subsequence (SACS).
The key thing that makes efficient computation possible is that no SACS of a length-2n string can have more than n characters (since clearly you can't cram more than 2 sets of n characters into a length-2n string), so if such a length-n SACS exists then it must be of maximum possible length. So to determine whether S is TT or not, it suffices to look for a maximum-length SACS between S and itself, and check whether this in fact has length n.
Computation by dynamic programming
Let's define f(i, j) to be the length of the longest self-avoiding common subsequence of the length-i prefix of S with the length-j prefix of S. To actually compute f(i, j), we can use a small modification of the usual LCS dynamic programming formula:
f(0, _) = 0
f(_, 0) = 0
f(i>0, j>0) = max(f(i-1, j), f(i, j-1), m(i, j))
m(i, j) = (if S[i] == S[j] && i < j then 1 else 0) + f(i-1, j-1)
As you can see, the only difference is the additional condition && i < j. As with the usual LCS DP, computing it takes O(n^2) time, since the 2 arguments each range between 0 and n, and the computation required outside of recursive steps is O(1). (Actually we need only compute the "upper triangle" of this DP matrix, since every cell (i, j) below the diagonal will be dominated by the corresponding cell (j, i) above it -- though that doesn't alter the asymptotic complexity.)
To determine whether the length-2j prefix of the string is TT, we need the maximum value of f(i, 2j) over all 0 <= i <= 2n -- that is, the largest value in column 2j of the DP matrix. This maximum can be computed in O(1) time per DP cell by recording the maximum value seen so far and updating as necessary as each DP cell in the column is calculated. Proceeding in increasing order of j from j=1 to j=2n lets us fill out the DP matrix one column at a time, always treating shorter prefixes of S before longer ones, so that when processing column 2j we can safely assume that no shorter prefix is TT (since if there had been, we would have found it earlier and already terminated).
Let the string length be N.
There are two approaches.
Approach 1. This approach is always exponential-time.
For each possible subsequence of length 1..N/2, list all occurences of this subsequence. For each occurence, list positions of all characters.
For example, for 123123 it should be:
(1, ((1), (4)))
(2, ((2), (5)))
(3, ((3), (6)))
(12, ((1,2), (4,5)))
(13, ((1,3), (4,6)))
(23, ((2,3), (5,6)))
(123, ((1,2,3),(4,5,6)))
(231, ((2,3,4)))
(312, ((3,4,5)))
The latter two are not necessary, as their appear only once.
One way to do it is to start with subsequences of length 1 (i.e. characters), then proceed to subsequences of length 2, etc. At each step, drop all subsequences which appear only once, as you don't need them.
Another way to do it is to check all 2**N binary strings of length N. Whenever a binary string has not more than N/2 "1" digits, add it to the table. At the end drop all subsequences which appear only once.
Now you have a list of subsequences which appear more than 1 time. For each subsequence, check all the pairs, and check whether such a pair forms a tight twin.
Approach 2. Seek for tight twins more directly. For each N*(N-1)/2 substrings, check whether the substring is even length, and each character appears in it even number of times, and then, being its length L, check whether it contains two tight twins of the length L/2. There are 2**L ways to divide it, the simplest you can do is to check all of them. There are more interesting ways to seek for t.t., though.
I would like to approach this as a dynamic programming/pattern matching problem. We deal with characters one at a time, left to right, and we maintain a herd of Non-Deterministic Finite Automata / NDFA, which correspond to partial matches. We start off with a single null match, and with each character we extend each NDFA in every possible way, with each NDFA possibly giving rise to many children, and then de-duplicate the result - so we need to minimise the state held in the NDFA to put a bound on the size of the herd.
I think a NDFA needs to remember the following:
1) That it skipped a stretch of k characters before the match region.
2) A suffix which is a p-character string, representing characters not yet matched which will need to be matched by overlines.
I think that you can always assume that the p-character string needs to be matched with overlines because you can always swap overlines and underlines in an answer if you swap throughout the answer.
When you see a new character you can extend NDFAs in the following ways:
a) An NDFA with nothing except skips can add a skip.
b) An NDFA can always add the new character to its suffix, which may be null
c) An NDFA with a p character string whose first character matches the new character can turn into an NDFA with a p-1 character string which consists of the last p-1 characters of the old suffix. If the string is now of zero length then you have found a match, and you can work out what it was if you keep links back from each NDFA to its parent.
I thought I could use a neater encoding which would guarantee only a polynomial herd size, but I couldn't make that work, and I can't prove polynomial behaviour here, but I notice that some cases of degenerate behaviour are handled reasonably, because they lead to multiple ways to get to the same suffix.

Update and check if the substring containing brackets is correct

The problem is:
Given the string with length of n, and m queries.
Each query is one of two cases:
Change the i-th character oppositely
Check if the substring from u-th character to v-th character is the correct brackets expression or not. If yes, print 1, else print 0.
Time limit: 0.2s
In these cases, a correct brackets expression is defined:
a string with length of 0
a string only contains '(' and ')'
if A is a correct brackets expression, then (A) is also a correct brackets expression
if A and B is correct brackets expressions, then AB is also a correct brackets expression
My main idea is similar to the problem 380C on CodeForces, http://codeforces.com/blog/entry/10363
Then I check if the longest subsequence in the given ranges is equal to the length of ranges, so I will get the answer. But I got time limit error.
I have been searching for this on the internet all day but I haven't got the answer. I will be grateful if you help me. :)
Here is my code: https://github.com/hoangvanthien/GH_CppFiles/blob/master/SPOJ/NKBRACKE.cpp
The conditions for a valid bracket sequence are:
Length of the substring is even.
The number of open and close brackets are equaled.
At any point in the sequence, there is no point which number of close brackets is greater than the number of open brackets.
So, from the original string of open and close brackets, we can convert it into sequence of number, each number represent the difference between open and close brackets from the beginning of the sequence. Each open brackets, we will plus one, and minus one for close.
For example, for ((())))) -> we have the sequence { 1, 2 , 3, 2 , 1, 0, -1, -2 }
So, to test if a substring is valid, for example, substring (2, 5), which is ())), we need to see if at any point, the difference between open and close is negative. From above sequence, we have {3, 2, 1, 0}, and we need to minus 2 for each element, as we need to remove those brackets from the beginning of the string, which are not in the substring. -> we have {1, 0, -1, -2} -> so the substring is invalid.
After understand above idea, we can have our solution for the problem.
What we need is a data structure, which can quickly update a range. For example, if we change from ( to ) at index 3, so we need to minus -2 to all element from index 3 onward.
And we need the data structure to quickly return the minimum value of a range (we only need to care about minimum value).
And from all of that requirements, we can use a Segment tree, which give you O(log n) update and O(log n) retrieve.
Pseudo code
SegmentTree tree;
Initialize the tree with original sequence
for each query in the tree
if( query type is update)
if(change from ) to ()
increase all value by 2 from range index to n
else if(change from ( to ) )
decrease all value by 2 from range index to n
else
int min = tree.getMinimumValueInRange(u, v)
int notInSubstring = tree.getMinimumValueInRange(u - 1, u - 1)
if(min - notInSubstring < 0)
print Invalid
else if( length of substring is not even)
print Invalid
else if( tree.getMinimumValueInRange(v, v) != notInSubstring)//Number of open and close brackets are not equaled.
print Invalid

Finding missing number using binary search

I am reading book on programming pearls.
Question: Given a sequential file that contains at most four billion
32 bit integers in random order, find a 32-bit integer that isn't in
the file (and there must be at least one missing). This problem has to
be solved if we have a few hundred bytes of main memory and several
sequential files.
Solution: To set this up as a binary search we have to define a range,
a representation for the elements within the range, and a probing
method to determine which half of a range holds the missing integer.
How do we do this?
We'll use as the range a sequence of integers known to contain atleast
one missing element, and we'll represent the range by a file
containing all the integers in it. The insight is that we can probe a
range by counting the elements above and below its midpoint: either
the upper or the lower range has atmost half elements in the total
range. Because the total range has a missing element, the smaller half
must also have a mising element. These are most ingredients of a
binary search algorithm for above problem.
Above text is copy right of Jon Bently from programming pearls book.
Some info is provided at following link
"Programming Pearls" binary search help
How do we search by passes using binary search and also not followed with the example given in above link? Please help me understand logic with just 5 integers rather than million integers to understand logic.
Why don't you re-read the answer in the post "Programming Pearls" binary search help. It explains the process on 5 integers as you ask.
The idea is that you parse each list and break it into 2 (this is where binary part comes from) separate lists based on the value in the first bit.
I.e. showing binary representation of actual numbers
Original List "": 001, 010, 110, 000, 100, 011, 101 => (broken into)
(we remove the first bit and append it to the "name" of the new list)
To form each of the bellow lists we took values starting with [0 or 1] from the list above
List "0": 01, 10, 00, 11 (is formed from subset 001, 010, 000, 011 of List "" by removing the first bit and appending it to the "name" of the new list)
List "1": 10, 00, 01 (is formed from subset 110, 100, 101 of List "" by removing the first bit and appending it to the "name" of the new list)
Now take one of the resulting lists in turn and repeat the process:
List "0" becomes your original list and you break it into
List "0***0**" and
List "0***1**" (the bold numbers are again the 1 [remaining] bit of the numbers in the list being broken)
Carry on until you end up with the empty list(s).
EDIT
Process step by step:
List "": 001, 010, 110, 000, 100, 011, 101 =>
List "0": 01, 10, 00, 11 (from subset 001, 010, 000, 011 of the List "") =>
List "00": 1, 0 (from subset 01, 00 of the List "0") =>
List "000": 0 [final result] (from subset 0 of the List "00")
List "001": 1 [final result] (from subset 1 of the List "00")
List "01": 0, 1 (from subset 10, 11 of the List "0") =>
List "010": 0 [final result] (from subset 0 of the List "01")
List "011": 1 [final result] (from subset 1 of the List "01")
List "1": 10, 00, 01 (from subset 110, 100, 101 of the List "") =>
List "10": 0, 1 (from subset 00, 01 of the List "1") =>
List "100": 0 [final result] (from subset 0 of the List "10")
List "101": 1 [final result] (from subset 1 of the List "10")
List "11": 0 (from subset 10 of the List "1") =>
List "110": 0 [final result] (from subset 0 of the List "11")
List "111": absent [final result] (from subset EMPTY of the List "11")
The positive of this method is that it will allow you to find ANY number of missing numbers in the set - i.e. if more than one is missing.
P.S. AFAIR for 1 single missing number out of the complete range there is even more elegant solution of XOR all numbers.
The idea is to solve easier problem:
Is the missing value in range [minVal, X] or (X, maxVal).
If you know this, you can move X and check again.
For example, you have 3, 4, 1, 5 (2 is missing).
You know that minVal = 1, maxVal = 5.
Range = [1, 5], X = 3, there should be 3 integers in range [1, 3] and 2 in range [4, 5]. There are only 2 in range [1, 3], so you are looking in range [1, 3]
Range = [1, 3], X = 2. There are only 1 value in range [1, 2], so you are looking in range [1, 2]
Range = [1, 2], X = 1. There are no values in range [2, 2] so it is your answer.
EDIT: Some pseudo-C++ code:
minVal = 1, maxVal = 5; //choose correct values
while(minVal < maxVal){
int X = (minVal + maxVal) / 2
int leftNumber = how much in range [minVal, X]
int rightNumber = how much in range [X + 1, maxVal]
if(leftNumber < (X - minVal + 1))maxVal = X
else minVal = X + 1
}
Here's a simple C solution which should illustrate the technique. To abstract away any tedious file I/O details, I'm assuming the existence of the following three functions:
unsigned long next_number (void) reads a number from the file and returns it. When called again, the next number in the file is returned, and so on. Behavior when the end of file is encountered is undefined.
int numbers_left (void) returns a true value if there are more numbers available to be read using next_number(), false if the end of the file has been reached.
void return_to_start (void) rewinds the reading position to the start of the file, so that the next call to next_number() returns the first number in the file.
I'm also assuming that unsigned long is at least 32 bits wide, as required for conforming ANSI C implementations; modern C programmers may prefer to use uint32_t from stdint.h instead.
Given these assumptions, here's the solution:
unsigned long count_numbers_in_range (unsigned long min, unsigned long max) {
unsigned long count = 0;
return_to_start();
while ( numbers_left() ) {
unsigned long num = next_number();
if ( num >= min && num <= max ) {
count++;
}
}
return count;
}
unsigned long find_missing_number (void) {
unsigned long min = 0, max = 0xFFFFFFFF;
while ( min < max ) {
unsigned long midpoint = min + (max - min) / 2;
unsigned long count = count_numbers_in_range( min, midpoint );
if ( count < midpoint - min + 1 ) {
max = midpoint; // at least one missing number below midpoint
} else {
min = midpoint; // no missing numbers below midpoint, must be above
}
}
return min;
}
One detail to note is that min + (max - min) / 2 is the safe way to calculate the average of min and max; it won't produce bogus results due to overflowing intermediate values like the seemingly simpler (min + max) / 2 might.
Also, even though it would be tempting to solve this problem using recursion, I chose an iterative solution instead for two reasons: first, because it (arguably) shows more clearly what's actually being done, and second, because the task was to minimize memory use, which presumably includes the stack too.
Finally, it would be easy to optimize this code, e.g. by returning as soon as count equals zero, by counting the numbers in both halves of the range in one pass and choosing the one with more missing numbers, or even by extending the binary search to n-ary search for some n > 2 to reduce the number of passes. However, to keep the example code as simple as possible, I've left such optimizations unmade. If you like, you may want to, say, try modifying the code so that it requires at most eight passes over the file instead of the current 32. (Hint: use a 16-element array.)
Actually, if we have range of integers from a to b. Sample: [a..b].
And in this range we have b-a integers. It means, that only one is missing.
And if only one is missing, we can calculate result using only single cycle.
First we can calculate sum of all integers in range [a..b], which equals:
sum = (a + b) * (b - a + 1) / 2
Then we calcualate summ of all integers in our sequence:
long sum1 = 0;
for (int i = 0; i < b - a; i++)
sum1 += arr[i];
Then we can find missing element as difference of those two sums:
long result = sum1 - sum;
when you've seen 2^31 zeros or ones in the ith digit place then your answer has a one or zero in the ith place. (Ex: 2^31 ones in 5th binary position means the answer has a zero in the 5th binary position.
First draft of c code:
uint32_t binaryHistogram[32], *list4BILLION, answer, placesChecked[32];
uint64_t limit = 4294967296;
uint32_t halfLimit = 4294967296/2;
int i, j, done
//General method to point to list since this detail is not important to the question.
list4BILLION = 0000000000h;
//Initialize array to zero. This array represents the number of 1s seen as you parse through the list
for(i=0;i<limit;i++)
{
binaryHistogram[i] = 0;
}
//Only sum up for first half of the 4 billion numbers
for(i=0;i<halfLimit;i++)
{
for(j=0;j<32;j++)
{
binaryHistogram[j] += ((*list4BILLION) >> j);
}
}
//Check each ith digit to see if all halfLimit values have been parsed
for(i=halfLimit;i<limit;i++)
{
for(j=0;j<32;j++)
{
done = 1; //Dont need to continue to the end if placesChecked are all
if(placesChecked[j] != 0) //Dont need to pass through the whole list
{
done = 0; //
binaryHistogram[j] += ((*list4BILLION) >> j);
if((binaryHistogram[j] > halfLimit)||(i - binaryHistogram[j] == halfLimit))
{
answer += (1 << j);
placesChecked[j] = 1;
}
}
}
}