What is profits of lazy directory iteration? - d

I am reading Phobos docs and found method dirEntries that complete "lazily iterates a given directory". But I can't understand real profits of it.
As I understand lazy function mean function that calculate only at time when it's needed.
Let's look at next code:
auto files = dirEntries(...);
auto cnt = files.count;
foreach( file; files ) { }
How much times dirEntries would be called? One or two? Please explain me the logic.
Or for example splitter
For me it's make code much more harder for understanding.

Lazy evaluation can be much more efficient if used correctly.
Say you have a somewhat expensive function that does something and you apply it to a whole range:
auto arr = iota(0, 100000); // a range of numbers from 0 to 100000
arr.map!(number => expensiveFunc(number))
.take(5)
.writeln;
If map wasn't lazy, it would execute expensiveFunc for all 100000 items in the range, and then pop off the first 5 of them.
But because map is lazy, expensiveFunc will only be called for the 5 items actually popped from the range.
Similarly with splitter, say you've a csv string with some data in it and you want to continue summing values until you meet a negative value.
string csvStr = "100,50,-1,1000,10,24,51"
int sum;
foreach(val; csvStr.splitter(",")){
immutable asNumber = val.to!int;
if(asNumber < 0) break;
sum += asNumber;
}
writeln(sum);
The above will only do the expensive 'splitting' work 3 times, as splitter is lazy and we only had to read 3 items. Saving us from having to keep splitting csvStr until the end, even though we don't need them.
So, in summary, the profits of lazy evaluation are that only the work that NEEDS to be done is actually done.

Related

print all the triplets (3 Sum) leetcode,https://leetcode.com/problems/3sum/ [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 1 year ago.
The community reviewed whether to reopen this question 1 year ago and left it closed:
Original close reason(s) were not resolved
Improve this question
class Solution {
public:
vector<vector<int>> threeSum(vector<int>& nums) {
vector<int> v;
vector<vector<int>> ans;
int n=nums.size();
sort(nums.begin(),nums.end());
for(int i=0;i<n;i++){
for(int j=i+1;j<n;j++){
for(int k=j+1;k<n;k++){
if(nums[i]+nums[j]+nums[k]==0 && i!=j && i!=k && j!=k){
v.push_back(nums[i]);
v.push_back(nums[j]);
v.push_back(nums[k]);
ans.push_back(v);
}
}
}
}
return ans;
}
};
it is not showing an error but it is displaying wrong answer as i have given in the attachment
Input: [-1, 0, 1, 2, -1, 4]
Your output: [[-1, -1, 2], [-1, -1, 2, -1, 0, 1], [-1, -1, 2, -1, 0, 1, -1, 0, 1]]
Expected output: [[-1, -1, 2], [-1, 0, 1]]
I can understand the problem with pushing back more and more values the my vector v. OK.
But maybe, somebody could give me a hint on how to tackle the problem with the duplicates?
Any help for me as a new user is highly welcome and appreciated.
Of course, we will help you here on SO.
Starting with a new language is never that easy and there may by some things that are not immediately clear in the beginning. Additionally, I do apologize for any rude comments that you may see, but you can be assured that the vast majority of the members of SO are very supportive.
I want to first give you some information on pages like Leetcode and Codeforces and the like. Often also referred to as “competitive programming” pages. Sometimes people misunderstand this and they think that you have only a limited time to submit the code. But that is not the case. There are such competitions but usually not on the mentioned pages. The bad thing is, the coding style used in that real competition events is also used on the online pages. And that is really bad. Because this coding style is that horrible that no serious developer would survive one day in a real company who needs to earn money with software and is then liable for it.
So, these pages will never teach you or guide you how to write good C++ code. And even worse, if newbies start learning the language and see this bad code, then they learn bad habits.
But what is then the purpose of such pages?
The purpose is to find a good algorithm, mostly optimized for runtime execution speed and often also for low memory consumption.
So, the are aiming at a good design. The Language or coding style does not matter for them. So, you can submit even completely obfuscated code or “code golf” solutions, as long at is it fast, it does not matter.
So, do never start to code immediately as a first step. First, think 3 days. Then, take some design tool, like for example a piece of paper, and sketch a design. Then refactor you design and then refactor your design and then refactor your design and then refactor your design and then refactor your design and so one. This may take a week.
And next, search for an appropriate programming language that you know and can handle your design.
And finally, start coding. Because you did a good design before, you can use long and meaningful variable names and write many many comments, so that other people (and you, after one month) can understand your code AND your design.
OK, maybe understood.
Now, let’s analyze your code. You selected a brute force solution with a triple nested loop. That could work for a low number of elements, but will result in most cases in a so called TLE (Time Limit Exceeded) error. Nearly all problems on those pages cannot be solved with brute force. Brute force solutions are always an indicator that you did not do the above design steps. And this leads to additional bugs.
Your code has too major semantic bugs.
You define in the beginning a std::vector with the name “v”. And then, in the loop, after you found a triplet meeting the given condition, you push_back the results in the std::vector. This means, you add 3 values to the std::vector “v” and now there are 3 elements in it. In the next loop run, after finding the next fit, you again push_back 3 additional values to your std::vector ”v” and now there are 6 elements in it. In the next round 9 elements and so on.
How to solve that?
You could use the std::vector’s clear function to delete the old elements from the std::vector at the beginning of the most inner loop, after the if-statement. But that is basically not that good, and, additionally, time consuming. Better is to follow the general idiom, to define variables as late as possible and at that time, when it is needed. So, if you would define your std::vector ”v” after the if statement, then the problem is gone. But then, you would additionally notice that it is only used there and nowhere else. And hence, you do not need it at all.
You may have seen that you can add values to a std::vector by using an initializer list. Something like:
std::vector<int> v {1,2,3};
With that know-how, you can delete your std::vector “v” and all related code and directly write:
ans.push_back( { nums[i], nums[j], nums[k] } );
Then you would have saved 3 unnecessary push_back (and a clear) operations, and more important, you would not get result sets with more than 3 elements.
Next problem. Duplicates. You try to prevent the storage of duplicates by writing && i!=j && i!=k && j!=k. But this will not work in general, because you compare indices and not values and because also the comparison is also wrong. The Boolean expressions is a tautology. It is always true. You initialize your variable j with i+1 and therefore “i” can never be equal to “j”. So, the condition i != j is always true. The same is valid for the other variables.
But how to prevent duplicate entries. You could do some logical comparisons, or first store all the triplets and later use std::unique (or other functions) to eliminate duplicates or use a container that would only store unique elements like a std::set. For the given design, having a time complexity of O(n^3), meaning it is already extremely slow, adding a std::set will not make things worse. I checked that in a small benchmark. So, the only solution is a completely different design. We will come to that later. Let us first fix the code, still using the brute force approach.
Please look at the below somehow short and elegant solution.
vector<vector<int>> threeSum(vector<int>& nums) {
std::set<vector<int>> ans;
int n = nums.size();
sort(nums.begin(), nums.end());
for (int i = 0; i < n; i++)
for (int j = i + 1; j < n; j++)
for (int k = j + 1; k < n; k++)
if (nums[i] + nums[j] + nums[k] == 0)
ans.insert({ nums[i], nums[j], nums[k] });
return { ans.begin(), ans.end() };
}
But, unfortunately, because of the unfortunate design decision, it is 20000 times slower for big input than a better design. And, because the online test programs will work with big input vectors, the program will not pass the runtime constraints.
How to come to a better solution. We need to carefully analyze the requirements and can also use some existing know-how for similar kind of problems.
And if you read some books or internet articles, then you often get the hint, that the so called “sliding window” is the proper approach to get a reasonable solution.
You will find useful information here. But you can of course also search here on SO for answers.
for this problem, we would use a typical 2 pointer approach, but modified for the specific requirements of this problem. Basically a start value and a moving and closing windows . . .
The analysis of the requirements leads to the following idea.
If all evaluated numbers are > 0, then we can never have a sum of 0.
It would be easy to identify duplicate numbers, if they would be beneath each other
--> Sorting the input values will be very beneficial.
This will eliminate the test for half of the values with randomly distribute input vectors. See:
std::vector<int> nums { 5, -1, 4, -2, 3, -3, -1, 2, 1, -1 };
std::sort(nums.begin(), nums.end());
// Will result in
// -3, -2, -1, -1, -1, 1, 2, 3, 4, 5
And with that we see, that if we shift our window to the right, then we can sop the evaluation, as soon as the start of the window hits a positive number. Additionally, we can identify immediately duplicate numbers.
Then next. If we start at the beginning of the sorted vector, this value will be most likely very small. And if we start the next window with one plus the start of the current window, then we will have “very” negative numbers. And to get a 0 by summing 2 “very” negative numbers, we would need a very positive number. And this is at the end of the std::vector.
Start with
startPointerIndex 0, value -3
Window start = startPointerIndex + 1 --> value -2
Window end = lastIndexInVector --> 5
And yes, we found already a solution. Now we need to check for duplicates. If there would be an additional 5 at the 2nd last position, then we can skip. It will not add an additional different solution. So, we can decrement the end window pointer in such a case. Same is valid, if there would be an additional -2 at the beginning if the window. Then we would need to increment the start window pointer, to avoid a duplicate finding from that end.
Some is valid for the start pointer index. Example: startPointerIndex = 3 (start counting indices with 0), value will be -1. But the value before, at index 2 is also -1. So, no need to evaluate that. Because we evaluate that already.
The above methods will prevent the creation of duplicate entries.
But how to continue the search. If we cannot find a solution, the we will narrow down the window. This we will do also in a smart way. If the sum is too big, the obviously the right window value was too big, and we should better use the next smaller value for the next comparison.
Same on the starting side of the window, If the sum was to small, then we obviously need a bigger value. So, let us increment the start window pointer. And we do this (making the window smaller) until we found a solution or until the window is closed, meaning, the start window pointer is no longer smaller than the end window pointer.
Now, we have developed a somehow good design and can start coding.
We additionally try to implement a good coding style. And refactor the code for some faster implementations.
Please see:
class Solution {
public:
// Define some type aliases for later easier typing and understanding
using DataType = int;
using Triplet = std::vector<DataType>;
using Triplets = std::vector<Triplet>;
using TestData = std::vector<DataType>;
// Function to identify all unique Triplets(3 elements) in a given test input
Triplets threeSum(TestData& testData) {
// In order to save function oeverhead for repeatingly getting the size of the test data,
// we will store the size of the input data in a const temporary variable
const size_t numberOfTestDataElements{ testData.size()};
// If the given input test vector is empty, we also immediately return an empty result vector
if (!numberOfTestDataElements) return {};
// In later code we often need the last valid element of the input test data
// Since indices in C++ start with 0 the value will be size -1
// With taht we later avoid uncessary subtractions in the loop
const size_t numberOfTestDataElementsMinus1{ numberOfTestDataElements -1u };
// Here we will store all the found, valid and unique triplets
Triplets result{};
// In order to save the time for later memory reallocations and copying tons of data, we reserve
// memory to hold all results only one time. This will speed upf operations by 5 to 10%
result.reserve(numberOfTestDataElementsMinus1);
// Now sort the input test data to be able to find an end condition, if all elements are
// greater than 0 and to easier identify duplicates
std::sort(testData.begin(), testData.end());
// This variables will define the size of the sliding window
size_t leftStartPositionOfSlidingWindow, rightEndPositionOfSlidingWindow;
// Now, we will evaluate all values of the input test data from left to right
// As an optimization, we additionally define a 2nd running variable k,
// to avoid later additions in the loop, where i+1 woild need to be calculated.
// This can be better done with a running variable that will be just incremented
for (size_t i = 0, k = 1; i < numberOfTestDataElements; ++i, ++k) {
// If the current value form the input test data is greater than 0,
// As um with the result of 0 will no longer be possible. We can stop now
if (testData[i] > 0) break;
// Prevent evaluation of duplicate based in the current input test data
if (i and (testData[i] == testData[i-1])) continue;
// Open the window and determin start and end index
// Start index is always the current evaluate index from the input test data
// End index is always the last element
leftStartPositionOfSlidingWindow = k;
rightEndPositionOfSlidingWindow = numberOfTestDataElementsMinus1;
// Now, as long as if the window is not closed, meaning to not narrow, we will evaluate
while (leftStartPositionOfSlidingWindow < rightEndPositionOfSlidingWindow) {
// Calculate the sum of the current addressed values
const int sum = testData[i] + testData[leftStartPositionOfSlidingWindow] + testData[rightEndPositionOfSlidingWindow];
// If the sum is t0o small, then the mall value on the left side of the sorted window is too small
// Therefor teke next value on the left side and try again. So, make the window smaller
if (sum < 0) {
++leftStartPositionOfSlidingWindow;
}
// Else, if the sum is too biig, the the value on the right side of the window was too big
// Use one smaller value. One to the left of the current closing address of the window
// So, make the window smaller
else if (sum > 0) {
--rightEndPositionOfSlidingWindow;
}
else {
// Accodring to above condintions, we found now are triplet, fulfilling the requirements.
// So store this triplet as a result
result.push_back({ testData[i], testData[leftStartPositionOfSlidingWindow], testData[rightEndPositionOfSlidingWindow] });
// We know need to handle duplicates at the edges of the window. So, left and right edge
// For this, we remember to c
const DataType lastLeftValue = testData[leftStartPositionOfSlidingWindow];
const DataType lastRightValue = testData[rightEndPositionOfSlidingWindow];
// Check left edge. As long as we have duplicates here, we will shift the opening position of the window to the right
// Because of boolean short cut evaluation we will first do the comparison for duplicates. This will give us 5% more speed
while (testData[leftStartPositionOfSlidingWindow] == lastLeftValue && leftStartPositionOfSlidingWindow < rightEndPositionOfSlidingWindow)
++leftStartPositionOfSlidingWindow;
// Check right edge. As long as we have duplicates here, we will shift the closing position of the window to the left
// Because of boolean short cut evaluation we will first do the comparison for duplicates. This will give us 5% more speed
while (testData[rightEndPositionOfSlidingWindow] == lastRightValue && leftStartPositionOfSlidingWindow < rightEndPositionOfSlidingWindow)
--rightEndPositionOfSlidingWindow;
}
}
}
return result;
}
};
The above solution will outperform 99% of other solutions. I made many benchmarks to prove that.
It additionally contains tons of comments to explain what is going on there. And If I have selected “speaking” and meaningful variable names for a better understanding.
I hope, that I could help you a little.
And finally: I dedicate this answer to Sam Varshavchik and PaulMcKenzie.

efficiently mask-out exactly 30% of array with 1M entries

My question's header is similar to this link, however that one wasn't answered to my expectations.
I have an array of integers (1 000 000 entries), and need to mask exactly 30% of elements.
My approach is to loop over elements and roll a dice for each one. Doing it in a non-interrupted manner is good for cache coherency.
As soon as I notice that exactly 300 000 of elements were indeed masked, I need to stop. However, I might reach the end of an array and have only 200 000 elements masked, forcing me to loop a second time, maybe even a third, etc.
What's the most efficient way to ensure I won't have to loop a second time, and not being biased towards picking some elements?
Edit:
//I need to preserve the order of elements.
//For instance, I might have:
[12, 14, 1, 24, 5, 8]
//Masking away 30% might give me:
[0, 14, 1, 24, 0, 8]
The result of masking must be the original array, with some elements set to zero
Just do a fisher-yates shuffle but stop at only 300000 iterations. The last 300000 elements will be the randomly chosen ones.
std::size_t size = 1000000;
for(std::size_t i = 0; i < 300000; ++i)
{
std::size_t r = std::rand() % size;
std::swap(array[r], array[size-1]);
--size;
}
I'm using std::rand for brevity. Obviously you want to use something better.
The other way is this:
for(std::size_t i = 0; i < 300000;)
{
std::size_t r = rand() % 1000000;
if(array[r] != 0)
{
array[r] = 0;
++i;
}
}
Which has no bias and does not reorder elements, but is inferior to fisher yates, especially for high percentages.
When I see a massive list, my mind always goes first to divide-and-conquer.
I won't be writing out a fully-fleshed algorithm here, just a skeleton. You seem like you have enough of a clue to take decent idea and run with it. I think I only need to point you in the right direction. With that said...
We'd need an RNG that can return a suitably-distributed value for how many masked values could potentially be below a given cut point in the list. I'll use the halfway point of the list for said cut. Some statistician can probably set you up with the right RNG function. (Anyone?) I don't want to assume it's just uniformly random [0..mask_count), but it might be.
Given that, you might do something like this:
// the magic RNG your stats homework will provide
int random_split_sub_count_lo( int count, int sub_count, int split_point );
void mask_random_sublist( int *list, int list_count, int sub_count )
{
if (list_count > SOME_SMALL_THRESHOLD)
{
int list_count_lo = list_count / 2; // arbitrary
int list_count_hi = list_count - list_count_lo;
int sub_count_lo = random_split_sub_count_lo( list_count, mask_count, list_count_lo );
int sub_count_hi = list_count - sub_count_lo;
mask( list, list_count_lo, sub_count_lo );
mask( list + sub_count_lo, list_count_hi, sub_count_hi );
}
else
{
// insert here some simple/obvious/naive implementation that
// would be ludicrous to use on a massive list due to complexity,
// but which works great on very small lists. I'm assuming you
// can do this part yourself.
}
}
Assuming you can find someone more informed on statistical distributions than I to provide you with a lead on the randomizer you need to split the sublist count, this should give you O(n) performance, with 'n' being the number of masked entries. Also, since the recursion is set up to traverse the actual physical array in constantly-ascending-index order, cache usage should be as optimal as it's gonna get.
Caveat: There may be minor distribution issues due to the discrete nature of the list versus the 30% fraction as you recurse down and down to smaller list sizes. In practice, I suspect this may not matter much, but whatever person this solution is meant for may not be satisfied that the random distribution is truly uniform when viewed under the microscope. YMMV, I guess.
Here's one suggestion. One million bits is only 128K which is not an onerous amount.
So create a bit array with all items initialised to zero. Then randomly select 300,000 of them (accounting for duplicates, of course) and mark those bits as one.
Then you can run through the bit array and, any that are set to one (or zero, if your idea of masking means you want to process the other 700,000), do whatever action you wish to the corresponding entry in the original array.
If you want to ensure there's no possibility of duplicates when randomly selecting them, just trade off space for time by using a Fisher-Yates shuffle.
Construct an collection of all the indices and, for each of the 700,000 you want removed (or 300,000 if, as mentioned, masking means you want to process the other ones) you want selected:
pick one at random from the remaining set.
copy the final element over the one selected.
reduce the set size.
This will leave you with a random subset of indices that you can use to process the integers in the main array.
You want reservoir sampling. Sample code courtesy of Wikipedia:
(*
S has items to sample, R will contain the result
*)
ReservoirSample(S[1..n], R[1..k])
// fill the reservoir array
for i = 1 to k
R[i] := S[i]
// replace elements with gradually decreasing probability
for i = k+1 to n
j := random(1, i) // important: inclusive range
if j <= k
R[j] := S[i]

How do I print out vectors in different order every time

I'm trying to make two vectors. Where vector1 (total1) is containing some strings and vector2(total2) is containing some random unique numbers(that are between 0 and total1.size() - 1)
I want to make a program that print out total1s strings, but in different order every turn. I don't want to use iterators or something because I want to improve my problem solving capacity.
Here is the specific function that crash the program.
for (unsigned i = 0; i < total1.size();)
{
v1 = rand() % total1.size();
for (unsigned s = 0; s < total1.size(); ++s)
{
if (v1 == total2[s])
;
else
{
total2.push_back(v1);
++i;
}
}
}
I'm very grateful for any help that I can get!
Can I suggest you change of algorithm?. Because, even if your current one is correctly implemented ("s", in your code, must go from 0 to total2.size not total1.size and if element is found, break and generate a new random), it has the following drawback: assume vectors of 1.000.000 elements and you are trying the last random number. You have one probability in 1.000.000 of find a random number not previously used. That is a very small amount.Last but one number has a probability of 2 in 1.000.000 also small. In conclusion, your program will loop and expend lots of CPU resources.
Your best alternative is follow #NathanOliver suggestion and look for function std::shuffle. The manual page shows the implementation algorithm, that is what you are looking for.
Another simple algorithm, with some pros and cons, is:
init total2 with sequence 0,1,2,...,n where n is the size total1 - 1
choice two random numbers, i1 and i2, in range [0,n-1].
Swap elements i1 and i2 in total2.
repeat from (2) a fixed number of times "R".
This method allows to known a priori the necessary steps and to control the level of "randomness" of the final vector (bigger R is more random). However, it is far to be good in its randomness quality.
Another method, better in the probabilistic distribution:
fill a list L with number 0,1,2,...size total1-1.
choice a random number i between 0 and the size of list L - 1 .
Store in total2 the i-th element in list L.
Remove this element from L.
repeat from (2) until L is empty.
If you just want to shuffle vector<string> total1, you can do this without using helping vector<int> total2. Here is an implementation based on Fisher–Yates shuffle.
for(int i=n-1; i>=1; i--) {
int j=rand()%(i+1);
swap(total1[j], total1[i]); // your prof might not allow use of swap:)
}
If you must use vector<int> total2 then shuffle it using above algorithm. Next you can use it to create a new vector<string> result from total1 where result[i]=total1[total2[i]].

How to calc percentage of coverage in an array of 1-100 using C++?

This is for an assignment so I would appreciate no direct answers; rather, any logic help with my algorithms (or pointing out any logic flaws) would be incredibly helpful and appreciated!
I have a program that receives "n" number of elements from the user to put into a single-dimensional array.
The array uses random generated numbers.
IE: If the user inputs 88, a list of 88 random numbers (each between 1 to 100) is generated).
"n" has a max of 100.
I must write 2 functions.
Function #1:
Determine the percentage of numbers that appear in the array of "n" elements.
So any duplicates would decrease the percentage.
And any missing numbers would decrease the percentage.
Thus if n = 75, then you have a maximum possible %age of 0.75
(this max %age decreases if there are duplicates)
This function basically calls upon function #2.
FUNCTION HEADER(GIVEN) = "double coverage (int array[], int n)"
Function #2:
Using a linear search, search for the key (key being the current # in the list of 1 to 100, which should be from the loop in function #1), in the array.
Return the position if that key is found in the array
(IE: if this is the loops 40th run, it will be at the variable "39",
and will go through every instance of an element in the array
and if any element is equal to 39, all of those positions will be returned?
I believe that is what our prof is asking)
Return -1 if the key is not found.
Given notes = "Only function #1 calls function #2,
and does so to find out if a certain value (key) is found among the first n elements of the array."
FUNCTION HEADER(GIVEN) = "int search (int array[], int n, int key)"
What I really need help with is the logic for the algorithm.
I would appreciate any help with this as I would approach this problem completely differently than our professor wants us.
My first thoughts would be to loop through function #1 for all variable keys of 1 through 100.
And in that loop, go to the search function (function #2), in which a loop would go through every number in the array and add to a counter if a number was (1)a duplicate or (2) non-existent in the array. Then I would subtract that counter from 100. Thus if all numbers were included in the array except for the #40 and #41, and then #77 was a duplicate , the total percentage of coverage would be 100 - 3 = 97%.
Although as I type this I think that may in of itself be flawed? ^ Because with a max of 100 elements in the array, if the only number missing was 99, then you would subtract 1 for having that number missing, and then if there was a duplicate you would subtract another 1, and thus your percentage of coverage would be (100-2) = 98, when clearly it ought to be 99.
And this ^ is exactly why I would REALLY appreciate any logic help. :)
I know I am having some problems approaching this logically.
I think I can figure out the coding with a relative amount of ease; what I am struggling witht he most is the steps to take. So any pseudocode ideas would be amazing!
(I can post my entire program code so far if necessary for anyone, just ask, but it is rather long as of now as I have many other functions performing other tasks in the program)
I may be mistaken, but as I read it all you need to do is:
write a function that loops through the array of n elements to find a given number in it. It would return the index of first occurence, or a negative value in case the number cannot be found in the array.
write a loop to call the function for all numbers 1 to 100 and count the finds. Then divide the result by 100.
I'm not sure if I understand this whole thing right, but 1 function you can do it, if you don't care about speed, it's better to put array into a vector, loop through 1..100 and use boost find function http://www.boost.org/doc/libs/1_41_0/doc/html/boost/algorithm/find_nth.html. There you can compare current value with the second entry value in the vector, if it contains you decrease, not not decrease, if you want to find if the unique number is in array, use http://www.cplusplus.com/reference/algorithm/find/. I don't understand, how the percentage decreases, so it's on your own and I don't rly understand second function, but if its linear search use again find.
P.S. Vector description http://www.cplusplus.com/reference/vector/vector/begin/.
You want to know how many numbers in the range [1, 100] appear in your given array. You can search for each number in turn:
size_t count_unique(int array[], size_t n)
{
size_t result = 0;
for (int i = 1; i <= 100; ++i)
{
if (contains(array, n, i))
{
++result;
}
}
return result;
}
All you still need is an implementation of the containment check contains(array, n, i), and to transform the unique count into a percentage (by using division).

Increase string overlap matrix building efficiency

I have a huge list (N = ~1million) of strings 100 characters long that I'm trying to find the overlaps between. For instance, one string might be
XXXXXXXXXXXXXXXXXXAACTGCXAACTGGAAXA (and so on)
I need to build an N by N matrix that contains the longest overlap value for every string with every other string. My current method is (pseudocode)
read in all strings to array
create empty NxN matrix
compare each string to every string with a higher array index (to avoid redoing comparisons)
Write longest overlap to matrix
There's a lot of other stuff going on, but I really need a much more efficient way to build the matrix. Even with the most powerful computing clusters I can get my hands on this method takes days.
In case you didn't guess, these are DNA fragments. X indicates "wild card" (probe gave below a threshold quality score) and all other options are a base (A, C, T, or G). I tried to write a quaternary tree algorithm, but this method was far too memory intensive.
I'd love any suggestions you can give for a more efficient method; I'm working in C++ but pseudocode/ideas or other language code would also be very helpful.
Edit: some code excerpts that illustrate my current method. Anything not particularly relevant to the concept has been removed
//part that compares them all to each other
for (int j=0; j<counter; j++) //counter holds # of DNA
for (int k=j+1; k<counter; k++)
int test = determineBestOverlap(DNArray[j],DNArray[k]);
//boring stuff
//part that compares strings. Definitely very inefficient,
//although I think the sheer number of comparisons is the main problem
int determineBestOverlap(string str1, string str2)
{
int maxCounter = 0, bestOffset = 0;
//basically just tries overlapping the strings every possible way
for (int j=0; j<str2.length(); j++)
{
int counter = 0, offset = 0;
while (str1[offset] == str2[j+offset] && str1[offset] != 'X')
{
counter++;
offset++;
}
if (counter > maxCounter)
{
maxCounter = counter;
bestOffset = j;
}
}
return maxCounter;
} //this simplified version doesn't account for flipped strings
Do you really need to know the match between ALL string pairs? If yes, then you will have to compare every string with every other string, which means you will need n^2/2 comparisons, and you will need one half terabyte of memory even if you just store one byte per string pair.
However, i assume what you really are interested in is long strings, those that have more than, say, 20 or 30 or even more than 80 characters in common, and you probably don't really want to know if two string pairs have 3 characters in common while 50 others are X and the remaining 47 don't match.
What i'd try if i were you - still without knowing if that fits your application - is:
1) From each string, extract the largest substring(s) that make(s) sense. I guess you want to ignore 'X'es at the start and end entirely, and if some "readable" parts are broken by a large number of 'X'es, it probably makes sense to treat the readable parts individually instead of using the longer string. A lot of this "which substrings are relevant?" depends on your data and application that i don't really know.
2) Make a list of these longest substrings, together with the number of occurences of each substring. Order this list by string length. You may, but don't really have to, store the indexes of every original string together with the substring. You'll get something like (example)
AGCGCTXATCG 1
GAGXTGACCTG 2
.....
CGCXTATC 1
......
3) Now, from the top to the bottom of the list:
a) Set the "current string" to the string topmost on the list.
b) If the occurence count next to the current string is > 1, you found a match. Search your original strings for the substring if you haven't remembered the indexes, and mark the match.
c) Compare the current string with all strings of the same length, to find matches where some characters are X.
d) Remove the 1st character from the current string. If the resulting string is already in your table, increase its occurence counter by one, else enter it into the table.
e) Repeat 3b with the last, instead of the first, character removed from the current string.
f) Remove the current string from the list.
g) Repeat from 3a) until you run out of computing time, or your remaining strings become too short to be interesting.
If this is a better algorithm depends very much on your data and which comparisons you're really interested in. If your data is very random/you have very few matches, it will probably take longer than your original idea. But it might allow you to find the interesting parts first and skip the less interesting parts.
I don't see many ways to improve the fact that you need to compare each string with each other including shifting them, and that is by itself super long, a computation cluster seems the best approach.
The only thing I see how to improve is the string comparison by itself: replace A,C,T,G and X by binary patterns:
A = 0x01
C = 0x02
T = 0x04
G = 0x08
X = 0x0F
This way you can store one item on 4 bits, i.e. two per byte (this might not be a good idea though, but still a possible option to investigate), and then compare them quickly with a AND operation, so that you 'just' have to count how many consecutive non zero values you have. That's just a way to process the wildcard, sorry I don't have a better idea to reduce the complexity of the overall comparison.