Naive suffix array optimisation c++ - c++

Do you have an idea how to optimize the following function while still using std::sort().
It sorts the suffixes of a text to create a suffix array. I think the problem is in the compare function as not much more can be done for the rest. compare is basically the lexicographical < operator for a string.
I added the function I use to test the time in the main() and the segment is now reproducible. I hope the comments make it easier to understand the logic. The function works fine but is too slow because of the multiple ifs. Construction time is now around 101 ms (on my CPU), but we aim at around 70.
// compile with: g++ -std=c++17 -Wall -pedantic -fsanitize=address -O3 test.cpp -o test
#include <string>
#include <vector>
#include <iostream>
#include <chrono>
#include <algorithm>
/// Build suffix array from text.
/// Input: the text (might be empty)
/// Output: a suffix array (sorted). The variable 'sa' is cleared before it's filled and returned.
void construct(std::vector<uint32_t>& sa, const std::string& text) {
sa.clear();
unsigned tsize = text.size();
for (unsigned i = 0; i < tsize; ++i) {
sa.push_back(i);
}
// with this function we compare the suffix at position 'a' to the suffix at position 'b' in the text.
// we do this letter by letter calling compare recursively
std::function<bool(uint32_t, uint32_t)> compare = [&](uint32_t a, uint32_t b) {
if (a>tsize) return true; //if we reach the end of the vector on a it means a<b
else if (b>tsize) return false; // if we reach the end on b it means b>a
else if (text[a] < text[b]) return true; // case a<b
else if (text[a] == text[b]) return compare(a + 1, b + 1); //if a and b are the same call compare on the next letters in both suffixes
else return false; // only the case b>a is left
};
std::sort(sa.begin(), sa.end(), compare);
}
// main tests the construction speed
int main(){
std::vector<uint32_t> sa;
srand(0);
std::string big(100000, ' '); // 'large' random text
for (auto& c : big) c = rand() % 128;
// time the construction
auto begin = std::chrono::steady_clock::now();
construct(sa, big);
auto end = std::chrono::steady_clock::now();
size_t time = std::chrono::duration_cast<std::chrono::milliseconds>(end - begin).count();
std::cout << "Construction time: " << time << "\n";
}

I haven't profiled it, but - depending on how clever your compiler is - I think a non-recursive formulation might be more efficient:
std::function<bool(uint32_t, uint32_t)> compare = [&](uint32_t a, uint32_t b) {
do
{
if (a>=tsize) return true;
else if (b>=tsize) return false;
else if (text[a] < text[b]) return true;
else if (text[a] > text[b]) return false;
a += 1;
b += 1;
} while (true);
};

The code in compare compares strings according to their characters, in lexicographical order. This is very similar to how numbers are compared - most significant bits/bytes, if different, force the result of the comparison, and if equal, defer the result to less significant bits/bytes. So, if your strings are usually long, you can compare 8 bytes at once, using uint64_t.
To do that, you must pad your strings (with zero-valued bytes), to prevent out-of-bounds reads (alternatively, do the 8-byte comparison only when far from the string's end, otherwise do the original 1-byte comparison). Also, if your system is little-endian (most likely), you have to reverse bytes in your uint64_t numbers.

Related

Examining a given K element in an array

This assignment is about a ship race on a lake.
I have an N array, where I input wind speed.
I have to give a K number, which determines how many consecutive days have the speed of wind between 10 and 100.
If I find the K amount of consecutive elements, I have to console out the first element's index of this sequence.
The goal is to find which day the "race" can be started.
For example:
S[10] = {50,40,0,5,0,80,70,90,100,120}
K=3
The output has to be 6, because it is the 6th element of the array, where this sequence started.
I don't know any method how can I implement this examination.
I tried this:
for (int i=0; i<N-2; i++){
if (((10<=S[i]) && (S[i]<=100)) && ((10<=S[i+1]) && (S[i+1]<=100)) && ((10<=S[i+2]) && (S[i+2]<=100))){
canBeStarted = true;
whichDayItCanBeStarted = i;
}
}
cout << whichDayItCanBeStarted << endl;
But I realised that K can be any number, so I have to examine K elements at once.
Making use of the algorithms standard library
(Restriction: the following answer provides an approach valid for C++17 and beyond)
For a problem such as this one, rather than re-inventing the wheel, you might want to consider turning to the algorithms library in the standard library, making use of std::transform and std::search_n to
produce an integer -> bool transform over wind speeds to validity of said wind speeds, followed by
searching over the result of the transform for a number of (K) sub-sequent true (valid wind speed) elements,
respectively.
E.g.:
#include <algorithm> // std::search_n, std::transform
#include <cstdint> // uint8_t (for wind speeds)
#include <iostream> // std::cout
#include <iterator> // std::back_inserter, std::distance
#include <vector> // std::vector
int main() {
// Wind data and wind restrictions.
const std::vector<uint8_t> wind_speed{50U, 40U, 0U, 5U, 0U,
80U, 70U, 90U, 100U, 120U};
const uint8_t minimum_wind_speed = 10U;
const uint8_t maximum_wind_speed = 100U;
const std::size_t minimum_consecutive_days = 3;
// Map wind speeds -> wind speed within limits.
std::vector<bool> wind_within_limits;
std::transform(wind_speed.begin(), wind_speed.end(),
std::back_inserter(wind_within_limits),
[](uint8_t wind_speed) -> bool {
return (wind_speed >= minimum_wind_speed) &&
(wind_speed <= maximum_wind_speed);
});
// Find the first K (minimum_consecutive_days) consecutive days with
// wind speed within limits.
const auto starting_day =
std::search_n(wind_within_limits.begin(), wind_within_limits.end(),
minimum_consecutive_days, true);
if (starting_day != wind_within_limits.end()) {
std::cout << "Race may start at day "
<< std::distance(wind_within_limits.begin(), starting_day) + 1
<< ".";
} else {
std::cout
<< "Wind speeds during the specified days exceed race conditions.";
}
}
Alternatively, we can integrate the transform into a binary predicate in the std::search_n invocation. This yields a more compact solution, but with, imo, somewhat worse semantics and readability.
#include <algorithm> // std::search_n
#include <cstdint> // uint8_t (for wind speeds)
#include <iostream> // std::cout
#include <iterator> // std::distance
#include <vector> // std::vector
int main() {
// Wind data and wind restrictions.
const std::vector<uint8_t> wind_speed{50U, 40U, 0U, 5U, 0U,
80U, 70U, 90U, 100U, 120U};
const uint8_t minimum_wind_speed = 10U;
const uint8_t maximum_wind_speed = 100U;
const std::size_t minimum_consecutive_days = 3;
// Find any K (minimum_consecutive_days) consecutive days with wind speed
// within limits.
const auto starting_day = std::search_n(
wind_speed.begin(), wind_speed.end(), minimum_consecutive_days, true,
[](uint8_t wind_speed, bool) -> bool {
return (wind_speed >= minimum_wind_speed) &&
(wind_speed <= maximum_wind_speed);
});
if (starting_day != wind_speed.end()) {
std::cout << "Race may start at day "
<< std::distance(wind_speed.begin(), starting_day) + 1 << ".";
} else {
std::cout
<< "Wind speeds during the specified days exceed race conditions.";
}
}
Both of the programs above, given the particular (hard-coded) wind data and restrictions that you've provided, results in:
Race may start at day 6.
You'd need to have a counter variable that's initially set to 0, and another variable to store the index where the sequence begins. and iterate through the array one element at a time. If you find an element between 10 and 100, check if the counter is equal to '0'. If it is, store the index in the other variable. Increment the counter by one. If the counter is equal to K, you're done, so break from the loop. Otherwise, if the element isn't between 10 and 100, set the counter to 0.

How do I count how often a string of letters occurs in a .txt file? (in C++)

I searched for ways to count how often a string of letters appears in a .txt file and found (among others) this thread: Count the number of times each word occurs in a file
which deals with the problem by counting words (which are separated by spaces).
However, I need to do something slightly different:
I have a .txt file containing billions of letters without any formatting (no spaces, no puntuation, no line breaks, no hard returns, etc.), just a loooooong line of the letters a, g, t and c (i.e: a DNA sequence ;)).
Now I want to write a program that goes through the entire sequence and count how often each possible continuous sequence of 9 letters appears in that file.
Yes, there are 4^9 possible combinations of 9-letter 'words' made up of the characters A, G, T and C, but I only want to output the top 1000.
Since there are no spaces or anything, I would have to go through the file letter-by-letter and examine all the 9-letter 'words' that appear, i.e.:
ATAGAGCTAGATCCCTAGCTAGTGACTA
contains the sequences:
ATAGAGCTA, TAGAGCTAG, AGAGCTAGA, etc.
I hope you know what I mean, it's hard for me to describe the same in English, since it is not my native language.
Best regards and thank you all in advance!
Compared to billions, 2^18, or 256k seems suddenly small. The good news is that it means your histogram can be stored in about 1 MB of data. A simple approach would be to convert each letter to a 2-bit representation, assuming your file only contains AGCT, and none of the RYMK... shorthands and wildcards.
This is what this 'esquisse' does. It packs the 9 bytes of text into an 18 bit value and increments the corresponding histogram bin. To speed up[ conversion a bit, it reads 4 bytes and uses a lookup table to convert 4 glyphs at a time.
I don't know how fast this will run, but it should be reasonable. I haven't tested it, but I know it compiles, at least under gcc. There is no printout, but there is a helper function to unpack sequence packed binary format back to text.
It should give you at least a good starting point
#include <vector>
#include <array>
#include <algorithm>
#include <iostream>
#include <fstream>
#include <exception>
namespace dna {
// helpers to convert nucleotides to packed binary form
enum Nucleotide : uint8_t { A, G, C, T };
uint8_t simple_lut[4][256] = {};
void init_simple_lut()
{
for (size_t i = 0 ; i < 4; ++i)
{
simple_lut[i]['A'] = A << (i * 2);
simple_lut[i]['C'] = C << (i * 2);
simple_lut[i]['G'] = G << (i * 2);
simple_lut[i]['T'] = T << (i * 2);
}
}
uint32_t pack4(const char(&seq)[4])
{
return simple_lut[0][seq[0]]
+ simple_lut[1][seq[1]]
+ simple_lut[2][seq[2]]
+ simple_lut[3][seq[3]];
}
// you can use this to convert the historghrtam
// index back to text.
std::string hist_index2string(uint32_t n)
{
std::string result;
result.reserve(9);
for (size_t i = 0; i < 9; ++i, n >>= 2)
{
switch (n & 0x03)
{
case A: result.insert(result.begin(), 'A'); break;
case C: result.insert(result.begin(), 'C'); break;
case G: result.insert(result.begin(), 'G'); break;
case T: result.insert(result.begin(), 'T'); break;
default:
throw std::runtime_error{ "totally unexpected error while unpacking index !!" };
}
}
return result;
}
}
int main(int argc, const char**argv, const char**)
{
if (argc < 2)
{
std::cerr << "Usage: prog_name <input_file> <output_file>\n";
return 3;
}
using dna::pack4;
dna::init_simple_lut();
std::vector<uint32_t> result;
try
{
result.resize(1 << 18);
std::ifstream ifs(argv[1]);
// read 4 bytes at a time, convert to packed bits representation
// then rotate in bits 2 by 2 in our 18 bits buffer.
// increment coresponding bin by 1
const uint32_t MASK{ (1 << 19) - 1 };
const std::streamsize RB{ 4 };
uint32_t accu{};
uint32_t packed{};
// we need to load at least 9 bytes to 'prime' the engine
char buffer[4];
ifs.read(buffer, RB);
accu = pack4(buffer) << 8;
ifs.read(buffer, RB);
accu |= pack4(buffer);
if (ifs.gcount() != RB)
{
throw std::runtime_error{ " input file is too short "};
}
ifs.read(buffer, RB);
while (ifs.gcount() != 0)
{
packed = pack4(buffer);
for (size_t i = 0; i < (size_t)ifs.gcount(); ++i)
{
accu <<= 2;
accu |= packed & 0x03;
packed >>= 2;
accu &= MASK;
++result[accu];
}
ifs.read(buffer.pch, RB);
}
ifs.close();
// histogram is compiled. store data to another file...
// you can crate a secondary table of { index and count }
// it's only 5MB long and use partial_sort to extract the first
// 1000.
}
catch(std::exception& e)
{
std::cerr << "Error \'" << e.what() << "\' while reading file.\n";
return 3;
}
return 0;
}
This algorithm can be adapted to run on multiple threads, by opening the file in multiple streams with the proper share configuration, and running the loop on bits of the file. Care must be taken for the 16 bytes seams at the end of the process.
If running in parallel, the inner loop is so short that it may be a good idea to provide each thread with its own histogram and merge the results at the end, otherwise, the locking overhead would slow things quite a bit.
[EDIT] Silly me I had the packed binary lookup wrong.
[EDIT2] replaced the packed lut with a faster version.
This works for you,
#include <iostream>
#include <fstream>
#include <string>
using namespace std;
int main ()
{
string line;
int sum=0;
ifstream inData ;
inData.open("countletters.txt");
while(!inData.eof())
{
getline(inData,line);
int numofChars= line.length();
for (unsigned int n = 0; n<line.length();n++)
{
if (line.at(n) == ' ')
{
numofChars--;
}
}
sum=numofChars+sum;
}
cout << "Number of characters: "<< sum << endl;
return 0 ;
}

Call recursive function in std::function

I just wrote a thread pool using C++11/14 std::thread objects and use tasks in the worker queue. I encountered some weird behaviour when calling recursive functions in lambda expressions. The following code crashes if you implement fac() in a recursive fashion (both with clang 3.5 and gcc 4.9):
#include <functional>
#include <vector>
std::size_t fac(std::size_t x) {
// This will crash (segfault).
// if (x == 1) return 1;
// else return fac(x-1)*x;
// This, however, works fine.
auto res = 1;
for (auto i = 2; i < x; ++i) {
res *= x;
}
return res;
}
int main() {
std::vector<std::function<void()> > functions;
for (auto i = 0; i < 10; ++i) {
functions.emplace_back([i]() { fac(i); });
}
for (auto& fn : functions) {
fn();
}
return 0;
}
It does, however, work fine with the iterative version above. What am I missing?
for (auto i = 0; i < 10; ++i) {
functions.emplace_back([i]() { fac(i); });
The first time through that loop, i is going to be set to zero, so you're executing:
fac(0);
Doing so with the recursive definition:
if (x == 1) return 1;
else return fac(x-1)*x;
means that the else block will execute, and hence x will wrap around to whatever the maximum size_t value is (as it's unsigned).
Then it's going to run from there down to 1, consuming one stack frame each time. At a minimum, that's going to consume 65,000 or so stack frames (based on the minimum allowed value of size_t from the standards), but probably more, much more.
That's what causing your crash. The fix is relatively simple. Since 0! is defined as 1, you can simply change your statement to be:
if (x <= 1)
return 1;
return fac (x-1) * x;
But you should keep in mind that recursive functions are best suited to those cases where the solution space reduces quickly, a classic example being the binary search, where the solution space is halved every time you recur.
Functions that don't reduce solution space quickly are usually prone to stack overflow problems (unless the optimiser can optimise away the recursion). You may still run into problems if you pass in a big enough number and it's no real different to adding together two unsigned numbers with the bizarre (though I actually saw it put forward as a recursive example many moons ago):
def addu (unsigned a, b):
if b == 0:
return a
return addu (a + 1, b - 1)
So, in your case, I'd stick with the iterative solution, albeit making it bug-free:
auto res = 1;
for (auto i = 2; i <= x; ++i) // include the limit with <=.
res *= i; // multiply by i, not x.
Both definitions have different behavior for x=0. The loop will be fine as it uses the less-than operator:
auto res = 1;
for (auto i = 2; i < x; ++i) {
res *= x;
}
However,
if (x == 1) return 1;
else return fac(x-1)*x;
Results in a quasi-infinite loop as x == 1 is false and x-1 yields the largest possible value of std::size_t (typically 264-1).
The recursive version does not take care of the case for x == 0.
You need:
std::size_t fac(std::size_t x) {
if (x == 1 || x == 0 ) return 1;
return fac(x-1)*x;
}
or
std::size_t fac(std::size_t x) {
if (x == 0 ) return 1;
return fac(x-1)*x;
}

How to increment a 64 bit number when the OS only supports 32?

I have a number stored as a string up to 16 chars (0 to F) in length. This needs to be incremented and the result stored as a string.
Simplest thing would have been to convert the string to an int, increment by one, then convert back to a string. However the OS doesn't support 64 numbers. What's the alternative?
I presume a hand crafted solution using two 32 bit integers is possible, in which case this must be a common scenario, but I couldn't find any boilerplate template code for doing such a thing after a bit of googling.
UPDATE: Sorry - sould have mentioned earlier - This is for Brew MP C++ - their conversion libraries' APIs are limited to 32 bit.
And experimenting with long long seems to have fundamental big time problems in general executing on hardware, making it unusable.
You can increment one digit at a time. If the digit is between '0' and '8', just add one; likewise if it's between 'A' and 'E'. If it's '9', set it to 'A'. If it's 'F', set it to '0' and increment the next digit to the left.
Your compiler doesn't support a 64-bit long?
You can use a variety of libraries to support arbitrary width integers such as http://gmplib.org/
Include <cstdint> and use int64_t or uint64_t. You can use them even on 32-bit machines. All that you need a modern compiler.
Here is the solution suggested by #Mark in C++:
template<typename Iter>
void increment(Iter begin, Iter end)
{
for (; begin != end; ++begin)
{
++*begin;
if (*begin == 'G')
{
*begin = '0';
continue;
}
if (*begin == ':')
{
*begin = 'A';
}
break;
}
}
Remember to call it with reverse iterators, so the string gets incremented from right to left:
int main()
{
std::string x = "123456789ABCDEFF";
increment(x.rbegin(), x.rend());
std::cout << x << std::endl;
x = "FFFFFFFFFFFFFFFF";
increment(x.rbegin(), x.rend());
std::cout << x << std::endl;
}
This is hand-ported from some C# I wrote (without a C++ compiler) - so you may need to tweak it. This works against a low/high type struct.
struct VirtInt64
{
unsigned int Low;
unsigned int High;
};
void Increment(VirtInt64* a)
{
var newLow = a->Low + 1;
if (newLow < a->Low)
{
a->Low = 0;
a->High = a->High + 1;
}
else
{
a->Low = newLow;
}
}

How to convert an int to a binary string representation in C++

I have an int that I want to store as a binary string representation. How can this be done?
Try this:
#include <bitset>
#include <iostream>
int main()
{
std::bitset<32> x(23456);
std::cout << x << "\n";
// If you don't want a variable just create a temporary.
std::cout << std::bitset<32>(23456) << "\n";
}
I have an int that I want to first convert to a binary number.
What exactly does that mean? There is no type "binary number". Well, an int is already represented in binary form internally unless you're using a very strange computer, but that's an implementation detail -- conceptually, it is just an integral number.
Each time you print a number to the screen, it must be converted to a string of characters. It just so happens that most I/O systems chose a decimal representation for this process so that humans have an easier time. But there is nothing inherently decimal about int.
Anyway, to generate a base b representation of an integral number x, simply follow this algorithm:
initialize s with the empty string
m = x % b
x = x / b
Convert m into a digit, d.
Append d on s.
If x is not zero, goto step 2.
Reverse s
Step 4 is easy if b <= 10 and your computer uses a character encoding where the digits 0-9 are contiguous, because then it's simply d = '0' + m. Otherwise, you need a lookup table.
Steps 5 and 7 can be simplified to append d on the left of s if you know ahead of time how much space you will need and start from the right end in the string.
In the case of b == 2 (e.g. binary representation), step 2 can be simplified to m = x & 1, and step 3 can be simplified to x = x >> 1.
Solution with reverse:
#include <string>
#include <algorithm>
std::string binary(unsigned x)
{
std::string s;
do
{
s.push_back('0' + (x & 1));
} while (x >>= 1);
std::reverse(s.begin(), s.end());
return s;
}
Solution without reverse:
#include <string>
std::string binary(unsigned x)
{
// Warning: this breaks for numbers with more than 64 bits
char buffer[64];
char* p = buffer + 64;
do
{
*--p = '0' + (x & 1);
} while (x >>= 1);
return std::string(p, buffer + 64);
}
AND the number with 100000..., then 010000..., 0010000..., etc. Each time, if the result is 0, put a '0' in a char array, otherwise put a '1'.
int numberOfBits = sizeof(int) * 8;
char binary[numberOfBits + 1];
int decimal = 29;
for(int i = 0; i < numberOfBits; ++i) {
if ((decimal & (0x80000000 >> i)) == 0) {
binary[i] = '0';
} else {
binary[i] = '1';
}
}
binary[numberOfBits] = '\0';
string binaryString(binary);
http://www.phanderson.com/printer/bin_disp.html is a good example.
The basic principle of a simple approach:
Loop until the # is 0
& (bitwise and) the # with 1. Print the result (1 or 0) to the end of string buffer.
Shift the # by 1 bit using >>=.
Repeat loop
Print reversed string buffer
To avoid reversing the string or needing to limit yourself to #s fitting the buffer string length, you can:
Compute ceiling(log2(N)) - say L
Compute mask = 2^L
Loop until mask == 0:
& (bitwise and) the mask with the #. Print the result (1 or 0).
number &= (mask-1)
mask >>= 1 (divide by 2)
I assume this is related to your other question on extensible hashing.
First define some mnemonics for your bits:
const int FIRST_BIT = 0x1;
const int SECOND_BIT = 0x2;
const int THIRD_BIT = 0x4;
Then you have your number you want to convert to a bit string:
int x = someValue;
You can check if a bit is set by using the logical & operator.
if(x & FIRST_BIT)
{
// The first bit is set.
}
And you can keep an std::string and you add 1 to that string if a bit is set, and you add 0 if the bit is not set. Depending on what order you want the string in you can start with the last bit and move to the first or just first to last.
You can refactor this into a loop and using it for arbitrarily sized numbers by calculating the mnemonic bits above using current_bit_value<<=1 after each iteration.
There isn't a direct function, you can just walk along the bits of the int (hint see >> ) and insert a '1' or '0' in the string.
Sounds like a standard interview / homework type question
Use sprintf function to store the formatted output in the string variable, instead of printf for directly printing. Note, however, that these functions only work with C strings, and not C++ strings.
There's a small header only library you can use for this here.
Example:
std::cout << ConvertInteger<Uint32>::ToBinaryString(21);
// Displays "10101"
auto x = ConvertInteger<Int8>::ToBinaryString(21, true);
std::cout << x << "\n"; // displays "00010101"
auto x = ConvertInteger<Uint8>::ToBinaryString(21, true, "0b");
std::cout << x << "\n"; // displays "0b00010101"
Solution without reverse, no additional copy, and with 0-padding:
#include <iostream>
#include <string>
template <short WIDTH>
std::string binary( unsigned x )
{
std::string buffer( WIDTH, '0' );
char *p = &buffer[ WIDTH ];
do {
--p;
if (x & 1) *p = '1';
}
while (x >>= 1);
return buffer;
}
int main()
{
std::cout << "'" << binary<32>(0xf0f0f0f0) << "'" << std::endl;
return 0;
}
This is my best implementation of converting integers(any type) to a std::string. You can remove the template if you are only going to use it for a single integer type. To the best of my knowledge , I think there is a good balance between safety of C++ and cryptic nature of C. Make sure to include the needed headers.
template<typename T>
std::string bstring(T n){
std::string s;
for(int m = sizeof(n) * 8;m--;){
s.push_back('0'+((n >> m) & 1));
}
return s;
}
Use it like so,
std::cout << bstring<size_t>(371) << '\n';
This is the output in my computer(it differs on every computer),
0000000000000000000000000000000000000000000000000000000101110011
Note that the entire binary string is copied and thus the padded zeros which helps to represent the bit size. So the length of the string is the size of size_t in bits.
Lets try a signed integer(negative number),
std::cout << bstring<signed int>(-1) << '\n';
This is the output in my computer(as stated , it differs on every computer),
11111111111111111111111111111111
Note that now the string is smaller , this proves that signed int consumes less space than size_t. As you can see my computer uses the 2's complement method to represent signed integers (negative numbers). You can now see why unsigned short(-1) > signed int(1)
Here is a version made just for signed integers to make this function without templates , i.e use this if you only intend to convert signed integers to string.
std::string bstring(int n){
std::string s;
for(int m = sizeof(n) * 8;m--;){
s.push_back('0'+((n >> m) & 1));
}
return s;
}