String Searching Algorithm that uses a graph ? C++

String Searching Algorithm that uses a graph ? C++ - c++

Code Instructions
Hey guys. Above is a coding project I have been assigned. Im reading the instructions and am completely lost because I've never learned how to code an undirected graph? Not sure how my professor expects us to do this but I was hoping I could get some help from experts. Any readings (or tips) you guys suggest I can look at to familiarize myself with how to get started on the program? Appreciate it, thanks!

The problem to solve is called "Word Morph". Your instructor gave some restrictions as to use an undirected graph, where the neighbour node differs only one character from the origin. Unfortuantely the requirements are not clear enough. "Differ by one character is ambiguous. If we use the replace-insert-delete idiom, then we can use other functions as by comparing 2 equal size strings. I assume the full approach.
And, at the end, you need to find a shortest way through a graph.
I could present you one possible solution. A complete working code example.
By the way the graph is non-weigthed, because the cost of travelling from one node to the next is always 1. So actually we are talking about an undirected non-weighted graph.
The main algorithms we need using here are:
Levensthein. Calculate distance of 2 strings
and Breadth First Search, to find the shortes path through a graph
Please note, If the words should have the same length, then no Levensthein is needed. Just compare character by charcter and count the differences. That's rather simple. (But as said: The instructions are a little bit unclear)
Both algorithms can be modified. For example: You do not need a Levensthein distance greater than 1. You can terminate the distance calculation after distance one has been found. And, in the breadth first search, you could show the path, through which you are going.
OK, now, how to implement an undirected graph. There are 2 possibilities:
A Matrix (I will not explain)
A list or a vector
I would recommend the vector approach for this case. A matrix would be rather sparse, so, better a vector.
The basic data structure that you need is a node containing vertices and neighbours. So you have the word (a std::string) as vertex and the "neighbours". That is a std::vector of index positions to other nodes in the graph.
The graph is a vector of nodes. And nodes neighbours point to other nodes in this vector. We use the index into the vector to denote neighbours. All this we pack into a structure and call it "UndirectedGraph". We add a "build" function that checks for adjacencies. In this function we compare each string with any and check, if the difference is <2, so 0 or 1. 0 means equeal and 1 is a given constraint. If we find such a difference, we add it as neighboour in the corresponding node.
Additionally we add a breadth first search algorithm. It is described in Wikipedia
To ease up the implementation of that algortuhm we a a "visited" flag to the node.
Please see the code below:
#include <sstream>
#include <iostream>
#include <vector>
#include <string>
#include <iterator>
#include <iomanip>
#include <numeric>
#include <algorithm>
#include <queue>
std::istringstream textFileStream(R"#(peach
peace
place
plane
plans
plays
slays
stays
stars
sears
years
yearn
)#");
using Vertex = std::string;
using Edge = std::vector<size_t>;
// One node in a graph
struct Node
{
// The Vertex is a std::string
Vertex word{};
// The edges are the index of the neighbour nodes
Edge neighbour{};
// For Breath First Search
bool visited{ false };
// Easy input and output
friend std::istream& operator >> (std::istream& is, Node& n) {
n.neighbour.clear();
std::getline(is, n.word);
return is;
}
friend std::ostream& operator << (std::ostream& os, const Node& n) {
os << std::left << std::setw(25) << n.word << " --> ";
std::copy(n.neighbour.begin(), n.neighbour.end(), std::ostream_iterator<int>(os, " "));
return os;
}
};
// The graph
struct UndirectedGraph
{
// Contains a vector of nodes
std::vector<Node> graph;
// build adjacenies
void build();
// Find Path
bool checkForPathFromStartToEnd(size_t start, size_t end);
bool checkForPath() {bool result = false;if (graph.size() > 1) {size_t s = graph.size() - 2;size_t e = s + 1;result = checkForPathFromStartToEnd(s, e); }return result; }
// Easy input and output
friend std::istream& operator >> (std::istream& is, UndirectedGraph& ug) {
ug.graph.clear();
std::copy(std::istream_iterator<Node>(is), std::istream_iterator<Node>(), std::back_inserter(ug.graph));
return is;
}
friend std::ostream& operator << (std::ostream& os, const UndirectedGraph& ug) {
size_t i{ 0 };
for (const Node& n : ug.graph)
os << std::right << std::setw(4) << i++ << ' ' << n << '\n';
return os;
}
};
// Distance between 2 strings
size_t levensthein(const std::string& string1, const std::string& string2)
{
const size_t lengthString1(string1.size());
const size_t lengthString2(string2.size());
if (lengthString1 == 0) return lengthString2;
if (lengthString2 == 0) return lengthString1;
std::vector<size_t> costs(lengthString2 + 1);
std::iota(costs.begin(), costs.end(), 0);
for (size_t indexString1 = 0; indexString1 < lengthString1; ++indexString1) {
costs[0] = indexString1 + 1;
size_t corner = indexString1;
for (size_t indexString2 = 0; indexString2 < lengthString2; ++indexString2) {
size_t upper = costs[indexString2 + 1];
if (string1[indexString1] == string2[indexString2]) {
costs[indexString2 + 1] = corner;
}
else {
const size_t temp = std::min(upper, corner);
costs[indexString2 + 1] = std::min(costs[indexString2], temp) + 1;
}
corner = upper;
}
}
size_t result = costs[lengthString2];
return result;
}
// Build the adjacenies
void UndirectedGraph::build()
{
// Iterate over all words in the graph
for (size_t i = 0; i < graph.size(); ++i)
// COmpare everything with everything (becuase of symmetries, omit half of comparisons)
for (size_t j = i + 1; j < graph.size(); ++j)
// Chec distance of the 2 words to compare
if (levensthein(graph[i].word, graph[j].word) < 2U) {
// And store the adjacenies
graph[i].neighbour.push_back(j);
graph[j].neighbour.push_back(i);
}
}
bool UndirectedGraph::checkForPathFromStartToEnd(size_t start, size_t end)
{
// Assume that it will not work
bool result = false;
// Store intermediate tries in queue
std::queue<size_t> check{};
// Set initial values
graph[start].visited = true;
check.push(start);
// As long as we have not visited all possible nodes
while (!check.empty()) {
// Get the next node to check
size_t currentNode = check.front(); check.pop();
// If we found the solution . . .
if (currentNode == end) {
// The set resultung value and stop searching
result = true;
break;
}
// Go through all neighbours of the current node
for (const size_t next : graph[currentNode].neighbour) {
// If the neighbour node has not yet been visited
if (!graph[next].visited) {
// Then visit it
graph[next].visited = true;
// And check following elements next time
check.push(next);
}
}
}
return result;
}
int main()
{
// Get the filename from the user
std::cout << "Enter Filename for file with words:\n";
std::string filename{};
//std::cin >> filename;
// Open the file
//std::ifstream textFileStream(filename);
// If the file could be opened . . .
if (textFileStream) {
// Create an empty graph
UndirectedGraph undirectedGraph{};
// Read the complete file into the graph
textFileStream >> undirectedGraph;
Node startWord{}, targetWord{};
std::cout << "Enter start word and enter target word\n"; // teach --> learn
std::cin >> startWord >> targetWord;
// Add the 2 words at the and of our graph
undirectedGraph.graph.push_back(startWord);
undirectedGraph.graph.push_back(targetWord);
// Build adjacency graph, including the just added words
undirectedGraph.build();
// For debug purposes: Show the graph
std::cout << undirectedGraph;
std::cout << "\n\nMorph possible? --> " << std::boolalpha << undirectedGraph.checkForPath() << '\n';
}
else {
// File could not be found or opened
std::cerr << "Error: Could not open file : " << filename;
}
return 0;
}
Please note: Although I have implemented asking for a file name, I do not use it in this example. I read from a istringstream. You need to delete the istringstream later and comment in the existing statements.
Reagarding the requirements from the instructor: I did not use any STL/Library/Boost searching algorithm. (What for? In this example?) But I use of course other C++ STL container. I will not reinvent the wheel and come up with a new "vector" or queue. And I will definitely not use "new" or C-Style arrays or pointer arithmetic.
Have fun!
And to all others: Sorry: I could not resist to write the code . . .

Related

How to get an element (struct) in an array by a value in the struct

Let's say I have this struct containing an integer.
struct Element
{
int number;
Element(int number)
{
this->number = number;
}
};
And I'm gonna create a vector containing many Element structs.
std::vector<Element> array;
Pretend that all the Element structs inside array have been initialized and have their number variable set.
My question is how can I instantly get an element based on the variable number?
It is very possible to do it with a for loop, but I'm currently focusing on optimization and trying to avoid as many for loops as possible.
I want it to be as instant as getting by index:
Element wanted_element = array[wanted_number]
There must be some kind of overloading stuff, but I don't really know what operators or stuff to overload.
Any help is appreciated :)

With comparator overloading implemented, std::find is available to help:
#include <iostream>
#include <vector>
#include <algorithm>
struct Element
{
int number;
Element(int number)
{
this->number = number;
}
bool operator == (Element el)
{
return number == el.number;
}
};
int main()
{
std::vector<Element> array;
std::vector<int> test;
for(int i=0;i<100;i++)
{
auto t = clock();
test.push_back(t);
array.push_back(Element(t));
}
auto valToFind = test[test.size()/2];
std::cout << "value to find: "<<valToFind<<std::endl;
Element toFind(valToFind);
auto it = std::find(array.begin(),array.end(),toFind);
if(it != array.end())
std::cout<<"found:" << it->number <<std::endl;
return 0;
}
The performance on above method depends on the position of the searched value in the array. Non-existing values & last element values will take the highest time while first element will be found quickest.
If you need to optimize searching-time, you can use another data-structure instead of vector. For example, std::map is simple to use here and fast on average (compared to latest elements of vector-version):
#include <iostream>
#include <vector>
#include <algorithm>
#include <map>
struct Element
{
int number;
Element(){ number = -1; }
Element(int number)
{
this->number = number;
}
};
int main()
{
std::map<int,Element> mp;
std::vector<int> test;
for(int i=0;i<100;i++)
{
auto t = clock();
test.push_back(t);
mp[t]=Element(t);
}
auto valToFind = test[test.size()/2];
std::cout << "value to find: "<<valToFind<<std::endl;
auto it = mp.find(valToFind);
if(it != mp.end())
std::cout<<"found:" << it->second.number <<std::endl;
return 0;
}
If you have to use vector, you can still use the map near the vector to keep track of its elements the same way above method just with extra memory space & extra deletions/updates on the map whenever vector is altered.
Anything you invent would with success would look like hashing or a tree in the end. std::unordered_map uses hashing while std::map uses red-black tree.
If range of values are very limited, like 0-to-1000 only, then simply saving its index in a second vector would be enough:
vec[number] = indexOfVector;
Element found = array[vec[number]];
If range is full and if you don't want to use any map nor unordered_map, you can still use a direct-mapped caching on the std::find method. On average, simple caching should decrease total time taken on duplicated searches (how often you search same item?).

Find X-largest values in a large file with optional input file command line parsing method in C++

I have a file in the following fixed format:
<unique record identifier> <white_space> <numeric value>
e.g.
1426828011 9
1426828028 350
1426828037 25
1426828056 231
1426828058 109
1426828066 111
.
.
.
I want to write a program that reads from 'stdin' the contents of a file, and optionally accepts
the absolute path of a file from the command line. The file/stdin stream is expected
to be in the above format. The output should be a list of the unique ids associated
with the X-largest values in the rightmost column, where X is specified by an input
parameter.
For example, given the input data above and X=3, the following would be
valid output:
1426828028
1426828066
1426828056
Note that the output does not need to be in any particular order. Multiple instances
of the same numeric value count as distinct records of the total X. So if we have 4
records with values: 200, 200, 115, 110 and X=2 then the result must consist of the two
IDs that point to 200 and 200 and no more.
Notice: take into account extremely large files.
My idea and brief implementation:
Sorting by k-largest values
1st way: I want to read file content into multimap then iterate k elements to output
2nd way: Read file data into a vector<pair<int, int>> then use heap sort (priority queue).
I'm wondering which way has better time complexity & higher performance? Time complexity of 2nd way should be nlog(n). Is time complexity of 1st way log(n)? Please tell me both time & space complexity of the above methods and suggest any other better methods.
Besides, the input file is huge, so I think of using external sort. But I haven't done it before. I'd appreciate if someone can instruct me or write sample code of it for my better understanding.
Anyways, it's not required to sort output. We only need to print X-largest values in any order. So I'm wondering whether I need to do any sorting algorithm. The requirement to print the X-largest values in any order is weird, because we must sort it in descending order before printing. So I even don't know why it says "in any order" as if it makes the problem easier.
My brief code:
#include <iostream>
#include <fstream>
#include <string>
#include <algorithm>
//#include "stdafx.h"
using namespace std;
std::multimap<int, int> mp;
typedef std::pair<std::string, int> mypair;
struct IntCmp {
bool operator()(const mypair &lhs, const mypair &rhs) {
return lhs.second < rhs.second;
}
};
void printK(const std::map<std::string,int> &mymap, int k) {
std::vector<mypair> myvec(mymap.begin(), mymap.end());
assert(myvec.size() >= k);
std::partial_sort(myvec.begin(), myvec.begin() + k, myvec.end(), IntCmp());
for (int i = 0; i < k; ++i) {
std::cout << i << ": " << myvec[i].first
<< "-> " << myvec[i].second << "\n";
}
}
void readinfo(std::istream& in)
{
std::string s, ID, Value;
//while(getline(in, s))
while(in >> ID >> Value)
std::cout << s << '\n';
}
int main (int argc, char **argv) {
if (argc > 1) { /* read from file if given as argument */
std::ifstream fin (argv[1]);
if (fin.is_open())
readinfo(fin);
else
{
std::cerr << "error: Unable to open file " << argv[1] << '\n';
return 1;
}
}
else
// No input file has been passed in the command line.
// Read the data from stdin (std::cin).
{
readinfo(std::cin);
}
return 0;
}
But I don't know how to split the huge file to sort and combine back together. Please tell me how to fix my code for this problem.

Maybe you could use a min-heap via std::priority_queue:
#include <cstdlib>
#include <fstream>
#include <iostream>
#include <queue>
#include <vector>
struct IdAndValue {
std::string id;
int value;
};
struct ValueCmp {
bool operator()(const IdAndValue &lhs, const IdAndValue &rhs) {
return lhs.value > rhs.value;
}
};
void PrintTopK(std::istream &in, long k) {
std::priority_queue<IdAndValue, std::vector<IdAndValue>, ValueCmp> largest_k;
std::string id;
int value;
while (in >> id >> value) {
if (largest_k.size() < k) {
largest_k.push({.id = id, .value = value});
} else {
if (value > largest_k.top().value) {
largest_k.pop();
largest_k.push({.id = id, .value = value});
}
}
}
std::cout << "Top " << k << " IDs with largest values:\n";
while (!largest_k.empty()) {
IdAndValue id_and_value = largest_k.top();
largest_k.pop();
std::cout << id_and_value.id << '\n';
}
}
int main(int argc, char **argv) {
if (argc > 2) { // Read from file if given as argument.
std::ifstream fin(argv[1]);
if (fin.is_open())
PrintTopK(fin, std::strtol(argv[2], nullptr, 10));
else {
std::cerr << "Error: Unable to open file " << argv[1] << '\n';
return 1;
}
} else { // Read the data from stdin (std::cin).
PrintTopK(std::cin, std::strtol(argv[1], nullptr, 10));
}
return 0;
}
Usage from stdin (Ctrl + D to send EOF on unix):
./PrintTopK 3
1426828011 9
1426828028 350
1426828037 25
1426828056 231
1426828058 109
1426828066 111
Top 3 IDs with largest values:
1426828066
1426828056
1426828028
Usage when passed in a file:
$ ./PrintTopK /Users/shash/CLionProjects/PrintTopK/data.txt 3
Top 3 IDs with largest values:
1426828066
1426828056
1426828028
With data.txt:
1426828011 9
1426828028 350
1426828037 25
1426828056 231
1426828058 109
1426828066 111

I think we can come up with a better approach that has a lower space and time complexity.
One requirement is to get the x largest values. Then we do only need to store x values. Not more. The others are of no interest. All values will be read and, if not larger than the already collected values, then we discard them. With that, we save tons of memory.
Next, how to store?
If we have an already sorted container, then the smallest element is always at the beginning. So, if we read a new value, then we just need to compare this new value with the first element in the container. Because, if the new value would be smaller than the smallest existing value, we can discard it. But if it is bigger, then we need to add it to our container, and eliminate the so far smallest element.
If we use a function like std::lower_bound then it will give us the exact position on where we need to insert the new element. Without the need for any resorting. It can be inserted at the exact correct position. Then we have a new smallest value.
To select the type of container, we think of the operations that we need to do.
We want to eliminate the first element (Without shifting all other data). - We want to add an element in a given position, without the need to shift all following values to the right.
This leads us to a std::list, which will fulfil our criteria in an optimal way.
So, how will we implement the solution?
Define a struct to hold the data, the unique id and the associated value
Add extraction >> and insertion << operators for easier IO
Add 2 sort operator overloads for std::list::sort and std::lower_bound
In main, get an std::istreameither to a given source file or std::cin
Read the first X values and store them in the list as is. If there should be only these X values or less, then we have the solution already
Sort the values in the list. The smalles value is now at the front of the list
If the std::istream contains still data, then continue to read values
If the new value id greater than than the smalled value in the list, then add the value to the list with insert sort
Delete the smallest value at the front.
After the initial sorting, all operations will be done in constant time. The number of input values does not add additional complexity. Anyway. Most time will be burned with reading data from the disk, if the file is huge. For small files with 100'000 values or so, any other algorithm would be also fine.
Please see one of many potential solutions below.
#include <iostream>
#include <fstream>
#include <random>
#include <string>
#include <list>
#include <limits>
#include <algorithm>
const std::string fileName{ "r:\\big.txt" };
// ----------------------------------------------------------------------------
// Create a big file for Test purposes
void createBigFile() {
if (std::ofstream ofs(fileName); ofs) {
constexpr size_t uniqueIDStart = 1'426'828'028;
constexpr size_t numberOfRecords = 10'000'000;
constexpr size_t endRecords = uniqueIDStart + numberOfRecords;
std::random_device randomDevice;
std::mt19937 randomEngine(randomDevice());
std::uniform_int_distribution<int> uniformDistribution(1, 10'000'000);
for (size_t k{ uniqueIDStart }; k < endRecords; ++k) {
ofs << k << ' ' << uniformDistribution(randomEngine) << '\n';
}
}
}
// ----------------------------------------------------------------------------
// Here we will store our data
struct Data {
unsigned int ID{};
int value{ std::numeric_limits<int>::max() };
// Sorting operators
bool operator < (const int& i) const { return value < i; } // For lower_bound
bool operator < (const Data& other) const { return value < other.value; } // For sort
// Simple extractor and inserter
friend std::istream& operator >> (std::istream& is, Data& d) { return is >> d.ID >> d.value; }
friend std::ostream& operator << (std::ostream& os, const Data& d) { return os << d.ID << ' ' << d.value; }
};
// Whatever number of X you need for the X-largest values
constexpr size_t Rank = 50;
// We will use a list to store the C-largest data
using DList = std::list<Data>;
using ListIter = DList::iterator;
// For faster reading we will increase the input buffer size
constexpr size_t ifStreamBufferSize = 500'000u;
static char buffer[ifStreamBufferSize];
int main(int argc, char* argv[]) {
// If you want to create test data, then uncomment the following line
//createBigFile();
//We will either read from std::cin or from a file
std::shared_ptr<std::istream> input{};
if (argc == 2) {
// Try to open the source file, given by command line arguments
input.reset(new std::ifstream(argv[1]));
input->rdbuf()->pubsetbuf(buffer, ifStreamBufferSize);
}
else {
// Use std::cin for input. Handover a NoOp custom deleter. We do not want to close std::cin
input.reset(&std::cin, [](...) {});
}
// If file could be opened or if std::cin is OK
if (input) {
// Here we will store all data
DList dList;
// Read the first X values as is
size_t numberOfElementsInArray{};
Data data;
while (*input >> data) {
if (numberOfElementsInArray < Rank) {
dList.push_front(std::move(data));
++numberOfElementsInArray;
}
if (numberOfElementsInArray >= Rank) break;
}
// And do a first time (and the only sort)
dList.sort();
// For comparison
int smallest{ dList.front().value };
// Read all data from file
while (*input >> data) {
// If the latest read value is bigger than the smalles in the list, the we need to add a new value now
if (data.value > smallest) {
// FInd the position, where to insert the new element
ListIter dIter = std::lower_bound(dList.begin(), dList.end(), data.value);
if (dIter != dList.end()) {
// Insert new value where is should be. Keeps sorting
dList.insert(dIter,std::move(data));
// We have now more values then needed. Get rid of the smalles one and get new smallest value.
dList.pop_front();
smallest = dList.front().value;
}
}
}
std::copy(dList.rbegin(), dList.rend(), std::ostream_iterator<Data>(std::cout, "\n"));
}
else std::cerr << "*** Error with input file (" << fileName << ") or with std::cin\n\n";
}

Randomly selecting specific subsequence from string

Given a string containing a number of characters interspersed with dashes, for example string s = "A--TG-DF----GR--";, I wish to randomly select a block of dashes (could be of size 1, 2, …, max number of consecutive dashes in string), and copy them over to another part of the string chosen at random.
For example, A--TG-DF(---)-GR-- gets moved to A--T(---)G-DF-GR-- while
another iteration may give A--TG-DF----GR(--) gets moved to A--TG-(--)DF----GR.
I'm generating random indices of the string through int i = rand() % (int) s.length();. Insertion happens through s.insert(rand() % (int) s.length(), substr);, where substr is the block of dashes.
My main problem lies in finding randomly a continuous group of dashes. I thought of using s.find("-"), but that'd only return the first instance of a single dash, and not a random position of a collection of dashes.

I know this problem is likely steeped in XY problems, but I found it a nice challenge none-the-less, so I thought about implementing it with the Boost Interval Container library.
The beauty of the library is that you can forget about a lot of details, while you don't sacrifice a lot of performance.
I've taken the liberty to generalize it, so that it is capable of moving multiple blocks of dashes (uniform randomly selected) simultaneously.
The solution runs Live On Coliru and generates 1,000,000 random transpositions of the given sample with randomly varied numbers of moved blocks (1..3) in about 2673 ms (1156 ms on my machine):
Generator gen(test_case);
std::string result;
std::map<std::string, size_t> histo;
for(int i = 0; i < 1000000; ++i) {
auto const mobility = gen.gen_relocations(1 + gen.random(3)); // move n blocks of dashes
result.clear();
gen.apply_relocations(mobility, std::back_inserter(result));
histo[result]++;
}
Note: the benchmark times include the time taken to build the histogram of unique results generated
Let's do a code walk-through here to explain things:
I tried to use "readable" types:
namespace icl = boost::icl;
using Position = size_t;
using Map = icl::interval_set<Position>;
using Region = Map::value_type;
E.g. the function that builds the Map of dashes is simply:
template <typename It> Map region_map(It f, It l) {
Map dashes;
for (Position pos = 0; f!=l; ++f, ++pos)
if ('-' == *f)
dashes += pos;
return dashes;
}
Note how I didn't particularly optimize this. I let the interval_set combine adjacent dashes. We might use hinted insertion, or a parser that add consecutive dashes as a block. I opted for KISS here.
Later on, we generate relocations, which map a Region onto a non-moving Position in the rest of the text.
using Relocs = boost::container::flat_multimap<Position, Region>;
By using the flat multimap, the caller has the entries already sorted by ascending insertion point. Because we use a reserve()-ed up-front flat multimap, we avoid the overhead of a node based implementation of map here.
We start by picking the dash-blocks to be moved:
Map pick_dashes(int n) {
Map map;
if (!_dashes.empty())
for (int i = 0; i < n; ++i)
map += *std::next(_dashes.begin(), _select(_rng));
return map;
}
The random distribution have been dimensioned at construction, e.g.:
_dashes(region_map(_input.begin(), _input.end())),
_rng(std::random_device {}()),
_select (0, _dashes.iterative_size() - 1),
_randpos(0, _input.size() - 1),
Next, we assign insertion-positions to each. The trick is to assign positions inside non-mobile (inert) regions of the source.
this includes other blocks of dashes that stay in their place
there is the degenerate case where everything is a block of dashes, we detected this in the constructor:
_is_degenerate(cardinality(_dashes) == _input.size())
So the code reads as follows:
Relocs gen_relocations(int n=1) {
Map const moving = pick_dashes(n);
Relocs relocs;
relocs.reserve(moving.iterative_size());
if (_is_degenerate)
{
// degenerate case (everything is a dash); no non-moving positions
// exist, just pick 0
for(auto& region : moving)
relocs.emplace(0, region);
} else {
auto inertia = [&] {
Position inert_point;
while (contains(moving, inert_point = _randpos(_rng)))
; // discard insertion points that are moving
return inert_point;
};
for(auto& region : moving)
relocs.emplace(inertia(), region);
}
return relocs;
}
Now all we need to do is apply the relocations.
The generic implementation of this is pretty straightforward. Again, it's not particularly optimized in order to keep it simple (KISS):
template <typename F>
void do_apply_relocations(Relocs const& mobility, F const& apply) const {
icl::split_interval_set<Position> steps {{0, _input.size()}};
for (auto& m : mobility) {
steps += m.first; // force a split on insertion point
steps -= m.second; // remove the source of the move
//std::cout << m.second << " moving to #" << m.first << ": " << steps << "\n";
}
auto next_mover = mobility.begin();
for(auto& step : steps) {
while (next_mover!=mobility.end() && contains(step, next_mover->first))
apply((next_mover++)->second, true);
apply(step, false);
}
}
Note The trick here is that we "abuse" the split_interval_set combining strategy to break the processing into sub-ranges that "stop" at the randomly generated insertion points: these artificial regions are the "steps" in our generation loop.
The apply function there is what we implement to get what we want. In our case we want a string like A--TG-DFGR(----)-- so we write an overload that appends to a container (e.g. std::string) using any output iterator:
template <typename Out>
Out apply_relocations(Relocs const& mobility, Out out) const {
if (_is_degenerate)
return std::copy(_input.begin(), _input.end(), out);
auto to_iter = [this](Position pos) { return _input.begin() + pos; };
do_apply_relocations(mobility, [&](Region const& r, bool relocated) {
if (relocated) *out++ = '(';
out = std::copy(
to_iter(first(r)),
to_iter(last(r)+1),
out
);
if (relocated) *out++ = ')';
});
return out;
}
Note The "complicated" part here are mapping the Position to input iterators (to_iter) and the code to optionally add () around moved blocks.
And with that, we have seen all the code.
Full Listing
#include <boost/container/flat_map.hpp>
#include <boost/icl/interval_set.hpp>
#include <boost/icl/split_interval_set.hpp>
#include <boost/icl/separate_interval_set.hpp>
#include <boost/lexical_cast.hpp>
#include <boost/range/algorithm.hpp>
#include <iomanip>
#include <iostream>
#include <random>
#include <map>
#include <chrono>
namespace icl = boost::icl;
using Position = size_t;
using Map = icl::interval_set<Position>;
using Region = Map::value_type;
using Relocs = boost::container::flat_multimap<Position, Region>;
struct Generator {
Generator(std::string const& input)
: _input(input),
_dashes(region_map(_input.begin(), _input.end())),
_rng(std::random_device {}()),
_select (0, _dashes.iterative_size() - 1),
_randpos(0, _input.size() - 1),
_is_degenerate(cardinality(_dashes) == _input.size())
{
}
unsigned random(unsigned below) {
return _rng() % below; // q&d, only here to make the tests deterministic for a fixed seed
}
Map full() const {
return Map { { 0, _input.size() } };
}
Relocs gen_relocations(int n=1) {
Map const moving = pick_dashes(n);
Relocs relocs;
relocs.reserve(moving.iterative_size());
if (_is_degenerate)
{
// degenerate case (everything is a dash); no non-moving positions
// exist, just pick 0
for(auto& region : moving)
relocs.emplace(0, region);
} else {
auto inertia = [&] {
Position inert_point;
while (contains(moving, inert_point = _randpos(_rng)))
; // discard insertion points that are moving
return inert_point;
};
for(auto& region : moving)
relocs.emplace(inertia(), region);
}
return relocs;
}
template <typename Out>
Out apply_relocations(Relocs const& mobility, Out out) const {
if (_is_degenerate)
return std::copy(_input.begin(), _input.end(), out);
auto to_iter = [this](Position pos) { return _input.begin() + pos; };
do_apply_relocations(mobility, [&](Region const& r, bool relocated) {
if (relocated) *out++ = '(';
out = std::copy(
to_iter(first(r)),
to_iter(last(r)+1),
out
);
if (relocated) *out++ = ')';
});
return out;
}
template <typename F>
void do_apply_relocations(Relocs const& mobility, F const& apply) const {
icl::split_interval_set<Position> steps {{0, _input.size()}};
for (auto& m : mobility) {
steps += m.first; // force a split on insertion point
steps -= m.second; // remove the source of the move
//std::cout << m.second << " moving to #" << m.first << ": " << steps << "\n";
}
auto next_mover = mobility.begin();
for(auto& step : steps) {
while (next_mover!=mobility.end() && contains(step, next_mover->first))
apply((next_mover++)->second, true);
apply(step, false);
}
}
private:
std::string _input;
Map _dashes;
std::mt19937 _rng;
std::uniform_int_distribution<Position> _select;
std::uniform_int_distribution<Position> _randpos;
bool _is_degenerate;
Map pick_dashes(int n) {
Map map;
if (!_dashes.empty())
for (int i = 0; i < n; ++i)
map += *std::next(_dashes.begin(), _select(_rng));
return map;
}
template <typename It> Map region_map(It f, It l) {
Map dashes;
for (Position pos = 0; f!=l; ++f, ++pos)
if ('-' == *f)
dashes += pos;
return dashes;
}
};
int main() {
for (std::string test_case : {
"----",
"A--TG-DF----GR--",
"",
"ATGDFGR",
})
{
auto start = std::chrono::high_resolution_clock::now();
Generator gen(test_case);
std::string result;
std::map<std::string, size_t> histo;
for(int i = 0; i < 1000000; ++i) {
auto const mobility = gen.gen_relocations(1 + gen.random(3)); // move n blocks of dashes
result.clear();
gen.apply_relocations(mobility, std::back_inserter(result));
histo[result]++;
}
std::cout << histo.size() << " unique results for '" << test_case << "'"
<< " in " << std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::high_resolution_clock::now()-start).count() << "ms\n";
std::multimap<size_t, std::string, std::greater<size_t> > ranked;
for (auto& entry : histo)
ranked.emplace(entry.second, entry.first);
int topN = 10;
for (auto& rank : ranked)
{
std::cout << std::setw(8) << std::right << rank.first << ": " << rank.second << "\n";
if (0 == --topN)
break;
}
}
}
Prints e.g.
1 unique results for '----' in 186ms
1000000: ----
3430 unique results for 'A--TG-DF----GR--' in 1156ms
9251: A(----)--TG-DFGR--
9226: (----)A--TG-DFGR--
9191: A--T(----)G-DFGR--
9150: A--TG-DFGR-(----)-
9132: A--(----)TG-DFGR--
9128: A--TG(----)-DFGR--
9109: A--TG-D(----)FGR--
9098: A--TG-DFG(----)R--
9079: A--TG-DFGR(----)--
9047: A-(----)-TG-DFGR--
1 unique results for '' in 25ms
1000000:
1 unique results for 'ATGDFGR' in 77ms
1000000: ATGDFGR

You can pre-process the string to get a list of iterators that point ot the beginnings of consecutive dashes in the string and then uniformly pick a random element from that list.
I will use the following standard library headers in this example (which is complete and working if you concatenate the following code blocks):
#include <cstddef>
#include <iostream>
#include <random>
#include <stdexcept>
#include <string>
#include <vector>
First, we define a function that finds us said list of iterators. To do so, we make use of std::string::find_first_of and std::string::find_first_not_of to find the index of the first character in and of the first character after the next sequence. Both functions work with indices rather than with iterators, so we have to add them to cbegin(). The function works with any character, not just dashes.
std::vector<std::string::const_iterator>
find_conscutive_sequences(const std::string& text, const char c)
{
std::vector<std::string::const_iterator> positions {};
std::size_t idx = 0UL;
while (idx != std::string::npos && idx < text.length())
{
const auto first = text.find_first_of(c, idx);
if (first == std::string::npos)
break;
positions.push_back(text.cbegin() + first);
idx = text.find_first_not_of(c, first);
}
return positions;
}
Next, we define a function that uses the result of the above function to return an iterator to the beginning of a randomly selected sequence of dashes.
We pass in the random engine as a parameter so it can be seeded once and used over and over again. The random standard library introduced in C++11 is so great that it should be preferred whenever possible over the legacy rand function.
If given an empty vector of positions, we have to fail because there is no sequence we could possibly select.
std::string::const_iterator
get_random_consecutive_sequence(const std::vector<std::string::const_iterator>& positions,
std::default_random_engine& prng)
{
if (positions.empty())
throw std::invalid_argument {"string does not contain any sequence"};
std::uniform_int_distribution<std::size_t> rnddist {0UL, positions.size() - 1UL};
const auto idx = rnddist(prng);
return positions.at(idx);
}
Finally, I define this little helper function to mark the selected sequence. Your code would do the copy / move / shift here.
std::string
mark_sequence(const std::string& text,
const std::string::const_iterator start)
{
const auto c = *start;
const std::size_t first = start - text.cbegin();
std::size_t last = text.find_first_not_of(c, first);
if (last == std::string::npos)
last = text.length();
std::string marked {};
marked.reserve(text.length() + 2UL);
marked += text.substr(0UL, first);
marked += '(';
marked += text.substr(first, last - first);
marked += ')';
marked += text.substr(last, text.length() - last);
return marked;
}
It can be used like this.
int
main()
{
const std::string example {"--A--B-CD----E-F---"};
std::random_device rnddev {};
std::default_random_engine rndengine {rnddev()};
const auto positions = find_conscutive_sequences(example, '-');
for (int i = 0; i < 10; ++i)
{
const auto pos = get_random_consecutive_sequence(positions, rndengine);
std::cout << mark_sequence(example, pos) << std::endl;
}
}
Possible output:
--A--B-CD(----)E-F---
--A--B(-)CD----E-F---
--A(--)B-CD----E-F---
--A(--)B-CD----E-F---
--A--B-CD(----)E-F---
--A--B-CD----E-F(---)
--A--B-CD----E-F(---)
(--)A--B-CD----E-F---
--A--B(-)CD----E-F---
(--)A--B-CD----E-F---

string::find() has optional second parameter: a position to start the search from. So, something like s.find("-", rand() % L) may do the trick for you, where L is (position of the last dash + 1).

As I understand the problem all dash blocks should have the same probability of being selected. Therefore we must first find the positions where all these blocks start and then pick one of those positions at Random.
If I'm allowed to use Smalltalk for pseudo code, then I would first find the indexes where every dash block starts:
dashPositionsOn: aString
| indexes i n |
indexes := OrderedCollection new.
i := 1.
n := aString size.
[i <= n] whileTrue: [| char |
char := aString at: i.
char = $-
ifTrue: [
indexes add: i.
[
i := i + 1.
i <= n and: [
char := aString at: i.
char = $-]] whileTrue]
ifFalse: [i := i + 1]].
^indexes
Now we can pick one of these indexes at random: indexes atRandom.
Please note that there are (much) better ways to implement this algorithm in Smalltalk (as well as in other languages).

interval range tree datastructure c++

I have a requirement where I have to update the color of a graphical frontend based on some attribute value.The attribute value has different ranges ....say -30 to -45, -60 to -80 and so on.....So, I needed a datastaructure where I could store these ranges(prefill them)....And When I do determine the point , I would like to know the range in which this point falls either in O(1) Time or O(logN) time....So, My Query would consist of a single point and the output should be a unique range containing that point...
I am confused between range trees and segment trees....i want to build the tree on top of c++ stl map.

What you need is called interval tree. http://en.wikipedia.org/wiki/Interval_tree.
Unfortunately you can't use std::set<> to get O(log N) insert, remove and query, because tree node needs to contain additional data. You can read about them here http://syedwaqarahmad.webs.com/documents/t.cormen-_introduction_to_algorithms_3rd_edition.pdf
chapter 14.3.
Instead you can use boost. It has interval container library.
http://www.boost.org/doc/libs/1_46_1/libs/icl/doc/html/index.html

Maybe this library can help you:
https://github.com/ekg/intervaltree

If I understand you correctly, you can du this quite easily with std::set:
#include <iostream>
#include <set>
struct Interval {
int min;
int max;
};
struct ComInt {
bool operator()(const Interval& lhs, const Interval& rhs){
return lhs.max < rhs.min;
}
};
std::set<Interval, ComInt> intervals = { { -10, -5 }, { -4, 4 }, { 5, 10 } };
int main() {
int point = 3;
Interval tmp = { point, point };
auto result=intervals.find(tmp);
if (result != intervals.end()) {
std::cout << "Min:" << result->min << " - Max:" << result->max << std::endl;
} else {
std::cout << "No matching Interval found" << std::endl;
}
}
of course you should build a wrapper class around it

Filling map with 2 keys from a string. Character and frequency c++

I am new to maps so an a little unsure of the best way to do this. This task is in relation to compression with huffman coding. Heres what I have.
#include <map>
#include <iostream>
#include <fstream>
#include <string>
using namespace std;
typedef map<char,int> huffmanMap;
void getFreq(string file, map<char, int> map)
{
map.clear();
for (string::iterator i = file.begin(); i != file.end(); ++i) {
++map[*i];
}
}
above is one method I found online but was unable to print anything
int main()
{
map<char, int> huffmanMap;
string fileline;
ifstream myfile;
myfile.open("text.txt",ios::out);
while(!myfile.eof()) {
getline(myfile, fileline); //get the line and put it in the fileline string
}
myfile.close();
I read in a from a text file to populate string fileline.
for (int i=0; i<fileline.length(); i++) {
char t = fileline[i];
huffmanMap[i]? huffmanMap[i]++ : huffmanMap[i]=1;
}
here is a second method I tried for populating the map, the char values are incorrect, symbols and smileyfaces..
getFreq(fileline,huffmanMap);
huffmanMap::iterator position;
for (position = huffmanMap.begin(); position != huffmanMap.end(); position++) {
cout << "key: \"" << position->first << endl;
cout << "value: " << position->second << endl;
}
This is how I tried to print map
system("pause");
return 0;
}
When I run my getFreq method the program crashes. I dont get any errors with either. With the second method the char values are nonsense.Note I have not had both methods running at the same time I just incuded them both to show what i have tried.
Any insight would be appreciated.Thanks. Be lenient im a beginner ;)

Your code is all over the place, it's not very coherent so difficult to understand the flow.
Here are some low-lights:
This is wrong: myfile.open("text.txt",ios::out); - why would you open an input stream with out flag? it should simply be:
string fileline;
ifstream myfile("text.txt");
while(getline(myfile, fileline)) {
// now use fileline.
}
In the while loop, what you want to do is to iterate over the content and add it to your map? So now the code looks like:
string fileline;
ifstream myfile("text.txt");
while(getline(myfile, fileline)) {
getFreq(fileline, huffmanMap);
}
Next fix, this is wrong: you have a typedef and a variable of the same name!
typedef map<char,int> huffmanMap;
map<char, int> huffmanMap;
Use sensible naming
typedef map<char,int> huffmanMap_Type;
huffmanMap_Type huffmanMap;
Next fix, your getFreq method signature is wrong, you are passing the map by value (i.e. copy) rather than reference, hence you modification in the function is to a copy not the original!
wrong: void getFreq(string file, map<char, int> map)
correct: void getFreq(string file, huffmanMap_Type& map)
Next: why clear() in the above method? What if there is more than one line? No need for that surely?
That's enough for now, clean up your code and update your question if there are more issues.

One fix and One improvement.
Fix is : make second parameter in getFreq reference:
void getFreq(string file, map<char, int> & map); //notice `&`
Improvement is : just write
huffmanMap[i]++;
instead of
huffmanMap[i]? huffmanMap[i]++ : huffmanMap[i]=1;
After all, by writing huffmanMap[i]? you're checking whether it's zero or not. If zero, then you make it one, which is same as huffmanMap[i]++.

(An answer using C++ language features fom C++20.
But first, you were asking about getting getting the count (frequency) of letters in a text.
There is nearly a universal solution approach for this. We can use the std::unordered_map. It is described in the C++ reference here.
It is the std::unordered_maps very convienient index operator [] which makes counting very simple. This operator returns a reference to the value that is mapped to a key. So, it searched for the key and then returns the value. If the key does not exist, it inserts a new key/value pair and returns a reference to the value.
So, in any way, a reference to the value is returned. And this can be incremented. Example:
With a "std::unordered_map<char, int> mymap{};" and a text "aba", the follwoing can be done with the index operator:
mymap['a'] will search for an 'a' in the map. It is not found, so a new entry for 'a' with corresponding value=0 is created: The a reference to the value is returned. And, if we now increment that, we will increment the counter value. So, mymap['a']++, wiil insert a new gey/value pair 'a'/0, then increment it, resulting in 'a'/1
For 'b' the same as above will happen.
For the next 'a', an entry will be found in the map, an so a reference to the value (1) is returned. This is incremented and will then be 2
And so on and so on.
By using some modern language elements, a whole file can be read and its characters counted, with one simple for-loop:
for (const char c : rng::istream_view<char>(ifs)) counter[c]++;
Additional information:
For building a Huffmann tree, we can use a Min-Heap, which can be easily implemented with the existing std::priority_queue. Please read here abour it.
With 4 lines of code, the complete Huffmann tree can be build.
And the end, we put the result in a code book. Again a std::unordered_map and show the result to the user.
This could zhen be for example implemented like the below:
#include <iostream>
#include <fstream>
#include <string>
#include <unordered_map>
#include <algorithm>
#include <queue>
#include <ranges>
#include <vector>
#include <utility>
namespace rng = std::ranges; // Abbreviation for the rnages namespace
using namespace std::string_literals; // And we want to use stding literals
// The Node of the Huffmann tree
struct Node {
char letter{ '\0' }; // The letter that we want to encode
std::size_t frequency{}; // The letters frequency in the source text
Node* left{}, *right{}; // The pointers to the children of this node
};
// Some abbreviations to reduce typing work and make code more readable
using Counter = std::unordered_map<char, std::size_t>;
struct Comp { bool operator ()(const Node* n1, const Node* n2) { return n1->frequency > n2->frequency; } };
using MinHeap = std::priority_queue<Node*, std::vector<Node*>, Comp>;
using CodeBook = std::unordered_map<char, std::string>;
// Traverse the Huffmann Tree and build the code book
void buildCodeBook(Node* root, std::string code, CodeBook& cb) {
if (root == nullptr) return;
if (root->letter != '\0') cb[root->letter] = code;
buildCodeBook(root->left, code + "0"s, cb);
buildCodeBook(root->right, code + "1"s, cb);
}
// Get the top most two Elements from the Min-Heap
std::pair<Node*, Node*> getFrom(MinHeap& mh) {
Node* left{ mh.top() }; mh.pop();
Node* right{ mh.top() }; mh.pop();
return { left, right };
}
// Demo function
int main() {
if (std::ifstream ifs{ "r:\\lorem.txt" }; ifs) {
// Define moste important resulting work products
Counter counter{};
MinHeap minHeap{};
CodeBook codeBook{};
// Read complete text from source file and count all characters ---------
for (const char c : rng::istream_view<char>(ifs)) counter[c]++;
// Build the Huffmann tree ----------------------------------------------
// First, create a min heap, based and the letters frequency
for (const auto& p : counter) minHeap.push(new Node{p.first, p.second});
// Compress the nodes
while (minHeap.size() > 1u) {
auto [left, right] = getFrom(minHeap);
minHeap.push(new Node{ '\0', left->frequency + right->frequency, left, right });
}
// And last but not least, generate the code book -----------------------
buildCodeBook(minHeap.top(), {}, codeBook);
// And, as debug output, show the code book -----------------------------
for (const auto& [letter, code] : codeBook) std::cout << '\'' << letter << "': " << code << '\n';
}
else std::cerr << "\n\n***Error: Could not open source text file\n\n";
}
You my notice that we use new to allocate memory. But we do not delete it afterwards.
We could now add the delete statements at the approiate positions but I will show you a modified solution using smart pointers.
Please see here:
#include <iostream>
#include <fstream>
#include <string>
#include <unordered_map>
#include <map>
#include <algorithm>
#include <queue>
#include <ranges>
#include <vector>
#include <utility>
#include <memory>
namespace rng = std::ranges; // Abbreviation for the rnages namespace
using namespace std::string_literals; // And we want to use stding literals
struct Node; // Forward declaration
using UPtrNode = std::unique_ptr<Node>; // Using smart pointer for memory management
// The Node of the Huffmann tree
struct Node {
char letter{ '\0' }; // The letter that we want to encode
std::size_t frequency{}; // The letters frequency in the source text
UPtrNode left{}, right{}; // The pointers to the children of this node
};
// Some abbreviations to reduce typing work and make code more readable
using Counter = std::unordered_map<char, std::size_t>;
struct CompareNode { bool operator ()(const UPtrNode& n1, const UPtrNode& n2) { return n1->frequency > n2->frequency; } };
using MinHeap = std::priority_queue<UPtrNode, std::vector<UPtrNode>, CompareNode>;
using CodeBook = std::map<Counter::key_type, std::string>;
// Traverse the Huffmann Tree and build the code book
void buildCodeBook(UPtrNode&& root, std::string code, CodeBook& cb) {
if (root == nullptr) return;
if (root->letter != '\0') cb[root->letter] = code;
buildCodeBook(std::move(root->left), code + "0"s, cb);
buildCodeBook(std::move(root->right), code + "1"s, cb);
}
// Get the top most to Elements from the Min-Heap
std::pair<UPtrNode, UPtrNode> getFrom(MinHeap& mh) {
UPtrNode left = std::move(const_cast<UPtrNode&>(mh.top()));mh.pop();
UPtrNode right = std::move(const_cast<UPtrNode&>(mh.top()));mh.pop();
return { std::move(left), std::move(right) };
}
// Demo function
int main() {
if (std::ifstream ifs{ "r:\\lorem.txt" }; ifs) {
// Define moste important resulting work products
Counter counter{};
MinHeap minHeap{};
CodeBook codeBook{};
// Read complete text from source file and count all characters ---------
for (const char c : rng::istream_view<char>(ifs)) counter[c]++;
// Build the Huffmann tree ----------------------------------------------
// First, create a min heap, based and the letters frequency
for (const auto& p : counter) minHeap.push(std::make_unique<Node>(Node{ p.first, p.second }));
// Compress the nodes
while (minHeap.size() > 1u) {
auto [left, right] = getFrom(minHeap);
minHeap.push(std::make_unique<Node>(Node{ '\0', left->frequency + right->frequency, std::move(left), std::move(right) }));
}
// And last but not least, generate the code book -----------------------
buildCodeBook(std::move(const_cast<UPtrNode&>(minHeap.top())), {}, codeBook);
// And, as debug output, show the code book -----------------------------
for (std::size_t k{}; const auto & [letter, code] : codeBook) std::cout << ++k << "\t'" << letter << "': " << code << '\n';
}
else std::cerr << "\n\n***Error: Could not open source text file\n\n";
}

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

String Searching Algorithm that uses a graph ? C++ - c++

Related

How to get an element (struct) in an array by a value in the struct

Find X-largest values in a large file with optional input file command line parsing method in C++

Randomly selecting specific subsequence from string

interval range tree datastructure c++

Filling map with 2 keys from a string. Character and frequency c++

Categories

Resources