Predicting next char in 'random' text generation based on some input file - c++

I am writing a program that generates random text based on the Markov model. I am running into a problem, with some files that have a lot of spaces in between words, the initial seed is seen to be a space. The problem is that all the next characters are seen as spaces as well and so the random text that is generated is just a blank documents as nextChosenChar is always a space.
Can someone suggest some solution to this problem?
I tried to come up with a solution as seen the latter part of the code below, but to no avail.
char ChooseNextChar(string seed, int order, string fileName){
Map<string, Vector<char> > nextCharMap;
ifstream inputStream;
inputStream.open(fileName.c_str());
int offset = 0;
Vector<char> charsFollingSeedVector;
inputStream.clear();
char* buffer = new char [order + 1];
char charFollowingSeed;
static int consecutiveSpaces = 0;
while (!inputStream.eof()) {
inputStream.seekg(offset);
inputStream.read(buffer, order + 1);
string key(buffer, order);
if (equalsIgnoreCase(key, seed)) {
//only insert key if not present otherwise overwriting old info
if (!nextCharMap.containsKey(seed)) {
nextCharMap.put(seed, charsFollingSeedVector);
}
//read the char directly following seed
charFollowingSeed = buffer[order];
nextCharMap[seed].push_back(charFollowingSeed);
}
offset++;
}
//case where no chars following seed
if (nextCharMap[seed].isEmpty()) {
return EOF;
}
//determine which is the most frequent following char
char nextChosenChar = MostFequentCharInVector(seed, nextCharMap);
//TRYING TO FIX PROBLEM OF ONLY OUTPUTTING SPACES**********
if (nextChosenChar == ' ') {
consecutiveSpaces++;
if (consecutiveSpaces >= 1) {
nextChosenChar = nextCharMap[seed].get(randomInteger(0, nextCharMap[seed].size()-1));
consecutiveSpaces = 0;
}
}
return nextChosenChar;
}

If you really want a character-based model, you won't get very natural looking text as output, but it is definitely possible, and that model will fundamentally be able to deal with sequences of space characters as well. There is no need to remove them from the input if you consider them a natural part of the text.
What is important is that a Markov model does not always fall back to predicting the one character that has the highest probability at any given stage. Instead, it must look at the entire probability distribution of possible characters, and chooses one randomly.
Here, randomly means it picks a character not pre-determined by the programmer. Still, the random distribution is not the uniform distribution, i.e. not all characters are equally likely. It has to take into account the relative probabilities of the various possible characters. One way to do this is to generate a cumulative probability distribution of characters, i.e. for example, if the probabilities are
p('a') == 0.2
p('b') == 0.4
p('c') == 0.4
we represent them as
p('a') == 0.2
p('b') == p('a') + 0.4 == 0.6
p('c') == p('a') + p('b') == 1.0
Then to generate a random character, we first generate a uniformly distributed random number N between 0 and 1, and then choose the first character whose cumulative probability is no less than N.
I have implemented this in the example code below. The train() procedure generates a cumulative probability distribution of the following-characters, for every character in the training input. The 'predict()' procedure applies this to generate random text.
For a full implementation, this still lacks:
A representation of the probability distribution for the initial character. As you see in the 'main()' function, my output simply always starts with 't'.
A representation of the length of the output string, or the final character. 'main()' simply always generates a string of length 100.
The code was tested with GCC 4.7.0 (C++11 option) on Linux. Example output below.
#include <iostream>
#include <string>
#include <vector>
#include <utility>
#include <map>
#include <numeric>
#include <algorithm>
#include <random>
template <typename Char>
class Markov
{
public:
/* Data type used to count the frequencies (integer!) of
characters. */
typedef std::map<Char,unsigned> CharDistributionMap;
/* Data type used to represent a cumulative probability (float!)
distribution. */
typedef std::vector<std::pair<Char,float>> CharDistribution;
/* Data type used to represent the Markov model. Each character is
mapped to a probality distribution of the characters that follow
it. */
typedef std::map<Char,CharDistribution> MarkovModel;
/* The model. */
MarkovModel _model;
/* Training procedure. */
template <typename Iterator>
void train(Iterator from, Iterator to)
{
_model = {};
if (from == to)
return;
std::map<Char,CharDistributionMap> proto_model {};
/* Count frequencies. */
Char current = *from;
while (true) {
++from;
if (from == to)
break;
Char next = *from;
proto_model[current][next] += 1;
current = next;
}
/* Transform into probability distribution. */
for (const auto &entry : proto_model) {
const Char current = entry.first;
const CharDistributionMap &freq = entry.second;
/* Calculate total frequency of current character. */
unsigned total =
std::accumulate(std::begin(freq),std::end(freq),0,
[](unsigned res,const std::pair<Char,unsigned> &p){
return res += p.second;
});
/* Determine the probability distribution of characters that
follow the current character. This is calculated as a cumulative
probability. */
CharDistribution dist {};
float probability { 0.0 };
std::for_each(std::begin(freq),std::end(freq),
[total,&probability,&dist](const std::pair<Char,unsigned> &p){
// using '+=' to get cumulative probability:
probability += static_cast<float>(p.second) / total;
dist.push_back(std::make_pair(p.first,probability));
});
/* Add probability distribution for current character to the model. */
_model[current] = dist;
}
}
/* Predict the next character, assuming that training has been
performed. */
template <typename RandomNumberGenerator>
Char predict(RandomNumberGenerator &gen, const Char current)
{
static std::uniform_real_distribution<float> generator_dist { 0, 1 };
/* Assume that the current character is known to the model. Otherwise,
an std::out_of_range exception will be thrown. */
const CharDistribution &dist { _model.at(current) };
/* Generate random number between 0 and 1. */
float random { generator_dist(gen) };
/* Identify the character that has the greatest cumulative probabilty
smaller than the random number generated. */
auto res =
std::lower_bound(std::begin(dist),std::end(dist),
std::make_pair(Char(),random),
[](const std::pair<Char,float> &p1, const std::pair<Char,float> &p2) {
return (p1.second < p2.second);
});
if (res == std::end(dist))
throw "Empty probability distribution. This should not happen.";
return res->first;
}
};
int main()
{
/* Initialize random-number generator. */
std::random_device rd;
std::mt19937 gen(rd());
std::string input { "this is some input text with many spaces." };
if (input.empty())
return 1;
/* We append the first character to the end, to ensure that even the
last character of the text gets a non-empty probability
distribution. A more proper way of dealing with character that
have empty distributions would be _smoothing_. */
input += input[0];
Markov<char> markov {};
markov.train(std::begin(input),std::end(input));
/* We set the initial character. In a real stochastic model, there
would have to be a separate probality distribution for initial
character and we would choose the initial character randomly,
too. */
char current_char { 't' };
for (unsigned i = 0 ; i < 100 ; ++i) {
std::cout << current_char;
current_char = markov.predict(gen,current_char);
}
std::cout << current_char << std::endl;
}
Some example output generated by this program:
t mext s.t th winy iny somaces sputhis inpacexthispace te iny me mext mexthis
tes is manputhis.th is wis.th with it is is.t s t winy it mext is ispany
this maces somany t s it this winy sputhisomacext manput somanputes macexte iso
t wispanpaces maces tesomacexte s s mes.th isput t wit t somanputes s withit sput ma
As you can see, the distribution of space characters follows, sort of naturally, the distribution found in the input text.

One solution would be to stream the characters one by one from the file so that your reading loop would look more like this:
char buffer[order];
inputStream.get(buffer,order);
char next_char;
while ( inputStream.get(next_char) )
{
string key(buffer, order);
if (equalsIgnoreCase(key, seed)) {
// only insert key if not present otherwise overwriting old info
if (!nextCharMap.containsKey(seed)) {
nextCharMap[seed] = Vector(charFollowingSeed);
}
else
{
nextCharMap[seed].push_back(charFollowingSeed);
}
// Update the buffer.
for(unsigned int i=1; i<order; ++i) buffer[i-1]=buffer[i];
buffer[order-1]=next_char;
}
Then you can discard extra spaces like this:
....
while ( inputStream.get(next_char) )
{
//Remove multiple spaces from input.
if( next_char==' ' and buffer[order-1]==' ') continue
string key(buffer, order);
....

Related

I need to create MultiMap using hash-table but I get time-limit exceeded error (C++)

I'm trying to solve algorithm task: I need to create MultiMap(key,(values)) using hash-table. I can't use Set and Map libraries. I send code to testing system, but I get time-limit exceeded error on test 20. I don't know what exactly this test contains. The code must do following tasks:
put x y - add pair (x,y).If pair exists, do nothing.
delete x y - delete pair(x,y). If pair doesn't exist, do nothing.
deleteall x - delete all pairs with first element x.
get x - print number of pairs with first element x and second elements.
The amount of operations <= 100000
Time limit - 2s
Example:
multimap.in:
put a a
put a b
put a c
get a
delete a b
get a
deleteall a
get a
multimap.out:
3 b c a
2 c a
0
#include <iostream>
#include <fstream>
#include <vector>
using namespace std;
inline long long h1(const string& key) {
long long number = 0;
const int p = 31;
int pow = 1;
for(auto& x : key){
number += (x - 'a' + 1 ) * pow;
pow *= p;
}
return abs(number) % 1000003;
}
inline void Put(vector<vector<pair<string,string>>>& Hash_table,const long long& hash, const string& key, const string& value) {
int checker = 0;
for(int i = 0; i < Hash_table[hash].size();i++) {
if(Hash_table[hash][i].first == key && Hash_table[hash][i].second == value) {
checker = 1;
break;
}
}
if(checker == 0){
pair <string,string> key_value = make_pair(key,value);
Hash_table[hash].push_back(key_value);
}
}
inline void Delete(vector<vector<pair<string,string>>>& Hash_table,const long long& hash, const string& key, const string& value) {
for(int i = 0; i < Hash_table[hash].size();i++) {
if(Hash_table[hash][i].first == key && Hash_table[hash][i].second == value) {
Hash_table[hash].erase(Hash_table[hash].begin() + i);
break;
}
}
}
inline void Delete_All(vector<vector<pair<string,string>>>& Hash_table,const long long& hash,const string& key) {
for(int i = Hash_table[hash].size() - 1;i >= 0;i--){
if(Hash_table[hash][i].first == key){
Hash_table[hash].erase(Hash_table[hash].begin() + i);
}
}
}
inline string Get(const vector<vector<pair<string,string>>>& Hash_table,const long long& hash, const string& key) {
string result="";
int counter = 0;
for(int i = 0; i < Hash_table[hash].size();i++){
if(Hash_table[hash][i].first == key){
counter++;
result += Hash_table[hash][i].second + " ";
}
}
if(counter != 0)
return to_string(counter) + " " + result + "\n";
else
return "0\n";
}
int main() {
vector<vector<pair<string,string>>> Hash_table;
Hash_table.resize(1000003);
ifstream input("multimap.in");
ofstream output("multimap.out");
string command;
string key;
int k = 0;
string value;
while(true) {
input >> command;
if(input.eof())
break;
if(command == "put") {
input >> key;
long long hash = h1(key);
input >> value;
Put(Hash_table,hash,key,value);
}
if(command == "delete") {
input >> key;
input >> value;
long long hash = h1(key);
Delete(Hash_table,hash,key,value);
}
if(command == "get") {
input >> key;
long long hash = h1(key);
output << Get(Hash_table,hash,key);
}
if(command == "deleteall"){
input >> key;
long long hash = h1(key);
Delete_All(Hash_table,hash,key);
}
}
}
How can I do my code work faster?
At very first, a matter of design: Normally, one would pass the key only to the function and calculate the hash within. Your variant allows a user to place elements anywhere within the hash table (using bad hash values), so user could easily break it.
So e. g. put:
using HashTable = std::vector<std::vector<std::pair<std::string, std::string>>>;
void put(HashTable& table, std::string& key, std::string const& value)
{
auto hash = h1(key);
// ...
}
If at all, the hash function could be parametrised, but then you'd write a separate class for (wrapping the vector of vectors) and provide the hash function in constructor so that a user cannot exchange it arbitrarily (and again break the hash table). A class would come with additional benefits, most important: better encapsulation (hiding the vector away, so user could not change it with vector's own interface):
class HashTable
{
public:
// IF you want to provide hash function:
template <typename Hash>
HashTable(Hash hash) : hash(hash) { }
void put(std::string const& key, std::string const& value);
void remove(std::string const& key, std::string const& value); //(delete is keyword!)
// ...
private:
std::vector<std::vector<std::pair<std::string, std::string>>> data;
// if hash function parametrized:
std::function<size_t(std::string)> hash; // #include <functional> for
};
I'm not 100% sure how efficient std::function really is, so for high performance code, you preferrably use your hash function h1 directly (not implenting constructor as illustrated above).
Coming to optimisations:
For the hash key I would prefer unsigned value: Negative indices are meaningless anyway, so why allow them at all? long long (signed or unsigned) might be a bad choice if testing system is a 32 bit system (might be unlikely, but still...). size_t covers both issues at once: it is unsigned and it is selected in size appropriately for given system (if interested in details: actually adjusted to address bus size, but on modern systems, this is equal to register size as well, which is what we need). Select type of pow to be the same.
deleteAll is implemented inefficiently: With each element you erase you move all the subsequent elements one position towards front. If you delete multiple elements, you do this repeatedly, so one single element can get moved multiple times. Better:
auto pos = vector.begin();
for(auto& pair : vector)
{
if(pair.first != keyToDelete)
*pos++ = std::move(s); // move semantics: faster than copying!
}
vector.erase(pos, vector.end());
This will move each element at most once, erasing all surplus elements in one single go. Appart from the final erasing (which you have to do explicitly then), this is more or less what std::remove and std::remove_if from algorithm library do as well. Are you allowed to use it? Then your code might look like this:
auto condition = [&keyToDelete](std::pair<std::string, std::string> const& p)
{ return p.first == keyToDelete; };
vector.erase(std::remove_if(vector.begin(), vector.end(), condition), vector.end());
and you profit from already highly optimised algorithm.
Just a minor performance gain, but still: You can spare variable initialisation, assignment and conditional branch (the latter one can be relatively expensive operation on some systems) within put if you simply return if an element is found:
//int checker = 0;
for(auto& pair : hashTable[hash]) // just a little more comfortable to write...
{
if(pair.first == key && pair.second == value)
return;
}
auto key_value = std::make_pair(key, value);
hashTable[hash].push_back(key_value);
Again, with algorithm library:
auto key_value = std::make_pair(key, value);
// same condition as above!
if(std::find_if(vector.begin(), vector.end(), condition) == vector.end();
{
vector.push_back(key_value);
}
Then less than 100000 operations does not indicate that each operation will require a separate key/value pair. We might expect that keys are added, removed, re-added, ..., so you most likely don't have to cope with 100000 different values. I'd assume your map is much too large (be aware that it requires initialisation of 100000 vectors as well). I'd assume a much smaller one should suffice already (possibly 1009 or 10007? You might possibly have to experiment a little...).
Keeping the inner vectors sorted might give you some performance boost as well:
put: You could use a binary search to find the two elements in between a new one is to be inserted (if one of these two is equal to given one, no insertion, of course)
delete: Use binary search to find the element to delete.
deleteAll: Find upper and lower bounds for elements to be deleted and erase whole range at once.
get: find lower and upper bound as for deleteAll, distance in between (number of elements) is a simple subtraction and you could print out the texts directly (instead of first building a long string). Which of outputting directly or creating a string really is more efficient is to be found out, though, as outputting directly involves multiple system calls, which in the end might cost previously gained performance again...
Considering your input loop:
Checking for eof() (only) is critical! If there is an error in the file, you'll end up in an endless loop, as the fail bit gets set, operator>> actually won't read anything at all any more and you won't ever reach the end of the file. This even might be the reason for your 20th test failing.
Additionally: You have line based input (each command on a separate line), so reading a whole line at once and only afterwards parse it will spare you some system calls. If some argument is missing, you will detect it correctly instead of (illegally) reading next command (e. g. put) as argument, similarly you won't interpret a surplus argument as next command. If a line is invalid for whatever reason (bad number of arguments as above or unknown command), you can then decide indiviually what you want to do (just ignore the line or abort processing entirely). So:
std::string line;
while(std::getline(std::cin, line))
{
// parse the string; if line is invalid, appropriate error handling
// (ignoring the line, exiting from loop, ...)
}
if(!std::cin.eof())
{
// some error occured, print error message!
}

How to input a multi-digit integer into an Arduino using a 4x4 keypad?

I am trying to make a combination lock using an Arduino, a keypad and a Servo but I have come across an obstacle.
I can't find a way to store a 4 digit value in a variable. since keypad.getKey only allows to store one digit.
After some browsing on the internet I came upon a solution for my problem on a forum but the answer didn't include a code sample, and I couldn't find anything else about in on the internet.
The answer said to either use a time limit for the user to input the number or a terminating character (which would be the better option according to them).
I would like to know more bout these terminating characters and how to implement them, or if anybody could suggest a better solution that would be much appreciated as well.
Thank you in advance,
To store 4 digit values, the easiest and naive way to do it is probably to use an array of size 4. Assuming keypad.getKey returns an int, you could do something like this: int input[4] = {0};.
You will need a cursor variable to know into which slot of the array you need to write when the next key is pressed so you can do some kind of loop like this:
int input[4] = {0};
for (unsigned cursor = 0; cursor < 4; ++cursor) {
input[cursor] = keypad.getKey();
}
If you want to use a terminating character (lets say your keyboard have 0-9 and A-F keys, we could say the F is the terminating key), the code changes for something like:
bool checkPassword() {
static const int expected[4] = {4,8,6,7}; // our password
int input[4] = {0};
// Get the next 4 key presses
for (unsigned cursor = 0; cursor < 4; ++cursor) {
int key = keypad.getKey();
// if F is pressed too early, then it fails
if (key == 15) {
return false;
}
// store the keypress value in our input array
input[cursor] = key;
}
// If the key pressed here isn't F (terminating key), it fails
if (keypad.getKey() != 15)
return false;
// Check if input equals expected
for (unsigned i = 0; i < 4; ++i) {
// If it doesn't, it fails
if (expected[i] != input[i]) {
return false;
}
}
// If we manage to get here the password is right :)
return true;
}
Now you can use the checkPassword function in your main function like this:
int main() {
while (true) {
if (checkPassword())
//unlock the thing
}
return 0;
}
NB: Using a timer sounds possible too (and can be combined with the terminating character option, they are not exclusive). The way to do this is to set a timer to the duration of your choice and when it ends you reset the cursor variable to 0.
(I never programmed on arduino and don't know about its keypad library but the logic is here, its up to you now)
In comment OP says a single number is wanted. The typical algorithm is that for each digit entered you multiply an accumulator by 10 and add the digit entered. This assumes that the key entry is ASCII, hence subtracting '0' from it to get a digit 0..9 instead of '0'..'9'.
#define MAXVAL 9999
int value = 0; // the number accumulator
int keyval; // the key press
int isnum; // set if a digit was entered
do {
keyval = getkey(); // input the key
isnum = (keyval >= '0' && keyval <= '9'); // is it a digit?
if(isnum) { // if so...
value = value * 10 + keyval - '0'; // accumulate the input number
}
} while(isnum && value <= MAXVAL); // until not a digit
If you have a backspace key, you simply divide the accumulator value by 10.

Working with big text files

I have a file in following format:
[1]
Parameter1=Value1
.
.
.
End
[2]
.
.
The number between brackets presents id of the entity. There're around 4500 entites. I need to parse through all entites and pick the ones matching my parameters and values. Size of file is around 20mb. My first approach was to reading file line by line and storing them in a struct array like:
struct Component{
std::string parameter;
std::string value;
};
struct Entity{
std::string id;
std::list<Component> components;
};
std::list<Entity> g_entities;
But this approach took enormous amount of memory and was very slow. I've also tried storing only the ones that match my parameters/values. But that also was really slow and took quite some memory. Ideally i would like to store all data in memory so that i won't have to load the file everytime i need to filter my parameters/values if it's possible with reasonable amount of memory usage.
Edit 1:
I read file line by line:
std::ifstream readTemp(filePath);
std::stringstream dataStream;
dataStream << readTemp.rdbuf();
readTemp.close();
while (std::getline(dataStream, line)){
if (line.find('[') != std::string::npos){
// Create Entity
Entity entity;
// Set entity id
entity.id = line.substr(line.find('[') + 1, line.find(']') - 1);
// Read all lines until EnumEnd=0
while (1){
std::getline(dataStream, line);
// Break loop if end of entity
if (line.find("EnumEnd=0") != std::string::npos){
if (CheckMatch(entity))
entities.push_back(entity);
entity.components.clear();
break;
}
Component comp;
int pos_eq = line.find('=');
comp.parameterId = line.substr(0, pos_eq);
comp.value = line.substr(pos_eq + 1);
entity.components.push_back(comp);
}
}
}
PS: After your edit. and Comment concerning memory consumption
500MB / 20MB = 25.
If each line is 25 chars long, the memory consumption looks ok.
OK you could use a look-up table for mapping parameter-names to numbers.
If the names-set is small, this will save the consumption up to 2 times.
Your data structure could look like this:
std::map<int, std::map<int, std::string> > my_ini_file_data;
std::map<std::string, int> param_to_idx;
(provided the parameter names within sections (entities as you call it) are not unique)
Putting the data is:
std::string param = "Param";
std::string value = "Val";
int entity_id = 0;
if ( param_to_idx.find(param) == param_to_idx.end() )
param_to_idx[param] = param_to_idx.size();
my_ini_file_data[entity_id][ param_to_idx[param] ] = value;
getting the data is:
value = my_ini_file_data[entity_id][ param_to_idx[param] ];
If the values-set is also considerably smaller than the number of entries,
you could even map values to numbers:
std::map<int, std::map<int, int> > my_ini_file_data;
std::map<std::string, int> param_to_idx;
std::map<std::string, int> value_to_idx;
std::map<int, std::string> idx_to_value;
Putting the data is:
std::string param = "Param";
std::string value = "Val";
int entity_id = 0;
if ( param_to_idx.find(param) == param_to_idx.end() )
param_to_idx[param] = param_to_idx.size();
if ( value_to_idx.find(value) == value_to_idx.end() )
{
int idx = value_to_idx.size();
value_to_idx[value] = idx;
idx_to_value[idx] = value;
}
my_ini_file_data[entity_id][ param_to_idx[param] ] = value_to_idx[value];
getting the data is:
value = idx_to_value[my_ini_file_data[entity_id][ param_to_idx[param] ] ];
Hope, this helps.
Initial answer
Concerning memory, I wouldn't care unless you have a kind of embedded system with very small memory.
Concerning the speed, I could give you some suggestions:
Find out, what is the bottleneck.
Use std::list! Using std::vector you re-initialize the memory each time the vector grows. If for some reason you need a vector at the end, create the vector reserving the requires number of entries, which you'll get by calling list::size()
Write a while loop, there you only call getline. If this alone is
already slow, read the entire block at once, create a reader-stream
out of the char* block and read line by line from the stream.
If the speed of the simple reading is OK, optimize your parsing code. You
can reduce the number of find-calls by storing the position. e.g.
int pos_begin = line.find('[]');
if (pos_begin != std::string::npos){
int pos_end = line.find(']');
if (pos_end != std::string::npos){
entity.id = line.substr(pos_begin + 1, pos_begin - 1);
// Read all lines until EnumEnd=0
while (1){
std::getline(readTemp, line);
// Break loop if end of entity
if (line.find("EnumEnd=0") != std::string::npos){
if (CheckMatch(entity))
entities.push_back(entity);
break;
}
Component comp;
int pos_eq = line.find('=');
comp.parameter= line.substr(0, pos_eq);
comp.value = line.substr(pos_eq + 1);
entity.components.push_back(comp);
}
}
}
Depending on how big your entities are, check if CheckMatch is slow. The smaller the entities, the slower the code - in this case.
You can use less memory by interning your params and values, so as not to store multiple copies of them.
You could have a map of strings to unique numeric IDs, that you create when loading the file, and then just use the IDs when querying your data structure. At the expense of possibly slower parsing initially, working with these structures afterwards should be faster, as you'd only be matching 32-bit integers rather than comparing strings.
Sketchy proof of concept for storing each string once:
#include <unordered_map>
#include <string>
#include <iostream>
using namespace std;
int string_id(const string& s) {
static unordered_map<string, int> m;
static int id = 0;
auto it = m.find(s);
if (it == m.end()) {
m[s] = ++id;
return id;
} else {
return it->second;
}
}
int main() {
// prints 1 2 2 1
cout << string_id("hello") << " ";
cout << string_id("world") << " ";
cout << string_id("world") << " ";
cout << string_id("hello") << endl;
}
The unordered_map will end up storing each string once, so you're set for memory. Depending on your matching function, you can define
struct Component {
int parameter;
int value;
};
and then your matching can be something like myComponent.parameter == string_id("some_key") or even myComponent.parameter == some_stored_string_id. If you want your strings back, you'll need a reverse mapping as well.

What's time complexity of this algorithm for getting all word ladders?

Word Ladder
Given two words (start and end), and a dictionary, find all
shortest transformation sequence(s) from start to end,
such that: Only one letter can be changed at a time Each intermediate word must exist in the dictionary
For example, Given: start = "hit" end = "cog" dict = ["hot","dot","dog","lot","log"] Return
[
["hit","hot","dot","dog","cog"],
["hit","hot","lot","log","cog"]
]
Note: All words have the same length. All words contain only lowercase alphabetic characters.
Personally I think, the time complexity for this algorithm depends on
the input(start, end, dict), can not write out like time complexity =
O(?).
Thank you AbcAeffchen. The tight time complexity =
O(len*N*(26^(N/2))), len is the length of the given start string(or
end string), N is the number of elements of the dict.(Assume C++
unordered_set is implemented by has set). Pleas check details below.
Idea of this solution: BFS(Map) + DFS.[C++]
#include <vector>
#include <unordered_map>
#include <deque>
#include <string>
using namespace std;
struct Node {
string val;
int level;
vector<Node *> prevs;
Node (string val, int level): val(val), level(level) {};
};
class Solution {
public:
vector<vector<string>> findLadders(string start, string end, unordered_set<string> &dict) {
vector<vector<string>> list;
// Input validation.
if (start.compare(end) == 0) {
vector<string> subList = {start, end};
list.push_back(subList);
return list;
}
deque<string> queue;
unordered_map<string, Node *> map;
queue.push_back(start);
Node *start_node = new Node(start, 0);
map.emplace(start, start_node);
while (!queue.empty()) {
// Dequeue.
string curr_string = queue.front();
queue.pop_front();
Node *curr_node = map.find(curr_string)->second;
int curr_level = curr_node->level;
int len = curr_string.length();
if (curr_string.compare(end) == 0) {
// Find the end.
vector<string> subList;
subList.push_back(curr_node->val);
getAllPathes(curr_node, list, subList);
return list;
}
// Iterate all children.
for (int i = 0; i < len; i ++) {
char curr_original_char = curr_string[i];
// Have a try.
for (char c = 'a'; c <= 'z'; c ++) {
if (c == curr_original_char) continue;
curr_string[i] = c;
if (dict.find(curr_string) != dict.end()) {
if (map.find(curr_string) == map.end()) {
// The new string has not been visited.
Node *child = new Node(curr_string, curr_level + 1);
// Add the parents of the current into prevs.
child->prevs.push_back(curr_node);
// Enqueue.
queue.push_back(curr_string);
map.emplace(curr_string, child);
} else {
// The new string has been visited.
Node *child = map.find(curr_string)->second;
if (child->level == curr_level + 1) {
child->prevs.push_back(curr_node);
}
}
}
}
// Roll back.
curr_string[i] = curr_original_char;
}
}
return list;
}
void getAllPathes(Node *end, vector<vector<string>> &list, vector<string> &subList) {
// Base case.
if (end == NULL) {
// Has been get to the top level, no topper one.
vector<string> one_rest(subList);
list.push_back(one_rest);
return;
}
vector<Node *> prevs = end->prevs;
if (prevs.size() > 0) {
for (vector<Node *>::iterator it = prevs.begin();
it != prevs.end(); it ++) {
// Have a try.
subList.insert(subList.begin(), (*it)->val);
// Do recursion.
getAllPathes((*it), list, subList);
// Roll back.
subList.erase(subList.begin());
}
} else {
// Do recursion.
getAllPathes(NULL, list, subList);
}
}
};
Split-up
Lets split the complexity of in three parts:
Find a next word in transformation sequence
The length of a shortest transformation sequence
The number of transformation sequences
Assumptions
Let n be the length of the given words and N the number of words in the dictionary. Lets also assume that the dictionary is sorted.
1. Part
Then you can find a next word in O(n ⋅ 26 ⋅ log(N)) = O(n log N) steps.
n characters in your word, that can be changed.
26 possible changes per character.
log(N) to look up, if this word exists in the dictionary.
2. Part
How long can a shortest transforamtion sequence be?
Example: Let the start word be "aa", the end word "zz" and the dictionary
["ab", "bb", "bc", "cc", ..].
This example needs 26 transformations. I think you can build worst case inputs that needs something like 26n-1 transformations.
But this depends on the words in the dictionary. So the worst case will be N, ie. all words in the dictionary are used.
3. Part
How many different sequences exists?
Everytime you looking for the next word in the sequence, it is possible to find 26 different next steps. But only for the first half of the lenght of the shortest sequence, because this holds also if you switch start and end word. So there could be up to O(26N/2) different sequences, as long as the worst case lenght of a shortest sequence is O(N).
Summary
O(n log N) finding the next transformation in a sequence.
O(N) transformations per sequence
O(26N/2) different sequences.
In total you get O(26N/2 N log N).
Notice
This holds only if your dictionary can contain any sequence of characters as "words". If you only allow words, that exists in a real language, you can use statistics to proof a better complexity.
The length of a shortest sequence is correlated to the number of different sequences. If you have a lot of words in your dictionary the sequence can become very long, but if you have too many, you maybe get more different sequences but they become also shorter. Maybe one can use some statistic to proof here also a better complexity.
I hope this helps

how to auto generate eclipse-indigo method block comment?

i want to generate block comment using eclipse-Indigo like this. I'm C++ programmer.
/**
*
* #param bar
* #return
*/
int foo(int bar);
how can i do like this.
IF your input is pretty much static, you can write a simplified lexer that will work, requires simple string mungeing. string has lots of nice editing capabilities in it with .substr() and .find() in it. all you have to do is figure out where the perens are. you know you can optionally process this as a stringstream, which makes this FAR easier (don't forget to use std::skipws to skip whitespace.
http://www.cplusplus.com/reference/string/string/substr/
http://www.cplusplus.com/reference/string/string/find/
#include <vector>
#include <string>
typedef STRUCT arg_s {
string sVarArgDataType, sVarArg;
} arg_s ARG;
ARG a;
vector<ARG> va;
char line[65000];
filein.getline(line, 65000);
line[65000-1]='\0'; //force null termination if it hasn't happened
get line and store in string sline0
size_t firstSpacePos=sline.find(' ');
size_t nextSpacePos = sline.find(' ',firstSpacePos+1);
size_t prevCommaPos = string::npos;
size_t nextCommaPos = sline.find(',');
size_t openPerenPos=sline.find('(');
size_t closePerenPos=sline.find(");");
string sReturnDataType, sFuncName;
if (
string::npos==firstSpacePos||
string::npos==semicolonPos||
string::npos==openPerenPos||
string::npos==closePerenPos) {
return false; //failure
}
while (string::npos != nextSpacePos) {
if (string::npos != nextCommaPos) {
//found another comma, a next argument. use next comma as a string terminator and prevCommaPos as an arg beginning.
//assume all keywords are globs of text
a.sVarArgDataType=sline.substr(prevCommaPos+1,nextSpacePos-(prevCommaPos+1));
a.sVarArg=sline.substr(nextSpacePos+1,nextCommaPos-(nextSpacePos+1));
} else {
//didn't find another comma. use ) as a string terminator and prevCommaPos as an arg beginning.
//assume all keywords are globs of text
a.sVarArgDataType=sline.substr(prevCommaPos+1,nextSpacePos-(prevCommaPos+1));
a.sVarArg=sline.substr(nextSpacePos+1,closePerenPos-(nextSpacePos+1));
}
va.push_back(a); //add structure to list
//move indices to next argument
nextCommaPos = sline.find(',', secondSpacePos+1);
nextSpacePos = sline.find(' ', secondSpacePos+1);
}
int i;
fileout<<"/**
*
";
for (i=0; i < va.size(); i++) {
fileout<<" * #param "<<va[i].sVarArg;
}
fileout<<"
* #return
*/
"<<sReturnDataType<<" "<<sFuncName<<'(';
for (i=0; i < va.size(); i++) {
fileout<<va[i].sArgDataType<<" "<<va[i].sVarArg;
if (i != va.size()-1) {
fileout<<", "; //don;t show a comma-space for the last item
}
}
fileout<<");"<<std::endl;
this will handle any number of arguments EXCEPT ... the variable argument type. but you can put in your own detection code for that and the if statement that switches out between ... and the 2-keyword argument types. here I am only supporting 2 keywords in my struct. you can support more by using a while to search for all the spaces before the next , comma or ) right peren in inside the while loop add your variable number of strings to a vector<string> inside the struct you are going to replace - nah, just make a vector<vector<string> >. or, just one vector and do a va.clear() after every function is done.
I just noticed the eclipse tag. I don't know much about eclipse. I can't even get it to work. some program.