I built a word generator, it picks a length and then randomly picks letters of the alphabet to make up words.
The program works but 99% of the output is rubbish as it is not observing the constructs of the English language, I am getting as many words with x and z in as I do e.
What are my options for biasing the RNG so it will use common letters more often.
I am using rand() from the stl seeded with the time.
The output will still be rubbish because biasing the random number generator is not enough to construct proper English words. But one approach to biasing the rng is:
Make a histogram of the occurences of letters in a large English text (the corpus). You'll get something like 500 'e', 3 'x', 1 'q', 450 'a', 200 'b' and so on.
Divide an interval into ranges where each letter gets a slice, with the length of the slice being the number of occurences in the interval. a gets [0-450), b [450,650), ..., q [3500,3501).
Generate a random number between 0 and the total length of the interval and check where it lands. Any number within 450-650 gives you a b, but only 3500 gives you a 'q'.
One method would be to use the letter frequency. For each letter define a range: a = [0, 2] (if the letter 'a' has 2% chance of being used), b = [2, 5] (3% chance), and so forth.. then generate a random number between 0 and 100 and choose a letter.
An other method is to use a nondeterministic finite automata where you can define certain transitions (you could parse the bible and build your probability). So you have a lot of transitions like this: e.g. the transition from 'a' to 'b' is 5%. Then you walk through the automata and generate some words.
I just saw that the proper term is markov chain, which is probably better than a NFA.
You can do an n-gram analysis of some body of text and use that as a base for the bias. You can do this either by letters or by syllables. Doing the analysis by syllables is probably more complicated.
To do it by letters, it's easy. You iterate through each character in the source text and keep track of the last n-1 characters you came across. Then, for each next character, you add the last n-1 characters and this new one (a n-gram) to your table of frequencies.
What does this table of frequencies look like? You can use a map mapping the n-grams to their frequencies. But this approach is not very good for the algorithm I suggest below. For that it's better to map each (n-1)-grams to a map of the last letter of an n-gram to its frequency. Something like: std::map<std::string, std::map<char,int>>.
Having made the analysis and collected the statistics, the algorithm would go like this:
pick a random starting n-gram. Your previous analysis may contain weighted data for which letters usually start words;
from all the n-grams that start with previous n-1 letters, pick a random last letter (considering the weights from the analysis);
repeat until you reach the end of a word (either using a predefined length or from data about word ending frequencies);
To pick random values from a set of values with different weights, you can start by setting up a table of the cumulative frequencies. Then you pick a random number between less than the sum of the frequencies, and see in what interval it falls.
For example:
A happens 10 times;
B happens 7 times;
C happens 9 times;
You build the following table: { A: 10, B: 17, C: 26 }. You pick a number between 1 and 26. If it is less than 10, it's A; if it's greater or equal to 10, but less than 17, it's B; if it's greater than 17, it's C.
You may want to use the English language's letter frequency to have a more realistic output : http://en.wikipedia.org/wiki/Letter_frequency.
But if you want pronounceable words, you should probably generate them from syllabes. You can find more information online, e.g. here : http://spell.psychology.wustl.edu/SyllStructDistPhon/CVC.html
You could derive a Markov Model be reading a source text and then generate words which are "like" the source.
This also works for generating sentences from words. Well, sort of works.
If you want to change just the letter frequency in the words, without futher lexical analisys (like the qu pair), get a list of english language letter frequencies.
Then create a weighted random generator, that will have more chance to output an e (1 in 7 chance) that a x (around 1 in a 1000).
To generate a weighted random generator (rand generates integers, IIRC):
1. Normalize the letter frequencies, so that they are all integers (for the Wikipedia frequencies basically multiply by 100000)
2. Make some sort of lookup table, where to each letter you assign a certain range, like the table below
letter | weight | start | end
a | 8.17% | 0 | 8167
b | 1.49% | 8168 | 9659
c | 2.78% | 9660 | 12441
d | 4.25% | 12442 | 16694
e | 12.70% | 16695 | 29396
f | 2.23% | 29397 | 31624
g | 2.02% | 31625 | 33639
.....
z | 0.07% | 99926 | 99999
3. Generate a random number between 0 and 99999, and use that to find the corresponding letter. This way, you will have the correct letter frequencies.
First, you need a table with the letters and their weights, something
like:
struct WeightedLetter
{
char letter;
int weight;
};
static WeightedLetter const letters[] =
{
{ 'a', 82 },
{ 'b', 15 },
{ 'c', 28 },
// ...
};
char getLetter()
{
int totalWeight = 0;
for ( WeightedLetter const* iter = begin( letters );
iter != end( letters );
++ iter ) {
totalWeight += iter->weight;
}
int choice = rand() % totalWeight;
// but you probably want a better generator
WeightedLetter const* result = begin( letters );
while ( choice > result->weight ) {
choice -= result->weight;
++ result;
}
return result->letter;
}
This is just off the top of my head, so it's likely to contain errors;
at the very least, the second loop requires some verification. But it
should give you the basic idea.
Of course, this still isn't going to result in English-like words. The
sequence "uq" is just as likely as "qu", and there's nothing to prevent
a word without a vowel, or a ten letter word with just vowels. The Wikipedia page on English Phonology has some good information as to what combinations can occur where, but it doesn't have any statistics on them. On the other hand, if you're trying to make up possible words, like Jabberwocky, then that may not be a problem: choose a random number of syllables, from 1 to some maximum, then an onset, a nucleus and a coda. (Don't forget that the onset and the coda can be empty.)
If you want to create pronounceable words do not try and join letters together.
Join sounds. Make a list of sounds to select from:"abe", "ape", "gre" etc
Related
I am not a huge math nerd so I may easily be missing something, but let's take the algorithm from https://cp-algorithms.com/string/z-function.html and try to apply it to, say, string baz. This string definitely has a substring set of 'b','a','z', 'ba', 'az', 'baz'.
Let's see how z function works (at leas how I understand it):
we take an empty string and add 'b' to it. By definition of the algo z[0] = 0 since it's undefined for size 1;
we take 'b' and add 'a' to it, invert the string, we have 'ab'... now we calculate z-function... and it produces {0, 0}. First element is "undefined" as is supposed, second element should be defined as:
i-th element is equal to the greatest number of characters starting from the position i that coincide with the first characters of s.
so, at i = 1 we have 'b', our string starts with a, 'b' doesn't coincide with 'a' so of course z[i=1]=0. And this will be repeated for the whole word. In the end we are left with z-array of all zeroes that doesn't tell us anything despite the string having 6 substrings.
Am I missing something? There are tons of websites recommending z function for count of distinct substrings but it... doesn't work? Am I misunderstanding the meaning of distinct here?
See test case: https://pastebin.com/mFDrSvtm
When you add a character x to the beginning of a string S, all the substrings of S are still substrings of xS, but how many new substrings do you get?
The new substrings are all prefixes of xS. There are length(xS) of these, but
max(Z(xS)) of these are already substrings of S, so
You get length(xS) - max(Z(xS)) new ones
So, given a string S, just add up all the length(P) - max(Z(P)) for every suffix P of S.
Your test case baz has 3 suffixes: z, az, and baz. All the letters are distinct, so their Z functions are zero everywhere. The result is that the number of distinct substrings is just the sum of the suffix lengths: 3 + 2 + 1 = 6.
Try baa: The only non-zero in the Z functions is Z('aa')[1] = 1, so the number of unique substrings is 3 + 2 - 1 + 1 = 5.
Note that the article you linked to mentions that this is an O(n2) algorithm. That is correct, although its overhead is low. It's possible to do this in O(n) time by building a suffix tree, but that is quite complicated.
I'm trying to figure out how to calculate the number of all strings of length n such that any substring of length 4 of string w, all three letters a, b, c occur. For example, abbcaabca should be printed when n = 9, but aabbcabac should not be included.
I was trying to make a math formula like
3^N - 3 * 2^N + 3 or (3^(N-3))*N!
Can it work this way or do I have to generate them and count them? I'm working with large numbers like 100, and I don't think I can generate them to count them.
You should probably be able to work your way up and start with let's say all possible words of length 4 and then add just one letter and count the possible allowed resulting words. Then you can iteratively go up to high numbers without having to explore all 3^N possibilities.
const unsigned w = 4;
unsigned n = 10;
vector<string> before,current;
// obtain all possible permutations of the strings "aabc", "abbc" and "abcc"
string base = "aabc";
before.emplace_back(base);
while(std::next_permutation(base.begin(),base.end())) before.emplace_back(base);
base = "abbc";
before.emplace_back(base);
while(std::next_permutation(base.begin(),base.end())) before.emplace_back(base);
base = "abcc";
before.emplace_back(base);
while(std::next_permutation(base.begin(),base.end())) before.emplace_back(base);
// iteratively add single letters to the words in the collection and add if it is a valid word
size_t posa,posb,posc;
for (unsigned k=1;k<n-w;++k)
{
current.clear();
for (const auto& it : before)
{
posa = it.find("a",k);
posb = it.find("b",k);
posc = it.find("c",k);
if (posb!= string::npos && posc!= string::npos) current.emplace_back(it+"a");
if (posa!= string::npos && posc!= string::npos) current.emplace_back(it+"b");
if (posa!= string::npos && posb!= string::npos) current.emplace_back(it+"c");
}
before = current;
}
for (const auto& it : current) cout<<it<<endl;
cout<<current.size()<<" valid words of length "<<n<<endl;
Note that with this you will still however run into the exponential wall pretty quickly... In a more efficient implementation I would represent words as integers (NOT vectors of integers, but rather integers in a base 3 representation), but the exponential scaling would still be there. If you are just interested in the number, #Jeffrey's approach is surely better.
The trick is to break down the problem. Consider:
Would knowing how many such strings, of length 50, ending in each pair of letter, help ?
Number of 50-string, ending in AA times
Number of 50-string, starting with B or C
+
Number of 50-string, ending in AB times
Number of 50-string, starting with C
+
All other combinations gives you the number of 100-long strings.
Continue breaking it down, recursively.
Look up dynamic programming.
Also look up large number libraries.
I understand this has been asked before and I somewhat have a grasp on how to compare frequency tables between cipher and English(this is the language I'm assuming its in for my program) but I'm unsure about how to get this into code.
void frequencyUpdate(std::vector< std::vector< std::string> > &file, std::vector<int> &freqArg) {
for (int itr_1 = 0; itr_1 < file.size(); ++itr_1) {
for (int itr_2 = 0; itr_2 < file.at(itr_1).size(); ++itr_2) {
for (int itr_3 = 0; itr_3 < file.at(itr_1).at(itr_2).length(); ++itr_3) {
file.at(itr_1).at(itr_2).at(itr_3) = toupper(file.at(itr_1).at(itr_2).at(itr_3));
if (!((int)file.at(itr_1).at(itr_2).at(itr_3) < 65 || (int)file.at(itr_1).at(itr_2).at(itr_3) > 90)) {
int temp = (int)file.at(itr_1).at(itr_2).at(itr_3) - 65;
freqArg.at(temp) += 1;
}
}
}
}
}
this is how I get the frequency of a given file that has its contents split into lines and then into words, hence the double vector of strings and using ASCII values of the chars - 65 for indices. The resulting vector of ints that hold frequency is saved.
Now is where I don't knot how to proceed. Should I hardcode in a const std:: vector <int> for the English frequency of letters and then somehow to comparison? How would I compare efficiently rather than simply compare each vector to each other for is possible not an efficient method?
This comparison is for getting an appropriate shift value for caesar cipher shifting to decrypt a text. I don't wanna use brute force and shift one at a time until the text is readable. Any advice on how to approach this? Thanks.
Take your frequency vector and the frequency vector for "typical" English text, and find the cross-correlation.
The highest values of the cross-correlation correspond to the most likely shift values. At that point you'll need to use each one to decrypt, and see whether the output is sensible (i.e. forms real words and coherent sentences).
In English, 'e' has the highest frequency. So whatever most frequent letter you got from your ciphertext, it most likely maps to 'e'.
Since e --> X then the key should be difference between 'e' and your most frequent letter X.
If this is not the right key (due to too short ciphertext distorting the statistics), try to match your most frequent ciphertext letter with the second one in English i.e. a.
I would suggest a graph traversal algorithm. Your starting node has no substitutions assigned and has 26 connected nodes, one for each possible letter substitution for the most frequently occurring ciphertext letter. The next node has another 25 connected nodes for the possible letters for the second most frequent ciphertext letter (one less, since you've already used one possible letter). Which destination node you choose should be based on which letters are most likely given a normal frequency distribution for the target language.
At each node, you can test for success by doing your substitutions into the ciphertext, and finding all the resulting words that now match entries in a dictionary file. The more matches you've found, the more likely you've got the correct substitution key.
I have a huge list (N = ~1million) of strings 100 characters long that I'm trying to find the overlaps between. For instance, one string might be
XXXXXXXXXXXXXXXXXXAACTGCXAACTGGAAXA (and so on)
I need to build an N by N matrix that contains the longest overlap value for every string with every other string. My current method is (pseudocode)
read in all strings to array
create empty NxN matrix
compare each string to every string with a higher array index (to avoid redoing comparisons)
Write longest overlap to matrix
There's a lot of other stuff going on, but I really need a much more efficient way to build the matrix. Even with the most powerful computing clusters I can get my hands on this method takes days.
In case you didn't guess, these are DNA fragments. X indicates "wild card" (probe gave below a threshold quality score) and all other options are a base (A, C, T, or G). I tried to write a quaternary tree algorithm, but this method was far too memory intensive.
I'd love any suggestions you can give for a more efficient method; I'm working in C++ but pseudocode/ideas or other language code would also be very helpful.
Edit: some code excerpts that illustrate my current method. Anything not particularly relevant to the concept has been removed
//part that compares them all to each other
for (int j=0; j<counter; j++) //counter holds # of DNA
for (int k=j+1; k<counter; k++)
int test = determineBestOverlap(DNArray[j],DNArray[k]);
//boring stuff
//part that compares strings. Definitely very inefficient,
//although I think the sheer number of comparisons is the main problem
int determineBestOverlap(string str1, string str2)
{
int maxCounter = 0, bestOffset = 0;
//basically just tries overlapping the strings every possible way
for (int j=0; j<str2.length(); j++)
{
int counter = 0, offset = 0;
while (str1[offset] == str2[j+offset] && str1[offset] != 'X')
{
counter++;
offset++;
}
if (counter > maxCounter)
{
maxCounter = counter;
bestOffset = j;
}
}
return maxCounter;
} //this simplified version doesn't account for flipped strings
Do you really need to know the match between ALL string pairs? If yes, then you will have to compare every string with every other string, which means you will need n^2/2 comparisons, and you will need one half terabyte of memory even if you just store one byte per string pair.
However, i assume what you really are interested in is long strings, those that have more than, say, 20 or 30 or even more than 80 characters in common, and you probably don't really want to know if two string pairs have 3 characters in common while 50 others are X and the remaining 47 don't match.
What i'd try if i were you - still without knowing if that fits your application - is:
1) From each string, extract the largest substring(s) that make(s) sense. I guess you want to ignore 'X'es at the start and end entirely, and if some "readable" parts are broken by a large number of 'X'es, it probably makes sense to treat the readable parts individually instead of using the longer string. A lot of this "which substrings are relevant?" depends on your data and application that i don't really know.
2) Make a list of these longest substrings, together with the number of occurences of each substring. Order this list by string length. You may, but don't really have to, store the indexes of every original string together with the substring. You'll get something like (example)
AGCGCTXATCG 1
GAGXTGACCTG 2
.....
CGCXTATC 1
......
3) Now, from the top to the bottom of the list:
a) Set the "current string" to the string topmost on the list.
b) If the occurence count next to the current string is > 1, you found a match. Search your original strings for the substring if you haven't remembered the indexes, and mark the match.
c) Compare the current string with all strings of the same length, to find matches where some characters are X.
d) Remove the 1st character from the current string. If the resulting string is already in your table, increase its occurence counter by one, else enter it into the table.
e) Repeat 3b with the last, instead of the first, character removed from the current string.
f) Remove the current string from the list.
g) Repeat from 3a) until you run out of computing time, or your remaining strings become too short to be interesting.
If this is a better algorithm depends very much on your data and which comparisons you're really interested in. If your data is very random/you have very few matches, it will probably take longer than your original idea. But it might allow you to find the interesting parts first and skip the less interesting parts.
I don't see many ways to improve the fact that you need to compare each string with each other including shifting them, and that is by itself super long, a computation cluster seems the best approach.
The only thing I see how to improve is the string comparison by itself: replace A,C,T,G and X by binary patterns:
A = 0x01
C = 0x02
T = 0x04
G = 0x08
X = 0x0F
This way you can store one item on 4 bits, i.e. two per byte (this might not be a good idea though, but still a possible option to investigate), and then compare them quickly with a AND operation, so that you 'just' have to count how many consecutive non zero values you have. That's just a way to process the wildcard, sorry I don't have a better idea to reduce the complexity of the overall comparison.
I want to select a number of random words from an array to make a total amount of 36 letters.
At first I tried to select a random word and add it after checking that it's not longer than the amount of free space we have. That was not efficient since the list would fill up and there would only be empty space left for a 2-3 letter word and it takes a long time to find such a short word.
So i decided to only choose six 6-letter words and I'm doing that by generating a random number and then incrementing it by 1 until we find a 6 letter word. It's pretty fast, but the words aren't really that random, often I get words that start from the same letter or only words that start with letters in sequence like a,b,c or x,y,z.
srand ( time(NULL) );
for(int i=0;i<6;i++)
{
randNumb = rand()%dictionary.size();
while(dictionary.at(randNumb).length() != 6)
{
randNumb++;
}
a << "/" << dictionary.at(randNumb) << "/";
}
I would like to choose words with different lengths but in favor of performance I'll settle with just the 6-letter words but then i would at least want them to be more randomly selected.
You should get a new random number instead of increasing the index. The way you do it, all the strings not matching your criteria "attract" more random numbers, and possibly lead to the following string to have a higher probability of being chosen.
The rand() function generates a number between 0 and RAND_MAX.
If RAND_MAX is defined as 32767, then you will not access elements in your dictionary (array?) with indices greater than that.
If you need to generate a random number greater than RAND_MAX, then think about summing the result of n calls of rand(), such that n * RAND_MAX >= dictionary.size(). The modulus of this result is then guaranteed to give an index that falls somewhere in the bounds of the entire dictionary.
Even if RAND_MAX is greater than dictionary.size(), using the % operator to select the index leads to a non-uniform distribution. The modulus will cause the early words to be selected more often than the later words (unless RAND_MAX + 1 is an integer multiple of dictionary.size()).
Consider a simple example: Assume your dictionary has 10 words, and RAND_MAX is 14. When rand() returns a value from 0 to 9, the corresponding word is chosen directly. But when rand() is 10 through 14, then one of the first five words will be chosen. So the first five words have twice the chance of being selected than the last five words.
A better way to map [0..RAND_MAX] to [0..dictionary.size()) is to use division:
assert(RAND_MAX + 1 >= dictionary.size());
randNumb = rand() * dictionary.size() / (RAND_MAX + 1);
But you have to be careful of integer overflow. If RAND_MAX * dictionary.size() is larger than you can represent in an integer, you'll need to use a larger data type. Some systems have a function like MulDiv for just this purpose. If you don't have something like MulDiv, you can convert to a floating point type and then truncate the result back to an integer:
double temp = static_cast<double>(rand()) * dictionary.size() / (RAND_MAX + 1);
randNumb = static_cast<int>(temp);
This is still an imperfect distribution, but the "hot" words will now be evenly distributed across the dictionary instead of clumping at the beginning.
The closer RAND_MAX + 1 is to an integer multiple of dictionary.size(), the better off you'll be. And if you can't be sure that it's close to an integer multiple, then you want RAND_MAX to be as large as possible relative to dictionary.size().
Since you don't have much control over RAND_MAX, you could consider tweaking dictionary.size(). For example, if you only want six-letter words, then why not strip all the others out of the dictionary?
std::vector<std::string> six_letter_words;
std::copy_if(dictionary.begin(), dictionary.end(),
std::back_inserter(six_letter_words),
[](const std::string &word){ return word.size() == 6; });
With the reduced set, we can use a more generic algorithm to select the words:
typedef std::vector<std::string> WordList;
// Returns true with the given probability, which should be 0.0 to 1.0.
bool Probably(double probability) {
return (static_cast<double>(std::rand()) / RAND_MAX) < probability;
}
// Selects n words from the dictionary using a normal distribution and
// copies them to target.
template <typename OutputIt>
OutputIt Select(int n, const WordList &dictionary, OutputIt target) {
double count = static_cast<double>(n);
for (std::size_t i = 0; count > 0.0 && i < dictionary.size(); ++i) {
if (Probably(count / (dictionary.size() - i))) {
*target++ = dictionary[i];
count -= 1.0;
}
}
return target;
}
The idea is to step through each word in the dictionary and select it with a probability of the number of words you need to pick divided by the number of words left to pick from. This works well, even if RAND_MAX is relatively small. Overall, though, it's much more computation than trying to randomly select indexes. Also note that this technique will never choose the same word more than once, where the index mapping technique could.
You call Select like this:
// Select six words from six_letter_words using a normal distribution.
WordList selected;
Select(6, six_letter_words, std::back_inserter(selected));
Also note that most implementations of rand() are pretty simplistic and may not give a good normal distribution to begin with.