Generate (not so)random string with particular string occurences

Generate (not so)random string with particular string occurences - c++

I have a requirement where I have the alphabet 'ACGT' and I need to create a string of around 20,000 characters. This string should contain 100+ occurrences of the pattern "CCGT". Most of the time the generated string contains it around 20-30 instances.
int N = 20000;
std::string alphabet("ACGT");
std::string str;
str.reserve(N);
for (int index = 0; index < N; index++)
{
str += alphabet[rand() % (alphabet.length())];
}
How do I tweak the code so that the pattern would appear more often?
Edit - Is there a way of changing the alphabet, i.e - 'A', 'C', 'G', 'T', 'CCGT' as characters of the alphabet?
Thank you.

Generate an array of ints containing 100 x 0s and 490 1s, 2s, 3s and 4s
[000000....111111....2222 etc] making almost 20,000 entries.
Then random shuffle it (std::random_shuffle)
Then write a string where each 0 translates to 'CCGT', each 1 translates to 'A', each 2 .... etc
I think that gives you what you want, and by tweaking the original array of ints you could change the number of 'A' characters in the output too.
Edit: If that isn't random enough, do 100 0s at the start and then random 1-4 for the rest.

The only solution I can think of that would meet the "100+" criteria is:
create 20000 character string
number of instances (call it n) = 100 + some random value
for (i = 0 ; i < n ; ++i)
{
pick random start position
write CCGT
}
Of course, you'd need to ensure the overwritten characters weren't part of a "CCGT" already.

My first thought would be to generate a list of 100 indexes where you will definitely insert the special string. Then as you generate the random string, insert the special string at each of these indexes as you reach them.
I've missed out checking that the intervals are spaced appropriately (cannot be within 4 of another interval) and also sorting them in ascending order - both of which would be necessary for this to work.
int N = 20000;
std::string alphabet("ACGT");
int intervals[100];
for (int index = 0; index < 100; index)
{
intervals[index] = rand() % 2000;
// Do some sort of check to make sure each element of intervals is not
// within 4 of another element and that no elements are repeated
}
// Sort the intervals array in ascending order
int current_interval_index = 0;
std::string str;
str.reserve(N);
for (int index = 0; index < N; index++)
{
if (index == intervals[current_interval_index])
{
str += alphabet;
current_interval_index++;
index += 3;
}
else
{
str += alphabet[rand() % (alphabet.length())];
}
}

The solution I came up with uses a std::vector to contain all the random sets of 4 chars including the 100 special sequence. I then shuffle that vector to distribute the 100 special sequence randomly throughout the string.
To make the distribution of letters even I create an alternative alphabet string called weighted that contains a relative abundance of alphabet characters according to what has already been included from the 100 special sequence.
int main()
{
std::srand(std::time(0));
using std::size_t;
const size_t N = 20000;
std::string alphabet("ACGT");
// stuff the ballot
std::vector<std::string> v(100, "CCGT");
// build a properly weighted alphabet string
// to give each letter equal chance of appearing
// in the final string
std::string weighted;
// This could be scaled down to make the weighted string much smaller
for(size_t i = 0; i < (N - 200) / 4; ++i) // already have 200 Cs
weighted += "C";
for(size_t i = 0; i < (N - 100) / 4; ++i) // already have 100 Ns & Gs
weighted += "GT";
for(size_t i = 0; i < N / 4; ++i)
weighted += "A";
size_t remaining = N - (v.size() * 4);
// add the remaining characters to the weighted string
std::string s;
for(size_t i = 0; i < remaining; ++i)
s += weighted[std::rand() % weighted.size()];
// add the random "4 char" sequences to the vector so
// we can randomly distribute the pre-loaded special "4 char"
// sequence among them.
for(std::size_t i = 0; i < s.size(); i += 4)
v.push_back(s.substr(i, 4));
// distribute the "4 char" sequences randomly
std::random_shuffle(v.begin(), v.end());
// rebuild string s from vector
s.clear();
for(auto&& seq: v)
s += seq;
std::cout << s.size() << '\n'; // should be N
}

I like the answer by #Andy Newman and think that is probably the best way - the code below is a compilable example of what they suggested.
#include <string>
#include <algorithm>
#include <iostream>
int main()
{
srand(time(0));
int N = 20000;
std::string alphabet("ACGT");
std::string str;
str.reserve(N);
int int_array[19700];
// Populate int array
for (int index = 0; index < 19700; index++)
{
if (index < 100)
{
int_array[index] = 0;
}
else
{
int_array[index] = (rand() % 4) + 1;
}
}
// Want the array in a random order
std::random_shuffle(&int_array[0], &int_array[19700]);
// Now populate string from the int array
for (int index = 0; index < 19700; index++)
{
switch(int_array[index])
{
case 0:
str += alphabet;
break;
case 1:
str += 'A';
break;
case 2:
str += 'C';
break;
case 3:
str += 'G';
break;
case 4:
str += 'T';
break;
default:
break;
}
}
// Print out to check what it looks like
std::cout << str << std::endl;
}

You should make N larger.
I take this liberty because you say 'create a string of around 20,000 characters'; but there's more to it than that.
If you're only finding around 20-30 instances in a string of 20000 characters then something is wrong. A ballpark estimate is to say that there are 20000 character positions to test, and at each of these there will be a four-letter string from an alphabet of four letters, giving a 1/256 chance of it being a specific string. The average should be (approximately; because I've oversimplified) 20000/256, or 78 hits.
It could be that your string isn't properly randomised (likely due to the use of the modulo idiom), or perhaps you're testing only every fourth character position -- as if the string were a list of non-overlapping four-letter words.
If you can bring your average hit rate back up to 78, then you can reach a little further to your 100 requirement simply by increasing N proportionally.

Related

Max Length Substring without any repeating character

Given a String, find the length of longest substring without any repeating character.
Example 1:
Input: s = ”abcabcbb”
Output: 3
Explanation: The answer is abc with length of 3.
Example 2:
Input: s = ”bbbbb”
Output: 1
Explanation: The answer is b with length of 1 units.
My solution works, but it isn't optimised. How can this be done in O(n) time?
#include<bits/stdc++.h>
using namespace std;
int solve(string str) {
if(str.size()==0)
return 0;
int maxans = INT_MIN;
for (int i = 0; i < str.length(); i++) // outer loop for traversing the string
{
unordered_set < int > set;
for (int j = i; j < str.length(); j++) // nested loop for getting different string starting with str[i]
{
if (set.find(str[j]) != set.end()) // if element if found so mark it as ans and break from the loop
{
maxans = max(maxans, j - i);
break;
}
set.insert(str[j]);
}
}
return maxans;
}
int main() {
string str = "abcsabcds";
cout << "The length of the longest substring without repeating characters is " <<
solve(str);
return 0;
}

Use a two pointer approach along with a hashmap here.
Initialise two pointers i = 0, j = 0 (i and j denote the left and right boundary of the current substring)
If the j-th character is not in the map, we can extend the substring. Add the j-th char to the map and increment j.
If the j-th character is in the map, we can not extend the substring without removing the earlier occurrence of the character. Remove the i-th char from the map and increment i.
Repeat this while j < length of string
This will have a time and space complexity of O(n).

#include <string>
#include <iostream>
#include <vector>
int main() {
// 1
std::string s;
std::cin >> s;
// 2
std::vector<int> lut(127, -1);
int i, beg{ 0 }, len_curr{ 0 }, len_ans{ 0 };
for (i = 0; i != s.size(); ++i) {
if (lut[s[i]] == -1 || lut[s[i]] < beg) {
++len_curr;
}
else {
if (len_curr > len_ans) {
len_ans = len_curr;
}
beg = lut[s[i]] + 1;
len_curr = i - lut[s[i]];
}
lut[s[i]] = i;
}
if (len_curr > len_ans) {
len_ans = len_curr;
}
// 3
std::cout << len_ans << '\n';
return 0;
}
In // 1 you:
Define and read your string s.
In // 2 you:
Define your look up table lut, which is a vector of int and consists of 127 buckets each initialized with -1. As per this article there are 95 printable ASCII characters numbered 32 to 126 hence we allocated 127 buckets. lut[ch] is the position in s where you found the character ch for the last time.
Define i (index variable for s), beg (the position in s where your current substring begin at), len_curr (the length of your current substring), len_ans (the length you are looking for).
Loop through s. If you have never found the character s[i] before OR you have found it but at a position BEFORE beg (It belonged to some previous substring in s) you increment len_curr. Otherwise you have a repeating character ! You compare len_curr against len_ans and If needed you assign. Your new beg will be the position AFTER the one you found your repeating character for the last time at. Your new len_curr will be the difference between your current position in s and the position that you found your repeating character for the last time at.
You assign i to lut[s[i]] which means that you found the character s[i] for the last time at position i.
You repeat the If clause when you fall through the loop because your longest substring can be IN the end of s.
In // 3 you:
Print len_ans.

Given two string S and T. Determine a substring of S that has minimum difference with T?

I have two string S and T where length of S >= length of T. I have to determine a substring of S which has same length as T and has minimum difference with T. Here difference between two strings of same length means, the number of indexes where they differ. For example: "ABCD" and "ABCE" differ at 3rd index, so their difference is 1.
I know I can use KMP(Knuth Morris Pratt) Pattern Searching algorithm to search T within S. But, what if S doesn't contain T as a substring? So, I have coded a brute force approach to solve this:
int main() {
string S, T;
cin >> S >> T;
int SZ_S = S.size(), SZ_T = T.size(), MinDifference = INT_MAX;
string ans;
for (int i = 0; i + SZ_T <= SZ_S; i++) { // I generate all the substring of S
int CurrentDifference = 0; // and check their difference with T
for (int j = 0; j < SZ_T; j++) { // and store the substring with minimum difference
if (S[i + j] != T[j])
CurrentDifference++;
}
if (CurrentDifference < MinDifference) {
ans = S.substr (i, SZ_T);
MinDifference = CurrentDifference;
}
}
cout << ans << endl;
}
But, my approach only works when S and T has shorter length. But, the problem is S and T can have length as large as 2 * 10^5. How can I approach this?

Let's maximize the number of characters that match. We can solve the problem for each character of the alphabet separately, and then sum up the results for
substrings. To solve the problem for a particular character, give string S and T as sequences 0 and 1 and multiply them using the FFT https://en.wikipedia.org/wiki/Fast_Fourier_transform.
Complexity O(|A| * N log N) where |A| size of the alphabet (for an uppercase letter is 26).

Similar Char(TCS CodeVita) Reduce Time Complexity

You have to take the length of the string from the user then input the string from the user, after that take the no. of queries from the user to check. Suppose that the user gives 3 queries 4,5,7. Then 4,5,7 are the position where you have to check how many of the same characters are repeated before that position.
Inputs:
9 (Length of the string to input)
abcabcabc
3 (No. of queries to check)
4 (Check at position 4)
5 (Check at position 5)
7 (Check at position 7)
Output:
1
1
2
Code which I made:
#include <iostream>
int main()
{
int n; // Length of String
int a[10000]; // Input for Query
int q; // NO. of queries
std::cin >> n;
char ch[n]; // To store the InputString
for (int i = 1; i <= n; i++) {
std::cin >> ch[i];
}
std::cin >> q;
for (int j = 1; j <= q; j++) {
std::cin >> a[j];
}
for (int i = 1; i <= q; i++) {
int count = 0;
for (int j = 1; j < a[i]; j++) {
if (ch[j] == ch[a[i]]) {
count = count + 1;
}
}
std::cout << count << "\n";
}
return 0;
}
But the problem is this the time complexity of the program is too much.In worst case it would be O(n*q), where n = length of the string and q = number of queries. How to improve the time complexity?

First, your Output does not meet the assignment. For example, at the position [1], there is character 'b' (0-based indexing), which has zero repetitions before that position. At the position [2], there is character 'a', which has one repetition before that position. So do you mean, before, including?
Second, is that a homework? (Just curious. Can hardly imagine any real-life case here... :))
Finally, to your question:
1) make an array of <query position, count> pairs
2) go through the InputString only once up to the highest query position and at every position in the InputString, increase your count if relevant for the query position in the <query position, count> pair.
2.b) Slight further optimization: sort your array or <query position, count> pairs according to the query position first. In your InputString, go to the lowest query position so that you have to check for every <query position, count> pair in this first turn. Then continue searching your InputString from the [lowest query position + 1] position up to the second lowest query position. You do not need to care about the lowest <query position, count> pair in this second turn. Etc. ...

Splitting a string info maximum number of equal substrings

Given a string, what's the most optimized solution to find the maximum number of equal substrings? For example "aaaa" is composed of four equal substrings "a", or "abab" is composed of two "ab"s. But for something as "abcd" there isn't any substrings but "abcd" that when concatenated to itself would make up "abcd".
Checking all the possible substrings isn't a solution since the input can be a string of length 1 million.

Since there is no given condition for the substrings, an optimized solution to find the maximum number of equal substrings is to count the shortest possible strings, letters. Create a map and count the letters of the string. Find the letter with the maximum number. That is your solution.
EDIT:
If the string must only consist of the substrings then the following code computes a solution
#include <iostream>
#include <string>
using ull = unsigned long long;
int main() {
std::string str = "abab";
ull length = str.length();
for (ull i = 1; (2 * i) <= str.length() && str.length() % i == 0; ++i) {
bool found = true;
for (ull j = 1; (j * i) < str.length(); ++j) {
for (ull k = 0; k < i; ++k) {
if(str[k] != str[k + j * i]) {
found = false;
}
}
}
if(found) {
length = i;
break;
}
}
std::cout << "Maximal number: " << str.length() / length << std::endl;
return 0;
}
This algorithm checks if the head of the string is repeated and if the string only consists of repetitions of the head.
i-loop iterates over the length of the head,
j-loop iterates over each repetition,
k-loop iterates over each character in the substring

Longest prefix string length for all the suffixes

Find the length of the longest prefix string for all the suffixes of the string.
For example suffixes of the string ababaa are ababaa, babaa, abaa, baa, aa and a. The similarities of each of these strings with the string "ababaa" are 6,0,3,0,1,1 respectively. Thus the answer is 6 + 0 + 3 + 0 + 1 + 1 = 11.
I wrote following code
#include <iostream>
#include <string.h>
#include <stdio.h>
#include <time.h>
int main ( int argc, char **argv) {
size_t T;
std::cin >> T;
char input[100000];
for ( register size_t i = 0; i < T; ++i) {
std::cin >> input;
double t = clock();
size_t len = strlen(input);
char *left = input;
char *right = input + len - 1;
long long sol = 0;
int end_count = 1;
while ( left < right ) {
if ( *right != '\0') {
if ( *left++ == *right++ ) {
sol++;
continue;
}
}
end_count++;
left = input; // reset the left pointer
right = input + len - end_count; // set right to one left.
}
std::cout << sol + len << std::endl;
printf("time= %.3fs\n", (clock() - t) / (double)(CLOCKS_PER_SEC));
}
}
Working fine, but for a string which is 100000 long and having same character i.e. aaaaaaaaaa.......a, it is taking long time , how can i optimize this one more.

You can use Suffix Array: http://en.wikipedia.org/wiki/Suffix_array

Let's say your ababaa is a pattern P.
I think you could use the following algorithm:
Create a suffix automata for all possible suffixes of P.
Walk the automata using P as input, count edges traversed so far. For each accepting state of the automata add the current edge count to total sum. Walk the automata until you either reach the end of the input or there are no more edges to go through.
The total sum is the result.

Use Z algorithm to calculate length of all substrings, which also prefixes in O(n) and then scan resulting array and sum its values.
Reference: https://www.geeksforgeeks.org/sum-of-similarities-of-string-with-all-of-its-suffixes/

From what I see, you are using plain array to evaluate the suffix and though it may turn out to be efficient for some data set, it would fail to be efficient for some cases, such as the one you mentioned.
You would need to implement a Prefix-Tree or Trie like Data Structure. The code for those aren't straightforward, so if you are not familiar with them, I would suggest you read a little bit about them.

I'm not sure whether a Trie gives you much performance gain.. but I would certainly think about it.
The other idea I had is to try to compress your string. I didn't really think about it, just a crazy idea...
if you have a string like this: ababaa compress it maybe to: abab2a. Then you have to come up with a technique where you can use your algorithm with those strings. The advantage is you can then compare long strings 100000a efficiently with each other. Or more importantly: you can calculate your sum very fast.
But again, I didn't think it through, maybe this is a very bad idea ;)

Here a java implementation:
// sprefix
String s = "abababa";
Vector<Integer>[] v = new Vector[s.length()];
int sPrefix = s.length();
v[0] = new Vector<Integer>();
v[0].add(new Integer(0));
for(int j = 1; j < s.length(); j++)
{
v[j] = new Vector<Integer>();
v[j].add(new Integer(0));
for(int k = 0; k < v[j - 1].size(); k++)
if(s.charAt(j) == s.charAt(v[j - 1].get(k)))
{
v[j].add(v[j - 1].get(k) + 1);
v[j - 1].set(k, 0);
}
}
for(int j = 0; j < v.length; j++)
for(int k = 0; k < v[j].size(); k++)
sPrefix += v[j].get(k);
System.out.println("Result = " + sPrefix);

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Generate (not so)random string with particular string occurences - c++

Related

Max Length Substring without any repeating character

Given two string S and T. Determine a substring of S that has minimum difference with T?

Similar Char(TCS CodeVita) Reduce Time Complexity

Splitting a string info maximum number of equal substrings

Longest prefix string length for all the suffixes

Categories

Resources