How to find number of ways of permuting a string satisfying the below conditions? - combinations

I am given a string,let say- "abcd".
Now I have to find all the strings that can be generated by permuting its character such that-
There are exactly four mismatches in the generated strings and,
The mismatches exists in pair, for e.g-
The string - "abcd" has three such permutations-
"badc","cdab","dcba".
Explanation-
Let us consider "abcd" and "badc". Now there are exactly four mismatch with , i.e- (a,b),(b,a),(c,d),(d,c) and these mismatches exists in pair.
Note that "abcde" has fifteen such permutations-
acbed,adebc,aedcb,baced,badce,baedc,cbaed,cdabe,ceadb,dbeac,dcbae,decab,ebdca,ecbda,edcba
Where I am failing?-
I am just finding the strings manually, but this becomes really time-consuming for strings of large length. Hence,I need a efficient solution.

If you have a string of length n, consisting of n different letters, then the number you are finding is: n * (n - 1) * (n - 2) * (n - 3) / 8.
Reason: there are C(n, 4) ways to choose 4 mismatch places from n letters, and for every such quadruple, there are three ways to pair them.
Hence the result is C(n, 4) * 3 = n * (n - 1) * (n - 2) * (n - 3) / 8.
Here the hypothesis is that all letters in the string are different. It is unclear from your description of problem whether letters can repeat. Please comment on this answer if this is your case. I will then update the answer.
Edit: Now suppose that the letters can repeat. The situation is more complicated. I will just give sketches here.
Let's say the string contains m different letters, occurring a_1, ..., a_m times, respectively. Also, write b_i for the number C(a_i, 2).
There are three cases:
Case 1: the four mismatches form two identical pairs, e.g. two a's and two b's are permuted.
In this case, we have sum{b_i * b_j : 1 <= i < j <= m} different strings. This is equal to (sum{b_i}^2 - sum{b_i ^2}) / 2, an expression that can be evaluated in O(m) time.
Case 2: the four mismatches contain three different letters, e.g. one a is paired with one b, and another a is paired with one c.
In this case, first choose the letter that is common in the two pairs, let's say it's the i-th letter; then one has to choose the other two letters.
Write s for the sum of all a_i, and t for the sum of all b_i. If we choose two arbitrary letter from the remaining s - a_i letters, we have to exclude cases where the two letters are identical, which has t - b_i possibilities. Thus there are C(s - a_i, 2) - (t - b_i) different ways to choose the other two letters, hence b_i * (C(s - a_i, 2) - (t - b_i)) different strings. Summing them up for all i gives the total number of different strings in this case. It still can be evaluated in O(m) time.
Case 3: four different letters, e.g. one a and one b form a pair, and one c and one d form another pair.
The idea is the same. First, consider the possibilities of choosing 4 arbitrary letters from all s letters. Then we have to exclude several cases:
i. All four letters are identical, this has sum{C(a_i, 4)} cases;
ii. Three letter are identical, the fourth is different, this has sum{C(a_i, 3) * (s - a_i)} cases;
iii. Two letters are identical, the other two are also identical, this is exactly the number calculated in Case 1;
iv. Two letters are identical, the other two are different, this is exactly the number calculated in Case 2.
So, the total number of possibilities of choosing four different letters from all s letters is: C(s, 4) - [i] - [ii] - [iii] - [iv]. As before, this number should be multiplied by 3 to yield the number of different strings, because each quadruple gives 3 different strings.
All in all, the time complexity is O(m), which is obviously optimal.

Related

Regular Expression to check if part of a string is greater than a specific number

I have a string of type "CCUV2-20151223.1.122", this string contains three parts separated by a dot (.)
Is there a way to check if the third part (say 122 in this example) is a number greater than a specific number (say 90) using regular expression?
Generally speaking, it is better to just take that part of the string and cast it to an actual number using whatever language you are using. However, here is a general algorithm:
Lets say you want to check if a string is greater than a number, which can be written as . You just have to look at the following cases:
[1-9]\d{n,} - the number has more than n digits and doesn't start with 0
[-9]\d{n-1} - the number starts with a digit, greater than and continues with n-1 digits
[-9]\d{n-2} - the number start with , followed by a digit greater than and continues with n-2 digits
...
[-9] - you have all but the last digit and the last digit is greater than
Now just use | to combine these cases.
Applying this for 122 we get:
[1-9]\d{3,}|[2-9]\d{2}|1[3-9]\d|12[3-9]

how to map a specialized string into specified integer

I am doing some financial trading work. I have a set of stock symbols but they have very clear pattern:
it's composed of two characters AB, AC AD and current month which is a four digit number: 1503, 1504, 1505. Some examples are:
AB1504
AB1505
AC1504
AC1505
AD1504
AD1505
....
Since these strings are so well designed patterned, I want to map (hash) each of the string into a unique integer so that I can use the integer as the array index for fast accessing, since I have a lot of retrievals inside my system and std::unordered_map or any other hash map are not fast enough. I have tests showing that general hash map are hundred-nanoseconds latency level while array indexing is always under 100 nanos.
my ideal case would be, for example, AB1504 maps to integer 1, AB1505
maps to 2...., then I can create an array inside to access the information related to these symbols much faster.
I'm trying to figure out some hash algorithms or other methods that can achieve my goal but couldn't find out.
Do you guys have any suggestions on this problem?
You can regard the string as a variable-base number representation, and convert that to an integer. For example:
AC1504:
A (range: A-Z)
C (range: A-Z)
15 (range: 0-99)
04 (range: 1-12)
Extract the parts; then a hash function could be
int part1, part2, part3, part4;
...
part1 -= 'A';
part2 -= 'A';
part4 -= 1;
return (((part1 * 26 + part2) * 100 + part3) * 12 + part4;
The following values should be representable by a 32-bit integer:
XYnnnn => (26 * X + Y) * 10000 + nnnn
Here X and Y take values in the range [0, 26), and n takes values in the range [0, 10).
You have a total of 6,760,000 representable values, so if you only want to associate a small amount of data with it (e.g. a count or a pointer), you can just make a flat array, where each symbol occupies one array entry.
If you parse the string as a mixed base number, first 2 base-26 digits and then 4 base-10 digits you will quickly get a unique index for each string. The only issue is that if you might get a sparsely populated array.
You can always reorder the digits when calculating the index to minimize the issue mentioned above.
As the numbers are actually months I would calculate the number of months from the first entry and multiply that with the 2 digit base-26 number from the prefix.
Hope you can make some sense from this, typing on my tablet at the moment. :D
I assume the format is 'AAyymm', where A is an uppercase character yy a two digit year and mm the two digit month.
Hence you can map it to 10 (AA) + Y (yy) + 4 (mm) bits. where Y = 32 - 10 - 4 = 18 bits for a 32 bit representation (or 262144 years).
Having that, you can represent the format as an integer by shifting the characters to there place and shifting the year and month pairs to there places after converting these to an integer.
Note: There will always be gaps in the binary representation, Here the 5+5 bit representation for the characters (6 + 6 values) and in the 4 bit month representation (4 values)
To avoid the gaps change the representation to ABmmmm, were the pair AB is represented by a the number 26*A+B and mmmm is the month relative to some zero month in some year (which covers 2^32/1024/12 = 349525 years - having 32 bits).
However, you might consider a split of stock symbols and time. Combining two values in one field is usually troublesome (It might be a good storage format, but no good 'program data format').

Linear time construction of a data structure to answer in constant time if two strings have 2 common letters

Problem statement:
Given a string which contains words separated by blanks(spaces) it's required to construct a data structure in linear time (O(n) where n is the number of characters in the string) that can answer in constant time (O(1)) if two words in that string have two common characters.
Any ideas on how that data structure would look like?
Thanks.
Your question is ambiguous.
What is the definition of a word?
Exactly two characters in common or at least two characters in common?
Do you just need to answer if a) there exists two words in the string which have two common characters or b) you will be given some pairs of words and need to answer if those words have two common characters for each pair?
In case of (b) of point 3, what is the format of input? Will you be given whole word as input or just the indices of the words in the string?
I'm assuming a word consists of only alphabetical character (a-zA-Z) and you will be given arbitrary pair of words by their index in the input string.
Define array f such that, (f[a],...,f[z]) = (0,...,25) and (f[A],...,f[Z]) = (26,...,51).
Let input string S consists of words W[0], W[1], ..., W[n-1]. Define a map from a word to integer by F(W) = OR(1 << f[c]) for all characters c in word W
If we have to find out if two words W1, W2 has (assuming at least) two characters in common, we just need to find out if B = F(W1) & F(W2) has at least two bits. You can either loop over all bits of B to find if it has at least two bits, which is still O(1), or you can check B && (B & (B-1)) is true. (Explanation: B & (B-1) unset the lowest set bit of B, so if that is non-zero B must have at least two bits)
Now you can precompute F(w) for each word in O(|S|) and then output for each query in O(1) by comparing the F-values.

beginner regex expressions for bit strings

I have included what I need to express - my attempt at the solution.
for strings of 0's and 1's:
the strings that contain exactly one 1 - (0*)(1)(0*)
the strings with two or more 0's or 1's followed by two or more 0's - (0|1){2}(0|1)* (0){2}(0*)
the strings that contain 01 - (0|1)* (01) (0|1)*
Am not sure how to express "contain" because what I'm doing seems kind of redundant. Am I somewhat on the right track with these?
Lets go to your answers:
the strings that contain exactly one 1 = 1{1} {N} is a quantifier can be
{n} exact
{n,} from n to infinite
{n,m} from n to m
the strings with two or more 0's or 1's followed by two or more 0's
(0{2,}0{2,})|(1{2,}0{2,}) I did not use [01]{2,} because the expression will validate on 1000 which, i think is wrong.
the strings that contain 01 - 01
Always keep it simple.

Constructing Strings using Regular Expressions and Boolean logic ||

How do I construct strings with exactly one occurrence of 111 from a set E* consisting of all possible combinations of elements in the set {0,1}?
You can generate the set of strings based on following steps:
Some chucks of numbers and their legal position are enumerated:
Start: 110
Must have one, anywhere: 111
Anywhere: 0, 010, 0110
End: 011
Depend on the length of target string (the length should be bigger than 3)
Condition 1: Length = 3 : {111}
Condition 2: 6 > Length > 3 : (Length-3) = 1x + 3y + 4z
For example, if length is 5: answer is (2,1,0) and (1,0,1)
(2,1,0) -> two '0' and one '010' -> ^0^010^ or ^010^0^ (111 can be placed in any one place marked as ^)
(1,0,1) -> one '0' and one '0110' ...
Condition 3: If 9 > Length > 6, you should consider the solution of two formulas:
Comments:
length – 3 : the length exclude 111
x: the times 0 occurred
y: the times 010 occurred
z: the times 0110 occurred
Finding all solutions {(x,y,z) | 1x + 3y + 4z = (Length - 3)} ----(1)
For each solution, you can generate one or more qualified string. For example, if you want to generate strings of length 10. One solution of (x,y,z) is (0,2,1), that means '010' should occurred twice and '0110' should occurred once. Based on this solution, the following strings can be generated:
0: x0 times
010: x 2 times
0110: x1 times
111: x1 times (must have)
Finding the permutations of elements above.
010-0110-010-111 or 111-010-010-0110 …
(Length - 6) = 1x + 3y + 4z ---(2)
Similar as above case, find all permutations to form an intermediate string.
Finally, for each intermediate string Istr, Istr + 011 or 110 + Istr are both qualified.
For example, (10-6) = 1*0 + 3*0 + 4*1 or = 1*1 + 3*1 + 4*0
The intermediate string can be composed by one '0110' for answer(0,0,1):
Then ^0110^011 and 110^0110^ are qualified strings (111 can be placed in any one place marked as ^)
Or the intermediate string can also be composed by one '0' and one '010' for answer (1,1,0)
The intermediate string can be 0 010 or 010 0
Then ^0010^011 and 110^0100^ are qualified strings (111 can be placed in any one place marked as ^)
Condition 4: If Length > 9, an addition formula should be consider:
(Length – 9) = 1x + 3y + 4z
Similar as above case, find all permutations to form an intermediate string.
Finally, for each intermediate string Istr, 110 + Istr + 011 is qualified.
Explaination:
The logic I use is based on Combinatorial Mathematics. A target string is viewed as a combination of one or more substrings. To fulfill the constraint ('111' appears exactly one time in target string), we should set criteria on substrings. '111' is definitely one substring, and it can only be used one time. Other substrings should prevent to violate the '111'-one-time constraint and also general enough to generate all possible target string.
Except the only-one-111, other substrings should not have more than two adjacent '1'. (Because if other substring have more than two adjacent 1, such as '111', '1111', '11111,' the substring will contain unnecessary '111'). Except the only-one-111, other substrings should not have more than two consecutive '1'. Because if other substring have more than two consecutive 1, such as '111', '1111', '11111,' the substring will contain unnecessary '111' . However, substrings '1' and '11' cannot ensure the only-one-111 constraint. For example, '1'+'11,' '11'+'11' or '1'+'1'+'1' all contain unnecessary '111'. To prevent unnecessary '111,' we should add '0' to stop more adjacent '1'. That results in three qualified substring '0', '010' and '0110'. Any combined string made from three qualified substring will contain zero times of '111'.
Above three qualified substring can be placeed anywhere in the target string, since they 100% ensure no additional '111' in target string.
If the substring's position is in the start or end, they can use only one '0' to prevent '111'.
In start:
10xxxxxxxxxxxxxxxxxxxxxxxxxxxx
110xxxxxxxxxxxxxxxxxxxxxxxxxxx
In end:
xxxxxxxxxxxxxxxxxxxxxxxxxxx011
xxxxxxxxxxxxxxxxxxxxxxxxxxxx01
These two cases can also ensure no additional '111'.
Based on logics mentioned above. We can generate any length of target string with exactly one '111'.
Your question could be clearer.
For one thing, does "1111" contain one occurrence of "111" or two?
If so, you want all strings that contain "111" but do not contain either "1111" or "111.*111". If not, omit the test for "1111".
If I understand you correctly, you're trying to construct an infinite subset of the infinite set of sequences of 0s and 1s. How you do that is probably going to depend on the language you're using (most languages don't have a way of representing infinite sets).
My best guess is that you want to generate a sequence of all sequences of 0s and 1s (which shouldn't be too hard) and select the ones that meet your criteria.