Finding Lexicographically smallest arrangement of some string

Finding Lexicographically smallest arrangement of some string - c++

The Title might seem as if it is a very common question but please bear with me.
Basically lets say at each index of the string you know which alphabets could be in that index, and then you want to find lexicographically smallest arrangement.
So for example:
Index | Options
-------|----------
1 | 'b'
2 | 'c', 'a'
3 | 'd', 'c'
4 | 'c', 'a'
So hence, o/p should be badc. And yes btw, characters cannot repeat so no greedy algorithm.
I think we could use some sort of Breadth First Search by creating a queue or something of the string and each time we found we could not create another permutation, you pop that out of the list. Doubt this is optimal though, must be something in O(N). Any ideas?
And no, I don't think C is bad, but I would prefer code snippets in C/C++ :p
Thanks!

This can be solved by matching algorithm. You can use a network flow solution to solve this. This can be broken down into a bi-partite graph problem.
To be precise maximum weight assignment problem or maximum cost maximum matching would be a solution.
Below is the bipartite set of vertices -
LEVEL Alphabets
1 a
2 b
3 c
4 d
e
.
.
.
z
Now assign edges from set Level to set Alphabet, only and only if those are the options for that level. So these will be edges here - {1,b} , {2,a}, {2,c} , {3,c} , {3,d} ,{4,a} ,{4,c}
Now, to get the lexicographically least result you need to assign weight to the edges in this fashion -
Edge Wt. = 26^(N-Level) * ('z' - Alphabet)
So for example edge weight for edge {2,c} would be
26^(4-2) * (26-3) = 26^2*23
Now you can use a standard maximum cost maximum matching solution. Which is a polynomial solution. And this would be the best approach as far as I can think now. The naive solution is an exponential solution 26^N, so I think you would be happy with a polynomial solution.

The naive approach is to use backtracking and try every possible solution, however it won't be efficient enough(26!). Then you can improve this backtrack solution by using dynamic programming with the help of bitmask. A bitmask can help you store which characters you have used so far.
Write a recursive function that takes an two inputs, the index which should assign a character to, and a bitmask which indicates which characters we have used so far. Initially the bitmask contains 26 zeros which means we haven't used any characters. After assigning a character to some index we change the bitmask accordingly. For example if we use character a we set the first bit of the bitmask to 1. This way you won't solve a lot of overlapping sub-problems.
#include <iostream>
#include <queue>
#include <vector>
#include <map>
using namespace std;
vector<vector<char> > data;
map< pair<int,int>, string > dp;
string func( int index, int bitmask ){
pair<int,int> p = make_pair(index,bitmask);
if ( dp.count( p ) )
return dp[p];
string min_str = "";
for ( int i=0; i<data[index].size(); ++i ){
if ( (bitmask&(1<<(data[index][i]-'a'))) == 0 ){
string cur_str = "";
cur_str += data[index][i];
if ( index+1 != data.size() ){
int mask = bitmask;
mask |= 1<<(data[index][i]-'a');
string sub = func(index+1, mask);
if (sub == "")
continue;
cur_str += sub;
}
if ( min_str == "" || cur_str < min_str ){
min_str = cur_str;
}
}
}
dp[p] = min_str;
return min_str;
}
int main()
{
data.resize(4);
data[0].push_back('b');
data[1].push_back('c');
data[1].push_back('a');
data[2].push_back('d');
data[2].push_back('c');
data[3].push_back('c');
data[3].push_back('a');
cout << func(0,0) << endl;
}

Related

How to sort non-numeric strings by converting them to integers? Is there a way to convert strings to unique integers while being ordered?

I am trying to convert strings to integers and sort them based on the integer value. These values should be unique to the string, no other string should be able to produce the same value. And if a string1 is bigger than string2, its integer value should be greater. Ex: since "orange" > "apple", "orange" should have a greater integer value. How can I do this?
I know there are an infinite number of possibilities between just 'a' and 'b' but I am not trying to fit every single possibility into a number. I am just trying to possibly sort, let say 1 million values, not an infinite amount.
I was able to get the values to be unique using the following:
long int order = 0;
for (auto letter : word)
order = order * 26 + letter - 'a' + 1;
return order;
but this obviously does not work since the value for "apple" will be greater than the value for "z".
This is not a homework assignment or a puzzle, this is something I thought of myself. Your help is appreciated, thank you!

You are almost there ... just a minor tweaks are needed:
you are multiplying by 26
however you have letters (a..z) and empty space so you should multiply by 27 instead !!!
Add zeropading
in order to make starting letter the most significant digit you should zeropad/align the strings to common length... if you are using 32bit integers then max size of string is:
floor(log27(2^32)) = 6
floor(32/log2(27)) = 6
Here small example:
int lexhash(char *s)
{
int i,h;
for (h=0,i=0;i<6;i++) // process string
{
if (s[i]==0) break;
h*=27;
h+=s[i]-'a'+1;
}
for (;i<6;i++) h*=27; // zeropad missing letters
return h;
}
returning these:
14348907 a
28697814 b
43046721 c
373071582 z
15470838 abc
358171551 xyz
23175774 apple
224829626 orange
ordered by hash:
14348907 a
15470838 abc
23175774 apple
28697814 b
43046721 c
224829626 orange
358171551 xyz
373071582 z
This will handle all lowercase a..z strings up to 6 characters length which is:
26^6 + 26^5 +26^4 + 26^3 + 26^2 + 26^1 = 321272406 possibilities
For more just use bigger bitwidth for the hash. Do not forget to use unsigned type if you use the highest bit of it too (not the case for 32bit)

You can use position of char:
std::string s("apple");
int result = 0;
for (size_t i = 0; i < s.size(); ++i)
result += (s[i] - 'a') * static_cast<int>(i + 1);
return result;
By the way, you are trying to get something very similar to hash function.

How to convert one string to another by successive substitutions of characters?

I'm currently trying to design an algorithm that doing such thing:
I got two strings A and B which consist of lowercase characters 'a'-'z'
and I can modify string A using the following operations:
1. Select two characters 'c1' and 'c2' from the character set ['a'-'z'].
2. Replace all characters 'c1' in string A with character 'c2'.
I need to find the minimum number of operations needed to convert string A to string B when possible.
I have 2 ideas that didn't work
1. Simple range-based for cycle that changes string B and compares it with A.
2. Idea with map<char, int> that does the same.
Right now I'm stuck on unit-testing with such situation : 'ab' is transferable to 'ba' in 3 iterations and 'abc' to 'bca' in 4 iterations.
My algorithm is wrong and I need some fresh ideas or working solution.
Can anyone help with this?
Here is some code that shows minimal RepEx:
int Transform(string& A, string& B)
{
int count = 0;
if(A.size() != B.size()){
return -1;
}
for(int i = A.size() - 1; i >= 0; i--){
if(A[i]!=B[i]){
char rep_elem = A[i];
++count;
replace(A.begin(),A.end(),rep_elem,B[i]);
}
}
if(A != B){
return -1;
}
return count;
}
How can I improve this or I should find another ideas?

First of all, don't worry about string operations. Your problem is algorithmic, not textual. You should somehow analyze your data, and only afterwards print your solution.
Start with building a data structure which tells, for each letter, which letter it should be replaced with. Use an array (or std::map<char, char> — it should conceptually be similar, but have different syntax).
If you discover that you should convert a letter to two different letters — error, conversion impossible. Otherwise, count the number of non-trivial cycles in the conversion graph.
The length of your solution will be the number of letters which shouldn't be replaced by themselves plus the number of cycles.
I think the code to implement this would be too long to be helpful.

Implementing Longest Common Substring using Suffix Array

I am using this program for computing the suffix array and the Longest Common Prefix.
I am required to calculate the longest common substring between two strings.
For that, I concatenate strings, A#B and then use this algorithm.
I have Suffix Array sa[] and the LCP[] array.
The the longest common substring is the max value of LCP[] array.
In order to find the substring, the only condition is that among substrings of common lengths, the one occurring the first time in string B should be the answer.
For that, I maintain max of the LCP[]. If LCP[curr_index] == max, then I make sure that the left_index of the substring B is smaller than the previous value of left_index.
However, this approach is not giving a right answer. Where is the fault?
max=-1;
for(int i=1;i<strlen(S)-1;++i)
{
//checking that sa[i+1] occurs after s[i] or not
if(lcp[i] >= max && sa[i] < l1 && sa[i+1] >= l1+1 )
{
if( max == lcp[i] && sa[i+1] < left_index ) left_index=sa[i+1];
else if (lcp[i] > ma )
{
left_index=sa[i+1];
max=lcp[i];
}
}
//checking that sa[i+1] occurs after s[i] or not
else if (lcp[i] >= max && sa[i] >= l1+1 && sa[i+1] < l1 )
{
if( max == lcp[i] && sa[i] < left_index) left_index=sa[i];
else if (lcp[i]>ma)
{
left_index=sa[i];
max=lcp[i];
}
}
}

AFAIK, This problem is from a programming contest and discussing about programming problems of ongoing contest before editorials have been released shouldn't be .... Although I am giving you some insights as I got Wrong Answer with suffix array. Then I used suffix Automaton which gives me Accepted.
Suffix array works in O(nlog^2 n) whereas Suffix Automaton works in O(n). So my advice is go with suffix Automaton and you will surely get Accepted.
And if you can code solution for that problem, you will surely code this.
Also found in codchef forum that:
Try this case
babaazzzzyy
badyybac
The suffix array will contain baa... (From 1st string ) , baba.. ( from first string ) , bac ( from second string ) , bad from second string .
So if you are examining consecutive entries of SA then you will find a match at "baba" and "bac" and find the index of "ba" as 7 in second string , even though its actually at index 1 also .
Its likely that you may output "yy" instead of "ba"
And also handling the constraint ...the first longest common substring to be found on the second string, should be written to output... would be very easy in case of suffix automaton. Best of luck!

Wildcard String Search Algorithm

In my program I need to search in a quite big string (~1 mb) for a relatively small substring (< 1 kb).
The problem is the string contains simple wildcards in the sense of "a?c" which means I want to search for strings like "abc" or also "apc",... (I am only interested in the first occurence).
Until now I use the trivial approach (here in pseudocode)
algorithm "search", input: haystack(string), needle(string)
for(i = 0, i < length(haystack), ++i)
if(!CompareMemory(haystack+i,needle,length(needle))
return i;
return -1; (Not found)
Where "CompareMemory" returns 0 iff the first and second argument are identical (also concerning wildcards) only regarding the amount of bytes the third argument gives.
My question is now if there is a fast algorithm for this (you don't have to give it, but if you do I would prefer c++, c or pseudocode). I started here
but I think most of the fast algorithms don't allow wildcards (by the way they exploit the nature of strings).
I hope the format of the question is ok because I am new here, thank you in advance!

A fast way, which is kind of the same thing as using a regexp, (which I would recommend anyway), is to find something that is fixed in needle, "a", but not "?", and search for it, then see if you've got a complete match.
j = firstNonWildcardPos(needle)
for(i = j, i < length(haystack)-length(needle)+j, ++i)
if(haystack[i] == needle[j])
if(!CompareMemory(haystack+i-j,needle,length(needle))
return i;
return -1; (Not found)
A regexp would generate code similar to this (I believe).

Among strings over an alphabet of c characters, let S have length s and let T_1 ... T_k have average length b. S will be searched for each of the k target strings. (The problem statement doesn't mention multiple searches of a given string; I mention it below because in that paradigm my program does well.)
The program uses O(s+c) time and space for setup, and (if S and the T_i are random strings) O(k*u*s/c) + O(k*b + k*b*s/c^u) total time for searching, with u=3 in program as shown. For longer targets, u should be increased, and rare, widely-separated key characters chosen.
In step 1, the program creates an array L of s+TsizMax integers (in program, TsizMax = allowed target length) and uses it for c lists of locations of next occurrences of characters, with list heads in H[] and tails in T[]. This is the O(s+c) time and space step.
In step 2, the program repeatedly reads and processes target strings. Step 2A chooses u = 3 different non-wild key characters (in current target). As shown, the program just uses the first three such characters; with a tiny bit more work, it could instead use the rarest characters in the target, to improve performance. Note, it doesn't cope with targets with fewer than three such characters.
The line "L[T[r]] = L[g+i] = g+i;" within Step 2A sets up a guard cell in L with proper delta offset so that Step 2G will automatically execute at end of search, without needing any extra testing during the search. T[r] indexes the tail cell of the list for character r, so cell L[g+i] becomes a new, self-referencing, end-of-list for character r. (This technique allows the loops to run with a minimum of extraneous condition testing.)
Step 2B sets vars a,b,c to head-of-list locations, and sets deltas dab, dac, and dbc corresponding to distances between the chosen key characters in target.
Step 2C checks if key characters appear in S. This step is necessary because otherwise a while loop in Step 2E will hang. We don't want more checks within those while loops because they are the inner loops of search.
Step 2D does steps 2E to 2i until var c points to after end of S, at which point it is impossible to make any more matches.
Step 2E consists of u = 3 while loops, that "enforce delta distances", that is, crawl indexes a,b,c along over each other as long as they are not pattern-compatible. The while loops are fairly fast, each being in essence (with ++si instrumentation removed) "while (v+d < w) v = L[v]" for various v, d, w. Replicating the three while loops a few times may increase performance a little and will not change net results.
In Step 2G, we know that the u key characters match, so we do a complete compare of target to match point, with wild-character handling. Step 2H reports result of compare. Program as given also reports non-matches in this section; remove that in production.
Step 2I advances all the key-character indexes, because none of the currently-indexed characters can be the key part of another match.
You can run the program to see a few operation-count statistics. For example, the output
Target 5=<de?ga>
012345678901234567890123456789012345678901
abc1efgabc2efgabcde3gabcdefg4bcdefgabc5efg
# 17, de?ga and de3ga match
# 24, de?ga and defg4 differ
# 31, de?ga and defga match
Advances: 'd' 0+3 'e' 3+3 'g' 3+3 = 6+9 = 15
shows that Step 2G was entered 3 times (ie, the key characters matched 3 times); the full compare succeeded twice; step 2E while loops advanced indexes 6 times; step 2I advanced indexes 9 times; there were 15 advances in all, to search the 42-character string for the de?ga target.
/* jiw
$Id: stringsearch.c,v 1.2 2011/08/19 08:53:44 j-waldby Exp j-waldby $
Re: Concept-code for searching a long string for short targets,
where targets may contain wildcard characters.
The user can enter any number of targets as command line parameters.
This code has 2 long strings available for testing; if the first
character of the first parameter is '1' the jay[42] string is used,
else kay[321].
Eg, for tests with *hay = jay use command like
./stringsearch 1e?g a?cd bc?e?g c?efg de?ga ddee? ddee?f
or with *hay = kay,
./stringsearch bc?e? jih? pa?j ?av??j
to exercise program.
Copyright 2011 James Waldby. Offered without warranty
under GPL v3 terms as at http://www.gnu.org/licenses/gpl.html
*/
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <limits.h>
//================================================
int main(int argc, char *argv[]) {
char jay[]="abc1efgabc2efgabcde3gabcdefg4bcdefgabc5efg";
char kay[]="ludehkhtdiokihtmaihitoia1htkjkkchajajavpajkihtijkhijhipaja"
"etpajamhkajajacpajihiatokajavtoia2pkjpajjhiifakacpajjhiatkpajfojii"
"etkajamhpajajakpajihiatoiakavtoia3pakpajjhiifakacpajjhkatvpajfojii"
"ihiifojjjjhijpjkhtfdoiajadijpkoia4jihtfjavpapakjhiifjpajihiifkjach"
"ihikfkjjjjhijpjkhtfdoiajakijptoik4jihtfjakpapajjkiifjpajkhiifajkch";
char *hay = (argc>1 && argv[1][0]=='1')? jay:kay;
enum { chars=1<<CHAR_BIT, TsizMax=40, Lsiz=TsizMax+sizeof kay, L1, L2 };
int L[L2], H[chars], T[chars], g, k, par;
// Step 1. Make arrays L, H, T.
for (k=0; k<chars; ++k) H[k] = T[k] = L1; // Init H and T
for (g=0; hay[g]; ++g) { // Make linked character lists for hay.
k = hay[g]; // In same loop, could count char freqs.
if (T[k]==L1) H[k] = T[k] = g;
T[k] = L[T[k]] = g;
}
// Step 2. Read and process target strings.
for (par=1; par<argc; ++par) {
int alpha[3], at[3], a=g, b=g, c=g, da, dab, dbc, dac, i, j, r;
char * targ = argv[par];
enum { wild = '?' };
int sa=0, sb=0, sc=0, ta=0, tb=0, tc=0;
printf ("Target %d=<%s>\n", par, targ);
// Step 2A. Choose 3 non-wild characters to follow.
// As is, chooses first 3 non-wilds for a,b,c.
// Could instead choose 3 rarest characters.
for (j=0; j<3; ++j) alpha[j] = -j;
for (i=j=0; targ[i] && j<3; ++i)
if (targ[i] != wild) {
r = alpha[j] = targ[i];
if (alpha[0]==alpha[1] || alpha[1]==alpha[2]
|| alpha[0]==alpha[2]) continue;
at[j] = i;
L[T[r]] = L[g+i] = g+i;
++j;
}
if (j != 3) {
printf (" Too few target chars\n");
continue;
}
// Step 2B. Set a,b,c to head-of-list locations, set deltas.
da = at[0];
a = H[alpha[0]]; dab = at[1]-at[0];
b = H[alpha[1]]; dbc = at[2]-at[1];
c = H[alpha[2]]; dac = at[2]-at[0];
// Step 2C. See if key characters appear in haystack
if (a >= g || b >= g || c >= g) {
printf (" No match on some character\n");
continue;
}
for (g=0; hay[g]; ++g) printf ("%d", g%10);
printf ("\n%s\n", hay); // Show haystack, for user aid
// Step 2D. Search for match
while (c < g) {
// Step 2E. Enforce delta distances
while (a+dab < b) {a = L[a]; ++sa; } // Replicate these
while (b+dbc < c) {b = L[b]; ++sb; } // 3 abc lines as many
while (a+dac > c) {c = L[c]; ++sc; } // times as you like.
while (a+dab < b) {a = L[a]; ++sa; } // Replicate these
while (b+dbc < c) {b = L[b]; ++sb; } // 3 abc lines as many
while (a+dac > c) {c = L[c]; ++sc; } // times as you like.
// Step 2F. See if delta distances were met
if (a+dab==b && b+dbc==c && c<g) {
// Step 2G. Yes, so we have 3-letter-match and need to test whole match.
r = a-da;
for (k=0; targ[k]; ++k)
if ((hay[r+k] != targ[k]) && (targ[k] != wild))
break;
printf ("# %3d, %s and ", r, targ);
for (i=0; targ[i]; ++i) putchar(hay[r++]);
// Step 2H. Report match, if found
puts (targ[k]? " differ" : " match");
// Step 2I. Advance all of a,b,c, to go on looking
a = L[a]; ++ta;
b = L[b]; ++tb;
c = L[c]; ++tc;
}
}
printf ("Advances: '%c' %d+%d '%c' %d+%d '%c' %d+%d = %d+%d = %d\n",
alpha[0], sa,ta, alpha[1], sb,tb, alpha[2], sc,tc,
sa+sb+sc, ta+tb+tc, sa+sb+sc+ta+tb+tc);
}
return 0;
}
Note, if you like this answer better than current preferred answer, unmark that one and mark this one. :)

Regular expressions usually use a finite state automation-based search, I think. Try implementing that.

comparing bits (one position at a time)

Initially I have user input decimal numbers (0 - 15), and I will turn that into binary numbers.
Say these numbers are written into a text file, as shown in the picture. These numbers are arranged by the numbers of 1's. The dash - is used to separate different groups of 1.
I have to read this file, and compare strings of one group with the all the strings in the group below, i.e., Group 1 with all the strings in group 2, and group 2 - group 3.
The deal is that, only one column of 0 / 1 difference is allowed, and that column is replaced by letter t. If more than one column of difference is encountered, write none.
So say group 2, 0001 with group 3, 0011, only the second column is different. however, 0010 and 0101 are two columns of difference.
The result will be written into another file.....
At the moment, when I am reading these strings, I am using vector string. I came across bitset. What is important is that I have to access the character one at a time, meaning I have break the vector string into vector char. But it seems like there could be easier way to do it.
I even thought about a hash table - linked-list. Having group 1 assigned to H[0]. Each comparison is done as H[current-group] with H[current_group+1]. But beyond the first comparison (comparing 1's and 0's), the comparison beyond that will not work under this hash-linked way. So I gave up on that.
#include <iostream>
#include <fstream>
#include <vector>
#include <string>
#include <algorithm>
#include <iterator>
using namespace std;
int main() {
ifstream inFile("a.txt");
vector<string> svec;
copy(istream_iterator<string>(inFile), istream_iterator<string>(), back_inserter(svec));
copy(svec.begin(), svec.end(), ostream_iterator<string>(cout,"\n"));
for(int i = 0; i < svec.size(); i++)
{
cout << svec[i] << " ";
}
inFile.close();
return 0;
}
This is the sample code of writing it into a file....but like I said, the whole deal of vector seems impractical in my case....
Any help is appreciated. thanks

I don't understand your code snippet -- it looks like all it does is read in the input file into a vector of strings, which will then contain each whitespace-delimited word in a separate string, then write it back out in 2 different ways (once with words separated by \n, once with them separated by spaces).
It seems the main problem you're having is with reading and interpreting the file itself, as opposed to doing the necessary calculations -- right? That's what I hope this answer will help you with.
I think the line structure of the file is important -- right? In that case you would be better off using the global getline() function in the <string> header, which reads an entire line (rather than a whitespace-delimited word) into a string. (Admittedly that function is pretty well-hidden!) Also you don't actually need to read all the lines into a vector, and then process them -- it's more efficient and actually easier to distill them down to numbers or bitsets as you go:
vector<unsigned> last, curr; // An unsigned can comfortably hold 0-15
ifstream inf("a.txt");
while (true) {
string line;
getline(inf, line); // This is the group header: ignore it
while (getline(inf, line)) {
if (line == "-") {
break;
}
// This line contains a binary string: turn it into a number
// We ignore all characters that are not binary digits
unsigned val = 0;
for (int i = 0; i < line.size(); ++i) {
if (line[i] == '0' || line[i] == '1') {
val = (val << 1) + line[i] - '0';
}
}
curr.push_back(val);
}
// Either we reached EOF, or we saw a "-". Either way, compare
// the last 2 groups.
compare_them_somehow(curr, last); // Not doing everything for you ;)
last = curr; // Using swap() would be more efficient, but who cares
curr.clear();
if (inf) {
break; // Either the disk exploded, or we reached EOF, so we're done.
}
}

Perhaps I've misunderstood your goal, but strings are amenable to array member comparison:
string first = "001111";
string next = "110111";
int sizeFromTesting = 5;
int columnsOfDifference = 0;
for ( int UU = sizeFromTesting; UU >=0; UU-- )
{
if ( first[ UU ] != next[ UU ] )
columnsOfDifference++;
}
cout << columnsOfDifference;
cin.ignore( 99, '\n' );
return 0;
Substitute file streams and bound protection where appropriate.
Not applicable, but to literally bitwise compare variables, & both using a mask for each digit (000010 for second digit).
If or = 0, they match: both are 0. If they or = 1 and & = 1, that digit is 1 for both. Otherwise they differ. Repeat for all the bits and all the numbers in the group.

in vb.net
'group_0 with group_1
If (group_0_count > 0 AndAlso group_1_count > 0) Then
Dim result = ""
Dim index As Integer = 0
Dim g As Integer = 0
Dim h As Integer = 0
Dim i As Integer = 0
For g = 0 To group_0_count - 1
For h = 0 To group_1_count - 1
result = ""
index = 0
For i = 0 To 3
If group_1_0.Items(g).ToString.Chars(i) <> group_1_1.Items(h).ToString.Chars(i) Then
result &= "-"
index = index + 1
Else
result &= group_1_0.Items(g).ToString.Chars(i)
End If
Next
Next
Next
End If

Read it in as an integer, then all you should need is comparisons with bitshifts and bit masks.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Finding Lexicographically smallest arrangement of some string - c++

Related

How to sort non-numeric strings by converting them to integers? Is there a way to convert strings to unique integers while being ordered?

How to convert one string to another by successive substitutions of characters?

Implementing Longest Common Substring using Suffix Array

Wildcard String Search Algorithm

comparing bits (one position at a time)

Categories

Resources