Finding if two strings are anagrams in O(n) - solution using XOR - c++

I'm working on a problem from hackerearth
The goal is to find if the input strings are anagrams in O(n) time.
Input format:
First line, contains an intger 'T' denoting no. of test cases.
Each test consists of a single line, containing two space separated
strings S1 and S2 of equal length.
My code:
#include <iostream>
#include <string>
int main()
{
int T;
std::cin >> T;
std::cin.ignore();
for(int i = 0; i < T; ++i)
{
std::string testString;
std::getline(std::cin, testString);
char test = ' ';
for (auto& token : testString)
{
if(token != ' ')
test ^= token;
}
if (test == ' ')
std::cout << "YES\n";
else
std::cout << "NO\n";
}
}
The code above fails 5/6 hackerearth tests.
Where is my mistake? Is this a good approach to the problem?

Note: Your question title says that the second word must be an anagram of the first. But, the linked to problem on hackerearth uses the term rearranged, which is more restrictive than an anagram and also says:
Two strings S1 and S2 are said to be identical, if any of the permutation of string S1 is equal to the string S2
One algorithm is to maintain a histogram of the incoming chars.
This is done with two loops, one for the first word and another for the second word.
For the first word, proceed char-by-char and increment the histogram value. Calculate the length of the first word by maintaining a running count.
When the space is reached, do the other loop which decrements the histogram. Maintain a count of the number of histogram cells that reach zero. In the end, this must match the length of the first word (i.e. success).
In the second loop, if a histogram cell goes negative, this is a mismatch because either the second word has a char not in the first word or has too many of a char in the first word.
Caveat: I apologize for this being a C-like solution, but it can easily be adapted to use more STL components
Also, char-at-a-time input may be faster than reading in the entire line into a buffer string
Edit: I've added annotation/comments to the code example to try to make things more clear
#include <stdio.h>
#include <stdlib.h>
char buf[(200 * 1024) + 100];
void
dotest(FILE *xf)
{
int histo[26] = { 0 };
int len = 0;
int chr;
int match = 0;
int fail = 0;
int cnt;
// scan first word
while (1) {
chr = fgetc(xf);
// stop on delimiter between first and second words
if (chr == ' ')
break;
// convert char to histogram index
chr -= 'a';
// increment the histogram cell
cnt = ++histo[chr];
// calculate number of non-zero histogram cells
if (cnt == 1)
++len;
}
// scan second word
while (1) {
chr = fgetc(xf);
// stop on end-of-line or EOF
if (chr == '\n')
break;
if (chr == EOF)
break;
// convert char to histogram index
chr -= 'a';
// decrement the histogram cell
cnt = --histo[chr];
// if the cell reaches zero, we [seemingly] have a match (i.e. the
// number of instances of this char in the second word match the
// number of instances in the first word)
if (cnt == 0)
match += 1;
// however, if we go negative, the second word has too many instances
// of this char to match the first word
if (cnt < 0)
fail = 1;
}
do {
// too many letters in second word that are _not_ in the first word
if (fail)
break;
// the number of times the second word had an exact histogram count
// against the first word must match the number of chars in the first
// [and second] word (i.e. all scrambled chars in the second word had
// a place in the first word)
fail = (match != len);
} while (0);
if (fail)
printf("NO\n");
else
printf("YES\n");
}
// main -- main program
int
main(int argc,char **argv)
{
char *file;
FILE *xf;
--argc;
++argv;
file = *argv;
if (file != NULL)
xf = fopen(file,"r");
else
xf = stdin;
fgets(buf,sizeof(buf),xf);
int tstcnt = atoi(buf);
for (int tstno = 1; tstno <= tstcnt; ++tstno)
dotest(xf);
if (file != NULL)
fclose(xf);
return 0;
}
UPDATE:
I've only had a glance at the code but it seems that len goes up for every char found (string lenght). and match goes up only when a unique char (histogram element) is exausted, so the check match == len will not be good?
len is only incremented in the first loop. (i.e.) It is the length of the first word only (as mentioned in the algorithm description above).
In the first loop, there is a check for the char being a space [which is guaranteed by the problem definition of the input to delimit the end of the first word] and the loop is broken out of at that point [before len is incremented], so len is correct.
The use of len, match, and fail speed things up. Otherwise, at the end, we'd have to scan the entire histogram and ensure all elements are zero to determine success/failure (i.e. any non-zero element means mismatch/failure).
Note: When doing such timed coding challenges before, I've noted that they can be pretty strict on elapsed time/speed and space. It's best to try to optimize as much as possible because, even if the algorithm is technically correct, it can fail the test for using too much memory or taking too much time.
That's why I suggested not using a string buffer because the maximum size as defined by the problem can be 100,000 bytes. Also, doing the [unnecessary] scan of the histogram at the end would also add time.
UPDATE #2:
It may be faster to read a full line at a time and then use a char pointer to traverse the buffer. Here's a version that does that. Which method is faster would need to be tried/benchmarked to see.
#include <stdio.h>
#include <stdlib.h>
char buf[(200 * 1024) + 100];
void
dotest(FILE *xf)
{
char *cp;
int histo[26] = { 0 };
int len = 0;
int chr;
int match = 0;
int fail = 0;
int cnt;
cp = buf;
fgets(cp,sizeof(buf),xf);
// scan first word
for (chr = *cp++; chr != 0; chr = *cp++) {
// stop on delimiter between first and second words
if (chr == ' ')
break;
// convert char to histogram index
chr -= 'a';
// increment the histogram cell
cnt = ++histo[chr];
// calculate number of non-zero histogram cells
if (cnt == 1)
++len;
}
// scan second word
for (chr = *cp++; chr != 0; chr = *cp++) {
// stop on end-of-line
if (chr == '\n')
break;
// convert char to histogram index
chr -= 'a';
// decrement the histogram cell
cnt = --histo[chr];
// if the cell reaches zero, we [seemingly] have a match (i.e. the
// number of instances of this char in the second word match the
// number of instances in the first word)
if (cnt == 0)
match += 1;
// however, if we go negative, the second word has too many instances
// of this char to match the first word
if (cnt < 0) {
fail = 1;
break;
}
}
do {
// too many letters in second word that are _not_ in the first word
if (fail)
break;
// the number of times the second word had an exact histogram count
// against the first word must match the number of chars in the first
// [and second] word (i.e. all scrambled chars in the second word had
// a place in the first word)
fail = (match != len);
} while (0);
if (fail)
printf("NO\n");
else
printf("YES\n");
}
// main -- main program
int
main(int argc,char **argv)
{
char *file;
FILE *xf;
--argc;
++argv;
file = *argv;
if (file != NULL)
xf = fopen(file,"r");
else
xf = stdin;
fgets(buf,sizeof(buf),xf);
int tstcnt = atoi(buf);
for (int tstno = 1; tstno <= tstcnt; ++tstno)
dotest(xf);
if (file != NULL)
fclose(xf);
return 0;
}
UPDATE #3:
The above two examples had a slight bug. It would report a false negative on an input line of (e.g.) aaa aaa.
The increment of len was always done in the first loop. This was incorrect. I've edited the above two examples to do the increment of len conditionally (i.e. only if the histogram cell was zero before the increment). Now, len is "the number of non-zero histogram cells in the first string". This takes into account duplicates in the string (e.g. aa).
As I mentioned, the use of len, match, and fail was to obviate the need to scan all histogram cells at the end, looking for a non-zero cell which means mismatch/failure.
This would [possibly] run faster for short input lines, where the post scan of the histogram took longer than the input line loops.
However, given that input lines can be 200k in length, the probability is that [almost] all of the histogram cells will be incremented/decremented. Also, the post scan of the histogram (e.g. check 26 integer array values for non-zero) is now a negligible part of the overall time.
Thus, the simple implementation [below] that eliminates len/match calculations inside the first two loops may be the fastest/best choice. This is because the two loops are slightly faster.
#include <stdio.h>
#include <stdlib.h>
char buf[(200 * 1024) + 100];
void
dotest(FILE *xf)
{
char *cp;
char buf[(200 * 1024) + 100];
int histo[26] = { 0 };
int chr;
int fail = 0;
cp = buf;
fgets(cp,sizeof(buf),xf);
// scan first word
for (chr = *cp++; chr != 0; chr = *cp++) {
// stop on delimiter between first and second words
if (chr == ' ')
break;
// convert char to histogram index
chr -= 'a';
// increment the histogram cell
++histo[chr];
}
// scan second word
for (chr = *cp++; chr != 0; chr = *cp++) {
// stop on end-of-line
if (chr == '\n')
break;
// convert char to histogram index
chr -= 'a';
// decrement the histogram cell
--histo[chr];
}
// scan histogram
for (int idx = 0; idx < 26; ++idx) {
if (histo[idx]) {
fail = 1;
break;
}
}
if (fail)
printf("NO\n");
else
printf("YES\n");
}
// main -- main program
int
main(int argc,char **argv)
{
char *file;
FILE *xf;
--argc;
++argv;
file = *argv;
if (file != NULL)
xf = fopen(file,"r");
else
xf = stdin;
fgets(buf,sizeof(buf),xf);
int tstcnt = atoi(buf);
for (int tstno = 1; tstno <= tstcnt; ++tstno)
dotest(xf);
if (file != NULL)
fclose(xf);
return 0;
}
The downside is that there is no "early escape" from the second loop. We would have to finish the scan of the second string even though we might be able to tell early that the second string can't match (e.g.):
aaaaaaaaaa baaaaaaaaa
baaaaaaaaa bbaaaaaaaa
With the simple version we couldn't terminate the second loop early even though we know the second string can never match when we see the b (i.e. the histogram cell goes negative) and skip over the scan of the multiple a in the second word.
So, here's a version that has a simple first loop as above. It adds back the on-the-fly check for a cell going negative in the second loop.
Once again, which version [of the four I've presented] is the best needs some experimentation/benchmarking.
#include <stdio.h>
#include <stdlib.h>
char buf[(200 * 1024) + 100];
void
dotest(FILE *xf)
{
char *cp;
int histo[26] = { 0 };
int chr;
int fail = 0;
int cnt;
cp = buf;
fgets(cp,sizeof(buf),xf);
// scan first word
for (chr = *cp++; chr != 0; chr = *cp++) {
// stop on delimiter between first and second words
if (chr == ' ')
break;
// convert char to histogram index
chr -= 'a';
// increment the histogram cell
++histo[chr];
}
// scan second word
for (chr = *cp++; chr != 0; chr = *cp++) {
// stop on end-of-line
if (chr == '\n')
break;
// convert char to histogram index
chr -= 'a';
// decrement the histogram cell
cnt = --histo[chr];
// however, if we go negative, the second word has too many instances
// of this char to match the first word
if (cnt < 0) {
fail = 1;
break;
}
}
do {
// too many letters in second word that are _not_ in the first word
if (fail)
break;
// scan histogram
for (int idx = 0; idx < 26; ++idx) {
if (histo[idx]) {
fail = 1;
break;
}
}
} while (0);
if (fail)
printf("NO\n");
else
printf("YES\n");
}
// main -- main program
int
main(int argc,char **argv)
{
char *file;
FILE *xf;
char buf[100];
--argc;
++argv;
file = *argv;
if (file != NULL)
xf = fopen(file,"r");
else
xf = stdin;
fgets(buf,sizeof(buf),xf);
int tstcnt = atoi(buf);
for (int tstno = 1; tstno <= tstcnt; ++tstno)
dotest(xf);
if (file != NULL)
fclose(xf);
return 0;
}

public static final int ASC = 97;
static boolean isAnagram(String a, String b) {
boolean res = false;
int len = a.length();
if (len != b.length()) {
return res;
}
a = a.toLowerCase();
b = b.toLowerCase();
int[] a_ascii = new int[26];
int aval = 0;
for (int i = 0; i < 2 * len; i++) {
if (i < len) {
aval = a.charAt(i) - ASC;
a_ascii[aval] = (a_ascii[aval] == 0) ? (aval * len + 1) : (a_ascii[aval] + 1);
} else {
aval = b.charAt(i - len) - ASC;
if (a_ascii[aval] == 0) {
return false;
}
a_ascii[aval] = a_ascii[aval] - 1;
res = (a_ascii[aval] == aval * len) ? true : false;
}
}
return res;
}

Related

Is there a way that I can pass a file (FASTA) to an array in C++

I'm using the KMP algorithm using C++ for pattern searching in a FASTA file.
KMP algorithm:
#include <bits/stdc++.h>
void computeLPSArray(char* pat, int M, int* lps);
// Prints occurrences of txt[] in pat[]
void KMPSearch(char* pat, char* txt)
{
int M = strlen(pat);
int N = strlen(txt);
// create lps[] that will hold the longest prefix suffix
// values for pattern
int lps[M];
// Preprocess the pattern (calculate lps[] array)
computeLPSArray(pat, M, lps);
int i = 0; // index for txt[]
int j = 0; // index for pat[]
while (i < N) {
if (pat[j] == txt[i]) {
j++;
i++;
}
if (j == M) {
printf("Found pattern at index %d ", i - j);
j = lps[j - 1];
}
// mismatch after j matches
else if (i < N && pat[j] != txt[i]) {
// Do not match lps[0..lps[j-1]] characters,
// they will match anyway
if (j != 0)
j = lps[j - 1];
else
i = i + 1;
}
}
}
// Fills lps[] for given patttern pat[0..M-1]
void computeLPSArray(char* pat, int M, int* lps)
{
// length of the previous longest prefix suffix
int len = 0;
lps[0] = 0; // lps[0] is always 0
// the loop calculates lps[i] for i = 1 to M-1
int i = 1;
while (i < M) {
if (pat[i] == pat[len]) {
len++;
lps[i] = len;
i++;
}
else // (pat[i] != pat[len])
{
// This is tricky. Consider the example.
// AAACAAAA and i = 7. The idea is similar
// to search step.
if (len != 0) {
len = lps[len - 1];
// Also, note that we do not increment
// i here
}
else // if (len == 0)
{
lps[i] = 0;
i++;
}
}
}
}
// Driver program to test above function
int main()
{
char txt[] = "ABABDABACDABABCABAB";
char pat[] = "ABABCABAB";
KMPSearch(pat, txt);
return 0;
}
So I want to modify the above code so that I can pass an input file in place of
char txt[] = "ABABDABACDABABCABAB";
Similarly, for
char pat[] = "ABABCABAB";
Input file - 1. some_data_file.fasta
>3a073269-a0b6-436a-a219-4130fcd3b9dc
TGTTGTACTTCGTTCAGTTACGTATTGCTGTTTTCCGCATTTATCGTGAAACGCTTTCGCGTTTTCGTGCGCCCGCTTCTGGGACTATCGCTGTTCTCCATACTATTACCCTCCATCTTTAATATTCATTCCTCTAGAACCTCCTGACCAAAATCTGTATTCGTCAGGGTTCTCTAGAGGATAGAACTAATAGGATAGATGTAGATAGAAAGGGAAGTTTATCAAGGGAGTACTGACTCACACGATCATAAGGTGAGGTCCCACTTTTGAGTAGGCCATCTGCAAGCACTGAGGAGCAAGGGTCCAGTAGGAGTCTCACAGCTGAAGAAGTTGGGTTTGATGTTCGAGGGCAGGAAGCATCCAGCATGGGAGAAATATGTAGGCCACAAAGATTAAACCAGTCTAGTCTTTCCATGTGTTTCTTCTCCTGCTTTTGTGGAGTCCTTGCTGGCAGCTGATTGAGGGTAGGTCTCGTTTCCAGTCCCACTGACTCAAATGTTAATCTCCCGGCAACACCCTCGCAGACACACTAAAGAACAATACCTTGCATCCTTCAATCCAATGAAGGTGACACTCAATATTAACCATCACAATAACTAATACGTTTTTATAGGGAATAAAGCACATATTTCCCATGATACCTGTAGAATTGTGTTTCTCTGGCCTGAATATAGGTTGGATTGGTTTAAATGTGAATTTTGTTTTACAATATTTATATGTCAATTGTAAATTCTGAGCACTTTCGAGTCAGAGCATACCTTTTTTTGAGATGGAGTCTCGCTCTGTCACCTAGGCTGGAGTGCAGTGGTGTGATCTCAGCTAGCTGCAACCTCCACCTCCTCAGGTTCAAGCGATTCTCCTGCCTGAGCCCCTGCCGAGCAGTTGGGATTACAGGTGCCCACCACCACATCTGGCTAATTTTGAAATTTTGAGACGGGGTTTCACCACGTGGGCCAGGCTGGTTTGAACTCCTGACCCCAGGTGATCCACCCGCCTCAGCTTCCCAAAGTGCTGGGATTACAGTGTGAGCCGCCGCCTGGCCAAATCATTACTTTTGAAGAAATAGTTAACAATGATTATTTCTTTTTGAATGACAATAAATTTTATTAATAAGTTAAACATATTTATATGTAATGTAAATTTTTGTATCCGGGTGCAGTAGTTCTTGCCCGTTATCCTAGCACTTTGGGAGGCCAAGGTGTTAATATTGCTTGAGCAGGAGTTTGAGACCAGCCTGGGAAACATGGTGAAACCTCATATCTACAAAAAATACAGCCTGGTGTT
>318cae4a-c764-4fe0-97b3-720b49f2bd80
TCGTGCGCTGCGTTCGTTCGTATTGCTGTTTTCTGACTTTACATCTTCGTAACGCTCGCGTTCGTGCGCCGCTAAGGCCAATAACAGGCTGAAATTGAGCCAATAATTAATAGCTTGCAACCAAAAAAGTCTGGGATTAAGCACATTACAATAGCCAACTACACAGGCTGAGAGGGAGAAGCTGGTACCACTCGCTAAAACTATTCTAATCAGTAGAAAAAGGGGAATACTCCTAGCTCATTTATGGGGCGGCATCATCTGATACCAACGCCTGGCAGAGACACAACAAAAGAGAAATTTTAGACCAATATCTGATGAAGAGACATGGTGCAAAAATCCTCAATAAATGCTGTGACCAAGATCAGCATCAAGCATCCACCATGAGTCAAGTGGGCTTCATCCTGGGATGCAAGGCTATTTCAACTATGCAAATCAATAAACAGTAATCAACATAAACAGGGCAAAGACAAAGAAACACACATGATTATCTCAATAGATGCAGAAAGGCCTTAAGACAAAATTCAAAGCAACCGCTTCATACTAAAAACTCTCAATAAATTAGGTATGATGATGGGACCTTATCTCAAAGTAATGGAGCTATTTATGACAAACCCACGTATCATACTGAATGAGCAAAAAACTGGAGAGAGTGTTCCCTTTAGCTTGGCACAAAGAACAAGGATGCCCTCTCTCACACACTCCCTATTCAGCCTTAGTGTTGGAAGTTCTGGCCAGGGCCATCGAGCCGAAGAAGAAATAAAGATTATTCAATTAGGAAAAAGAAAAGTCAAATTCTGTTTGAATGAGCAGCAGTCATATCTGAAAACCCCATGTGATCTCATTCCCCAAAATATCCTTAGCTGATAGCTTAACTAAAGTCTCAGGATACAAAATCAATGTTTGCAAAAATCTAGGGCAGATAACATACACTAATAGCAGAAGCAGGAGGCAAATCATGAGTGAACTCCATTCTGAGAATTGCTTTCAAAGAGAATCAAATACGGGAATCCAACTTACAGGGATGTGGGACCTCTTCAAGGAGAACAGAAACCACTAACTCATATAATAAAGAGGATGCAAACAAATGGAAGAGCATTCCCATGCTCATGGATCAGGAAGAATCAATATCAGTGATAATCCGCCCTACTGCCCATATTAATTACCAGATTCAATACATCCCGTCAAGCTACCAGTGACTTTCTTCTGGAATTGGAAAAACACTTTAAAGGAGGAAGTTCTGTGGAATAAAAGAGCTACCATGCGCCAGAGTCAATCTAAGCCAAAGAACAAAGCCAGAGGCATCATTTACTGACTTTGAAACATACACAGTCTGATACTGGTGCACCAGAACAGGGTTATAGTAATGGAACAGGAGAACTAGGCCCTCAGAAACCACCACACATCTACAACCATCTGATCTTTGACAAACACAAACAAAAACAAGAAATGGGGAAAGGATTCCCTAATATTTA
Input file - 2. some_pattern_file.fasta
>telomere_tract
TTAGGGTTAGGGTTAGGGTTAGGG
Conditions:
The read must contain the "TTAGGGTTAGGGTTAGGGTTAGGG" pattern at least twice.
The last occurrence of the pattern must be located near the read's end (less than 20,000 bases from the end of the read).
************** To summarize: **************
Input file to the code: subset_na12878dataset.fasta
Pattern to search for: telomere_tract_pattern.fasta
Conditions for the location of the pattern in each read: (i) A read must contain the pattern at least twice, (ii) The last occurence of the pattern must be located near the read's end (less than 20,000 bases from the end of the read).
Code to use for pattern searching: KMP ( https://www.geeksforgeeks.org/kmp-algorithm-for-pattern-searching/ )
Output file of the code: subset_na12878dataset_telom.fasta (The order of the reads in the output file does not matter, as long as they are all located in the file.)
Thanks in advance!
It is possible with std::ifstream::rdbuf.
char *const buffer = new char[BUFFER_SIZE];
std::filebuf* pFileBuffer = inputFile.rdbuf();
pFileBuffer->sgetn(buffer,size);
Don't forget that this operation copies the content to the array you've provided. The char *const doesn't directly point to the content of the file. Actually, I don't know if there is a sane way to access the file's address directly.

Smallest Binary String not Contained in Another String

The question description is relatively simple, an example is given
input: 10100011
output: 110
I have tried using BFS but I don't think this is an efficient enough solution (maybe some sort of bitmap + sliding window solution?)
string IntToString(int a)
{
ostringstream temp;
temp << a;
return temp.str();
}
bool is_subsequence(string& s, string& sub) {
if(sub.length() > s.length()) return false;
int pos = 0;
for(char c : sub)
{
pos = s.find(c, pos);
if(pos == string::npos) return false;
++pos;
}
return true;
}
string shortestNotSubsequence(string& s) {
Queue q(16777216);
q.push(0);
q.push(1);
while(!q.empty())
{
string str;
int num = q.front; q.pop();
str = IntToString(num);
if(!is_subsequence(s, str)) return str;
string z = str + '0';
string o = str + '1';
q.push(stoi(str+'0'));
q.push(stoi(str+'1'));
}
return "";
}
int main() {
string N;
cin >> N;
cout << shortestNotSubsequence(N) << endl;
return 0;
}
You can do this pretty easily in O(N) time.
Let W = ceiling(log2(N+1)), where N is the length of the input string S.
There are 2W possible strings of length W. S must have less than N of them as substrings, and that's less than 2W, so at least one string of length W must not be present in S.
W is also less than the number of bits in a size_t, and it only takes O(N) space to store a mask of all possible strings of length W. Initialize such a mask to 0s, and then iterate through S using the lowest W bits in a size_t as a sliding window of the substrings you encounter. Set the mask bit for each substring you encounter to 1.
When you're done, scan the mask to find the first 0, and that will be a string of length W that's missing.
There may also be shorter missing strings, though, so merge the mask bits in pairs to make a mask for the strings of length W-1, and then also set the mask bit for the last W-1 bits in S, since those might not be included in any W-length string. Then scan the mask for 0s to see if you can find a shorter missing string.
As long as you keep finding shorter strings, keep merging the mask for smaller strings until you get to length 1. Since each such operation divides the mask size in 2, that doesn't affect the overall O(N) time for the whole algorithm.
Here's an implementation in C++
#include <string>
#include <vector>
#include <algorithm>
std::string shortestMissingBinaryString(const std::string instr) {
const size_t len = instr.size();
if (len < 2) {
if (!len || instr[0] != '0') {
return std::string("0");
}
return std::string("1");
}
// Find a string size guaranteed to be missing
size_t W_mask = 0x3;
unsigned W = 2;
while(W_mask < len) {
W_mask |= W_mask<<1;
W+=1;
}
// Make a mask of all the W-length substrings that are present
std::vector<bool> mask(W_mask+1, false);
size_t lastSubstr=0;
for (size_t i=0; i<len; ++i) {
lastSubstr = (lastSubstr<<1) & W_mask;
if (instr[i] != '0') {
lastSubstr |= 1;
}
if (i+1 >= W) {
mask[lastSubstr] = true;
}
}
//Find missing substring of length W
size_t found = std::find(mask.begin(), mask.end(), false) - mask.begin();
// try to find a shorter missing substring
while(W > 1) {
unsigned testW = W - 1;
W_mask >>= 1;
// calculate masks for length testW
for (size_t i=0; i<=W_mask; i++) {
mask[i] = mask[i*2] || mask[i*2+1];
}
mask.resize(W_mask+1);
// don't forget the missing substring at the end
mask[lastSubstr & W_mask] = true;
size_t newFound = std::find(mask.begin(), mask.end(), false) - mask.begin();
if (newFound > W_mask) {
// no shorter string
break;
}
W = testW;
found = newFound;
}
// build the output string
std::string ret;
for (size_t bit = ((size_t)1) << (W-1); bit; bit>>=1) {
ret.push_back((found & bit) ? '1': '0');
}
return ret;
}

Run-length decompression using C++

I have a text file with a string which I encoded.
Let's say it is: aaahhhhiii kkkjjhh ikl wwwwwweeeett
Here the code for encoding, which works perfectly fine:
void Encode(std::string &inputstring, std::string &outputstring)
{
for (int i = 0; i < inputstring.length(); i++) {
int count = 1;
while (inputstring[i] == inputstring[i+1]) {
count++;
i++;
}
if(count <= 1) {
outputstring += inputstring[i];
} else {
outputstring += std::to_string(count);
outputstring += inputstring[i];
}
}
}
Output is as expected: 3a4h3i 3k2j2h ikl 6w4e2t
Now, I'd like to decompress the output - back to original.
And I am struggling with this since a couple days now.
My idea so far:
void Decompress(std::string &compressed, std::string &original)
{
char currentChar = 0;
auto n = compressed.length();
for(int i = 0; i < n; i++) {
currentChar = compressed[i++];
if(compressed[i] <= 1) {
original += compressed[i];
} else if (isalpha(currentChar)) {
//
} else {
//
int number = isnumber(currentChar).....
original += number;
}
}
}
I know my Decompress function seems a bit messy, but I am pretty lost with this one.
Sorry for that.
Maybe there is someone out there at stackoverflow who would like to help a lost and beginner soul.
Thanks for any help, I appreciate it.
Assuming input strings cannot contain digits (this cannot be covered by your encoding as e. g. both the strings "3a" and "aaa" would result in the encoded string "3a" – how would you ever want to decompose again?) then you can decompress as follows:
unsigned int num = 0;
for(auto c : compressed)
{
if(std::isdigit(static_cast<unsigned char>(c)))
{
num = num * 10 + c - '0';
}
else
{
num += num == 0; // assume you haven't read a digit yet!
while(num--)
{
original += c;
}
}
}
Untested code, though...
Characters in a string actually are only numerical values, though. You can consider char (or signed char, unsigned char) as ordinary 8-bit integers as well. And you can store a numerical value in such a byte, too. Usually, you do run length encoding exactly that way: Count up to 255 equal characters, store the count in a single byte and the character in another byte. One single "a" would then be encoded as 0x01 0x61 (the latter being the ASCII value of a), "aa" would get 0x02 0x61, and so on. If you have to store more than 255 equal characters you store two pairs: 0xff 0x61, 0x07 0x61 for a string containing 262 times the character a... Decoding then gets trivial: you read characters pairwise, first byte you interpret as number, second one as character – rest being trivial. And you nicely cover digits that way as well.
#include "string"
#include "iostream"
void Encode(std::string& inputstring, std::string& outputstring)
{
for (unsigned int i = 0; i < inputstring.length(); i++) {
int count = 1;
while (inputstring[i] == inputstring[i + 1]) {
count++;
i++;
}
if (count <= 1) {
outputstring += inputstring[i];
}
else {
outputstring += std::to_string(count);
outputstring += inputstring[i];
}
}
}
bool alpha_or_space(const char c)
{
return isalpha(c) || c == ' ';
}
void Decompress(std::string& compressed, std::string& original)
{
size_t i = 0;
size_t repeat;
while (i < compressed.length())
{
// normal alpha charachers
while (alpha_or_space(compressed[i]))
original.push_back(compressed[i++]);
// repeat number
repeat = 0;
while (isdigit(compressed[i]))
repeat = 10 * repeat + (compressed[i++] - '0');
// unroll releat charachters
auto char_to_unroll = compressed[i++];
while (repeat--)
original.push_back(char_to_unroll);
}
}
int main()
{
std::string deco, outp, inp = "aaahhhhiii kkkjjhh ikl wwwwwweeeett";
Encode(inp, outp);
Decompress(outp, deco);
std::cout << inp << std::endl << outp << std::endl<< deco;
return 0;
}
The decompression can't possibly work in an unambiguous way because you didn't define a sentinel character; i.e. given the compressed stream it's impossible to determine whether a number is an original single number or it represents the repeat RLE command. I would suggest using '0' as the sentinel char. While encoding, if you see '0' you just output 010. Any other char X will translate to 0NX where N is the repeat byte counter. If you go over 255, just output a new RLE repeat command

I Am Able To Go Outside Array Bounds

Given two strings, write a method to decide if one is an anagram/permutation of the other. This is my approach:
I wrote this function to check if 2 strings are anagrams (such as dog and god).
In ascii, a to z is 97 - 122.
Basically I have an array of bools that are all initially false. Everytime I encounter a char in string1, it marks it as true.
To check if its an anagram, I check if any chars of string2 are false (should be true if encountered in string1).
I'm not sure how but this works too: arr[num] = true; (shouldnt work because I dont take into account that ascii starts at 97 and thus goes out of bounds).
(Side note: is there a better approach than mine?)
EDIT: Thanks for all the responses! Will be carefully reading each one. By the way: not an assignment. This is a problem from a coding interview practice book
bool permutation(const string &str1, const string &str2)
{
// Cannot be anagrams if sizes are different
if (str1.size() != str2.size())
return false;
bool arr[25] = { false };
for (int i = 0; i < str1.size(); i++) // string 1
{
char ch = (char)tolower(str1[i]); // convert each char to lower
int num = ch; // get ascii
arr[num-97] = true;
}
for (int i = 0; i < str2.size(); i++) // string 2
{
char ch = (char)tolower(str2[i]); // convert char to lower
int num = ch; // get ascii
if (arr[num-97] == false)
return false;
}
return true;
}
There is nothing inherent in C++ arrays that prevents you from writing beyond the end of them. But, in doing so, you violate the contract you have with the compiler and it is therefore free to do what it wishes (undefined behaviour).
You can get bounds checking on "arrays" by using the vector class, if that's what you need.
As for a better approach, it's probably better if your array is big enough to cover every possible character (so you don't have to worry about bounds checking) and it shouldn't so much be a truth value as a count, so as to handle duplicate characters within the strings. If it's just a truth value, then here and her would be considered anagrams.
Even though you state it's not an assignment, you'll still learn more if you implement it yourself, so it's pseudo-code only from me. The basic idea would be:
def isAnagram (str1, str2):
# Different lengths means no anagram.
if len(str1) not equal to len(str2):
return false
# Initialise character counts to zero.
create array[0..255] (assumes 8-bit char)
for each index 0..255:
set count[index] to zero
# Add 1 for all characters in string 1.
for each char in string1:
increment array[char]
# Subtract 1 for all characters in string 2.
for each char in string2:
decrement array[char]
# Counts will be all zero for an anagram.
for each index 0..255:
if count[index] not equal to 0:
return false
return true
Working approach : with zero additional cost.
bool permutation(const std::string &str1, const std::string &str2)
{
// Cannot be anagrams if sizes are different
if (str1.size() != str2.size())
return false;
int arr[25] = {0 };
for (int i = 0; i < str1.size(); i++) // string 1
{
char ch = (char)tolower(str1[i]); // convert each char to lower
int num = ch; // get ascii
arr[num-97] = arr[num-97] + 1 ;
}
for (int i = 0; i < str2.size(); i++) // string 2
{
char ch = (char)tolower(str2[i]); // convert char to lower
int num = ch; // get ascii
arr[num-97] = arr[num-97] - 1 ;
}
for (int i =0; i< 25; i++) {
if (arr[i] != 0) {
return false;
}
}
return true;
}
Yes, C and C++ both doesn't carry out the index-out-of-bounds.
It is the duty of the programmer to make sure that the program logic doesn't cross the legitimate limits. It is the programmer who need to make checks for the violations.
Improved Code:
bool permutation(const string &str1, const string &str2)
{
// Cannot be anagrams if sizes are different
if (str1.size() != str2.size())
return false;
int arr[25] = { 0 }; //<-------- Changed
for (int i = 0; i < str1.size(); i++) // string 1
{
char ch = (char)tolower(str1[i]); // convert each char to lower
int num = ch; // get ascii
arr[num-97] += 1; //<-------- Changed
}
for (int i = 0; i < str2.size(); i++) // string 2
{
char ch = (char)tolower(str2[i]); // convert char to lower
int num = ch; // get ascii
arr[num-97] = arr[num-97] - 1 ; //<-------- Changed
}
for (int i =0; i< 25; i++) { //<-------- Changed
if (arr[i] != 0) { //<-------- Changed
return false; //<-------- Changed
}
}
return true;
}

How to find string in a string

I somehow need to find the longest string in other string, so if string1 will be "Alibaba" and string2 will be "ba" , the longest string will be "baba". I have the lengths of strings, but what next ?
char* fun(char* a, char& b)
{
int length1=0;
int length2=0;
int longer;
int shorter;
char end='\0';
while(a[i] != tmp)
{
i++;
length1++;
}
int i=0;
while(b[i] != tmp)
{
i++;
length++;
}
if(dlug1 > dlug2){
longer = length1;
shorter = length2;
}
else{
longer = length2;
shorter = length1;
}
//logics here
}
int main()
{
char name1[] = "Alibaba";
char name2[] = "ba";
char &oname = *name2;
cout << fun(name1, oname) << endl;
system("PAUSE");
return 0;
}
Wow lots of bad answers to this question. Here's what your code should do:
Find the first instance of "ba" using the standard string searching functions.
In a loop look past this "ba" to see how many of the next N characters are also "ba".
If this sequence is longer than the previously recorded longest sequence, save its length and position.
Find the next instance of "ba" after the last one.
Here's the code (not tested):
string FindLongestRepeatedSubstring(string longString, string shortString)
{
// The number of repetitions in our longest string.
int maxRepetitions = 0;
int n = shortString.length(); // For brevity.
// Where we are currently looking.
int pos = 0;
while ((pos = longString.find(shortString, pos)) != string::npos)
{
// Ok we found the start of a repeated substring. See how many repetitions there are.
int repetitions = 1;
// This is a little bit complicated.
// First go past the "ba" we have already found (pos += n)
// Then see if there is still enough space in the string for there to be another "ba"
// Finally see if it *is* "ba"
for (pos += n; pos+n < longString.length() && longString.substr(pos, n) == shortString; pos += n)
++repetitions;
// See if this sequence is longer than our previous best.
if (repetitions > maxRepetitions)
maxRepetitions = repetitions;
}
// Construct the string to return. You really probably want to return its position, or maybe
// just maxRepetitions.
string ret;
while (maxRepetitions--)
ret += shortString;
return ret;
}
What you want should look like this pseudo-code:
i = j = count = max = 0
while (i < length1 && c = name1[i++]) do
if (j < length2 && name2[j] == c) then
j++
else
max = (count > max) ? count : max
count = 0
j = 0
end
if (j == length2) then
count++
j = 0
end
done
max = (count > max) ? count : max
for (i = 0 to max-1 do
print name2
done
The idea is here but I feel that there could be some cases in which this algorithm won't work (cases with complicated overlap that would require going back in name1). You may want to have a look at the Boyer-Moore algorithm and mix the two to have what you want.
The Algorithms Implementation Wikibook has an implementation of what you want in C++.
http://www.cplusplus.com/reference/string/string/find/
Maybe you made it on purpose, but you should use the std::string class and forget archaic things like char* string representation.
It will make you able to use lots of optimized methods, such as string research, etc.
why dont you use strstr function provided by C.
const char * strstr ( const char * str1, const char * str2 );
char * strstr ( char * str1, const char * str2 );
Locate substring
Returns a pointer to the first occurrence of str2 in str1,
or a null pointer if str2 is not part of str1.
The matching process does not include the terminating null-characters.
use the length's now and create a loop and play with the original string anf find the longest string inside.