Python Iterating through a string to look for a Palindrome

Python Iterating through a string to look for a Palindrome - python-2.7

So I have looked around this site and others for information on how to iterate through a string on Python, find a specific substring, reverse it and check if the two equaled in order to get a Palindrome. This is the problem though since some of the test cases are challenging to get and have confused me on how to find them through indexing.
This is my code that works for all, but two test cases:
def countPalindromes(s):
count = 0
firstindex = 0
lastindex = len(str)-1
while firstindex != lastindex and firstindex <= lastindex:
ch1 = s[firstindex:lastindex]
ch2 = s[lastindex:firstindex:-1]
if ch1 == ch2:
count +=1
firstindex +=1
lastindex -=1
return count
This code works for the following Palindromes: "racecar", " ", and "abqc".
It does not work for these Palindromes "aaaa" and "abacccaba".
For "aaaa" there are 6 palindromes and for "abacccaba" there are 8 palindromes. This is where my problem occurs, and I simply can't figure it out. For the 6 palindromes for "aaaa" I get aaaa, aaa, aa, twice for each. For "abacccaba" the 8 palindromes I have no idea as I get abacccaba, bacccab, accca, ccc, aba, aba.
I understand this is a confusing question, but I am lost how to approach the problem since I only get 2 for the "aaaa" and 4 for "abacccaba". Any ideas how I would cut out the substrings and get these values?
Thanks in advance!

while firstindex != lastindex and firstindex <= lastindex: misses the case of a single character palindrome.
You're also missing the case where aa contains three palindromes, 0:1, 0:2 and 1:2.
I think you're missing some palindromes for aaaa; there are 10:
aaaa
a
a
a
a
aa
aa
aa
aaa
aaa
If single-character palindromes do not count, then we have 6.
Either way, you need to consider all substrings as possible palindromes; not only the ones in the middle. Comparing a string against its reversed self is very easy to do in Python: s == s[::-1].
Getting all the substrings is easy too:
def get_all_substrings(input_string):
length = len(input_string)
return [input_string[i:j+1] for i in range(length) for j in range(i,length)]
and filtering out strings of length < 2 is also easy:
substrings = [a for a in get_all_substrings(string) if len(a) > 1]
Combining these should be fairly straight forward:
len([a for a in get_all_substrings(string) if len(a) > 1 and a == a[::-1]])

I think you should write a function(f) individually to check if a string is a palindrome.
Then make a function(g) that selects sub-strings of letters.
Eg: in string abcd, g will select a, b, c, d, ab, bc, cd, abc, bcd, abcd. Then apply f on each of these strings individually to get the number of palindromes.

Related

regex: Match strings of numbers up to permutation of ciphers

I am trying to find a regex query, such that, for instance, the following strings match the same expression
"1116.67711..44."
"2224.43322..88."
"9993.35599..22."
"7779.91177..55."
I.e. formally "x1x1x1x2.x2x3x3x1x1..x4x4." where xi ≠ xj if i ≠ j, and where xi is some number from 1 to 9 inclusive.
Or (another example), the following strings match the same expression, but not the same expression as before:
"94..44.773399.4"
"25..55.886622.5"
"73..33.992277.3"
I.e. formally "x1x2..x2x2.x3x3x4x4x1x1.x2" where xi ≠ xj if i ≠ j, and where xi is some number from 1 to 9 inclusive.
That is two strings should be equal if they have the same form, but with the numbers internally permuted so that they are pairwise distinct.
The dots should mean a space in the sequence, this could be any value that is not a single digit number, and two "equal" strings, should have spaces the same places. If it helps, the strings all have the same length of 81 (above they all have a length of 15, as to not write too long strings).
That is, if I have some string as above, e.g. "3566.235.225..45" i want to have some reqular expression that i can apply to some database to find out if such a string already exists
Is it possible to do this?

The answer is fairly straightforward:
import re
pattern = re.compile(r'^(\d)\1{3}$')
print(pattern.match('1234'))
print(pattern.match('333'))
print(pattern.match('3333'))
print(pattern.match('33333'))
You capture what you need once, then tell the regex engine how often you need to repeat it. You can refer back to it as often as you like, for example for a pattern that would match 11.222.1 you'd use ^(\d)\1{1}\.(\d)\2{2}\.(\1){1}$.
Note that the {1} in there is superfluous, but it shows that the pattern can be very regular. So much so, that it's actually easy to write a function that solves the problem for you:
def make_pattern(grouping, separators='.'):
regex_chars = '.\\*+[](){}^$?!:'
groups = {}
i = 0
j = 0
last_group = 0
result = '^'
while i < len(grouping):
if grouping[i] in separators:
if grouping[i] in regex_chars:
result += '\\'
result += grouping[i]
i += 1
else:
while i < len(grouping) and grouping[i] == grouping[j]:
i += 1
if grouping[j] in groups:
group = groups[grouping[j]]
else:
last_group += 1
groups[grouping[j]] = last_group
group = last_group
result += '(.)'
j += 1
result += f'\\{group}{{{i-j}}}'
j = i
return re.compile(result+'$')
print(make_pattern('111.222.11').match('aaa.bbb.aa'))
So, you can give make_pattern a good example of the pattern and it will return the compiled regex for you. If you'd like other separators than '.', you can just pass those in as well:
my_pattern = make_pattern('11,222,11', separators=',')
print(my_pattern.match('aa,bbb,aa'))

Is there a pythonic way to count the number of leading matching characters in two strings?

For two given strings, is there a pythonic way to count how many consecutive characters of both strings (starting at postion 0 of the strings) are identical?
For example in aaa_Hello and aa_World the "leading matching characters" are aa, having a length of 2. In another and example there are no leading matching characters, which would give a length of 0.
I have written a function to achive this, which uses a for loop and thus seems very unpythonic to me:
def matchlen(string0, string1): # Note: does not work if a string is ''
for counter in range(min(len(string0), len(string1))):
# run until there is a mismatch between the characters in the strings
if string0[counter] != string1[counter]:
# in this case the function terminates
return(counter)
return(counter+1)
matchlen(string0='aaa_Hello', string1='aa_World') # returns 2
matchlen(string0='another', string1='example') # returns 0

You could use zip and enumerate:
def matchlen(str1, str2):
i = -1 # needed if you don't enter the loop (an empty string)
for i, (char1, char2) in enumerate(zip(str1, str2)):
if char1 != char2:
return i
return i+1

An unexpected function in os.path, commonprefix, can help (because it is not limited to file paths, any strings work). It can also take in more than 2 input strings.
Return the longest path prefix (taken character-by-character) that is a prefix of all paths in list. If list is empty, return the empty string ('').
from os.path import commonprefix
print(len(commonprefix(["aaa_Hello","aa_World"])))
output:
2

from itertools import takewhile
common_prefix_length = sum(
1 for _ in takewhile(lambda x: x[0]==x[1], zip(string0, string1)))
zip will pair up letters from the two strings; takewhile will yield them as long as they're equal; and sum will see how many there are.
As bobble bubble says, this indeed does exactly the same thing as your loopy thing. Its sole pro (and also its sole con) is that it is a one-liner. Take it as you will.

Find a Repeated Substring Pattern in a given string

Given a non-empty string check if it can be constructed by taking a substring of it and appending multiple copies of the substring together. You may assume the given string consists of lowercase English letters only and its length will not exceed 10000.
Example 1:
Input: "abab"
Output: True
Explanation: It's the substring "ab" twice.
Example 2:
Input: "aba"
Output: False
Example 3:
Input: "abcabcabcabc"
Output: True
Explanation: It's the substring "abc" four times. (And the substring "abcabc" twice.)
I found the above question on a online programming site here. I submitted the following answer which is working for the custom test cases, but is getting time exceed exception on submission. I tried other way of regex pattern matching, but as expected, that should be taking more time than this way, and fails too.
public class Solution {
public boolean repeatedSubstringPattern(String str) {
int substringEndIndex = -1;
int i = 0;
char startOfString = str.charAt(0);
i++;
char ch;
while(i < str.length()){
if((ch=str.charAt(i)) != startOfString){
//create a substring until the char at start of string is encountered
i++;
}else{
if(str.split(str.substring(0,i)).length == 0){
return true;
}else{
//false alarm. continue matching.
i++;
}
}
}
return false;
}
}
Any idea on where I am taking too much time.

Ther's a literally one-line solution to the problem.
Repeat the given string twice and remove the first and last character of the newly created string, check if a given string is a substring of the newly created string.
def repeatedSubstringPattern(self, s: str) -> bool:
return s in (s + s )[1: -1]
Eg:
str:
abab.
Repeat str: abababab.
Remove the first and last characters: bababa.
Check if abab is a substring of bababa.
str: aba.
Repeat str: abaaba
Remove first and last characters: baab.
Check if aba is a substring of baab.
Mathematical Proof:
Let P be the pattern that is repeated K times in a string S.
S = P*K.
Let N be the newly created string by repeating string S
N = S+S.
Let F be the first character of string N and L be the last character of string N
N = ( F+ P*(K-1) )+ (P*(K-1) + L)
N = F+ P(2K-2)+ L
If K = 1. i.e a string repeated only once
N = F+L. //as N != S So False
If K ≥ 2.
N = F+k'+ N
Where k'≥K. As our S=P*K.
So, S must be in N.
We can further use KMP algorithm to check if S is a sub-string of N. Which will give us time complexity of O(n)

You can use Z-algorithm
Given a string S of length n, the Z Algorithm produces an array Z
where Z[i] is the length of the longest substring starting from S[i]
which is also a prefix of S, i.e. the maximum k such that
S[j] = S[i + j] for all 0 ≤ j < k. Note that Z[i] = 0 means that
S[0] ≠ S[i]. For easier terminology, we will refer to substrings which
are also a prefix as prefix-substrings.
Build Z-array for your string and find whether such position i exists for that i+ Z[i] = n and i is divisor of n (string length)

A short and easily understandable logic would be:
def repeatedSubstringPattern(s: str):
for i in range(1,int(len(s)/2)+1):
if set(s.split(s[0:i])) == {''}:
return True
return False
You can also write return i for return the number after which the pattern repeats itself.

Split a word with regexp in matlab; startIndex for 'split'?

My aim is to generate the phonetic transcription for any word according to a set of rules.
First, I want to split words into their syllables. For example, I want an algorithm to find 'ch' in a word and then separate it like shown below:
Input: 'aachbutcher'
Output: 'a' 'a' 'ch' 'b' 'u' 't' 'ch' 'e' 'r'
I have come so far:
check=regexp('aachbutcher','ch');
if (isempty(check{1,1})==0) % Returns 0, when 'ch' was found.
[match split startIndex endIndex] = regexp('aachbutcher','ch','match','split')
%Now I split the 'aa', 'but' and 'er' into single characters:
for i = 1:length(split)
SingleLetters{i} = regexp(split{1,i},'.','match');
end
end
My problem is: How do I put the cells together, such that they are formatted like the desired output? I only have the starting indexes for the match parts ('ch') but not for the split parts ('aa', 'but','er').
Any ideas?

You don't need to work with the indices or length. Simple logic: Process first element from match, then first from split, then second from match etc....
[match,split,startIndex,endIndex] = regexp('aachbutcher','ch','match','split');
%Now I split the 'aa', 'but' and 'er' into single characters:
SingleLetters=regexp(split{1,1},'.','match');
for i = 2:length(split)
SingleLetters=[SingleLetters,match{i-1},regexp(split{1,i},'.','match')];
end

So, you know the length of 'ch', it's 2. You know where you found it from regex, as those indices are stored in startIndex. I'm assuming (Please, correct me if I'm wrong) that you want to split all other letters of the word into single-letter cells, like in your output above. So, you can just use the startIndex data to construct your output, using conditionals, like this:
check=regexp('aachbutcher','ch');
if (isempty(check{1,1})==0) % Returns 0, when 'ch' was found.
[match split startIndex endIndex] = regexp('aachbutcher','ch','match','split')
%Now I split the 'aa', 'but' and 'er' into single characters:
for i = 1:length(split)
SingleLetters{i} = regexp(split{1,i},'.','match');
end
end
j = 0;
for i = 1 : length('aachbutcher')
if (i ~= startIndex(1)) && (i ~= startIndex(2))
j = j +1;
output{end+1} = SingleLetters{j};
else
i = i + 1;
output{end+1} = 'ch';
end
end
I don't have MATLAB right now, so I can't test it. I hope it works for you! If not, let me know and I'll take anther shot at it.

R code to check if word matches pattern

I need to validate a string against a character vector pattern. My current code is:
trim <- function (x) gsub("^\\s+|\\s+$", "", x)
# valid pattern is lowercase alphabet, '.', '!', and '?' AND
# the string length should be >= than 2
my.pattern = c(letters, '!', '.', '?')
check.pattern = function(word, min.size = 2)
{
word = trim(word)
chars = strsplit(word, NULL)[[1]]
all(chars %in% my.pattern) && (length(chars) >= min.size)
}
Example:
w.valid = 'special!'
w.invalid = 'test-me'
check.pattern(w.valid) #TRUE
check.pattern(w.invalid) #FALSE
This is VERY SLOW i guess...is there a faster way to do this? Regex maybe?
Thanks!
PS: Thanks everyone for the great answers. My objective was to build a 29 x 29 matrix,
where the row names and column names are the allowed characters. Then i iterate over each word of a huge text file and build a 'letter precedence' matrix. For example, consider the word 'special', starting from the first char:
row s, col p -> increment 1
row p, col e -> increment 1
row e, col c -> increment 1
... and so on.
The bottleneck of my code was the vector allocation, i was 'appending' instead of pre-allocate the final vector, so the code was taking 30 minutes to execute, instead of 20 seconds!

There are some built-in functions that can clean up your code. And I think you're not leveraging the full power of regular expressions.
The blaring issue here is strsplit. Comparing the equality of things character-by-character is inefficient when you have regular expressions. The pattern here uses the square bracket notation to filter for the characters you want. * is for any number of repeats (including zero), while the ^ and $ symbols represent the beginning and end of the line so that there is nothing else there. nchar(word) is the same as length(chars). Changing && to & makes the function vectorized so you can input a vector of strings and get a logical vector as output.
check.pattern.2 = function(word, min.size = 2)
{
word = trim(word)
grepl(paste0("^[a-z!.?]*$"),word) & nchar(word) >= min.size
}
check.pattern.2(c(" d ","!hello ","nA!"," asdf.!"," d d "))
#[1] FALSE TRUE FALSE TRUE FALSE
Next, using curly braces for number of repetitions and some paste0, the pattern can use your min.size:
check.pattern.3 = function(word, min.size = 2)
{
word = trim(word)
grepl(paste0("^[a-z!.?]{",min.size,",}$"),word)
}
check.pattern.3(c(" d ","!hello ","nA!"," asdf.!"," d d "))
#[1] FALSE TRUE FALSE TRUE FALSE
Finally, you can internalize the regex from trim:
check.pattern.4 = function(word, min.size = 2)
{
grepl(paste0("^\\s*[a-z!.?]{",min.size,",}\\s*$"),word)
}
check.pattern.4(c(" d ","!hello ","nA!"," asdf.!"," d d "))
#[1] FALSE TRUE FALSE TRUE FALSE

If I understand the pattern you are desiring correctly, you would want a regex of a similar format to:
^\\s*[a-z!\\.\\?]{MIN,MAX}\\s*$
Where MIN is replaced with the minimum length of the string, and MAX is replaced with the maximum length of the string. If there is no maximum length, then MAX and the comma can be omitted. Likewise, if there is neither maximum nor minimum everything within the {} including the braces themselves can be replaced with a * which signifies the preceding item will be matched zero or more times; this is equivalent to {0}.
This ensures that the regex only matches strings where every character after any leading and trailing whitespace is from the set of
* a lower case letter
* a bang (exclamation point)
* a question mark
Note that this has been written in Perl style regex as it is what I am more familiar with; most of my research was at this wiki for R text processing.
The reason for the slowness of your function is the extra overhead of splitting the string into a number of smaller strings. This is a lot of overhead in comparison to a regex (or even a manual iteration over the string, comparing each character until the end is reached or an invalid character is found). Also remember that this algorithm ENSURES a O(n) performance rate, as the split causes n strings to be generated. This means that even FAILING strings must do at least n actions to reject the string.
Hopefully this clarifies why you were having performance issues.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Python Iterating through a string to look for a Palindrome - python-2.7

Related

regex: Match strings of numbers up to permutation of ciphers

Is there a pythonic way to count the number of leading matching characters in two strings?

Find a Repeated Substring Pattern in a given string

Split a word with regexp in matlab; startIndex for 'split'?

R code to check if word matches pattern

Categories

Resources