Find a Repeated Substring Pattern in a given string - regex

Given a non-empty string check if it can be constructed by taking a substring of it and appending multiple copies of the substring together. You may assume the given string consists of lowercase English letters only and its length will not exceed 10000.
Example 1:
Input: "abab"
Output: True
Explanation: It's the substring "ab" twice.
Example 2:
Input: "aba"
Output: False
Example 3:
Input: "abcabcabcabc"
Output: True
Explanation: It's the substring "abc" four times. (And the substring "abcabc" twice.)
I found the above question on a online programming site here. I submitted the following answer which is working for the custom test cases, but is getting time exceed exception on submission. I tried other way of regex pattern matching, but as expected, that should be taking more time than this way, and fails too.
public class Solution {
public boolean repeatedSubstringPattern(String str) {
int substringEndIndex = -1;
int i = 0;
char startOfString = str.charAt(0);
i++;
char ch;
while(i < str.length()){
if((ch=str.charAt(i)) != startOfString){
//create a substring until the char at start of string is encountered
i++;
}else{
if(str.split(str.substring(0,i)).length == 0){
return true;
}else{
//false alarm. continue matching.
i++;
}
}
}
return false;
}
}
Any idea on where I am taking too much time.

Ther's a literally one-line solution to the problem.
Repeat the given string twice and remove the first and last character of the newly created string, check if a given string is a substring of the newly created string.
def repeatedSubstringPattern(self, s: str) -> bool:
return s in (s + s )[1: -1]
Eg:
str:
abab.
Repeat str: abababab.
Remove the first and last characters: bababa.
Check if abab is a substring of bababa.
str: aba.
Repeat str: abaaba
Remove first and last characters: baab.
Check if aba is a substring of baab.
Mathematical Proof:
Let P be the pattern that is repeated K times in a string S.
S = P*K.
Let N be the newly created string by repeating string S
N = S+S.
Let F be the first character of string N and L be the last character of string N
N = ( F+ P*(K-1) )+ (P*(K-1) + L)
N = F+ P(2K-2)+ L
If K = 1. i.e a string repeated only once
N = F+L. //as N != S So False
If K ≥ 2.
N = F+k'+ N
Where k'≥K. As our S=P*K.
So, S must be in N.
We can further use KMP algorithm to check if S is a sub-string of N. Which will give us time complexity of O(n)

You can use Z-algorithm
Given a string S of length n, the Z Algorithm produces an array Z
where Z[i] is the length of the longest substring starting from S[i]
which is also a prefix of S, i.e. the maximum k such that
S[j] = S[i + j] for all 0 ≤ j < k. Note that Z[i] = 0 means that
S[0] ≠ S[i]. For easier terminology, we will refer to substrings which
are also a prefix as prefix-substrings.
Build Z-array for your string and find whether such position i exists for that i+ Z[i] = n and i is divisor of n (string length)

A short and easily understandable logic would be:
def repeatedSubstringPattern(s: str):
for i in range(1,int(len(s)/2)+1):
if set(s.split(s[0:i])) == {''}:
return True
return False
You can also write return i for return the number after which the pattern repeats itself.

Related

How can i find the index of the first alphabet of second last word in a string [duplicate]

This question already has answers here:
Kotlin function for getting start and end index of substring
(2 answers)
Closed 1 year ago.
I want to find index from string. How can i find the index of the first alphabet of second last word in a string.
val index = "Hey! How are you men? How you doing"
i want to search you doing from the above string, but i want y index from the word you. I did some code to find index but I am unable to find it.
fun main(vararg args: String) {
val inputString = "Hey! How are you men? How you doing"
val regex = "you doing".toRegex()
val match = regex.find(inputString)!!
println(match.value)
println(match.range)
}
This regex finds the last two words in your sentence and calculates the index by subtracting the length of the two words from the length of the string.
val result = Regex("^(?:.*?\\s+)?([^\\s]+\\s+[^\\s]+)$").matchEntire(inputString)
if (result != null) {
println(inputString.length - result.groupValues[1].length)
} else {
println("not supported")
}
Supports inputs like
Hey! How are you men? How you doing
Hey! How are you men? How you doing?
Hey! How are you, John?
Hello there!
Split the string, then take the first character of the second-to-last element of the resulting array.
If you are looking for the index of the y in you doing related to the entire string (Hey! How are you men? How you doing), you can use indexOf.
val inputString = "Hey! How are you men? How you doing"
val matchString = "you doing"
val matchIndex = inputString.indexOf(matchString)
More info on indexOf here.
If you don't want to use a regex (which you probably shouldn't unless you need the efficiency) the simplest option is probably what #samuei says:
index.split(' ').takeLast(2).first().first()
(take the last two words, take the first of those, and then the first character of that)
If you want to mess with indices instead you could do this kind of thing:
val lastSpaceIndex = index.lastIndexOf(' ')
val secondToLastSpace = index.lastIndexOf(' ', startIndex = lastSpaceIndex -1)
println(index.get(secondToLastSpace + 1))
where you're finding the index of the last space, then the index of the last space before that one, and then grabbing the character after that. But this is already getting a lot less readable, and is it worth the extra complexity? Your call!

Finding the shortest repetitive pattern in a string

I was wondering if there was a way to do pattern matching in Octave / matlab? I know Maple 10 has commands to do this but not sure what I need to do in Octave / Matlab. So if a number was 12341234123412341234 the pattern match would be 1234. I'm trying to find the shortest pattern that upon repetiton generates the whole string.
Please note: the numbers (only numbers will be used) won't be this simple. Also, I won't know the pattern ahead of time (that's what I'm trying to find). Please see the Maple 10 example below which shows that the pattern isn't known ahead of time but the command finds the pattern.
Example of Maple 10 pattern matching:
ns:=convert(12341234123412341234,string);
ns := "12341234123412341234"
StringTools:-PrimitiveRoot(ns);
"1234"
How can I do this in Octave / Matlab?
Ps: I'm using Octave 3.8.1
To find the shortest pattern that upon repetition generates the whole string, you can use regular expressions as follows:
result = regexp(str, '^(.+?)(?=\1*$)', 'match');
Some examples:
>> str = '12341234123412341234';
>> result = regexp(str, '^(.+?)(?=\1*$)', 'match')
result =
'1234'
>> str = '1234123412341234123';
>> result = regexp(str, '^(.+?)(?=\1*$)', 'match')
result =
'1234123412341234123'
>> str = 'lullabylullaby';
>> result = regexp(str, '^(.+?)(?=\1*$)', 'match')
result =
'lullaby'
>> str = 'lullaby1lullaby2lullaby1lullaby2';
>> result = regexp(str, '^(.+?)(?=\1*$)', 'match')
result =
'lullaby1lullaby2'
I'm not sure if this can be accomplished with regular expressions. Here is a script that will do what you need in the case of a repeated word called pattern.
It loops through the characters of a string called str, trying to match against another string called pattern. If matching fails, the pattern string is extended as needed.
EDIT: I made the code more compact.
str = 'lullabylullabylullaby';
pattern = str(1);
matchingState = false;
sPtr = 1;
pPtr = 1;
while sPtr <= length(str)
if str(sPtr) == pattern(pPtr) %// if match succeeds, keep looping through pattern string
matchingState = true;
pPtr = pPtr + 1;
pPtr = mod(pPtr-1,length(pattern)) + 1;
else %// if match fails, extend pattern string and start again
if matchingState
sPtr = sPtr - 1; %// don't change str index when transitioning out of matching state
end
matchingState = false;
pattern = str(1:sPtr);
pPtr = 1;
end
sPtr = sPtr + 1;
end
display(pattern);
The output is:
pattern =
lullaby
Note:
This doesn't allow arbitrary delimiters between occurrences of the pattern string. For example, if str = 'lullaby1lullaby2lullaby1lullaby2';, then
pattern =
lullaby1lullaby2
This also allows the pattern to end mid-way through a cycle without changing the result. For example, str = 'lullaby1lullaby2lullaby1'; would still result in
pattern =
lullaby1lullaby2
To fix this you could add the lines
if pPtr ~= length(pattern)
pattern = str;
end
Another approach is as follows:
determine length of string, and find all possible factors of the string length value
for each possible factor length, reshape the string and check
for a repeated substring
To find all possible factors, see this solution on SO. The next step can be performed in many ways, but I implement it in a simple loop, starting with the smallest factor length.
function repeat = repeats_in_string(str);
ns = numel(str);
nf = find(rem(ns, 1:ns) == 0);
for ii=1:numel(nf)
repeat = str(1:nf(ii));
if all(ismember(reshape(str,nf(ii),[])',repeat));
break;
end
end
This problem is a great Rorschach test for your approach to problem solving. I'll add a signal engineering solution, which should be simple since the signal is expected to be perfectly repetitive, assuming this holds: find the shortest pattern that upon repetition generates the whole string.
In the following str fed to the function is actually a column vector of floats, not a string, the original string having been converted with str2num(str2mat(str)'):
function res=findshortestrepel(str);
[~,ii] = max(fft(str-mean(str)));
res = str(1:round(numel(str)/(ii-1)));
I performed a small test, comparing this to the regexp solution and found it to be faster overall (blue squares), although somewhat inconsistently, and only if you don't consider the time required to convert the string into a vector of floats (green squares). However I did not pursue this further (not breaking records with this):
Times in sec.

Java Regex: how to find the total number of occurrences of matching strings from a txt file?

The instructor gave us a text file which contains a book, and we're supposed to find the number of times a word is used which does not contains "aeio."
Here's the full question for clarifications sake:
create a Junit test that tests the total number of occurrences of
words that do not contain the letters a, e, i, or o. Note that this
test differs from the others in that it finds the total number of
occurrences of matching strings, not just the number of matching
strings. Assert that the number of occurrences is 1347.
Here's a copied test from the code that he gave us, but I think it's very close to what the answer should be...I just can't figure this one out.
#Test
public void testCapuletOrCapulets() {
//count number of times a word doesn't contain a,e,i or o
String matchString = "^aeio" ;
File f = new File("romeojuliet.txt");
WordFrequency wf = new WordFrequency(f);
wf.buildTree();
Map<String, Integer> map = wf.getFrequencies();
int numMatches = 0;
for(String s: map.keySet()) if(s.matches(matchString.toLowerCase())) numMatches++;
assertEquals(numMatches, 1347);
}
I would approach it like this:
#Test
public void testCapuletOrCapulets() {
//count number of times a word doesn't contain a,e,i or o
String matchString = "aeio" ;
File f = new File("romeojuliet.txt");
WordFrequency wf = new WordFrequency(f);
wf.buildTree();
Map<String, Integer> map = wf.getFrequencies();
int numMatches = 0;
int numAll = 0;
for(String s: map.keySet()){
for (char c : s.toCharArray()){
if(matchString.contains(c)){
numMatches++;
break;
}
}
numAll++;
}
assertEquals(numAll - numMatches, 1347);
}
Not entirely sure if it's plug and play, because i couldn't test it right now.
What it does is split the string into a char array and matches the char against the matchString. If the matchString contains the char then numMatches goes up and we'll move on to the next word (hence break the inner for-loop).
Because you need to count words which don't contain given letters then you would need to also count the total number of words and then subtract the matches from total word count.
I am sure there are better solutions, but this should also work.

How to use regular expressions to extract 3-tuple values from a string

I am trying to extract n 3-tuples (Si, Pi, Vi) from a string.
The string contains at least one such 3-tuple.
Pi and Vi are not mandatory.
SomeTextxyz#S1((property(P1)val(V1))#S2((property(P2)val(V2))#S3
|----------1-------------|----------2-------------|-- n
The desired output would be:
Si,Pi,Vi.
So for n occurrences in the string the output should look like this:
[S1,P1,V1] [S2,P2,V2] ... [Sn-1,Pn-1,Vn-1] (without the brackets)
Example
The input string could be something like this:
MyCarGarage#Mustang((property(PS)val(500))#Porsche((property(PS)val(425‌​)).
Once processed the output should be:
Mustang,PS,500 Porsche,PS,425
Is there an efficient way to extract those 3-tuples using a regular expression
(e.g. using C++ and std::regex) and what would it look like?
#(.*?)\(\(property\((.*?)\)val\((.*?)\)\) should do the trick.
example at http://regex101.com/r/bD1rY2
# # Matches the # symbol
(.*?) # Captures everything until it encounters the next part (ungreedy wildcard)
\(\(property\( # Matches the string "((property(" the backslashes escape the parenthesis
(.*?) # Same as the one above
\)val\( # Matches the string ")val("
(.*?) # Same as the one above
\)\) # Matches the string "))"
How you should implement this in C++ i don't know but that is the easy part :)
http://ideone.com/S7UQpA
I used C's <regex.h> instead of std::regex because std::regex isn't implemented in g++ (which is what IDEONE uses). The regular expression I used:
" In C(++)? regexes are strings.
# Literal match
([^(#]+) As many non-#, non-( characters as possible. This is group 1
( Start another group (group 2)
\\(\\(property\\( Yet more literal matching
([^)]+) As many non-) characters as possible. Group 3.
\\)val\\( Literal again
([^)]+) As many non-) characters as possible. Group 4.
\\)\\) Literal parentheses
) Close group 2
? Group 2 optional
" Close Regex
And some c++:
int getMatches(char* haystack, item** items){
first, calculate the length of the string (we'll use that later) and the number of # found in the string (the maximum number of matches)
int l = -1, ats = 0;
while (haystack[++l])
if (haystack[l] == '#')
ats++;
malloc a large enough array.
*items = (item*) malloc(ats * sizeof(item));
item* arr = *items;
Make a regex needle to find. REGEX is #defined elsewhere.
regex_t needle;
regcomp(&needle, REGEX, REG_ICASE|REG_EXTENDED);
regmatch_t match[5];
ret will hold the return value (0 for "found a match", but there are other errors you may want to be catching here). x will be used to count the found matches.
int ret;
int x = -1;
Loop over matches (ret will be zero if a match is found).
while (!(ret = regexec(&needle, haystack, 5, match,0))){
++x;
Get the name from match1
int bufsize = match[1].rm_eo-match[1].rm_so + 1;
arr[x].name = (char *) malloc(bufsize);
strncpy(arr[x].name, &(haystack[match[1].rm_so]), bufsize - 1);
arr[x].name[bufsize-1]=0x0;
Check to make sure the property (match[3]) and the value (match[4]) were found.
if (!(match[3].rm_so > l || match[3].rm_so<0 || match[3].rm_eo > l || match[3].rm_so< 0
|| match[4].rm_so > l || match[4].rm_so<0 || match[4].rm_eo > l || match[4].rm_so< 0)){
Get the property from match[3].
bufsize = match[3].rm_eo-match[3].rm_so + 1;
arr[x].property = (char *) malloc(bufsize);
strncpy(arr[x].property, &(haystack[match[3].rm_so]), bufsize - 1);
arr[x].property[bufsize-1]=0x0;
Get the value from match[4].
bufsize = match[4].rm_eo-match[4].rm_so + 1;
arr[x].value = (char *) malloc(bufsize);\
strncpy(arr[x].value, &(haystack[match[4].rm_so]), bufsize - 1);
arr[x].value[bufsize-1]=0x0;
} else {
Otherwise, set both property and value to NULL.
arr[x].property = NULL;
arr[x].value = NULL;
}
Move the haystack to past the match and decrement the known length.
haystack = &(haystack[match[0].rm_eo]);
l -= match[0].rm_eo;
}
Return the number of matches.
return x+1;
}
Hope this helps. Though it occurs to me now that you never answered kind of a vital question: What have you tried?

R code to check if word matches pattern

I need to validate a string against a character vector pattern. My current code is:
trim <- function (x) gsub("^\\s+|\\s+$", "", x)
# valid pattern is lowercase alphabet, '.', '!', and '?' AND
# the string length should be >= than 2
my.pattern = c(letters, '!', '.', '?')
check.pattern = function(word, min.size = 2)
{
word = trim(word)
chars = strsplit(word, NULL)[[1]]
all(chars %in% my.pattern) && (length(chars) >= min.size)
}
Example:
w.valid = 'special!'
w.invalid = 'test-me'
check.pattern(w.valid) #TRUE
check.pattern(w.invalid) #FALSE
This is VERY SLOW i guess...is there a faster way to do this? Regex maybe?
Thanks!
PS: Thanks everyone for the great answers. My objective was to build a 29 x 29 matrix,
where the row names and column names are the allowed characters. Then i iterate over each word of a huge text file and build a 'letter precedence' matrix. For example, consider the word 'special', starting from the first char:
row s, col p -> increment 1
row p, col e -> increment 1
row e, col c -> increment 1
... and so on.
The bottleneck of my code was the vector allocation, i was 'appending' instead of pre-allocate the final vector, so the code was taking 30 minutes to execute, instead of 20 seconds!
There are some built-in functions that can clean up your code. And I think you're not leveraging the full power of regular expressions.
The blaring issue here is strsplit. Comparing the equality of things character-by-character is inefficient when you have regular expressions. The pattern here uses the square bracket notation to filter for the characters you want. * is for any number of repeats (including zero), while the ^ and $ symbols represent the beginning and end of the line so that there is nothing else there. nchar(word) is the same as length(chars). Changing && to & makes the function vectorized so you can input a vector of strings and get a logical vector as output.
check.pattern.2 = function(word, min.size = 2)
{
word = trim(word)
grepl(paste0("^[a-z!.?]*$"),word) & nchar(word) >= min.size
}
check.pattern.2(c(" d ","!hello ","nA!"," asdf.!"," d d "))
#[1] FALSE TRUE FALSE TRUE FALSE
Next, using curly braces for number of repetitions and some paste0, the pattern can use your min.size:
check.pattern.3 = function(word, min.size = 2)
{
word = trim(word)
grepl(paste0("^[a-z!.?]{",min.size,",}$"),word)
}
check.pattern.3(c(" d ","!hello ","nA!"," asdf.!"," d d "))
#[1] FALSE TRUE FALSE TRUE FALSE
Finally, you can internalize the regex from trim:
check.pattern.4 = function(word, min.size = 2)
{
grepl(paste0("^\\s*[a-z!.?]{",min.size,",}\\s*$"),word)
}
check.pattern.4(c(" d ","!hello ","nA!"," asdf.!"," d d "))
#[1] FALSE TRUE FALSE TRUE FALSE
If I understand the pattern you are desiring correctly, you would want a regex of a similar format to:
^\\s*[a-z!\\.\\?]{MIN,MAX}\\s*$
Where MIN is replaced with the minimum length of the string, and MAX is replaced with the maximum length of the string. If there is no maximum length, then MAX and the comma can be omitted. Likewise, if there is neither maximum nor minimum everything within the {} including the braces themselves can be replaced with a * which signifies the preceding item will be matched zero or more times; this is equivalent to {0}.
This ensures that the regex only matches strings where every character after any leading and trailing whitespace is from the set of
* a lower case letter
* a bang (exclamation point)
* a question mark
Note that this has been written in Perl style regex as it is what I am more familiar with; most of my research was at this wiki for R text processing.
The reason for the slowness of your function is the extra overhead of splitting the string into a number of smaller strings. This is a lot of overhead in comparison to a regex (or even a manual iteration over the string, comparing each character until the end is reached or an invalid character is found). Also remember that this algorithm ENSURES a O(n) performance rate, as the split causes n strings to be generated. This means that even FAILING strings must do at least n actions to reject the string.
Hopefully this clarifies why you were having performance issues.