Regular expression: match an incomplete search string - regex

Is there a regular expression for matching a string that is not necessarily complete?
Example:
some other supercalifragilisticexpialidocious random things
and maybe supercalifragilistic meaningless padding
lorem superca ipsum dolor
I would like to match whichever left part of supercalifragilisticexpialidocious there is each time. There are not necessarily spaces around words.
The expected result would be to find:
supercalifragilisticexpialidocious
supercalifragilistic
superca
This is similar to matching the same character any number of times, but more universal.
Thank you!

I know this isn't a regex, but I think your objective can be accomplished better using code. Here's an example of a JavaScript function that matches as much of an input string as it can:
function matchMost(find, string){
for(var i = 0 ; i < find.length ; i++){
for(var j = find.length ; j > i ; j--){
if(string.indexOf(find.substring(i, j)) !== -1){
return find.substring(i, j);
}
}
}
return false;
}
For example, if you call matchMost("supercalifragilisticexpialidocious", "lorem superca ipsum dolor"), it will return the string "superca". If string doesn't contain a single character from find, the function will return false.
Here's a JS Fiddle where you can test this code: http://jsfiddle.net/n252eyw1/
UPDATE
This function matches as much of the left side of an input string as it can:
function matchMostLeft(find, string){
for(var j = find.length ; j > 0 ; j--){
if(string.indexOf(find.substring(0, j)) !== -1){
return find.substring(0, j);
}
}
return false;
}
JS Fiddle: http://jsfiddle.net/sjy312ae/

There is, but it's not tidy at all (and probably not very performant either). This regex matches at least 3 characters on the left side and up to supercal as written; the way to extend it should be fairly plain.
(?:sup(?:e(?:r(?:c(?:a(?:l)?)?)?)?)?)?
Paul's answer is likely far more useful in the general case.

Below is an alternative solution. It is similar to Paul's in that it uses indexOf rather than regular expressions. This also makes it equally case-sensitive. My approach should perform better in exceptional cases where Paul's solution would cause excessive calls to indexOf; typically when:
the needle is very long (worse than supercalifragilisticexpialidocious), and
you have a lot of separate texts to scan, and
the majority of texts either do not match, or contain only short matches.
If this is not the case with you, then please use Paul's solution, as it is clean, simple and readable.
function getLongestMatchingPrefix(needle, haystack) {
var len = 0;
var i = 0;
while (len < needle.length && (i = haystack.substring(i).indexOf(needle.substring(0, len + 1))) >= 0) {
while (++len < needle.length && haystack.substring(i, i + len + 1) == needle.substring(0, len + 1)) {}
}
return needle.substring(0, len);
}
Fiddle: http://jsfiddle.net/ov26msj5/

Related

How to split 1 long paragraph to 2 shorter paragraphs? Google Document

I want paragraphs to be up to 3 sentences only.
For that, my strategy is to loop on all paragraphs and find the 3rd sentence ending (see note). And then, to add a "\r" char after it.
This is the code I have:
for (var i = 1; i < paragraphs.length; i++) {
...
sentEnds = paragraphs[i].getText().match(/[a-zA-Z0-9_\u0590-\u05fe][.?!](\s|$)|[.?!][.?!](\s|$)/g);
//this array is used to count sentences in Hebrew/English/digits that end with 1 or more of either ".","?" or "!"
...
if ((sentEnds != null) && (sentEnds.length > 3)) {
lineBreakAnchor = paragraphs[i].getText().match(/.{10}[.?!](\s)/g);
paragraphs[i].replaceText(lineBreakAnchor[2],lineBreakAnchor[2] + "\r");
}
}
This works fine for round 1. But if I run the code again- the text after the inserted "\r" char is not recognized as a new paragraph. Hence, more "\r" (new lines) will be inserted each time the script is running.
How can I make the script "understand" that "\r" means new, separate paragraph?
OR
Is there another character/approach that will do the trick?
Thank you.
Note: I use the last 10 characters of the sentence assuming the match will be unique enough to make only 1 replacement.
Without modifying your own regex expression you can achieve this.
Try this approach to split the paragraphs:
Grab the whole content of the document and create an array of sentences.
Insert paragraphs with up to 3 sentences after original paragraphs.
Remove original paragraphs from hell.
function sentenceMe() {
var doc = DocumentApp.getActiveDocument();
var paragraphs = doc.getBody().getParagraphs();
var sentences = [];
// Split paragraphs into sentences
for (var i = 0; i < paragraphs.length; i++) {
var parText = paragraphs[i].getText();
//Count sentences in Hebrew/English/digits that end with 1 or more of either ".","?" or "!"
var sentEnds = parText.match(/[a-zA-Z0-9_\u0590-\u05fe][.?!](\s|$)|[.?!][.?!](\s|$)/g);
if (sentEnds){
for (var j=0; j< sentEnds.length; j++){
var initIdx = 0;
var sentence = parText.substring(initIdx,parText.indexOf(sentEnds[j])+3);
var parInitIdx = initIdx;
initIdx = parText.indexOf(sentEnds[j])+3;
parText = parText.substring(initIdx - parInitIdx);
sentences.push(sentence);
}
}
// console.log(sentences);
}
inThrees(doc, paragraphs, sentences)
}
function inThrees(doc, paragraphs, sentences) {
// define offset
var offset = paragraphs.length;
// Create paragraphs with up to 3 sentences
var k=0;
do {
var parText = sentences.splice(0,3).join(' ');
doc.getBody().insertParagraph(k + offset , parText.concat('\n'));
k++
}
while (sentences.length > 0)
// Remove paragraphs from hell
for (var i = 0; i < offset; i++){
doc.getBody().removeChild(paragraphs[i]);
}
}
In case you are wondering about the custom menu, here is it:
function onOpen() {
var ui = DocumentApp.getUi();
ui.createMenu('Custom Menu')
.addItem("3's the magic number", 'sentenceMe')
.addToUi();
}
References:
DocumentApp.Body.insertParagraph
Actually the detection of sentences is not an easy task.
A sentence does not always end with a dot, a question mark or an exclamation mark. If the sentence ends with a quote then punctuation rules in some countries force you to put the end of the sentence mark inside the quote:
John asked: "Who's there?"
Not every dot means an end of a sentence, usually the dot after an uppercase letter does not end the sentence, because it occurs after an initial. The sentence does not end after J. here:
The latest Star Wars movie has been directed by J.J. Abrams.
However, sometimes the sentence does end after a capital letter followed by a dot:
This project has been sponsored by NASA.
And abbreviations can make it very hard:
For more information check the article in Phys. Rev. Letters 66, 2697, 2013.
Having in mind these difficulties let's still try to get some expression which will work in "usual" cases.
Make a global match and substitution. Match
((?:[^.?!]+[.?!] +){3})
and substitute it with
\1\r
Demo
This looks for 3 sentences (a sentence is a sequence of not-dot, not-?, not-! characters followed by a dot, a ? or a ! and some spaces) and puts a \r after them.
UPDATED 2020-03-04
Try this:
var regex = new RegExp('((?:[a-zA-Z0-9_\\u0590-\\u05fe\\s]+[.?!]+\\s+){3})', 'gi');
for (var i = 1; i < paragraphs.length; i++) {
paragraphs[i].replaceText(regex, '$1\\r');
}

RegExp JS regarding sequential patttern matching

P.S: --> I know there is an easy solution to my needs, and I can do it that way but, -- I am looking for a "diff" solution for learning sake & challenge sake. So, this is just to solve an algorithm in a lesser traditional way.
I am working on solving an algorithm, and thought I had everything working well but one use case is failing. That is because I am building a regexp dynamically - now, my issue is this.
I need to match letters sequentially up until one doesn't match, then I just "match" what did match sequentially.
so... lets say I was matching this:
"zaazizz"
with this: /\bz[a]?[z]?/
"zizzi".match(/\bz[z]?[i]?/)
currently, that is matching with a : [zi], but the match should only be [z]
zzi only matches "z" from the front of "zizzi", in that order zzi - I now I am using [z]? etc... so it is optional.. but what I really need is match sequentially.. I'd only get "zi" IF from the front, it matched: zzi per my regex.... so, some sort of lookahead or ?. I tried ?= and != no luck.
I still think a non-regex-approach is best here. Have a look at the following JS-Code:
var match = "abcdef";
var input = "abcxdef";
var mArray = match.split("");
var inArray = input.split("");
var max = Math.min(mArray.length, inArray.length) - 1;
for (var i = 0; i < max; i++) {
if (mArray[i] != inArray[i]) { break; }
}
input.substring(0, i);
Where match is the string to be partially matched, input is the input and input.substring(0, i) is the result of the matching part. And you can change match as often as you like.

Regex for parenthesis (JavaScript)

This is the regexp I created so far:
\((.+?)\)
This is my test string: (2+2) + (2+3*(2+3))
The matches I get are:
(2+2)
And
(2+3*(2+3)
I want my matches to be:
(2+2)
And
(2+3*(2+3))
How should I modify my regular expression?
You cannot parse parentesized expressions with regular expression.
There is a mathematical proof that regular expressions can't do this.
Parenthesized expressions are a context-free grammar, and can thus be recognized by pushdown automata (stack-machines).
You can, anyway, define a regular expression that will work on any expression with less than N parentheses, with an arbitrary finite N (even though the expression will get complex).
You just need to acknowledge that your parentheses might contain another arbitrary number of parenteses.
\(([^()]+(\([^)]+\)[^)]*)*)\)
It works like this:
\(([^()]+ matches an open parenthesis, follwed by whatever is not a parenthesis;
(\([^)]+\)[^)]*)* optionally, there may be another group, formed by an open parenthesis, with something inside it, followed by a matching closing parenthesis. Some other non-parenthesis character may follow. This can be repeated an arbitrary amount of times. Anyway, at last, there must be
)\) another closed parenthesis, which matches with the first one.
This should work for nesting depth 2. If you want nesting depth 3, you have to further recurse, allowing each of the groups I described at point (2) to have a nested parenthesized group.
Things will get much easier if you use a stack. Such as:
foundMatches = [];
mStack = [];
start = RegExp("\\(");
mid = RegExp("[^()]*[()]?");
idx = 0;
while ((idx = input.search(start.substr(idx))) != -1) {
mStack.push(idx);
//Start a search
nidx = input.substr(idx + 1).search(mid);
while (nidx != -1 && idx + nidx < input.length) {
idx += nidx;
match = input.substr(idx).match(mid);
match = match[0].substr(-1);
if (match == "(") {
mStack.push(idx);
} else if (mStack.length == 1) {
break;
}
nidx = input.substr(idx + 1).search(mid);
}
//Check the result
if (nidx != -1 && idx + nidx < input.length) {
//idx+nidx is the index of the last ")"
idx += nidx;
//The stack contains the index of the first "("
startIdx = mStack.pop();
foundMatches.push(input.substr(startIdx, idx + 1 - startIdx));
}
idx += 1;
}
How about you parse it yourself using a loop without the help of regex?
Here is one simple way:
You would have to have a variable, say "level", which keeps track of how many open parentheses you have come across so far (initialize it with a 0).
You would also need a string buffer to contain each of your matches ( e.g. (2+2) or (2+3 * (2+3)) ) .
Finally, you would need somewhere you can dump the contents of your buffer into whenever you finish reading a match.
As you read the string character by character, you would increment level by 1 when you come across "(", and decrement by 1 when you come across ")". You would then put the character into the buffer.
When you come across ")" AND the level happens to hit 0, that is when you know you have a match. This is when you would dump the contents of the buffer and continue.
This method assumes that whenever you have a "(" there will always be a corresponding ")" in the input string. This method will handle arbitrary number of parentheses.

Is it possible to generate a (compact) regular expression for an anagram of an arbitrary string?

Problem: write a program in any language which, given a string of characters, generates a regex that matches any anagram of the input string. For all regexes greater than some length N, The regex must be shorter than the "brute force" solution listing all possible anagrams separated by "|", and the length of the regex should grow "slowly" as the input string grows (ideally linearly, but possibly n ln n).
Can you do it? I've tried, but my attempts are so far from succeeding, that I'm beginning to doubt it's possible. The only reason I ask is I thought I had seen a solution on another site, but much pointless googling failed to uncover it a second time.
I think this javascript code will work according to your specifications. The regex length will increase linearly with the length of the input. It generates a regex which uses positive lookahead to match the anagram of the input string. The lookahead part of regex makes sure all the characters are present in the test input string ignoring their order and the matching part ensures that the length of the test input string is same as the length of the input string (for which regex is constructed).
function anagramRegexGenerator(input) {
var lookaheadPart = '';
var matchingPart = '^';
var positiveLookaheadPrefix='(?=';
var positiveLookaheadSuffix=')';
var inputCharacterFrequencyMap = {}
for ( var i = 0; i< input.length; i++ )
{
if (!inputCharacterFrequencyMap[input[i]]) {
inputCharacterFrequencyMap[input[i]] = 1
} else {
++inputCharacterFrequencyMap[input[i]];
}
}
for ( var j in inputCharacterFrequencyMap) {
lookaheadPart += positiveLookaheadPrefix;
for (var k = 0; k< inputCharacterFrequencyMap[j]; k++) {
lookaheadPart += '.*';
if (j == ' ') {
lookaheadPart += '\\s';
} else {
lookaheadPart += j;
}
matchingPart += '.';
}
lookaheadPart += positiveLookaheadSuffix;
}
matchingPart += '$';
return lookaheadPart + matchingPart;
}
Sample input and output is the following
anagramRegexGenerator('aaadaaccc')
//generates the following string.
"(?=.*a.*a.*a.*a.*a)(?=.*d)(?=.*c.*c.*c)^.........$"
anagramRegexGenerator('abcdef ghij');
//generates the following string.
"(?=.*a)(?=.*b)(?=.*c)(?=.*d)(?=.*e)(?=.*f)(?=.*\s)(?=.*g)(?=.*h)(?=.*i)(?
=.*j)^...........$"
//test run returns true
/(?=.*a)(?=.*b)(?=.*c)(?=.*d)(?=.*e)(?=.*f)(?=.*\s)(?=.*g)(?=.*h)(?=.*i)(?
=.*j)^...........$/.test('acdbefghij ')
//or using the RegExp object
//this returns true
new RegExp(anagramRegexGenerator('abcdef ghij')).test('acdbefghij ')
//this returns false
new RegExp(anagramRegexGenerator('abcdef ghij')).test('acdbefghijj')

Regex Split after 20 characters

I have a fixed width text file where each field is given 20 characters total. Usually only 5 characters are used and then there is trailing whitespace. I'd like to use the Split function to extract the data, rather than the Match function. Can someone help me with a regex for this? Thanks in advance.
I would do this with string manipulation, rather than regex. If you're using JavaScript:
var results = [];
for (i = 0; i < input.length; i += 20) {
results.push(input.substring(i, i + 20));
}
Or to trim the whitespace:
var results = [];
for (i = 0; i < input.length; i += 20) {
results.push(input.substring(i, i + 20).replace(/^\s+|\s+$/g, ''));
}
If you must use regex, it should just be something like .{20}.
Split on whitespaces and get the first returned element. This is under the assumption that you do not have whitespaces within the actual data.
cheers
If you must:
^(.{20})(.{20})(.{20})$ // repeat the part in parentheses for each field
You still need to trim each field to remove trailing whitespace.
It seems simpler to use substr() or your languages equivalent. Or in PHP you could use str_split($string, 20).