String matching with multiple prefixes

String matching with multiple prefixes - regex

This is an algorithmic question; any pseudocode/verbal explanation will do (although a Python solution outline would be ideal).
Let's have a query word A, for example pity. And let's have a set of other strings, Bs, each of which is composed of one or more space-delimited words: pious teddy, piston tank yard, pesky industrial strength tylenol, oh pity is me! etc.
The goal is to identify those strings B from which A could be constructed. Here "constructed" means we could take prefixes of one or more words in B, in order, and concatenate them together to get A.
Example:
pity = piston tank yard
pity = pesky industrial strength tylenol
pity = oh pity is me!
On the other hand, pious teddy should not be identified, because there's no way we can take prefixes of the words pious and teddy and concatenate them into pity.
The checking should be fast (ideally some regexp), because the set of strings B is potentially large.

You can use a pattern of \bp(?:i|\w*(?>\h+\w+)*?\h+i)(?:t|\w*(?>\h+\w+)*?\h+t)(?:y|\w*(?>\h+\w+)*?\h+y) to match those words. It assumes spaces to be used as word separators. This is rather easy to construct, just take the first letter of your word to be matched, then loop over the others and construct (?:[letter]|\w*(?>\h+\w+)*?\h+[letter]) from them.
This pattern is basically an unrolled version of \bp(?:i|.*?\bi)(?:t|.*?\bt)(?:y|.*?\by), which matters the second to last letters either as exactly next letter or first letter (because of the word boundary) of a next word.
You can see it in action here: https://regex101.com/r/r3ZVNE/2
I have added the last sample as non-matching one for some tests I did with atomic groups.
In Delphi I would do it like this:
program ProjectTest;
uses
System.SysUtils,
RegularExpressions;
procedure CheckSpecialMatches(const Matchword: string; const CheckList: array of string);
var
I: Integer;
Pat, C: string;
RegEx: TRegEx;
begin
assert(Matchword.Length > 0);
Pat := '\b' + Matchword[1];
for I := Low(Matchword) + 1 to High(Matchword) do
Pat := Pat + Format('(?:%0:s|\w*(?>\h+\w+)*?\h+%0:s)', [Matchword[I]]);
RegEx := TRegEx.Create(Pat, [roCompiled, roIgnoreCase]);
for C in CheckList do
if Regex.IsMatch(C) then
WriteLn(C);
end;
const
CheckList: array[0..3] of string = (
'pious teddy',
'piston tank yard',
'pesky industrial strength tylenol',
'prison is ity');
Matchword = 'pity';
begin
CheckSpecialMatches(Matchword, CheckList);
ReadLn;
end
.

Related

Regex to insert space with certain characters but avoid date and time

I made a regex which inserts a space where ever there is any of the characters
-:\*_/;, present for example JET*AIRWAYS\INDIA/858701/IDBI 05/05/05;05:05:05 a/c should beJET* AIRWAYS\ INDIA/ 858701/ IDBI 05/05/05; 05:05:05 a/c
The regex I used is (?!a\/c|w\/d|m\/s|s\/w|m\/o)(\D-|\D:|\D\*|\D_|\D\\|\D\/|\D\;)
I have added some words exceptions like a/c w/d etc. \D conditions given to avoid date/time values getting separated, but this created an issue, the numbers followed by the above mentioned characters never get split.
My requirement is
1. Insert a space after characters -:\*_/;,
2. but date and time should not get split which may have / :
3. need exception on words like a/c w/d
The following is the full code
Private Function formatColon(oldString As String) As String
Dim reg As New RegExp: reg.Global = True: reg.Pattern = "(?!a\/c|w\/d|m\/s|s\/w|m\/o)(\D-|\D:|\D\*|\D_|\D\\|\D\/|\D\;)" '"(\D:|\D/|\D-|^w/d)"
Dim newString As String: newString = reg.Replace(oldString, "$1 ")
formatColon = XtraspaceKill(newString)
End Function

I would use 3 replacements.
Replace all date and time special characters with a special macro that should never be found in your text, e.g. for 05/15/2018 4:06 PM, something based on your name:
05MANUMOHANSLASH15MANUMOHANSLASH2018 4MANUMOHANCOLON06 PM
You can encode exceptions too, like this:
aMANUMOHANSLASHc
Now run your original regex to replace all special characters.
Finally, unreplace the macros MANUMOHANSLASH and MANUMOHANCOLON.
Meanwhile, let me tell you why this is complicated in a single regex.
If trying to do this in a single regex, you have to ask, for each / or :, "Am I a part of a date or time?"
To answer that, you need to use lookahead and lookbehind assertions, the latter of which Microsoft has finally added support for.
But given a /, you don't know if you're between the first and second, or second and third parts of the date. Similar for time.
The number of cases you need to consider will render your regex unmaintainably complex.
So please just use a few separate replacements :-)

Find group of strings starting and ending by a character using regular expression

I have a string, and I want to extract, using regular expressions, groups of characters that are between the character : and the other character /.
typically, here is a string example I'm getting:
'abcd:45.72643,4.91203/Rou:hereanotherdata/defgh'
and so, I want to retrieved, 45.72643,4.91203 and also hereanotherdata
As they are both between characters : and /.
I tried with this syntax in a easier string where there is only 1 time the pattern,
[tt]=regexp(str,':(\w.*)/','match')
tt = ':45.72643,4.91203/'
but it works only if the pattern happens once. If I use it in string containing multiples times the pattern, I get all the string between the first : and the last /.
How can I mention that the pattern will occur multiple time, and how can I retrieve it?

Use lookaround and a lazy quantifier:
regexp(str, '(?<=:).+?(?=/)', 'match')
Example (Matlab R2016b):
>> str = 'abcd:45.72643,4.91203/Rou:hereanotherdata/defgh';
>> result = regexp(str, '(?<=:).+?(?=/)', 'match')
result =
1×2 cell array
'45.72643,4.91203' 'hereanotherdata'

In most languages this is hard to do with a single regexp. Ultimately you'll only ever get back the one string, and you want to get back multiple strings.
I've never used Matlab, so it may be possible in that language, but based on other languages, this is how I'd approach it...
I can't give you the exact code, but a search indicates that in Matlab there is a function called strsplit, example...
C = strsplit(data,':')
That should will break your original string up into an array of strings, using the ":" as the break point. You can then ignore the first array index (as it contains text before a ":"), loop the rest of the array and regexp to extract everything that comes before a "/".
So for instance...
'abcd:45.72643,4.91203/Rou:hereanotherdata/defgh'
Breaks down into an array with parts...
1 - 'abcd'
2 - '45.72643,4.91203/Rou'
3 - 'hereanotherdata/defgh'
Then Ignore 1, and extract everything before the "/" in 2 and 3.

As John Mawer and Adriaan mentioned, strsplit is a good place to start with. You can use it for both ':' and '/', but then you will not be able to determine where each of them started. If you do it with strsplit twice, you can know where the ':' starts :
A='abcd:45.72643,4.91203/Rou:hereanotherdata/defgh';
B=cellfun(#(x) strsplit(x,'/'),strsplit(A,':'),'uniformoutput',0);
Now B has cells that start with ':', and has two cells in each cell that contain '/' also. You can extract it with checking where B has more than one cell, and take the first of each of them:
C=cellfun(#(x) x{1},B(cellfun('length',B)>1),'uniformoutput',0)
C =
1×2 cell array
'45.72643,4.91203' 'hereanotherdata'

Starting in 16b you can use extractBetween:
>> str = 'abcd:45.72643,4.91203/Rou:hereanotherdata/defgh';
>> result = extractBetween(str,':','/')
result =
2×1 cell array
{'45.72643,4.91203'}
{'hereanotherdata' }
If all your text elements have the same number of delimiters this can be vectorized too.

Matlab Extracting sub string from cell array

I have a '3 x 1' cell array the contents of which appear like the following:
'ASDF_LE_NEWYORK Fixedafdfgd_ML'
'Majo_LE_WASHINGTON FixedMonuts_ML'
'Array_LE_dfgrt_fdhyuj_BERLIN Potato Price'
I want to be able to elegantly extract and create another '3x1' cell array with contents as:
'NEWYORK'
'WASHINGTON'
'BERLIN'
If you notice in above the NAME's are after the last underscore and before the first SPACE or '_ML'. How do I write such code in a concise manner.
Thanks
Edit:
Sorry guys I should have used a better example. I have it corrected now.

You can use lookbehind for _ and lookahead for space:
names = regexp(A, '(?<=_)[^\s_]*(?=\s)', 'match', 'once');
Where A is the cell array containing the strings:
A = {...
'ASDF_LE_NEWYORK Fixedafdfgd_ML'
'Majo_LE_WASHINGTON FixedMonuts_ML'
'Array_LE_dfgrt_fdhyuj_BERLIN Potato Price'};
>> names = regexp(A, '(?<=_)[^\s_]*(?=\s)', 'match', 'once')
names =
'NEWYORK'
'WASHINGTON'
'BERLIN'

NOTE: The question was changed, so the answer is no longer complete, but hopefully the regexp example is still useful.
Try regexp like this:
names = regexp(fullNamesCell,'_(NAME\d?)\s','tokens');
names = cellfun(#(x)(x{1}),names)
In the pattern _(NAME\d?)\s, the parenthesis define a subexpression, which will be returned as a token (a portion of matched text). The \d? specifies zero or one digits, but you could use \d{1} for exactly one digit or \d{1,3} if you expect between 1 and 3 digits. The \s specified whitespace.
The reorganization of names is a little convoluted, but when you use regexp with a cell input and tokens you get a cell of cells that needs some reformatting for your purposes.

How to efficiently implement regular expression like .a.b.*?

I want to matching of filenames like Colibri does. I tried to solve it by regular expressions.
Searching in Colibri works that You type characters which are in order inside of a filename and it finds all files with these characters in order in the filename. For example for "ab" it finds "cabal", "ab", and "achab".
Simple insertion of .* between letters works (so searched string "ab" becomes regular expression .*a.*b.*), but I want to make it on large amount of files.
So far I have O(N*???), where N is amount of filenames and ??? is at best linear complexity (I assume my language uses NFA). I don't care about space complexity that much. What data structures or algorithms should I choose to make it more efficient (in time complexity)?

If you just want to check whether the characters of a search string search are contained in another string str in the same order, you could use this simple algorithm:
pos := -1
for each character in search do
pos := indexOf(str, character, pos+1)
if pos is -1 then
break
endif
endfor
return pos
This algorithm returns the offset of the last character of search in str and -1 otherwise. Its runtime is in O(n) (you could replace indexOf by a simple while loop that compares the characters in str from pos to Length(str)-1 and returns either the offset or -1).

It will greatly improve your efficiency if you replace the . with a character negation. i.e.
[^a]*a[^b]*b.*
This way you have much less backtracking. See This Reference
Edit* #yi_H you are right, this regex would probably serve just as well:
a[^b]*b

Your . is unnecessary. You'll get better performance if you simply transform "abc"
into ^[^a]*a[^b]*b[^c]*c.
string exp = "^";
foreach (char c in inputString)
{
string s = Regex.Escape (c.ToString()); // escape `.` as `\.`
exp += "[^" + s + "]*" + s; // replace `a` with `[^a]*a`
}
Regex regex = new Regex (exp, RegexOptions.IgnoreCase);
foreach (string fileName in fileNames)
{
if (regex.IsMatch (fileName))
yield return fileName;
}

For a limited character-set it might make sense to create lookup table which contains an array or linked list of matching filenames.
If your ABC contains X characters then the "1 length" lookup table will contain X table entries, if it is a "2 length" table it will contain X^2 entries and so on. The 2 length table will contain for each entry ("ab", "qx") all the files which which have those letters in that order. When searching for longer input "string" look up the appropriate entry and do the search on those entries.
Note: calculate the needed extra memory and measure the speed improvement (compared to full table scan), the benefits depend on the data set.

Regular expressions, what a trouble!

I need your kind help to resolve this question.
I state that I am not able to use regolar expressions with Oracle PL/SQL, but I promise that I'll study them ASAP!
Please suppose you have a table with a column called MY_COLUMN of type VARCHAR2(4000).
This colums is populated as follows:
Description of first no.;00123457;Description of 2nd number;91399399119;Third Descr.;13456
You can see that the strings are composed by couple of numbers (which may begin with zero), and strings (containing all alphanumeric characters, and also dot, ', /, \, and so on):
Description1;Number1;Description2;Number2;Description3;Number3;......;DescriptionN;NumberN
Of course, N is not known, this means that the number of couples for every record can vary from record to record.
In every couple the first element is always the number (which may begin with zero, I repeat), and the second element is the string.
The field separator is ALWAYS semicolon (;).
I would like to transform the numbers as follows:
00123457 ===> 001-23457
91399399119 ===> 913-99399119
13456 ===> 134-56
This means, after the first three digits of the number, I need to put a dash "-"
How can I achieve this using regular expressions?
Thank you in advance for your kind cooperation!

I don't know Oracle/PL/SQL, but I can provide a regex:
([[:digit:]]{3})([[:digit:]]+)
matches a number of at least four digits and remembers the first three separately from the rest.
RegexBuddy constructs the following code snippet from this:
DECLARE
result VARCHAR2(255);
BEGIN
result := REGEXP_REPLACE(subject, '([[:digit:]]{3})([[:digit:]]+)', '\1-\2', 1, 0, 'c');
END;
If you need to make sure that those numbers are always directly surrounded by ;, you can alter this slightly:
(^|;)([[:digit:]]{3})([[:digit:]]+)(;|$)
However, this will not work if two numbers can directly follow each other (12345;67890 will only match the first number). If that's not a problem, use
result := REGEXP_REPLACE(subject, '(^|;)([[:digit:]]{3})([[:digit:]]+)(;|$)', '\1\2-\3\4', 1, 0, 'c');

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

String matching with multiple prefixes - regex

Related

Regex to insert space with certain characters but avoid date and time

Find group of strings starting and ending by a character using regular expression

Matlab Extracting sub string from cell array

How to efficiently implement regular expression like .a.b.*?

Regular expressions, what a trouble!

Categories

Resources

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

String matching with multiple prefixes - regex

Related

Regex to insert space with certain characters but avoid date and time

Find group of strings starting and ending by a character using regular expression

Matlab Extracting sub string from cell array

How to efficiently implement regular expression like .*a.*b.*?

Regular expressions, what a trouble!

Categories

Resources

How to efficiently implement regular expression like .a.b.*?