Generalized Regex from a set of String - regex

I have this problem. I need to find automatically a way to generate a regex that match a set of string.
For example, given the set of string in input:
S = ["Casino Royale (1928)", "Mission Goldfinger", "A view to a kill"]
create iterating at the start a regex that match the first string, so:
regex1 = "\w{6}\s\w{6}\s\(\d{4}\)"
then compare regex1 with the second string, so:
regex2 = "\w{6-7}\s\w{6-10}(\s\(\d{4}\))?"
and then with the last string, so the final output is:
regex_output = "\w{1-7}\s\w{4-10}(\s\w{2}\s\w\s\w{4}|\s\(\d{4}\))?"
I would like to if it is possible to realize. Maybe it is a problem of complexity theory, maybe.
Thanks in advice.

Use an alternation of literals:
^\QCasino Royale (1928)\E|\QMission Goldfinger\E|\QA view to a kill\E$
\Q...\E means the characters contained to be matched literally.
This approach can of course handle an arbitrarily large list of strings.

Related

How to get all sub-strings of a specific format from a string

I have a large string and I want to get all sub-strings of format [[someword]] from it.
Meaning, get all words (list) which are wrapped in opening and closing square brackets.
Now one way to do this is splitting string by space and then filtering the list with this filter but the problem is some times [[someword]] does not exist as a word, it might have a ,, space or . right before of after it.
What is the best way to do this?
I will appreciate a solution in Scala but as this is more of a programming problem, I will convert your solution to Scala if it's in some other language I know e.g. Python.
This question is different from marked duplicate because the regex needs to able to accommodate characters other than English characters in between the brackets.
You can use this (?<=\[{2})[^[\]]+(?=\]{2}) regex to match and extract all the words you need that are contained in double square brackets.
Here is a Python solution,
import re
s = 'some text [[someword]] some [[some other word]]other text '
print(re.findall(r'(?<=\[{2})[^[\]]+(?=\]{2})', s))
Prints,
['someword', 'some other word']
I never worked in Scala but here is a solution in Java and as I know Scala is based upon Java only hence this may help.
String s = "some text [[someword]] some [[some other word]]other text ";
Pattern p = Pattern.compile("(?<=\\[{2})[^\\[\\]]+(?=\\]{2})");
Matcher m = p.matcher(s);
while(m.find()) {
System.out.println(m.group());
}
Prints,
someword
some other word
Let me know if this is what you were looking for.
Scala solution:
val text = "[[someword1]] test [[someword2]] test 1231"
val pattern = "\\[\\[(\\p{L}+)]\\]".r //match words with brackets and get content with group
val values = pattern
.findAllIn(text)
.matchData
.map(_.group(1)) //get 1st group
.toList
println(values)

RegEx Parse Tool to extract digits from string

Using Alteryx, I have a field called Address which consists of fields like A32C, GH2X, ABC19E. So basically where digits are pinned between sets of letters. I am trying to use the RegEx tool to extract the digits out into a new column called ADDRESS1.
I have Address set to Field to Parse. Output method Parse.
My regular expression is typed in as:
(?:[[alpha]]+)(/d+)(?:[[alpha]]+)
And then I have (/d+) outputting to ADDRESS1. However, when I run this it parses 0 records. What am I doing wrong?
To match a digit, use [0-9] or \d. To match a letter, use [[:alpha:]].
Use
[[:alpha:]]+(\d+)[[:alpha:]]+
See the regex demo.
You can try this :
let regex = /(?!([A-Z]+))(\d+)(?=[A-Z]+)/g;
let values = 'A32CZ, GH2X, ABC19E'
let result = values.match(regex);
console.log(result);

Regex match a string and allow specific character to appear randomly

I want to extract a portion of a string, allowing for the dash character to appear randomly throughout. In my match, I want the dash character occurrences to be included.
Let's say I have a scenario like so:
haystack = "arandomse-que-nce"
needle = "sequence"
and I want to come out on the other end with a string like se-que-nce this this case, what would the regex pattern look like?
I would split the string and then join by -*; for example, in JavaScript:
var needle = "sequence"
var regex = new RegExp(needle.split('').join('-*'))
var result = "arandomse-que-nce".match(regex) // ["se-que-nce"]
var result2 = "a-bad-sequ_ence".match(regex) // null
You could also use a regex to insert -* between each character:
var regex = new RegExp(needle.replace(/(?!$|^)/g, '-*'))
Both the split/join method and the replace method return 's-*e-*q-*u-*e-*n-*c-*e' for the regex.
If you have characters like * in your string, that have meanings in regular expressions, you may want to escape them, like so:
var regex = new RegExp(needle.replace(/(?!$|^)/g, '-*')
.replace(/([-\\^$*+?.()|[\]{}])/g, '\\$1'))
Then, if needle was 1+1, for example, it would give you 1-*\+-*1 for the regex.
s-*e-*q-*u-*e-*n-*c-*e-*
The assumes that multiple hyphens in a row are okay.
EDIT: Doorknob's split/join solution is good, but be aware that it only works for character that aren't special characters (*, +, etc.)
I don't know what the specifications are, but if there are special characters, make sure to escape them:
new RegExp(needle.split('').map(function(c) { return '\\' + c; }).join('-*'))
You could try to use:
s-?e-?q-?u-?e-?n-?c-?e

Regex to remove characters up to a certain point in a string

How do I use regex to convert
11111aA$xx1111xxdj$%%`
to
aA$xx1111xxdj$%%
So, in other words, I want to remove (or match) the FIRST grouping of 1's.
Depending on the language, you should have a way to replace a string by regex. In Java, you can do it like this:
String s = "11111aA$xx1111xxdj$%%";
String res = s.replaceAll("^1+", "");
The ^ "anchor" indicates that the beginning of the input must be matched. The 1+ means a sequence of one or more 1 characters.
Here is a link to ideone with this running program.
The same program in C#:
var rx = new Regex("^1+");
var s = "11111aA$xx1111xxdj$%%";
var res = rx.Replace(s, "");
Console.WriteLine(res);
(link to ideone)
In general, if you would like to make a match of anything only at the beginning of a string, add a ^ prefix to your expression; similarly, adding a $ at the end makes the match accept only strings at the end of your input.
If this is the beginning, you can use this:
^[1]*
As far as replacing, it depends on the language. In powershell, I would do this:
[regex]::Replace("11111aA$xx1111xxdj$%%","^[1]*","")
This will return:
aA$xx1111xxdj$%%
If you only want to replace consecutive "1"s at the beginning of the string, replace the following with an empty string:
^1+
If the consecutive "1"s won't necessarily be the first characters in the string (but you still only want to replace one group), replace the following with the contents of the first capture group (usually \1 or $1):
1+(.*)
Note that this is only necessary if you only have a "replace all" capability available to you, but most regex implementations also provide a way to replace only one instance of a match, in which case you could just replace 1+ with an empty string.
I'm not sure but you can try this
[^1](\w*\d*\W)* - match all as a single group except starting "1"(n) symbols
In Javascript
var str = '11111aA$xx1111xxdj$%%';
var patt = /^1+/g;
str = str.replace(patt,"");

Regular expression to match if all given words are in a string

Say I have a query like this: "one two three", if I replace with spaces with | (pipe character) I can match a string if it contains one or more of those words. This is like a logical OR.
Is there something similar that does a logical AND. It should match regardless of word ordering as long as all the words are present in the string.
Unfortunately I'm away from my Mastering Regular Expressions book :(
Edit: I'm using Javascript and the query can contain any amount of words.
Try look-ahead assertions:
(?=.*one)(?=.*two)(?=.*three)
But it would be better if you use three separate regular expressions or simpler string searching operations.
There's nothing really good for that. You could fairly easily match on three occurrences of any of the words involved:
(?:\b(?:one|two|three)\b.*){3}
but that matches "one one one" as easily as "one two three".
You can use lookahead assertions like Gumbo describes. Or you can write out the permutations, like so:
(?\bone\b.*\btwo\b.*\bthree\b|\btwo\b.*\bone\b.*\bthree\b|\bone\b.*\bthree\b.*\btwo\b|\bthree\b.*\bone\b.*\btwo\b|\bthree\b.*\btwo\b.*\bone\b|\btwo\b.*\bthree\b.*\bone\b)
which is obviously horrible.
Long story short, it's a lot better to do three separate matches.
Do three separate matches.
The only reason to do it in one, is if you needed it to find them in a specific order.
Use this if it needs to contain at least one of the words in your list
let string='One red, two green and 4 orange';
var words='one two three';
words=words.split(' ').join('|');
let pattern=new RegExp(`(?=(.)*?\\b(${words})\\b)((.)+)`,'g');
if(string.match(pattern)!==null){
//string matches at least one of the words
}else{
//string does not match any of the words
}
Use this if it needs to contain all the words in your list
let string='One red, two green and 4 orange';
var words='one two three';
words=words.split(' ').map(function(value,index){return '(?=(.)*?\\b('+value+')\\b)'; }).join('');
let pattern=new RegExp(`${words}((.)+)`,'g');
if(string.match(pattern)!==null){
//string matches all words
}else{
//string does not match all words
}
What it does? First it changes the string into an array and concatenates the regex in its final form then use the magic of Javascript regex variable insertion.
This will help you delegate this method even if the list of words changes dinamically.
*if you want to test it or adjust it you can do it here: doregex.com
*Note that you cannot replace the backtick( ` ) with any other type of quotes(simple ' or double ") or the expression will fail.