generate all combinations of strings based on template - regex

How to generate all combinations of strings based on template?
For example:
- Template string of
"{I|We} want {|2|3|4} {apples|pears}"
The curly braces "{...}" identify a group or words, each word separated by "|".
The class should generate strings with every combination of words within each word group.
I know it's finite automata, and also regex. How to efficiently generate combination?
For example
G[0][j] [want] G[1][j] G[2][j]"
G[0] = {I, We}
G[1] = {2, 3, 4}
G[2] = {apples, pears}
firstly, generate all possible combination c = [0..1][0..2][0..1]:
000
001
010
011
020
021
100
101
110
111
120
121
and then for each c replace G[i][j] by G[i][c[i]]

Shell glob
$ for q in {I,We}\ want\ {2,3,4}\ {apples,pears}; do echo "$q" ; done
I want 2 apples
I want 2 pears
I want 3 apples
I want 3 pears
I want 4 apples
I want 4 pears
We want 2 apples
We want 2 pears
We want 3 apples
We want 3 pears
We want 4 apples
We want 4 pears

The most functional solution to this problem I found so far is the Python module sre_yield.
The goal of sre_yield is to efficiently generate all values that can
match a given regular expression, or count possible matches
efficiently.
Emphasis added by me.
To apply it to your stated problem: Formulate your template as regex pattern and use it in sre_yield to get all possible combinations or count possible matches like this:
import sre_yield
result = set(sre_yield.AllStrings("(I|We) want (|2|3|4) (apples|pears)"))
result.__len__()
result
Output:
16
{'I want apples',
'I want pears',
'I want 2 apples',
'I want 2 pears',
'I want 3 apples',
'I want 3 pears',
'I want 4 apples',
'I want 4 pears',
'We want apples',
'We want pears',
'We want 2 apples',
'We want 2 pears',
'We want 3 apples',
'We want 3 pears',
'We want 4 apples',
'We want 4 pears'}
PS: Instead of a list as shown on the project page I use a set to avoid duplicates. If this is not what you want go with a list.

The principle is:
Regex -> NFA
NFA -> minimal DFA
DFS-walk through the DFA (collecting all characters)
This principle is implemented, e.g. in RexLex:
DeterministicAutomaton dfa = Pattern.compileGenericAutomaton("(I|We) want (2|3|4)? (apples|pears)")
.toAutomaton(new FromGenericAutomaton.ToMinimalDeterministicAutomaton());
if (dfa.getProperty().isAcyclic()) {
for (String s : dfa.getSamples(1000)) {
System.out.println(s);
}
}

Convert each set of strings {...} into a string array so you have n arrays.
So for "{I|We} want {|2|3|4} {apples|pears}" we would have 4 arrays.
Place each of those arrays into another array. In my example I will call the collection
This is Java code, but its simple enough that you should be able to convert it to any language. I didn't test but it should work.
void makeStrings(String[][] wordSet, ArrayList<String> collection) {
makeStrings(wordSet, collection, "", 0, 0);
}
void makeStrings(String[][] wordSet, ArrayList<String> collection, String currString, int x_pos, int y_pos) {
//If there are no more wordsets in the whole set add the string (this means 1 combination is completed)
if (x_pos >= wordSet.length) {
collection.add(currString);
return;
}
//Else if y_pos is outof bounds (meaning no more words within the smaller set {...} return
else if (y_pos >= wordSet[x_pos].length) {
return;
}
else {
//Generate 2 new strings, one to send "vertically " and one "horizontally"
//This string accepts the current word at x.y and then moves to the next word subset
String combo_x = currString + " " + wordSet[x_pos][y_pos];
makeStrings(wordSet, collection, combo_x, x_pos + 1, 0);
//Create a copy of the string and move to the next string within the same subset
String combo_y = currString;
makeStrings(wordSet, collection, combo_y, x_pos , y_pos + 1);
}
}
*Edit for corrections

Related

Removing Measurement Units from Cell Array

I am trying to remove the units out of a column of cell array data i.e.:
cArray =
time temp
2022-05-10 20:19:43 '167 °F'
2022-05-10 20:19:53 '173 °F'
2022-05-10 20:20:03 '177 °F'
...
2022-06-09 20:18:10 '161 °F'
I have tried str2double but get all NaN.
I have found some info on regexp but don't follow exactly as the example is not the same.
Can anyone help me get the temp column to only read the value i.e.:
cArray =
time temp
2022-05-10 20:19:43 167
2022-05-10 20:19:53 173
2022-05-10 20:20:03 177
...
2022-06-09 20:18:10 161
For some cell array of data
cArray = { ...
1, '123 °F'
2, '234 °F'
3, '345 °F'
};
The easiest option is if we can safely assume the temperature data always starts with numeric values, and you want all of the numeric values. Then we can use regex to match only numbers
temps = regexp( cArray(:,2), '\d+', 'match', 'once' );
The match option causes regexp to return the matching string rather than the index of the match, and once means "stop at the first match" so that we ignore everything after the first non-numeric character.
The pattern '\d+' means "one or more numbers". You could expand it to match numbers with a decimal part using '\d+(\.\d+)?' instead if that's a requirement.
Then if you want to actually output numbers, you should use str2double. You could do this in a loop, or use cellfun which is a compact way of achieving the same thing.
temps = cellfun( #str2double, temps, 'uni', 0 ); % 'uni'=0 to retain cell array
Finally you can override the column in cArray
cArray(:,2) = temps;

Stata Regex for 'standalone' numbers in string

I am trying to remove a specific pattern of numbers from a string using the regexr function in Stata. I want to remove any pattern of numbers that are not bounded by a character (other than whitespace), or a letter. For example, if the string contained t370 or 6-test I would want those to remain. It's only when I have numbers next to each other.
clear
input id str40 string
1 "9884 7-test 58 - 489"
2 "67-tty 783 444"
3 "j3782 3hty"
end
I would like to end up with:
ID string
1 7-test
2 67-tty
3 j37b2 3hty
I've tried different regex statements to find when numbers are wrapped in a word boundary: regexr(string, "\b[0-9]+\b", ""); in addition to manually adding the white space " [0-9]+" which will only replace if the pattern occurs in the middle, not at the start of a string. If it's easier to do this without regex expressions that's fine, I was just trying to become more familiar.
Following up on the loop suggesting from the comments, you could do something like the following:
clear
input id str40 string
1 "9884 7-test 58 - 489"
2 "67-tty 783 444"
3 "j3782 3hty"
end
gen N_words = wordcount(string) // # words in each string
qui sum N_words
global max_words = r(max) // max # words in all strings
split string, gen(part) parse(" ") // split string at space (p.s. space is the default)
gen string2 = ""
forval i = 1/$max_words {
* add in parts that contain at least one letter
replace string2 = string2 + " " + part`i' if regexm(part`i', "[a-zA-Z]") & !missing(string2)
replace string2 = part`i' if regexm(part`i', "[a-zA-Z]") & missing(string2)
}
drop part* N_words
where the result would be
. list
+----------------------------------------+
| id string string2 |
|----------------------------------------|
1. | 1 9884 7-test 58 - 489 7-test |
2. | 2 67-tty 783 444 67-tty |
3. | 3 j3782 3hty j3782 3hty |
+----------------------------------------+
Note that I have assumed that you want all words that contain at least one letter. You may need to adjust the regexm here for your specific use case.

Gsub Data frame Replace only exact Cell value matches, no substring

The problem I am facing is that I have a dataframe called uniqindex which looks like the following.
S5 1 Below 25
S5 2 25-30
S5 3 31-35
S5 4 36-40
S5 5 41-45
S5 6 46-50
A sample line of the file where I intend to replace the numeric codes with the age ranges looks like -
S5 4 3 5 3 7 4 3 4 4 7
Following is the code that I run
range<-c('S1','S2a','S2b','S4','S5','S5a','S6','S8','S9','Q8')
FinalOut<-NULL
AddColName<-NULL
for (y in range)
{
df<-copytrans1[copytrans1[,1]==as.character(y),]
uniqindex<-index1[index1[,1]==y,]
looptime<-nrow(uniqindex)
for (k in 1:looptime)
{
df <- as.data.frame(lapply(df, FUN = function(x) gsub(uniqindex[k,2],uniqindex[k,3], x)))
}
FinalOut<-rbind(FinalOut,df)
AddColName<-rbind(AddColName,cbind(as.data.frame(y),df))
}
The problem that I face is that as the substitutions run sequentially, this is the output that I get
S5a S5a ageage41_501_40 ageage41_501_40 age41_50 ageage41_501_40 age41_50 ageage41_501_40 ageage41_501_40 ageage41_501_40 ageage41_501_40 age41_50 age41_50
I want to know how can I change my code to only change exact matches. Currently, 1 would be changed to 25-30 and in the second iteration 2 of 25-30 is changed to 25-305-30
To match the one-digit index only as an isolated word rather than within a two-digit age, you can put the symbols \< and \> at the beginning and end of the pattern:
gsub(paste('\\<', uniqindex[k,2], '\\>', sep=''), uniqindex[k,3], x)

Scala Regular Expression Oddity

I have this regular expression:
^(10)(1|0)(.)(.)(.)(.{18})((AB[^|]*)\||(AQ[^|]*)\||(AJ[^|]*)\||(AF[^|]*)\||(CS[^|]*)\||(CR[^|]*)\||(CT[^|]*)\||(CK[^|]*)\||(CV[^|]*)\||(CY[^|]*)\||(DA[^|]*)\||(AO[^|]*)\|)+AY([0-9]*)AZ(.*)$
To give it a bit of organization, there's really 3 parts:
// Part 1
^(10)(1|0)(.)(.)(.)(.{18})
// Part 2
// Optional Elements that begin with two characters and is terminated by a |
// May appear at most once
((AB[^|]*)\||(AQ[^|]*)\||(AJ[^|]*)\||(AF[^|]*)\||(CS[^|]*)\||(CR[^|]*)\||(CT[^|]*)\||(CK[^|]*)\||(CV[^|]*)\||(CY[^|]*)\||(DA[^|]*)\||(AO[^|]*)\|)+
// Part 3
AY([0-9]*)AZ(.*)$
Part 2 is the part that I'm having trouble with but I believe the current regular expression says any of these given elements will appear one or more times. I could have done something like: (AB.*?|) but I don't need the pipe in my group and wasn't quite sure how to express it.
This is my sample input - it's SIP2 if you've seen it before (please disregard checksum, I know it's not valid):
101YNY201406120000091911AOa|ABb|AQc|AJd|CKe|AFf|CSg|CRh|CTi|CVj|CYk|DAl|AY1AZAA71
This is my snippet of Scala code:
val regex = """^(10)(1|0)(.)(.)(.)(.{18})((AB[^|]*)\||(AQ[^|]*)\||(AJ[^|]*)\||(AF[^|]*)\||(CS[^|]*)\||(CR[^|]*)\||(CT[^|]*)\||(CK[^|]*)\||(CV[^|]*)\||(CY[^|]*)\||(DA[^|]*)\||(AO[^|]*)\|)+AY([0-9]*)AZ(.*)$""".r
val msg = "101YNY201406120000091911AOa|ABb|AQc|AJd|CKe|AFf|CSg|CRh|CTi|CVj|CYk|DAl|AY1AZAA71"
val m = regex.findFirstMatchIn(msg)) match {
case None => println("No match")
case Some(x) =>
for (i <- 0 to x.groupCount) {
println(i + " " + x.group(i))
}
}
This is my output:
0 101YNY201406120000091911AOa|ABb|AQc|AJd|CKe|AFf|CSg|CRh|CTi|CVj|CYk|DAl|AY1AZAA71
1 10
2 1
3 Y
4 N
5 Y
6 201406120000091911
7 DAl|
8 ABb
9 AQc
10 AJd
11 AFf
12 CSg
13 CRh
14 CTi
15 CKe
16 CVj
17 CYk
18 DAl
19 AOa
20 1
21 AA71
Note the entry that starts with 7. Can anyone explain why that's there?
I'm using Scala 2.10.4 but I believe regular expressions in Scala simply uses Java's regular expression. I'm certainly open to other suggestions for parsing strings.
EDIT: Based on wingedsubmariner's response, I was able to fix my regular expression:
^(10)(1|0)(.)(.)(.)(.{18})(?:AB([^|]*)\||AQ([^|]*)\||AJ([^|]*)\||AF([^|]*)\||CS([^|]*)\||CR([^|]*)\||CT([^|]*)\||CK([^|]*)\||CV([^|]*)\||CY([^|]*)\||DA([^|]*)\||AO([^|]*)\|)+AY([0-9]*)AZ(.*)$
Basically adding ?: to indicate I was not interested in the group!
You get a matched group for each set of parentheses, the order being the order of the opening parenthesis in the regex. Matched group 7 corresponds to the opening parenthesis that begins your "Group 2":
((AB[^|]*)\||(AQ[^|]*)\||(AJ[^|]*)\||(AF[^|]*)\||(CS[^|]*)\||(CR[^|]*)\||(CT[^|]*)\||(CK[^|]*)\||(CV[^|]*)\||(CY[^|]*)\||(DA[^|]*)\||(AO[^|]*)\|)+
^
|
This parenthesis
Each matched group takes on the value of the last part of the text that matched, which in this case is DAl| because it was the last piece of text to match the "Group 2" expression.
Here is a simpler example that demonstrates the behavior:
val regex = """((A)\||(B)\|)+""".r
val msg = "A|B|A|B|"
regex.findFirstMatchIn(msg) match {
case None => println("No match")
case Some(x) =>
for (i <- 0 to x.groupCount) {
println(i + " " + x.group(i))
}
}
Which produces:
0 A|B|A|B|
1 B|
2 A
3 B

Generating the shortest regex to match an arbitrary word list

I'm hoping someone might know of a script that can take an arbitrary word list and generated the shortest regex that could match that list exactly (and nothing else).
For example, suppose my list is
1231
1233
1234
1236
1238
1247
1256
1258
1259
Then the output should be:
12(3[13468]|47|5[589])
This is an old post, but for the benefit of those finding it through web searches as I did, there is a Perl module that does this, called Regexp::Optimizer, here: http://search.cpan.org/~dankogai/Regexp-Optimizer-0.23/lib/Regexp/Optimizer.pm
It takes a regular expression as input, which can consist just of the list of input strings separated with |, and outputs an optimal regular expression.
For example, this Perl command-line:
perl -mRegexp::Optimizer -e "print Regexp::Optimizer->new->optimize(qr/1231|1233|1234|1236|1238|1247|1256|1258|1259/)"
generates this output:
(?^:(?^:12(?:3[13468]|5[689]|47)))
(assuming you have installed Regex::Optimizer), which matches the OP's expectation quite well.
Here's another example:
perl -mRegexp::Optimizer -e "print Regexp::Optimizer->new->optimize(qr/314|324|334|3574|384/)"
And the output:
(?^:(?^:3(?:[1238]|57)4))
For comparison, an optimal trie-based version would output 3(14|24|34|574|84). In the above output, you can also search and replace (?: and (?^: with just ( and eliminate redundant parentheses, to obtain this:
3([1238]|57)4
You are probably better off saving the entire list, or if you want to get fancy, create a Trie:
1231
1234
1247
1
|
2
/ \
3 4
/ \ \
1 4 7
Now when you take a string check if it reaches a leaf node. It does, it's valid.
If you have variable length overlapping strings (eg: 123 and 1234) you'll need to mark some nodes as possibly terminal.
You can also use the trie to generate the regex if you really like the regex idea:
Nodes from the root to the first branching are fixed (eg: 12)
Branches create |: (eg: 12(3|4)
Leaf nodes generate a character class (or single character) that follows the parent node: (eg 12(3[14]|47))
This might not generate the shortest regex, to do that you'll might some extra work:
"Compact" ranges if you find them (eg [12345] becomes [1-4])
Add quantifiers for repeated elements (eg: [1234][1234] becomes [1234]{2}
???
I really don't think it's worth it to generate the regex.
This project generates a regexp from a given list of words: https://github.com/bwagner/wordhierarchy
It almost does the same as the above JavaScript solution, but avoids certain superfluous parentheses.
It only uses "|", non-capturing group "(?:)" and option "?".
There's room for improvement when there's a row of single characters:
Instead of e.g. (?:3|8|1|6|4) it could generate [38164].
The generated regexp could easily be adapted to other regexp dialects.
Sample usage:
java -jar dist/wordhierarchy.jar 1231 1233 1234 1236 1238 1247 1256 1258 1259
-> 12(?:5(?:6|9|8)|47|3(?:3|8|1|6|4))
Here's what I came up with (JavaScript). It turned a list of 20,000 6-digit numbers into a 60,000-character regular expression. Compared to a naive (word1|word2|...) construction, that's almost 60% "compression" by character count.
I'm leaving the question open, as there's still a lot of room for improvement and I'm holding out hope that there might be a better tool out there.
var list = new listChar("");
function listChar(s, p) {
this.char = s;
this.depth = 0;
this.parent = p;
this.add = function(n) {
if (!this.subList) {
this.subList = {};
this.increaseDepth();
}
if (!this.subList[n]) {
this.subList[n] = new listChar(n, this);
}
return this.subList[n];
}
this.toString = function() {
var ret = "";
var subVals = [];
if (this.depth >=1) {
for (var i in this.subList) {
subVals[subVals.length] = this.subList[i].toString();
}
}
if (this.depth === 1 && subVals.length > 1) {
ret = "[" + subVals.join("") + "]";
} else if (this.depth === 1 && subVals.length === 1) {
ret = subVals[0];
} else if (this.depth > 1) {
ret = "(" + subVals.join("|") + ")";
}
return this.char + ret;
}
this.increaseDepth = function() {
this.depth++;
if (this.parent) {
this.parent.increaseDepth();
}
}
}
function wordList(input) {
var listStep = list;
while (input.length > 0) {
var c = input.charAt(0);
listStep = listStep.add(c);
input = input.substring(1);
}
}
words = [/* WORDS GO HERE*/];
for (var i = 0; i < words.length; i++) {
wordList(words[i]);
}
document.write(list.toString());
Using
words = ["1231","1233","1234","1236","1238","1247","1256","1258","1259"];
Here's the output:
(1(2(3[13468]|47|5[689])))