Finding permutations using regular expressions - regex

I need to create a regular expression (for program in haskell) that will catch the strings containing "X" and ".", assuming that there are 4 "X" and only one ".". It cannot catch any string with other X-to-dot relations.
I have thought about something like
[X\.]{5}
But it catches also "XXXXX" or ".....", so it isn't what I need.

That's called permutation parsing, and while "pure" regular expressions can't parse permutations it's possible if your regex engine supports lookahead. (See this answer for an example.)
However I find the regex in the linked answer difficult to understand. It's cleaner in my opinion to use a library designed for permutation parsing, such as megaparsec.
You use the Text.Megaparsec.Perm module by building a PermParser in a quasi-Applicative style using the <||> operator, then converting it into a regular MonadParsec action using makePermParser.
So here's a parser which recognises any combination of four Xs and one .:
import Control.Applicative
import Data.Ord
import Data.List
import Text.Megaparsec
import Text.Megaparsec.Perm
fourXoneDot :: Parsec Dec String String
fourXoneDot = makePermParser $ mkFive <$$> x <||> x <||> x <||> x <||> dot
where mkFive a b c d e = [a, b, c, d, e]
x = char 'X'
dot = char '.'
I'm applying the mkFive function, which just stuffs its arguments into a five-element list, to four instances of the x parser and one dot, combined with <||>.
ghci> parse fourXoneDot "" "XXXX."
Right "XXXX."
ghci> parse fourXoneDot "" "XX.XX"
Right "XXXX."
ghci> parse fourXoneDot "" "XX.X"
Left {- ... -}
This parser always returns "XXXX." because that's the order I combined the parsers in: I'm mapping mkFive over the five parsers and it doesn't reorder its arguments. If you want the permutation parser to return its input string exactly, the trick is to track the current position within the component parsers, and then sort the output.
fourXoneDotSorted :: Parsec Dec String String
fourXoneDotSorted = makePermParser $ mkFive <$$> x <||> x <||> x <||> x <||> dot
where mkFive a b c d e = map snd $ sortBy (comparing fst) [a, b, c, d, e]
x = withPos (char 'X')
dot = withPos (char '.')
withPos = liftA2 (,) getPosition
ghci> parse fourXoneDotSorted "" "XX.XX"
Right "XX.XX"
As the megaparsec docs note, the implementation of the Text.Megaparsec.Perm module is based on Parsing Permutation Phrases; the idea is described in detail in the paper and the accompanying slides.

The other answers look quite complicated to me, given that there are only five strings in this language. Here's a perfectly fine and very readable regex for this:
\.XXXX|X\.XXX|XX\.XX|XXX\.X|XXXX\.

Are you attached to regex, or did you just end up at regex because this was a question you didn't want to try answering with applicative parsers?
Here's the simplest possible attoparsec implementation I can think of:
parseDotXs :: Parser ()
parseDotXs = do
dotXs <- count 5 (satisfy (inClass ".X"))
let (dots,xS) = span (=='.') . sort $ dotXs
if (length dots == 1) && (length xS == 4) then do
return ()
else do
fail "Mismatch between dots and Xs"
You may need to adjust slightly depending on your input type.
There are tons of fancy ways to do stuff in applicative parsing land, but there is no rule saying you can't just do things the rock-stupid simple way.

Try the following regex :
(?<=^| )(?=[^. ]*\.)(?=(?:[^X ]*X){4}).{5}(?=$| )
Demo here
If you have one word per string, you can simplify the regex by this one :
^(?=[^. \n]*\.)(?=(?:[^X \n]*X){4}).{5}$
Demo here

Related

Splitting a string at every 2 newline characters in haskell [duplicate]

This question already has answers here:
What is the best way to split a string by a delimiter functionally?
(9 answers)
Closed 8 months ago.
My input looks like
abc
a
b
c
abc
abc
abc
abc
I need a function that would split it into something like
[ "abc"
, "a\nb\nc"
, "abc\nabc\nabc\nabc"]
I've tried using regexes, but
I can't import Text.Regex itself
Module Text.Regex.Base does not export splitStr
It's generally a bad idea to use regex in such cases, since it's less readable then pure and concise code, that can be used here.
For example using foldr, the only case where we should add new string into lists of strings is the case where last seen element and current element are newline's:
split :: FilePath -> IO [String]
split path = do
text <- readFile path
return $ foldr build [[]] (init text)
where
build c (s:ls) | s == [] || head s /= '\n' || c /= '\n' = (c:s) : ls
| otherwise = [] : tail s : ls
This code produces the aforementioned result when it is given file with aforementioned content.

Find a regular expression that describes those words that don't contain two consecutive a's over the alphabet {a,b}

I've tried to write a grammar for the language. Here is my grammar:
S -> aS | bS | λ
I also wanted to generate the word "bbababb" which does not have two consecutive a's.
I started with,
bS => bbS => bbaS => bbabS => bbabaS => bbababS => bbababbS => bbababbλ => bbababb.
And finally I tried the following regular expression,
(a+b*)a*(a+b*)
I really appreciate your help.
Let's try to write some rules that describe all strings that don't have two consecutive a's:
the empty string is in the language
if x is a string in the language ending in a, you can add b to the end to get another string in the language
if x is a string in the language ending in b, you can add an a or a b to it to get another string in the language
This lets us write down a grammar:
S -> e | aB | bS
B -> e | bS
That grammar should work for us. Consider your string bbababb:
S -> bS -> bbS -> bbaB -> bbabS
-> bbabaB -> bbababS -> bbababbS
-> bbababb
To turn a regular grammar such as this into a regular expression, we can write equations and solve for S:
S = e + aB + bS
B = e + bS
Replace for B:
S = e + a(e + bS) + bS
= e + a + abS + bS
= e + a + (ab + b)S
Now we can eliminate recursion to solve for S:
S = (ab + b)*(e + a)
This gives us a regular expression: (ab + b)*(e + a)
a must always be followed by b, except the last char, so you can express it as "b or ab, with an optional trailing a":
\b(b|ab)+a?\b
See live demo.
\b (word boundaries) might be able to be removed depending on your usage and regex engine.

Remove characters from a string in all elements of a list

Im trying to replace all strings which contain a substring by itself, in a list.
I've tried it by using the map function:
cleanUpChars = map(\w -> if isInfixOf "**" w then map(\c -> if c == "*" then ""; else c); else w)
To me this reads as: map elements in a list, such that if a character of a word contains * replace it with nothing
To Haskell: "Couldnt match expected type [[Char]] -> [[Char]] with actual type [Char] in the expression: w" (and the last w is underlined)
Any help is appreciated
To answer the revised question (when isInfixOf has been imported correctly):
cleanUpChars = map(\w -> if isInfixOf "**" w then map(\c -> if c == "*" then ""; else c); else w)
The most obvious thing wrong here is that c in the inner parentheses will be a Char (since it's the input to a function which is mapped over a String) - and characters use single quotes, not double quotes. This isn't just a case of a typo or wrong syntax, however - "" works fine as an empty string (and is equivalent to [] since Strings are just lists), but there is no such thing as an "empty character".
If, as it seems, your aim is to remove all *s from each string in the list that contains **, then the right tool is filter rather than map:
Prelude Data.List> cleanUpChars = map(\w -> if isInfixOf "**" w then filter (/= '*') w; else w)
Prelude Data.List> cleanUpChars ["th**is", "is", "a*", "t**es*t"]
["this","is","a*","test"]
(Note that in the example I made up, it removes all asterisks from t**es*t, even the single one. This may not be what you actually wanted, but it's what your logic in the faulty version implied - you'll have to be a little more sophisticated to only remove pairs of consecutive *'s.)
PS I would certainly never write the function like that, with the semicolon - it really doesn't gain you anything. I would also use the infix form of isInfixOf, which makes it much clearer which string you are looking for inside the other:
cleanUpChars :: [String] -> [String]
cleanUpChars = map (\w -> if "**" `isInfixOf` w then filter (/= '*') w else w)
I'm still not particularly happy with that for readability - there's probably some nice way to tidy it up that I'm overlooking for now. But even if not, it helps readability imo to give the function a local name (hopefully you can come up with a more concise name than my version!):
cleanUpChars :: [String] -> [String]
cleanUpChars = map possiblyRemoveAsterisks
where possiblyRemoveAsterisks w = if "**" `isInfixOf` w then filter (/= '*') w else w

Sequentially replace multiple places matching single pattern in a string with different replacements

Using stringr package, it is easy to perform regex replacement in a vectorized manner.
Question: How can I do the following:
Replace every word in
hello,world??your,make|[]world,hello,pos
to different replacements, e.g. increasing numbers
1,2??3,4|[]5,6,7
Note that simple separators cannot be assumed, the practical use case is more complicated.
stringr::str_replace_all does not seem to work because it
str_replace_all(x, "(\\w+)", 1:7)
produces a vector for each replacement applied to all words, or it has
uncertain and/or duplicate input entries so that
str_replace_all(x, c("hello" = "1", "world" = "2", ...))
will not work for the purpose.
Here's another idea using gsubfn. The pre function is run before the substitutions and the fun function is run for each substitution:
library(gsubfn)
x <- "hello,world??your,make|[]world,hello,pos"
p <- proto(pre = function(t) t$v <- 0, # replace all matches by 0
fun = function(t, x) t$v <- v + 1) # increment 1
gsubfn("\\w+", p, x)
Which gives:
[1] "1,2??3,4|[]5,6,7"
This variation would give the same answer since gsubfn maintains a count variable for use in proto functions:
pp <- proto(fun = function(...) count)
gsubfn("\\w+", pp, x)
See the gsubfn vignette for examples of using count.
I would suggest the "ore" package for something like this. Of particular note would be ore.search and ore.subst, the latter of which can accept a function as the replacement value.
Examples:
library(ore)
x <- "hello,world??your,make|[]world,hello,pos"
## Match all and replace with the sequence in which they are found
ore.subst("(\\w+)", function(i) seq_along(i), x, all = TRUE)
# [1] "1,2??3,4|[]5,6,7"
## Create a cool ore object with details about what was extracted
ore.search("(\\w+)", x, all = TRUE)
# match: hello world your make world hello pos
# context: , ?? , |[] , ,
# number: 1==== 2==== 3=== 4=== 5==== 6==== 7==
Here a base R solution. It should also be vectorized.
x="hello,world??your,make|[]world,hello,pos"
#split x into single chars
x_split=strsplit(x,"")[[1]]
#find all char positions and replace them with "a"
x_split[gregexpr("\\w", x)[[1]]]="a"
#find all runs of "a"
rle_res=rle(x_split)
#replace run lengths by 1
rle_res$lengths[rle_res$values=="a"]=1
#replace run values by increasing number
rle_res$values[rle_res$values=="a"]=1:sum(rle_res$values=="a")
#use inverse.rle on the modified rle object and collapse string
paste0(inverse.rle(rle_res),collapse="")
#[1] "1,2??3,4|[]5,6,7"

Find all possible permutations of a lowercase chemical formula

I'm trying to resolve ambiguity in a lowercase chemical formula. Since some element names are substrings of other element names, and they're all run together, there can be multiple global matches for the same pattern.
Considering the regex /^((h)|(s)|(hg)|(ga)|(as))+$/ against the string hgas. There are two possible matches. hg, as and h, s, ga (out of order compared to input, but not an issue). Obviously a regex for all possible symbols would be longer, but this example was done for simplicity.
Regex's powerful lookahead and lookbehind allow it to conclusively determine whether even a very long string does match this pattern or there are no possible permutations of letters. It will diligently try all possible permutations of matches, and, for example, if it hits the end of the string with a leftover g, go back through and retry a different combination.
I'm looking for a regular expression, or a language with some kind of extension, that adds on the ability to keep looking for matches after one is found, in this case, finding h, s, ga as well as hg, as.
Rebuilding the complex lookahead and lookbehind functionality of regex for this problem does not seem like a reasonable solution, especially considering the final regex also includes a \d* after each symbol.
I thought about reversing the order of the regexp, /^((as)|(ga)|(hg)|(s)|(h))+$/, to find additional mappings, but at most this will only find one additional match, and I don't have the theoretical background in regex to know if it's even reasonable to try.
I've created a sample page using my existing regex which finds 1 or 0 matches for a given lowercase string and returns it properly capitalized (and out of order). It uses the first 100 chemical symbols in its matching.
http://www.ptable.com/Script/lowercase_formula.php?formula=hgas
tl;dr: I have a regex to match 0 or 1 possible chemical formula permutations in a string. How do I find more than 1 match?
I'm well-aware this answer might be off-topic (as in the approach), but I think it is quite interesting, and it solves the OP's problem.
If you don't mind learning a new language (Prolog), then it might help you generate all possible combinations:
name(X) :- member(X, ['h', 's', 'hg', 'ga', 'as']).
parse_([], []).
parse_(InList, [HeadAtom | OutTail]) :-
atom_chars(InAtom, InList),
name(HeadAtom),
atom_concat(HeadAtom, TailAtom, InAtom),
atom_chars(TailAtom, TailList),
parse_(TailList, OutTail).
parse(In, Out) :- atom_chars(In, List), parse_(List, Out).
Sample run:
?- parse('hgas', Out).
Out = [h, ga, s] ;
Out = [hg, as] ;
false.
The improved version, which includes processing for number is a tad bit longer:
isName(X) :- member(X, ['h', 's', 'hg', 'ga', 'as', 'o', 'c']).
% Collect all numbers, since it will not be part of element name.
collect([],[],[]).
collect([H|T], [], [H|T]) :-
\+ char_type(H, digit), !.
collect([H|T], [H|OT], L) :-
char_type(H, digit), !, collect(T, OT, L).
parse_([], []).
parse_(InputChars, [Token | RestTokens]) :-
atom_chars(InputAtom, InputChars),
isName(Token),
atom_concat(Token, TailAtom, InputAtom),
atom_chars(TailAtom, TailChars),
parse_(TailChars, RestTokens).
parse_(InputChars, [Token | RestTokens]) :-
InputChars = [H|_], char_type(H, digit),
collect(InputChars, NumberChars, TailChars),
atom_chars(Token, NumberChars),
parse_(TailChars, RestTokens).
parse(In, Out) :- atom_chars(In, List), parse_(List, Out).
Sample run:
?- parse('hgassc20h245o', X).
X = [h, ga, s, s, c, '20', h, '245', o] ;
X = [hg, as, s, c, '20', h, '245', o] ;
false.
?- parse('h2so4', X).
X = [h, '2', s, o, '4'] ;
false.
?- parse('hgas', X).
X = [h, ga, s] ;
X = [hg, as] ;
false.
The reason you haven't found a generalized regex library that does this is because it's not possible with all regular expressions to do this. There are regular expressions that will not terminate.
Imagine with your example that you just added empty string to the list of terms, then
'hgas' could be:
['hg', 'as']
['hg', '', 'as']
['hg', '', '', 'as']
You'll probably just have to roll your own.
In psuedo code:
def findall(term, possible):
out = []
# for all the terms
for pos in possible:
# if it is a candidate
if term.startswith(pos):
# combine it with all possible endings
for combo in findall(term.removefrombegining(pos), possible):
newCombo = combo.prepend(out)
out.append(newCombo)
return out
findall('hgas', ['h', 'as', ...])
This above will run in exponential time so dynamic programming will be the way this isn't an exponentially large problem. Memoization for the win.
The last thing worth noting is the above code doesn't check that it fully matches.
ie. 'hga' might return ['hg'] as a possibility.
I'll leave the actual coding, memrization, and this last hiccup to, as my profs lovingly say, 'an exercise to the reader'
This is not a job for regexp, you need something more like a state machine.
You'd need to parse the string popping out all known symbols, stopping if there is none, and continuing. If the whole string gets consumed on one branch, you have found a possibility.
In PHP, something like:
<?php
$Elements = array('h','he','li','s','ga','hg','as',
// "...and there may be many others, but they haven't been discovered".
);
function check($string, $found = array(), $path = '')
{
GLOBAL $Elements, $calls;
$calls++;
if ('' == $string)
{
if (!empty($path))
$found[] = $path;
return $found;
}
$es = array_unique(array(
substr($string, 0, 1), // Element of 1 letter ("H")
substr($string, 0, 2), // Element of 2 letter ("He")
));
foreach($es as $e)
if (in_array($e, $Elements))
$found = check(substr($string, strlen($e)), $found, $path.$e.',');
return $found;
}
print_r(check('hgas'));
print "in $calls calls.\n";
?>
Don't use regex. A regex matches only 1 element as you say, instead you need to find all the possible "meanings" of your string. Given the fact that each element's lenght is 1-2 chars, I'd go with this algorythm (forgive the pseudocode):
string[][] results;
void formulas(string[] elements, string formula){
string[] elements2=elements;
if(checkSymbol(formula.substring(0,1))){
elements.append(formula.substring(0,1));
if(formula.length-1 ==0){
results.append(elements);
} else {
formulas(elements,formula.substring(1,formula.length);
}
}
if(checkSymbol(formula.substring(0,2))){
elements2.append(formula.substring(0,2));
if(formula.length-2 ==0){
results.append(elements2);
} else {
formulas(elements2,formula.substring(2,formula.length);
}
}
}
bool checkSymbol(string symbol){
// verifies if symbol is the name of an element
}
input "hgas" (let's go depth first)
first step:
checkSymbol(formula.substring(0,1)) is true for "h"
elements2 = [h]
recursive call, if(checkSymbol(formula.substring(0,1))) false
then it tests ga => true
elements2 = [h, ga]
third recursive call
test s, checksymbol returns true, elements is then [h, ga, s]. Length of the substring is 0 so it appends to results the first array: [h, ga, s]
--
let's go back to the second "branch" of first step
The test checkSymbol(formula.substring(0,2) finds that "hg" is an element as well
elements2 = [hg]
Then we call formulas([hg],"as")
The test for "a" fails (it is not an element) and the test for "as" works, the length is totally consumed, the result [hg,as] is appended to results[]
This algorythm should run in O(n^2) time in the worst case, n being the length of the string.