Telling if regular expression contains a single invariable segment

Telling if regular expression contains a single invariable segment - regex

I write a search tool which is optimized to first look for fixed phrases of characters in sentences - consider it a simple "does the sentence contain a particular sequence of characters". The result will be a set of found sentences which can be searched further in a second stage.
For this second stage I like to apply regex search for convenience. But I need to pre-select the items first at the first stage, and I cannot simply get all sentences - the API I need to use for the first stage requires me to search for a phrase of at least one matching char. So there's no way around this.
Now, the user will only enter one regex, and my software needs to first determine if it can perform the first stage search on this. If the user enters something ambiguous, I will then tell the user to change his regex.
I need the algorithm that determines all substrings I can use for the first stage search.
Here are some examples of expected results:
a.b – Yes (searches for "a" or "b" first)
a|b – No (there'd be two distinct first level searches necessary)
[ab] – No (same problem: Not a clear target for the first stage search)
[ab]c – Yes (searches for "c" first)
These are simple examples. But since regex can get quite complicated I wonder if I can construct a regex or other test that will tell me if I have a usable outcome.
I could also live with limiting the regex syntax to the more common cases if that makes the test simpler, e.g. no recursion or whatever could help.

Here is an example, using Haskell. The algorithm should be easily transferred to another language.
Just some boilerplate imports;
import Data.Maybe
import Data.List
import Data.Function
Here is a datatype to represent a Regex. You'll have to parse it yourself or use a library:
data Regex
= Concat Regex Regex -- e.g. /ab/
| Alt Regex Regex -- e.g. /a|b/
| Single Char -- e.g. /a/
| Star Regex -- e.g. /a*/
| CharClass [(Char,Char)] -- a list of ranges. for non-range (e.g. [a]) just use the same char twice
Here is the algorithm:
regexMustMatch (Single x) = [Just x] -- has to match the character
regexMustMatch (Alt _ _) = [Nothing] -- doesn't need to match one thing (you could actually check for equality here, so something like /a|a/ would work)
regexMustMatch (Star _) = [Nothing] -- doesn't need to match one thing
regexMustMatch (CharClass ((a,b):[])) | a == b = [Just a] -- char class must match if it only has one character
regexMustMatch (CharClass _) = [Nothing] -- otherwise doesn't need to match one thing
regexMustMatch (Concat x y) = (regexMustMatch x) ++ (regexMustMatch y) -- must match both parts in sequence
Some methods to make the results usable:
selectAll = map (concatMap (return . fromJust)) .
filter (isJust . head) .
groupBy ((==) `on` isJust)
selectLongest x = case selectAll x of
[] -> ""
xs -> maximumBy (compare `on` length) xs
And some examples:
main = do
-- your tests
-- /ab/
print . selectAll . regexMustMatch $ (Single 'a' `Concat` Single 'b')
-- /a|b/
print . selectAll . regexMustMatch $ (Single 'a' `Alt` Single 'b')
-- /[ab]/
print . selectAll . regexMustMatch $ (CharClass [('a','a'),('b','b')])
-- /[ab]c/
print . selectAll . regexMustMatch $ ((Single 'a' `Alt` Single 'b') `Concat` Single 'c')
-- a few more
-- /[a]/
print . selectAll . regexMustMatch $ (CharClass [('a','a')])
-- /ab*c/
print . selectAll . regexMustMatch $ (Single 'a' `Concat` Star (Single 'b') `Concat` Single 'c')
-- /s(ab*)(cd)/ - these aren't capturing parens, just grouping to test associativity
print . selectAll . regexMustMatch $ (Single 's' `Concat` (Single 'a' `Concat` Star (Single 'b')) `Concat` (Single 'c' `Concat` Single 'd'))
Output:
["ab"] -- /ab/
[] -- /a|b/
[] -- /[ab]/
["c"] -- /[ab]c/
["a"] -- /[a]/
["a","c"] -- /ab*c/
["sa","cd"] -- /s(ab*)(cd)/
The main area where this could be improved is in the algorithm for alternation.
If we have the regex /a*bc*|d*be*/ then b needs to be matched, but this won't pick that up.
Edit: here's an improved algorithm for alternation:
regexMustMatch (Alt x y)
| x' == y' = x'
| otherwise = start ++ [Nothing] ++ common ++ [Nothing] ++ end
where
x' = regexMustMatch x
y' = regexMustMatch y
start = map fst $ takeWhile (uncurry (==)) (zip x' y')
end = map fst $ reverse $ takeWhile (uncurry (==)) (zip (reverse (drop (length start) x')) (reverse (drop (length start) y')))
dropEnds = drop (length start) . reverse . drop (length end) . reverse
common = intercalate [Nothing] $ map (map Just) (selectAll (dropEnds x') `intersect` selectAll (dropEnds y'))
Some more tests with the improved alternation:
/a*bc*|d*be*/ == b
/s(abc*|abe*)/ == sab
/s(a*bc*|d*be*)/ == s, b
/sa*b|b*/ == s
/(abc*|abe*)s/ == ab, s
/(a*bc*|d*be*)s/ == b, s
/(a*b|b*)s/ == s
/s(ab|b)e/ == s, be
/s(ba|b)e/ == sb, e
/s(b|b)e/ == sbe
/s(ac*b|ac*b)e/ == sa, be

Related

Ocaml: Get a list of characters that are between two characters

to clarify my dilemma I'll explain the problem I'm faced with...
Basically, I am being passed a string that can contain single characters or ranges of characters and am trying to return back a list of characters represented by the string I was passed.
Ex. "b" would just give a list ['b'] "a-z" would give ['a' ; 'b' ; 'c' ; ... ; 'z'] and something like "ad-g2-6" would be ['a' ; 'd' ; 'e' ; 'f' ; 'g' ; '2' ; '3' ; '4' ; '5' ; '6'] since there is the character a and the ranges d-g and 2-6. (Also worth noting that something like "a-" would just be ['a' ; '-'] since the range wasn't completed.
My ideas for solving this have come to exploding the string into a list of characters (lst) then pattern matching and building onto an accumulator like
let mainfunc str = let lst = (explode str) in
let rec func lst acc = match lst with
| [] -> acc
| a::'-'::b::t -> func t (acc # **SOMETHING TO GET THIS RANGE**)
| a::t -> func t (acc # [a])
in func lst []
Anything that could help me get a range between the characters would be great and I'm open to ideas if someone has a better way to go about this problem than what I have set up.
(Also note that my explode function works as intended and converts a string into a char list)

Since you wrote a successful explode function I'll assume that you have no trouble with recursion etc. So the problem might just be a way to talk about characters as values (so you can get the next character after a given one).
For this you can use Char.code and Char.chr (from the OCaml standard library).
Here's a function that takes a character and returns a list consisting of the character and the next one in order:
let char_pair c =
[ c; Char.chr (Char.code c + 1) ]
Here's how it looks when you run it:
# char_pair 'x';;
- : char list = ['x'; 'y']
(I leave as an exercise the problem of dealing with the character with code 255.)
As a side comment, your approach looks pretty good to me. It looks like it will work.

Haskell split string on last occurence

Is there any way I can split String in Haskell on the last occurrence of given character into 2 lists?
For example I want to split list "a b c d e" on space into ("a b c d", "e").
Thank you for answers.

I'm not sure why the solutions suggested are so complicated. Only one two traversals are needed:
splitLast :: Eq a => a -> [a] -> Either [a] ([a],[a])
splitLast c' = foldr go (Left [])
where
go c (Right (f,b)) = Right (c:f,b)
go c (Left s) | c' == c = Right ([],s)
| otherwise = Left (c:s)
Note this is total and clearly signifies its failure. When a split is not possible (because the character specified wasn't in the string) it returns a Left with the original list. Otherwise, it returns a Right with the two components.
ghci> splitLast ' ' "hello beautiful world"
Right ("hello beautiful","world")
ghci> splitLast ' ' "nospaceshere!"
Left "nospaceshere!"

Its not beautiful, but it works:
import Data.List
f :: Char -> String -> (String, String)
f char str = let n = findIndex (==char) (reverse str) in
case n of
Nothing -> (str, [])
Just n -> splitAt (length str - n -1) str
I mean f 'e' "a b c d e" = ("a b c d ", "e"), but I myself wouldn't crop that trailing space.

I would go with more pattern matching.
import Data.List
splitLast = contract . words
where contract [] = ("", "")
contract [x] = (x, "")
contract [x,y] = (x, y)
contract (x:y:rest) = contract $ intercalate " " [x,y] : rest
For long lists, we just join the first two strings with a space and try the shorter list again. Once the length is reduced to 2, we just return the pair of strings.
(x, "") seemed like a reasonable choice for strings with no whitespace, but I suppose you could return ("", x) instead.
It's not clear that ("", "") is the best choice for empty strings, but it seems like a reasonable alternative to raising an error or changing the return type to something like Maybe (String, String).

I can propose the following solution:
splitLast list elem = (reverse $ snd reversedSplit, reverse $ fst reversedSplit)
where
reversedSplit = span (/= elem) $ reverse list
probably not the fastest one (two needless reverses) but I like it's simplicity.
If you insist on removing the space we're splitting on, you can go for:
import qualified Data.List as List
splitLast list elem = splitAt (last $ List.elemIndices elem list) list
however, this version assumes that there will be at least one element matching the pattern. If you don't like this assumption, the code gets slightly longer (but no double-reversals here):
import qualified Data.List as List
splitLast list elem = splitAt index list where
index = if null indices then 0 else last indices
indices = List.elemIndices elem list
Of course, choice of splitting at the beginning is arbitrary and probably splitting at the end would be more intuitive for you - then you can simply replace 0 with length list

My idea is to split at every occurrence and then separate the initial parts from the last part.
Pointed:
import Control.Arrow -- (&&&)
import Data.List -- intercalate
import Data.List.Split -- splitOn
breakOnLast :: Eq a => a -> [a] -> ([a], [a])
breakOnLast x = (intercalate x . init &&& last) . splitOn x
Point-free:
liftA2 (.) ((&&& last) . (. init) . intercalate) splitOn
(.) <$> ((&&&) <$> ((.) <$> pure init <*> intercalate) <*> pure last) <*> splitOn

Frequency table in Haskell with list comprehension only, find frequency of characters in a String

I am new to Haskell, trying to learn some stuff and pass the task that I was given. I would like to find the number of characters in a String but without importing Haskell modules.
I need to implement a frequency table and I would like to understand more about programming in Haskell and how I can do it.
I have my FreqTable as a tuple with the character and the number of occurrences of the 'char' in a String.
type FreqTable = [(Char, Int)]
I have been searching for for a solution for couple of days and long hours to find some working examples.
My function or the function in the task id declares as follows:
fTable :: String -> FreqTable
I know that the correct answer can be:
map (\x -> (head x, length x)) $ group $ sort
or
map (head &&& length) . group . sort
or
[ (x,c) | x <- ['A'..'z'], let c = (length . filter (==x)), c>0 ]
I can get this to work exactly with my list but I found this as an optional solution. I am getting an error which I can solve at the moment with the above list comprehension.
Couldn't match expected type ‘String -> FreqTable’
with actual type ‘[(Char, [Char] -> Int)]’
In the expression:
[(x, c) |
x <- ['A' .. 'z'], let c = (length . filter (== x)), c > 0]
In an equation for ‘fTable’:
fTable
= [(x, c) |
x <- ['A' .. 'z'], let c = (length . filter (== x)), c > 0]
Can please someone share with me and explain me a nice and simple way of checking the frequency of characters without importing Data.List or Map

You haven't included what you should be filtering and taking the length of
[ (x,c) | x <- ['A'..'z'], let c = (length . filter (==x)), c>0 ]
-- ^_____________________^
-- this is a function from a String -> Int
-- you want the count, an Int
-- The function needs to be applied to a String
The string to apply it to is the argument to fTable
fTable :: String -> FreqTable
fTable text = [ (x,c) | x <- ['A'..'z'], let c = (length . filter (==x)) text, c>0 ]
-- ^--------------------------------------------------------------------^

The list: ['A'..'z'] is this string:
"ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz"
so you are iterating over both upper and lower case letters (and some symbols.) That's why you have a tuple, e.g., for both 'A' and 'a'.
If you want to perform a case-insensitive count, you have to perform a case-insensitive comparison instead of straight equality.
import Data.Char
ciEquals :: Char -> Char -> Bool
ciEquals a b = toLower a == toLower b
Then:
ftable text = [ (x,c) | x <- ['A'..'Z'],
, let c = (length . filter (ciEquals x)) text,
, c > 0 ]

string to list of lists of rhyming words

My goal is a function which takes a sentence and returns a list of lists with the words rhyming (rhyming = last 3 chars are equal).
Example: "Six sick hicks nick six slick bricks with picks and sticks." ->
[[Six,six],[sick,nick,slick],[hicks,bricks,picks,sticks],[with]]
This is my code so far (bsort is bubblesort):
rhymeWords:: String -> [[String]]
rhymeWords "" = []
rhymeWords xs = bsort (words (reverse xs))
I do not know how to translate it into code but I would like to take the first three chars of the first string and put them into a list. Then take the next String and test if it is equal to the first. If true put the second string into the first list otherwise create a second list. Then move on to the third string, each time testing with previous lists.
Can anyone please help me?

The following code groups rhymes as requested, although it converts all characters to lower case.
import Data.List (sort)
import Data.Char (toLower)
rhymeWords:: String -> [[String]]
rhymeWords "" = []
rhymeWords xs = [map reverse g | g <- groupRhymes (sortRhymes xs) []]
where sortRhymes xs = sort $ map reverse (words [toLower x | x <- xs])
groupRhymes :: [String] -> [[String]] -> [[String]]
groupRhymes [] acc = acc
groupRhymes (x:xs) acc = case acc of
[] -> groupRhymes xs [[x]]
_ -> if take 3 x == take 3 (head (last acc))
then groupRhymes xs ((init acc) ++ [(last acc) ++ [x]])
else groupRhymes xs (acc ++ [[x]])
Example result:
hymeWords "Six sick hicks nick six slick bricks with picks and sticks"
[["and"],["with"],["slick","nick","sick"],["hicks","picks","bricks","sticks"],["six","six"]]
Note that the example input doesn't have a period at the end of the sentence, because the last word would include it and break the sorting. You'll need to fiddle a bit with presented code if you need to pass sentences with a period.

When you have to group items together, you can use Data.List's grouping higher order functions. With groupBy you can easily solve your problem just by writing your grouping function. In your case, you want to group words that rhyme together. You just have to write the function rhyming:
rhyming :: String -> String -> Bool
rhyming word1 word2 = last3 (lower word1) == last3 (lower word2)
where
last3 = take 3 . reverse -- if you wanted `last3` to return the last three characters in order, you'd just have to apply `reverse` to the result, but that's unnecessary here
lower = map toLower
So your rhymeWords function can be written like so:
import Data.List (groupBy, sort)
import Data.Char (toLower)
rhyming :: String -> String -> Bool
rhyming word1 word2 = last3 (lowercase word1) == last3 (lowercase word2)
where
last3 = take 3 . reverse
lowercase = map toLower
rhymeWords :: String -> [[String]]
rhymeWords = groupBy rhyming . map reverse . sort . map reverse . words
The map reverse . sort . map reverse thing is needed since groupBy groups elements that are next to another. It groups words that are likely to rhyme together.

Simplifying regex in Haskell with trees

I have this data structure for regular expressions (RE), and so far I do not have any functions modifying REs:
data Regex a = Letter a | Emptyword | Concat (Regex a) (Regex a) | Emptyset | Or (Regex a) (Regex a) | Star (Regex a)
deriving (Show, Eq)
I would like to implement a simplification algorithm for my REs. For this I thought I should first represent the RE as tree, update the tree according to some equivalences and then convert it back to a RE. My reasoning was that with trees I would have functions to find, extract and attach subtrees, update values etc.
However, I have difficulties finding a tree module giving these functionalities and being simple enough for a beginner to learn.
I found this avl-tree package however, it seems very large.
I'd like to have alternative suggestions to my approach with trees and suggestions on easy tree modules supporting mentioned functions.
Note that I'm a beginner in Haskell and I do not understand monads yet and that I'm not interested in an implementation to simplify REs.
Edit 1: We know that the following two REs are equivalent, where L b stands for Letter b and C for Concat:
Or Or
/ \ / \
L b C = L b L a
/ \
L a Emptyword
So given the left RE I'd like to replace the subtree with its root labeled by C with a node labeled by L a. As was pointed out my data structure is a tree structure. However, currently I do not have functions to, e.g. replace a subtree with a node, or find a subtree of a structure that I can replace.

As noted in the comments, you already have a tree. You can simplify right away:
simplify :: Regex a -> Regex a
simplify (Star Emptyset) = Emptyword
simplify (Star (Star x)) = Star (simplify x)
simplify (Concat x Emptyword) = simplify x
simplify (Concat Emptyword y) = simplify y
simplify (Or x y) | x == y = x
-- or rather simplify (Or x y) | simplify x == simplify y = simplify x
-- more sophisticated rules here
-- ...
-- otherwise just push down
simplify (Or x y) = simplify (Or (simplify x) (simplify y)
-- ...
simplify x#(Letter _) = x
This is just superficial, e.g. the first rule should be simplify (Star x) | simplify x == Emptyset = emptyword.
AVL Trees
AVL trees are for balance, not really applicable here. The only place where balance make sense is for the associative operations
Or (x (Or y z) == Or (Or x y) y
I suggest to use lists for those operations
data Regex' a = Letter' a | Concat' [Regex a] | Or [Regex a] | Star (Regex a)
deriving (Show, Eq)
(No Emptyword' because it is Concat' []; same with Emptyset' and Or.)
Converting between Regex and Regex' is the usual exercise for the reader.
General Hardness
Note that Regex equivalence is not easy:
(a|b)* = (a*b)*a*
Optimizing Or "(a|b)*" "(a*b)*a*" is hard...

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Telling if regular expression contains a single invariable segment - regex

Related

Ocaml: Get a list of characters that are between two characters

Haskell split string on last occurence

Frequency table in Haskell with list comprehension only, find frequency of characters in a String

string to list of lists of rhyming words

Simplifying regex in Haskell with trees

Categories

Resources