Replace regex match by arbitrary function of match itself - regex

I'm trying to write a function of type Text -> (Text -> Text) -> Text that replaces occurrences of a regular expression in a piece of text by something else that is a function of what the regular expression has matched. There is subRegex from Text.Regex but this only allows replacing a match with some fixed replacement string whereas I would like the replacement to an an arbitrary function of the match. Is there a package that already implements something like that?

You can use matchRegexAll
matchRegexAll
:: Regex -- ^ The regular expression
-> String -- ^ The string to match against
-> Maybe ( String, String, String, [String] )
-- ^ Returns: 'Nothing' if the match failed, or:
--
-- > Just ( everything before match,
-- > portion matched,
-- > everything after the match,
-- > subexpression matches )
For example:
subFirst :: Regex -> String -> (String -> String) -> String
subFirst rx input f = case matchRegexAll rx input of
Nothing -> input
Just (pre, match, post, _) -> pre <> f match <> post
If you want to do this for all matches rather than just the first, you can call this function recursively on the remainder post (left as an exercise).
For a different approach, it looks like the text-regex-replace replace package might be of use to you. It works directly on Text rather than String, and it appears to have the capability of arbitrary replacement functions (however the usage seems a bit obtuse).

If you’re willing to write your pattern matching function as a parser instead of a regular expression, then the function Replace.Megaparsec.streamEdit with the match combinator has the signature you’re looking for.
Here’s a usage example in the README

Related

Regex to allow special only between alpha numeric

I need a REGEX which should validate only if a string starts and ends with Alphabets or numbers and should allow below special characters in between them as below:
/*
hello -> pass
what -> pass
how ##are you -> pass
how are you? -> pass
hi5kjjv -> pass
8ask -> pass
yyyy. -> fail
! dff -> fail
NoSpeci###alcharacters -> pass
Q54445566.00 -> pass
q!Q1 -> pass
!Q1 -> fail
q!! -> fail
#NO -> fail
0.2Version -> pass
-0.2Version -> fail
*/
the rex works fine for above query but the issue is it expects at least two valid characters:
^[A-Za-z0-9]+[ A-Za-z0-9_#./#&+$%_=;?\'\,\!-]*[A-Za-z0-9]+
failing in case if we pass:
a -> failed but valid
1 -> failed but valid
I tried replacing + with * but this was accepting special characters from the start (#john) which is wrong.
[A-Za-z0-9]+ with [A-Za-z0-9]*
You may use this regex:
^[A-Za-z0-9](?:[! #$%;&'+,./=?#\w-]*[A-Za-z0-9?])?$
Or if your regex flavor supports atomic groups then use a bit more efficient:
^[A-Za-z0-9](?>[! #$%;&'+,./=?#\w-]*[A-Za-z0-9?])?$
RegEx Demo
RegEx Details:
^: Start
[A-Za-z0-9]: Match an alphanumeric character
(?>[! #$%;&'+,./=?#\w-]*[A-Za-z0-9?])?: Optional atomic group to match 0 or more of a specified char in [...] followed by an alphanumeric or ? characters
$: End

What is the best way to ensure a regex in OCaml matches the entire input string?

In OCaml, I'm trying to check if a regex matches the entire input string, not just a prefix or a suffix or the potion of the input string before the first newline.
For example, I want to avoid a regex of [0-9]+ matching against strings like these:
let negative_matches = [
" 123"; (* leading whitespace *)
"123 "; (* trailing whitespace *)
"123\n"; (* trailing newline *)
]
I see that Str.string_match still returns true when trailing characters do not match the pattern:
# List.map (fun s -> Str.string_match (Str.regexp "[0-9]+") s 0) negative_matches;;
- : bool list = [false; true; true]
Adding $ to the pattern helps in the second example, but $ is documented to only "match at the end of the line", so the third example still matches
# List.map (fun s -> Str.string_match (Str.reg exp "[0-9]+$") s 0) negative_matches;;
- : bool list = [false; false; true]
I don't see a true "end of string" matcher (like \z in Java and Ruby) documented, so the best answer I've found is to additionally check the length of the input string against the length of the match using Str.match_end:
# List.map (fun s -> Str.string_match (Str.reg exp "[0-9]+") s 0 && Str.match_end () = String.length s) negative_matches;;
- : bool list = [false; false; false]
Please tell me I'm missing something obvious and there is an easier way.
Edit: note that I'm not always looking to match against a simple regex like [0-9]+. I'd like a way to match an arbitrary regex against the entire input string.
You are missing something obvious. There is an easier way. If
[^0-9]
is matched in the input string you will know it contains a non-digit character.
Unfortunately, I don't think Str offers a better way to ensure the whole string has been matched than your own solution, or the similar, slightly clearer alternative:
Str.string_match (Str.regexp "[0-9]+") s 0 && Str.matched_string s = s
Or you could just check for the presence of a newline character as that is the fly in the ointment as you show.
And, of course, there are other regular expression libraries available that do not have this problem.
try this for your example
(?<![^A-z]|\w)[0-9]+(?![^A-z]|\w)
test it here
if you want to generate other patterns you can start by knowing this
(?<!'any group you don't want it to appear before your desire')
(?!'any group you don't want it to appear after your desire')

How do I do regex substitutions with multiple capture groups?

I'm trying to allow users to filter strings of text using a glob pattern whose only control character is *. Under the hood, I figured the easiest thing to filter the list strings would be to use Js.Re.test[https://rescript-lang.org/docs/manual/latest/api/js/re#test_], and it is (easy).
Ignoring the * on the user filter string for now, what I'm having difficulty with is escaping all the RegEx control characters. Specifically, I don't know how to replace the capture groups within the input text to create a new string.
So far, I've got this, but it's not quite right:
let input = "test^ing?123[foo";
let escapeRegExCtrl = searchStr => {
let re = [%re("/([\\^\\[\\]\\.\\|\\\\\\?\\{\\}\\+][^\\^\\[\\]\\.\\|\\\\\\?\\{\\}\\+]*)/g")];
let break = ref(false);
while (!break.contents) {
switch (Js.Re.exec_ (re, searchStr)) {
| Some(result) => {
let match = Js.Re.captures(result)[0];
Js.log2("Matching: ", match)
}
| None => {
break := true;
}
}
}
};
search -> escapeRegExCtrl
If I disregard the "test" portion of the string being skipped, the above output will produce:
Matching: ^ing
Matching: ?123
Matching: [foo
With the above example, at the end of the day, what I'm trying to produce is this (with leading and following .*:
.*test\^ing\?123\[foo.*
But I'm unsure how to achieve creating a contiguous string from the matched capture groups.
(echo "test^ing?123[foo" | sed -r 's_([\^\?\[])_\\\1_g' would get the work done on the command line)
EDIT
Based on Chris Maurer's answer, there is a method in the JS library that does what I was looking for. A little digging exposed the ReasonML proxy for that method:
https://rescript-lang.org/docs/manual/latest/api/js/string#replacebyre
Let me see if I have this right; you want to implement a character matcher where everything is literal except *. Presumably the * is supposed to work like that in Windows dir commands, matching zero or more characters.
Furthermore, you want to implement it by passing a user-entered character string directly to a Regexp match function after suitably sanitizing it to only deal with the *.
If I have this right, then it sounds like you need to do two things to get the string ready for js.re.test:
Quote all the special regex characters, and
Turn all instances of * into .* or maybe .*?
Let's keep this simple and process the string in two steps, each one using Js.re.replace. So the list of special characters in regex are [^$.|?*+(). Suitably quoting these for replace:
str.replace(/[\[\\\^\$\.\|\?\+\(\)]/g, '\$&')
This is just all those special characters quoted. The $& in the replacement specifications says to insert whatever matched.
Then pass that result to a second replace for the * to .*? transformation.
str.replace(/*+/g, '.*?')

How to stop matching if the condition is satisfied?

Target is to remove patterns (split by '/') with single alphabet, AND if one such pattern appears, then remove the rest right parts.
For example:
/modadisi/v/list -> /modadisi
/i/m/videos/tnt -> null
New examples:
/abcd/abcd/abcd/a/abcd -> /abcd/abcd/abcd
/abcd -> /abcd
/abcd/abcd/abcd -> /abcd/abcd/abcd
The current regex I use is
\/[a-zA-Z]{2,}
This will match all patterns, like /modadisi/v/list-> /modadisi/list. Is it possible to modify the regex to scan from left to right, and stop if condition is matched?
Based on your new examples, just anchor the pattern to the start of the string using ^, and put the pattern inside a group that repeats. The full pattern would be ^(\/[a-zA-Z]{2,})*.
For the inputs:
/modadisi/v/list
/i/m/videos/tnt
/abcd/abcd/abcd/a/abcd
/abcd
/abcd/abcd/abcd
it produces:
/modadisi
{nothing}
/abcd/abcd/abcd
/abcd
/abcd/abcd/abcd
If any of this isn't right, let me know and I will adjust the pattern.

Trying to match a string in the format of domain\username using Lua and then mask the pattern with '#'

I am trying to match a string in the format of domain\username using Lua and then mask the pattern with #.
So if the input is sample.com\admin; the output should be ######.###\#####;. The string can end with either a ;, ,, . or whitespace.
More examples:
sample.net\user1,hello -> ######.###\#####,hello
test.org\testuser. Next -> ####.###\########. Next
I tried ([a-zA-Z][a-zA-Z0-9.-]+)\.?([a-zA-Z0-9]+)\\([a-zA-Z0-9 ]+)\b which works perfectly with http://regexr.com/. But with Lua demo it doesn't. What is wrong with the pattern?
Below is the code I used to check in Lua:
test_text="I have the 123 name as domain.com\admin as 172.19.202.52 the credentials"
pattern="([a-zA-Z][a-zA-Z0-9.-]+).?([a-zA-Z0-9]+)\\([a-zA-Z0-9 ]+)\b"
res=string.match(test_text,pattern)
print (res)
It is printing nil.
Lua pattern isn't regular expression, that's why your regex doesn't work.
\b isn't supported, you can use the more powerful %f frontier pattern if needed.
In the string test_text, \ isn't escaped, so it's interpreted as \a.
. is a magic character in patterns, it needs to be escaped.
This code isn't exactly equivalent to your pattern, you can tweek it if needed:
test_text = "I have the 123 name as domain.com\\admin as 172.19.202.52 the credentials"
pattern = "(%a%w+)%.?(%w+)\\([%w]+)"
print(string.match(test_text,pattern))
Output: domain com admin
After fixing the pattern, the task of replacing them with # is easy, you might need string.sub or string.gsub.
Like already mentioned pure Lua does not have regex, only patterns.
Your regex however can be matched with the following code and pattern:
--[[
sample.net\user1,hello -> ######.###\#####,hello
test.org\testuser. Next -> ####.###\########. Next
]]
s1 = [[sample.net\user1,hello]]
s2 = [[test.org\testuser. Next]]
s3 = [[abc.domain.org\user1]]
function mask_domain(s)
s = s:gsub('(%a[%a%d%.%-]-)%.?([%a%d]+)\\([%a%d]+)([%;%,%.%s]?)',
function(a,b,c,d)
return ('#'):rep(#a)..'.'..('#'):rep(#b)..'\\'..('#'):rep(#c)..d
end)
return s
end
print(s1,'=>',mask_domain(s1))
print(s2,'=>',mask_domain(s2))
print(s3,'=>',mask_domain(s3))
The last example does not end with ; , . or whitespace. If it must follow this, then simply remove the final ? from pattern.
UPDATE: If in the domain (e.g. abc.domain.org) you need to also reveal any dots before that last one you can replace the above function with this one:
function mask_domain(s)
s = s:gsub('(%a[%a%d%.%-]-)%.?([%a%d]+)\\([%a%d]+)([%;%,%.%s]?)',
function(a,b,c,d)
a = a:gsub('[^%.]','#')
return a..'.'..('#'):rep(#b)..'\\'..('#'):rep(#c)..d
end)
return s
end