How to reduce this regex to fit in Lua? - regex

I'm trying to do a regex that captures the longest string before a possibly ending letter "c", in Lua.
For example,
Given abc, match ab
Given acc, match ac
Given abcd, match abcd
Given abd, match abd
The solution I came up is ^(.+(?=c$)|.+(?!c$)). However, Lua does not have lookahead, so I'm thinking if there is a way to reduce this to something that Lua natively supports.

You can use (string.match(str:reverse(), "^c(.+)") or str:reverse()):reverse(). If nothing is matched, then the original string is returned.
[Updated to fix the logic]

You can use a capture group to capture 1 more occurrences of any char except c followed by optional c chars.
Then match a c char at the end of the string outside of the group.
([^c]+c*)c$
The pattern matches
( Capture group 1
[^c]+c* Match 1+ times any char except c followed by matching optional c chars
) Close group 1
c$ Match a c chat at the end of the string
Regex demo
In the code, you can print the value of group 1 if it exists, else print the whole string if there is not match.
I have no experience in Lua, but looking at the code (and borrowing it for an example) posted by #Nifim this would match:
strs = {
"abc",
"acc",
"abd",
"abcd",
"c"
}
for _,s in ipairs(strs) do
print(s:match("([^c]+c*)c$") or s)
end
Output
ab
ac
abd
abcd
c

You can use "s:match("([^c]+c?[^c]*)c$") or s, this will get all non-c characters then potentially a c followed by any remaining non-c characters and the final c if present, in the event of no match it returns the whole string.
Here is an example:
strs = {
"abc",
"acc",
"abd",
"abcd",
}
for _,s in ipairs(strs) do
print(s, s:match("([^c]+c?[^c]*)c$") or s)
end
Output:
abc ab
acc ac
abd abd
abcd abcd

Egor Skriptunoff provided the best answer in my opinion in the comment here. Since he didn't leave it as dedicated answer, I'll close this question for him. I do not hold any credit.
Original answer:
s:match("(.-)c?$")

Related

Need a Regex that includes all char after expression

I am trying to figure out a regex. That includes all characters after it but if another patterns occurs it does not overlap
This is my current regex
[a-zA-Z]{2}\d{1}\s?\w?
The pattern is always 2 letter followed by a number like AE1 or BE3 but I need all the characters following the pattern.
So AE1 A E F but if another pattern occurs in the string like
AE1 A D BE1 A D C it cannot overlap with and be two separate matches.
So to clarify
AB3 D T B should be one match on the regex
ABC D A F DE3 D CD A
should have 2 matches with all the char following it because of the the two letter word and number.
How do I achieve this
I'm not quite following the logic here, yet my guess would be that we might want something similar to this:
([A-Z]{2}\d\s([A-Z]+\s)+)|([A-Z]{3}\s([A-Z]+\s)+)
which allows two letters followed by a digit, or three letters, both followed by ([A-Z]+\s)+.
Demo
Look, you have to consider where your pattern will start. I mean, you know, what is different between AE1 A E F and BE1 A D C in AE1 A D BE1 A D C? You don't want to treat both similarly. So you have to separate them. Separation of these two texts is possible only determining which one is placed in text start.
Altogether, only adding ^ to start your pattern will solve problem.
So your regex should be like this:
^[a-zA-Z]{2}\d{1}\s?\w?
Demo
What you want to do is to split a string with your pattern having the current pattern match as the start of the extracted substrings.
You may use
(?!^)(?=[a-zA-Z]{2}\d)
to split the string. Details
(?!^) - not at the start of the string
(?=[a-zA-Z]{2}\d) - a location in the string that is immediately followed with 2 ASCII letters and any digit.
See the Scala demo:
val s = "ABC D A F DE3 D CD A"
val rx = """(?!^)(?=[a-zA-Z]{2}\d)"""
val results = s.split(rx).map(_.trim)
println(results.mkString(", "))
// => ABC D A F, DE3 D CD A
You can just use this regex:
(?i)\b[a-z]{2}\d\b(?:(?:(?!\b[a-z]{2}\d\b).)+\s?)?
Demo and explanations: https://regex101.com/r/DtFU8j/1/
It uses a negative lookahead (?!\b[a-z]{2}\d\b) to add the constraint that the character matched after the initial pattern (?i)\b[a-z]{2}\d\b should not contain this exact pattern.

Regex absolute begginer: filter alphanumeric

I'm playing codewars in Ruby and I'm stuck on a Kata. The goal is to validate if a user input string is alphanumeric. (yes, this is quite advanced Regex)
The instructions:
At least one character ("" is not valid)
Allowed characters are uppercase / lowercase latin letters and digits from 0 to 9
No whitespaces/underscore
What I've tried :
^[a-zA-Z0-9]+$
^(?! !)[a-zA-Z0-9]+$
^((?! !)[a-zA-Z0-9]+)$
It passes all the test except one, here's the error message:
Value is not what was expected
I though the Regex I'm using would satisfy all the conditions, what am I missing ?
SOLUTION: \A[a-zA-Z0-9]+\z (and better Ruby :^) )
$ => end of a line
\z => end of a string
(same for beginning: ^ (line) and \A (string), but wasn't needed for the test)
Favourite answer from another player:
/\A[A-z\d]+\z/
My guess is that maybe, we would start with an expression similar to:
^(?=[A-Za-z0-9])[A-Za-z0-9]+$
and test to see if it might cover our desired rules.
In this demo, the expression is explained, if you might be interested.
Test
re = /^(?=[A-Za-z0-9])[A-Za-z0-9]+$/m
str = '
ab
c
def
abc*
def^
'
# Print the match result
str.scan(re) do |match|
puts match.to_s
end
str !~ /[^A-Za-z\d]/
The string contains alphanumeric characters only if and only if it does not contain a character other than an alphnumeric character.

Make sure regex does not match empty string - but with a few caveats

There is a problem that I need to do, but there are some caveats that make it hard.
Problem: Match on all non-empty strings over the alphabet {abc} that contain at most one a.
Examples
a
abc
bbca
bbcabb
Nonexample
aa
bbaa
Caveats: You cannot use a lookahead/lookbehind.
What I have is this:
^[bc]*a?[bc]*$
but it matches empty strings. Maybe a hint? Idk anything would help
(And if it matters, I'm using python).
As I understand your question, the only problem is, that your current pattern matches empty strings. To prevent this you can use a word boundary \b to require at least one word character.
^\b[bc]*a?[bc]*$
See demo at regex101
Another option would be to alternate in a group. Match an a surrounded by any amount of [bc] or one or more [bc] from start to end which could look like: ^(?:[bc]*a[bc]*|[bc]+)$
The way I understood the issue was that any character in the alphabet should match, just only one a character.
Match on all non-empty strings over the alphabet... at most one a
^[b-z]*a?[b-z]*$
If spaces can be included:
^([b-z]*\s?)*a?([b-z]*\s?)*$
You do not even need a regex here, you might as well use .count() and a list comprehension:
data = """a,abc,bbca,bbcabb,aa,bbaa,something without the bespoken letter,ooo"""
def filter(string, char):
return [word
for word in string.split(",")
for c in [word.count(char)]
if c in [0,1]]
print(filter(data, 'a'))
Yielding
['a', 'abc', 'bbca', 'bbcabb', 'something without the bespoken letter', 'ooo']
You've got to positively match something excluding the empty string,
using only a, b, or c letters. But can't use assertions.
Here is what you do.
The regex ^(?:[bc]*a[bc]*|[bc]+)$
The explanation
^ # BOS
(?: # Cluster choice
[bc]* a [bc]* # only 1 [a] allowed, arbitrary [bc]'s
| # or,
[bc]+ # no [a]'s only [bc]'s ( so must be some )
) # End cluster
$ # EOS

Check if string is subset of a bunch of characters? (RegEx)?

I have a little problem, I have 8 characters, for example "a b c d a e f g", and a list of words, for example:
mom, dad, bad, fag, abac
How can I check if I can or cannot compose these words with the letters I have?
In my example, I can compose bad, abac and fag, but I cannot compose dad (I have not two D) and mom (I have not M or O).
I'm pretty sure it can be done using a RegEx but would be helpful even using some functions in Perl..
Thanks in advance guys! :)
This is done most simply by forming a regular expression from the word that is to be tested.
This sorts the list of available characters and forms a string by concatenating them. Then each candidate word is split into characters, sorted, and rejoined with the regex term .* as separator. So, for instance, abac will be converted to a.*a.*b.*c.
Then the validity of the word is determined by testing the string of available characters against the derived regex.
use strict;
use warnings;
my #chars = qw/ a b c d a e f g /;
my $chars = join '', sort #chars;
for my $word (qw/ mom dad bad fag abac /) {
my $re = join '.*', sort $word =~ /./g;
print "$word is ", $chars =~ /$re/ ? 'valid' : 'NOT valid', "\n";
}
output
mom is NOT valid
dad is NOT valid
bad is valid
fag is valid
abac is valid
This is to demonstrate the possibility rather than endorsing the regex method. Please consider other saner solution.
First step, you need to count the number of characters available.
Then construct your regex as such (this is not Perl code!):
Start with start of input anchor, this matches the start of the string (a single word from the list):
^
Append as many of these as the number of unique characters:
(?!(?:[^<char>]*+<char>){<count + 1>})
Example: (?!(?:[^a]*+a){3}) if the number of a is 2.
I used an advanced regex construct here called zero-width negative look-ahead (?!pattern). It will not consume text, and it will try its best to check that nothing ahead in the string matches the pattern specified (?:[^a]*+a){3}. Basically, the idea is that I check that I cannot find 3 'a' ahead in the string. If I really can't find 3 instances of 'a', it means that the string can only contain 2 or less 'a'.
Note that I use *+, which is 0 or more quantifier, possessively. This is to avoid unnecessary backtracking.
Put the characters that can appear within []:
[<unique_chars_in_list>]+
Example: For a b c d a e f g, this will become [abcdefg]+. This part will actually consume the string, and make sure the string only contains characters in the list.
End with end of input anchor, which matches the end of the string:
$
So for your example, the regex will be:
^(?!(?:[^a]*+a){3})(?!(?:[^b]*+b){2})(?!(?:[^c]*+c){2})(?!(?:[^d]*+d){2})(?!(?:[^e]*+e){2})(?!(?:[^f]*+f){2})(?!(?:[^g]*+g){2})[abcdefg]+$
You must also specify i flag for case-insensitive matching.
Note that this only consider the case of English alphabet (a-z) in the list of words to match. Space and hyphen are not (yet) considered here.
How about sorting both strings into alphabetical order then for the string you want to check insert .*
between each letter like so:
'aabcdefg' =~ m/a.*b.*d.*/
True
'aabcdefg' =~ m/m.*m.*u.*/
False
'aabcdefg' =~ m/a.*d.*d.*/
False
Some pseudocode:
Sort the available characters into alphabetical order
for each word:
Sort the characters of the word into alphabetical order
For each character of the word search forwards through the available characters to find a matching character. Note the this
search will never go back to the start of the available chars,
matched chars are consumed.
Or even better, use frequency counts of characters.
For your available characters, construct a map from character to occurence count of that character.
Do the same for each candidate word and compare against the available map, if the word map contains a mapping for a character where the available map does not, or the mapped value is larger in the word map than the available map, then the word cannot be constructed using the available characters.
Here's a really simple script that would be rather easy to generalize:
#!/usr/bin/env perl
use strict;
use warnings;
sub check_word {
my $word = shift;
my %chars;
$chars{$_}++ for #_;
$chars{$_}-- or return for split //, $word;
return 1;
}
print check_word( 'cab', qw/a b c/ ) ? "Good" : "Bad";
And of course the performance of this function could be greatly enhanced if the letters list is going to be the same every time. Actually for eight characters, copying the hash vs building a new one each time is probably the same speed.
pseudocode:
bool possible=true
string[] chars= { "a", "b", "c"}
foreach word in words
{
foreach char in word.chars
{
possible=possible && chars.contains(char)
}
}

Using Regex is there a way to match outside characters in a string and exclude the inside characters?

I know I can exclude outside characters in a string using look-ahead and look-behind, but I'm not sure about characters in the center.
What I want is to get a match of ABCDEF from the string ABC 123 DEF.
Is this possible with a Regex string? If not, can it be accomplished another way?
EDIT
For more clarification, in the example above I can use the regex string /ABC.*?DEF/ to sort of get what I want, but this includes everything matched by .*?. What I want is to match with something like ABC(match whatever, but then throw it out)DEF resulting in one single match of ABCDEF.
As another example, I can do the following (in sudo-code and regex):
string myStr = "ABC 123 DEF";
string tempMatch = RegexMatch(myStr, "(?<=ABC).*?(?=DEF)"); //Returns " 123 "
string FinalString = myStr.Replace(tempMatch, ""); //Returns "ABCDEF". This is what I want
Again, is there a way to do this with a single regex string?
Since the regex replace feature in most languages does not change the string it operates on (but produces a new one), you can do it as a one-liner in most languages. Firstly, you match everything, capturing the desired parts:
^.*(ABC).*(DEF).*$
(Make sure to use the single-line/"dotall" option if your input contains line breaks!)
And then you replace this with:
$1$2
That will give you ABCDEF in one assignment.
Still, as outlined in the comments and in Mark's answer, the engine does match the stuff in between ABC and DEF. It's only the replacement convenience function that throws it out. But that is supported in pretty much every language, I would say.
Important: this approach will of course only work if your input string contains the desired pattern only once (assuming ABC and DEF are actually variable).
Example implementation in PHP:
$output = preg_replace('/^.*(ABC).*(DEF).*$/s', '$1$2', $input);
Or JavaScript (which does not have single-line mode):
var output = input.replace(/^[\s\S]*(ABC)[\s\S]*(DEF)[\s\S]*$/, '$1$2');
Or C#:
string output = Regex.Replace(input, #"^.*(ABC).*(DEF).*$", "$1$2", RegexOptions.Singleline);
A regular expression can contain multiple capturing groups. Each group must consist of consecutive characters so it's not possible to have a single group that captures what you want, but the groups themselves do not have to be contiguous so you can combine multiple groups to get your desired result.
Regular expression
(ABC).*(DEF)
Captures
ABC
DEF
See it online: rubular
Example C# code
string myStr = "ABC 123 DEF";
Match m = Regex.Match(myStr, "(ABC).*(DEF)");
if (m.Success)
{
string result = m.Groups[1].Value + m.Groups[2].Value; // Gives "ABCDEF"
// ...
}