Check if string is subset of a bunch of characters? (RegEx)? - regex

I have a little problem, I have 8 characters, for example "a b c d a e f g", and a list of words, for example:
mom, dad, bad, fag, abac
How can I check if I can or cannot compose these words with the letters I have?
In my example, I can compose bad, abac and fag, but I cannot compose dad (I have not two D) and mom (I have not M or O).
I'm pretty sure it can be done using a RegEx but would be helpful even using some functions in Perl..
Thanks in advance guys! :)

This is done most simply by forming a regular expression from the word that is to be tested.
This sorts the list of available characters and forms a string by concatenating them. Then each candidate word is split into characters, sorted, and rejoined with the regex term .* as separator. So, for instance, abac will be converted to a.*a.*b.*c.
Then the validity of the word is determined by testing the string of available characters against the derived regex.
use strict;
use warnings;
my #chars = qw/ a b c d a e f g /;
my $chars = join '', sort #chars;
for my $word (qw/ mom dad bad fag abac /) {
my $re = join '.*', sort $word =~ /./g;
print "$word is ", $chars =~ /$re/ ? 'valid' : 'NOT valid', "\n";
}
output
mom is NOT valid
dad is NOT valid
bad is valid
fag is valid
abac is valid

This is to demonstrate the possibility rather than endorsing the regex method. Please consider other saner solution.
First step, you need to count the number of characters available.
Then construct your regex as such (this is not Perl code!):
Start with start of input anchor, this matches the start of the string (a single word from the list):
^
Append as many of these as the number of unique characters:
(?!(?:[^<char>]*+<char>){<count + 1>})
Example: (?!(?:[^a]*+a){3}) if the number of a is 2.
I used an advanced regex construct here called zero-width negative look-ahead (?!pattern). It will not consume text, and it will try its best to check that nothing ahead in the string matches the pattern specified (?:[^a]*+a){3}. Basically, the idea is that I check that I cannot find 3 'a' ahead in the string. If I really can't find 3 instances of 'a', it means that the string can only contain 2 or less 'a'.
Note that I use *+, which is 0 or more quantifier, possessively. This is to avoid unnecessary backtracking.
Put the characters that can appear within []:
[<unique_chars_in_list>]+
Example: For a b c d a e f g, this will become [abcdefg]+. This part will actually consume the string, and make sure the string only contains characters in the list.
End with end of input anchor, which matches the end of the string:
$
So for your example, the regex will be:
^(?!(?:[^a]*+a){3})(?!(?:[^b]*+b){2})(?!(?:[^c]*+c){2})(?!(?:[^d]*+d){2})(?!(?:[^e]*+e){2})(?!(?:[^f]*+f){2})(?!(?:[^g]*+g){2})[abcdefg]+$
You must also specify i flag for case-insensitive matching.
Note that this only consider the case of English alphabet (a-z) in the list of words to match. Space and hyphen are not (yet) considered here.

How about sorting both strings into alphabetical order then for the string you want to check insert .*
between each letter like so:
'aabcdefg' =~ m/a.*b.*d.*/
True
'aabcdefg' =~ m/m.*m.*u.*/
False
'aabcdefg' =~ m/a.*d.*d.*/
False

Some pseudocode:
Sort the available characters into alphabetical order
for each word:
Sort the characters of the word into alphabetical order
For each character of the word search forwards through the available characters to find a matching character. Note the this
search will never go back to the start of the available chars,
matched chars are consumed.
Or even better, use frequency counts of characters.
For your available characters, construct a map from character to occurence count of that character.
Do the same for each candidate word and compare against the available map, if the word map contains a mapping for a character where the available map does not, or the mapped value is larger in the word map than the available map, then the word cannot be constructed using the available characters.

Here's a really simple script that would be rather easy to generalize:
#!/usr/bin/env perl
use strict;
use warnings;
sub check_word {
my $word = shift;
my %chars;
$chars{$_}++ for #_;
$chars{$_}-- or return for split //, $word;
return 1;
}
print check_word( 'cab', qw/a b c/ ) ? "Good" : "Bad";
And of course the performance of this function could be greatly enhanced if the letters list is going to be the same every time. Actually for eight characters, copying the hash vs building a new one each time is probably the same speed.

pseudocode:
bool possible=true
string[] chars= { "a", "b", "c"}
foreach word in words
{
foreach char in word.chars
{
possible=possible && chars.contains(char)
}
}

Related

How can I match only integers in Perl?

So I have an array that goes like this:
my #nums = (1,2,12,24,48,120,360);
I want to check if there is an element that is not an integer inside that array without using loop. It goes like this:
if(grep(!/[^0-9]|\^$/,#nums)){
die "Numbers are not in correct format.";
}else{
#Do something
}
Basically, the format should not be like this (Empty string is acceptable):
1A
A2
#A
#
#######
More examples:
1,2,3,A3 = Unacceptable
1,2,###,2 = unacceptable
1,2,3A,4 = Unacceptable
1, ,3,4=Acceptable
1,2,3,360 = acceptable
I know that there is another way by using look like a number. But I can't use that for some reason (outside of my control/setup reasons). That's why I used the regex method.
My question is, even though the numbers are in not correct format (A60 for example), the condition always return False. Basically, it ignores the incorrect format.
You say in the comments that you don't want to use modules because you can't install them, but there are many core modules that should come with Perl (although some systems screw this up).
zdim's answer in the comments is to look for anything that is not 0, 1, 2, 3, 4, 5, 6, 7, 8, or 9. That's the negated character class [^0-9]. A grep in scalar context returns the number of items that match:
my $found_non_ints = grep { /[^0-9]/ } #items;
Instead of that, I'd go back to the non-negated character class and match string that only has zero or more digits. To do this, anchor the pattern to the absolute start and end of the string:
my $found_non_ints = grep { ! /\A[0-9]*\z/ } #items;
But, this doesn't really match integers. It matches positive whole numbers (and zero). If you want to match negative numbers as well, allow an optional - at the start of the string:
my $found_non_ints = grep { ! /\A-?[0-9]*\z/ } #items;
That - would be a problem in the negated character class.
Also, you don't want the $ anchor here: that allows a possible newline to match at the end, and that's a non-digit (the \Z is the same for the end of the string). Also, the meaning of $ can change based on the setting of the /m flag, which might be set with default regex flags.
Here's a short program with your sample data. Note that you need to decide how to split up the list; does whitespace matter? I decided to remove whitespace around the comma:
#!perl
use v5.10;
while( <DATA> ) {
chomp;
my $found_non_ints = grep { ! /\A[0-9]*\z/ } split /\s*,\s*/;
say "$_ => Found $found_non_ints non-ints";
}
__DATA__
1A
A2
#A
#
1,2,3,A3
1,2,###,2
1,2,3A,4
1, ,3,4
1,,3,4
1,2,3,360
The solution proposed in the question gets close, except that the logic got reversed and there is an error in a regex pattern. One way for it:
if ( grep { /[^0-9] | ^$/x } #nums ) { say 'not all integers' }
Regex explanation
[] is a character class: it matches any one of the characters listed inside (so [abc] matches either of a, b, or c) -- but when it starts with a ^ it matches any character not listed; so [^abc] matches any char not being either of a, b, or c. The pattern 0-9 inside a character class specifies all digits in that range (and we can also use a-z and A-Z)
So [^0-9] matches any character that is not a digit
Then that is or-ed by | with a ^$: ^ matches beginning of the string and $ is for the end of it. So ^$ match a string without anything -- an empty string! We need to account for that as [^0-9] doesn't while an array element can be an empty string. (It can also be a undef but from my understanding that is not possible with actual data, and a regex on undef would draw a warning.)
Note that $ allows for a newline as well, and that ^ and $ may change their meaning if /m modifier is in use, matching on linefeeds inside a string. However, in all these cases we'd be matching a non-digit, which is precisely the point here
/x modifier makes it disregard literal spaces inside so we can space things out for easier reading. (It also allows for newlines and comments with #, so complex patterns can be organized and documented very nicely)
So that's all -- the regex tries to match anything that shouldn't be in an integer (assumed to be strictly positive in OP's data).
If it matches any such, in any one of the array elements, then grep returns a list which isn't empty (but has at least one element) and that is "true" under if. So we caught a non-integer and we go into if's block to deal with that.
A little aside: we can also declare and populate an array right inside the if condition, to catch all those non-integers:
if ( my #non_ints = grep { /[^0-9] | ^$/x } #nums ) {
say 'Non-integers: ', join ' ', map { "|$_|" } #non_ints;
}
This also reads more nicely, telling by the array name what we're after in that complicated condition: "non_ints." I put || around each item in print to be able to see an empty string.†
Now, when you put an exclamation mark in front of that regex, it reverses the true/false return from the regex and our code goes haywire. So drop that !.
The other error is in escaping the ^ by having \^. This would match a literal ^ character, robbing ^ of its special meaning as a pattern in regex, explained above. So drop that \.
One other way is in using an extremely useful List::Util library, which is "core" (so it is normally installed with Perl, even though that can get messed up).
Among a number of essential functions it gives us any, and with it we have
use List::Util qw(any);
if ( any { /[^0-9]|^$/ } #nums ) { say 'not all integers' }
I like any firstly because the name of the function includes at least a part of the needed logic, making code that much clearer and easier to comprehend: is there any element of #nums for which the code in the block is true? So any element which contains a non-digit? Precisely what is needed here.
Then, another advantage is that any will quit as soon as it finds one match, while grep continues through the whole list. But this efficiency advantage shows only on very large arrays or a lot of repeated checks. Also, on the other hand sometimes we want to count all instances.
I'd also like to point out some of any's siblings: none and notall. These names themselves also capture a good deal of logic, making otherwise possibly convoluted code that much clearer. Browse through this library to get accustomed to what is in there.
† A program with your test data
use warnings;
use strict;
use feature 'say';
while (<DATA>) {
chomp;
my #nums = split /\s*,\s*/;
say "Data: #nums";
if ( my #non_ints = grep { /[^0-9] | ^$/x } #nums ) {
say 'Non-ints: ', join ' ', map { "|$_|" } #non_ints;
}
say '---';
}
__DATA__
1A
A2
#A
#
1,2,3,A3
1,2,###,2
1,2,3A,4
1, ,3,4
1,2,3,360

Regular expression puzzler

I have been doing regular expression for 25+ years but I don't understand why this regex is not a match (using Perl syntax):
"unify" =~ /[iny]{3}/
# as in
perl -e 'print "Match\n" if "unify" =~ /[iny]{3}/'
Can someone help solve that riddle?
The quantifier {3} in the pattern [iny]{3} means to match a character with that pattern (either i or n or y), and then another character with the same pattern, and then another. Three -- one after another. So your string unify doesn't have that, but can muster two at most, ni.
That's been explained in other answers already. What I'd like to add is an answer to a clarification in comments: how to check for these characters appearing 3 times in the string, scattered around at will. Apart from matching that whole substring, as shown already, we can use a lookahead:
(?=[iny].*[iny].*[iny])
This does not "consume" any characters but rather "looks" ahead for the pattern, not advancing the engine from its current position. As such it can be very useful as a subpattern, in combination with other patterns in a larger regex.
A Perl example, to copy-paste on the command line:
perl -wE'say "Match" if "unify" =~ /(?=[iny].*[iny].*[iny])/'
The drawback to this, as well as to consuming the whole such substring, is the literal spelling out of all three subpatterns; what when the number need be decided dynamically? Or when it's twelve? The pattern can be built at runtime of course. In Perl, one way
my $pattern = '(?=' . join('.*', ('[iny]')x3) . ')';
and then use that in the regex.
 
For the sake of performance, for long strings and many repetitions, make that .* non-greedy
(?=[iny].*?[iny].*?[iny])
(when forming the pattern dynamically join with .*?)
A simple benchmark for illustration (in Perl)
use warnings;
use strict;
use feature 'say';
use Getopt::Long;
use List::Util qw(shuffle);
use Benchmark qw( cmpthese );
# For how many seconds to run each option (-r N, default 3),
# how many times to repeat for the test string (-n N, default 2)
my ($runfor, $n) = (3, 2);
GetOptions('r=i' => \$runfor, 'n=i' => \$n);
my $str = 'aa'
. join('', map { (shuffle 'b'..'t')x$n, 'a' } 1..$n)
. 'a'x($n+1)
. 'zzz';
my $pat_greedy = '(?=' . join('.*', ('a')x$n) . ')';
my $pat_non_greedy = '(?=' . join('.*?', ('a')x$n) . ')';
#my $pat_greedy = join('.*', ('a')x$n); # test straight match,
#my $pat_non_greedy = join('.*?', ('a')x$n); # not lookahead
sub match_repeated {
my ($s, $pla) = #_;
return ( $s =~ /$pla(.*z)/ ) ? "match" : "no match";
}
cmpthese(-$runfor, {
greedy => sub { match_repeated($str, $pat_greedy) },
non_greedy => sub { match_repeated($str, $pat_non_greedy) },
});
(Shuffling of that string is probably unneeded but I feared optimizations intruding.)
When a string is made with the factor of 20 (program.pl -n 20) the output is
Rate greedy non_greedy
greedy 56.3/s -- -100%
non_greedy 90169/s 159926% --
So ... some 1600 times better non-greedy. That test string is 7646 characters long and the pattern to match has 20 subpatterns (a) with .* between them (in greedy case); so there's a lot going on there. With default 2, so for a short string and a simpler pattern, the difference is 10%.
Btw, to test for straight-up matches (not using lookahead) just move those comment signs around the pattern variables, and it's nearly twice as bad:
Rate greedy non_greedy
greedy 56.5/s -- -100%
non_greedy 171949/s 304117% --
The letters n, i, and y aren't all adjacent. There's an f in between them.
/[iny]{3}/ matches any string that contains a substring of three letters taken from the set {i, n, y}. The letters can be in any order; they can even be repeated.
Choosing three characters three times, with replacement, means there are 33 = 27 matching substrings:
iii, iin, iiy, ini, inn, iny, iyi, iyn, iyy
nii, nin, niy, nni, nnn, nny, nyi, nyn, nyy
yii, yin, yiy, yni, ynn, yny, yyi, yyn, yyy
To match non-adjacent letters you can use one of these:
[iny].*[iny].*[iny]
[iny](.*[iny]){2}
([iny].*){3}
(The last option will work fine on its own since your search is unanchored, but might not be suitable as part of a larger regex. The final .* could match more than you intend.)
That pattern looks for three consecutive occurrences of the letters i, n, or y. You do not have three consecutive occurrences.
Perhaps you meant to use [inf] or [ify]?
Looks like you are looking for 3 consecutive letters, so yours should not match
[iny]{3} //no match
[unf]{3} //no match
[nif]{3} //matches nif
[nify]{3} //matches nif
[ify]{3} //matches ify
[uni]{3} //matches uni
Hope that helps somewhat :)
The {3} atom means "exactly three consecutive matches of the preceding element." While all of the letters in your character class are present in the string, they are not consecutive as they are separated by other characters in your string.
It isn't the order of items in the character class that's at issue. It's the fact that you can't match any combination of the three letters in your character class where exactly three of them are directly adjacent to one another in your example string.

TCL_REGEXP:: How to grep a line from variable that looks similar in TCL

My TCL script:
set test {
a for apple
b for ball
c for cat
number n1
numbers 2,3,4,5,6
d for doctor
e for egg
number n2
numbers 56,4,5,5
}
set lines [split $test \n]
set data [join $lines :]
if { [regexp {number n1.*(numbers .*)} $data x y]} {
puts "numbers are : $y"
}
Current output if I run the above script:
C:\Documents and Settings\Owner\Desktop>tclsh stack.tcl
numbers are : numbers 56,4,5,5:
C:\Documents and Settings\Owner\Desktop>
Expected output:
In the script regexp, If I specify "number n1"... Its should print "numbers are : numbers 2,3,4,5,6"
If I specify "number n2"... Its should print "numbers are : numbers 56,4,5,5:"
Now always its prints the last (final line - numbers 56,4,5,5:) as output. How to resolve this issue.
Thanks,
Kumar
Try using
regexp {number n1.*?(numbers .*)\n} $test x y
(note that I'm matching against test. There is no need to replace the newlines.)
There are two differences from your pattern.
The question mark behind the first star makes the match non-greedy.
There is a newline character behind the capturing parentheses.
Your pattern told regexp to match from the first occurrence of number n1 up to the last occurrence of numbers, and it did. This is because the .* match between them was greedy, i.e. it matched as many characters as it could, which meant it went past the first numbers.
Making the match non-greedy means that the pattern will match from the first occurrence of number n1 up to the following occurrence of numbers, which was what you wanted.
After numbers, there is another .* match which is a bit troublesome. If it were greedy, it would match everything up to the end of the variable content. If it were non-greedy, it wouldn't match any characters, since matching a zero-length string satisfies the match. Another problem is that the Tcl RE engine doesn't really allow for switching back from non-greedy mode.
You can fix this by forcing the pattern to match one character past the text that you want the .* to match, making the zero-length match invalid. Matching a newline (\n) or space (\s) character should work. (This of course means that there must be a newline / other space character after every data field: if a numbers field is the last character range in the variable that field can't be located.)
Documentation: regular expression syntax, regexp
To use a Tcl variable in a regular expression is easy. On one level anyway: you put the regular expression in double quotes so that you have standard Tcl variable substitution inside it prior to it being passed to the RE engine:
# ...
set target "n1"
if { [regexp "number $target.*(numbers .*)" $data x y]} {
# ...
The hard part is that you've got to remember that switching to "…" from {…} will affect the whole of that word, and that the substitutions are of regular expression fragments. We usually recommend using {…} because that's easier to get consistently and unconfusingly right in the majority of cases.
Let's illustrate how this can get annoying. In your specific case, you may want to actually use this:
if { [regexp "number $target\[^:\]*:(numbers \[^:\]*)" $data x y]} {
The character sets here exclude the : (which you've — unnecessarily — used as a newline replacement) but because […] is also standard Tcl metasyntax, you have to backslash-quote it. (Things get even more annoying when you want to always use the contents of the variable as a literal even though they might include RE metasyntax characters; you need a regsub call to tidy things up. And you start to potentially make Tcl's RE cache less efficient too.)

Regular expression matching any subset of a given set?

Is it possible to write a regular expression which will match any subset of a given set of characters a1 ... an ?
I.e. it should match any string where any of these characters appears at most once, there are no other characters and the relative order of the characters doesn't matter.
Some approaches that arise at once:
1. [a1,...,an]* or (a1|a2|...|an)*- this allows multiple presence of characters
2. (a1?a2?...an?) - no multiple presence, but relative order is important - this matches any subsequence but not subset.
3. ($|a1|...|an|a1a2|a2a1|...|a1...an|...|an...a1), i.e. write all possible subsequences (just hardcode all matching strings :)) of course, not acceptable.
I also have a guess that it may be theoretically impossible, because during parsing the string we will need to remember which character we have already met before, and as far as I know regular expressions can check out only right-linear languages.
Any help will be appreciated. Thanks in advance.
This doesn't really qualify for the language-agnostic tag, but...
^(?:(?!\1)a1()|(?!\2)a2()|...|(?!\n)an())*$
see a demo on ideone.com
The first time an element is matched, it gets "checked off" by the capturing group following it. Because the group has now participated in the match, a negative lookahead for its corresponding backreference (e.g., (?!\1)) will never match again, even though the group only captured an empty string. This is an undocumented feature that is nevertheless supported in many flavors, including Java, .NET, Perl, Python, and Ruby.
This solution also requires support for forward references (i.e., a reference to a given capturing group (\1) appearing in the regex before the group itself). This seems to be a little less widely supported than the empty-groups gimmick.
Can't think how to do it with a single regex, but this is one way to do it with n regexes: (I will usr 1 2 ... m n etc for your as)
^[23..n]*1?[23..n]*$
^[13..n]*2?[13..n]*$
...
^[12..m]*n?[12..m]*$
If all the above match, your string is a strict subset of 12..mn.
How this works: each line requires the string to consist exactly of:
any number of charactersm drawn fromthe set, except a particular one
perhaps a particular one
any number of charactersm drawn fromthe set, except a particular one
If this passes when every element in turn is considered as a particular one, we know:
there is nothing else in the string except the allowed elements
there is at most one of each of the allowed elements
as required.
for completeness I should say that I would only do this if I was under orders to "use regex"; if not, I'd track which allowed elements have been seen, and iterate over the characters of the string doing the obvious thing.
Not sure you can get an extended regex to do that, but it's pretty easy to do with a simple traversal of your string.
You use a hash (or an array, or whatever) to store if any of your allowed characters has already been seen or not in the string. Then you simply iterate over the elements of your string. If you encounter an element not in your allowed set, you bail out. If it's allowed, but you've already seen it, you bail out too.
In pseudo-code:
foreach char a in {a1, ..., an}
hit[a1] = false
foreach char c in string
if c not in {a1, ..., an} => fail
if hit[c] => fail
hit[c] = true
Similar to Alan Moore's, using only \1, and doesn't refer to a capturing group before it has been seen:
#!/usr/bin/perl
my $re = qr/^(?:([abc])(?!.*\1))*$/;
foreach (qw(ba pabc abac a cc cba abcd abbbbc), '') {
print "'$_' ", ($_ =~ $re) ? "matches" : "does not match", " \$re \n";
}
We match any number of blocks (the outer (?:)), where each block must consist of "precisely one character from our preferred set, which is not followed by a string containing that character".
If the string might contain newlines or other funny stuff, it might be necessary to play with some flags to make ^, $ and . behave as intended, but this all depends on the particular RE flavor.
Just for sillyness, one can use a positive look-ahead assertion to effectively AND two regexps, so we can test for any permutation of abc by asserting that the above matches, followed by an ordinary check for 'is N characters long and consists of these characters':
my $re2 = qr/^(?=$re)[abc]{3}$/;
foreach (qw(ba pabc abac a cc abcd abbbbc abc acb bac bca cab cba), '') {
print "'$_' ", ($_ =~ $re2) ? "matches" : "does not match", " \$re2 \n";
}

How to use a REGEX pattern to remove a specific word "THE" only if at beginning of text string?

I have a text input field for titles of various things and to help minimize false negatives on search results(internal search is not the best), I need to have a REGEX pattern which looks at the first four characters of the input string and removes the word(and space after the word) _the _ if it is there at the beginning only.
For example if we are talking about the names of bands, and someone enters The Rolling Stones , what i need is for the entry to say only Rolling Stones
Can a regex be used to automatically strip these 4characters?
Applying the regex
^(?:\s*the\s*)?(.*)$
will match any string, and capture it in backreference no. 1, unless it starts with the (optionally surrounded by whitespace), in which case backref no. 1 will contain whatever follows.
You need to set the case-insensitive option in your regex engine for this to work.
You can use the ^ identifier to match a pattern at the beginning of a line, however for what you are using this for, it can be considered overkill.
A lot of languages support string manipulations, which is a more suitable choice. I can provide an example to demonstrate in Python,
>>> def func(n):
n = n[4:len(n)] if n[0:4] == "The " else n
return n
>>> func("The Rolling Stones")
'Rolling Stones'
>>> func("They Might Be Giants")
'They Might Be Giants'
As you don't clarify with language, here is a solution in Perl :
my $str = "The Rolling Stones";
$str =~ s/^the //i;
say $str; # Rolling Stones