How to accept numbers and specific words? - regex

i have validating a clothes size field, and want it to accept only numbers and specific "words" like S, M, XL, XXL etc. But i am unsure how to add the words to the pattern. For example, i want it to match something like "2, 5, 23, S, XXXL" which are valid sizes, but not random combinations of letters like "2X3, SLX"
Ok since people are not suggesting regexp solutions i guess i should say that this is part of a larger method of validation which uses regexp. For convenience and code consistency i want to do this with regexp.
Thanks

If they're a known set of values, I am not sure a regex is the best way to do it. But here is one regex that is basically a brute-force match of your values, each with a \b (word boundary) anchor
\b2\b|\b5\b|\b23\b|\bXXXL\b|\bXL\b|\bM\b|\bS\b

Sorry for not giving you a straight answer. regexp might be overkill in your case. A solution without it could, depending on your needs, be more maintainable.
I don't know which language you use so I will just pick one randomly. You could treat it as a piece of pseudo code.
In PHP:
function isValidSize($size) {
$validSizeTags = array("S", "M", "XL", "XXL");
$minimumSize = 2;
$maximumSize = 23;
if(ctype_digit(strval($size))) { // explanation for not using is_numeric/is_int below
if($size >= $minimumSize && $size <= $maxiumSize) {
return true;
}
} else if(in_array($size, $validSizeTags)) {
return true;
} else {
return false;
}
}
$size = "XS";
$isValid = isValidSize($size); // true
$size = 23;
$isValid = isValidSize($size); // true
$size = "23";
$isValid = isValidSize($size); // true, is_int would return false here
$size = 50;
$isValid = isValidSize($size); // false
$size = 15.5;
$isValid = isValidSize($size); // false, is_numeric would return true here
$size = "diadjsa";
$isValid = isValidSize($size); // false
(The reason for using ctype_digit(strval($size)) instead of is_int or is_numeric is that the first one will only return true for real integers, not strings like "15". And the second one will return true for all numeric values not just integers. ctype_digit will however return true for strings containing numeric characters, but return false for integers. So we convert the value to a string using strval before sending it to ctype_digits. Welcome to the world of PHP.)
With this logic in place you can easily inject validSizeTags, maximumSize and minimumSize from a configuration file or a database where you store all valid sizes for this specific product. That would get much messier using regular expressions.

Here is an example in JavaScript:
var patt = /^(?:\d{1,2}|X{0,3}[SML])$/i;
patt.test("2"); // true
patt.test("23"); // true
patt.test("XXXL"); // true
patt.test("S"); // true
patt.test("SLX"); // false

Use Array Membership Instead of Regular Expressions
Some problems are easier to deal with by using a different approach to representing your data. While regular expressions can be powerful, you might be better off with an array membership test if you are primarily interested in well-defined fixed values. For example, using Ruby:
sizes = %w[2 5 23 S XXXL].map(&:upcase)
size = 'XXXL'
sizes.include? size.to_s.upcase # => true
size = 'XL'
sizes.include? size.to_s.upcase # => false

seeing as it is being harder than i had thought, i am thinking to store the individual matched values in an array and match those individually against accepted values. i will use something like
[0-9]+|s|m|l|xl|xxl
and store the matches in the array
then i will check each array element against [0-9]+ and s|m|l|xl|xxl and if it matches any of these, it's valid. maybe there is a better way but i can't dwell on this for too long
thanks for your help

This will accept the alternatives one or more times, separated by whitespace or punctuation. It should be easy enough to expand the separator character class if you think you need to.
^([Xx]{0,3}[SsMmLl]|[0-9]+)([ ,:;-]+([Xx]{0,3}[SsMmLl]))*$
If you can interpolate the accepted pattern into a string before using it as a regex, you can reduce the code duplication.
This is a regular egrep pattern. Regex dialects differ between languages, so you might need to tweak something in order to adapt it to your language of choice (PHP? It's good form to include this information in the question).

Related

Regex permutations without repetition [duplicate]

This question already has answers here:
How to find all permutations of a given word in a given text?
(6 answers)
Closed 7 years ago.
I need a RegEx to check if I can find a expression in a string.
For the string "abc" I would like to match the first appearance of any of the permutations without repetition, in this case 6: abc, acb, bac, bca, cab, cba.
For example, in this string "adesfecabefgswaswabdcbaes" it'd find a coincidence in the position 7.
Also I'd need the same for permutations without repetition like this "abbc". The cases for this are 12: acbb, abcb, abbc, cabb, cbab, cbba, bacb, babc, bcab, bcba, bbac, bbca
For example, in this string "adbbcacssesfecabefgswaswabdcbaes" it'd find a coincidence in the position 3.
Also, I would like to know how would that be for similar cases.
EDIT
I'm not looking for the combinations of the permutations, no. I already have those. WHat I'm looking for is a way to check if any of those permutations is in a given string.
EDIT 2
This regex I think covers my first question
([abc])(?!\1)([abc])(?!\2|\1)[abc]
Can find all permutations(6) of "abc" in any secuence of characters.
Now I need to do the same when I have a repeated character like abbc (12 combinations).
([abc])(?!\1)([abc])(?!\2|\1)[abc]
You can use this without g flag to get the position.See demo.The position of first group is what you want.
https://regex101.com/r/nS2lT4/41
https://regex101.com/r/nS2lT4/42
The only reason you might "need a regex" is if you are working with a library or tool which only permits specifying certain kinds of rules with a regex. For instance, some editors can be customized to color certain syntactic constructs in a particular way, and they only allow those constructs to be specified as regular expressions.
Otherwise, you don't "need a regex", you "need a program". Here's one:
// are two arrays equal?
function array_equal(a1, a2) {
return a1.every(function(chr, i) { return chr === a2[i]; });
}
// are two strings permutations of each other?
function is_permutation(s1, s2) {
return array_equal(s1.split('').sort(), s2.split('').sort());
}
// make a function which finds permutations in a string
function make_permutation_finder(chars) {
var len = chars.length;
return function(str) {
for (i = 0; i < str.length - len; i++) {
if (is_permutation(chars, str.slice(i, i+len))) return i;
}
return -1;
};
}
> finder = make_permutation_finder("abc");
> console.log(finder("adesfecabefgswaswabdcbaes"));
< 6
Regexps are far from being powerful enough to do this kind of thing.
However, there is an alternative, which is precompute the permutations and build a dynamic regexp to find them. You did not provide a language tag, but here's an example in JS. Assuming you have the permutations and don't have to worry about escaping special regexp characters, that's just
regexp = new RegExp(permuations.join('|'));

regex how can I split this word?

I have a list of several phrases in the following format
thisIsAnExampleSentance
hereIsAnotherExampleWithMoreWordsInIt
and I'm trying to end up with
This Is An Example Sentance
Here Is Another Example With More Words In It
Each phrase has the white space condensed and the first letter is forced to lowercase.
Can I use regex to add a space before each A-Z and have the first letter of the phrase be capitalized?
I thought of doing something like
([a-z]+)([A-Z])([a-z]+)([A-Z])([a-z]+) // etc
$1 $2$3 $4$5 // etc
but on 50 records of varying length, my idea is a poor solution. Is there a way to regex in a way that will be more dynamic? Thanks
A Java fragment I use looks like this (now revised):
result = source.replaceAll("(?<=^|[a-z])([A-Z])|([A-Z])(?=[a-z])", " $1$2");
result = result.substring(0, 1).toUpperCase() + result.substring(1);
This, by the way, converts the string givenProductUPCSymbol into Given Product UPC Symbol - make sure this is fine with the way you use this type of thing
Finally, a single line version could be:
result = source.substring(0, 1).toUpperCase() + source(1).replaceAll("(?<=^|[a-z])([A-Z])|([A-Z])(?=[a-z])", " $1$2");
Also, in an Example similar to one given in the question comments, the string hiMyNameIsBobAndIWantAPuppy will be changed to Hi My Name Is Bob And I Want A Puppy
For the space problem it's easy if your language supports zero-width-look-behind
var result = Regex.Replace(#"thisIsAnExampleSentanceHereIsAnotherExampleWithMoreWordsInIt", "(?<=[a-z])([A-Z])", " $1");
or even if it doesn't support them
var result2 = Regex.Replace(#"thisIsAnExampleSentanceHereIsAnotherExampleWithMoreWordsInIt", "([a-z])([A-Z])", "$1 $2");
I'm using C#, but the regexes should be usable in any language that support the replace using the $1...$n .
But for the lower-to-upper case you can't do it directly in Regex. You can get the first character through a regex like: ^[a-z] but you can't convet it.
For example in C# you could do
var result4 = Regex.Replace(result, "^([a-z])", m =>
{
return m.ToString().ToUpperInvariant();
});
using a match evaluator to change the input string.
You could then even fuse the two together
var result4 = Regex.Replace(#"thisIsAnExampleSentanceHereIsAnotherExampleWithMoreWordsInIt", "^([a-z])|([a-z])([A-Z])", m =>
{
if (m.Groups[1].Success)
{
return m.ToString().ToUpperInvariant();
}
else
{
return m.Groups[2].ToString() + " " + m.Groups[3].ToString();
}
});
A Perl example with unicode character support:
s/\p{Lu}/ $&/g;
s/^./\U$&/;

VB.Net Matching and replacing the contents of multiple overlapping sets of brackets in a string

I am using vb.net to parse my own basic scripting language, sample below. I am a bit stuck trying to deal with the 2 separate types of nested brackets.
Assuming name = Sam
Assuming timeFormat = hh:mm:ss
Assuming time() is a function that takes a format string but
has a default value and returns a string.
Hello [[name]], the time is [[time(hh:mm:ss)]].
Result: Hello Sam, the time is 19:54:32.
The full time is [[time()]].
Result: The full time is 05/06/2011 19:54:32.
The time in the format of your choice is [[time([[timeFormat]])]].
Result: The time in the format of your choice is 19:54:32.
I could in theory change the syntax of the script completely but I would rather not. It is designed like this to enable strings without quotes because it will be included in an XML file and quotes in that context were getting messy and very prone to errors and readability issues. If this fails I could redesign using something other than quotes to mark out strings but I would rather use this method.
Preferably, unless there is some other way I am not aware of, I would like to do this using regex. I am aware that the standard regex is not really capable of this but I believe this is possible using MatchEvaluators in vb.net and some form of recursion based replacing. However I have not been able to get my head around it for the last day or so, possibly because it is hugely difficult, possibly because I am ill, or possibly because I am plain thick.
I do have the following regex for parts of it.
Detecting the parentheses: (\w*?)\((.*?)\)(?=[^\(+\)]*(\(|$))
Detecting the square brackets: \[\[(.*?)\]\](?=[^\[+\]]*(\[\[|$))
I would really appreciate some help with this as it is holding the rest of my project back at the moment. And sorry if I have babbled on too much or not put enough detail, this is my first question on here.
Here's a little sample which might help you iterate through several matches/groups/captures. I realize that I am posting C# code, but it would be easy for you to convert that into VB.Net
//these two may be passed in as parameters:
string tosearch;//the string you are searching through
string regex;//your pattern to match
//...
Match m;
CaptureCollection cc;
GroupCollection gc;
Regex r = new Regex(regex, RegexOptions.IgnoreCase);
m = r.Match(tosearch);
gc = m.Groups;
Debug.WriteLine("Number of groups found = " + gc.Count.ToString());
// Loop through each group.
for (int i = 0; i < gc.Count; i++)
{
cc = gc[i].Captures;
counter = cc.Count;
int grpnum = i + 1;
Debug.WriteLine("Scanning group: " + grpnum.ToString() );
// Print number of captures in this group.
Debug.WriteLine(" Captures count = " + counter.ToString());
if (cc.Count >= 1)
{
foreach (Capture cap in cc)
{
Debug.WriteLine(string.format(" Capture found: {0}", cap.ToString()));
}
}
}
Here is a slightly simplified version of the code I wrote for this. Thanks for the help everyone and sorry I forgot to post this before. If you have any questions or anything feel free to ask.
Function processString(ByVal scriptString As String)
' Functions
Dim pattern As String = "\[\[((\w+?)\((.*?)\))(?=[^\(+\)]*(\(|$))\]\]"
scriptString = Regex.Replace(scriptString, pattern, New MatchEvaluator(Function(match) processFunction(match)))
' Variables
pattern = "\[\[([A-Za-z0-9+_]+)\]\]"
scriptString = Regex.Replace(scriptString, pattern, New MatchEvaluator(Function(match) processVariable(match)))
Return scriptString
End Function
Function processFunction(ByVal match As Match)
Dim nameString As String = match.Groups(2).Value
Dim paramString As String = match.Groups(3).Value
paramString = processString(paramString)
Select Case nameString
Case "time"
Return getLocalValueTime(paramString)
Case "math"
Return getLocalValueMath(paramString)
End Select
Return ""
End Function
Function processVariable(ByVal match As Match)
Try
Return moduleDictionary("properties")("vars")(match.Groups(1).Value)
Catch ex As Exception
End Try
End Function

Regular Expression to find numbers with same digits in different order

I have been looking for a regular expression with Google for an hour or so now and can't seem to work this one out :(
If I have a number, say:
2345
and I want to find any other number with the same digits but in a different order, like this:
2345
For example, I match
3245 or 5432 (same digits but different order)
How would I write a regular expression for this?
There is an "elegant" way to do it with a single regex:
^(?:2()|3()|4()|5()){4}\1\2\3\4$
will match the digits 2, 3, 4 and 5 in any order. All four are required.
Explanation:
(?:2()|3()|4()|5()) matches one of the numbers 2, 3, 4, or 5. The trick is now that the capturing parentheses match an empty string after matching a number (which always succeeds).
{4} requires that this happens four times.
\1\2\3\4 then requires that all four backreferences have participated in the match - which they do if and only if each number has occurred once. Since \1\2\3\4 matches an empty string, it will always match as long as the previous condition is true.
For five digits, you'd need
^(?:2()|3()|4()|5()|6()){5}\1\2\3\4\5$
etc...
This will work in nearly any regex flavor except JavaScript.
I don't think a regex is appropriate. So here is an idea that is faster than a regex for this situation:
check string lengths, if they are different, return false
make a hash from the character (digits in your case) to integers for counting
loop through the characters of your first string:
increment the counter for that character: hash[character]++
loop through the characters of the second string:
decrement the counter for that character: hash[character]--
break if any count is negative (or nonexistent)
loop through the entries, making sure each is 0:
if all are 0, return true
else return false
EDIT: Java Code (I'm using Character for this example, not exactly Unicode friendly, but it's the idea that matters now):
import java.util.*;
public class Test
{
public boolean isSimilar(String first, String second)
{
if(first.length() != second.length())
return false;
HashMap<Character, Integer> hash = new HashMap<Character, Integer>();
for(char c : first.toCharArray())
{
if(hash.get(c) != null)
{
int count = hash.get(c);
count++;
hash.put(c, count);
}
else
{
hash.put(c, 1);
}
}
for(char c : second.toCharArray())
{
if(hash.get(c) != null)
{
int count = hash.get(c);
count--;
if(count < 0)
return false;
hash.put(c, count);
}
else
{
return false;
}
}
for(Integer i : hash.values())
{
if(i.intValue()!=0)
return false;
}
return true;
}
public static void main(String ... args)
{
//tested to print false
System.out.println(new Test().isSimilar("23445", "5432"));
//tested to print true
System.out.println(new Test().isSimilar("2345", "5432"));
}
}
This will also work for comparing letters or other character sequences, like "god" and "dog".
Put the digits of each number in two arrays, sort the arrays, find out if they hold the same digits at the same indices.
RegExes are not the right tool for this task.
You could do something like this to ensure the right characters and length
[2345]{4}
Ensuring they only exist once is trickier and why this is not suited to regexes
(?=.*2.*)(?=.*3.*)(?=.*4.*)(?=.*5.*)[2345]{4}
The simplest regular expression is just all 24 permutations added up via the or operator:
/2345|3245|5432|.../;
That said, you don't want to solve this with a regex if you can get away with it. A single pass through the two numbers as strings is probably better:
1. Check the string length of both strings - if they're different you're done.
2. Build a hash of all the digits from the number you're matching against.
3. Run through the digits in the number you're checking. If you hit a match in the hash, mark it as used. Keep going until you don't get an unused match in the hash or run out of items.
I think it's very simple to achieve if you're OK with matching a number that doesn't use all of the digits. E.g. if you have a number 1234 and you accept a match with the number of 1111 to return TRUE;
Let me use PHP for an example as you haven't specified what language you use.
$my_num = 1245;
$my_pattern = '/[' . $my_num . ']{4}/'; // this resolves to pattern: /[1245]{4}/
$my_pattern2 = '/[' . $my_num . ']+/'; // as above but numbers can by of any length
$number1 = 4521;
$match = preg_match($my_pattern, $number1); // will return TRUE
$number2 = 2222444111;
$match2 = preg_match($my_pattern2, $number2); // will return TRUE
$number3 = 888;
$match3 = preg_match($my_pattern, $number3); // will return FALSE
$match4 = preg_match($my_pattern2, $number3); // will return FALSE
Something similar will work in Perl as well.
Regular expressions are not appropriate for this purpose. Here is a Perl script:
#/usr/bin/perl
use strict;
use warnings;
my $src = '2345';
my #test = qw( 3245 5432 5542 1234 12345 );
my $canonical = canonicalize( $src );
for my $candidate ( #test ) {
next unless $canonical eq canonicalize( $candidate );
print "$src and $candidate consist of the same digits\n";
}
sub canonicalize { join '', sort split //, $_[0] }
Output:
C:\Temp> ks
2345 and 3245 consist of the same digits
2345 and 5432 consist of the same digits

Regex default value if not found

I would like to supply my regular expression with a 'default' value, so if the thing I was looking for is not found, it will return the default value as if it had found it.
Is this possible to do using regex?
It sounds like you want some sort of regex syntax that says "if the regexp does not match any part of the given string pretend that it matched the following substring: 'foobar'". Such a feature does not exist in any regexp syntax I've seen.
You'll probably need to something like this:
matched_string = string.find_regex_match(regex);
if(matched_string == null) {
string = "default";
}
(This will of course need to be adjusted to the language you're using)
It's hard to answer this without a specific language, but in Perl at least, something like this works:
$string='hello';
$default = 1234;
($match) = ($string =~ m/(\d+)/ or $default);
print "$match\n";
1234
Not strictly part of the regex, but avoids the extra conditional block.
As far as I know, you can't do that with RegExp`s, at least with Perl Compatible Regular Expressions.
You can see by your self here.
Here's what I did in javascript...
function match(regx, str, dflt, index = 0) {
if (!str) return dflt
let x = str.match(regx)
return x ? x[index] || dflt : dflt
}