Extract all allowed characters from a regular expression - regex

I need to extract a list of all allowed characters from a given regular expression.
So for example, if the regex looks like this (some random example):
[A-Z]*\s+(4|5)+
the output should be
ABCDEFGHIJKLMNOPQRSTUVWXYZ45
(omitting the whitespace)
One obvious solution would be to define a complete set of allowed characters, and use a find method, to return the corresponding subsequence for each character. This seems to be a bit of a dull solution though.
Can anyone think of a (possibly simple) algorithm on how to implement this?

One thing you can do is:
split the regex by subgroup
test the char panel against the subgroup
See the following example (not perfect yet) c#:
static void Main(String[] args)
{
Console.WriteLine($"-->{TestRegex(#"[A-Z]*\s+(4|5)+")}<--");
}
public static string TestRegex(string pattern)
{
string result = "";
foreach (var subPattern in Regex.Split(pattern, #"[*+]"))
{
if(string.IsNullOrWhiteSpace(subPattern))
continue;
result += GetAllCharCoveredByRegex(subPattern);
}
return result;
}
public static string GetAllCharCoveredByRegex(string pattern)
{
Console.WriteLine($"Testing {pattern}");
var regex = new Regex(pattern);
var matches = new List<char>();
for (var c = char.MinValue; c < char.MaxValue; c++)
{
if (regex.IsMatch(c.ToString()))
{
matches.Add(c);
}
}
return string.Join("", matches);
}
Which outputs:
Testing [A-Z]
Testing \s
Testing (4|5)
-->ABCDEFGHIJKLMNOPQRSTUVWXYZ
? ? ???????? 45<--

Related

Password Validate With Regex

Below is the regex which validates my passwords:
(((?=.*\\d{2,20})(?=.*[a-z]{2,20})(?=.*[A-Z]{3,20})(?=.*[##$%!##%&*?_~,-]{2,20})).{9,20}[^<>'\"])
Basically what I want is that it contains all above given characters in password. But it needs those characters in sequence e.g
it validates 23aaAAA#!, but it does not validatea2#!AAAa.
Simply add nested capturing groups to keep the chars validation not strictly as a sequence. The regex can be also simplified as follow (without extra groups):
(?=(?:.*[0-9]){2,20})
(?=(?:.*[a-z]){2,20})
(?=(?:.*[A-Z]){3,20})
(?=(?:.*[##$%!&*?_~,-]){2,20})
.{9,20}
[^<>'\"] # This matches also the newline char, i don't think you really want this...
In java use it as follows to match the :
String regex = "(?=(?:.*[0-9]){2,20})(?=(?:.*[a-z]){2,20})(?=(?:.*[A-Z]){3,20})(?=(?:.*[##$%!&*?_~,-]){2,20}).{9,20}[^<>'\"]";
String password = "23aaA#AA!#X"; // This have to be 10 chars long at least, no newline
if (password.matches(regex))
System.out.println("Success");
else
System.out.println("Failure");
The regex requires a password with (all not strictly in sequence):
(?=(?:.*[0-9]){2,20}): 2 numbers
(?=(?:.*[a-z]){2,20}): 3 lowercase lettes
(?=(?:.*[A-Z]){3,20}): 3 uppercase letters
(?=(?:.*[##$%!&*?_~,-]){2,20}): 2 of the symbols in the chars group
.{9,20}: Min length of 9 and max of 20
[^<>'\"]: One char that is not in (<,>,',") (NOTE: this matches also the newline)
So the min/max is actually 10/21 but the last statemente matches also the newline, so in the online regex demo, the visible chars will be between 9 and 20.
Regex online demo here
I would not try to build a single regexp, because nobody will ever be able to read nor change this expression ever again. It's to hard to understand.
"a2#!AAAa" will never validate because it needs 2 digits.
If you want a password, that contains a minmuum and maximum number of chars from a group. simply count them.
public class Constraint {
private Pattern pattern;
public String regexp = "";
public int mincount=2;
public int maxcount=6;
public Constraint(String regexp, int mincount, int maxcount) {
this.mincount=mincount;
this.maxcount=maxcount;
pattern = Pattern.compile(regexp);
}
public boolean fit(String str)
{
int count = str.length() - pattern.matcher(str).replaceAll("").length();
return count >= mincount && count <= maxcount;
}
#Override
public String toString() {
return pattern.toString();
}
}
public class Test {
static Constraint[] constraints = new Constraint[] {
new Constraint("[a-z]",2,20),
new Constraint("[A-Z]",2,20),
new Constraint("\\d",2,20),
new Constraint("["+Pattern.quote("##$%!##%&*?_~,-")+"]",2,20),
new Constraint("[<>'\\\"]",0,0)
};
public static boolean checkit(String pwd)
{
boolean ok=true;
for (Constraint constraint:constraints)
{
if (constraint.fit(pwd)==false)
{
System.out.println("match failed for constraint "+constraint);
ok=false;
}
}
return ok;
}
public static void main(String[] args) {
checkit("23aaAAA#!");
checkit("a2#!AAAa");
}
}

SSN masking using the regular expression

I am trying to mask the SSN which is in "123-12-1234" to "XXX-XX-1234". I am able achieve using the below code.
string input = " 123-12-1234 123-11-1235 ";
Match m = Regex.Match(input, #"((?:\d{3})-(?:\d{2})-(?<token>\d{4}))");
while (m.Success)
{
if (m.Groups["token"].Length > 0)
{
input = input.Replace(m.Groups[0].Value,"XXX-XX-"+ m.Groups["token"].Value);
}
m = m.NextMatch();
}
Is there a better way to do it in one line using the Regex.Replace method.
You can try the following:
string input = " 123-12-1234 123-11-1235";
string pattern = #"(?:\d{3})-(?:\d{2})-(\d{4})";
string result = Regex.Replace(input, pattern, "XXX-XX-$1");
Console.WriteLine(result); // XXX-XX-1234 XXX-XX-1235
If your are going to be doing a lot of masking you should consider a few whether to use compiled regular expression or not.
Using them will cause a slight delay when the application is first run, but they will run faster subsequently.
Also the choice of static vs instances of the Regex should be considered.
I found the following to be the most efficient
public class SSNFormatter
{
private const string IncomingFormat = #"^(\d{3})-(\d{2})-(\d{4})$";
private const string OutgoingFormat = "xxxx-xx-$3";
readonly Regex regexCompiled = new Regex(IncomingFormat, RegexOptions.Compiled);
public string SSNMask(string ssnInput)
{
var result = regexCompiled.Replace(ssnInput, OutgoingFormat);
return result;
}
}
There is a comparison of six methods for regex checking/masking here.

Code to parse capture groups in regular expressions into a tree

I need to identify (potentially nested) capture groups within regular expressions and create a tree. The particular target is Java-1.6 and I'd ideally like Java code. A simple example is:
"(a(b|c)d(e(f*g))h)"
which would be parsed to
"a(b|c)d(e(f*g))h"
... "b|c"
... "e(f*g)"
... "f*g"
The solution should ideally account for count expressions, quantifiers, etc and levels of escaping. However if this is not easy to find a simpler approach might suffice as we can limit the syntax used.
EDIT. To clarify. I want to parse the regular expression string itself. To do so I need to know the BNF or equivalent for Java 1.6 regexes. I am hoping someone has already done this.
A byproduct of a result would be that the process would test for validity of the regex.
Consider stepping up to an actual parser/lexer:
http://www.antlr.org/wiki/display/ANTLR3/FAQ+-+Getting+Started
It looks complicated, but if your language is fairly simple, it's fairly straightforward. And if it's not, doing it in regexes will probably make your life hell :)
I came up with a partial solution using an XML tool (XOM, http://www.xom.nu) to hold the tree. First the code, then an example parse. First the escaped characters (\ , ( and ) ) are de-escaped (here I use BS, LB and RB), then remaining brackets are translated to XML tags, then the XML is parsed and the characters re-escaped. What is needed further is a BNF for Java 1.6 regexes doe quantifiers such as ?:, {d,d} and so on.
public static Element parseRegex(String regex) throws Exception {
regex = regex.replaceAll("\\\\", "BS");
regex.replaceAll("BS\\(", "LB");
regex.replaceAll("BS\\)", "RB");
regex = regex.replaceAll("\\(", "<bracket>");
regex.replaceAll("\\)", "</bracket>");
Element regexX = new Builder().build(new StringReader(
"<regex>"+regex+"</regex>")).getRootElement();
extractCaptureGroupContent(regexX);
return regexX;
}
private static String extractCaptureGroupContent(Element regexX) {
StringBuilder sb = new StringBuilder();
for (int i = 0; i < regexX.getChildCount(); i++) {
Node childNode = regexX.getChild(i);
if (childNode instanceof Text) {
Text t = (Text)childNode;
String s = t.getValue();
s = s.replaceAll("BS", "\\\\").replaceAll("LB",
"\\(").replaceAll("RB", "\\)");
t.setValue(s);
sb.append(s);
} else {
sb.append("("+extractCaptureGroupContent((Element)childNode)+")");
}
}
String capture = sb.toString();
regexX.addAttribute(new Attribute("capture", capture));
return capture;
}
example:
#Test
public void testParseRegex2() throws Exception {
String regex = "(.*(\\(b\\))c(d(e)))";
Element regexElement = ParserUtil.parseRegex(regex);
CMLUtil.debug(regexElement, "x");
}
gives:
<regex capture="(.*((b))c(d(e)))">
<bracket capture=".*((b))c(d(e))">.*
<bracket capture="(b)">(b)</bracket>c
<bracket capture="d(e)">d
<bracket capture="e">e</bracket>
</bracket>
</bracket>
</regex>

Want to Encode text during Regex.Replace call

I have a regex call that I need help with.
I haven't posted my regex, because it is not relevant here.
What I want to be able to do is, during the Replace, I also want to modify the ${test} portion by doing a Html.Encode on the entire text that is effecting the regex.
Basically, wrap the entire text that is within the range of the regex with the bold tag, but also Html.Encode the text inbetween the bold tag.
RegexOptions regexOptions = RegexOptions.Compiled | RegexOptions.IgnoreCase;
text = Regex.Replace(text, regexBold, #"<b>${text}</b>", regexOptions);
There is an incredibly easy way of doing this (in .net). Its called a MatchEvaluator and it lets you do all sorts of cool find and replace. Essentially you just feed the Regex.Replace method the method name of a method that returns a string and takes in a Match object as its only parameter. Do whatever makes sense for your particular match (html encode) and the string you return will replace the entire text of the match in the input string.
Example: Lets say you wanted to find all the places where there are two numbers being added (in text) and you want to replace the expression with the actual number. You can't do that with a strict regex approach, but you can when you throw in a MatchEvaluator it becomes easy.
public void Stuff()
{
string pattern = #"(?<firstNumber>\d+)\s*(?<operator>[*+-/])\s*(?<secondNumber>\d+)";
string input = "something something 123 + 456 blah blah 100 - 55";
string output = Regex.Replace(input, pattern, MatchMath);
//output will be "something something 579 blah blah 45"
}
private static string MatchMath(Match match)
{
try
{
double first = double.Parse(match.Groups["firstNumber"].Value);
double second = double.Parse(match.Groups["secondNumber"].Value);
switch (match.Groups["operator"].Value)
{
case "*":
return (first * second).ToString();
case "+":
return (first + second).ToString();
case "-":
return (first - second).ToString();
case "/":
return (first / second).ToString();
}
}
catch { }
return "NaN";
}
Find out more at http://msdn.microsoft.com/en-us/library/system.text.regularexpressions.matchevaluator.aspx
Don't use Regex.Replace in this case... use..
foreach(Match in Regex.Matches(...))
{
//do your stuff here
}
Heres an implementation of this I've used to pick out special replace strings from content and localize them.
protected string FindAndTranslateIn(string content)
{
return Regex.Replace(content, #"\{\^(.+?);(.+?)?}", new MatchEvaluator(TranslateHandler), RegexOptions.IgnoreCase);
}
public string TranslateHandler(Match m)
{
if (m.Success)
{
string key = m.Groups[1].Value;
key = FindAndTranslateIn(key);
string def = string.Empty;
if (m.Groups.Count > 2)
{
def = m.Groups[2].Value;
if(def.Length > 1)
{
def = FindAndTranslateIn(def);
}
}
if (group == null)
{
return Translate(key, def);
}
else
{
return Translate(key, group, def);
}
}
return string.Empty;
}
From the match evaluator delegate you return everything you want replaced, so where I have returns you would have bold tags and an encode call, mine also supports recursion, so a little over complicated for your needs, but you can just pare down the example for your needs.
This is equivalent to doing an iteration over the collection of matches and doing parts of the replace methods job. It just saves you some code, and you get to use a fancy shmancy delegate.
If you do a Regex.Match, the resulting match objects group at the 0th index, is the subset of the intput that matched the regex.
you can use this to stitch in the bold tags and encode it there.
Can you fill in the code inside {} to add the bold tag, and encode the text?
I'm confused as to how to apply the changes to the entire text block AND replace the section in the text variable at the end.

Regex Rejecting matches because of Instr

What's the easiest way to do an "instring" type function with a regex? For example, how could I reject a whole string because of the presence of a single character such as :? For example:
this - okay
there:is - not okay because of :
More practically, how can I match the following string:
//foo/bar/baz[1]/ns:foo2/#attr/text()
For any node test on the xpath that doesn't include a namespace?
(/)?(/)([^:/]+)
Will match the node tests but includes the namespace prefix which makes it faulty.
I'm still not sure whether you just wanted to detect if the Xpath contains a namespace, or whether you want to remove the references to the namespace. So here's some sample code (in C#) that does both.
class Program
{
static void Main(string[] args)
{
string withNamespace = #"//foo/ns2:bar/baz[1]/ns:foo2/#attr/text()";
string withoutNamespace = #"//foo/bar/baz[1]/foo2/#attr/text()";
ShowStuff(withNamespace);
ShowStuff(withoutNamespace);
}
static void ShowStuff(string input)
{
Console.WriteLine("'{0}' does {1}contain namespaces", input, ContainsNamespace(input) ? "" : "not ");
Console.WriteLine("'{0}' without namespaces is '{1}'", input, StripNamespaces(input));
}
static bool ContainsNamespace(string input)
{
// a namspace must start with a character, but can have characters and numbers
// from that point on.
return Regex.IsMatch(input, #"/?\w[\w\d]+:\w[\w\d]+/?");
}
static string StripNamespaces(string input)
{
return Regex.Replace(input, #"(/?)\w[\w\d]+:(\w[\w\d]+)(/?)", "$1$2$3");
}
}
Hope that helps! Good luck.
Match on :? I think the question isn't clear enough, because the answer is so obvious:
if(Regex.Match(":", input)) // reject
You might want \w which is a "word" character. From javadocs, it is defined as [a-zA-Z_0-9], so if you don't want underscores either, that may not work....
I dont know regex syntax very well but could you not do:
[any alpha numeric]\*:[any alphanumeric]\*
I think something like that should work no?
Yeah, my question was not very clear. Here's a solution but rather than a single pass with a regex, I use a split and perform iteration. It works as well but isn't as elegant:
string xpath = "//foo/bar/baz[1]/ns:foo2/#attr/text()";
string[] nodetests = xpath.Split( new char[] { '/' } );
for (int i = 0; i < nodetests.Length; i++)
{
if (nodetests[i].Length > 0 && Regex.IsMatch( nodetests[i], #"^(\w|\[|\])+$" ))
{
// does not have a ":", we can manipulate it.
}
}
xpath = String.Join( "/", nodetests );