Generate regex expression from series of input - regex

Is it somehow possible to generate a Regex expression from a series of input ?
I am not sure if this is even possible. Hence I am posting this question here.
Is there any tool or website that does this ?
More Update:
say I enter inputs like
www.google.com
google.com
http://www.google.com
it should somehow give me a regex expression dats accepts this type of input... Is this possible ?

For your URL Example, here's something that I just threw together in C#. I think it'll help you out.
// Input "pattern" should consist of a string with ONLY the following tags:
// <protocol> <web> <website> <DomainExtension> <RestOfPath>
// Ex) GenerateRegexFor("<protocol><web><webite><domainextension>") will match http://www.google.com
public string GenerateRegexFor(string pattern)
{
string regex = ProcessNextPart(pattern, "");
return regex;
}
public string ProcessNextPart(string pattern, string regex)
{
pattern = pattern.ToLower();
if (pattern.ToLower().StartsWith("<protocol>"))
{
regex += #"[a-zA-Z]+://";
pattern = pattern.Replace("<protocol>", "");
}
else if (pattern.ToLower().StartsWith("<web>"))
{
regex += #"www\d?"; //\d? in case of www2
pattern = pattern = pattern.Replace("<web>", "");
}
else if (pattern.ToLower().StartsWith("<website>"))
{
regex += #"([a-zA-Z0-9\-]*\.)+";
pattern = pattern.Replace("<website>", "");
}
else if (pattern.ToLower().StartsWith("<domainextension>"))
{
regex += "[a-zA-Z]{2,}";
pattern = pattern.Replace("<domainextension>", "");
}
else if (pattern.ToLower().StartsWith("<restofpath>"))
{
regex += #"(/[a-zA-Z0-9\-]*)*(\.[a-zA-Z]*/?)?";
pattern = pattern.Replace("<restofpath>", "");
}
if (pattern.Length > 0 && pattern != "")
return ProcessNextPart(pattern, regex);
return regex;
}
Depending on the style of URL you'd like to match, I think this should match just about anything and everything. You may want to make it a little more picky if there will be text that is similar to URLs but not URLs.
You'd use it like this:
//to match something like "www.google.com/images/whatever"
// \
// \ |www||.google.||----com------||/images/whatever
// \ | | | |
// \/ V V V V
string regex = GenerateRegexFor("<web><website><domainextension><restofpath>");
//to match something like "http://www.google.com/images/whatever"
string regex = GenerateRegexFor("<protocol><web><website><domainextension><restofpath>");
You can use any of those tags, in any order (though some of them wouldn't make much sense). Feel free to build on this, too. You could add as many tags as you wanted for it to represent any number of patterns.
Oh, and +1 for giving me something to do at work.

Related

I want to exact match characters using regex in JS? [duplicate]

What is the regular expression (in JavaScript if it matters) to only match if the text is an exact match? That is, there should be no extra characters at other end of the string.
For example, if I'm trying to match for abc, then 1abc1, 1abc, and abc1 would not match.
Use the start and end delimiters: ^abc$
It depends. You could
string.match(/^abc$/)
But that would not match the following string: 'the first 3 letters of the alphabet are abc. not abc123'
I think you would want to use \b (word boundaries):
var str = 'the first 3 letters of the alphabet are abc. not abc123';
var pat = /\b(abc)\b/g;
console.log(str.match(pat));
Live example: http://jsfiddle.net/uu5VJ/
If the former solution works for you, I would advise against using it.
That means you may have something like the following:
var strs = ['abc', 'abc1', 'abc2']
for (var i = 0; i < strs.length; i++) {
if (strs[i] == 'abc') {
//do something
}
else {
//do something else
}
}
While you could use
if (str[i].match(/^abc$/g)) {
//do something
}
It would be considerably more resource-intensive. For me, a general rule of thumb is for a simple string comparison use a conditional expression, for a more dynamic pattern use a regular expression.
More on JavaScript regexes: https://developer.mozilla.org/en/JavaScript/Guide/Regular_Expressions
"^" For the begining of the line "$" for the end of it. Eg.:
var re = /^abc$/;
Would match "abc" but not "1abc" or "abc1". You can learn more at https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Regular_Expressions

Match longest substring with regex [duplicate]

I tried looking for an answer to this question but just couldn't finding anything and I hope that there's an easy solution for this. I have and using the following code in C#,
String pattern = ("(hello|hello world)");
Regex regex = new Regex(pattern, RegexOptions.IgnoreCase);
var matches = regex.Matches("hello world");
Question is, is there a way for the matches method to return the longest pattern first? In this case, I want to get "hello world" as my match as opposed to just "hello". This is just an example but my pattern list consist of decent amount of words in it.
If you already know the lengths of the words beforehand, then put the longest first. For example:
String pattern = ("(hello world|hello)");
The longest will be matched first. If you don't know the lengths beforehand, this isn't possible.
An alternative approach would be to store all the matches in an array/hash/list and pick the longest one manually, using the language's built-in functions.
Regular expressions (will try) to match patterns from left to right. If you want to make sure you get the longest possible match first, you'll need to change the order of your patterns. The leftmost pattern is tried first. If a match is found against that pattern, the regular expression engine will attempt to match the rest of the pattern against the rest of the string; the next pattern will be tried only if no match can be found.
String pattern = ("(hello world|hello wor|hello)");
Make two different regex matches. The first will match your longer option, and if that does not work, the second will match your shorter option.
string input = "hello world";
string patternFull = "hello world";
Regex regexFull = new Regex(patternFull, RegexOptions.IgnoreCase);
var matches = regexFull.Matches(input);
if (matches.Count == 0)
{
string patternShort = "hello";
Regex regexShort = new Regex(patternShort, RegexOptions.IgnoreCase);
matches = regexShort.Matches(input);
}
At the end, matches will be be the output of "full" or "short", but "full" will be checked first and will short-circuit if it is true.
You can wrap the logic in a function if you plan on calling it many times. This is something I came up with (but there are plenty of other ways you can do this).
public bool HasRegexMatchInOrder(string input, params string[] patterns)
{
foreach (var pattern in patterns)
{
Regex regex = new Regex(pattern, RegexOptions.IgnoreCase);
if (regex.IsMatch(input))
{
return true;
}
}
return false;
}
string input = "hello world";
bool hasAMatch = HasRegexMatchInOrder(input, "hello world", "hello", ...);

Regular expression checking URLs, only allowing lowercase [duplicate]

What is the regular expression (in JavaScript if it matters) to only match if the text is an exact match? That is, there should be no extra characters at other end of the string.
For example, if I'm trying to match for abc, then 1abc1, 1abc, and abc1 would not match.
Use the start and end delimiters: ^abc$
It depends. You could
string.match(/^abc$/)
But that would not match the following string: 'the first 3 letters of the alphabet are abc. not abc123'
I think you would want to use \b (word boundaries):
var str = 'the first 3 letters of the alphabet are abc. not abc123';
var pat = /\b(abc)\b/g;
console.log(str.match(pat));
Live example: http://jsfiddle.net/uu5VJ/
If the former solution works for you, I would advise against using it.
That means you may have something like the following:
var strs = ['abc', 'abc1', 'abc2']
for (var i = 0; i < strs.length; i++) {
if (strs[i] == 'abc') {
//do something
}
else {
//do something else
}
}
While you could use
if (str[i].match(/^abc$/g)) {
//do something
}
It would be considerably more resource-intensive. For me, a general rule of thumb is for a simple string comparison use a conditional expression, for a more dynamic pattern use a regular expression.
More on JavaScript regexes: https://developer.mozilla.org/en/JavaScript/Guide/Regular_Expressions
"^" For the begining of the line "$" for the end of it. Eg.:
var re = /^abc$/;
Would match "abc" but not "1abc" or "abc1". You can learn more at https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Regular_Expressions

Regex to match only links starting with www [duplicate]

What is the regular expression (in JavaScript if it matters) to only match if the text is an exact match? That is, there should be no extra characters at other end of the string.
For example, if I'm trying to match for abc, then 1abc1, 1abc, and abc1 would not match.
Use the start and end delimiters: ^abc$
It depends. You could
string.match(/^abc$/)
But that would not match the following string: 'the first 3 letters of the alphabet are abc. not abc123'
I think you would want to use \b (word boundaries):
var str = 'the first 3 letters of the alphabet are abc. not abc123';
var pat = /\b(abc)\b/g;
console.log(str.match(pat));
Live example: http://jsfiddle.net/uu5VJ/
If the former solution works for you, I would advise against using it.
That means you may have something like the following:
var strs = ['abc', 'abc1', 'abc2']
for (var i = 0; i < strs.length; i++) {
if (strs[i] == 'abc') {
//do something
}
else {
//do something else
}
}
While you could use
if (str[i].match(/^abc$/g)) {
//do something
}
It would be considerably more resource-intensive. For me, a general rule of thumb is for a simple string comparison use a conditional expression, for a more dynamic pattern use a regular expression.
More on JavaScript regexes: https://developer.mozilla.org/en/JavaScript/Guide/Regular_Expressions
"^" For the begining of the line "$" for the end of it. Eg.:
var re = /^abc$/;
Would match "abc" but not "1abc" or "abc1". You can learn more at https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Regular_Expressions

C++ regex to search file paths in a string

I'm trying to parse strings which can contain file paths.
I'm using C++ with regex library. I'm not that good with regex, here it's the ECMAScript.
I don't know why the string :
"C:\Windows\explorer.exe C:\titi\toto.exe"
Doesn't matches the pattern (actually it only founds the first one)
(?:[a-zA-Z]\:|\\)(?:\\[a-z_\-\s0-9]+)+
Do you have a better idea to find every match ?
Thanks!
Here's my code:
wsmatch matches;
regex_constants::match_flag_type fl = regex_constants::match_default ;
regex_constants::syntax_option_type st = regex_constants::icase //Case insensitive
| regex_constants::ECMAScript
| regex_constants::optimize;
wregex pattern(L"(?:[a-zA-Z]\\:|\\\\)(?:\\\\[a-z_\\-\\s0-9]+)+", st);
// Look if matches pattern
printf("--> %ws\n", path.c_str());
if (regex_search(path, matches, pattern, fl)
&& matches.size() > 0)
{
for (u_int i = 0 ; i < matches.size() ; i++)
{
wssub_match sub_match = matches[i];
wstring sub_match_str = sub_match.str();
printf("%ws\n", sub_match_str.c_str());
}
}
You could use something like this:
.?:(\\[a-zA-Z 0-9]*)*.[a-zA-Z]*
I tested it with http://regexpal.com/ and it extracts all file paths.
Although regex provided by #mspoerr satisfies example question, but it wasn't great for me in more complex scenarios, therefore I used to write my own.
Regex:
(\w:)?([\\\w\s0-9_]*)\.\w+
Advanced test string:
C:\Wi ndows\explorer.exe asdasds
: ad C:\titi\toto.Heexe
HELLOO : qwefqwfqwf c:\aa.
(it matches only two valid file paths)