I' use a small validation script that tells me when a given url is blocked by robots.txt.
For example there is a given url like http://www.example.com/dir/test.html
My current script tells me if the url is blocked, when there is a line in robots.txt like:
Disallow: /test1.html
But it also says that the url is blocked when there are lines like:
Disallow: /tes
Thats wrong.
I googled something like "regex exact string" and found lots of solutions for the problem above.
But this leads to another problem. When I check exact string in an url http://www.example.com/dir/test1/page.html and in robots.txt is a line like
Disallow: /test1/page.html
My script doesn't get it because it looks for
Disallow: /dir/test1/page.html
And says: That the target page.html is not blocked - but it is!
How can I match an exact string with variable text before and behind the string?
Here is the short-version of the script:
/* example for $rules */
$rules = array("/tes", "/test", "/test1", "/test/page.html", "/test1/page.html", "/dir/test1/page.html")
/*example for $parsed['path']:*/
"dir/test.html"
"dir/test1/page.html"
"test1/page.html"
foreach ($rules as $rule) {
// check if page is disallowed to us
if (preg_match("/^$rule/", $parsed['path']))
return false;
}
EDIT:
This is the whole function:
function robots_allowed($url, $useragent = false) {
// parse url to retrieve host and path
$parsed = parse_url($url);
$agents = array(preg_quote('*'));
if ($useragent)
$agents[] = preg_quote($useragent);
$agents = implode('|', $agents);
// location of robots.txt file
$robotstxt = !empty($parsed['host']) ? #file($parsed['scheme'] . "://" . $parsed['host'] . "/robots.txt") : "";
// if there isn't a robots, then we're allowed in
if (empty($robotstxt))
return true;
$rules = array();
$ruleApplies = false;
foreach ($robotstxt as $line) {
// skip blank lines
if (!$line = trim($line))
continue;
// following rules only apply if User-agent matches $useragent or '*'
if (preg_match('/^\s*User-agent: (.*)/i', $line, $match)) {
$ruleApplies = preg_match("/($agents)/i", $match[1]);
}
if ($ruleApplies && preg_match('/^\s*Disallow:(.*)/i', $line, $regs)) {
// an empty rule implies full access - no further tests required
if (!$regs[1])
return true;
// add rules that apply to array for testing
$rules[] = preg_quote(trim($regs[1]), '/');
}
}
foreach ($rules as $rule) {
// check if page is disallowed to us
if (preg_match("/^$rule/", $parsed['path']))
return false;
}
// page is not disallowed
return true;
}
The URL comes from user input.
Try everything at once, avoid the array.
/(?:\/?dir\/)?\/?tes(?:(?:t(?:1)?)?(?:\.html|(?:\/page\.html)?))/
https://regex101.com/r/VxL30W/1
(?: /?dir / )?
/?tes
(?:
(?:
t
(?: 1 )?
)?
(?:
\.html
|
(?: /page \. html )?
)
)
I've found a solution to match /test or /test/hello or /test/ but not to match /testosterone or /hellotest:
(?:\/test$|\/test\/)
With PHP-Variables:
if (preg_match("/(?:" . $rule . "$|" . $rule . "\/)/", $parsed['path']))
Based on the funktion above.
https://regex101.com/r/DFVR5T/3
Can I use (?:\/ ...) or is that wrong?
EDIT: Please feel free to add additional validations that would be useful for others, using this simple directive.
--
I'm trying to create an Angular Directive that limits the characters input into a text box. I've been successful with a couple common use cases (alphbetical, alphanumeric and numeric) but using popular methods for validating email addresses, dates and currency I can't get the directive to work since I need it negate the regex. At least that's what I think it needs to do.
Any assistance for currency (optional thousand separator and cents), date (mm/dd/yyyy) and email is greatly appreciated. I'm not strong with regular expressions at all.
Here's what I have currently:
http://jsfiddle.net/corydorning/bs05ys69/
HTML
<div ng-app="example">
<h1>Validate Directive</h1>
<p>The Validate directive allow us to restrict the characters an input can accept.</p>
<h3><code>alphabetical</code> <span style="color: green">(works)</span></h3>
<p>Restricts input to alphabetical (A-Z, a-z) characters only.</p>
<label><input type="text" validate="alphabetical" ng-model="validate.alphabetical"/></label>
<h3><code>alphanumeric</code> <span style="color: green">(works)</span></h3>
<p>Restricts input to alphanumeric (A-Z, a-z, 0-9) characters only.</p>
<label><input type="text" validate="alphanumeric" ng-model="validate.alphanumeric" /></label>
<h3><code>currency</code> <span style="color: red">(doesn't work)</span></h3>
<p>Restricts input to US currency characters with comma for thousand separator (optional) and cents (optional).</p>
<label><input type="text" validate="currency.us" ng-model="validate.currency" /></label>
<h3><code>date</code> <span style="color: red">(doesn't work)</span></h3>
<p>Restricts input to the mm/dd/yyyy date format only.</p>
<label><input type="text" validate="date" ng-model="validate.date" /></label>
<h3><code>email</code> <span style="color: red">(doesn't work)</span></h3>
<p>Restricts input to email format only.</p>
<label><input type="text" validate="email" ng-model="validate.email" /></label>
<h3><code>numeric</code> <span style="color: green">(works)</span></h3>
<p>Restricts input to numeric (0-9) characters only.</p>
<label><input type="text" validate="numeric" ng-model="validate.numeric" /></label>
JavaScript
angular.module('example', [])
.directive('validate', function () {
var validations = {
// works
alphabetical: /[^a-zA-Z]*$/,
// works
alphanumeric: /[^a-zA-Z0-9]*$/,
// doesn't work - need to negate?
// taken from: http://stackoverflow.com/questions/354044/what-is-the-best-u-s-currency-regex
currency: /^[+-]?[0-9]{1,3}(?:,?[0-9]{3})*(?:\.[0-9]{2})?$/,
// doesn't work - need to negate?
// taken from here: http://stackoverflow.com/questions/15196451/regular-expression-to-validate-datetime-format-mm-dd-yyyy
date: /(?:0[1-9]|1[0-2])\/(?:0[1-9]|[12][0-9]|3[01])\/(?:19|20)[0-9]{2}/,
// doesn't work - need to negate?
// taken from: http://stackoverflow.com/questions/46155/validate-email-address-in-javascript
email: /^([\w-]+(?:\.[\w-]+)*)#((?:[\w-]+\.)*\w[\w-]{0,66})\.([a-z]{2,6}(?:\.[a-z]{2})?)$/i,
// works
numeric: /[^0-9]*$/
};
return {
require: 'ngModel',
scope: {
validate: '#'
},
link: function (scope, element, attrs, modelCtrl) {
var pattern = validations[scope.validate] || scope.validate
;
modelCtrl.$parsers.push(function (inputValue) {
var transformedInput = inputValue.replace(pattern, '')
;
if (transformedInput != inputValue) {
modelCtrl.$setViewValue(transformedInput);
modelCtrl.$render();
}
return transformedInput;
});
}
};
});
I am pretty sure, there is better way, probably regex is also not best tool for that, but here is mine proposition.
This way you can only restrict which characters are allowed for input and to force user to use proper format, but you will need to also validate final input after user will finish typing, but this is another story.
The alphabetic, numeric and alphanumeric are quite simple, for input and validating input, as it is clear what you can type, and what is a proper final input. But with dates, mails, currency, you cannot validate input with regex for full valid input, as user need to type it in first, and in a meanwhile the input need to by invalid in terms of final valid input. So, this is one thing to for example restrict user to type just digits and / for a date format, like: 12/12/1988, but in the end you need to check if he typed proper date or just 12/12/126 for example. This need to be checked when answer is submited by user, or when text field lost focus, etc.
To just validate typed character, you can try with this:
JSFiddle DEMO
First change:
var transformedInput = inputValue.replace(pattern, '')
to
var transformedInput = inputValue.replace(pattern, '$1')
then use regular expressions:
/^([a-zA-Z]*(?=[^a-zA-Z]))./ - alphabetic
/^([a-zA-Z0-9]*(?=[^a-zA-Z0-9]))./ - alphanumeric
/(\.((?=[^\d])|\d{2}(?![^,\d.]))|,((?=[^\d])|\d{3}(?=[^,.$])|(?=\d{1,2}[^\d]))|\$(?=.)|\d{4,}(?=,)).|[^\d,.$]|^\$/- currency (allow string like: 343243.34, 1,123,345.34, .05 with or without $)
^(((0[1-9]|1[012])|(\d{2}\/\d{2}))(?=[^\/])|((\d)|(\d{2}\/\d{2}\/\d{1,3})|(.+\/))(?=[^\d])|\d{2}\/\d{2}\/\d{4}(?=.)).|^(1[3-9]|[2-9]\d)|((?!^)(3[2-9]|[4-9]\d)\/)|[3-9]\d{3}|2[1-9]\d{2}|(?!^)\/\d\/|^\/|[^\d/] - date (00-12/00-31/0000-2099)
/^(\d*(?=[^\d]))./ - numeric
/^([\w.$-]+\#[\w.]+(?=[^\w.])|[\w.$-]+\#(?=[^\w.-])|[\w.#-]+(?=[^\w.$#-])).$|\.(?=[^\w-#]).|[^\w.$#-]|^[^\w]|\.(?=#).|#(?=\.)./i - email
Generally, it use this pattern:
([valid characters or structure] captured in group $1)(?= positive lookahead for not allowed characters) any character
in effect it will capture all valid character in group $1, and if user type in an invalid character, whole string is replaced with already captured valid characters from group $1. It is complemented by part which shall exclude some obvious invalid character(s), like ## in a mail, or 34...2 in currency.
With understanding how these regular expression works, despite that it looks quite complex, I think it easy to extend it, by adding additional allowed/not allowed characters.
Regular expression for validating currency, dates and mails are easy to find, so I find it redundant to post them here.
OffTopic. Whats more the currency part in your demo is not working, it is bacause of: validate="currency.us" instead of validate="currency", or at least it works after this modification.
In my opinion it is impossible to create regular expressions that will work for matching things like dates or emails with the
parser you use. This is mainly because you would need non-capturing groups in your
regular expressions (which is possible), which are not replaced by the
inputValue.replace(pattern, '') call you have in your parser function. And this is the
part that is not possible in JavaScript. JavaScript replaces what you put in non-capturing
groups as well.
So... you'll need to go for a different approach. I would suggest to go for positive
regular expressions, which will yield a match when the input is valid.
Then you need of course to change the code of your parser. You could for instance
decide to chop off characters from the end of the input text until what remains passes
the regular expression test. This you could code as follows:
modelCtrl.$parsers.push(function (inputValue) {
var transformedInput = inputValue;
while (transformedInput && !pattern.exec(transformedInput)) {
// validation fails: chop off last character and try again
transformedInput = transformedInput.slice(0, -1);
}
if (transformedInput !== inputValue) {
modelCtrl.$setViewValue(transformedInput);
modelCtrl.$render();
}
return transformedInput;
});
Now life has become a bit easier. Just pay attention that you make your regular
expressions in such a way that they do not reject partial input. So "01/" should be
considered valid for a date, otherwise the user can never get to type in a date. On
the other hand, as soon as it becomes clear that adding characters will no longer
allow for valid input, the regular expression should reject it. So "101" should be
rejected as a date, as you can never add characters at the end to make it a valid date.
Also, all of these regular expressions should check the whole input, so as a consequence
they need to make use of the ^ and $ symbols.
Here is what the regular expression for a (partial) date could look like:
^([0-9]{0,2}|[0-9]{2}[\/]([0-9]{0,2}|[0-9]{2}[\/][0-9]{0,4}))$
This means: an input of 0 to 2 digits is valid, or exactly 2 digits followed by a slash, followed by either:
0 to 2 digits, or
exactly 2 digits followed by a slash, followed by 0 to 4 digits
Admittedly, not as smart as the one you had found, but that one would need a lot of editing to allow for partially entered dates. It is possible, but
it represents a very long expression with a lot of brackets and |.
Once you have all the regular expressions set up, you could think to further improve
the parser. One idea would be to not let it chop off characters from the end, but to
let it test all strings with one character removed somewhere compared to the original,
and see which one passes the test. If there is no way found to remove one character and have
success, then remove two consecutive characters in any place of the input value,
then three, ... etc, until you find a value that passes the test or arrive at an empty value.
This will work better for cases where the user inserts characters half way their input.
Just an idea...
import { Directive, ElementRef, EventEmitter, HostListener, Input, Output, Renderer2 } from '#angular/core';
import { ControlValueAccessor, NG_VALUE_ACCESSOR } from '#angular/forms';
import { CurrencyPipe, DecimalPipe } from '#angular/common';
import { ValueChangeEvent } from '#goomTool/goom-elements/events/value-change-event.model';
const noOperation = () => {
};
#Directive({
selector: '[formattedNumber]',
providers: [{
provide: NG_VALUE_ACCESSOR,
useExisting: FormattedNumberDirective,
multi: true
}]
})
export class FormattedNumberDirective implements ControlValueAccessor {
#Input() public configuration;
#Output() public valueChange: EventEmitter<ValueChangeEvent> = new EventEmitter();
public locale: string = process.env.LOCALE;
private el: HTMLInputElement;
// Keeps track of the value without formatting
private innerInputValue: any;
private specialKeys: string[] =
['Backspace', 'Tab', 'End', 'Home', 'Enter', 'Shift', 'ArrowRight', 'ArrowLeft', 'Delete'];
private onTouchedCallback: () => void = noOperation;
private onChangeCallback: (a: any) => void = noOperation;
constructor(private elementRef: ElementRef,
private decimalPipe: DecimalPipe,
private currencyPipe: CurrencyPipe,
private renderer: Renderer2) {
this.el = elementRef.nativeElement;
}
public writeValue(value: any) {
if (value !== this.innerInputValue) {
if (!!value) {
this.renderer.setAttribute(this.elementRef.nativeElement, 'value', this.getFormattedValue(value));
}
this.innerInputValue = value;
}
}
public registerOnChange(fn: any) {
this.onChangeCallback = fn;
}
public registerOnTouched(fn: any) {
this.onTouchedCallback = fn;
}
// On Focus remove all non-digit ,display actual value
#HostListener('focus', ['$event.target.value'])
public onfocus(value) {
if (!!this.innerInputValue) {
this.el.value = this.innerInputValue;
}
}
// On Blur set values to pipe format
#HostListener('blur', ['$event.target.value'])
public onBlur(value) {
this.innerInputValue = value;
if (!!value) {
this.el.value = this.getFormattedValue(value);
}
}
/**
* Allows special key, Unit Interval, value based on regular expression
*
* #param event
*/
#HostListener('keydown', ['$event'])
public onKeyDown(event) {
// Allow Backspace, tab, end, and home keys . .
if (this.specialKeys.indexOf(event.key) !== -1) {
if (event.key === 'Backspace') {
this.updateValue(this.getBackSpaceValue(this.el.value, event));
}
if (event.key === 'Delete') {
this.updateValue(this.getDeleteValue(this.el.value, event));
}
return;
}
const next: string = this.concatAtIndex(this.el.value, event);
if (this.configuration.angularPipe && this.configuration.angularPipe.length > 0) {
if (!this.el.value.includes('.')
&& (this.configuration.min == null || this.configuration.min < 1)) {
if (next.startsWith('0') || next.startsWith('0.') || next.startsWith('.')) {
if (next.length > 1) {
this.updateValue(next);
}
return;
}
}
}
/* pass your pattern in component regex e.g.
* regex = new RegExp(RegexPattern.WHOLE_NUMBER_PATTERN)
*/
if (next && !String(next).match(this.configuration.regex)) {
event.preventDefault();
return;
}
if (!!this.configuration.minFractionDigits && !!this.configuration.maxFractionDigits) {
if (!!next.split('\.')[1] && next.split('\.')[1].length > this.configuration.minFractionDigits) {
return this.validateFractionDigits(next, event);
}
}
this.innerInputValue = next;
this.updateValue(next);
}
private updateValue(newValue) {
this.onTouchedCallback();
this.onChangeCallback(newValue);
if (newValue) {
this.renderer.setAttribute(this.elementRef.nativeElement, 'value', newValue);
}
}
private validateFractionDigits(next, event) {
// create real-time pattern to validate min & max fraction digits
const regex = `^[-]?\\d+([\\.,]\\d{${this.configuration.minFractionDigits},${this.configuration.maxFractionDigits}})?$`;
if (!String(next).match(regex)) {
event.preventDefault();
return;
}
this.updateValue(next);
}
private concatAtIndex(current: string, event) {
return current.slice(0, event.currentTarget.selectionStart) + event.key +
current.slice(event.currentTarget.selectionEnd);
}
private getBackSpaceValue(current: string, event) {
return current.slice(0, event.currentTarget.selectionStart - 1) +
current.slice(event.currentTarget.selectionEnd);
}
private getDeleteValue(current: string, event) {
return current.slice(0, event.currentTarget.selectionStart) +
current.slice(event.currentTarget.selectionEnd + 1);
}
private transformCurrency(value) {
return this.currencyPipe.transform(value, this.configuration.currencyCode, this.configuration.display,
this.configuration.digitsInfo, this.locale);
}
private transformDecimal(value) {
return this.decimalPipe.transform(value, this.configuration.digitsInfo, this.locale);
}
private transformPercent(value) {
return this.decimalPipe.transform(value, this.configuration.digitsInfo, this.locale) + ' %';
}
private getFormattedValue(value) {
switch (this.configuration.angularPipe) {
case ('decimal'): {
return this.transformDecimal(value);
}
case ('currency'): {
return this.transformCurrency(value);
}
case ('percent'): {
return this.transformPercent(value);
}
default: {
return value;
}
}
}
}
----------------------------------
export const RegexPattern = Object.freeze({
PERCENTAGE_PATTERN: '^([1-9]\\d*(\\.)\\d*|0?(\\.)\\d*[1-9]\\d*|[1-9]\\d*)$', // e.g. '.12% ' or 12%
DECIMAL_PATTERN: '^(([-]+)?([1-9]\\d*(\\.|\\,)\\d*|0?(\\.|\\,)\\d*[1-9]\\d*|[1-9]\\d*))$', // e.g. '123.12'
CURRENCY_PATTERN: '\\$?[-]?[0-9]{1,3}(?:,?[0-9]{3})*(?:\\.[0-9]{2})?$', // e.g. '$123.12'
KEY_PATTERN: '^[a-zA-Z\\-]+-[0-9]+', // e.g. ABC-1234
WHOLE_NUMBER_PATTERN: '^([-]?([1-9][0-9]*)|([0]+)$)$' // e.g 1234
});
I need a regex for email that allows uppercase and lowercase and I want to validate with jquery .match(). Right now I have this working for lowercase only.
^[_a-z0-9-]+(\.[_a-z0-9-]+)*#[a-z0-9-]+(\.[a-z0-9-]+)*(\.[a-z]{2,4})$
I'd like one for UC and LC or a combo. Thanks for your help.
function validateEmail($email) {
var emailRegex = /^[a-zA-Z]+#[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+$/;
if (!emailRegex.test( $email )) {
return false;
} else {
return true;
}
}
You can use this like:
if( !validateEmail("LowUp#gmail.com")) { /* Code for an invalid emailaddress */ }
Resource: https://stackoverflow.com/a/9082446/4168014
Ran across this when looking for a regex for uppercase and lowercase letters in an email, so wanted to update this question with an answer that worked for me since the others had some limitations.
^[_a-zA-Z0-9-]+(\.[_a-zA-Z0-9-]+)*(\+[a-zA-Z0-9-]+)?#[a-zA-Z0-9-]+(\.[a-zA-Z0-9-]+)*$
This will allow for .'s, numbers, lowercase, and uppercase letters in an email with no limit of characters after the '#' sign.
regex
"^([\\w-]+(?:\\.[\\w-]+)*)#((?:[\\w-]+\\.)*\\w[\\w-]{0,66})\\.([A-Za-z]{2,6}(?:\\.[A-Za-z]{2,6})?)$"
Reg expression also works when do we work with capital letters.
public class EmailValidator {
public static final String EMAIL_REGEX_PATTERN = "^([\\w-]+(?:\\.[\\w-]+)*)#((?:[\\w-]+\\.)*\\w[\\w-]{0,66})\\.([A-Za-z]{2,6}(?:\\.[A-Za-z]{2,6})?)$";
// static String emailId = "GAVIN.HOUSTON#TRANSPORT.NSW.au";
// static String emailId = "GAVIN.HOUSTON#transport.gov.nsw.au";
static String emailId = "ap.invoices#sydney.edu.au";
public static void main(String[] args) {
System.out.println("is valid email " + isValidEmail(emailId));
}
public static boolean isValidEmail(final String email) {
boolean result = false;
if (email != null) {
Pattern pattern = Pattern.compile(EMAIL_REGEX_PATTERN);
Matcher matcher = pattern.matcher(email);
result = matcher.matches();
}
return result;
}
}
I want to extract the http link from inside the anchor tags? The extension that should be extracted should be WMV files only.
Because HTML's syntactic rules are so loose, it's pretty difficult to do with any reliability (unless, say, you know for absolute certain that all your tags will use double quotes around their attribute values). Here's some fairly general regex-based code for the purpose:
function extract_urls($html) {
$html = preg_replace('<!--.*?-->', '', $html);
preg_match_all('/<a\s+[^>]*href="([^"]+)"[^>]*>/is', $html, $matches);
foreach($matches[1] as $url) {
$url = str_replace('&', '&', trim($url));
if(preg_match('/\.wmv\b/i', $url) && !in_array($url, $urls))
$urls[] = $url;
}
preg_match_all('/<a\s+[^>]*href=\'([^\']+)\'[^>]*>/is', $html, $matches);
foreach($matches[1] as $url) {
$url = str_replace('&', '&', trim($url));
if(preg_match('/\.wmv\b/i', $url) && !in_array($url, $urls))
$urls[] = $url;
}
preg_match_all('/<a\s+[^>]*href=([^"\'][^> ]*)[^>]*>/is', $html, $matches);
foreach($matches[1] as $url) {
$url = str_replace('&', '&', trim($url));
if(preg_match('/\.wmv\b/i', $url) && !in_array($url, $urls))
$urls[] = $url;
}
return $urls;
}
Regex:
<a\\s*href\\s*=\\s*(?:(\"|\')(?<link>[^\"]*.wmv)(\"|\'))\\s*>(?<name>.*)\\s*</a>
[Note: \s* is used in several places to match the extra white space characters that can occur in the html.]
Sample C# code:
/// <summary>
/// Assigns proper values to link and name, if the htmlId matches the pattern
/// Matches only for .wmv files
/// </summary>
/// <returns>true if success, false otherwise</returns>
public static bool TryGetHrefDetailsWMV(string htmlATag, out string wmvLink, out string name)
{
wmvLink = null;
name = null;
string pattern = "<a\\s*href\\s*=\\s*(?:(\"|\')(?<link>[^\"]*.wmv)(\"|\'))\\s*>(?<name>.*)\\s*</a>";
if (Regex.IsMatch(htmlATag, pattern))
{
Regex r = new Regex(pattern, RegexOptions.IgnoreCase | RegexOptions.Compiled);
wmvLink = r.Match(htmlATag).Result("${link}");
name = r.Match(htmlATag).Result("${name}");
return true;
}
else
return false;
}
MyRegEx.TryGetHrefDetailsWMV("<td><a href='/path/to/file'>Name of File</a></td>",
out wmvLink, out name); // No match
MyRegEx.TryGetHrefDetailsWMV("<td><a href='/path/to/file.wmv'>Name of File</a></td>",
out wmvLink, out name); // Match
MyRegEx.TryGetHrefDetailsWMV("<td><a href='/path/to/file.wmv' >Name of File</a></td>", out wmvLink, out name); // Match
I wouldn't do this with regex - I would probably use jQuery:
jQuery('a[href$=.wmv]').attr('href')
Compare this to chaos's simplified regex example, which (as stated) doesn't deal with fussy/complex markup, and you'll hopefully understand why a DOM parser is better than a regex for this type of problem.