Beginning regex - regex

I have absolutely no idea about regex at all. I am trying to perform email validation in a PHP signup script (to check if a valid email has been entered). I have got a script from the internet but I don't know how it works and I am completely unfamiliar with the functions used. I found out that the function they used (eregi_replace()) is deprecated in favor of preg_replace(). Firstly, can someone guide me through the steps of the function below, and secondly can you explain the preg_replace() function and how it works?
The validation script:
$regex = "([a-z0-9_.-]+)". # name
"#". # at
"([a-z0-9.-]+){2,255}". # domain & possibly subdomains
".". # period
"([a-z]+){2,10}"; # domain extension
$eregi = eregi_replace($regex, '', $email);
$valid_email = empty($eregi) ? true : false;
The script was sourced from here

regex are not needed here:
filter_var('bob#example.com', FILTER_VALIDATE_EMAIL);
if you want training i suggest you to visit http://www.regular-expressions.info and http://php.net
To quickly test a pattern you can use this online tool: http://regex.larsolavtorvik.com/

Related

Fail2Ban regex for Drupal log not matching

I am trying to match and ban certain patterns in my drupal logs (drupal 9).
I have taken the base drupal-auth regex, created a new conf and tried to amend it to my requirements but I seem to be failing at the first hurdle. This is the code that will give me anything that has the type 'user' and this is filtered by the user\ in the code below, just before the <HOST> block:
failregex = ^%(__prefix_line)s(https?:\/\/)([\da-z\.-]+)\.([a-z\.]{2,6})(\/[\w\.-]+)*\|\d{10}\|user\|<HOST>\|.+\|.+\|\d\|.*\|.+\.$
If I want to search exactly the same pattern, but with say 'page not found' or 'access denied' instead of 'user' what do I need? I cannot seem to get it to match the moment the type has a space in it. It seems such a simple thing to do!
I am using fail2ban-regex --print-all-matched to test.

Regex capture group in Varnish VCL

I have a URL in the form of:
http://some-site.com/api/v2/portal-name/some/webservice/call
The data I want to fetch needs
http://portal-name.com/webservices/v2/some/webservice/call
(Yes I can rewrite the application so it uses other URL's but we are testing varnish at the moment so for now it cannot be intrusive.)
But I'm having trouble getting the URL correctly in varnish VCL. The api part is replaced by an empty string, no worries but now the portal-name.
Things I've tried:
if (req.url ~ ".*/(.*)/") {
set req.http.portalhostname = re.group.0;
set req.http.portalhostname = $1;
}
From https://docs.fastly.com/guides/vcl/vcl-regular-expression-cheat-sheet and Extracting capturing group contents in Varnish regex
And yes, std is imported.
But this gives me either a
Syntax error at
('/etc/varnish/default.vcl' Line 36 Pos 35)
set req.http.portalhostname = $1;
or a
Symbol not found: 're.group.0' (expected type STRING_LIST):
So: how can I do this? When I have extracted the portalhostname I should be able to simply do a regsub to replace that value with an empty string and then prepend "webservices" and my URL is complete.
The varnish version i'm using: varnish-4.1.8 revision d266ac5c6
Sadly re.group seems to have been removed at some version. Similar functionality appears to be accessible via one of several vmods. See https://varnish-cache.org/vmods/

how to obtain a regular expression for validation email address for one domain ONLY?

I am struggling with writing a regex for validating email address for only one domain.
I have this expression
[A-Z0-9a-z._%+-]+#[A-Za-z0-9.-]+\\.[A-Za-z]{2,64}
But the issue is that for example hello#gmail.com.net is valid and I only want to be only valid for only one domain. So hence I do not want hello#gmail.com.net to be valid.
Help is needed. Thank you!
try this [A-Z0-9a-z._%+-]+#[A-Za-z0-9-]+\.[A-Za-z]{2,64}.
In your regex is a dot in the allowed characters behind the #.
You can use something like:
\b[A-Z0-9a-z._%+-]+#gmail\.com\.net\b
Regex Demo
I found this regex for Swift:
[A-Z0-9a-z._%+-]+#[A-Za-z0-9.-]+\\.[A-Za-z]{2,64}
It has an extra backslash.
I found it here: http://emailregex.com/
Regards,
Melle
I know you already accept an answer but this idea just cross my mind. You can use URLComponents to split the email address into user and host and validate each component separately:
func validate(emailAddress: String) -> Bool {
guard let components = URLComponents(string: "mailto://" + emailAddress),
let host = components.host else
{
return false
}
return host.components(separatedBy: ".").count == 2
}
print(validate(emailAddress: "hello#gmail.com")) // true
print(validate(emailAddress: "hello#gmail.com.net")) // false
print(validate(emailAddress: "hello")) // false
Your requirement has a big flaw in it though: valid domains can have two dots, like someone#bbc.co.uk. Getting a regex pattern to validate an email is hard. Gmail, for example, will direct all emails sent to jsmith+abc#gmail.com to the same inbox as jsmith#gmail.com. The best way is to perform some rudimentary check on the email address, then email the user and ask them to click a link to confirm the email.
You can try with below pattern.
/^(([^<>()\[\]\.,;:\s#\"]+(\.[^<>()\[\]\.,;:\s#\"]+)*)|(\".+\"))#((REPLACE_THIS_WITH_EMAIL_DOMAIN+\.)+[^<>()[\]\.,;:\s#\"]{2,})$/i;
for Eg.
/^(([^<>()\[\]\.,;:\s#\"]+(\.[^<>()\[\]\.,;:\s#\"]+)*)|(\".+\"))#((gmail+\.)+[^<>()[\]\.,;:\s#\"]{2,})$/i;

Using WWW::Mechanize to scrape multiple pages under a directory - Perl

I'm working on a project to site scrape every interview found here into an HTML ready document to be later dumped into a DB which will automatically update our website with the latest content. You can see an example of my current site scraping script which I asked a question about the other day: WWW::Mechanize Extraction Help - PERL
The problem I can't seem to wrap my head around is knowing if what I'm trying to accomplish now is even possible. Because I don't want to have to guess when a new interview is published, my hope is to be able to scrape the website which has a directory listing of all of the interviews and automatically have my program fetch the content on the new URL (new interview).
Again, the site in question is here (scroll down to see the listing of interviews): http://millercenter.org/president/clinton/oralhistory
My initial thought was to have a regex of .\ at the end of the link above in hopes that it would automatically search any links found under that page. I can't seem to be able to get this to work using WWW::Mechanize, however. I will post what I have below and if anyone has any guidance or experience with this your feedback would be greatly appreciated. I'll also summarize my tasks below the code so that you have a concise understanding of what we hope to accomplish.
Thanks to any and all that can help!
#!/usr/bin/perl -w
use strict;
use WWW::Mechanize;
use WWW::Mechanize::Link;
use WWW::Mechanize::TreeBuilder;
my $mech = WWW::Mechanize->new();
WWW::Mechanize::TreeBuilder->meta->apply($mech);
$mech->get("http://millercenter.org/president/clinton/oralhistory/\.");
# find all <dl> tags
my #list = $mech->find('dl');
foreach ( #list ) {
print $_->as_HTML();
}
# # find all links
# my #links = $mech->links();
# foreach my $link (#links) {
# print "$link->url \n";
# }
To summarize what I'm hoping is possible:
Extract the content of every interview found here in an HTML ready document like I did here: WWW::Mechanize Extraction Help - PERL. This would require the 'get' action to be able to traverse the pages listed under the /oralhistory/ directory, which can perhaps be solved using a regex?
Possibly extract the respondent name and position fields on the directory page to be populated in a title field (this isn't that big of a deal if it can't be done)
No, you can't use wildcards on urls.. :-(
You'll have to parse yourself the page with the listing, and then get and process pages in a loop.
To extract specific fields from a page contents will be a strightforward task with WWW::Mechanize...
UPDATE: answering OP comment:
Try this logic:
use strict;
use warnings;
use WWW::Mechanize;
use LWP::Simple;
use File::Basename;
my $mech = WWW::Mechanize->new( autocheck => 1 );
$mech->get("http://millercenter.org/president/clinton/oralhistoryml");
# find all <dl> tags
my #list = $mech->find('dl');
foreach my $link (#list) {
my $url = $link->url();
my $localfile = basename($url);
my $localpath = "./$localfile";
print "$localfile\n";
getstore($url, $localpath);
}
My answer is focused on the approach of how to do this. I'm not providing code.
There are no IDs in the links, but the names of the interview pages seem to be fine to use. You need to parse them out and build a lookup table.
Basically you start by building a parser that fetches all the links that look like an interview. That is fairly simple with WWW::Mechanize. The page URL is:
http://millercenter.org/president/clinton/oralhistory
All the interviews follow this schema:
http://millercenter.org/president/clinton/oralhistory/george-mitchell
So you can find all links in that page that start with http://millercenter.org/president/clinton/oralhistory/. Then you make them unique, because there is this teaser box slider thing that showcases some of them, and it has a read more link to the page. Use a hash to do that like this:
my %seen;
foreach my $url (#urls) {
$mech->get($url) unless $seen{$url};
$seen{$url}++;
}
Then you fetch the page and do your stuff and write it to your database. Use the URL or the interview name part of the URL (e.g. goerge-mitchell) as the primary key. If there are other presidents and you want those as well, adapt in case the same name shows up for several presidents.
Then you go back and add a cache lookup into your code. You grab all the IDs from the DB before you start fetching the page, and put those in a hash.
# prepare query and stuff...
my %cache;
while (my $res = $sth->fetchrow_hashref) {
$cache{$res->{id}}++;
}
# later...
foreach my $url (#urls) {
next if $cache{$url}; # or grab the ID out of the url
next if $seen{$url};
$mech->get($url);
$seen{$url}++;
}
You also need to filter out the links that are not interviews. One of those would be http://millercenter.org/president/clinton/oralhistory/clinton-description, which is the read more of the first paragraph on the page.

Regular expression for validating url with parameters

I have been searching high and low for a solution to this, but to no avail. I am trying to prevent users from entering poorly formed URLs. Currently I have this regular expression in place:
^(http|https)\://.*$
This does a check to make sure the user is using http or https in the URL. However I need to go a step further and validate the structure of the URL.
For example this URL: http://mytest.com/?=test is clearly invalid as the parameter is not specified. All of the regular expressions that I've found on the web return valid when I use this URL.
I've been using this site to test the expressions that I've been finding.
Look I think the best solution for testing the URL as :
var url="http://mytest.com/?=test";
Make 2 steps :
1- test only URL as :
http://mytest.com/
use pattern :
var pattern1= "^(http:\/\/www.|https:\/\/www.|ftp:\/\/www.|www.){1}([0-9A-Za-z]+\.)([A-Za-z]){2,3}(\/)?";
2- split URL string by using pattern1 to get the URL query string and IF URL has Query string then make test on It again by using the following pattern :
var query=url.split(pattern1);
var q_str = query[1];
var pattern2 = "^(\?)?([0-9A-Za-z]+=[0-9A-Za-z]+(\&)?)+$";
Good Luck,
I believe the problem you are having comes from the fact that what is or is not a valid parameter from a query string is not universally defined. And specifically for your problem, the criteria for a valid query is still not well defined from your single example of what should fail.
To be precise, check this out RFC3986#3.4
Maybe you can make up a criteria for what should be an "acceptable" query string and from that you can get an answer. ;)