How can use regex to validate iSCSI target names? - regex

I am trying to craft a regexp to validate iSCSI qualified names. An example of a qualified name is iqn.2011-08.com.example:storage This is example is minimal, I have seen other examples that are more extended.
So far what I have to validate off of it this:
print "Enter a new target name: ";
my $target_name = <STDIN>;
chomp $target_name;
if ($target_name =~ /^iqn\.\d{4}-\d{2}/xmi) {
print GREEN . "Target name is valid!" . RESET . "\n";
} else {
print RED . "Target name is not valid!" . RESET . "\n";
}
How can I extend that to work with rest up to the : I am not going to parse after the : becuase it is a description tag.
Is there a limit to how big a domain name can be?

According to RFC3270 (and in turn RFC1035),
/
(?(DEFINE)
(?<IQN_PAT>
iqn
\.
[0-9]{4}-[0-9]{2}
\.
(?&REV_SUBDOMAIN_PAT)
(?: : .* )?
)
(?<EUI_PAT>
eui
\.
[0-9A-Fa-f]{16}
)
(?<REV_SUBDOMAIN_PAT>
(?&LABEL_PAT) (?: \. (?&LABEL_PAT) )*
)
(?<LABEL_PAT>
[A-Za-z] (?: [A-Za-z0-9\-]* [A-Za-z0-9] )?
)
)
^ (?: (?&IQN_PAT) | (?&EUI_PAT) ) \z
/sx
It's not clear if the eui names accept lowercase hex digits or not. I figured it was safer to allow them.
If you condense the above, you get /^(?:iqn\.[0-9]{4}-[0-9]{2}(?:\.[A-Za-z](?:[A-Za-z0-9\-]*[A-Za-z0-9])?)+(?::.*)?|eui\.[0-9A-Fa-f]{16})\z/s.
(By the way, your use /m is wrong, your use of /i is wrong, and \d can match far more than the allowed [0-9].)

If you only need part before : then you can use following regexp:
if ($target_name =~ /^iqn\.(\d{4}-\d{2})\.([^:]+):/xmi) {
my ($date, $reversed_domain_name) = ($1, $2);
Regexp [^:]+ matches to 1 or more non-: symbols. It will match even if domain name is not well formed. Further improvements depends on your goal: do you need just get individual components of iSCSI name or do you need to validate its syntax?
Is there a limit to how big a domain name can be?
From Wikipedia:
The full domain name may not exceed a total length of 253

Related

Regex Like for ORACLE with lookahead and negative lookahead

I am working with an programm which uploads emailadresses to another programm - but it accepts emails only in one way:
i tried to write a reglular expression to filter out emailadresse which are not accepted
^(?:([A-Za-z0-9!#$%*+-.=?~|`_^]{1,64})|(\"[A-Za-z0-9!#$%*+-.=?~|`_^(){}<>#,;: \[\]]{1,64}\"))\#(?!\.)(?!\-)(?!.*\.$)(?!.*\.\.)([A-Za-z0-9.-]{1,61})\.([a-z]{2,10})$
The description says:
username#domain
The at sign ('#') must be present and not first or last character.
The length of the name can have up to and including 64 characters.
The length of the domain can have up to and including 64 characters.
All email addresses are forced to lowercase when the email is sent. Therefore any email addresses requiring uppercase will most likely not be delivered correctly by the ISP as we will have changed it to lowercase.
username
Can contain:
A-Z
a-z
0-9
! # $ % * + - . = ? ~ | ` _ ^
The entire name can be surrounded by double quotes (though this is not supported by many ISPs). In this case, the following additional characters are allowed between the quotes - ( ) { } < > # , ; : [ ] (space)
domain
Can contain:
A-Z
a-z
0-9
Cannot contain 2 or more consecutive periods
Must contain at least 1 period
Domain - Cannot begin or end with a period or dash
also the part with [] does not work
Thanks for your help.
Oracle does not, natively, support non-capturing groups, look-ahead or look-behind in regular expressions.
However, if you have Java enabled in the database then you can compile a Java class:
CREATE AND COMPILE JAVA SOURCE NAMED RegexParser AS
import java.util.regex.Pattern;
public class RegexpMatch {
public static int match(
final String value,
final String regex
){
final Pattern pattern = Pattern.compile(regex);
return pattern.matcher(value).matches() ? 1 : 0;
}
}
/
And create a PL/SQL wrapper function:
CREATE FUNCTION regexp_java_match(value IN VARCHAR2, regex IN VARCHAR2) RETURN NUMBER
AS LANGUAGE JAVA NAME 'RegexpMatch.match( java.lang.String, java.lang.String ) return int';
/
and then you can use your regular expression (or any other regular expression that Java supports):
SELECT REGEXP_JAVA_MATCH(
'alice#example.com',
'^(?:([A-Za-z0-9!#$%*+-.=?~|`_^]{1,64})|(\"[A-Za-z0-9!#$%*+-.=?~|`_^(){}<>#,;: \[\]]{1,64}\"))\#(?!\.)(?!\-)(?!.*\.$)(?!.*\.\.)([A-Za-z0-9.-]{1,61})\.([a-z]{2,10})$'
) AS match
FROM DUAL
Which outputs:
MATCH
1
db<>fiddle here
Your regular expression can be re-written into a format that Oracle supports as:
(?:) non-capturing group are not supported and should just be a () capturing group instead.
Look-ahead is not supported but you can rewrite the look-ahead patterns using character list so #(?!\.)(?!-)([A-Za-z0-9.-]{1,61}) can be rewritten as #[A-Za-z0-9][A-Za-z0-9.-]{0,60}.
The (?!.*\.$) look-ahead is redundant as the pattern ends with ([a-z]{2,10})$ and can never match a trailing ..
If you want to include ] and - in a character list then ] should be the first character and - the last in the set.
The only thing that cannot be implemented in an Oracle regular expression is simultaneously restricting the length of the post-# segment and ensuring there are no .. double dots; to do that you need to check for one of those two conditions in a second regular expression.
SELECT REGEXP_SUBSTR(
REGEXP_SUBSTR(
'alice#example.com',
'^('
-- Unquoted local-part
|| '[A-Za-z0-9!#$%*+.=?~|`_^-]{1,64}'
-- or
|| '|'
-- Quoted local-part
|| '"[]A-Za-z0-9!#$%*+.=?~|`_^(){}<>#,;: [-]{1,64}"'
|| ')#'
-- Domains
|| '[A-Za-z0-9]([A-Za-z0-9.-]{0,60})?'
-- Top-level domain
|| '\.[a-z]{2,10}$'
),
-- Local-part
'^([^"]*?|".*?")'
|| '#'
-- Domains - exclude .. patterns
|| '([^.]+\.)+[a-z]{2,10}$'
) AS match
FROM DUAL
Or, using POSIX character lists:
SELECT REGEXP_SUBSTR(
REGEXP_SUBSTR(
'alice#example.com',
'^('
-- Unquoted local-part
|| '[[:alnum:]!#$%*+.=?~|`_^-]{1,64}'
-- or
|| '|'
-- Quoted local-part
|| '"[][:alnum:]!#$%*+.=?~|`_^(){}<>#,;: [-]{1,64}"'
|| ')#'
-- Domains
|| '[[:alnum:]]([[:alnum:].-]{0,60})?'
-- Top-level domain
|| '\.[[:lower:]]{2,10}$'
),
-- Local-part
'^([^"]*?|".*?")'
|| '#'
-- Domains
|| '([^.]+\.)+[[:lower:]]{2,10}$'
) AS match
FROM DUAL
Which both output:
MATCH
alice#example.com
db<>fiddle here

Powershell Regex expression to get part of a string

I would like to take part of a string to use it elsewhere. For example, I have the following strings:
Project XYZ is the project name - 20-12-11
I would like to get the value "XYZ is the project name" from the string. The word "Project" and character "-" before the number will always be there.
I think a lookaround regular expression would work here since "Project" and "-" are always there:
(?<=Project ).+?(?= -)
A lookaround can be useful for cases that deal with getting a sub string.
Explanation:
(?<= = negative lookbehind
Project = starting string (including space)
) = closing negative lookbehind
.+? = matches anything in between
(?= = positive lookahead
- = ending string
) = closing positive lookahead
Example in PowerShell:
Function GetProjectName($InputString) {
$regExResult = $InputString | Select-String -Pattern '(?<=Project ).+?(?= -)'
$regExResult.Matches[0].Value
}
$projectName = GetProjectName -InputString "Project XYZ is the project name - 20-12-11"
Write-Host "Result = '$($projectName)'"
here is yet another regex version. [grin] it may be easier to understand since it uses somewhat basic regex patterns.
what it does ...
defines the input string
defines the prefix to match on
this will keep only what comes after it.
defines the suffix to match on
this part will keep only what is before it.
trigger the replace
the part in the () is what will be placed into the 1st capture group.
show what was kept
the code ...
$InString = 'Project XYZ is the project name - 20-12-11'
# "^" = start of string
$Prefix = '^project '
# ".+' = one or more of any character
# "$" = end of string
$Suffix = ' - .+$'
# "$1" holds the content of the 1st [and only] capture group
$OutString = $InString -replace "$Prefix(.+)$Suffix", '$1'
$OutString
# define the input string
$str = 'Project XYZ is the project name - 20-12-11'
# use regex (-match) including the .*? regex pattern
# this patterns means (.)any char, (*) any times, (?) maximum greed
# to capture (into brackets) the desired pattern substring
$str -match "(Project.*?is the project name)"
# show result (the first capturing group)
$matches[1]

How do I access the captures within a match?

I am trying to parse a csv file, and I am trying to access names regex in proto regex in Perl6. It turns out to be Nil. What is the proper way to do it?
grammar rsCSV {
regex TOP { ( \s* <oneCSV> \s* \, \s* )* }
proto regex oneCSV {*}
regex oneCSV:sym<noQuote> { <-[\"]>*? }
regex oneCSV:sym<quoted> { \" .*? \" } # use non-greedy match
}
my $input = prompt("Enter csv line: ");
my $m1 = rsCSV.parse($input);
say "===========================";
say $m1;
say "===========================";
say "1 " ~ $m1<oneCSV><quoted>; # this fails; it is "Nil"
say "2 " ~ $m1[0];
say "3 " ~ $m1[0][2];
Detailed discussion complementing Christoph's answer
I am trying to parse a csv file
Perhaps you are focused on learning Raku parsing and are writing some throwaway code. But if you want industrial strength CSV parsing out of the box, please be aware of the Text::CSV modules[1].
I am trying to access a named regex
If you are learning Raku parsing, please take advantage of the awesome related (free) developer tools[2].
in proto regex in Raku
Your issue is unrelated to it being a proto regex.
Instead the issue is that, while the match object corresponding to your named capture is stored in the overall match object you stored in $m1, it is not stored precisely where you are looking for it.
Where do match objects corresponding to captures appear?
To see what's going on, I'll start by simulating what you were trying to do. I'll use a regex that declares just one capture, a "named" (aka "Associative") capture that matches the string ab.
given 'ab'
{
my $m1 = m/ $<named-capture> = ( ab ) /;
say $m1<named-capture>;
# 「ab」
}
The match object corresponding to the named capture is stored where you'd presumably expect it to appear within $m1, at $m1<named-capture>.
But you were getting Nil with $m1<oneCSV>. What gives?
Why your $m1<oneCSV> did not work
There are two types of capture: named (aka "Associative") and numbered (aka "Positional"). The parens you wrote in your regex that surrounded <oneCSV> introduced a numbered capture:
given 'ab'
{
my $m1 = m/ ( $<named-capture> = ( ab ) ) /; # extra parens added
say $m1[0]<named-capture>;
# 「ab」
}
The parens in / ( ... ) / declare a single top level numbered capture. If it matches, then the corresponding match object is stored in $m1[0]. (If your regex looked like / ... ( ... ) ... ( ... ) ... ( ... ) ... / then another match object corresponding to what matches the second pair of parentheses would be stored in $m1[1], another in $m1[2] for the third, and so on.)
The match result for $<named-capture> = ( ab ) is then stored inside $m1[0]. That's why say $m1[0]<named-capture> works.
So far so good. But this is only half the story...
Why $m1[0]<oneCSV> in your code would not work either
While $m1[0]<named-capture> in the immediately above code is working, you would still not get a match object in $m1[0]<oneCSV> in your original code. This is because you also asked for multiple matches of the zeroth capture because you used a * quantifier:
given 'ab'
{
my $m1 = m/ ( $<named-capture> = ( ab ) )* /; # * is a quantifier
say $m1[0][0]<named-capture>;
# 「ab」
}
Because the * quantifier asks for multiple matches, Raku writes a list of match objects into $m1[0]. (In this case there's only one such match so you end up with a list of length 1, i.e. just $m1[0][0] (and not $m1[0][1], $m1[0][2], etc.).)
Summary
Captures nest;
A capture quantified by either * or + corresponds to two levels of nesting not just one.
In your original code, you'd have to write say $m1[0][0]<oneCSV>; to get to the match object you're looking for.
[1] Install relevant modules and write use Text::CSV; (for a pure Raku implementation) or use Text::CSV:from<Perl5>; (for a Perl plus XS implementation) at the start of your code. (talk slides (click on top word, eg. "csv", to advance through slides), video, Raku module, Perl XS module.)
[2] Install CommaIDE and have fun with its awesome grammar/regex development/debugging/analysis features. Or install the Grammar::Tracer; and/or Grammar::Debugger modules and write use Grammar::Tracer; or use Grammar::Debugger; at the start of your code (talk slides, video, modules.)
The match for <oneCSV> lives within the scope of the capture group, which you get via $m1[0].
As the group is quantified with *, the results will again be a list, ie you need another indexing operation to get at a match object, eg $m1[0][0] for the first one.
The named capture can then be accessed by name, eg $m1[0][0]<oneCSV>. This will already contain the match result of the appropriate branch of the protoregex.
If you want the whole list of matches instead of a specific one, you can use >> or map, eg $m1[0]>>.<oneCSV>.

Regex - Extracting a number when preceeded OR followed by a currency sign

if (preg_match_all('((([£€$¥](([ 0-9]([0-9])*)((\.|\,)(\d{2}|\d{1}))|([ 0-9]([0-9])*)))|(([0-9]([0-9])*)((\.|\,)(\d{2}|\d{1})(\s{0}|\s{1}))|([0-9]([0-9])*(\s{0}|\s{1})))[£€$¥]))', $Commande, $matches)) {
$tot1 = $matches[0];
This is my tested solution.
It works for all 4 currencies when sign is placed before or after, with or without a space in between.
It works with a dot or a comma for decimals.
It works without decimal, or with just 1 number after the dot or comma.
It extracts several amounts in the same string in a mix of formats declined above as long as there is a space in between.
I think it covers everything, although I am sure it can be simplified.
It was Needed for an international order form where clients enter the amounts themselves as well as the description in the same field.
You can use a conditional:
if (preg_match_all('~(\$ ?)?[0-9]{1,3}(?:,?[0-9]{3})*(?:\.[0-9]{2})?(?:[pcm]|bn|[mb]illion)?(?(1)| ?\$)~i', $order, $matches)) {
$tot = $matches[0];
}
Explanation:
I put the currency in the first capturing group: (\$ ?) and I make it optional with a ?
At the end of the pattern, I use an if then else:
(?(1) # if the first capturing group exist
# then match nothing
| # else
[ ]?\$ # matches the currency
) # end of the conditional
You should check for optional $ at the end of amount:
\$? ?(\d[\d ,]*(?:\.\d{1,2})?|\d[\d,](?:\.\d{2})?) ?\$?(?:[pcm]|bn|[mb]illion)
Live demo

How to Capture Only Surnames from a Regex Pattern?

Team
I have written a Perl program to validate the accuracy of formatting (punctuation and the like) of surnames, forenames, and years.
If a particular entry doesn't follow a specified pattern, that entry is highlighted to be fixed.
For example, my input file has lines of similar text:
<bibliomixed id="bkrmbib5">Abdo, C., Afif-Abdo, J., Otani, F., & Machado, A. (2008). Sexual satisfaction among patients with erectile dysfunction treated with counseling, sildenafil, or both. <emphasis>Journal of Sexual Medicine</emphasis>, <emphasis>5</emphasis>, 1720–1726.</bibliomixed>
My programs works just fine, that is, if any entry doesn't follow the pattern, the script generates an error. The above input text doesn't generate any error. But the one below is an example of an error because Rose A. J. is missing a comma after Rose:
NOT FOUND: <bibliomixed id="bkrmbib120">Asher, S. R., & Rose A. J. (1997). Promoting children’s social-emotional adjustment with peers. In P. Salovey & D. Sluyter, (Eds). <emphasis>Emotional development and emotional intelligence: Educational implications.</emphasis> New York: Basic Books.</bibliomixed>
From my regex search pattern, is it possible to capture all the surnames and the year, so I can generate a text prefixed to each line as shown below?
<BIB>Abdo, Afif-Abdo, Otani, Machado, 2008</BIB><bibliomixed id="bkrmbib5">Abdo, C., Afif-Abdo, J., Otani, F., & Machado, A. (2008). Sexual satisfaction among patients with erectile dysfunction treated with counseling, sildenafil, or both. <emphasis>Journal of Sexual Medicine</emphasis>, <emphasis>5</emphasis>, 1720–1726.</bibliomixed>
My regex search script is as follows:
while(<$INPUT_REF_XML_FH>){
$line_count += 1;
chomp;
if(/
# bibliomixed XML ID tag and attribute----<START>
<bibliomixed
\s+
id=".*?">
# bibliomixed XML ID tag and attribute----<END>
# --------2 OR MORE AUTHOR GROUP--------<START>
(?:
(?:
# pattern for surname----<START>
(?:(?:[\w\x{2019}|\x{0027}]+\s)+)? # surnames with spaces
(?:(?:[\w\x{2019}|\x{0027}]+-)+)? # surnames with hyphens
(?:[A-Z](?:\x{2019}|\x{0027}))? # surnames with closing single quote or apostrophe O’Leary
(?:St\.\s)? # pattern for St.
(?:\w+-\w+\s)?# pattern for McGillicuddy-De Lisi
(?:[\w\x{2019}|\x{0027}]+) # final surname pattern----REQUIRED
# pattern for surname----<END>
,\s
# pattern for forename----<START>
(?:
(?:(?:[A-Z]\.\s)+)? #initials with periods
(?:[A-Z]\.-)? #initials with hyphens and periods <<Y.-C. L.>>
(?:(?:[A-Z]\.\s)+)? #initials with periods
[A-Z]\. #----REQUIRED
# pattern for titles....<START>
(?:,\s(?:Jr\.|Sr\.|II|III|IV))?
# pattern for titles....<END>
)
# pattern for forename----<END>
,\s)+
#---------------FINAL AUTHOR GROUP SEPATOR----<START>
&\s
#---------------FINAL AUTHOR GROUP SEPATOR----<END>
# --------2 OR MORE AUTHOR GROUP--------<END>
)?
# --------LAST AUTHOR GROUP--------<START>
# pattern for surname----<START>
(?:(?:[\w\x{2019}|\x{0027}]+\s)+)? # surnames with spaces
(?:(?:[\w\x{2019}|\x{0027}]+-)+)? # surnames with hyphens
(?:[A-Z](?:\x{2019}|\x{0027}))? # surnames with closing single quote or apostrophe O’Leary
(?:St\.\s)? # pattern for St.
(?:\w+-\w+\s)?# pattern for McGillicuddy-De Lisi
(?:[\w\x{2019}|\x{0027}]+) # final surname pattern----REQUIRED
# pattern for surname----<END>
,\s
# pattern for forename----<START>
(?:
(?:(?:[A-Z]\.\s)+)? #initials with periods
(?:[A-Z]\.-)? #initials with hyphens and periods <<Y.-C. L.>>
(?:(?:[A-Z]\.\s)+)? #initials with periods
[A-Z]\. #----REQUIRED
# pattern for titles....<START>
(?:,\s(?:Jr\.|Sr\.|II|III|IV))?
# pattern for titles....<END>
)
# pattern for forename----<END>
(?: # pattern for editor notation----<START>
\s\(Ed(?:s)?\.\)\.
)? # pattern for editor notation----<END>
# --------LAST AUTHOR GROUP--------<END>
\s
\(
# pattern for a year----<START>
(?:[A-Za-z]+,\s)? # July, 1999
(?:[A-Za-z]+\s)? # July 1999
(?:[0-9]{4}\/)? # 1999\/2000
(?:\w+\s\d+,\s)?# August 18, 2003
(?:[0-9]{4}|in\spress|manuscript\sin\spreparation) # (1999) (in press) (manuscript in preparation)----REQUIRED
(?:[A-Za-z])? # 1999a
(?:,\s[A-Za-z]+\s[0-9]+)? # 1999, July 2
(?:,\s[A-Za-z]+\s[0-9]+\x{2013}[0-9]+)? # 2002, June 19–25
(?:,\s[A-Za-z]+)? # 1999, Spring
(?:,\s[A-Za-z]+\/[A-Za-z]+)? # 1999, Spring\/Winter
(?:,\s[A-Za-z]+-[A-Za-z]+)? # 2003, Mid-Winter
(?:,\s[A-Za-z]+\s[A-Za-z]+)? # 2007, Anniversary Issue
# pattern for a year----<END>
\)\.
/six){
print $FOUND_REPORT_FH "$line_count\tFOUND: $&\n";
$found_count += 1;
} else{
print $ERROR_REPORT_FH "$line_count\tNOT FOUND: $_\n";
$not_found_count += 1;
}
Thanks for your help,
Prem
Alter this bit
# pattern for surname----<END>
,?\s
This now means an optional , followed by white space. If the Persons surname is "Bunga Bunga" it won't work
All of your subpatterns are non-capturing groups, starting with (?:. This reduces compilation times by a number of factors, one of which being that the subpattern is not captured.
To capture a pattern you merely need to place parenthesis around the part you require to capture. So you could remove the non-capturing assertion ?: or place parens () where you need them. http://perldoc.perl.org/perlretut.html#Non-capturing-groupings
I'm not sure but, from your code I think you may be attempting to use lookahead assertions as, for example, you test for surnames with spaces, if none then test for surnames with hyphens. This will not start from the same point every time, it will either match the first example or not, then move forward to test the next position with the second surname pattern, whether the regex will then test the second name for the first subpattern is what I am unsure of. http://perldoc.perl.org/perlretut.html#Looking-ahead-and-looking-behind
#!usr/bin/perl
use warnings;
use strict;
my $line = '123 456 7antelope89';
$line =~ /^(\d+\s\d+\s)?(\d+\w+\d+)?/;
my ($ay,$be) = ($1 ? $1:'nocapture ', $2 ? $2:'nocapture ');
print 'a: ',$ay,'b: ',$be,$/;
undef for ($ay,$be,$1,$2);
$line = '123 456 7bealzelope89';
$line =~ /(?:\d+\s\d+\s)?(?:\d+\w+\d+)?/;
($ay,$be) = ($1 ? $1:'nocapture ', $2 ? $2:'nocapture ');
print 'a: ',$ay,'b: ',$be,$/;
undef for ($ay,$be,$1,$2);
$line = '123 456 7canteloupe89';
$line =~ /((?:\d+\s\d+\s))?(?:\d+(\w+)\d+)?/;
($ay,$be) = ($1 ? $1:'nocapture ', $2 ? $2:'nocapture ');
print 'a: ',$ay,'b: ',$be,$/;
undef for ($ay,$be,$1,$2);
exit 0;
For capturing the whole pattern the first pattern of the third example does not make sense, as this tells the regex to not capture the pattern group while also capturing the pattern group. Where this is useful is in the second pattern which is a fine grained pattern capture, in that the pattern captured is part of a non-capturing group.
a: 123 456 b: 7antelope89
a: nocapture b: nocapture
a: 123 456 b: canteloupe
One little nitpic
id=".*?"
may be better as
id="\w*?"
id names requiring to be _alphanumeric iirc.