preg_match to find specific string followed by arbitrary numbers - regex

I have the HTML markup of a web page as $subject. I'm using preg_match to search for a particular string artist_x? and trying to return just the x portion. x can be any number ranging from 1 to 12 digits. So, the pattern would have to match any/all of the following:
artist_1?
artist_12345?
artist_123456789012?
...and bring back:
1
12345
123456789012
...respectively.
This is the closest I can come up with:
$pattern = '|[artist_][0-9]{1,12}|';
preg_match($pattern, $subject, $matches);
But it's not working right... It's finding the string I'm looking for and even returning the expected number, but the returned number still has the underscore prepended. I've tried quite a few solutions to get this far, but am out of ideas.
Thanks!

I don't really understand what your pattern is supposed to do.
Anyway, this should work: you just capture the number that interest you. $matches will be an array containing first the whole match and then the group. So the first group will be what you want.
$subject = "artist_3333";
$pattern = '/artist_(\d{1,12})/';
preg_match($pattern, $subject, $matches);
echo($matches[1]); // 3333
\d in here means the same thing as [0-9], it matches digits

Related

regex, period allowed, not comma

Hi Im looking for a regex for
Valid:
20000
20.000
If a comma is used, it should not match with the comma and whats after.
Not valid
20.000,12
Right now Im using:
([0-9]+([.][0-9]+)*)+?
But this one also takes the last 2 digits after comma.
You could use
^\d+(?:\.\d+)*
# start of line, 1+ digits, .1234 eventually
See a demo on regex101.com.
If you add a ^ to the beginning of the regex, only the part from the start of the string will match
^([0-9]+([.][0-9]+)*)+?
But i think
^\d+(\.\d+)*
is the better solution to match numbers
If you want to match floating point numbers then use (^\d*.?\d+)
There is nothing in the question that suggests that the outcome needs to be a valid number. Therefore it looks like the expression needs to accept any number or a period in which case the regex is quite straightforward and the following should work:
^([0-9\.]+)$
Here are my tests to demonstrate the outcome
20000 - pass
20.000 - pass
20.000,12 - fail
1.000.000 - pass
23000,000 - fail
For info, I used the php code below for my test:
$testdata = array('20000', '20.000', '20.000,12', '1.000.000', '23000,000');
$pattern = "/^([0-9\.]+)$/";
foreach ($testdata as $k => $v) {
$result = preg_match($pattern, $v)? 'pass': 'fail';
echo "".$v." - ".$result."<br />";
}

Regex against an odd value

I am parsing through a csv to export into two catagories: Matches and NonMatches via powershell.
The data in the csv is formated as the following:
SamAccountName
xx9999xx
aa0000aa
ab0909xc
etc
I need a regex that will allow me to filter out all the matches that follow the aa0000bb naming convention and export that into a csv (same with the ones that dont match the aa0000bb convention)
Convention (because it's odd): the first 2 are letters ranging from a-z,then 4 numbers 0-9, then 2 letters a-z. Meaning you can have aa0000aa and by7690zi
Right now I have:
Import-Csv "C:\Path\to\csv" | ForEach {
$Name =$_.SamAccountName;
$Name -Match '[a-z]+[a-z]+[0-9]+[0-9]+[0-9]+[0-9]+[a-z]+[a-z]'} | Out-Null; $Matches
or
ForEach ($Name in $CSV)
{
If ($Name -Match '[a-z]+[a-z]+[0-9]+[0-9]+[0-9]+[0-9]+[a-z]+[a-z]')
{
Write-Host "$Matches"
}else{
Write-Host "No Match"
}
}
Both seem to output random things that are not correct.
I'm thinking (hoping really) that there is a way to match by the following:
a-z #For the first set of letters
0-9 #For the 4 digits
a-z #For the second set of letters
ie [a-z{2}]+[0-9{4}]+[a-z{2}]
I can run a normal 'xx9999xx' -Match '[a-z]+[a-z]+[0-9]+[0-9]+[0-9]+[0-9]+[a-z]+[a-z]' and it returns true and then
PS C:\Windows> $Matches
Name Value
---- -----
0 xx9999xx
But I still don't understand how to do that on a mass scale.
Correct regex to match your convention should be as following.
Regex: [a-z]{2}[0-9]{4}[a-z]{2}
Regex101 Demo
noob has a correct regex to match your problem. To understand why yours was matching more than you expected it is because the plus sign is a quantifier meta character in regex. It means
Between one and unlimited times, as many times as possible, giving back as needed
So if you have more numbers than two it would try to match more. So this would be a valid match
aaaa00000000000aaaa
Using a fixed quantifier {2}, {4} etc. will match exactly what you want.
Regex101.com is a great resource for quick testing. It also gives a very good explanation of your regex.

Regular expression help in Perl

I have following text pattern
(2222) First Last (ab-cd/ABC1), <first.last#site.domain.com> 1224: efadsfadsfdsf
(3333) First Last (abcd/ABC12), <first.last#site.domain.com> 1234, 4657: efadsfadsfdsf
I want the number 1224 or 1234, 4657 from the above text after the text >.
I have this
\((\d+)\)\s\w*\s\w*\s\(\w*\/\w+\d*\),\s<\w*\.\w*\#\w*\.domain.com>\s\d+:
which will take the text before : But i want the one after email till :
Is there any easy regular expression to do this? or should I use split and do this
Thanks
Edit: The whole text is returned by a command line tool.
(3333) First Last (abcd/ABC12), <first.last#site.domain.com> 1234, 4657: efadsfadsfdsf
(3333) - Unique ID
First Last - First and last names
<first.last#site.domain.com> - Email address in format FirstName.LastName#sub.domain.com
1234, 4567 - database primary Keys
: xxxx - Headline
What I have to do is process the above and get hte database ID (in ex: 1234, 4567 2 separate ID's) and query the tables
The above is the output (like this I will get many entries) from the tool which I am calling via my Perl script.
My idea was to use a regular expression to get the database id's. Guess I could use regular expression for this
you can fudge the stuff you don't care about to make the expression easier, say just 'glob' the parts between the parentheticals (and the email delimiters) using non-greedy quantifiers:
/(\d+)\).*?\(.*?\),\s*<.*?>\s*(\d+(?:,\s*\d+)*):/ (not tested!)
there's only two captured groups, the (1234), and the (1234, 4657), the second one which I can only assume from your pattern to mean: "a digit string, followed by zero or more comma separated digit strings".
Well, a simple fix is to just allow all the possible characters in a character class. Which is to say change \d to [\d, ] to allow digits, commas and space.
Your regex as it is, though, does not match the first sample line, because it has a dash - in it (ab-cd/ABC1 does not match \w*\/\w+\d*\). Also, it is not a good idea to rely too heavily on the * quantifier, because it does match the empty string (it matches zero or more times), and should only be used for things which are truly optional. Use + otherwise, which matches (1 or more times).
You have a rather strict regex, and with slight variations in your data like this, it will fail. Only you know what your data looks like, and if you actually do need a strict regex. However, if your data is somewhat consistent, you can use a loose regex simply based on the email part:
sub extract_nums {
my $string = shift;
if ($string =~ /<[^>]*> *([\d, ]+):/) {
return $1 =~ /\d+/g; # return the extracted digits in a list
# return $1; # just return the string as-is
} else { return undef }
}
This assumes, of course, that you cannot have <> tags in front of the email part of the line. It will capture any digits, commas and spaces found between a <> tag and a colon, and then return a list of any digits found in the match. You can also just return the string, as shown in the commented line.
There would appear to be something missing from your examples. Is this what they're supposed to look like, with email?
(1234) First Last (ab-cd/ABC1), <foo.bar#domain.com> 1224: efadsfadsfdsf
(1234) First Last (abcd/ABC12), <foo.bar#domain.com> 1234, 4657: efadsfadsfdsf
If so, this should work:
\((\d+)\)\s\w*\s\w*\s\(\w*\/\w+\d*\),\s<\w*\.\w*\#\w*\.domain\.com>\s\d+(?:,\s(\d+))?:
$string =~ /.*>\s*(.+):.+/;
$numbers = $1;
That's it.
Tested.
With number catching:
$string =~ /.*>\s*(?([0-9]|,)+):.+/;
$numbers = $1;
Not tested but you get the idea.

2-step regular expression matching with a variable in Perl

I am looking to do a 2-step regular expression look-up in Perl, I have text that looks like this:
here is some text 9337 more text AA 2214 and some 1190 more BB stuff 8790 words
I also have a hash with the following values:
%my_hash = ( 9337 => 'AA', 2214 => 'BB', 8790 => 'CC' );
Here's what I need to do:
Find a number
Look up the text code for the number using my_hash
Check if the text code appears within 50 characters of the identified number, and if true print the result
So the output I'm looking for is:
Found 9337, matches 'AA'
Found 2214, matches 'BB'
Found 1190, no matches
Found 8790, no matches
Here's what I have so far:
while ( $text =~ /(\d+)(.{1,50})/g ) {
$num = $1;
$text_after_num = $2;
$search_for = $my_hash{$num};
if ( $text_after_num =~ /($search_for)/ ) {
print "Found $num, matches $search_for\n";
}
else {
print "Found $num, no matches\n";
}
This sort of works, except that the only correct match is 9337; the code doesn't match 2214. I think the reason is that the regular expression match on 9337 is including 50 characters after the number for the second-step match, and then when the regex engine starts again it is starting from a point after the 2214. Is there an easy way to fix this? I think the \G modifier can help me here, but I don't quite see how.
Any suggestions or help would be great.
You have a problem with greediness. The 1,50 will consume as much as it can. Your regex should be /(\d+)(.+?)(?=($|\d))/
To explain, the question mark will make the multiple match non-greedy (it will stop as soon as the next pattern is matched - the next pattern gets precedence). The ?= is a lookahead operator to say "check if the next element is a digit. If so, match but do not consume." This allows the first digit to get picked up by the beginning of the regex and be put into the next matched pattern.
[EDIT]
I added an optional end value to the lookahead so that it wouldn't die on the last match.
Just use :
/\b\d+\b/g
Why match everything if you don't need to? You should use other functions to determine where the number is :
/(?=9337.{1,50}AA)/
This will fail if AA is further than 50 chars away from the end of 9337. Of course you will have to interpolate your variables to match your hashe's keys and values. This was just an example for your first key/value pair.

perl regex grouping overload

I am using the following perl regex lines
$myalbum =~ s/[-_'&’]/ /g;
$myalbum =~ s/[,’.]//g;
$myalbum =~ m/([A-Z0-9\$]+) +([A-Z0-9\$]+) +([A-Z0-9\$]+) +([A-Z0-9\$]+) +([A-Z0-9\$]+)/i;
to match the following strings
"30_Seconds_To_Mars_-_30_Seconds_To_Mars"
"30_Seconds_To_Mars_-_A_Beautiful_Lie"
"311_-_311"
"311_-_From_Chaos"
"311_-_Grassroots"
"311_-_Sound_System"
What I am experiencing is that for strings with less than 5 matching groups (ex. 311_-_311), attempting to print $1 $2 $3 prints nothing at all. Only strings with more than 5 matches will print.
How do I resolve this?
It looks like you just want the words in separate groups. To me, it seems like you're abusing regexes to do that when you could just run your substitutions and then split. Just do:
$myalbum =~ s/[-_'&’]/ /g;
$myalbum =~ s/[,’.]//g;
my #myalbum_list = split(/\s/, $myalbum);
#Print out whatever it is you want/ test length, etc...
print "$myalbum_list[0] $myalbum_list[1] $myalbum_list[2]";
the + character means at least one match. Which means your regex m/([A-Z0-9\$]+) +([A-Z0-9\$]+) + ... requires all those fields to be there for it to be considered a match. The reason you are not capturing anything is because it's not actually matching.
You are probably looking for the * character which means zero or more not one or more like +.
I suppose your capturing groups are empty for "311 - 311" because this string doesn't match your regex.
How to resolve? Use * instead of + to permit empty sequences.
Edit: From your post I guess you want to extract the album name, i.e. the part before the minus sign.
Why not match against '(.*) - (.*)', being the first group the album and the second the title. The problem is with strings like "Album with minus - sign - First track" or "My Album - Track is one - two - three". But also as a human you wouldn't know there where the album ends and the track starts.