Search html file for random string using regex - regex

I am trying to use Perl to search through an html file, looking for a semi-random string and store the match in a variable or print it out.
The string is the name of a jpg image and always follows the pattern of 9 digits followed by 6 lower case letters, i.e.
140005917smpxgj.jpg
But it is random every time. I am sure Perl can do this, but I will admit I am getting a bit confused.

Not too complicated. You may want to watch out for varying caps in the extension, e.g. JPG. If that is a concern, you may add (?i) before the extension.
You may also wish to prevent partial names, e.g. discard a match that has more than 9 digits. That is the (?<!\d) part: Make sure no digit characters precede the match.
ETA: Now extracts multiple matches too, thanks to ikegami.
while (<>) {
for (/(?<!\d)([0-9]{9}[a-z]{6}\.(?i)jpg)/g) {
say;
push #match, $_;
}
}

Try this regex:
/\b\d{9}[a-z]{6}\.jpg/

perldoc perlre
use warnings;
use strict;
while (<DATA>) {
if (/ ( [0-9]{9} [a-z]{6} [.] jpg ) /x) {
print "$1\n";
}
}
__DATA__
foo 140005917smpxgj.jpg bar
sdfads 777666999abcdef.jpg dfgffgh
Prints:
140005917smpxgj.jpg
777666999abcdef.jpg

the solution regex is \d{9}[a-z]{6}\.jpg

Related

Perl - Find same sequence of characters in strings

In fact, I have a text file where sentences are written on each line and I have to find the same sequences of characters for each sentence of each line. For instance, one of the sentences is
no pain no gain
and I want to be able to determine that the sequence of shared characters in this string is ain.
I tried with regular expressions (found on stackoverflow by the way) but it was to find sequences of same consecutive characters, and it's not what I'm looking for. So as a beginner in perl, I don't know how to implement that.
Thank you by advance for your time and attention.
edit: here is what I've tried but not what I want:
#!/usr/bin/perl
use utf8;
open $file, "<:encoding(utf8)", "text.txt";
while($ligne=<$file>)
{
while($ligne =~ /(.)\1+/g)
{
$gram = $1;
print "$ligne\n";
print "$gram\n";
}
}
This is a simple proof of concept that matches the ain of "pain" and then looks for that same match later on in the string, which it then finds in "gain". I'm using the "match named subpattern 'Match', which is how the regex matches ain (or no).
#!/usr/bin/perl
use strict;
use warnings;
my $string = "no pain no gain";
if ($string =~ m/(?<Match>[a-zA-Z]{3}).*\k<Match>/g) {
print "Match: $+{Match}\n";
}
Output:
Match: ain
Note that if you change the length specifier to 2, the match becomes "no", rather than "ain".
Implementing a more robust regex for whatever your actual needs are and just iterate over every line you have and test for a match.
By the way, regex101.com is an amazing resource for learning and practicing regular expressions. I recommend it 10000%.

Make a regular expression in perl to grep value work on a string with different endings

I have this code in perl where I want to extract the value of 'EUR_AF', in this case '0.39'.
Sometimes 'EUR_AF' ends with ';', sometimes it doesn't.
Alternatively, 'EUR_AF' may end with '=0' instead of '=0.39;' or '=0.39'.
How do I make the code handle that? Can't seem to find it online...I could of course wrap everything in an almost endless if-elsif-else statement, but that seems overkill.
Example text:
AVGPOST=0.9092;AN=2184;RSQ=0.5988;ERATE=0.0081;AC=144;VT=SNP;THETA=0.0045;AA=A;SNPSOURCE=LOWCOV;LDAF=0.0959;AF=0.07;ASN_AF=0.05;AMR_AF=0.10;AFR_AF=0.11;EUR_AF=0.039
Code: $INFO =~ m/\;EUR\_AF\=(.*?)(;)/
I did find that: $INFO =~ m/\;EUR\_AF\=(.*?0)/ handles the cases of EUR_AF=0, but how to handle alternative scenarios efficiently?
Extract one value:
my ($eur_af) = $s =~ /(?:^|;)EUR_AF=([^;]*)/;
my ($eur_af) = ";$s" =~ /;EUR_AF=([^;]*)/;
Extract all values:
my %rec = split(/[=;]/, $s);
my $eur_af = $rec{EUR_AF};
This regex should work for you: (?<=EUR_AF=)\d+(\.\d+)?
It means
(?<=EUR_AF=) - look for a string preceeded by EUR_AF=
\d+(\.\d+)? - consist of a digit, optionally a decimal digit
EDIT: I originally wanted the whole regex to return the correct result, not only the capture group. If you want the correct capture group edit it to (?<=EUR_AF=)(\d+(?:\.\d+)?)
I have found the answer. The code:
$INFO =~ m/(?:^|;)EUR_AF=([^;]*)/
seems to handle the cases where EUR_AF=0 and EUR_AF=0.39, ending with or without ;. The resulting $INFO will be 0 or 0.39.

How to limit match length before a certain character?

I am using the following regular expression to scan input text files for valid emails.
[A-Za-z0-9!#$%&*+/=?^_`{|}~-]+(?:\.[A-Za-z0-9!#$%&*+/=?^_`{|}~-]+)*#(?:[A-Za-z0-9](?:[A-Za-z0-9-]*[A-Za-z0-9])?\.)+[A-Za-z0-9](?:[A-Za-z0-9-]*[A-Za-z0-9])?
Now I also need to limit the matches to 20 characters before the '#' sign in the email address, but not sure how to do it.
PS. I am using the Perl regular expression library (TPerlRegex) found in Delphi XE2.
Please can you help me?
Since your library is supposed to be PERL compatible, it should support lookaheads. These are convenient to ensure several "orthogonal" restrictions in the pattern:
(?=[^#]{1,20}#)[A-Za-z0-9!#$%&*+/=?^_`{|}~-]+(?:\.[A-Za-z0-9!#$%&*+/=?^_`{|}~-]+)*#(?:[A-Za-z0-9](?:[A-Za-z0-9-]*[A-Za-z0-9])?\.)+[A-Za-z0-9](?:[A-Za-z0-9-]*[A-Za-z0-9])?
The lookahead will only match if there is an # after no more than 20 non-# characters. However, the lookahead does not actually advance the position of the regex engine in your subject string, so after the condition has been checked, the engine is still at the beginning of the email (or whichever position it is checking at the moment) and will continue with your pattern as previously.
Consider using Email::Address to capture email addresses, and then grepping the results for those having 20 or fewer characters before the #:
use strict;
use warnings;
use Email::Address;
my #addresses;
while ( my $line = <DATA> ) {
push #addresses, $_
for grep { /([^#]+)/ and length $1 < 21 }
Email::Address->parse($line);
}
print "$_\n" for #addresses;
__DATA__
ABCDEFGHIJKLMNOPQRSTUVWXYZguest#host.com frank#email.net Line noise. test#host.com
Some stuff here... help#perl.org And even more here!
Nothing to see here. 01234567890123456789#numbers.com Nothing to see.
Output:
frank#email.net
test#host.com
help#perl.org
01234567890123456789#numbers.com

Perl Regex to match everything after # character

I have a bunch of text files that contain tags referenced by the # symbol. For e.g. a note is tagged 'home' if the note contains #home.
I am trying to find a Perl Regex that will match everything after the # character but not including the #character.
I have this so far (#\w+) which successfully matches the whole tag (for .e.g it matches #home, #work etc) but I cant find a way to modify it so only the characters after the # character get picked up.
I had a look at this perl regex to match all words following a character but I couldnt seem to work it out from this.
Any help would be great.
As #Quentin said, #(\w+) is the best solution.
#!/usr/bin/perl
while (<>) {
while (/#(\w+)/g) {
print $1, "\n";
}
}
If you DO want to match the tag exactly, you can try (?<=#)\w+ instead. It matches every characters after the #, but # excluded.
#!/usr/bin/perl
while (<>) {
while (/(?<=#)\w+/g) {
print $&, "\n";
}
}
Reference: Using Look-ahead and Look-behind
Just move the # so it is outside the capturing group:
#(\w+)

Perl Regex (\d*\.\d{2})

I have run into a regex in Perl that seems to be giving me problems. I'm fairly new to Perl - but I don't think that's my problem.
Here is the code:
if ($line =~ m/<amount>(\d*\.\d{2})<\//) { $amount = $1; }
I'm essentially parsing an XML formatted file for a single tag. Here is the specific value that I'm trying to parse.
<amount>23.00000</amount>
Can someone please explain why my regex won't work?
EDIT: I should mention I'm trying to import the amount as a currency value. The trailing 3 decimals are useless.
You shouldn't use regex for parsing HTML, but regardless this will fix it:
if ($line =~ m|<amount>(\d*\.\d{2})\d*<//)| { $amount = $1; }
The \d*\.\d{2} regex fragment only recognize a number with exactly two decimal places. Your sample has five decimal place, and thus does not match this fragment.
You want to use \d*\.\d+ if you need to have a least one decimal place, or \d*\.\d{2,5} if you can have between 2 and 5 decimal place.
And you should not use back-tick characters in your regex as they have no meaning in a regex, and thus are interpreted as regular character.
So you want to use:
if ($line =~ m/<amount>(\d*\.\d{2,5})<\/amount>/) { $amount = $1; }
In a regex pattern, the sequence "{2}" means match exactly two instances of the preceding pattern.
So \d{2} will only match two digits, whereas your input text had five digits at that point.
If you don't want the trailing digits, then you can discard them using \d* outside the capture-parentheses.
Also, if your pattern contains slashes, consider using a different delimiter to avoid having to escape the slashes, e.g.
if ($line =~ m{<amount>(\d*\.\d{2})\d*</}) { $amount = $1; }
Also, if you want to parse XML, then you may want to consider using an XML library such as XML::LibXML.