How to limit match length before a certain character? - regex

I am using the following regular expression to scan input text files for valid emails.
[A-Za-z0-9!#$%&*+/=?^_`{|}~-]+(?:\.[A-Za-z0-9!#$%&*+/=?^_`{|}~-]+)*#(?:[A-Za-z0-9](?:[A-Za-z0-9-]*[A-Za-z0-9])?\.)+[A-Za-z0-9](?:[A-Za-z0-9-]*[A-Za-z0-9])?
Now I also need to limit the matches to 20 characters before the '#' sign in the email address, but not sure how to do it.
PS. I am using the Perl regular expression library (TPerlRegex) found in Delphi XE2.
Please can you help me?

Since your library is supposed to be PERL compatible, it should support lookaheads. These are convenient to ensure several "orthogonal" restrictions in the pattern:
(?=[^#]{1,20}#)[A-Za-z0-9!#$%&*+/=?^_`{|}~-]+(?:\.[A-Za-z0-9!#$%&*+/=?^_`{|}~-]+)*#(?:[A-Za-z0-9](?:[A-Za-z0-9-]*[A-Za-z0-9])?\.)+[A-Za-z0-9](?:[A-Za-z0-9-]*[A-Za-z0-9])?
The lookahead will only match if there is an # after no more than 20 non-# characters. However, the lookahead does not actually advance the position of the regex engine in your subject string, so after the condition has been checked, the engine is still at the beginning of the email (or whichever position it is checking at the moment) and will continue with your pattern as previously.

Consider using Email::Address to capture email addresses, and then grepping the results for those having 20 or fewer characters before the #:
use strict;
use warnings;
use Email::Address;
my #addresses;
while ( my $line = <DATA> ) {
push #addresses, $_
for grep { /([^#]+)/ and length $1 < 21 }
Email::Address->parse($line);
}
print "$_\n" for #addresses;
__DATA__
ABCDEFGHIJKLMNOPQRSTUVWXYZguest#host.com frank#email.net Line noise. test#host.com
Some stuff here... help#perl.org And even more here!
Nothing to see here. 01234567890123456789#numbers.com Nothing to see.
Output:
frank#email.net
test#host.com
help#perl.org
01234567890123456789#numbers.com

Related

Make a regular expression in perl to grep value work on a string with different endings

I have this code in perl where I want to extract the value of 'EUR_AF', in this case '0.39'.
Sometimes 'EUR_AF' ends with ';', sometimes it doesn't.
Alternatively, 'EUR_AF' may end with '=0' instead of '=0.39;' or '=0.39'.
How do I make the code handle that? Can't seem to find it online...I could of course wrap everything in an almost endless if-elsif-else statement, but that seems overkill.
Example text:
AVGPOST=0.9092;AN=2184;RSQ=0.5988;ERATE=0.0081;AC=144;VT=SNP;THETA=0.0045;AA=A;SNPSOURCE=LOWCOV;LDAF=0.0959;AF=0.07;ASN_AF=0.05;AMR_AF=0.10;AFR_AF=0.11;EUR_AF=0.039
Code: $INFO =~ m/\;EUR\_AF\=(.*?)(;)/
I did find that: $INFO =~ m/\;EUR\_AF\=(.*?0)/ handles the cases of EUR_AF=0, but how to handle alternative scenarios efficiently?
Extract one value:
my ($eur_af) = $s =~ /(?:^|;)EUR_AF=([^;]*)/;
my ($eur_af) = ";$s" =~ /;EUR_AF=([^;]*)/;
Extract all values:
my %rec = split(/[=;]/, $s);
my $eur_af = $rec{EUR_AF};
This regex should work for you: (?<=EUR_AF=)\d+(\.\d+)?
It means
(?<=EUR_AF=) - look for a string preceeded by EUR_AF=
\d+(\.\d+)? - consist of a digit, optionally a decimal digit
EDIT: I originally wanted the whole regex to return the correct result, not only the capture group. If you want the correct capture group edit it to (?<=EUR_AF=)(\d+(?:\.\d+)?)
I have found the answer. The code:
$INFO =~ m/(?:^|;)EUR_AF=([^;]*)/
seems to handle the cases where EUR_AF=0 and EUR_AF=0.39, ending with or without ;. The resulting $INFO will be 0 or 0.39.

How to find value after specific/static string using regex(perl)?

I'm still learning regex, and have a long ways to go so would appreciate help from any of you with more regex experience. I'm working on a perl script to parse multiple log files, and parse for certain values. In this case, I'm trying to get a list of user names.
Here's what my log file looks like:
[date timestamp]UserName = Joe_Smith
[date timestamp]IP Address = 10.10.10.10
..
Just testing, I've been able to pull it out using \UserName\s\=\s\w+, however I just want the actual UserName value, and not include the 'UserName =' part. Ideally if I can get this to work, I should be able to apply the same logic for pulling out the IP Address etc, but just hoping to get list of Usernames for the moment.
Also, the usernames are always in the format above of Firstname_Lastname, so I believe \w+ should always get everything I need.
Appreciate any help!
You should capture the part of the matched string that you are interested in using parentheses in the regular expression.
If the match succeeds, then captures are available in the built-in variables $1, $2 etc, numbered in the order that their opening parenthesis appears in the regular expressions.
In this case you need only a single capture so you need look only at $1.
Beware that you should always check that a regex match succeeded before using the values in the capture variables, as they retain the values from the last successful match and a failed match doesn't reset them.
use strict;
use warnings;
my $str = '[date timestamp]UserName = Joe_Smith';
if ($str =~ /UserName = (\w+)/) {
print $1, "\n";
}
output
Joe_Smith
Another way to do it:
my ($username) = $str =~ /UserName\s\=\s(\w+)/
or warn "no username parsed from '$str'\n";
You should make the regex as \UserName\s\=\s(\w+)$ And after this the part in the bracket will be available in the variable $1. My perl is a bit rusty, so if it doesnt work right, look at http://www.troubleshooters.com/codecorn/littperl/perlreg.htm#StringSelections

Search html file for random string using regex

I am trying to use Perl to search through an html file, looking for a semi-random string and store the match in a variable or print it out.
The string is the name of a jpg image and always follows the pattern of 9 digits followed by 6 lower case letters, i.e.
140005917smpxgj.jpg
But it is random every time. I am sure Perl can do this, but I will admit I am getting a bit confused.
Not too complicated. You may want to watch out for varying caps in the extension, e.g. JPG. If that is a concern, you may add (?i) before the extension.
You may also wish to prevent partial names, e.g. discard a match that has more than 9 digits. That is the (?<!\d) part: Make sure no digit characters precede the match.
ETA: Now extracts multiple matches too, thanks to ikegami.
while (<>) {
for (/(?<!\d)([0-9]{9}[a-z]{6}\.(?i)jpg)/g) {
say;
push #match, $_;
}
}
Try this regex:
/\b\d{9}[a-z]{6}\.jpg/
perldoc perlre
use warnings;
use strict;
while (<DATA>) {
if (/ ( [0-9]{9} [a-z]{6} [.] jpg ) /x) {
print "$1\n";
}
}
__DATA__
foo 140005917smpxgj.jpg bar
sdfads 777666999abcdef.jpg dfgffgh
Prints:
140005917smpxgj.jpg
777666999abcdef.jpg
the solution regex is \d{9}[a-z]{6}\.jpg

What is the regular Expression to uncomment a block of Perl code in Eclipse?

I need a regular expression to uncomment a block of Perl code, commented with # in each line.
As of now, my find expression in the Eclipse IDE is (^#(.*$\R)+) which matches the commented block, but if I give $2 as the replace expression, it only prints the last matched line. How do I remove the # while replacing?
For example, I need to convert:
# print "yes";
# print "no";
# print "blah";
to
print "yes";
print "no";
print "blah";
In most flavors, when a capturing group is repeated, only the last capture is kept. Your original pattern uses + repetition to match multiple lines of comments, but group 2 can only keep what was captured in the last match from the last line. This is the source of your problem.
To fix this, you can remove the outer repetition, so you match and replace one line at a time. Perhaps the simplest pattern to do this is to match:
^#\s*
And replace with the empty string.
Since this performs match and replacement one line at a time, you must repeat it as many times as necessary (in some flavors, you can use the g global flag, in e.g. Java there are replaceFirst/All pair of methods instead).
References
regular-expressions.info/Repeating a Captured Group vs Capturing a Repeated Group
Related questions
Is there a regex flavor that allows me to count the number of repetitions matched by * and +?
.NET regex keeps all repeated matches
Special note on Eclipse keyboard shortcuts
It Java mode, Eclipse already has keyboard shortcuts to add/remove/toggle block comments. By default, Ctrl+/ binds to the "Toggle comment" action. You can highlight multiple lines, and hit Ctrl+/ to toggle block comments (i.e. //) on and off.
You can hit Ctrl+Shift+L to see a list of all keyboard shortcuts. There may be one in Perl mode to toggle Perl block comments #.
Related questions
What is your favorite hot-key in Eclipse?
Hidden features of Eclipse
Search with ^#(.*$) and replace with $1
You can try this one: -
use strict;
use warning;
my $data = "#Hello#stack\n#overflow\n";
$data =~ s/^?#//g ;
OUTPUT:-
Hello
stack
overflow
Or
open(IN, '<', "test.pl") or die $!;
read(IN, my $data, -s "test.pl"); #reading a file
$data =~ s/^?#//g ;
open(OUT, '>', "test1.pl") or die $!;
print OUT $data; #Writing a file
close OUT;
close IN;
Note: Take care of #!/usr/bin/perl in the Perl script, it will uncomment it also.
You need the GLOBAL g switch.
s/^#(.+)/$1/g
In order to determine whether a perl '#' is a comment or something else, you have to compile the perl and build a parse tree, because of Schwartz's Snippet
whatever / 25 ; # / ; die "this dies!";
Whether that '#' is a comment or part of a regex depends on whether whatever() is nullary, which depends on the parse tree.
For the simple cases, however, yours is failing because (^#(.*$\R)+) repeats a capturing group, which is not what you wanted.
But anyway, if you want to handle simple cases, I don't even like the regex that everyone else is using, because it fails if there is whitespace before the # on the line. What about
^\s*#(.*)$
? This will match any line that begins with a comment (optionally with whitespace, e.g., for indented blocks).
Try this regex:
(^[\t ]+)(\#)(.*)
With this replacement:
$1$3
Group 1 is (^[\t ]+) and matches all leading whitespace (spaces and tabs).
Group 2 is (#) and matches one # character.
Group 3 is (.*) and matches the rest of the line.

How to return the first five digits using Regular Expressions

How do I return the first 5 digits of a string of characters in Regular Expressions?
For example, if I have the following text as input:
15203 Main Street
Apartment 3 63110
How can I return just "15203".
I am using C#.
This isn't really the kind of problem that's ideally solved by a single-regex approach -- the regex language just isn't especially meant for it. Assuming you're writing code in a real language (and not some ill-conceived embedded use of regex), you could do perhaps (examples in perl)
# Capture all the digits into an array
my #digits = $str =~ /(\d)/g;
# Then take the first five and put them back into a string
my $first_five_digits = join "", #digits[0..4];
or
# Copy the string, removing all non-digits
(my $digits = $str) =~ tr/0-9//cd;
# And cut off all but the first five
$first_five_digits = substr $digits, 0, 5;
If for some reason you really are stuck doing a single match, and you have access to the capture buffers and a way to put them back together, then wdebeaum's suggestion works just fine, but I have a hard time imagining a situation where you can do all that, but don't have access to other language facilities :)
it would depend on your flavor of Regex and coding language (C#, PERL, etc.) but in C# you'd do something like
string rX = #"\D+";
Regex.replace(input, rX, "");
return input.SubString(0, 5);
Note: I'm not sure about that Regex match (others here may have a better one), but basically since Regex itself doesn't "replace" anything, only match patterns, you'd have to look for any non-digit characters; once you'd matched that, you'd need to replace it with your languages version of the empty string (string.Empty or "" in C#), and then grab the first 5 characters of the resulting string.
You could capture each digit separately and put them together afterwards, e.g. in Perl:
$str =~ /(\d)\D*(\d)\D*(\d)\D*(\d)\D*(\d)/;
$digits = $1 . $2 . $3 . $4 . $5;
I don't think a regular expression is the best tool for what you want.
Regular expressions are to match patterns... the pattern you are looking for is "a(ny) digit"
Your logic external to the pattern is "five matches".
Thus, you either want to loop over the first five digit matches, or capture five digits and merge them together.
But look at that Perl example -- that's not one pattern -- it's one pattern repeated five times.
Can you do this via a regular expression? Just like parsing XML -- you probably could, but it's not the right tool.
Not sure this is best solved by regular expressions since they are used for string matching and usually not for string manipulation (in my experience).
However, you could make a call to:
strInput = Regex.Replace(strInput, "\D+", "");
to remove all non number characters and then just return the first 5 characters.
If you are wanting just a straight regex expression which does all this for you I am not sure it exists without using the regex class in a similar way as above.
A different approach -
#copy over
$temp = $str;
#Remove non-numbers
$temp =~ s/\D//;
#Get the first 5 numbers, exactly.
$temp =~ /\d{5}/;
#Grab the match- ASSUMES that there will be a match.
$first_digits = $1
result =~ s/^(\d{5}).*/$1/
Replace any text starting with a digit 0-9 (\d) exactly 5 of them {5} with any number of anything after it '.*' with $1, which is the what is contained within the (), that is the first five digits.
if you want any first 5 characters.
result =~ s/^(.{5}).*/$1/
Use whatever programming language you are using to evaluate this.
ie.
regex.replace(text, "^(.{5}).*", "$1");