How to find value after specific/static string using regex(perl)? - regex

I'm still learning regex, and have a long ways to go so would appreciate help from any of you with more regex experience. I'm working on a perl script to parse multiple log files, and parse for certain values. In this case, I'm trying to get a list of user names.
Here's what my log file looks like:
[date timestamp]UserName = Joe_Smith
[date timestamp]IP Address = 10.10.10.10
..
Just testing, I've been able to pull it out using \UserName\s\=\s\w+, however I just want the actual UserName value, and not include the 'UserName =' part. Ideally if I can get this to work, I should be able to apply the same logic for pulling out the IP Address etc, but just hoping to get list of Usernames for the moment.
Also, the usernames are always in the format above of Firstname_Lastname, so I believe \w+ should always get everything I need.
Appreciate any help!

You should capture the part of the matched string that you are interested in using parentheses in the regular expression.
If the match succeeds, then captures are available in the built-in variables $1, $2 etc, numbered in the order that their opening parenthesis appears in the regular expressions.
In this case you need only a single capture so you need look only at $1.
Beware that you should always check that a regex match succeeded before using the values in the capture variables, as they retain the values from the last successful match and a failed match doesn't reset them.
use strict;
use warnings;
my $str = '[date timestamp]UserName = Joe_Smith';
if ($str =~ /UserName = (\w+)/) {
print $1, "\n";
}
output
Joe_Smith

Another way to do it:
my ($username) = $str =~ /UserName\s\=\s(\w+)/
or warn "no username parsed from '$str'\n";

You should make the regex as \UserName\s\=\s(\w+)$ And after this the part in the bracket will be available in the variable $1. My perl is a bit rusty, so if it doesnt work right, look at http://www.troubleshooters.com/codecorn/littperl/perlreg.htm#StringSelections

Related

Possible to use one RegEx group for multiple matches?

Example Text:
[ABC[[value='123'SomeTextHere[]]][value='5463',SomedifferentTextwithdifferentlength]][[value='Text';]]]]][ABC [...]
Current RegEx:
[ABC.*?(?:value='(.*?)')+.*?]]]
What I want to achive:
There's an extremely long text (HTTP Response) with data I want to grab. A single dataset contains multiple lines. On every line the data I want to collect is located inside the "value:''" tag. On each line there are multiple of those value tags. Is it somehow possible to use (optimize) the above regex to get the data of all value tags with just a single capturing group in the regex pattern?
To clarify what I want: alternatively I would have to use the following pattern:
[ABC.*?value='(.*?)'.*?value='(.*?)'.*?value='(.*?)'.*?value='(.*?)'.*?]]]
Using Perl, you can easily get at all matches of a regular expression, and most of the other regular expression libraries have similar capabilities. As you want to match a header, doing a repeated match with an anchor ( \G )is the easiest:
use strict;
#use Regexp::Debugger;
my $data = "[ABC[[value='123'SomeTextHere[]]][value='5463',SomedifferentTextwithdifferentlength]][[value='Text';]]]]][ABC [...]";
my #matches = $data =~ /(?:^\[ABC|\G).*?\bvalue='([^']*)'/g;
print "[$_]" for #matches;
__END__
[123][5463][Text]
Most likely you will need to add the "global" flag to whatever regex library you are using for matching.
Personally, I would split this up into a two-step process. First, extract the string between [ABC[[ and ]]], and then extract all value='...' parts from that string. Also, most likely, you can parse the string [ABC[[...]]] in a sane way, counting opening and closing brackets. Or maybe that string is even JSON and you can just use a proper parser there?

Regular expression tag matching

I have a very simple Perl function that returns the content of a tag in custom XML code I need to parse. However, if there are line returns inside of the tags, then it returns an empty value and I'm not sure how to fix it:
sub in_tag
{
my ($text, $tag) = #_;
my ($content) = $text =~ m/<$tag.*>(.*)<\/$tag>/;
$content = $content . "";
return $content;
}
# works
print in_tag("<item><creation type=\"date\">2014-01-03</creation><name type=\"word\">John Doe</name><id type=\"number\">67</id></item>", "name");
# doesnt work
print in_tag("<item><creation type=\"date\">2014-01-03</creation><name type=\"word\">John\nDoe</name><id type=\"number\">67</id></item>", "name");
To make the . regex metacharacter match a newline, you need to use the /s flag:
m/..../s;
You also want to use non-greedy quantifiers in your regular expression. Put a ? after the * to still match zero or more, but with the provision that it doesn't go beyond text that would match the next part of the pattern:
m/<$tag.*?>(.*?)<\/$tag>/
I don't mind this simple sort of extraction for quick programs or small, uncomplicated inputs, but beyond that I like XML::Twig. It takes a bit to get used to, but once you get the hang of it you'll be able to do all sorts of fancy things with almost no effort.

Regex match a specific filename format Perl

I'm trying to match a filename format which is filename_nrows_ncols. I came up with (_[\d]+_[\d]+)$ and tested it in Rubular and it works there. http://www.rubular.com/r/W7DKNhmpMV
But when I'm trying to assugn the match to a variable in my perl code, I get Use of uninitialized value... error. What's wrong with my regex? Thanks in adv.
$match =~ /(_[\d]+_[\d]+)$/;
Without seeing your code, it's hard to say, but I'd imagine it should look something like this:
if ($filename =~ /(\d+_\d+)$/) {
# Do something
}
By the way the [] around [\d] isn't necessary in this case. If you had something other than the \d within it, it would be.
-- EDIT --
I think I see what's wrong. You want the results of the regex to go into $match. If that's the case, assuming your filename is in the default variable, then you probably want this:
my ($match) = /(\d+_\d+)$/;
or if it's in another variable
my ($match) = $filename =~ /(\d+_\d+)$/;
The error, by the way, only appears to be a warning from "use warnings" or -W. It's a good one, though.
You'll need to provide the entire line of code that's causing that error. The regular expression itself looks fine (although you may be better off with something like ^(.+)_(\d+)_(\d+)$ if you plan on doing anything with the filename, nrows, or ncols (which would then be stored in $1, $2, and $3 respectively).

How to check for one Perl regex pattern while excluding another

As an example, if I need to filter the following text and search for the word example, but omit results in which the line begin with a hash #
# filler text example
example
example 2
# test example 3
I've tried a few different combinations, but cant seem to get this right.
Update
I've tried /^[^#].*example/g and /^(?!#).*example.*/g but didn't seem to get any results
It is strangely common to attempt to bundle far too much functionality into a single regex, while people don't seem to do the same thing with any other operator.
There is nothing wrong with writing
if ( /example/ and not /^#/ ) {
print;
}
and it is far clearer than any single equivalent regular expression
You could change this to multiple statements if you wish; something like
while (<>) {
next if /^#/;
print if /example/;
}
Or you could allow comments to start in the middle of a line by creating a temporary variable that contains the text with all characters from the hash # onwards removed, and process that instead
while (<>) {
my ($trimmed) = /^([^#]*)/;
print if $trimmed =~ /example/;
}
Note that if you are hoping to process Perl code using this, then there are cases which will have to receive special treatment where a hash doesn't denote the start of a comment, such as the $#array construct, or an alternative pattern delimiter like m#example#
^((?!#).*example.*)$ will work better with your regex101 tester. Also use the flags gm instead of just g. The tester is processing the sample text as a single string, but devnull's answer works if you're processing the text line by line.

How to limit match length before a certain character?

I am using the following regular expression to scan input text files for valid emails.
[A-Za-z0-9!#$%&*+/=?^_`{|}~-]+(?:\.[A-Za-z0-9!#$%&*+/=?^_`{|}~-]+)*#(?:[A-Za-z0-9](?:[A-Za-z0-9-]*[A-Za-z0-9])?\.)+[A-Za-z0-9](?:[A-Za-z0-9-]*[A-Za-z0-9])?
Now I also need to limit the matches to 20 characters before the '#' sign in the email address, but not sure how to do it.
PS. I am using the Perl regular expression library (TPerlRegex) found in Delphi XE2.
Please can you help me?
Since your library is supposed to be PERL compatible, it should support lookaheads. These are convenient to ensure several "orthogonal" restrictions in the pattern:
(?=[^#]{1,20}#)[A-Za-z0-9!#$%&*+/=?^_`{|}~-]+(?:\.[A-Za-z0-9!#$%&*+/=?^_`{|}~-]+)*#(?:[A-Za-z0-9](?:[A-Za-z0-9-]*[A-Za-z0-9])?\.)+[A-Za-z0-9](?:[A-Za-z0-9-]*[A-Za-z0-9])?
The lookahead will only match if there is an # after no more than 20 non-# characters. However, the lookahead does not actually advance the position of the regex engine in your subject string, so after the condition has been checked, the engine is still at the beginning of the email (or whichever position it is checking at the moment) and will continue with your pattern as previously.
Consider using Email::Address to capture email addresses, and then grepping the results for those having 20 or fewer characters before the #:
use strict;
use warnings;
use Email::Address;
my #addresses;
while ( my $line = <DATA> ) {
push #addresses, $_
for grep { /([^#]+)/ and length $1 < 21 }
Email::Address->parse($line);
}
print "$_\n" for #addresses;
__DATA__
ABCDEFGHIJKLMNOPQRSTUVWXYZguest#host.com frank#email.net Line noise. test#host.com
Some stuff here... help#perl.org And even more here!
Nothing to see here. 01234567890123456789#numbers.com Nothing to see.
Output:
frank#email.net
test#host.com
help#perl.org
01234567890123456789#numbers.com