Regex Match "words" That Contain Periods perl - regex

I am going through a TCPDUMP file (in .txt format) and trying to pull out any line that contains the use of the "word" V.I.D.C.A.M. It is embedded between a bunch of periods and includes periods which is really screwing me up, here is an example:
E..)..#.#.8Q...obr.f...$[......TP..P<........SMBs......................................NTLMSSP.........0.........`b.m..........L.L.<...V.I.D.C.A.M.....V.I.D.C.A.M.....V.I.D.C.A.M.....V.I.D.C.A.M.....V.I.D.C.A.M..............W.i.n.d.o.w.s. .5...1...W.i.n.d.o.w.s. .2.0.0.0. .L.A.N. .M.a.n.a.g.e.r..
How do you handle something like that?

You need to escape the periods:
if ($string =~ m{V\.I\.D\.C\.A\.M\.}) {
...
}
or if your string is entirely quoted, use the \Q which escapes any metacharacters that follow.
if ($string =~ m{\QV.I.D.C.A.M.})

Related

perl regex remove newlines in string

I have a Perl script which runs over a database dump in a plain text file, trying to remove all instances of newlines and possibly other odd characters when I see strings between quotes:
INSERT INTO ... VALUES ( "... these are the lines I'm interested in." )
I slurp in the file:
#file = <FILE>;
and:
foreach my $line (#file) {
$line =~ s/"[^"]*(\R)+[^"]*"//g;
# I want to get rid of newlines in strings
# And other odd characters I might come across
}
One character class I used instead of (\R) was:
([\r\n\t\v\f]+)
and I would try to:
$line =~ s/"[^"]+?([\r\n\t\v\f]+)[^"]*"//g;
I'm sure I'm missing something. I try to start matching with a literal double quote, scan past anything not a double quote (non-greedy, at least one match), reach the characters I want to get rid of, and keep scanning not double quote (any number of other characters not a double quote) until I reach the ending double quote.
So I wanted to replace $1 capture above with nothing.
I've tried on-line regex builders, and
/"[^"]*?([\r\n\t\f\v]+)[^"]*"/
worked with an on-line test, using a short paragraph with newlines and tabs in it, although it was in PHP pcre mode. I thought it would have worked with Perl.
Perhaps I'm not escaping some characters properly in the regex for Perl? Or the pattern is just not going to work the way I want it to, because it's wrong.
Thank you, any help appreciated.
The regex at regex101.com:
"[^"]*?([\r\n\f\t\v]+)[^"]*?"
matches for strings like this:
"This is
my\t test
string.
So there!"
I'm thoroughly puzzled now. :)
The real problem is that you will only find one group of \R's when there could be many groups between quotes. The best thing to do is make a callback (eval) with a general match between quotes, then substitute the \R's in
the replacement.
something like:
sub repl {
my ($content) = _#;
$content =~ s/\R+//g;
return $content;
}
$input =~ s/"([^"]*)"/ repl($1) /ge;
edit: If you're looking for only 1 linebreak cluster, you have to
exclude linebreaks leading up to it. For example: [^"\r\n]+
edit2: To slurp the file into $input, do a
$/ = undef;
my $input = <$fh>;

Strings with quotemeta enabled not able to match specific regex

For following strings with quotemeta enabled, the if statements are not able to match .cpp and .o file names. Am I doing anything wrong here.
E\:\\P4\\NTG5\\PATHOLOGY_products\\arm\-qnx\-m650\-4\.4\.2\-osz\-trc\-dbg\\gen\\deliveries\\ntg5\\arm\\api\\sys\\most\\pf\\mss\\src\\private\\DSIDSYSMOSTServerMoCCAStream\.cpp\
`E\:\\P4\\NTG5\\PATHOLOGY_products\\arm\-qnx\-m650\-4\.4\.2\-osz\-trc\-dbg\\bin\\deliveries\\ntg5\\arm\\api\\sys\\most\\pf\\mss\\src\\DSIDSYSMOSTServerMoCCAStream\.o\`
if ($a_path =~ m/[\\>](\w+\.(?:cpp|c))/) {
$compile_line_array->source_filename($a_path);
$compile_line_array->include_list_index($include_path_cnt);
$j=0;
last;
}
if($a_path =~ m/[\\>](\w+\.(?:o))/) {
$compile_line_array->object_file($a_path);
}
The regexes match a word character followed by a .; if your strings have a backslash before every ., they will not match.
Somehow, you are not thinking about this correctly: "quotemeta" isn't something that is enabled or disabled, it is an operator that sticks backslashes before some characters in your string. Why are you using it in the first place?
Why do you have your filenames run through quotemeta? As you've demonstrated, that's going to backslash escape all your .'s. Therefore if that's what you want to match against, you'll have to add some backslashes to your regex.
if ($a_path =~ m/[\\>](\w+\\\.(?:cpp|c))/) {
or
if($a_path =~ m/[\\>](\\\w+\.(?:o))/) {

Perl - regex - I want to read and search each line for a string followed by a ";"

I'm playing and learning Perl so that I can read log files. I want to search every line and look for a string of alphanumeric followed by this ; at the beginning of each line.
This is part of what I have:
if ($line =~ /\S([a-zA-Z][a-zA-Z0-9]*)/)
but I think this is wrong.
Please advise.
"Alphanumeric" is a bit ambiguous now, since many people still infected with ASCII think it means A-Z with 0-9, but Perl thinks about it differently depending on the version (Know your character classes under different semantics). As with any regular expression, your job is to design a pattern the includes only what you want and doesn't exclude anything that you do want.
Also, many people still use the ^ to mean the beginning of the string, which is does if there's no /m flag. However, the re module can now set default flags, so your regex might not be what you think it is when another programmer tries to be helpful.
I tend to write things like:
my $alphanum = qr/[a-z0-9]/i;
my $regex = qr/
\A # absolute start of string
(?:$alphanum)+ # I can change this elsewhere
;
/x;
if( $line =~ $regex ) { ... }
Try:
if ($line =~ /^[a-z0-9]+;/i) { ... }
^ matches the start of a line. The + matches once or more. /i makes the search case-insensitive.

Perl Regex (\d*\.\d{2})

I have run into a regex in Perl that seems to be giving me problems. I'm fairly new to Perl - but I don't think that's my problem.
Here is the code:
if ($line =~ m/<amount>(\d*\.\d{2})<\//) { $amount = $1; }
I'm essentially parsing an XML formatted file for a single tag. Here is the specific value that I'm trying to parse.
<amount>23.00000</amount>
Can someone please explain why my regex won't work?
EDIT: I should mention I'm trying to import the amount as a currency value. The trailing 3 decimals are useless.
You shouldn't use regex for parsing HTML, but regardless this will fix it:
if ($line =~ m|<amount>(\d*\.\d{2})\d*<//)| { $amount = $1; }
The \d*\.\d{2} regex fragment only recognize a number with exactly two decimal places. Your sample has five decimal place, and thus does not match this fragment.
You want to use \d*\.\d+ if you need to have a least one decimal place, or \d*\.\d{2,5} if you can have between 2 and 5 decimal place.
And you should not use back-tick characters in your regex as they have no meaning in a regex, and thus are interpreted as regular character.
So you want to use:
if ($line =~ m/<amount>(\d*\.\d{2,5})<\/amount>/) { $amount = $1; }
In a regex pattern, the sequence "{2}" means match exactly two instances of the preceding pattern.
So \d{2} will only match two digits, whereas your input text had five digits at that point.
If you don't want the trailing digits, then you can discard them using \d* outside the capture-parentheses.
Also, if your pattern contains slashes, consider using a different delimiter to avoid having to escape the slashes, e.g.
if ($line =~ m{<amount>(\d*\.\d{2})\d*</}) { $amount = $1; }
Also, if you want to parse XML, then you may want to consider using an XML library such as XML::LibXML.

Regex Replace Cleaning a string from unwanted characters

I'm creating a method to modify page titles into a good string for to use URL rewriting.
Example: "Latest news", would be "latest-news"
The problem is the page titles are out of my control and some are similar to the following:
Football & Rugby News!. Ideally this would become football-rugby-news.
I've done some work to get this to football-&-rugby-news!
Is there a possible regex to identify unwanted characters in there and the extra '-' ?
Basically, I need numbers and letters separated by a single '-'.
I only have basic knowledge of regex, and the best I could come up with was:
[^a-z0-9-]
I'm not sure if I'm being clear enough here.
Try a 'replace all' with something like this.
[^a-zA-Z0-9\\-]+
Replace the matches with a dash.
Alternative regex:
[^a-zA-Z0-9]+
This one will avoid multiple dashes if a dash itself is found near other unwanted characters.
This Perl script also does what you're looking for. Of course you'd have to feed it the string by some other means than just hardcoding it; I merely put it in there for the example.
#!/usr/bin/perl
use strict;
use warnings;
my $string = "Football & Rugby News!";
$string = lc($string); # lowercase
my $allowed = qr/a-z0-9-\s/; # all permitted characters
$string =~ s/[^$allowed]//g; # remove all characters that are NOT in $allowed
$string =~ s/\s+/-/g; # replace all kinds of whitespace with '-'
print "$string\n";
prints
football-rugby-news