perl regex remove newlines in string - regex

I have a Perl script which runs over a database dump in a plain text file, trying to remove all instances of newlines and possibly other odd characters when I see strings between quotes:
INSERT INTO ... VALUES ( "... these are the lines I'm interested in." )
I slurp in the file:
#file = <FILE>;
and:
foreach my $line (#file) {
$line =~ s/"[^"]*(\R)+[^"]*"//g;
# I want to get rid of newlines in strings
# And other odd characters I might come across
}
One character class I used instead of (\R) was:
([\r\n\t\v\f]+)
and I would try to:
$line =~ s/"[^"]+?([\r\n\t\v\f]+)[^"]*"//g;
I'm sure I'm missing something. I try to start matching with a literal double quote, scan past anything not a double quote (non-greedy, at least one match), reach the characters I want to get rid of, and keep scanning not double quote (any number of other characters not a double quote) until I reach the ending double quote.
So I wanted to replace $1 capture above with nothing.
I've tried on-line regex builders, and
/"[^"]*?([\r\n\t\f\v]+)[^"]*"/
worked with an on-line test, using a short paragraph with newlines and tabs in it, although it was in PHP pcre mode. I thought it would have worked with Perl.
Perhaps I'm not escaping some characters properly in the regex for Perl? Or the pattern is just not going to work the way I want it to, because it's wrong.
Thank you, any help appreciated.
The regex at regex101.com:
"[^"]*?([\r\n\f\t\v]+)[^"]*?"
matches for strings like this:
"This is
my\t test
string.
So there!"
I'm thoroughly puzzled now. :)

The real problem is that you will only find one group of \R's when there could be many groups between quotes. The best thing to do is make a callback (eval) with a general match between quotes, then substitute the \R's in
the replacement.
something like:
sub repl {
my ($content) = _#;
$content =~ s/\R+//g;
return $content;
}
$input =~ s/"([^"]*)"/ repl($1) /ge;
edit: If you're looking for only 1 linebreak cluster, you have to
exclude linebreaks leading up to it. For example: [^"\r\n]+
edit2: To slurp the file into $input, do a
$/ = undef;
my $input = <$fh>;

Related

Regex Match "words" That Contain Periods perl

I am going through a TCPDUMP file (in .txt format) and trying to pull out any line that contains the use of the "word" V.I.D.C.A.M. It is embedded between a bunch of periods and includes periods which is really screwing me up, here is an example:
E..)..#.#.8Q...obr.f...$[......TP..P<........SMBs......................................NTLMSSP.........0.........`b.m..........L.L.<...V.I.D.C.A.M.....V.I.D.C.A.M.....V.I.D.C.A.M.....V.I.D.C.A.M.....V.I.D.C.A.M..............W.i.n.d.o.w.s. .5...1...W.i.n.d.o.w.s. .2.0.0.0. .L.A.N. .M.a.n.a.g.e.r..
How do you handle something like that?
You need to escape the periods:
if ($string =~ m{V\.I\.D\.C\.A\.M\.}) {
...
}
or if your string is entirely quoted, use the \Q which escapes any metacharacters that follow.
if ($string =~ m{\QV.I.D.C.A.M.})

Perl - regex - I want to read and search each line for a string followed by a ";"

I'm playing and learning Perl so that I can read log files. I want to search every line and look for a string of alphanumeric followed by this ; at the beginning of each line.
This is part of what I have:
if ($line =~ /\S([a-zA-Z][a-zA-Z0-9]*)/)
but I think this is wrong.
Please advise.
"Alphanumeric" is a bit ambiguous now, since many people still infected with ASCII think it means A-Z with 0-9, but Perl thinks about it differently depending on the version (Know your character classes under different semantics). As with any regular expression, your job is to design a pattern the includes only what you want and doesn't exclude anything that you do want.
Also, many people still use the ^ to mean the beginning of the string, which is does if there's no /m flag. However, the re module can now set default flags, so your regex might not be what you think it is when another programmer tries to be helpful.
I tend to write things like:
my $alphanum = qr/[a-z0-9]/i;
my $regex = qr/
\A # absolute start of string
(?:$alphanum)+ # I can change this elsewhere
;
/x;
if( $line =~ $regex ) { ... }
Try:
if ($line =~ /^[a-z0-9]+;/i) { ... }
^ matches the start of a line. The + matches once or more. /i makes the search case-insensitive.

Regex Replace Cleaning a string from unwanted characters

I'm creating a method to modify page titles into a good string for to use URL rewriting.
Example: "Latest news", would be "latest-news"
The problem is the page titles are out of my control and some are similar to the following:
Football & Rugby News!. Ideally this would become football-rugby-news.
I've done some work to get this to football-&-rugby-news!
Is there a possible regex to identify unwanted characters in there and the extra '-' ?
Basically, I need numbers and letters separated by a single '-'.
I only have basic knowledge of regex, and the best I could come up with was:
[^a-z0-9-]
I'm not sure if I'm being clear enough here.
Try a 'replace all' with something like this.
[^a-zA-Z0-9\\-]+
Replace the matches with a dash.
Alternative regex:
[^a-zA-Z0-9]+
This one will avoid multiple dashes if a dash itself is found near other unwanted characters.
This Perl script also does what you're looking for. Of course you'd have to feed it the string by some other means than just hardcoding it; I merely put it in there for the example.
#!/usr/bin/perl
use strict;
use warnings;
my $string = "Football & Rugby News!";
$string = lc($string); # lowercase
my $allowed = qr/a-z0-9-\s/; # all permitted characters
$string =~ s/[^$allowed]//g; # remove all characters that are NOT in $allowed
$string =~ s/\s+/-/g; # replace all kinds of whitespace with '-'
print "$string\n";
prints
football-rugby-news

Perl Regular expression remove double tabs, line breaks, white spaces

I want to write a perl script that removes double tabs, line breaks and white spaces.
What I have so far is:
$txt=~s/\r//gs;
$txt=~s/ +/ /gs;
$txt=~s/\t+/\t/gs;
$txt=~s/[\t\n]*\n/\n/gs;
$txt=~s/\n+/\n/gs;
But,
1. It's not beautiful. Should be possible to do that with far less regexps.
2. It just doesn't work and I really do not know why. It leaves some double tabs, white spaces and empty lines (i.e. lines with only a tab or whitespace)
I could solve it with a while, but that is very slow and ugly.
Any suggestions?
You've got a bit of a mish-mash of stuff in there, not all of which corresponds to what you said. Let's break down what you have and then perhaps you can work from there to what you want.
$txt=~s/\r//s; # removes a single \r from the line. Did you mean to use g on this one?
$txt=~s/[\t ]\n//s; # match a single \t OR space right before a \n, and remove.
$txt=~s/ +/ /gs;# match at least 2 spaces, replace with a single space
$txt=~s/\t+/ /gs;# match at least 2 \t, replace with a single space
$txt=~s/\n /\n/s;# remove a space immediately following a \n
$txt=~s/\t /\t/s;# remove a space immediately following a \t
$txt=~s/\n+/ /gs;# match at least 2 \n, replace them all with a single space
I have the feeling that's not at all what you want to accomplish.
I'm honestly unclear on what you want to do. The way I read your stated intent, I would have thought you'd want to replace all double tabs with single tabs, all double line breaks with single line breaks, and all double spaces with single spaces. I'll further surmise that you want to actually do runs of those characters, not just doubles. Here's the regexes for what I've just said, hopefully that will give you something to go on:
(I've also removed all \r).
$txt=~s/\r//gs;# remove all \r
$txt=~s/\t+/\t/gs;# replace all runs of > 1 tab with a single tab
$txt=~s/\n+/\n/gs;# replace all runs of > 1 \n with a single \n
$txt=~s/ +/ /gs;# replace all runs of > 1 space with a single space
Given that your attempted regexes don't seem to match the way I read your stated desire, I suspect that there's some fuzziness about what you really want to do here. You might want to think further about what you are trying to accomplish, which should help the regexes become clearer.
I am not sure of your exact requirements, but here are a few hint that might get you going :
To compress all white space to spaces (probably too powerful!)
$txt=~s/\s+/ /g ;
To remove any white space at start of line
$txt=~s/^ +//gm ;
To compress multiple tabs to a space
$txt=~s/\t+/ /g ;
As I try to work out a quick real answer for you, have you looked at the docs (and no I'm not just saying rtfm). perldoc is a great tool and has some useful info, may I suggest perldoc perlrequick and perldoc perlreref to get you going.
First of all, you might find it easier to split the long text into lines and operate on the lines separately and then join them again. Also if we make a new array to store the results to be joined we can easily exclude empty lines.
Finally, it strikes me that in operating on a long block of text, that text is likely to be external to your script. If you are really opening a file and globbing it into a variable, you could more easily do what I am leaving in as a comment block. To use this method comment the first block and remove the comment on the second block, the third block remains for either method. I include this because if you really are reading in a file then splitting it, it saves a lot of work to just read it in line by line. You could then write it out to another file if desired.
#!/usr/bin/env perl
use strict;
use warnings;
my #return_lines;
### Begin "text in script" Method ###
my $txt = <<END;
hello world
hello world
hello world
hello world
END
#note last two are to test removing spaces after tabs
my #lines = split(/\n/, $txt);
foreach my $line (#lines) {
### Begin "text in external file" Method (commented) ###
#my $filename = 'file.txt';
#open( my $filehandle, '<', $filename);
#while (<$filehandle>) {
# my $line = $_;
### Script continues for either input method ###
$line =~ s/^\s*//; #remove leading whitespace
$line =~ s/\s*$//; #remove trailing whitespace
$line =~ s/\ {2,}/ /g; #remove multiple literal spaces
$line =~ s/\t{2,}/\t/g; #remove excess tabs (is this what you meant?)
$line =~ s/(?<=\t)\ *//g; #remove any spaces after a tab
push #return_lines, $line unless $line=~/^\s*$/; #remove empty lines
}
my $return_txt = join("\n", #return_lines) . "\n";
print $return_txt;
This is a bit un-clear.
If you have a line like ab TABcTABTAB \n\n, what do you want as a result? I am reading the above as ab c\n? – DVK 1 min ago edit
In other words, is it correct that you want:
All the whitespace (e.g. any amount of spaces and tabs) in the middle of the lines converted to a single space?
All the whitespace at the beginning OR end of the line removed (except for newlines)?
Remove completely empty lines?
$s =~ s/[\t ]+$//ms; # Remove ending spaces/tabs
$s =~ s/^[\t ]+//ms; # Remove starting spaces/tabs
$s =~ s/[\t ]+/ /ms; # Replace duplicate whitespace mid-string with 1 space
$s =~ s/^$//ms; # Remove completely empty lines
Please note that I used the "/ms" modifyers (read perdoc perlre for details) so that I could use start/end of line anchors within a multi-line string.

What is the regular Expression to uncomment a block of Perl code in Eclipse?

I need a regular expression to uncomment a block of Perl code, commented with # in each line.
As of now, my find expression in the Eclipse IDE is (^#(.*$\R)+) which matches the commented block, but if I give $2 as the replace expression, it only prints the last matched line. How do I remove the # while replacing?
For example, I need to convert:
# print "yes";
# print "no";
# print "blah";
to
print "yes";
print "no";
print "blah";
In most flavors, when a capturing group is repeated, only the last capture is kept. Your original pattern uses + repetition to match multiple lines of comments, but group 2 can only keep what was captured in the last match from the last line. This is the source of your problem.
To fix this, you can remove the outer repetition, so you match and replace one line at a time. Perhaps the simplest pattern to do this is to match:
^#\s*
And replace with the empty string.
Since this performs match and replacement one line at a time, you must repeat it as many times as necessary (in some flavors, you can use the g global flag, in e.g. Java there are replaceFirst/All pair of methods instead).
References
regular-expressions.info/Repeating a Captured Group vs Capturing a Repeated Group
Related questions
Is there a regex flavor that allows me to count the number of repetitions matched by * and +?
.NET regex keeps all repeated matches
Special note on Eclipse keyboard shortcuts
It Java mode, Eclipse already has keyboard shortcuts to add/remove/toggle block comments. By default, Ctrl+/ binds to the "Toggle comment" action. You can highlight multiple lines, and hit Ctrl+/ to toggle block comments (i.e. //) on and off.
You can hit Ctrl+Shift+L to see a list of all keyboard shortcuts. There may be one in Perl mode to toggle Perl block comments #.
Related questions
What is your favorite hot-key in Eclipse?
Hidden features of Eclipse
Search with ^#(.*$) and replace with $1
You can try this one: -
use strict;
use warning;
my $data = "#Hello#stack\n#overflow\n";
$data =~ s/^?#//g ;
OUTPUT:-
Hello
stack
overflow
Or
open(IN, '<', "test.pl") or die $!;
read(IN, my $data, -s "test.pl"); #reading a file
$data =~ s/^?#//g ;
open(OUT, '>', "test1.pl") or die $!;
print OUT $data; #Writing a file
close OUT;
close IN;
Note: Take care of #!/usr/bin/perl in the Perl script, it will uncomment it also.
You need the GLOBAL g switch.
s/^#(.+)/$1/g
In order to determine whether a perl '#' is a comment or something else, you have to compile the perl and build a parse tree, because of Schwartz's Snippet
whatever / 25 ; # / ; die "this dies!";
Whether that '#' is a comment or part of a regex depends on whether whatever() is nullary, which depends on the parse tree.
For the simple cases, however, yours is failing because (^#(.*$\R)+) repeats a capturing group, which is not what you wanted.
But anyway, if you want to handle simple cases, I don't even like the regex that everyone else is using, because it fails if there is whitespace before the # on the line. What about
^\s*#(.*)$
? This will match any line that begins with a comment (optionally with whitespace, e.g., for indented blocks).
Try this regex:
(^[\t ]+)(\#)(.*)
With this replacement:
$1$3
Group 1 is (^[\t ]+) and matches all leading whitespace (spaces and tabs).
Group 2 is (#) and matches one # character.
Group 3 is (.*) and matches the rest of the line.