I want to write a perl script that removes double tabs, line breaks and white spaces.
What I have so far is:
$txt=~s/\r//gs;
$txt=~s/ +/ /gs;
$txt=~s/\t+/\t/gs;
$txt=~s/[\t\n]*\n/\n/gs;
$txt=~s/\n+/\n/gs;
But,
1. It's not beautiful. Should be possible to do that with far less regexps.
2. It just doesn't work and I really do not know why. It leaves some double tabs, white spaces and empty lines (i.e. lines with only a tab or whitespace)
I could solve it with a while, but that is very slow and ugly.
Any suggestions?
You've got a bit of a mish-mash of stuff in there, not all of which corresponds to what you said. Let's break down what you have and then perhaps you can work from there to what you want.
$txt=~s/\r//s; # removes a single \r from the line. Did you mean to use g on this one?
$txt=~s/[\t ]\n//s; # match a single \t OR space right before a \n, and remove.
$txt=~s/ +/ /gs;# match at least 2 spaces, replace with a single space
$txt=~s/\t+/ /gs;# match at least 2 \t, replace with a single space
$txt=~s/\n /\n/s;# remove a space immediately following a \n
$txt=~s/\t /\t/s;# remove a space immediately following a \t
$txt=~s/\n+/ /gs;# match at least 2 \n, replace them all with a single space
I have the feeling that's not at all what you want to accomplish.
I'm honestly unclear on what you want to do. The way I read your stated intent, I would have thought you'd want to replace all double tabs with single tabs, all double line breaks with single line breaks, and all double spaces with single spaces. I'll further surmise that you want to actually do runs of those characters, not just doubles. Here's the regexes for what I've just said, hopefully that will give you something to go on:
(I've also removed all \r).
$txt=~s/\r//gs;# remove all \r
$txt=~s/\t+/\t/gs;# replace all runs of > 1 tab with a single tab
$txt=~s/\n+/\n/gs;# replace all runs of > 1 \n with a single \n
$txt=~s/ +/ /gs;# replace all runs of > 1 space with a single space
Given that your attempted regexes don't seem to match the way I read your stated desire, I suspect that there's some fuzziness about what you really want to do here. You might want to think further about what you are trying to accomplish, which should help the regexes become clearer.
I am not sure of your exact requirements, but here are a few hint that might get you going :
To compress all white space to spaces (probably too powerful!)
$txt=~s/\s+/ /g ;
To remove any white space at start of line
$txt=~s/^ +//gm ;
To compress multiple tabs to a space
$txt=~s/\t+/ /g ;
As I try to work out a quick real answer for you, have you looked at the docs (and no I'm not just saying rtfm). perldoc is a great tool and has some useful info, may I suggest perldoc perlrequick and perldoc perlreref to get you going.
First of all, you might find it easier to split the long text into lines and operate on the lines separately and then join them again. Also if we make a new array to store the results to be joined we can easily exclude empty lines.
Finally, it strikes me that in operating on a long block of text, that text is likely to be external to your script. If you are really opening a file and globbing it into a variable, you could more easily do what I am leaving in as a comment block. To use this method comment the first block and remove the comment on the second block, the third block remains for either method. I include this because if you really are reading in a file then splitting it, it saves a lot of work to just read it in line by line. You could then write it out to another file if desired.
#!/usr/bin/env perl
use strict;
use warnings;
my #return_lines;
### Begin "text in script" Method ###
my $txt = <<END;
hello world
hello world
hello world
hello world
END
#note last two are to test removing spaces after tabs
my #lines = split(/\n/, $txt);
foreach my $line (#lines) {
### Begin "text in external file" Method (commented) ###
#my $filename = 'file.txt';
#open( my $filehandle, '<', $filename);
#while (<$filehandle>) {
# my $line = $_;
### Script continues for either input method ###
$line =~ s/^\s*//; #remove leading whitespace
$line =~ s/\s*$//; #remove trailing whitespace
$line =~ s/\ {2,}/ /g; #remove multiple literal spaces
$line =~ s/\t{2,}/\t/g; #remove excess tabs (is this what you meant?)
$line =~ s/(?<=\t)\ *//g; #remove any spaces after a tab
push #return_lines, $line unless $line=~/^\s*$/; #remove empty lines
}
my $return_txt = join("\n", #return_lines) . "\n";
print $return_txt;
This is a bit un-clear.
If you have a line like ab TABcTABTAB \n\n, what do you want as a result? I am reading the above as ab c\n? – DVK 1 min ago edit
In other words, is it correct that you want:
All the whitespace (e.g. any amount of spaces and tabs) in the middle of the lines converted to a single space?
All the whitespace at the beginning OR end of the line removed (except for newlines)?
Remove completely empty lines?
$s =~ s/[\t ]+$//ms; # Remove ending spaces/tabs
$s =~ s/^[\t ]+//ms; # Remove starting spaces/tabs
$s =~ s/[\t ]+/ /ms; # Replace duplicate whitespace mid-string with 1 space
$s =~ s/^$//ms; # Remove completely empty lines
Please note that I used the "/ms" modifyers (read perdoc perlre for details) so that I could use start/end of line anchors within a multi-line string.
Related
I have a Perl script which runs over a database dump in a plain text file, trying to remove all instances of newlines and possibly other odd characters when I see strings between quotes:
INSERT INTO ... VALUES ( "... these are the lines I'm interested in." )
I slurp in the file:
#file = <FILE>;
and:
foreach my $line (#file) {
$line =~ s/"[^"]*(\R)+[^"]*"//g;
# I want to get rid of newlines in strings
# And other odd characters I might come across
}
One character class I used instead of (\R) was:
([\r\n\t\v\f]+)
and I would try to:
$line =~ s/"[^"]+?([\r\n\t\v\f]+)[^"]*"//g;
I'm sure I'm missing something. I try to start matching with a literal double quote, scan past anything not a double quote (non-greedy, at least one match), reach the characters I want to get rid of, and keep scanning not double quote (any number of other characters not a double quote) until I reach the ending double quote.
So I wanted to replace $1 capture above with nothing.
I've tried on-line regex builders, and
/"[^"]*?([\r\n\t\f\v]+)[^"]*"/
worked with an on-line test, using a short paragraph with newlines and tabs in it, although it was in PHP pcre mode. I thought it would have worked with Perl.
Perhaps I'm not escaping some characters properly in the regex for Perl? Or the pattern is just not going to work the way I want it to, because it's wrong.
Thank you, any help appreciated.
The regex at regex101.com:
"[^"]*?([\r\n\f\t\v]+)[^"]*?"
matches for strings like this:
"This is
my\t test
string.
So there!"
I'm thoroughly puzzled now. :)
The real problem is that you will only find one group of \R's when there could be many groups between quotes. The best thing to do is make a callback (eval) with a general match between quotes, then substitute the \R's in
the replacement.
something like:
sub repl {
my ($content) = _#;
$content =~ s/\R+//g;
return $content;
}
$input =~ s/"([^"]*)"/ repl($1) /ge;
edit: If you're looking for only 1 linebreak cluster, you have to
exclude linebreaks leading up to it. For example: [^"\r\n]+
edit2: To slurp the file into $input, do a
$/ = undef;
my $input = <$fh>;
to all perl gurus! I have the following snip of code and there is a specific line that I am trying to understand. Been reading around and manage to understand that it's a perl regex. But, I haven't been able to understand what each one is going. correct me if I am wrong for what I am about to put.
this particular part is for reading the EDID content from a files which are in HEX. I believe what the previous guy was trying to do is take out any spaces and next lines. But not completely sure.
for (my $int=1;$int<9;$int++){
my $line = <$info>;
$line =~ s/\r?\n$//;
chomp $line;
$line =~ s/\s+//g;
if ( $line eq "00000000000000000000000000000000" ){
print "bad EDID information in file $file --- all 0's\r\n";
close $info;
close $OUTFILE;
exit 1;
}
print $OUTFILE $line
}
now, this part is the one that throws me off.
$line =~ s/\r?\n$//;
what I want to understand is... what is s/ \r? \n $// is doing. I believe \n is next line. But not sure about the other ones. Any comment or help is always welcome.
hwnd's answer is factually correct, it does not explain why this regex is there.
Windows and Unix (including OS X) use different ways to express the end of a line. That regex deletes both kinds ensuring it will work no matter which type of machine produced the file or which type is reading it.
Windows and many Internet protocols use carriage return (ASCII 015) and a line feed (ASCII 012); this comes from when computer displays were electric typewriters and had to be told to move the print head (the carriage) back to the first column (carriage return) and then advance a line (line feed). Unix uses just a line feed (ASCII 012). Carriage return in a regex is \r or \015. Line feed (aka newline) is \n or \012.
The $ is redundant, the newline will be at the end of the line, and should probably be removed.
The call to chomp is redundant. chomp will remove a newline of the type to the current operating system. On Unix it will remove \n and on Windows it will remove \r\n (it will actually remove the value of $/). However, if you're working with a Windows file on a Unix machine, or vice versa, it will not adapt to the type of file. The regex is safer.
$line =~ s/\s+//g; The /g makes it match as many times as possible removing all whitespace anywhere in the line. Since carriage return and newlines are whitespace, this makes both chomp and s/\r?\n$// redundant.
All three lines could be reduced to $line =~ s{\s+}{}g.
In case you do not already know, s/// is the substitution operator.
The pattern matches an optional carriage return followed by a newline sequence and the end of the string.
\r? # '\r' (carriage return) (optional)
\n # '\n' (newline)
$ # before an optional \n, and the end of the string
Your predecessor has written the equivalent of a chomp that is intended to work on both Windows and Linux text files. The former has CR LF line endings "\r\n" and the latter has just LF "\n".
A better way to write this, assuming you're not interested in trailing tabs or spaces, would be s/\s+$//, since both CR and LF are "whitespace".
Better still, if you can guarantee that you are running on version 10 or later of Perl 5 (put use 5.010 at the top of the program) would be s/\s+\z//.
Or, if you want to retain trailing spaces but remove the line terminator(s), s/[\r\n]+\z// will do that for you, and will also cope with old-fashioned Mac text files, which have just CR at the end.
I'm creating a method to modify page titles into a good string for to use URL rewriting.
Example: "Latest news", would be "latest-news"
The problem is the page titles are out of my control and some are similar to the following:
Football & Rugby News!. Ideally this would become football-rugby-news.
I've done some work to get this to football-&-rugby-news!
Is there a possible regex to identify unwanted characters in there and the extra '-' ?
Basically, I need numbers and letters separated by a single '-'.
I only have basic knowledge of regex, and the best I could come up with was:
[^a-z0-9-]
I'm not sure if I'm being clear enough here.
Try a 'replace all' with something like this.
[^a-zA-Z0-9\\-]+
Replace the matches with a dash.
Alternative regex:
[^a-zA-Z0-9]+
This one will avoid multiple dashes if a dash itself is found near other unwanted characters.
This Perl script also does what you're looking for. Of course you'd have to feed it the string by some other means than just hardcoding it; I merely put it in there for the example.
#!/usr/bin/perl
use strict;
use warnings;
my $string = "Football & Rugby News!";
$string = lc($string); # lowercase
my $allowed = qr/a-z0-9-\s/; # all permitted characters
$string =~ s/[^$allowed]//g; # remove all characters that are NOT in $allowed
$string =~ s/\s+/-/g; # replace all kinds of whitespace with '-'
print "$string\n";
prints
football-rugby-news
I'm trying to open a file, match a particular line, and then wrap HTML tags around that line. Seems terribly simple but apparently I'm missing something and don't understand the Perl matched pattern variables correctly.
I'm matching the line with this:
$line =~ m/(Number of items:.*)/i;
Which puts the entire line into $1. I try to then print out my new line like this:
print "<p>" . $1 . "<\/p>;
I expect it to print this:
<p>Number of items: 22</p>
However, I'm actually getting this:
</p>umber of items: 22
I've tried all kinds of variations - printing each bit on a separate line, setting $1 to a new variable, using $+ and $&, etc. and I always get the same result.
What am I missing?
You have an \r in your match, which when printed results in the malformed output.
edit:
To explain further, chances are your file has windows style \r\n line endings. chomp won't remove the \r, which will then get slurped into your greedy match, and results in the unpleasant output (\r means go back to the start of the line and continue printing).
You can remove the \r by adding something like
$line =~ tr/\015//d;
Can you provide a complete code snippet that demonstrates your problem? I'm not seeing it.
One thing to be cautious of is that $1 and friends refer to captures from the last successful match in that dynamic scope. You should always verify that a match succeeds before using one:
$line = "Foo Number of items: 97\n";
if ( $line =~ m/(Number of items:.*)/i ) {
print "<p>" . $1 . "<\/p>\n";
}
You've just learned (for future reference) how dangerous .* can be.
Having banged my head against similar unpleasantnesses, these days I like to be as precise as I can about what I expect to capture. Maybe
$line =~ m/(Number of items:\s+\d+)/;
Then I'm sure of not capturing the offending control character in the first place. Whatever Cygwin may be doing with Windows files, I can remain blissfully ignorant.
I need a regular expression to uncomment a block of Perl code, commented with # in each line.
As of now, my find expression in the Eclipse IDE is (^#(.*$\R)+) which matches the commented block, but if I give $2 as the replace expression, it only prints the last matched line. How do I remove the # while replacing?
For example, I need to convert:
# print "yes";
# print "no";
# print "blah";
to
print "yes";
print "no";
print "blah";
In most flavors, when a capturing group is repeated, only the last capture is kept. Your original pattern uses + repetition to match multiple lines of comments, but group 2 can only keep what was captured in the last match from the last line. This is the source of your problem.
To fix this, you can remove the outer repetition, so you match and replace one line at a time. Perhaps the simplest pattern to do this is to match:
^#\s*
And replace with the empty string.
Since this performs match and replacement one line at a time, you must repeat it as many times as necessary (in some flavors, you can use the g global flag, in e.g. Java there are replaceFirst/All pair of methods instead).
References
regular-expressions.info/Repeating a Captured Group vs Capturing a Repeated Group
Related questions
Is there a regex flavor that allows me to count the number of repetitions matched by * and +?
.NET regex keeps all repeated matches
Special note on Eclipse keyboard shortcuts
It Java mode, Eclipse already has keyboard shortcuts to add/remove/toggle block comments. By default, Ctrl+/ binds to the "Toggle comment" action. You can highlight multiple lines, and hit Ctrl+/ to toggle block comments (i.e. //) on and off.
You can hit Ctrl+Shift+L to see a list of all keyboard shortcuts. There may be one in Perl mode to toggle Perl block comments #.
Related questions
What is your favorite hot-key in Eclipse?
Hidden features of Eclipse
Search with ^#(.*$) and replace with $1
You can try this one: -
use strict;
use warning;
my $data = "#Hello#stack\n#overflow\n";
$data =~ s/^?#//g ;
OUTPUT:-
Hello
stack
overflow
Or
open(IN, '<', "test.pl") or die $!;
read(IN, my $data, -s "test.pl"); #reading a file
$data =~ s/^?#//g ;
open(OUT, '>', "test1.pl") or die $!;
print OUT $data; #Writing a file
close OUT;
close IN;
Note: Take care of #!/usr/bin/perl in the Perl script, it will uncomment it also.
You need the GLOBAL g switch.
s/^#(.+)/$1/g
In order to determine whether a perl '#' is a comment or something else, you have to compile the perl and build a parse tree, because of Schwartz's Snippet
whatever / 25 ; # / ; die "this dies!";
Whether that '#' is a comment or part of a regex depends on whether whatever() is nullary, which depends on the parse tree.
For the simple cases, however, yours is failing because (^#(.*$\R)+) repeats a capturing group, which is not what you wanted.
But anyway, if you want to handle simple cases, I don't even like the regex that everyone else is using, because it fails if there is whitespace before the # on the line. What about
^\s*#(.*)$
? This will match any line that begins with a comment (optionally with whitespace, e.g., for indented blocks).
Try this regex:
(^[\t ]+)(\#)(.*)
With this replacement:
$1$3
Group 1 is (^[\t ]+) and matches all leading whitespace (spaces and tabs).
Group 2 is (#) and matches one # character.
Group 3 is (.*) and matches the rest of the line.