How to delete selected duplicated line from file using perl script - regex

Let's say I am having file a.txt which is having following content.
aa
aa
bb
cc
dd
ee
ff
ee
gg
I want following output -
aa
aa
bb
cc
dd
ee
ff
gg
Note that I want delete particular duplication of lines only: ee.
How can I do that following one liner I tried.
perl -ne 'print unless $a{$_}++' a.txt
but it is deleting all duplicated lines.

To remove duplicates of only specific ("target") lines add that condition
perl -lne'$tgt = "ee"; print unless $h{$_}++ and $_ eq $tgt' file
If there may be multiple such targets, then check whether the current line matches any one of them. A nice tool for that is any from List::Util
perl -MList::Util=any -lne'
$line=$_;
#tgt = qw(ee aa);
print unless $h{$line}++ and any { $_ eq $line} #tgt
' file
Target(s) can be read from the command line as arguments, if there is a benefit to not have them hardcoded.
Note: In older versions any is in List::MoreUtils, and not in List::Util.

Related

In Perl, how to match two consecutive Carriage Returns?

Hi StackOverflow buddies,
I'm on Windows platform; I have a data file but something wrong happened and (I don't know why) all combinations of "Carriage Return + New Line" became "Carriage Return + Carriage Return + New Line", (190128 edit:) for example:
When viewing the file as plain text, it is:
When viewing the same file in hex mode, it is:
Out of practical purposes I need to remove the extra "0D" in double "0D"s like ".... 30 30 0D 0D 0A 30 30 ....", and change it to ".... 30 30 0D 0A 30 30 ....".
190129 edit: Besides, to ensure that my problem can be reproduced, I uploaded my data file to GitHub at URL (should download & unzip it before using; in a binary \ hex editor you can 0D 0D 0A in the first line): https://github.com/katyusza/hello_world/blob/master/ram_init.zip
I used the following Perl script to remove the extra Carriage Return, but to my astonishment my regex just do NOT work!! My entire code is (190129 edit: past entire Perl script here):
use warnings ;
use strict ;
use File::Basename ;
#-----------------------------------------------------------
# command line handling, file open \ create
#-----------------------------------------------------------
# Capture input input filename from command line:
my $input_fn = $ARGV[0] or
die "Should provide input file name at command line!\n";
# Parse input file name, and generate output file name:
my ($iname, $ipath, $isuffix) = fileparse($input_fn, qr/\.[^.]*/);
my $output_fn = $iname."_pruneNonPrintable".$isuffix;
# Open input file:
open (my $FIN, "<", $input_fn) or die "Open file error $!\n";
# Create output file:
open (my $FO, ">", $output_fn) or die "Create file error $!\n";
#-----------------------------------------------------------
# Read input file, search & replace, write to output
#-----------------------------------------------------------
# Read all lines in one go:
$/ = undef;
# Read entire file into variable:
my $prune_txt = <$FIN> ;
# Do match & replace:
$prune_txt =~ s/\x0D\x0D/\x0D/g; # do NOT work.
# $prune_txt =~ s/\x0d\x0d/\x30/g; # do NOT work.
# $prune_txt =~ s/\x30\x0d/\x0d/g; # can work.
# $prune_txt =~ s/\x0d\x0d\x0a/\x0d\x0a/gs; # do NOT work.
# Print end time of processing:
print $FO $prune_txt ;
# Close files:
close($FIN) ;
close($FO) ;
I did everything I could to match two consecutive Carriage Returns, but failed. Can anyone please point out my mistake, or tell me the right way to go? Thanks in advance!
On Windows, file handles have a :crlf layer given to them by default.
This layer converts CR LF to LF on read.
This layer converts LF to CR LF on write.
Solution 1: Compensate for the :crlf layer.
You'd use this solution if you want to end up with system-appropriate line endings.
# ... read ... # CR CR LF ⇒ CR LF
s/\r+\n/\n/g; # CR LF ⇒ LF
# ... write ... # LF ⇒ CR LF
Solution 2: Remove the :crlf layer.
You'd use this solution if you want to end up with CR LF unconditionally.
Use <:raw and >:raw instead of < and > as the mode.
# ... read ... # CR CR LF ⇒ CR CR LF
s/\r*\n/\r\n/g; # CR CR LF ⇒ CR LF
# ... write ... # CR LF ⇒ CR LF
The first of your regexes appears to work fine for me, which means that there may be an issue in some other piece of code. Please provide a Minimal, Complete, and Verifiable Example, which means including sample input data and so on.
$ perl -wMstrict -e 'print "Foo\r\r\nBar\r\r\n"' >test.txt
$ hexdump -C test.txt
00000000 46 6f 6f 0d 0d 0a 42 61 72 0d 0d 0a |Foo...Bar...|
0000000c
$ cat test.pl
#!/usr/bin/env perl
use warnings;
use strict;
use Data::Dump;
my $filename = 'test.txt';
open my $fh, '<:raw:encoding(ASCII)', $filename or die "$filename: $!";
my $prune_txt = do { local $/; <$fh> }; # slurp file
close $fh;
dd $prune_txt;
$prune_txt =~ s/\x0D\x0D/\x0D/g;
dd $prune_txt;
$ perl test.pl
"Foo\r\r\nBar\r\r\n"
"Foo\r\nBar\r\n"
By the way, it's not immediately obvious to me which encoding your file is using? In the above example, you may need to adjust the :encoding(...) layer appropriately.

How to grep any word that appears between 2 and 4 times?

My file is:
ab 12ab 1cd uu 88 ab 33 33 1 1
ab cd uu 88 88 33 33 33 cw ab
And I need to extract the words and numbers that appears 2-4 times.- {2,4}
I've tried many regex lines and even regex101.
I cant really put my finger on what's not working.
this is the closest I've got so far:
egrep -o '[\w]{2,4}' A1
Native grep doesn't supoort \w and {} notations. You have to use extended regular expressions.
Use
-E option as,
-E, --extended-regexp
Interpret pattern as an extended regular expression (i.e. force grep to behave as egrep).
Also use
-w to match words, so that it matches the entire words instead of partial.
-w, --word-regexp
The expression is searched for as a word (as if surrounded by [[:<:]]' and[[:>:]]'; see re_format(7)).
Example
$ grep -Ewo "\w{2,4}" file
ab
12ab
1cd
uu
88
ab
33
33
ab
cd
uu
88
88
33
33
33
cw
Note
You can eliminated use of an un-necessary cat by providing file as input to grep instead.
You were very close; within character class notation [], the special notation \w is being treated literally, put it out of []:
egrep -o '\w{2,4}'
Also egrep is deprecated in favor of grep -E, and you don't need the cat as grep takes file(s) as argument(s):
grep -Eo '\w{2,4}' file.txt
I would use awk for it:
awk '{for(i=1;i<=NF;i++)a[$i]++}
END{for(x in a)if(a[x]>1&&a[x]<5)print x}' file
It will scan the whole file, find out the words with occurrence (in the file) in this range [2,4]
Output is:
uu
ab
88
1
Using AWK, this solution counts the word instances per line not per file:
awk '{delete array; for(i = 1; i <= NF; i++) array[$i]+=1; for(i in array) if(array[i] >= 2 && array[i] <= 4) printf "%s ", i; printf "\n" }' input.txt
Delete to clear the array for each new line. Use fields as hash for array indexes and increment it's value by one. Print the index (field) with values between 2 and 4 inclusive.
Output:
ab 1 33
ab 88 33
Perl implementation for a file small enough to process its content as a single string:
$/ = undef;
$_ = <>;
#_ = /(\b\w+\b)/gs;
my %h; $h{$_}++ for #_;
for (keys %h) {
print "$_\n" if $h{$_} >= 2 and $h{$_} <= 4;
}
Save it into a script.pl and run:
perl script.pl < file
Of course, you can pass the code via -e option as well: perl -e 'the code' < file.
Input
ab 12ab 1cd uu 88 ab 33 33 1 1
ab cd uu 88 88 33 33 33 cw ab
Output
88
uu
ab
1
There is no 33 in the output, since it occurs 5 times in the input.
The code reads the file in slurp mode into the default variable ($_), then collects all the words (\w with word boundaries around) into #_ array. Then it counts the number of times each word occurred in the file and stores the result into %h hash. The final block prints only the items that occurred 2, 3, or 4 times, no more and no less.
Note, in Perl you should always use strict; and use warnings; in order to detect issues at early phase.

Vim formula to leave only two first words in each line

How to leave only two first words (remove all the rest) in each line in Vim?
E.g.
AA AA subst:pl:acc:m3+subst:pl:acc:n2
aa ad acta brev:npun
into
AA AA
aa ad
You can use my beloved :normal command:
:%norm EElD
or AWK:
:%!awk '{print $1, $2}'
or a substitution, if the real file doesn't look like the sample provided:
:%s/\S\+\s\+\S\+\zs.*
Applied to all lines:
:%s/\v^(\s*\S+\s+\S+).+/\1/
How about using a macro:
<ESC>gg
qa2wDj0
q
and repeating this using
#a
You can specify count by saying
n#a

In Perl, how can I use the regex substitution operator to replace non-ASCII characters in a substring?

How to use this command:
perl -pi -e 's/[^[:ascii:]]/#/g' file
to change only characters at offset A to offset B of each line?
Alternatively to rubber boots' answer, you can operate on a substring instead of the whole string to begin with:
perl -pi -e 'substr($_, 5, 5) =~ s/[^[:ascii:]]/#/g' file
To illustrate:
perl -e 'print "\xff" x 16' | \
perl -p -e 'substr($_, 5, 5) =~ s/[^[:ascii:]]/#/g' | \
hd
will print
ff ff ff ff ff 23 23 23 23 23 ff ff ff ff ff ff
In this code, the first offset is 0-based, and you have to use the length instead of the second offset, so it will be
substr($_, A-1, B-A).
Under reservation that I didn't understand your question correctly, if the offsets A, and B are 5 and 10, then it should be like:
perl -pi -e 's/(?<=.{5})(?<!.{10})[^[:ascii:]]/#/g' file
Explanation:
[^[:ascii:]] <- the character which is looked for
(?<=.{5}) <- if at least 5 chars were before (offset 5)
(?<!.{10}) <- but no more than 10 characters before (offset 10)
The constructs:
(?<= ...) and (?<! ...)
are called positive and negative lookbehinds, which are zero-with assertions.
(You can google them, see section Look-Around Assertions in perlre)
Addendum 1
You mentioned substr() in your title, which I overlooked first. This would work, of course, too:
perl -pi -e 'substr($_,5,10)=~s/[^[:ascii:]]/#/g' file
The description of substr EXPR,OFFSET,LENGTH can be found in perldoc.
This example nicely illustrates the use of substr() as a left-value.
Addendum 2
When updating this post, Grrrr added the same solution as an answer, but his came first by a minute! (so he deserves the booty)
Regards
rbo

Regex to remove lines in file(s) that ending with same or defined letters

i need a bash script for mac osx working in this way:
./script.sh * folder/to/files/
#
# or #
#
./script.sh xx folder/to/files/
This script
read a list of files
open each file and read each lines
if lines ended with the same letters ('*' mode) or with custom letters ('xx') then
remove line and RE-SAVE file
backup original file
My first approach to do this:
#!/bin/bash
# ck init params
if [ $# -le 0 ]
then
echo "Usage: $0 <letters>"
exit 0
fi
# list files in current dir
list=`ls BRUTE*`
for i in $list
do
# prepare regex
case $1 in
"*") REGEXP="^.*(.)\1+$";;
*) REGEXP="^.*[$1]$";;
esac
FILE=$i
# backup file
cp $FILE $FILE.bak
# removing line with same letters
sed -Ee "s/$REGEXP//g" -i '' $FILE
cat $FILE | grep -v "^$"
done
exit 0
But it doesn't work as i want....
What's wrong?
How can i fix this script?
Example:
$cat BRUTE02.dat BRUTE03.dat
aa
ab
ac
ad
ee
ef
ff
hhh
$
If i use '*' i want all files that ended with same letters to be clean.
If i use 'ff' i want all files that ended with 'ff' to be clean.
Ah, it's on Mac OSx. Remember that sed is a little different from classical linux sed.
man sed
sed [-Ealn] command [file ...]
sed [-Ealn] [-e command] [-f command_file] [-i extension] [file
...]
DESCRIPTION
The sed utility reads the specified files, or the standard input
if no files are specified, modifying the input as specified by a list
of commands. The
input is then written to the standard output.
A single command may be specified as the first argument to sed.
Multiple commands may be specified by using the -e or -f options. All
commands are applied
to the input in the order they are specified regardless of their
origin.
The following options are available:
-E Interpret regular expressions as extended (modern)
regular expressions rather than basic regular expressions (BRE's).
The re_format(7) manual page
fully describes both formats.
-a The files listed as parameters for the ``w'' functions
are created (or truncated) before any processing begins, by default.
The -a option causes
sed to delay opening each file until a command containing
the related ``w'' function is applied to a line of input.
-e command
Append the editing commands specified by the command
argument to the list of commands.
-f command_file
Append the editing commands found in the file
command_file to the list of commands. The editing commands should
each be listed on a separate line.
-i extension
Edit files in-place, saving backups with the specified
extension. If a zero-length extension is given, no backup will be
saved. It is not recom-
mended to give a zero-length extension when in-place
editing files, as you risk corruption or partial content in situations
where disk space is
exhausted, etc.
-l Make output line buffered.
-n By default, each line of input is echoed to the standard
output after all of the commands have been applied to it. The -n
option suppresses this
behavior.
The form of a sed command is as follows:
[address[,address]]function[arguments]
Whitespace may be inserted before the first address and the
function portions of the command.
Normally, sed cyclically copies a line of input, not including
its terminating newline character, into a pattern space, (unless there
is something left
after a ``D'' function), applies all of the commands with
addresses that select that pattern space, copies the pattern space to
the standard output, append-
ing a newline, and deletes the pattern space.
Some of the functions use a hold space to save all or part of the
pattern space for subsequent retrieval.
anything else?
it's clear my problem?
thanks.
I don't know bash shell too well so I can't evaluate what the failure is.
This is just an observation of the regex as understood (this may be wrong).
The * mode regex looks ok:
^.*(.)\1+$ that ended with same letters..
But the literal mode might not do what you think.
current: ^.*[$1]$ that ended with 'literal string'
This shouldn't use a character class.
Change it to: ^.*$1$
Realize though the string in $1 (before it goes into the regex) should be escaped
incase there are any regex metacharacters contained within it.
Otherwise, do you intend to have a character class?
perl -ne '
BEGIN {$arg = shift; $re = $arg eq "*" ? qr/([[:alpha:]])\1$/ : qr/$arg$/}
/$re/ && next || print
'
Example:
echo "aa
ab
ac
ad
ee
ef
ff" | perl -ne '
BEGIN {$arg = shift; $re = $arg eq "*" ? qr/([[:alpha:]])\1$/ : qr/$arg$/}
/$re/ && next || print
' '*'
produces
ab
ac
ad
ee
ef
A possible issue:
When you put * on the command line, the shell replaces it with the name of all the files in your directory. Your $1 will never equal *.
And some tips:
You can replace replace:
This:
# list files in current dir
list=`ls BRUTE*`
for i in $list
With:
for i in BRUTE*
And:
This:
cat $FILE | grep -v "^$"
With:
grep -v "^$" $FILE
Besides the possible issue, I can't see anything jumping out at me. What do you mean clean? Can you give an example of what a file should look like before and after and what the command would look like?
This is the problem!
grep '\(.\)\1[^\r\n]$' *
on MAC OSX, ( ) { }, etc... must be quoted!!!
Solved, thanks.