Removing both duplicates (not just the repeated) from a text file? - regex

By this I mean, erase all rows in a text file that are repeated, NOT just the duplicates. I mean both the row that is duplicated and the duplicated row. This would leave me only with the list of rows that weren't repeated. Perhaps a regular expression could do this in notepad++? But which one? Any other methods?

If you're on a unix-like system, you can use the uniq command.
ezra#ubuntu:~$ cat test.file
ezra
ezra
john
user
ezra#ubuntu:~$ uniq -u test.file
john
user
Note, that the similar rows be adjacent. You'll have to sort the file first if they're not.
ezra#ubuntu:~$ cat test.file
ezra
john
ezra
user
ezra#ubuntu:~$ uniq -u test.file
ezra
john
ezra
user
ezra#ubuntu:~$ sort test.file | uniq -u
john
user

If you have acess to a regex that supports PCRE style, this is straight forward:
s/(?:^|(?<=\n))(.*)\n(?:\1(?:\n|$))+//g
(?:^|(?<=\n)) # Behind us is beginning of string or newline
(.*)\n # Capture group 1: all characters up until next newline
(?: # Start non-capture group
\1 # backreference to what was captured in group 1
(?:\n|$) # a newline or end of string
)+ # End non-capture group, do this 1 or more times
Context is a single string
use strict; use warnings;
my $str =
'hello
this is
this is
this is
that is';
$str =~ s/
(?:^|(?<=\n))
(.*)\n
(?:
\1
(?:\n|$)
)+
//xg;
print "'$str'\n";
__END__
output:
'hello
that is'

Related

Lazy Grep -P: How to show only to the 1st match from the lines

I need to print only the 1st match from each line.
My file contains text something like this:
cat t.txt
abcsuahrcb
abscuharcb
bsaucharcb
absuhcrcab
He is the command I am trying with:
cat t.txt | grep -oP 'a.*?c'
It gives:
abc
ahrc
absc
arc
auc
arc
absuhc
I need it to return:
abc
absc
auc
absuhc
These are the 1st possible matches from each line.
Any other alternatives like sed and aws will work, but not something which needs to be installed on Ubuntu.
Perl to the rescue:
perl -lne 'print $1 if /(a.*?c)/' t.txt
-n reads the input line by line, running the code for each;
-l removes newlines from input lines and adds them to output;
The code tries to match a.*?c, if matched, it stores the result in $1;
As there's no loop, only one match per line is attempted.
A sed variation on The fourth bird's answer:
$ sed -En 's/^[^a]*(a[^c]*c).*/\1/p' t.txt
abc
absc
auc
absuhc
Where:
-En - enable extended regex support, suppress automatic printing of pattern space
^[^a]* - from start of line match all follow-on characters that are not a
(a[^c]*c) - (1st capture group) match letter a plus all follow-on characters that are not c followed by a c
.* - match rest of line
\1/p - print contents of 1st capture group
One awk idea:
$ awk 'match($0,/a[^c]*c/) { print substr($0,RSTART,RLENGTH)}' t.txt
abc
absc
auc
absuhc
Where:
if we find a match then the match() call is non-zero (ie, 'true') so ...
print the substring defined by the RSTART/RLENGTH variables (which are auto-populated by a successful match() call)
Using grep you could write the pattern as matching from the first a to the first c using a negated character class.
Using -P for Perl-compatible regular expressions, you can make use of \K to forget what is matched so far.
Note that you don't have to use cat but you can add the filename at the end.
grep -oP '^[^a]*\Ka[^c]*c' t.txt
The pattern matches:
^ Start of string
[^a]* Optionally match any char except a
\K Forget what is matched so far
a Match literally
[^c]* Optionally match any char except c
c Match literally
Output
abc
absc
auc
absuhc
Another option with gnu-awk and the same pattern, only now using and printing the capture group 1 value:
awk 'match($0,/^[^a]*(a[^c]*c)/, a) { print a[1]}' t.txt

How to grep/perl/awk overlapping regex

Trying to pipe a string into a grep/perl regex to pull out overlapping matches. Currently, the results only appear to pull out sequential matches without any "lookback":
Attempt using egrep (both on GNU and BSD):
$ echo "bob mary mike bill kim jim john" | egrep -io "[a-z]+ [a-z]+"
bob mary
mike bill
kim jim
Attempt using perl style grep (-P):
$ echo "bob mary mike bill kim jim john" | grep -oP "()[a-z]+ [a-z]+"
bob mary
mike bill
kim jim
Attempt using awk showing only the first match:
$ echo "bob mary mike bill kim jim john" | awk 'match($0, /[a-z]+ [a-z]+/) {print substr($0, RSTART, RLENGTH)}'
bob mary
The overlapping results I'd like to see from a simple working bash pipe command are:
bob mary
mary mike
mike bill
bill kim
kim jim
jim john
Any ideas?
Lookahead is your friend here
echo "bob mary mike bill kim jim john" |
perl -wnE'say "$1 $2" while /(\w+)\s+(?=(\w+))/g'
The point is that lookahead, as a "zero-width assertion," doesn't consume anything -- while it still allows us to capture a pattern in it.
So as the regex engine matches a word and spaces ((\w+)\s+), gobbling them up, it then stops there and "looks ahead," merely to "assert" that the sought pattern is there; it doesn't move from its spot between the last space and the next \w, doesn't "consume" that next word, as they say.
It is nice though that we can also capture that pattern that is "seen," even tough it's not consumed! So we get our $1 and $2, two words.
Then, because of /g modifier, the engine moves on, to find another word+spaces, with yet another word following. That next word is the one our lookahead spotted -- so now that one is consumed, and yet next one "looked" for (and captured). Etc.
See Lookahead and lookbehind assertions in perlretut
Use the Perl one-liners below, which avoid the lookahead (which can still be your friend):
For whitespace-delimited words:
echo "bob mary mike bill kim jim john" | perl -lane 'print "$F[$_] $F[$_+1]" for 0..($#F-1);'
For words defined as \w+ in Perl, delimited by the non-word characters \W+:
echo "bob.mary,mike'bill kim jim john" | perl -F'/\W+/' -lane 'print "$F[$_] $F[$_+1]" for 0..($#F-1);'
The Perl one-liner uses these command line flags:
-e : Tells Perl to look for code in-line, instead of in a file.
-n : Loop over the input one line at a time, assigning it to $_ by default.
-l : Strip the input line separator ("\n" on *NIX by default) before executing the code in-line, and append it when printing.
-a : Split $_ into array #F on whitespace or on the regex specified in -F option.
-F'/\W+/' : Split into #F on \W+ (one or more non-word characters), rather than on whitespace.
$#F : the last index of the array #F, into which the input line is split.
0..($#F-1) : the range of indexes (numbers), from the first (0) to the penultimate ($#F-1) index of the array #F.
$F[$_] and $F[$_+1]: two consecutive elements of the array #F, with indexes $_ and $_+1, respectively.
SEE ALSO:
perldoc perlrun: how to execute the Perl interpreter: command line switches
perldoc perlre: Perl regular expressions (regexes)
perldoc perlre: Perl regular expressions (regexes): Quantifiers; Character Classes and other Special Escapes; Assertions; Capture groups
perldoc perlrequick: Perl regular expressions quick start
You can also use awk
awk '{for(i=1;i<NF;i++) print $i,$(i+1)}' <<< 'bob mary mike bill kim jim john'
See the online demo. This solution iterates over all whitespace-separated fields and prints current field ($i) + field separator (a space here) + the subsequent field value ($(i+1)).
Or, another perl solution that uses a very common technique to capture the overlapping pattern inside a positive lookahead:
perl -lane 'while (/(?=\b(\p{L}+\s+\p{L}+))/g) {print $1}' <<< 'bob mary mike bill kim jim john'
See the online demo. Details:
(?= - start of a positive lookahead
\b - a word boundary
(\p{L}+\s+\p{L}+) - capturing group 1: one or more letters, one or more whitespaces, one or more letters
) - end of the lookahead.
Here, only Group 1 values are printed ({print $1}).
Performance consideration
As for Perl solutions here, mine turns out the slowest, and Timur's the fastest, however, awk solution turns out to be faster than any Perl solutions. Results:
# ./wiktor_awk.sh
real 0m17.069s
user 0m12.264s
sys 0m5.314s
# ./timur_perl.sh
real 0m18.201s
user 0m15.612s
sys 0m6.139s
# ./zdim.sh
real 0m23.559s
user 0m19.883s
sys 0m7.359s
# ./wiktor_perl.sh
real 2m12.528s
user 1m52.857s
sys 0m20.201s
Note I created *.sh files for each solution like
#!/bin/bash
N=10000
time(
for i in $(seq 1 $N); do
<SOLUTION_HERE> &>/dev/null;
done)
and ran for f in *.sh; do chmod +x "$f"; done (borrowed from here).

Search with grep only lines that start with #

before i get my ass kicked, I want you to know that I checked several documents on "grep" and I couldn't find what I'm looking for or maybe my English is too limited to get the idea.
I have a lot of markdown documents. Each document contain a first level heading (#) which is always on line 1.
I can search for ^# and that works, but how can I tell grep to look for certain words on the line that starts with #?
I want this this
grep 'some words' file.markdown
But also specify that the line starts with a #.
You may use
grep '^# \([^ ].*\)\{0,1\}some words' file.markdown
Or, using ERE syntax
grep -E '^# ([^ ].*)?some words' file.markdown
Details
^ - start of a line
# - a # char
\([^ ].*\)\{0,1\} - an optional sequence of patterns (a \(...\) is a capturing group in BRE syntax, in ERE, it is (...)) (\{0,1\} is an interval quantifier that repeats the pattern it modifies 1 or 0 times):
[^ ] - any char but a space
.* - any 0+ chars
some words - some words text.
See an online grep demo:
s="# Get me some words here
#some words here I don't want
# some words here I need"
grep '^# \([^ ].*\)\{0,1\}some words' <<< "$s"
# => # Get me some words here
# # some words here I need

I am using sed with a regex in Linux but not getting the expected answer

I have a file in which the content looks like this
1234 Name Surname Address
and I should have an output file like this
(123) Name Surname Address
I am using the Linux sed command
sed -e 's/^\([0-9]\{3\}\)\./&/g' file_name
But my command doesn't change the input file at all.
Please help,
You don't need a two capture groups just one is enough and also don't need to capture the remaining characters after the numbers,
$ sed 's/\([0-9]\{3\}\)[0-9]/(\1)/g' file
(123) Name Surname Adress
Through GNU sed,
$ sed -r 's/([0-9]{3})[0-9]/(\1)/g' file
(123) Name Surname Adress
Explanation:
([0-9]{3}) # Captures first three numbers.
[0-9] # Matches the last number.
(\1) # In the replacement part, it replaces the four numbers by `(captured group)`
You can use awk
awk '{$1="("substr($1,1,3)")"}1' file
(123) Name Surname Address
It adds parentheses to the three first characters of field #1 and prints it.
And its shorter than sed :)
You can use this sed,
sed -e 's/^\([0-9]\{3\}\).\(.*\)/(\1)\2/g' yourfile
Explanation:
^\([0-9]\{3\}\) - first 3 digits ( group 1 )
. - matching any one character
\(.*\) - remaining content ( group 2 )
(\1) - accessing the first group using \1 and
added parenthesis around it.
\2 - accessing the second group

regex: remove all but groups

I need to remove all but matched groups in regex. A toy example would be:
echo 'spam 123 ham 345 eggs' | perl -pe 's/( \d+ )/SOMETHING/g'
123 345
What perl regex will remove everything but the matched groups? The matched group can be more complex than just digits - I can define the groups to match, but outside of the groups I can contain any random characters
Just join all matches instead. I don't know Perl, but something like this might work:
$result = join('', $subject =~ m/\s*\d+\s*/g);