Remove duplicate words and just print lines in which this occurs - regex

I have a challenge to look in a file if a sentence contains 2 identical consecutive words. If so, you print the word; otherwise, you don't print the sentence.
Example:
abc2 1 def2 3 abc2
F4
--------------
dea 123 123 zy45
12 12
abc cd abc cd
xyz%$#! xyz%$#! kk
xyzxyz
abc h h h h
After running the program the output will be:
dea 123 zy45
12
xyz%$#! kk
abc h h h
3
This is what I have so far:
sed '/\([^\([^ ]\+\)[ ]\+\1]\)/d' F4 >|tmp
I got this so far but this is only separating between the sentences that have the double word and sentences that don't.

Your sed expression was quite accurate. However, it needed some mangling to make it work:
$ sed -nr 's/\b(\S+)\s+\1(\s|$)/\1/p' file
dea 123 zy45
12
xyz%$#! kk
abc h h h
The idea is the one you already implemented: match a given word with [^ ] and see if you match it again with \1. What I added is all of this to be replaced with \1 so the repeated block disappears.
Instead of [^ ] it is also useful to use \S and instead of [ ], \s. Note also the usage of \b as a word boundary to prevent false positives like fedorqui qui and the usage of \1(\s|$) to prevent other false positives like hello helloa (thanks WalterA for the examples!). Note the usage of \s|$ to match either a space or the end of the line; \b matches any not-word character, which makes it not useful for the case with xyz%$#! kk.
To prevent all lines to be printed, we use sed -n. This way, we just print (with p) those that go through the regular expression that was defined.
Note the usage of -r to get rid of all those escaping to capture groups. Without it, the command would be:
sed -n 's/\b\([^ ]\+\)[ ]\+\1/\1/p' file
Let's test it with a more comprehensive input:
$ cat a
abc2 1 def2 3 abc2
F4
--------------
dea 123 123 zy45
12 12
abc cd abc cd
xyz%$#! xyz%$#! kk
xyzxyz
fedorqui qui
hello helloa
abc h h h h
$ sed -nr 's/\b(\S+)\s+\1(\s|$)/\1/p' a
dea 123zy45
12
xyz%$#!kk
abc hh h

I was looking for a sed solution that seemed to be easy. perhaps in this case awk is better (F4 is the inputfile):
awk '{
for (i=2; i<=NF; i++) {
if ($(i-1)==$i) {
$i="";
printf("%s\n", $0);
break;
}
}
}' F4
I am not complete happy with this solution, since it will leave a double FieldSep in $0 after deleting the doubled word, but literally the OP did not see that a space or tab should be deleted too.

Related

Remove "." from digits

I have a string in the following way =
"lmn abc 4.0mg 3.50 mg over 12 days. Standing nebs."
I want to convert it into :
"lmn abc 40mg 350 mg over 12 days. Standing nebs."
that is I only convert a.b -> ab where a and b are integer
waiting for help
Assuming you are using Python. You can use captured groups in regex. Either numbered captured group or named captured group. Then use the groups in the replacement while leaving out the ..
import re
text = "lmn abc 4.0mg 3.50 mg over 12 days. Standing nebs."
Numbered: You reference the pattern group (content in brackets) by their index.
text = re.sub("(\d+)\.(\d+)", "\\1\\2", text)
Named: You reference the pattern group by a name you specified.
text = re.sub("(?P<before>\d+)\.(?P<after>\d+)", "\g<before>\g<after>", text)
Which each returns:
print(text)
> lmn abc 40mg 350 mg over 12 days. Standing nebs.
However you should be aware that leaving out the . in decimal numbers will change their value. So you should be careful with whatever you are doing with these numbers afterwards.
Using any sed in any shell on every Unix box:
$ sed 's/\([0-9]\)\.\([0-9]\)/\1\2/g' file
"lmn abc 40mg 350 mg over 12 days. Standing nebs."
Using sed
$ cat input_file
"lmn abc 4.0mg 3.50 mg over 12 days. Standing nebs. a.b.c."
$ sed 's/\([a-z0-9]*\)\.\([a-z0-9]\)/\1\2/g' input_file
"lmn abc 40mg 350 mg over 12 days. Standing nebs. abc."
echo '1.2 1.23 12.34 1. .2' |
ruby -p -e '$_.gsub!(/\d+\K\.(?=\d+)/, "")'
Output
12 123 1234 1. .2
If performance matters:
echo '1.2 1.23 12.34 1. .2' |
ruby -p -e 'BEGIN{$regex = /\d+\K\.(?=\d+)/; $empty_string = ""}; $_.gsub!($regex, $empty_string)'

How to grep any word that appears between 2 and 4 times?

My file is:
ab 12ab 1cd uu 88 ab 33 33 1 1
ab cd uu 88 88 33 33 33 cw ab
And I need to extract the words and numbers that appears 2-4 times.- {2,4}
I've tried many regex lines and even regex101.
I cant really put my finger on what's not working.
this is the closest I've got so far:
egrep -o '[\w]{2,4}' A1
Native grep doesn't supoort \w and {} notations. You have to use extended regular expressions.
Use
-E option as,
-E, --extended-regexp
Interpret pattern as an extended regular expression (i.e. force grep to behave as egrep).
Also use
-w to match words, so that it matches the entire words instead of partial.
-w, --word-regexp
The expression is searched for as a word (as if surrounded by [[:<:]]' and[[:>:]]'; see re_format(7)).
Example
$ grep -Ewo "\w{2,4}" file
ab
12ab
1cd
uu
88
ab
33
33
ab
cd
uu
88
88
33
33
33
cw
Note
You can eliminated use of an un-necessary cat by providing file as input to grep instead.
You were very close; within character class notation [], the special notation \w is being treated literally, put it out of []:
egrep -o '\w{2,4}'
Also egrep is deprecated in favor of grep -E, and you don't need the cat as grep takes file(s) as argument(s):
grep -Eo '\w{2,4}' file.txt
I would use awk for it:
awk '{for(i=1;i<=NF;i++)a[$i]++}
END{for(x in a)if(a[x]>1&&a[x]<5)print x}' file
It will scan the whole file, find out the words with occurrence (in the file) in this range [2,4]
Output is:
uu
ab
88
1
Using AWK, this solution counts the word instances per line not per file:
awk '{delete array; for(i = 1; i <= NF; i++) array[$i]+=1; for(i in array) if(array[i] >= 2 && array[i] <= 4) printf "%s ", i; printf "\n" }' input.txt
Delete to clear the array for each new line. Use fields as hash for array indexes and increment it's value by one. Print the index (field) with values between 2 and 4 inclusive.
Output:
ab 1 33
ab 88 33
Perl implementation for a file small enough to process its content as a single string:
$/ = undef;
$_ = <>;
#_ = /(\b\w+\b)/gs;
my %h; $h{$_}++ for #_;
for (keys %h) {
print "$_\n" if $h{$_} >= 2 and $h{$_} <= 4;
}
Save it into a script.pl and run:
perl script.pl < file
Of course, you can pass the code via -e option as well: perl -e 'the code' < file.
Input
ab 12ab 1cd uu 88 ab 33 33 1 1
ab cd uu 88 88 33 33 33 cw ab
Output
88
uu
ab
1
There is no 33 in the output, since it occurs 5 times in the input.
The code reads the file in slurp mode into the default variable ($_), then collects all the words (\w with word boundaries around) into #_ array. Then it counts the number of times each word occurred in the file and stores the result into %h hash. The final block prints only the items that occurred 2, 3, or 4 times, no more and no less.
Note, in Perl you should always use strict; and use warnings; in order to detect issues at early phase.

grep between two strings if pattern in the middle linux

i want to grep between two strings only if there is a pattern between them.
for example, in this text:
first wanted string is Start, second is END, and the pattern is 1 2 3 each in a new line.
Start
abc
abc
1
2
3
abc
END
bla
bla
Start
abc
abc
1
2
4
abc
END
bla
bla
Start
abc
abc
1
2
3
abc
abc
END
the result should be:
Start
abc
abc
1
2
3
abc
END
Start
abc
abc
1
2
3
abc
abc
END
thanks!
sed -ne '/Start/{:a;N;/END/!b a;/\n1\n2\n3\n/p}'
Line by line:
we need only text starting with 'Start':
sed -ne '/Start/{
we found 'Start', now add everything up to 'END' to pattern space;
set label named 'a':
:a
add next line to pattern space:
N
if not found 'END' - jump to 'a'
/END/!b a
now check if we have desired pattern that contain 1 2 3 and print
they will be separated by '\n' as they were on separate lines
/\n1\n2\n3\n/p
}'
grep is not suitable, use sed instead
sed -n "/Start/,/END/p" input.txt
should work. I'm assuming input in a file input.txt.

Match "\b(OneTwoThree|OneTwo|TwoThree)\b" with least repetition

I managed this with PCRE only, but I'd like it to work with Javascript's RegExp as well. That, and the regex is ugly. Are there any other, saner ways of accomplishing this?
Note, that while the topic says "OneTwoThree", I'm using "qwe" for brevity.
$ cat test.txt | grep -oP '\b(q(\g<we>|\g<w>)|(?<we>(?<w>w)e))\b'
qwe
qw
we
File test.txt contains:
qwe qw we q w e qq qe wq ww eq ew ee qqq qqw qqe qwq qww qeq qew qee wqq wqw wqe wwq www wwe weq wew wee eqq eqw eqe ewq eww ewe eeq eew eee
(Only the first three should match.)
Something like this would work for your sample data:
/\b(qwe?|we)\b/
/\b(q?we|qw)\b/
Which you can test here.
But for the full pattern you specified in the title it would be
/\b(OneTwo(Three)?|TwoThree)\b/
/\b((One)?TwoThree|OneTwo)\b/
Now, this is not more readable, but it does reduce redundancy slightly:
/\b(?!w\b)q?we?\b/
Which you can test here
Or for your full pattern:
/\b(?!Two\b)(One)?Two(Three)?\b/
Maybe this but not sure -
# \b(?=..)q?we?\b
\b
(?= . . )
q? w e?
\b

RegEx to extract numeric value immediately before a search character /string found

what would be the best regEx to extract all the number (only numbers) before a search string ?
ABC Y C S 1 $ 46CC MAN 25/ 31
Need to extract 25 in this case, but its not fixed length ? Any help ?
'\d+(?=/)'
should work. see test with grep:
kent$ echo "ABC Y C S 1 $ 46CC MAN 25/ 31 "|grep -Po '\d+(?=/)'
25
Perl regex:
while ($subject =~ m!\d+(?=.*/)!g) {
# matched text = $&
}
Output:
1
46
25
So basically keep matching, as long as a / exist somewhere later.