Extract the surrounding words together with the match

Extract the surrounding words together with the match - regex

I am looking for every occurence of a search term, e.g. ddd, in a file and output the surroundings, like this:
File.txt
aaa bbb ccc ddd eee fff
ttt uuu iii eee ddd
ddd
ggg jjj kkk ddd lll
output
ccc ddd eee
eee ddd
ddd
kkk ddd lll
As a starting point, I am using this piece of code
#!/usr/bin/perl -w
while(<>) {
while (/ddd(\d{1,3}))/g) {
print "$1\n"
}
}

You can try the following..it gives the output you want:
while(<>) {
if(/((?:\w+ )?ddd(?: \w+)?)/) {
print "$1\n";
}
}
Regex used:
( # open the grouping.
(?:\w+ )? # an optional word of at least one char followed by a space.
ddd # 'ddd'
(?: \w+)? # an optional space followed by a word of at least one char.
) # close the grouping.

#!/usr/bin/perl -w
while (<>) {
if (/((?:[a-z]{3} )?ddd(?: [a-z]{3})?)/)
print "$1\n";
}

while (<>) {
chomp;
my #words = split;
for my $i (0..$#words) {
if ($words[$i] eq 'ddd') {
print join ' ', $i > 0 ? $words[$i-1] : (), $words[$i], $i < $#words ? $words[$i+1] : ();
print "\n";
}
}
}

#!/usr/bin/perl
while (<>) {
chomp;
#F = split /\s+/;
if (/^ddd$/) {print $_."\n";next};
for ($i=0; $i<=$#F;$i++) {
if ($F[$i] eq 'ddd') {
print "$F[$i-1] $F[$i] $F[$i + 1]\n";
}
}
}

Related

I have to remove set of duplicate words in the given string using Perl

For Eg:-
I have to scrape address from multiple websites. Sometimes address having repeated country name or address.
$string1="No 3, 3rd street mumbai india 3rd street";
$string2="#3 1019 GM Amsterdam Funda Real Estate BV 1019 GM Amsterdam The Netherlands";
I need to remove the group of n number of words in the given string.
In the given
$string1 contains "3rd street" as duplicate. I need to remove.
$string2 contains "1019 GM Amsterdam" as duplicate.
Output will be..
$string1="No 3, 3rd street mumbai india";
$string2="#3 1019 GM Amsterdam Funda Real Estate BV The Netherlands";

I have tried with some brute force method try the following
use warnings;
use strict;
use POSIX;
my $string1="aaa bbb aaa ccc aaa bbb";
#my $string1="fff ggg hhh ddd jjj fff ggg hhh";
#my $string2 = "fff ggg hhh ddd jjj fff ggg hhh fff ggg mmm";
my $string1_count = () = $string1=~m/\s+/g;
my $string_divide = ceil($string1_count/2);
for(my $i = $string_divide; $i > 1; $i--)
{
last if($string1 =~s/((?:\w+\s?){$i}).+\K\1//g);
}
print "$string1\n";

Just try this:
my $string1="aaa bbb aaa ccc aaa bbb";
my $string2="fff ggg hhh ddd jjj fff ggg hhh";
my #split = split / /, $string1;
my #unique = keys {map {$_ => 1} #split};
my $string3 = join " ", sort #unique;
print $string3;

how to use regular expression in awk or sed, for find all homopolymers in DNA sequence?

Background
Homopolymers are a sub-sequence of DNA with consecutives identical bases, like AAAAAAA. Example in python for extract it:
import re
DNA = "ACCCGGGTTTAACCGGACCCAA"
homopolymers = re.findall('A+|T+|C+|G+', DNA)
print homopolymers
['A', 'CCC', 'GGG', 'TTT', 'AA', 'CC', 'GG', 'A', 'CCC', 'AA']
my effort
I made a gawk script that solves the problem, but without to use regular expressions:
echo "ACCCGGGTTTAACCGGACCCAA" | gawk '
BEGIN{
FS=""
}
{
homopolymer = $1;
base = $1;
for(i=2; i<=NF; i++){
if($i == base){
homopolymer = homopolymer""base;
}else{
print homopolymer;
homopolymer = $i;
base = $i;
}
}
print homopolymer;
}'
output
A
CCC
GGG
TTT
AA
CC
GG
A
CCC
AA
question
how can I use regular expressions in awk or sed, getting the same result ?

grep -o will get you that in one-line:
echo "ACCCGGGTTTAACCGGACCCAA"| grep -ioE '([A-Z])\1*'
A
CCC
GGG
TTT
AA
CC
GG
A
CCC
AA
Explanation:
([A-Z]) # matches and captures a letter in matched group #1
\1* # matches 0 or more of captured group #1 using back-reference \1
sed is not the best tool for this but since OP has asked for it:
echo "ACCCGGGTTTAACCGGACCCAA" | sed -r 's/([A-Z])\1*/&\n/g'
A
CCC
GGG
TTT
AA
CC
GG
A
CCC
AA
PS: This is gnu-sed.

Try using split and just comparing.
echo "ACCCGGGTTTAACCGGACCCAA" | awk '{ split($0, chars, "")
for (i=1; i <= length($0); i++) {
if (chars[i]!=chars[i+1])
{
printf("%s\n", chars[i])
}
else
{
printf("%s", chars[i])
}
}
}'
A
CCC
GGG
TTT
AA
CC
GG
A
CCC
AA
EXPLANATION
The split method divides the one-line string you send to awk, and separes each character in array chars[]. Now, we go through the entire array and check if the char is equal to the next One if (chars[i]!=chars[i+1]) and then, if it´s equal, we just print the char, and wait for the next one. If the next one is different, we just print the base char, a \n what means a newline.

regex pattern match and extraction

I am trying to write a perl regex to extract words greater than 2 letters after the colon :. For example, If the pattern is subject:I am about to write a regex. I need to extract in my $variable only >2 letter wordsi.e, $variable = "subject:about write regex".
Here is my program where the regex and pattern matching is done but when I print, my variable is empty. What am I doing wrong?
#!/usr/bin/perl
while (<STDIN>) {
foreach my $query_part (split(/\s+/, $_)) {
my($query_part_subject) = $query_part =~ /([^\w\#\.]+)?((?:\w{3,}|[\$\#()+.])+)(?::(\w{3,}.+))?/ ;
print "query_part : $query_part_subject \n";
}
}
exit(0);

Try doing this :
#!/usr/bin/perl
use strict; use warnings;
while (<DATA>) {
s/.*?://;
print join "\n", grep { length($_) > 2 } split;
__DATA__
subject:a bb ccc dddd fffff
OUTPUT
ccc
dddd
fffff
NOTE
from my understanding of your question : I display only the words length > 2 characters after the : character.

It isn't clear from your question. Is this what you are looking for??
$txt='I am about to create regex';
$x='(I)';
$y='.*?';
$z='(am)';
$re1=$x.$y.$z;
if ($txt =~ m/$re1/is)
{
$var1=$1;
$word1=$2;
print "($var1) ($word1) \n";
}

perl regex matching failed

I want to match two different string and output should come in $1 and $2,
According to me in this example, if $a is 'xy abc', then $1 should be 'xy abc' and $2 should 'abc', but 'abc' part is coming in $3.
Can you please help me to writing a regex in that $1 should have whole string and $2 should
have second part.
I am using perl 5.8.5.
my #data=('abc xy','xy abc');
foreach my $a ( #data) {
print "\nPattern= $a\n";
if($a=~/(abc (xy)|xy (abc))/) {
print "\nMatch: \$1>$1< \$2>$2< \$3>$3<\n";
}
}
Output:
perl test_reg.pl
Pattern= abc xy
Match: $1>abc xy< $2>xy< $3><
Pattern= xy abc
Match: $1>xy abc< $2>< $3>abc<

Can be done with:
(?|(abc (xy))|(xy (abc)))
Why even bother with capturing the whole thing? You can use $& for that.
my #data = ('abc xy', 'xy abc');
for(#data) {
print "String: '$_'\n";
if(/(?|abc (xy)|xy (abc))/) {
print "Match: \$&='$&', \$1='$1'\n";
}
}

Because only one of captures $2 and $3 can be defined, you can write
foreach my $item ( #data) {
print "\nPattern= $item\n";
if ($item=~/(abc (xy)|xy (abc))/) {
printf "Match: whole>%s< part>%s<\n", $1, $2 || $3;
}
}
which gives the output
Pattern= abc xy
Match: whole>abc xy< part>xy<
Pattern= xy abc
Match: whole>xy abc< part>abc<

If you can live with allowing more capture variables than $1 and $2, then use the substrings from the branch of the alternative that matched.
for ('abc xy', 'xy abc') {
print "[$_]:\n";
if (/(abc (xy))|(xy (abc))/) {
print " - match: ", defined $1 ? "1: [$1], 2: [$2]\n"
: "1: [$3], 2: [$4]\n";
}
else {
print " - no match\n";
}
}
Output:
[abc xy]:
- match: 1: [abc xy], 2: [xy]
[xy abc]:
- match: 1: [xy abc], 2: [abc]

Removing spaces between single letters

I have a string that may contain an arbitrary number of single-letters separated by spaces. I am looking for a regex (in Perl) that will remove spaces between all (unknown number) of single letters.
For example:
ab c d should become ab cd
a bcd e f gh should become a bcd ef gh
a b c should become abc
and
abc d should be unchanged (because there are no single letters followed by or preceded by a single space).
Thanks for any ideas.

Your description doesn't really match your examples. It looks to me like you want to remove any space that is (1) preceded by a letter which is not itself preceded by a letter, and (2) followed by a letter which is not itself followed by a letter. Those conditions can be expressed precisely as nested lookarounds:
/(?<=(?<!\pL)\pL) (?=\pL(?!\pL))/
tested:
use strict;
use warnings;
use Test::Simple tests => 4;
sub clean {
(my $x = shift) =~ s/(?<=(?<!\pL)\pL) (?=\pL(?!\pL))//g;
$x;
}
ok(clean('ab c d') eq 'ab cd');
ok(clean('a bcd e f gh') eq 'a bcd ef gh');
ok(clean('a b c') eq 'abc');
ok(clean('ab c d') eq 'ab cd');
output:
1..4
ok 1
ok 2
ok 3
ok 4
I'm assuming you really meant one space character (U+0020); if you want to match any whitespace, you might want to replace the space with \s+.

You can do this with lookahead and lookbehind assertions, as described in perldoc perlre:
use strict;
use warnings;
use Test::More;
is(tran('ab c d'), 'ab cd');
is(tran('a bcd e f gh'), 'a bcd ef gh');
is(tran('a b c'), 'abc');
is(tran('abc d'), 'abc d');
sub tran
{
my $input = shift;
(my $output = $input) =~ s/(?<![[:lower:]])([[:lower:]]) (?=[[:lower:]])/$1/g;
return $output;
}
done_testing;
Note the current code fails on the second test case, as the output is:
ok 1
not ok 2
# Failed test at test.pl line 7.
# got: 'abcd efgh'
# expected: 'a bcd ef gh'
ok 3
ok 4
1..4
# Looks like you failed 1 test of 4.
I left it like this as your second and third examples seem to contradict each other as to how leading single characters should be handled. However, this framework should be enough to allow you to experiment with different lookaheads and lookbehinds to get the exact results you are looking for.

This piece of code
#!/usr/bin/perl
use strict;
my #strings = ('a b c', 'ab c d', 'a bcd e f gh', 'abc d');
foreach my $string (#strings) {
print "$string --> ";
$string =~ s/\b(\w)\s+(?=\w\b)/$1/g; # the only line that actually matters
print "$string\n";
}
prints this:
a b c --> abc
ab c d --> ab cd
a bcd e f gh --> a bcd ef gh
abc d --> abc d
I think/hope this is what you're looking for.

This should do the trick:
my $str = ...;
$str =~ s/ \b(\w) \s+ (\w)\b /$1$2/gx;
That removes the space between all single nonspace characters. Feel free to replace \S with a more restrictive character class if needed. There also may be some edge cases related to punctuation characters that you need to deal with, but I can't guess that from the info you have provided.
As Ether helpfully points out, this fails on one case. Here is a version that should work (though not quite as clean as the first):
s/ \b(\w) ( (?:\s+ \w\b)+ ) /$1 . join '', split m|\s+|, $2/gex;
I liked Ether's test based approach (imitation is the sincerest form of flattery and all):
use warnings;
use strict;
use Test::Magic tests => 4;
sub clean {
(my $x = shift) =~ s{\b(\w) ((?: \s+ (\w)\b)+)}
{$1 . join '', split m|\s+|, $2}gex;
$x
}
test 'space removal',
is clean('ab c d') eq 'ab cd',
is clean('a bcd e f gh') eq 'a bcd ef gh',
is clean('a b c') eq 'abc',
is clean('abc d') eq 'abc d';
returns:
1..4
ok 1 - space removal 1
ok 2 - space removal 2
ok 3 - space removal 3
ok 4 - space removal 4

It's not a regex but since I am lazy by nature I would it do this way.
#!/usr/bin/env perl
use warnings;
use 5.012;
my #strings = ('a b c', 'ab c d', 'a bcd e f gh', 'abc d');
for my $string ( #strings ) {
my #s; my $t = '';
for my $el ( split /\s+/, $string ) {
if ( length $el > 1 ) {
push #s, $t if $t;
$t = '';
push #s, $el;
} else { $t .= $el; }
}
push #s, $t if $t;
say "#s";
}
OK, my way is the slowest:
no_regex 130619/s -- -60% -61% -63%
Alan_Moore 323328/s 148% -- -4% -8%
Eric_Storm 336748/s 158% 4% -- -5%
canavanin 352654/s 170% 9% 5% --
I didn't include Ether's code because ( as he has tested ) it returns different results.

Now I have the slowest and the fastest.
#!/usr/bin/perl
use 5.012;
use warnings;
use Benchmark qw(cmpthese);
my #strings = ('a b c', 'ab c d', 'a bcd e f gh', 'abc d');
cmpthese( 0, {
Eric_Storm => sub{ for my $string (#strings) { $string =~ s{\b(\w) ((?: \s+ (\w)\b)+)}{$1 . join '', split m|\s+|, $2}gex; } },
canavanin => sub{ for my $string (#strings) { $string =~ s/\b(\w)\s+(?=\w\b)/$1/g; } },
Alan_Moore => sub{ for my $string (#strings) { $string =~ s/(?<=(?<!\pL)\pL) (?=\pL(?!\pL))//g; } },
keep_uni => sub{ for my $string (#strings) { $string =~ s/\PL\pL\K (?=\pL(?!\pL))//g; } },
keep_asc => sub{ for my $string (#strings) { $string =~ s/[^a-zA-Z][a-zA-Z]\K (?=[a-zA-Z](?![a-zA-Z]))//g; } },
no_regex => sub{ for my $string (#strings) { my #s; my $t = '';
for my $el (split /\s+/, $string) {if (length $el > 1) { push #s, $t if $t; $t = ''; push #s, $el; } else { $t .= $el; } }
push #s, $t if $t;
#say "#s";
} },
});
.
Rate no_regex Alan_Moore Eric_Storm canavanin keep_uni keep_asc
no_regex 98682/s -- -64% -65% -66% -81% -87%
Alan_Moore 274019/s 178% -- -3% -6% -48% -63%
Eric_Storm 282855/s 187% 3% -- -3% -46% -62%
canavanin 291585/s 195% 6% 3% -- -45% -60%
keep_uni 528014/s 435% 93% 87% 81% -- -28%
keep_asc 735254/s 645% 168% 160% 152% 39% --

This will do the job.
(?<=\b\w)\s(?=\w\b)

Hi I have written simple javascript to do this it's simple and you can convert into any language.
function compressSingleSpace(source){
let words = source.split(" ");
let finalWords = [];
let tempWord = "";
for(let i=0;i<words.length;i++){
if(tempWord!='' && words[i].length>1){
finalWords.push(tempWord);
tempWord = '';
}
if(words[i].length>1){
finalWords.push(words[i]);
}else{
tempWord += words[i];
}
}
if(tempWord!=''){
finalWords.push(tempWord);
}
source = finalWords.join(" ");
return source;
}
function convertInput(){
let str = document.getElementById("inputWords").value;
document.getElementById("firstInput").innerHTML = str;
let compressed = compressSingleSpace(str);
document.getElementById("finalOutput").innerHTML = compressed;
}
label{
font-size:20px;
margin:10px;
}
input{
margin:10px;
font-size:15px;
padding:10px;
}
input[type="button"]{
cursor:pointer;
background: #ccc;
}
#firstInput{
color:red;
font-size:20px;
margin:10px;
}
#finalOutput{
color:green;
font-size:20px;
margin:10px;
}
<label for="inputWords">Enter your input and press Convert</label><br>
<input id="inputWords" value="check this site p e t z l o v e r . c o m thanks">
<input type="button" onclick="convertInput(this.value)" value="Convert" >
<div id="firstInput">check this site p e t z l o v e r . c o m thanks</div>
<div id="finalOutput">check this site petzlover.com thanks</div>

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Extract the surrounding words together with the match - regex

#!/usr/bin/perl -w while (<>) { if (/((?:[a-z]{3} )?ddd(?: [a-z]{3})?)/) print "$1\n"; }

while (<>) { chomp; my #words = split; for my $i (0..$#words) { if ($words[$i] eq 'ddd') { print join ' ', $i > 0 ? $words[$i-1] : (), $words[$i], $i < $#words ? $words[$i+1] : (); print "\n"; } } }

#!/usr/bin/perl while (<>) { chomp; #F = split /\s+/; if (/^ddd$/) {print $_."\n";next}; for ($i=0; $i<=$#F;$i++) { if ($F[$i] eq 'ddd') { print "$F[$i-1] $F[$i] $F[$i + 1]\n"; } } }

Related

I have to remove set of duplicate words in the given string using Perl

how to use regular expression in awk or sed, for find all homopolymers in DNA sequence?

regex pattern match and extraction

perl regex matching failed

Removing spaces between single letters

Categories

Resources