Identifying pseudo-duplicates with Perl - regex

I have a list that contains names. There are multiples of the same name. I want to catch the first instance of these pseudo-dupes and anchor them.
Example input
Josh Smith
Josh Smith0928340938
Josh Smith and friends
hello
hello1223
hello and goodbye.
What I want to do is identify the first occurrence of Josh Smith or hello and put an anchor such as a pipe | in front of it to validate. These are also wildcards as the list is large, so I cannot specifically look for the first match of Josh Smith and so on.
My desired output would be this:
|Josh Smith
Josh Smith0928340938
Josh Smith and friends
|hello
hello1223
hello and goodbye.
I did not provide any code. I am a little in the dark on how to go about this and was hoping maybe someone had been in a similar situation using regex or Perl.

I think based on what I understand of your requirements you are looking for something like this:
$prefix = '';
$buffered = '';
$count = 0;
while ($line = <>) {
$linePrefix = substr($line,0,length($prefix));
if ($buffered ne '' && $linePrefix eq $prefix) {
$buffered .= $line;
$count++;
} else {
if ($buffered ne '') {
print "|" if ($count > 1);
print $buffered;
}
$buffered = $line;
$prefix = $line;
chomp $prefix;
$count = 1;
}
}
if ($buffered ne '') {
if ($count > 1) {
print "|";
}
print $buffered;
}

Actually, IMO this is a rather interesting question, because you can be creative. As you do not know how to identify the root name, I have to ask if you have to? I have a feeling that you do not need a perfect solution. Therefore, I would go for something simple:
#!/usr/bin/perl -wn
$N = 4;
if (#prev) {
$same_start = length $_ >= $N &&
substr($prev[0], 0, $N) eq substr($_, 0, $N);
unless ($same_start) {
print "|", shift #prev if $#prev;
#prev = grep { print;0 } #prev;
}
}
push #prev, $_;
}{ print for #prev
edit: fixed bug: <print "|", shift #prev;> to <print "|", shift #prev if $#prev;>
Sample output:
$ perl josh.pl <josh-input.txt
|Josh Smith
Josh Smith0928340938
Josh Smith and friends
|hello
hello1223
hello and goodbye.

Related

Port awk one-liner to perl using regexes (Summing groups of data)

Given the following input:
£ cat problem
Team 7
John: 19
Sue: 20
Pam: 35
Team 42
Jeff: 12
Sam: 3
Phil: 26
Jill: 10
Team 9
Bill: 19
John: 7
Linda: 15
I am trying to reproduce the following output in a perl one-liner:
£ awk '/Team/ {x=$2} /: *[0-9]+/ {myarray[x]+=$2}; END{for (key in myarray) {print "Team " key ": " myarray[key]}}' problem
Team 42: 51
Team 7: 74
Team 9: 41
This (basic) problem and the data actually comes from a perl tutorial. http://learnperl.scratchcomputing.com/tutorials/csss/
I'm interested in something close to this, which I have tried...literally over 50 variations of:
£ perl -e 'while(<>){if (/Team/){$x = /(\d+)/;}else{/(\d+)/;$myarray{$x}+= {$1}; }} foreach $key (keys %myarray){print "$key\n";}' problem
I know that 1) these regexes are not returning the match (the += {$1} was an attempt to remedy that), and 2) even if they were I am probably populating the hash incorrectly. Googling potential solutions returns me verbose, multiline code. I was waiting to get the keys outputting correctly before I even bothered moving onto the values, btw.
Without examining the awk code, I'd write this perl:
perl -lne '
if (/Team (\d+)/) {
$t = $1;
} elsif (/: (\d+)/) {
$score{$t} += $1
}
} END {
printf "Team %d: %d\n", $_, $score{$_} for keys %score
' problem
Team 7: 74
Team 9: 41
Team 42: 51
The -n option implicitly wraps a while (<>) {...} loop around the given code. See perldoc perlrun for all the details.
The } END { bit exploits that, and allows you to collect data inside the while loop, then do stuff with it when the loop is complete.
You can see how perl deals with one-liners by adding the -MO=Deparse option:
perl -MO=Deparse -lne '
if (/Team (\d+)/) {
$t = $1;
} elsif (/: (\d+)/) {
$score{$t} += $1
}
} END {
printf "Team %d: %d\n", $_, $score{$_} for keys %score
' problem
BEGIN { $/ = "\n"; $\ = "\n"; }
LINE: while (defined($_ = <ARGV>)) {
chomp $_;
if (/Team (\d+)/) {
$t = $1;
}
elsif (/: (\d+)/) {
$score{$t} += $1;
}
}
sub END {
printf "Team %d: %d\n", $_, $score{$_} foreach (keys %score);
}
-e syntax OK
I'm not sure why it turns the END block into a subroutine though.
A couple of notes about your code:
Use better variable names.
$x = /(\d+)/
m// used in scalar context will return a true/false value, regardless of the use of capturing parentheses (ref: http://perldoc.perl.org/perlop.html#Regexp-Quote-Like-Operators)
to assign the captured text to variables, you must have a list context on the left-hand side: ($a, $list, $of, $variables) = (a, list, of, values), so
($x) = /(\d+)/; # with parentheses on the left-hand side
$myarray{$x}+= {$1}
when you use {...} as an expression, you are creating a hash reference
in this case {$1}, your hash does not have enough elements: it requires an even numbered list of key-value pairs.
you just want $myarray{$x} += $1
foreach $key (keys %myarray){...}
an alternate way to iterate over a hash is to use the each function in a while loop:
while (my ($key, $value) = each %myarray) {
print "Team $key: $value\n";
}

How to obtain the captured groups in perl

I'm currently working on a simple script to detect the relations between people. I define the $np to be the regex of a Proper Noun.
In my first task, the goal is to find all the proper nouns that appear left of another proper noun in which the number of words between them is less than a fixed value. I have written the following code:
$/ = '';
my $PM = qr{\b[A-Z][\w-]*\w};
my $de = qr{d[aoe]s?};
my $s = qr{[\n ]};
my $np = qr{$PM (?: $s $PM | $s $de $s $PM )*}x;
while(<>){
while(/($np)/g){
print("$1 : ");
my #x = m/(?=(?: $s+ (?: [\w-]+ | ($np)) ){1,7})/gx;
my $y = join(", ", #x);
print("$y\n");
}
}
I expected it to print all the proper nouns in the file and, for each one of them, the proper nouns that are in the window. However, this is not happening.
How can I make it work like I want?
PS: I'm a perl newbie
EDIT: Some people recommended to add samples of input and output expected.
If I had a file with the following text
John asked Mary to meet Anna.
then, I would like my script to print
John : Mary, Anna
Mary : Anna
Anna :
However, in its current state, I get commas printed in an infinite loop.
You get an infinite loop because the position of match for $_ is changed again after you do the second match. You can assign it to a named variable before the match to avoid this:
while( <> ) {
my $line = $_;
while( $line =~/($np)/g ) {
...
}
}
See Using regular expressions in Perl and pos() for details.
Is it a requirement to use regex? Doesn't seem to be the best way to me.
I'd just split the string into words and go through that. Something like the code below.
Not sure I understand correctly your specification. What my code does is to find proper names that appear within certain distance to the left of other proper name.
my $window = 3;
my %result;
while(<DATA>){
my #words = map {$_ =~ s/[[:punct:]]$//; $_} split;
my $index = $#words;
for (my $index = $#words; $index > 0; $index--) {
my $word = $words[$index];
next unless is_name($word);
my $start_index = $index - 3;
$start_index = 0 if $start_index < 0;
my $end_index = $index - 1;
$end_index = 0 if $end_index < 0;
my #neigbours = grep {is_name($_)} #words[$start_index .. $end_index];
$result{$word} = [#neigbours] if #neigbours;
}
}
sub is_name {
shift =~ /^[A-Z][\w-]*\w$/;
}
__DATA__
John asked Mary to meet Anna.

Perl6: Confused about BagHash/Matching

I'm attempting to count matches of a regex using a BagHash, and getting odd results.
my $fh = open "versions.txt";
my $versions = BagHash.new();
while (defined my $line = $fh.get) {
my $last = '';
if $line ~~ /(\d+)\.?(\d*)/ {
say 'match ' ~ $/[0];
if $last !eq $/[0] {
say 'not-same: ' ~ $/[0];
$versions{$/[0]}++
}
$last = $/[0];
}
else {
$last = '';
}
}
say 'count: ' ~ $versions.elems;
Output is:
match 234
not-same: 234
match 999
not-same 999
count: 1 # I expect 2 here.
The test case I'm working with is:
version history thingy
version=234.234
version=999
What am I missing?
You are resetting $last with each iteration. Also, don't trust say. It's meant to be used to avoid flooding a terminal or logfile with infinite lists. Use dd (Rakudo internal) or a module to dump debug output. If you would have used dd would would have seen that $/[0] contains a Match, a complex structure that is not suited to generate Hash keys.
# my #lines = slurp('version.txt');
my #lines = ('version=234.234', 'version=999');
my BagHash $versions.=new;
for #lines {
ENTER my $last = '';
if .Str ~~ /(\d+) '.'? (\d*)/ {
$versions{$0.Str}++ if $last ne $0.Str;
$last = $0.Str
}else{
$last = ''
}
};
dd $versions;
# OUTPUT«BagHash $versions = ("234"=>1,"999"=>1).BagHash␤»
The whole point of BagHash is that it's constructor will do the counting for you. If you supply lazy lists all the way down, this can be fairly efficient.
my #lines = ('version=234.234', 'version=999');
dd BagHash.new(#lines».split('=')».[1]);
# OUTPUT«("234.234"=>1,"999"=>1).BagHash␤»
Bug #1 is you almost certainly want your $last declaration outside of the loop so you don't keep resetting it to ''
Bug #2 you probably only want to update $last on the state where you found a version number not for all lines
Bug #3 you used the Match object as the key to the HashBag rather than the string value of the version. You can coerce a match to being the string it matched with ~$/[0] but just $0 is a shortcut for that too.
I cleaned up your code and got the below that works, but is really quite far from being idiomatic Perl 6:
my $fh = open "versions.txt";
my $versions = BagHash.new();
my $last = '';
for $fh.lines -> $line {
if $line ~~ /(\d+)\.?(\d*)/ {
say 'match ' ~ $/[0];
if $last ne $/[0] {
say 'not-same: ' ~ $/[0];
$versions{~$/[0]}++;
$last = $/[0];
}
}
else {
$last = '';
}
}
say $versions;
say 'count: ' ~ $versions.elems;
I would personally have written this as follows if it was throw away code:
my $versions = "versions.txt".IO.lines.comb(/(\d+)\.?(\d*)/).Bag;
say $versions.elems;
If you wanted the file later or to do more with each line or this is for production:
my %versions;
for "versions.txt".IO.lines -> $line {
if $line ~~ /((\d+)\.?(\d*))/ {
%versions{$0}++;
}
}
say %versions.elems;

replace all to "hello" except string that is in between double quotes

$ cat file1
"rome" newyork
"rome"
rome
What do I need to fill in the blank?
$ sed ____________________ file1
I want output like
"rome" newyork
"rome"
hello
if my input is like this
$ cat file1
/temp/hello/ram
hello
/hello/temp/ram
if I want to change the hello that does not have slashes what should I do? (change hello to happy)
temp/hello/ram
happy
/hello/temp/ram
sed 's/[^\"]rome[^\"]/hello/g' your_file
tested below:
> cat temp
"rome" newyork
"rome"
rome
> sed 's/[^\"]rome[^\"]/hello/g' temp
"rome" newyork
"rome"
hello
>
Why is rome changed to hello but newyork is not? If I'm reading the question correctly, you're trying to replace everything not in double quotes with hello?
Depending on the exact use cases you want (what happens to the input string ""?), you probably want something like this:
sed 's/\".*\"/hello/'
I dont see a direct way to replace all others except those enclosed inside " "
However, with recursive sed, a brute force method, you can achieve it.
cat file1 | sed "s/\"rome\"/\"italy\"/g" | sed "s/rome/hello/g" | sed "s/\"italy\"/\"rome\"/g"
The second problem can be solved with a simple perl one-liner (assuming only one hello per line):
perl -pe 'next if /\//; s/hello/happy/;'
The first problem requires some internal book keeping to keep track of whether you are inside a string or not. This can also be solved with perl:
#!/usr/bin/perl -w
use strict;
use warnings;
my $state_outside_string = 0;
my $state_inside_string = 1;
my $state = $state_outside_string;
while (my $line = <>) {
my #chars = split(//,$line);
my $not_yet_printed = "";
foreach my $char (#chars) {
if ($char eq '"') {
if ($state == $state_outside_string) {
$state = $state_inside_string;
$not_yet_printed =~ s/rome/hello/;
print $not_yet_printed;
$not_yet_printed = "";
} else {
$state = $state_outside_string;
}
print $char;
next;
}
if ($state == $state_inside_string) {
print $char;
} else {
$not_yet_printed .= $char;
}
}
$not_yet_printed =~ s/rome/hello/;
print $not_yet_printed;
}

How to determine number of times a word appears in text?

How can I find the number of times a word is in a block of text in Perl?
For example my text file is this:
#! /usr/bin/perl -w
# The 'terrible' program - a poorly formatted 'oddeven'.
use constant HOWMANY => 4; $count = 0;
while ( $count < HOWMANY ) {
$count++;
if ( $count == 1 ) {
print "odd\n";
} elsif ( $count == 2 ) {
print "even\n";
} elsif ( $count == 3 ) {
print "odd\n";
} else { # at this point $count is four.
print "even\n";
}
}
I want to find the number of "count" word for that text file. File is named terrible.pl
Idealy it should use regex and with minimum number of line of code.
EDIT: This is what I have tried:
use IO::File;
my $fh = IO::File->new('terrible.pl', 'r') or die "$!\n";
my %words;
while (<$fh>) {
for my $word ($text =~ /count/g) {
print "x";
$words{$word}++;
}
}
print $words{$word};
Here's a complete solution. If this is homework, you learn more by explaining this to your teacher than by rolling your own:
perl -0777ne "print+(##=/count/g)+0" terrible.pl
If you are trying to count how many times appears the word "count", this will work:
my $count=0;
open(INPUT,"<terrible.pl");
while (<INPUT>) {
$count++ while ($_ =~ /count/g);
}
close(INPUT);
print "$count times\n";
I'm not actually sure what your example code is but you're almost there:
perl -e '$text = "lol wut foo wut bar wut"; $count = 0; $count++ while $text =~ /wut/g; print "$count\n";'
You can use the /g modifier to continue searching the string for matches. In the example above, it will return all instances of the word 'wut' in the $text var.
You can probably use something like so:
my $fh = IO::File->new('test.txt', 'r') or die "$!\n";
my %words;
while (<$fh>) {
for my $word (split / /) {
$words{$word}++;
}
}
That will give you an accurate count of every "word" (defined as a group of characters separated by a space), and store it in a hash which is keyed by the word with a value of the number of the word which was seen.
perdoc perlrequick has an answer. The term you want in that document is "scalar context".
Given that this appears to be a homework question, I'll point you at the documentation instead.
So, what are you trying to do? You want the number of times something appears in a block of text. You can use the Perl grep function. That will go through a block of text without needing to loop.
If you want an odd/even return value, you can use the modulo arithmetic function. You can do something like this:
if ($number % 2) {
print "$number is odd\n"; #Returns a "1" or true
}
else {
print "$number is even\n"; #Returns a "0" or false
}