Regex to replace gibberish - regex

I have to clean some input from OCR which recognizes handwriting as gibberish. Any suggestions for a regex to clean out the random characters? Example:
Federal prosecutors on Monday charged a Miami man with the largest
case of credit and debit card data theft ever in the United States,
accusing the one-time government informant of swiping 130 million
accounts on top of 40 million he stole previously.
, ':, Ie
':... 11'1
. '(.. ~!' ': f I I
. " .' I ~
I' ,11 l
I I I ~ \ :' ,! .~ , .. r, 1 , ~ I . I' , .' I ,.
, i
I ; J . I.' ,.\ ) ..
. : I
'I', I
.' '
r,"
Gonzalez is a former informant for the U.S. Secret Service who helped
the agency hunt hackers, authorities say. The agency later found out that
he had also been working with criminals and feeding them information
on ongoing investigations, even warning off at least one individual,
according to authorities.
eh....l
~.\O ::t
e;~~~
s: ~ ~. 0
qs c::; ~ g
o t/J (Ii .,
::3 (1l Il:l
~ cil~ 0 2:
t:lHj~(1l
. ~ ~a
0~ ~ S'
N ("b t/J :s
Ot/JIl:l"-<:!
v'g::!t:O
-....c......
VI (:ll <' 0
:= - ~
< (1l ::3
(1l ~ '
t/J VJ ~
Pl
.....
....
(II

One of the simpleset solutions(not involving regexpes):
#pseudopython
number_of_punct = sum([1 if c.ispunct() else 0 for c in line])
if number_of_punct >len(line)/2: line_is_garbage()
well. Or rude regexpish s/[!,'"##~$%^& ]{5,}//g

A simple heuristic, similar to anonymous answer:
listA = [0,1,2..9, a,b,c..z, A,B,C,..Z , ...] // alphanumerical symbols
listB = [!#$%^&...] // other symbols
Na = number_of_alphanumeric_symbols( line )
Nb = number_of_other_symbols( line )
if Na/Nb <= garbage_ratio then
// garbage

No idea how well it would work, but I have considered this problem in the past, idly. I've on occasions played with a little programmatic device called a markov chain
Now the wikipedia article probably won't make much sense until you see some of the other things a markov chain is good for. One example of a markov chain in action is this Greeking generator. Another example is the MegaHAL chatbot.
Greeking is gibberish that looks like words. Markov chains provide a way of randomly generating a sequence of letters, but weighting the random choices to emulate the frequency patterns of an examined corpus. So for instance, Given the letter "T", the letter h is more likely to show up next than any other letter. So you examine a corpus (say some newspapers, or blog postings) to produce a kind of fingerprint of the language you're targeting.
Now that you have that frequency table/fingerprint, you can examine your sample text, and rate each letter according to the likelyhood of it appearing. Then, you can flag the letters under a particular threshold likelyhood for removal. In other words, a surprise filter. Filter out surprises.
There's some leeway for how you generate your freqency tables. You're not limited to one letter following another. You can build a frequency table that predicts which letter will likely follow each digraph (group of two letters), or each trigraph, or quadgraph. You can work the other side, predicting likely and unlikely trigraphs to appear in certain positions, given some previous text.
It's kind of like a fuzzy regex. Rather than MATCH or NO MATCH, the whole text is scored on a sliding scale according to how similar it is to your reference text.

I did a combo of eliminating lines that don't contain at least two 3 letter words, or one 6 letter word.
([a-z|A-Z]{3,}\s){2,}|([a-z|A-Z]{6,})
http://www.regexpal.com/

Here is a Perl implementation of the garbage_ratio heuristic:
#!/usr/bin/perl
use strict;
use warnings;
while ( defined( my $chunk = read_chunk(\*DATA) ) ) {
next unless length $chunk;
my #tokens = split ' ', $chunk;
# what is a word?
my #words = grep {
/^[A-Za-z]{2,}[.,]?$/
or /^[0-9]+$/
or /^a|I$/
or /^(?:[A-Z][.])+$/
} #tokens;
# completely arbitrary threshold
my $score = #words / #tokens;
print $chunk, "\n" if $score > 0.5;
}
sub read_chunk {
my ($fh) = #_;
my ($chunk, $line);
while ( my $line = <$fh> ) {
if( $line =~ /\S/ ) {
$chunk .= $line;
last;
}
}
while (1) {
$line = <$fh>;
last unless (defined $line) and ($line =~ /\S/);
$chunk .= $line;
}
return $chunk;
}
__DATA__
Paste the text above after __DATA__ above (not repeating the text here to save space). Of course, the use of the __DATA__ section is for the purpose of posting a self-contained script. In real life, you would have code to open the file etc.
Output:
Federal prosecutors on Monday charged a Miami man with the largest
case of credit and debit card data theft ever in the United States,
accusing the one-time government informant of swiping 130 million
accounts on top of 40 million he stole previously.
Gonzalez is a former informant for the U.S. Secret Service who helped
the agency hunt hackers, authorities say. The agency later found out that
he had also been working with criminals and feeding them information
on ongoing investigations, even warning off at least one individual,
according to authorities.

Regex won't help here. I'd say if you have control over the recognition part then focus on better quality there:
http://www.neurogy.com/ocrpreproc.html
You can also ask user to help you and specify the type of text you work with. e.g. if it is a page from a book then you would expect the majority of lines to be the same length and mainly consisting of letters, spaces and punctuation.

Well a group of symbols would match a bit of gibberish. Perhaps checking against a dictionary for words?
There seems to be a lot of line breaks where gibberish is, so that may be an indicator too.

Interesting problem.
If this is representative, I suppose you could build a library of common words and delete any line which didn't match any of them.
Or perhaps you could match character and punctuation characters and see if there is a reliable ratio cut-off, or simply a frequency of occurrence of some characters which flags it as gibberish.
Regardless, I think there will have to be some programming logic, not simply a single regular expression.

I guess that a regex would not help here. Regex would basically match a deterministic input i.e. a regex will have a predefined set of patterns that it will match. And gibberish would in most cases be random.
One way would be to invert the problem i.e. match the relevant text instead of matching the gibberish.

I'd claim a regex like "any punctuation followed by anything except a space is spam'.
So in .NET it's possibly something like
.Replace("\\p{1,}[a-zA-Z0-9]{1,}", "");
Then you'd consider "any word with two or more punctuations consecutively:
.Replace(" \\p{2,} ", "");
Seems like a good start anyway.

I like #Breton's answer - I'd suggest using his Corpus approach also with a library of known 'bad scans', which might be easier to identify because 'junk' has more internal consistency than 'good text' if it comes from bad OCR scans (the number of distinct glyphs is lower for example).

Another good technique is to use a spell checker/dictionary and look up the 'words' after you've eliminated the non readable stuff with regex.

Related

Need to add prefix to captured names, based on given lists

I have two arrays #Mister and #Mrs and need to add prefix based on the values.
#Mister = qw(Parasuram Raghavan Srivatsan);
#Mrs = qw(Kalai Padmini Maha);
my $str = "I was invited the doctor Parasuram and Kalai and civil Engineer Raghavan and Padmini and finally Advocate Srivatsan and Maha";
#Mr. Parasuram Mr. Raghavan Mr. Srivatsan
if(grep ($_ eq $str), #Mister)
{ $str=~s/($_)/Mr. $1/g; }
#Mrs. Kalai Mrs. Padmini Mrs. Maha`
if(grep ($_ eq $str), #Mrs)
{ $str=~s/($_)/Mrs. $1/g; }
Output Should be:
I was invited the doctor Mr. Parasuram and Mrs. Kalai and civil Engineer Mr. Raghavan and Mrs. Padmini and finally Advocate Mr. Srivatsan and Mrs. Maha
Could someone simplify the way I am doing and whats wrong in this code.
A simple take
my $mr_re = join '|', #Mister;
my $mrs_re = join '|', #Mrs;
$str =~ s/\b($mr_re)\b/Mr. $1/g;
$str =~ s/\b($mrs_re)\b/Ms. $1/g;
(note that I used the neutral Ms above instead of Mrs)
However, when we consider the bewildering complexity of names, the \b doesn't take care of all ways for a name to contain another. An easy example: the - is readily found in names and \b is an anchor between \w and \W, where \w does not include -.
Thus Name-Another would be matched by Name alone as well.
If there are characters other than alphanumeric (plus _) that can be inside names consider
my $w_re = /[a-z-]/i; # list all characters that can be in a name
$str =~ s/(?<!$w_re)($mr_re)(?!$w_re)/Mr. $1/g; # same for Ms.
where negative lookarounds ?<! and ?! are assertions that match your non-name characters (those not listed in $w_re) but do not consume them. Thus they delimit acceptable names.
The same holds for accents, and yet many other characters used in names in various cultures. The task of forming a satisfactory $w_re may be a tricky one even for one particular use case.
If names can come in multiple words (with spaces), in order to handle names within others you would have to parse them in general. That is a complex task; seek modules as little regex won't cut it.
A simple fix would be to preprocess lists to check for names with multiple words that contain other names from your lists, and to handle that case by case.
For your example with hard coded and verifiable names the above works. However, in general, when assembling a regex from strings make sure that all (ascii) non-word chars are escaped so that you actually have the intended literal characters without a special meaning
my $mr_re = join '|', map { quotemeta } #Mister;
my $mrs_re = join '|', map { quotemata } #Mrs;
See quotemeta; inside a regex use \Q, see it in perlbackslash and in perlre.
Note that this problem critically relies on sensible input.
If names are duplicated in lists the problem is ill-posed: If they repeat in the sentence it is unknown which is which, if they don't it is unknown whether it is Mr. or Ms. Thus the name lists should be first checked for duplicates.
"Could someone simplify the way I am doing and whats wrong in this code."
The first part is addressed by zdim in a way I would do it too, but the "what's wrong" part could get some more addressing, in my opinion (just nitpicking, but maybe useful for someone):
if(grep ($_ eq $str), #Mister)
{ $str =~ s/($_)/Mr. $1/g; }
Your list entries will never equal the $str, I think you meant $str =~/$_/
Either use an additional pair of parenthesis around both condition and #list or use the block form of grep (grep { $str =~ /$_/ } #Mister) - otherwise grep will miss the list as argument, since it takes the one existing pair as limiter for it's argument list right now.
the $_ used in the grep command is not available outside of the command, so the $str-substitution would use whatever the value of $_ is currently. In the example it would most likely be undef, so that between each character in the former $str 'Mr. ' is inserted.
Like I said: A perfectly good solution to your problem is given in zdim's answer, but you also asked "what's wrong in this code".
#ssr1012 and other readers: Be careful! It's tempting to think there is a universal solution for this problem. But, unfortunately, even #zdim's approach will give undesirable results if the same name appears in both arrays, and it is still tricky if a name in one array is the same as a name in the other array except for a few additional characters at the start or end.
Here's your example using slightly different names:
my #Mister = qw(Parasuram Mahan Srivatsan);
my #Mrs = qw(Kalai Padmini Maha);
...
# I was invited the doctor Mr. Parasuram and Ms. Kalai and civil Engineer Mr. Ms. Mahan and Ms. Padmini and finally Advocate Mr. Srivatsan and Ms. Maha
See the "Mr. Ms. Mahan"? You don't have enough information for a universal solution. This is only reliable if your names are hard-coded and checked first to avoid collisions.
Even if you added first names, you might not have enough information - guessing gender from first names is unreliable in many language cultures.

Perl Regex match something, but make sure that the match string does not contain a string

I have files with sequences of conversations where speakers are tagged. The format of my files is:
<SPEAKER>John</SPEAKER>
I am John
<SPEAKER>Lisa</SPEAKER>
And I am Lisa
I am now looking to identify the first sequence in each document in which John speaks and Lisa speaks right afterwards (and I then want to then retain the entire part of the document that follows this sequence, including the sequence).
I built this regex:
^.*?(<SPEAKER>John<\/SPEAKER>.*?<SPEAKER>Lisa<\/SPEAKER>.*)
but it of course also captures the case where there is a sequence of speakers is John-Michael-Lisa, i.e. where there is someone speaking between John and Lisa.
How can I get the right match?
Here is a regex you can use to match what you describe:
(<SPEAKER>John<\/SPEAKER>(?:(?!<SPEAKER>).)*<SPEAKER>Lisa<\/SPEAKER>.*)
And a small demo showing that it works: https://regex101.com/r/iW8vS5/1
However, as both kchinger and owler mentioned, regex probably isn't the best way to do this. A regex solution would likely be significantly slower than a small snippet of code for any long document.
This isn't a purely regex solution, maybe someone else can do that, but instead I wrote a small loop to check each line. If it finds what you want, it will keep the rest of the document. You would need to feed it the correct sequence if it wasn't a full document. A regex to do what you want might be kind of slow since it will be relatively complicated, but you'd have to benchmark against a pure regex solution (if someone comes up with one) if speed is important.
edit to note: ?!Lisa is a negative lookahead if you haven't seen it. Some combined negative lookaheads might be what you need to use to do it in one regex, but good luck reading it later.
open(my $input,"input2.txt")||die "can't open the file";
my $output = "";
my $wanted = 0;
while(<$input>)
{
$wanted = 1 if(/<SPEAKER>John<\/SPEAKER>/);
$wanted = 2 if(/<SPEAKER>Lisa<\/SPEAKER>/ && $wanted == 1);
if(/<SPEAKER>(?!Lisa)/ && /<SPEAKER>(?!John)/ && $wanted == 1)
{
$wanted = 0;
last;
}
$output = $output . $_ if($wanted);
}
print "$output" if $wanted;

How to implement for 10 lines and 79 characters in each line in perl

There is a variable which can have a maximum of 10 lines and each line can have a maximum of 79 characters. Beyond, 10th line and beyond 79th character of each line, nothing should get displayed. How to implement this in perl. I have no idea for how to implement for 10 lines. Can anybody help me out with this ? I am not finding any solution for the same.
The code for counting the number of characters will be :
#!/usr/bin/perl
my $string = "As a describer of life and manners, he must be allowed to stand perhaps the first of the first rank. His humour, which, as Steele observes, is peculiar to himself, is so happily diffused as to give the grace of novelty to domestic scenes and daily occurrences. He never "o'ersteps the modesty of nature," nor raises merriment or wonder by the violation of truth. His figures neither divert by distortion nor amaze by aggravation. He copies life with so much fidelity that he can be hardly said to invent; yet his exhibitions have an air so much original, that it is difficult to suppose them not merely the product of imagination"
if(length($string) > 79)
{
$string = substr($string,0,79);
}
print "my string is :",$string;
But for , lines how to check ? and How to club it with lines code ?
Assuming that you want the string split at word boundaries and reformatted so that internal newlines are treated as blanks and that you want to print at most 10 lines each with at most 79 characters plus the newline, then this code seems to do the job. Note that the string in the question contains both single and double quotes, so I used q{} to delimit the string.
#!/usr/bin/env perl
use strict;
use warnings;
use constant MAXLINELEN => 79;
use constant MAXNUMLINES => 10;
my $string = q{As a describer of life and manners, he must be allowed to stand perhaps the first of the first rank. His humour, which, as Steele observes, is peculiar to himself, is so happily diffused as to give the grace of novelty to domestic scenes and daily occurrences. He never "o'ersteps the modesty of nature," nor raises merriment or wonder by the violation of truth. His figures neither divert by distortion nor amaze by aggravation. He copies life with so much fidelity that he can be hardly said to invent; yet his exhibitions have an air so much original, that it is difficult to suppose them not merely the product of imagination};
sub print_up_to_10_lines_of_79_chars_split_at_words
{
my($string) = #_;
my(#words) = split /\s+/, $string;
my $line_num = 0;
my $line_len = 0;
my $pad = "";
foreach my $word (#words)
{
my $len = length($word);
if ($line_len + length($pad) + $len > MAXLINELEN)
{
last if (++$line_num >= MAXNUMLINES);
print "\n";
$pad = "";
$line_len = 0;
}
printf "%s%s", $pad, $word;
$line_len += length($pad) + $len;
$pad = " ";
}
print "\n";
}
print "First string: (", length($string), ")\n";
print_up_to_10_lines_of_79_chars_split_at_words($string);
$string .= ". $string.";
print "Second string: (", length($string), ")\n";
print_up_to_10_lines_of_79_chars_split_at_words($string);
Sample output:
First string: (629)
As a describer of life and manners, he must be allowed to stand perhaps the
first of the first rank. His humour, which, as Steele observes, is peculiar to
himself, is so happily diffused as to give the grace of novelty to domestic
scenes and daily occurrences. He never "o'ersteps the modesty of nature," nor
raises merriment or wonder by the violation of truth. His figures neither
divert by distortion nor amaze by aggravation. He copies life with so much
fidelity that he can be hardly said to invent; yet his exhibitions have an air
so much original, that it is difficult to suppose them not merely the product
of imagination
Second string: (1261)
As a describer of life and manners, he must be allowed to stand perhaps the
first of the first rank. His humour, which, as Steele observes, is peculiar to
himself, is so happily diffused as to give the grace of novelty to domestic
scenes and daily occurrences. He never "o'ersteps the modesty of nature," nor
raises merriment or wonder by the violation of truth. His figures neither
divert by distortion nor amaze by aggravation. He copies life with so much
fidelity that he can be hardly said to invent; yet his exhibitions have an air
so much original, that it is difficult to suppose them not merely the product
of imagination. As a describer of life and manners, he must be allowed to stand
perhaps the first of the first rank. His humour, which, as Steele observes, is
If your requirements differ from the assumptions I stated, then obviously the code has to be changed, but you have to state your requirements precisely. It would, for example, be perfectly feasible to build an answer string that, given a long input, contains the output rather than printing to standard output. If your splitting requirements are different, the processing will be different.
printf can be used to truncate a string:
printf "%.79s", $string; # Limit the string to 79 characters
To print only 10 lines you'll need to employ a loop of some kind. Here's an example using a foreach loop and a counter:
use strict;
use warnings;
my #lines = ...;
my $line_num = 0;
for my $line (#lines) {
last if ++$line_num > 10; # Increment counter and exit loop after 10th line
printf "%.79s", $line;
}
Alternatively, use splice to only take 10 lines:
for my $line (splice #lines, 10) {
printf "%.79s", $line;
}

How to use Perl to parse specified formatted text with regex?

Question abstract:
how to parse text file into two "hashes" in Perl. One store key-value pairs taken from the (X=Y) part, another from the (X:Y) part?
1=9
2=2
3=1
4=6
2:1
3:1
4:1
1:2
1:3
1:4
3:4
3:2
they are kept in one file, and only the symbol between the two digits denotes the difference.
===============================================================================
I just spent around 30 hours learning Perl during last semester and managed to finish my Perl assignment in an "head first, ad-hoc, ugly" way.
Just received my result for this section as 7/10, to be frank, I am not happy with this, particularly because it recalls my poor memory of trying to use Regular Expression to deal with formatted data, which rule is like this :
1= (the last digit in your student ID,or one if this digit is zero)
2= (the second last digit in your student ID,or one if this digit is zero)
3= (the third last digit in your student ID, or one if this digit is zero)
4= (the forth last digit in your student ID, or one if this digit is zero)
2:1
3:1
4:1
1:2
1:3
1:4
2:3 (if the last digit in your student ID is between 0 and 4) OR
3:4 (if the last digit in your student ID is between 5 and 9)
3:2 (if the second last digit in your student ID is between 0 and 4) OR
4:3 (if the second last digit in your student ID is between 5 and 9)
An example of the above configuration file: if your student ID is 10926029, it has to be:
1=9
2=2
3=1
4=6
2:1
3:1
4:1
1:2
1:3
1:4
3:4
3:2
The assignment was about Pagerank calculation, which algorithm is simplified so I came up with the answer to that part in 5 minutes. However, it was the text parsing part that took me heaps of time.
The first part of the text (Page=Pagerank) denotes the pages and their corresponding pageranks.
The second part (FromNode:ToNode) denotes the direction of a link between two pages.
For a better understanding, please go to my website and check the requirement file and my Perl script here
There are massive comments in the script so I reckon it is not hard at all to see how stupid I was in my solution :(
If you are still on this page, let me justify why I ask this question here in SO:
I got nothing else but "Result 7/10" with no comment from uni.
I am not studying for uni, I am learning for myself.
So, I hope the Perl gurus can at least guide me the right direction toward solving this problem. My stupid solution was sort of "generic" and probable would work in Java, C#, etc. I am sure that is not even close to the nature of Perl.
And, if possible, please let me know the level of solution, like I need to go through "Learning Perl ==> Programming Perl ==> Master Perl" to get there :)
Thanks for any hint and suggestion in advance.
Edit 1:
I have another question posted but closed here, which describes pretty much like how things go in my uni :(
Is this what you mean? The regex basically has three capture groups (denoted by the ()s). It should capture one digit, followed by either = or : (that's the capture group wrapping the character class [], which matches any character within it), followed by another single digit.
my ( %assign, %colon );
while (<DATA>) {
chomp;
my ($l, $c, $r) = $_ =~ m/(\d)([=:])(\d)/;
if ( q{=} eq $c ) { $assign{$l} = $r; }
elsif ( q{:} eq $c ) { $colon{$l} = $r; }
}
__DATA__
1=9
2=2
3=1
4=6
2:1
3:1
4:1
1:2
1:3
1:4
3:4
3:2
As for the recommendation, grab a copy of Mastering Regular Expressions if you can. It's very...thorough.
Well, if you don't want to validate any restrictions on the data file, you can parse this data pretty easily. The main issue lies in selecting the appropriate structure to store your data.
use strict;
use warnings;
use IO::File;
my $file_path = shift; # Take file from command line
my %page_rank;
my %links;
my $fh = IO::File->new( $file_path, '<' )
or die "Error opening $file_path - $!\n";
while ( my $line = $fh->readline ) {
chomp $line;
next unless $line =~ /^(\d+)([=:])(\d+)$/; # skip invalid lines
my $page = $1;
my $delimiter = $2;
my $value = $3;
if( $delimiter eq '=' ) {
$page_rank{$page} = $value;
}
elsif( $delimiter eq ':' ) {
$links{$page} = [] unless exists $links{$page};
push #{ $links{$page} }, $value;
}
}
use Data::Dumper;
print Dumper \%page_rank;
print Dumper \%links;
The main way that this code differs from Pedro Silva's is that mine is more verbose and it also handles multiple links from one page properly. For example, my code preserves all values for links from page 1. Pedro's code discards all but the last.

Trying to simplify a Regex

I'm spending my weekend analyzing Campaign Finance Contribution records. Fun!
One of the annoying things I've noticed is that entity names are entered differently:
For example, i see stuff like this: 'llc', 'llc.', 'l l c', 'l.l.c', 'l. l. c.', 'llc,', etc.
I'm trying to catch all these variants.
So it would be something like:
"l([,\.\ ]*)l([,\.\ ]*)c([,\.\ ]*)"
Which isn't so bad... except there are about 40 entity suffixes that I can think of.
The best thing I can think of is programmatically building up this pattern , based on my list of suffixes.
I'm wondering if there's a better way to handle this within a single regex that is human readable/writable.
You could just strip out excess crap. Using Perl:
my $suffix = "l. lc.."; # the worst case imaginable!
$suffix =~ s/[.\s]//g;
# no matter what variation $suffix was, it's now just "llc"
Obviously this may maul your input if you use it on the full company name, but getting too in-depth with how to do that would require knowing what language we're working with. A possible regex solution is to copy the company name and strip out a few common words and any words with more than (about) 4 characters:
my $suffix = $full_name;
$suffix =~ s/\w{4,}//g; # strip words of more than 4 characters
$suffix =~ s/(a|the|an|of)//ig; # strip a few common cases
# now we can mangle $suffix all we want
# and be relatively sure of what we're doing
It's not perfect, but it should be fairly effective, and more readable than using a single "monster regex" to try to match all of them. As a rule, don't use a monster regex to match all cases, use a series of specialized regexes to narrow many cases down to a few. It will be easier to understand.
Regexes (other than relatively simple ones) and readability rarely go hand-in-hand. Don't misunderstand me, I love them for the simplicity they usually bring, but they're not fit for all purposes.
If you want readability, just create an array of possible values and iterate through them, checking your field against them to see if there's a match.
Unless you're doing gene sequencing, the speed difference shouldn't matter. And it will be a lot easier to add a new one when you discover it. Adding an element to an array is substantially easier than reverse-engineering a regex.
The first two "l" parts can be simplified by [the first "l" part here]{2}.
You can squish periods and whitespace first, before matching: for instance, in perl:
while (<>) {
$Sq = $_;
$Sq =~ s/[.\s]//g; # squish away . and " " in the temporary save version
$Sq = lc($Sq);
/^llc$/ and $_ = 'L.L.C.'; # try to match, if so save the canonical version
/^ibm/ and $_ = 'IBM'; # a different match
print $_;
}
Don't use regexes, instead build up a map of all discovered (so far) entries and their 'canonical' (favourite) versions.
Also build a tool to discover possible new variants of postfixes by identifying common prefixes to a certain number of characters and printing them on the screen so you can add new rules.
In Perl you can build up regular expressions inside your program using strings. Here's some example code:
#!/usr/bin/perl
use strict;
use warnings;
my #strings = (
"l.l.c",
"llc",
"LLC",
"lLc",
"l,l,c",
"L . L C ",
"l W c"
);
my #seps = ('.',',','\s');
my $sep_regex = '[' . join('', #seps) . ']*';
my $regex_def = join '', (
'[lL]',
$sep_regex,
'[lL]',
$sep_regex,
'[cC]'
);
print "definition: $regex_def\n";
foreach my $str (#strings) {
if ( $str =~ /$regex_def/ ) {
print "$str matches\n";
} else {
print "$str doesn't match\n";
}
}
This regular expression could also be simplified by using case-insensitive matching (which means $match =~ /$regex/i ). If you run this a few times on the strings that you define, you can easily see cases that don't validate according to your regular expression. Building up your regular expression this way can be useful in only defining your separator symbols once, and I think that people are likely to use the same separators for a wide variety of abbreviations (like IRS, I.R.S, irs, etc).
You also might think about looking into approximate string matching algorithms, which are popular in a large number of areas. The idea behind these is that you define a scoring system for comparing strings, and then you can measure how similar input strings are to your canonical string, so that you can recognize that "LLC" and "lLc" are very similar strings.
Alternatively, as other people have suggested you could write an input sanitizer that removes unwanted characters like whitespace, commas, and periods. In the context of the program above, you could do this:
my $sep_regex = '[' . join('', #seps) . ']*';
foreach my $str (#strings) {
my $copy = $str;
$copy =~ s/$sep_regex//g;
$copy = lc $copy;
print "$str -> $copy\n";
}
If you have control of how the data is entered originally, you could use such a sanitizer to validate input from the users and other programs, which will make your analysis much easier.