Let's say I want to find in a large (300,000 letters) the word "dogs" with the distance between letters exactly 40,000 letters in between. So I do:
$mystring =~ m/d.{40000}o.{40000}g.{40000}s/;
This will work quite well in other (slower) languages but in Perl it throws me "Quantifier in {,} bigger than 32766 in regex".
So:
Can we use a bigger number as the quantifier somehow?
If not, is there another good way to find what I want? Note that "dogs" is only an example; I want to do this for any word and any jump size (and fast).
If you really need to do this fast I would look at a custom search based on the ideas of Boyer-Moore string search. A regular expression is parsed into a finite state machine. Even a clever, compact representation of such a FSM is not going to be a very effective way to execute a search like you describe.
If you really want to continue along the lines you are now you can just concatenate two expressions like .{30000}.{10000} which is the same as .{40000} in practice.
I think index might be better suited for this task. Something along the lines of the completely untested:
sub has_dogs {
my $str = shift;
my $start = 0
while (-1 < (my $pos = index $$str, 'd', $start)) {
no warnings 'uninitialized';
if ( ('o' eq substr($$str, $pos + 40_000, 1)) and
('g' eq substr($$str, $pos + 80_000, 1)) and
('s' eq substr($$str, $pos + 120_000, 1)) ) {
return 1;
}
}
return;
}
40,000 = 2 * 20,000
/d(?:.{20000}){2}o(?:.{20000}){2}g(?:.{20000}){2}s/s
Related
I have files with sequences of conversations where speakers are tagged. The format of my files is:
<SPEAKER>John</SPEAKER>
I am John
<SPEAKER>Lisa</SPEAKER>
And I am Lisa
I am now looking to identify the first sequence in each document in which John speaks and Lisa speaks right afterwards (and I then want to then retain the entire part of the document that follows this sequence, including the sequence).
I built this regex:
^.*?(<SPEAKER>John<\/SPEAKER>.*?<SPEAKER>Lisa<\/SPEAKER>.*)
but it of course also captures the case where there is a sequence of speakers is John-Michael-Lisa, i.e. where there is someone speaking between John and Lisa.
How can I get the right match?
Here is a regex you can use to match what you describe:
(<SPEAKER>John<\/SPEAKER>(?:(?!<SPEAKER>).)*<SPEAKER>Lisa<\/SPEAKER>.*)
And a small demo showing that it works: https://regex101.com/r/iW8vS5/1
However, as both kchinger and owler mentioned, regex probably isn't the best way to do this. A regex solution would likely be significantly slower than a small snippet of code for any long document.
This isn't a purely regex solution, maybe someone else can do that, but instead I wrote a small loop to check each line. If it finds what you want, it will keep the rest of the document. You would need to feed it the correct sequence if it wasn't a full document. A regex to do what you want might be kind of slow since it will be relatively complicated, but you'd have to benchmark against a pure regex solution (if someone comes up with one) if speed is important.
edit to note: ?!Lisa is a negative lookahead if you haven't seen it. Some combined negative lookaheads might be what you need to use to do it in one regex, but good luck reading it later.
open(my $input,"input2.txt")||die "can't open the file";
my $output = "";
my $wanted = 0;
while(<$input>)
{
$wanted = 1 if(/<SPEAKER>John<\/SPEAKER>/);
$wanted = 2 if(/<SPEAKER>Lisa<\/SPEAKER>/ && $wanted == 1);
if(/<SPEAKER>(?!Lisa)/ && /<SPEAKER>(?!John)/ && $wanted == 1)
{
$wanted = 0;
last;
}
$output = $output . $_ if($wanted);
}
print "$output" if $wanted;
I have a file containing several rows of code, like this:
160101, 0100, 58.8,
160101, 0200, 59.3,
160101, 0300, 59.5,
160101, 0400, 59.1,
I'm trying to print out the third column with a regex, like this:
# Read the text file.
open( IN, "file.txt" ) or die "Can't read words file: $!";
# Print out.
while (<IN>) {
print "Number: $1\n"
while s/[^\,]+\,[^\,]+\,([^\,]+)\,/$1/g;
}
And it works fairly well, however, I'm trying to only fetch the numbers that are greater than or equal to 59 (that includes numbers like 59.1 and 59.0). I've tried several numeric regex combinations (the one below will not give me the right number, obviously, but just making a point), including:
while s/[^\,]+\,[^\,]+\,([^\,]+)\,^[0-9]{3}$/$1/g;
but none seem to work. Any ideas?
My first idea would be to split that line and then pick and choose
while (my $line = <IN>) {
my #nums = split ',\s*', $line;
print "$nums[2]\n" if $nums[2] >= $cutoff;
}
If you insist on doing it all in the regex then you may want to use /e modifier, so in the substitution part you can run code. Then you can test the particular match and print it there.
Assuming that the numbers can't reach 100 (three digits) you could use
[^\,]+\,[^\,]+\,\s*(59\.\d+|[6-9]\d\.\d+)\,
which uses your regex except for the capture group which captures the number 59 and it's decimals, or two digit numbers from 60-99 and it's decimals.
Regards
Edit:
To go above 100 you can add another alternative in the capture group:
[^\,]+\,[^\,]+\,\s*(59\.\d+|[6-9]\d\.\d+|[1-9]\d{2,}\.\d+)\,
which allows larger numbers (>=100.0).
Why do you use while? Is it possible to have more than one third column on a line? A simple if will work the same, comunicating the intent more clearly.
Also, if you want to extract, you don't need to substitute. Use m// instead of s///.
Regexes aren't the right tool to do numberic comparisons. Use >= instead:
print "Number: $1\n" if /[^\,]+\,[^\,]+\,([^\,]+)\,/
&& $1 >= 59
Assuming the line ends with a comma :
print foreach map{s/.+?(\d+.\d+),$/$1/;$_} ;
In case there might be someting after the rightmost comma :
print foreach map{s/.+?(\d+.\d+),[^,]*$/$1/;$_} ;
But i wouldn't use regexp in that case :
print foreach map{(split, ',')[-2]} ;
I would suggest not using a regex when split is a better tool for the job. Likewise - regex is very bad at detecting numeric values - it works on text based patterns.
But how about:
while ( <> ) {
print ((split /,\s*/)[2],"\n");
}
If you want to test a conditional:
while ( <> ) {
my #fields = split /,\s*/;
print $fields[2],"\n" if $fields[2] >= 59;
}
Or perhaps:
print join "\n", grep { $_ >= 59 } map { (split /,\s*/)[2] } <>;
map takes your input, and extracts the third field (returning a list). grep then applies a filter condition to every element. And then we print it.
Note - in the above, I use <> which is the magic file handle (reads files specified on command line, or STDIN) but you can use your filehandle.
However it's probably worth noting - 3 argument open with lexical file handles are recommended now.
open ( my $input, '<', 'file.txt' ) or die $!;
It has a number of advantages and is generally good style.
I have a string of characters which I want to break down in its substrings on the spaces between words, but the number of spaces spanning between a substring should not be more than 4.
E.g.: String:
"Baicalein, a specific lipoxygenase (LOX) inhibitor, has anti-inflammatory and antioxidant effects."
The resulting substrings should look like
1. Baicalein,
2. Baicalein, a
3. Baicalein, a specific
4. Baicalein, a specific lipoxygenase
5. Baicalein, a specific lipoxygenase (LOX)
6. a
7. a specific...
I feel there must be some way with Regex, but I'm not sure
EDIT
Code that I have used:
my #arr = split('\s', $line);
for(my $i=0; $i<$#arr; $i++)
{
my $str1 = $arr[$i];
my $str2 = $arr[$i].' '.$arr[$i+1];
my $str3 = $arr[$i].' '.$arr[$i+1].' '.$arr[$i+2];
my $str4 = $arr[$i].' '.$arr[$i+1].' '.$arr[$i+2].' '.$arr[$i+3];
}
I have very long strings and by this approach it takes a lot of time.
Thanks in Advance
You could create an inner loop to avoid the repeated code. Also, repeatedly gluing stuff with the dot operator is less efficient.
my #substrings;
for (my $i=0; $i<=$#arr; ++$i)
{
for (my $j=0; $j<5 && $i+$j<=$#arr; ++$j)
{
push #substrings, join(' ', #arr[$i..$i+$j]);
}
}
You'll notice the additional boundary condition to prevent the inner loop from going past the end of the input array, and the use of a new array #substrings to contain the results. Finally, see how indentation helps you see what goes where.
I'm spending my weekend analyzing Campaign Finance Contribution records. Fun!
One of the annoying things I've noticed is that entity names are entered differently:
For example, i see stuff like this: 'llc', 'llc.', 'l l c', 'l.l.c', 'l. l. c.', 'llc,', etc.
I'm trying to catch all these variants.
So it would be something like:
"l([,\.\ ]*)l([,\.\ ]*)c([,\.\ ]*)"
Which isn't so bad... except there are about 40 entity suffixes that I can think of.
The best thing I can think of is programmatically building up this pattern , based on my list of suffixes.
I'm wondering if there's a better way to handle this within a single regex that is human readable/writable.
You could just strip out excess crap. Using Perl:
my $suffix = "l. lc.."; # the worst case imaginable!
$suffix =~ s/[.\s]//g;
# no matter what variation $suffix was, it's now just "llc"
Obviously this may maul your input if you use it on the full company name, but getting too in-depth with how to do that would require knowing what language we're working with. A possible regex solution is to copy the company name and strip out a few common words and any words with more than (about) 4 characters:
my $suffix = $full_name;
$suffix =~ s/\w{4,}//g; # strip words of more than 4 characters
$suffix =~ s/(a|the|an|of)//ig; # strip a few common cases
# now we can mangle $suffix all we want
# and be relatively sure of what we're doing
It's not perfect, but it should be fairly effective, and more readable than using a single "monster regex" to try to match all of them. As a rule, don't use a monster regex to match all cases, use a series of specialized regexes to narrow many cases down to a few. It will be easier to understand.
Regexes (other than relatively simple ones) and readability rarely go hand-in-hand. Don't misunderstand me, I love them for the simplicity they usually bring, but they're not fit for all purposes.
If you want readability, just create an array of possible values and iterate through them, checking your field against them to see if there's a match.
Unless you're doing gene sequencing, the speed difference shouldn't matter. And it will be a lot easier to add a new one when you discover it. Adding an element to an array is substantially easier than reverse-engineering a regex.
The first two "l" parts can be simplified by [the first "l" part here]{2}.
You can squish periods and whitespace first, before matching: for instance, in perl:
while (<>) {
$Sq = $_;
$Sq =~ s/[.\s]//g; # squish away . and " " in the temporary save version
$Sq = lc($Sq);
/^llc$/ and $_ = 'L.L.C.'; # try to match, if so save the canonical version
/^ibm/ and $_ = 'IBM'; # a different match
print $_;
}
Don't use regexes, instead build up a map of all discovered (so far) entries and their 'canonical' (favourite) versions.
Also build a tool to discover possible new variants of postfixes by identifying common prefixes to a certain number of characters and printing them on the screen so you can add new rules.
In Perl you can build up regular expressions inside your program using strings. Here's some example code:
#!/usr/bin/perl
use strict;
use warnings;
my #strings = (
"l.l.c",
"llc",
"LLC",
"lLc",
"l,l,c",
"L . L C ",
"l W c"
);
my #seps = ('.',',','\s');
my $sep_regex = '[' . join('', #seps) . ']*';
my $regex_def = join '', (
'[lL]',
$sep_regex,
'[lL]',
$sep_regex,
'[cC]'
);
print "definition: $regex_def\n";
foreach my $str (#strings) {
if ( $str =~ /$regex_def/ ) {
print "$str matches\n";
} else {
print "$str doesn't match\n";
}
}
This regular expression could also be simplified by using case-insensitive matching (which means $match =~ /$regex/i ). If you run this a few times on the strings that you define, you can easily see cases that don't validate according to your regular expression. Building up your regular expression this way can be useful in only defining your separator symbols once, and I think that people are likely to use the same separators for a wide variety of abbreviations (like IRS, I.R.S, irs, etc).
You also might think about looking into approximate string matching algorithms, which are popular in a large number of areas. The idea behind these is that you define a scoring system for comparing strings, and then you can measure how similar input strings are to your canonical string, so that you can recognize that "LLC" and "lLc" are very similar strings.
Alternatively, as other people have suggested you could write an input sanitizer that removes unwanted characters like whitespace, commas, and periods. In the context of the program above, you could do this:
my $sep_regex = '[' . join('', #seps) . ']*';
foreach my $str (#strings) {
my $copy = $str;
$copy =~ s/$sep_regex//g;
$copy = lc $copy;
print "$str -> $copy\n";
}
If you have control of how the data is entered originally, you could use such a sanitizer to validate input from the users and other programs, which will make your analysis much easier.
I have to clean some input from OCR which recognizes handwriting as gibberish. Any suggestions for a regex to clean out the random characters? Example:
Federal prosecutors on Monday charged a Miami man with the largest
case of credit and debit card data theft ever in the United States,
accusing the one-time government informant of swiping 130 million
accounts on top of 40 million he stole previously.
, ':, Ie
':... 11'1
. '(.. ~!' ': f I I
. " .' I ~
I' ,11 l
I I I ~ \ :' ,! .~ , .. r, 1 , ~ I . I' , .' I ,.
, i
I ; J . I.' ,.\ ) ..
. : I
'I', I
.' '
r,"
Gonzalez is a former informant for the U.S. Secret Service who helped
the agency hunt hackers, authorities say. The agency later found out that
he had also been working with criminals and feeding them information
on ongoing investigations, even warning off at least one individual,
according to authorities.
eh....l
~.\O ::t
e;~~~
s: ~ ~. 0
qs c::; ~ g
o t/J (Ii .,
::3 (1l Il:l
~ cil~ 0 2:
t:lHj~(1l
. ~ ~a
0~ ~ S'
N ("b t/J :s
Ot/JIl:l"-<:!
v'g::!t:O
-....c......
VI (:ll <' 0
:= - ~
< (1l ::3
(1l ~ '
t/J VJ ~
Pl
.....
....
(II
One of the simpleset solutions(not involving regexpes):
#pseudopython
number_of_punct = sum([1 if c.ispunct() else 0 for c in line])
if number_of_punct >len(line)/2: line_is_garbage()
well. Or rude regexpish s/[!,'"##~$%^& ]{5,}//g
A simple heuristic, similar to anonymous answer:
listA = [0,1,2..9, a,b,c..z, A,B,C,..Z , ...] // alphanumerical symbols
listB = [!#$%^&...] // other symbols
Na = number_of_alphanumeric_symbols( line )
Nb = number_of_other_symbols( line )
if Na/Nb <= garbage_ratio then
// garbage
No idea how well it would work, but I have considered this problem in the past, idly. I've on occasions played with a little programmatic device called a markov chain
Now the wikipedia article probably won't make much sense until you see some of the other things a markov chain is good for. One example of a markov chain in action is this Greeking generator. Another example is the MegaHAL chatbot.
Greeking is gibberish that looks like words. Markov chains provide a way of randomly generating a sequence of letters, but weighting the random choices to emulate the frequency patterns of an examined corpus. So for instance, Given the letter "T", the letter h is more likely to show up next than any other letter. So you examine a corpus (say some newspapers, or blog postings) to produce a kind of fingerprint of the language you're targeting.
Now that you have that frequency table/fingerprint, you can examine your sample text, and rate each letter according to the likelyhood of it appearing. Then, you can flag the letters under a particular threshold likelyhood for removal. In other words, a surprise filter. Filter out surprises.
There's some leeway for how you generate your freqency tables. You're not limited to one letter following another. You can build a frequency table that predicts which letter will likely follow each digraph (group of two letters), or each trigraph, or quadgraph. You can work the other side, predicting likely and unlikely trigraphs to appear in certain positions, given some previous text.
It's kind of like a fuzzy regex. Rather than MATCH or NO MATCH, the whole text is scored on a sliding scale according to how similar it is to your reference text.
I did a combo of eliminating lines that don't contain at least two 3 letter words, or one 6 letter word.
([a-z|A-Z]{3,}\s){2,}|([a-z|A-Z]{6,})
http://www.regexpal.com/
Here is a Perl implementation of the garbage_ratio heuristic:
#!/usr/bin/perl
use strict;
use warnings;
while ( defined( my $chunk = read_chunk(\*DATA) ) ) {
next unless length $chunk;
my #tokens = split ' ', $chunk;
# what is a word?
my #words = grep {
/^[A-Za-z]{2,}[.,]?$/
or /^[0-9]+$/
or /^a|I$/
or /^(?:[A-Z][.])+$/
} #tokens;
# completely arbitrary threshold
my $score = #words / #tokens;
print $chunk, "\n" if $score > 0.5;
}
sub read_chunk {
my ($fh) = #_;
my ($chunk, $line);
while ( my $line = <$fh> ) {
if( $line =~ /\S/ ) {
$chunk .= $line;
last;
}
}
while (1) {
$line = <$fh>;
last unless (defined $line) and ($line =~ /\S/);
$chunk .= $line;
}
return $chunk;
}
__DATA__
Paste the text above after __DATA__ above (not repeating the text here to save space). Of course, the use of the __DATA__ section is for the purpose of posting a self-contained script. In real life, you would have code to open the file etc.
Output:
Federal prosecutors on Monday charged a Miami man with the largest
case of credit and debit card data theft ever in the United States,
accusing the one-time government informant of swiping 130 million
accounts on top of 40 million he stole previously.
Gonzalez is a former informant for the U.S. Secret Service who helped
the agency hunt hackers, authorities say. The agency later found out that
he had also been working with criminals and feeding them information
on ongoing investigations, even warning off at least one individual,
according to authorities.
Regex won't help here. I'd say if you have control over the recognition part then focus on better quality there:
http://www.neurogy.com/ocrpreproc.html
You can also ask user to help you and specify the type of text you work with. e.g. if it is a page from a book then you would expect the majority of lines to be the same length and mainly consisting of letters, spaces and punctuation.
Well a group of symbols would match a bit of gibberish. Perhaps checking against a dictionary for words?
There seems to be a lot of line breaks where gibberish is, so that may be an indicator too.
Interesting problem.
If this is representative, I suppose you could build a library of common words and delete any line which didn't match any of them.
Or perhaps you could match character and punctuation characters and see if there is a reliable ratio cut-off, or simply a frequency of occurrence of some characters which flags it as gibberish.
Regardless, I think there will have to be some programming logic, not simply a single regular expression.
I guess that a regex would not help here. Regex would basically match a deterministic input i.e. a regex will have a predefined set of patterns that it will match. And gibberish would in most cases be random.
One way would be to invert the problem i.e. match the relevant text instead of matching the gibberish.
I'd claim a regex like "any punctuation followed by anything except a space is spam'.
So in .NET it's possibly something like
.Replace("\\p{1,}[a-zA-Z0-9]{1,}", "");
Then you'd consider "any word with two or more punctuations consecutively:
.Replace(" \\p{2,} ", "");
Seems like a good start anyway.
I like #Breton's answer - I'd suggest using his Corpus approach also with a library of known 'bad scans', which might be easier to identify because 'junk' has more internal consistency than 'good text' if it comes from bad OCR scans (the number of distinct glyphs is lower for example).
Another good technique is to use a spell checker/dictionary and look up the 'words' after you've eliminated the non readable stuff with regex.