Need string matching algorithm

Need string matching algorithm - regex

Data:
x.txt,simple text file(around 1 MB)
y.txt dictionary file(around 1Lakh words).
Need to find whether any of the word/s in y.txt is present in x.txt.
Need an algorithm which consumes less time for execution and language preferred for the same.
PS: Please suggest any algorithm apart from BRUTE FORCE METHOD.
I need pattern matching rather than exact string matching.
For instance :
x.txt : "The Old Buzzards were disestablished on 27 April"
Y.txt : "establish"
Output should be : Found establish in X.txt : Line 1
Thank you.

It is not clear to me whether you need this to get a job done or it is home work. If you need it to get a job done then:
#!/usr/bin/bash
Y=`cat y.txt | tr '\n' '|'`
echo "${Y%|}"
grep -E "${Y%|}" x.txt
if [ "$?" -eq 0 ]
then
echo "found"
else
echo "no luck"
fi
is hard to beat as you slurp in all the patterns from a file, construct a regular expression (the echo shows the regex) and then hand it to grep which constructs a finite state automata for you. That is going to fly as it compares every character in the text at most once. If it is homework then I suggest you consult Cormen et al 'Introduction to Algorithms', or the first few chapters of the Dragon Book which will also explain what I just said.
Forgot to add: y.txt should contain your pattern one per line, but as a nice side effect your patterns do not have to be single words.

Suppose, you have any Set implementation in your standard library, here is some pseudo-code:
dictionary = empty set
def populate_dict():
for word in dict_file:
add(dictionary, word)
def validate_text(text_file):
for word in text_file: ### O(|text_file|)
if word in dictionary: ### O(log |dictonary|)
report(word)
populate_dict()
every_now_and_then(populate_dict)
That would give you O(t * log d) instead of the brute-force O(t * d) where t and d are the lengths of the input text file and dictionary respectively. I don't think that anything faster is possible since you can't read the file faster that O(t) and can't search faster than O(log d).

This is a search algorithm I had in mind for a while.
Basically the algorithm is in two steps.
In the first step all the words from y.txt are inserted in a tree. Every path in the tree from the root to a leaf is a word. The leaf is empty.
For example, the tree for the words dog and day is the following.
<root>--<d>-<a>-<y>-<>
\-<o>-<g>-<>
The second part of the algorithm is a search down the tree. When you reach an empty leaf then you have found a word.
The implementation in Groovy, if more comments are needed just ask
//create a tree to store the words in a compact and fast to search way
//each path of the tree from root to an empty leaf is a word
def tree = [:]
new File('y.txt').eachLine{ word->
def t=tree
word.each{ c ->
if(!t[c]){
t[c]=[:]
}
t=t[c]
}
t[0]=0//word terminator (the leaf)
}
println tree//for debug purpose
//search for the words in x.txt
new File('x.txt').eachLine{ str, line->
for(int i=0; i<str.length(); i++){
if(tree[str[i]]){
def t=tree[str[i]]
def res=str[i]
def found=false
for(int j=i+1; j<str.length(); j++){
if(t[str[j]]==null){
if(found){
println "Found $res at line $line, col $i"
res=str[j]
found=false
}
break
}else if(t[str[j]][0]==0){
found=true
res+=str[j]
t=t[str[j]]
continue
}else{
t=t[str[j]]
res+=str[j]
}
found=false
}
if(found) println "Found $res at line $line, col $i"//I know, an ugly repetition, it's for words at the end of a line. I will fix this later
}
}
}
this is my y.txt
dog
day
apple
daydream
and x.txt
This is a beautiful day and I'm walking with my dog while eating an apple.
Today it's sunny.
It's a daydream
The output is the following:
$ groovy search.groovy
[d:[o:[g:[0:0]], a:[y:[0:0, d:[r:[e:[a:[m:[0:0]]]]]]]], a:[p:[p:[l:[e:[0:0]]]]]]
Found day at line 1, col 20
Found dog at line 1, col 48
Found apple at line 1, col 68
Found day at line 2, col 2
Found daydream at line 3, col 7
This algorithm should be fast because the depth of the tree doesn't depend on the number of words in y.txt. The depth is equal to the length of the longest word in y.txt.

Related

match words with few differences allowed

I was wondering is there is any tool to match almost the same word for a bash terminal.
In the following file, called list.txt contain 1 word per line:
ban
1ban
12ban
12ban3
It is easy to find words containing "ban"
grep -E "*ban*" list.txt
Question:
How to actually match words that are have x letters differences?
With the search word "ban", I expect the match "1ban" for X=1.
Concerning the notion of distance, I want to have maximum:
X deletion
or X substitutions
or X insertions
Any tool, but preferentially something you could call as command-line on a bash terminal.
NOTE: The Levenshtein Distance will count an insertion of 2 letter as 1 difference. This is not what I want.

You may use Python PyPi regex class that supports fuzzy matching.
Since you actually want to match words with maximum X difference (1 deletion OR 1 substitution OR 1 deletion), you may create a Python script like
#!/usr/bin/env python3
import regex, io, sys
def main(argv):
if len(argv) < 3:
# print("USAGE: fuzzy_search -searchword -xdiff -file")
exit(-1)
search=argv[0]
xdiff=argv[1]
file=argv[2]
# print("Searching for {} in {} with {} differences...".format(search, file, xdiff))
with open(file, "r") as f:
contents = f.read()
print(regex.findall(r"\b(?:{0}){{s<={1},i<={1},d<={1}}}\b".format(regex.escape(search), xdiff), contents))
if __name__ == "__main__":
main(sys.argv[1:])
Here, {s<=1,i<=1,d<=1} means we allow the word we search for 1 or 0 substitutions (s<=1), 1 or 0 insertions (i<=1) or 1 or 0 deletions (d<=1).
The \b are word boundaries, thanks to that construct, only whole words are matched (no cat in vacation will get matched).
Save as fuzzy_search.py.
Then, you may call it as
python3 fuzzy_search.py "ban" 1 file
where "ban" is the word the fuzzy search is being performed for and 1 is the higher limit of differences.
The result I get is
['ban', '1ban']
You may change the format of the output to line only:
print("\n".join(regex.findall(r"\b(?:{0}){{s<={1},i<={1},d<={1}}}\b".format(regex.escape(search), xdiff), contents)))
Then, the result is
ban
1ban

You can check the difference as shown below by checking each character using python,
def is_diff(str1, str2):
diff = False
for char1, char2 in zip(str1, str2):
if char1 != char2:
if diff:
return False
else:
diff = True
return diff
with open('list.txt') as f:
data = f.readlines()
for line in data:
print is_diff('ban', line)

How do i delete first 2 lines which match with a text given by me ( using sed )?

How do i delete first 2 lines which match with a text given by me ( using sed ! )
E.g :
#file.txt contains following lines :
abc
def
def
abc
abc
def
And i want to delete first 2 "abc"

Using "sed"
While #EdMorton has pointed out that sed is not the best tool for this job (if you wonder why exactly, see my answer below and compare it to the awk code), my research showed that the solution to the generalized problem
Delete occurences "N" through "M" of a line matching a given pattern using sed
indeed is a very tricky one in my opinion. There seem to be many suggestions for how to replace the "N"th occurence of a matching pattern with sed, but I found that deleting a specific matching line (or a range of lines) is a much more complex undertaking.
While the generalized problem with arbitrary values for N, M, and the pattern would probably be solved best by writing a "sed script generator" on the basis of a Finite State Machine, the solution to the special case asked by the OP is still simple enough to be coded by hand. I must admit that I wasn't very familiar with the obfuscated intricacies of the sed command syntax before, but I found this challenge to be quite useful for gaining more experience with non-trivial sed usage.
Anyway, here's my solution for deleting the first two occurences of a line containing "abc" in a file. If there's a simpler approach, I'm eager to learn about it, as this has taken me some time now.
A final caveat: this assumes GNU sed, as I was unable to find a solution with POSIX sed:
sed -n ':1;/abc/{n;b2;};p;$b4;n;b1;:2;/abc/{n;b3;};p;$b4;n;b2;:3;p;$b4;n;b3;:4;q' file
or, in more verbose syntax:
sed -n '
# BEGIN - look for first match
:first;
/abc/ {
# First match found. Skip line and jump to second section
n; bsecond;
};
# Line does not match. Print it and quit if end-of-file reached
p; $bend;
# Advance to next line and start over
n; bfirst;
# END - look for first match
# BEGIN - look for second match
:second;
/abc/ {
# Second match found. Skip line and jump to final section
n; bfinal;
}
# Line does not match. Print it and quit if end-of-file reached
p; $bend;
# Advance to next line and start over
n; bsecond;
# END - look for second match
# BEGIN - both matches found; print remaining lines
:final;
# Print line and quit if end-of-file reached
p; $bend;
# Advance to next line and start over
n; bfinal;
# END - print remaining lines
# QUIT
:end;
q;
' file

sed is for simple substitutions on individual lines, that is all. For anything else you should be using awk:
$ awk '!(/abc/ && ++c<3)' file
def
def
abc
def

Join lines after specific word till another specific word

I have a .txt file of a transcript that looks like this
MICHEAL: blablablabla.
further talk by Michael.
more talk by Michael.
VALERIE: blublublublu.
Valerie talks more.
MICHAEL: blibliblibli.
Michael talks again.
........
All in all this pattern goes on for up to 4000 lines and not just two speakers but with up to seven different speakers, all with unique names written with upper-case letters (as in the example above).
For some text mining I need to rearrange this .txt file in the following way
Join the lines following one speaker - but only the ones that still belong to him - so that the above file looks like this:
MICHAEL: blablablabla. further talk by Michael. more talk by Michael.
VALERIE: blublublublu. Valerie talks more.
MICHAEL: blibliblibli. Michael talks again.
Sort the now properly joined lines in the .txt file alphabetically, so that all lines spoken by a speaker are now together. But, the sort function should not sort the sentences spoken by one speaker (after having sorted each speakers lines together).
I know some basic vim commands, but not enough to figure this out. Especially, the first one. I do not know what kind of pattern I can implement in vim so that it only joins the lines of each speaker.
Any help would be greatly apperciated!

Alright, first the answer:
:g/^\u\+:/,/\n\u\+:\|\%$/join
And now the explanation:
g stands for global and executes the following command on every line that matches
/^\u+:/ is the pattern :g searches for : ^ is start of line, \u is a upper case character, + means one or more matches and : is unsurprisingly :
then comes the tricky bit, we make the executed command a range, from the match so some other pattern match. /\n\u+:\|\%$ is two parts parted by the pipe \| . \n\u+: is a new line followed by the last pattern, i.e. the line before the next speaker. \%$ is the end of the file
join does what it says on the tin
So to put it together: For each speaker, join until the line before the next speaker or the end of the file.
The closest to the sorting I now of is
:sort /\u+:/ r
which will only sort by speaker name and reverse the other line so it isn't really what you are looking for

Well I don't know much about vim, but I was about to match lines corresponding particular speaker and here is the regex for that.
Regex: /([A-Z]+:)([A-Za-z\s\.]+)(?!\1)$/gm
Explanation:
([A-Z]+:) captures the speaker's name which contains only capital letters.
([A-Za-z\s\.]+) captures the dialogue.
(?!\1)$ backreferences to the Speaker's name and compares if the next speaker was same as the last one. If not then it matches till the new speaker is found.
I hope this will help you with matching at least.

In vim you might take a two step approach, first replace all newlines.
:%s/\n\+/ /g
Then insert a new line before the terms UPPERCASE: except the first one:
:%s/ \([[:upper:]]\+:\)/\r\1/g
For the sorting you can leverage the UNIX sort program:
:%sort!
You can combine them using a pipe symbol:
:%s/\n\+/ /g | %s/ \([[:upper:]]\+:\)/\r\1/g | %!sort
and map them to a key in your vimrc file:
:nnoremap <F5> :%s/\n\+/ /g \| %s/ \([[:upper:]]\+:\)/\r\1/g \| %sort! <CR>
If you press F5 in normal mode, the transformation happens. Note that the | needs to get escaped in the nnoremap command.

Here is a script solution to your problem.
It's not well tested, so I added some comments so you can fix it easily.
To make it run, just:
fill the g:speakers var in the top of the script with the uppercase names you need;
source the script (ex: :sav /tmp/script.vim|so %);
run :call JoinAllSpeakLines() to join the lines by speakers;
run :call SortSpeakLines() to sort
You may adapt the different patterns to better fit your needs, for example adding some space tolerance (\u\{2,}\s*\ze:).
Here is the code:
" Fill the following array with all the speakers names:
let g:speakers = [ 'MICHAEL', 'VALERIE', 'MATHIEU' ]
call sort(g:speakers)
function! JoinAllSpeakLines()
" In the whole file, join all the lines between two uppercase speaker names
" followed by ':', first inclusive:
silent g/\u\{2,}:/call JoinSpeakLines__()
endf
function! SortSpeakLines()
" Sort the whole file by speaker, keeping the order for
" each speaker.
" Must be called after JoinAllSpeakLines().
" Create a new dict, with one key for each speaker:
let speakerlines = {}
for speaker in g:speakers
let speakerlines[speaker] = []
endfor
" For each line in the file:
for line in getline(1,'$')
let speaker = GetSpeaker__(line)
if speaker == ''
continue
endif
" Add the line to the right speaker:
call add(speakerlines[speaker], line)
endfor
" Delete everything in the current buffer:
normal gg"_dG
" Add the sorted lines, speaker by speaker:
for speaker in g:speakers
call append(line('$'), speakerlines[speaker])
endfor
" Delete the first (empty) line in the buffer:
normal gg"_dd
endf
function! GetOtherSpeakerPattern__(speaker)
" Returns a pattern which matches all speaker names, except the
" one given as a parameter.
" Create an new list with a:speaker removed:
let others = copy(g:speakers)
let idx = index(others, a:speaker)
if idx != -1
call remove(others, idx)
endif
" Create and return the pattern list, which looks like
" this : "\v<MICHAEL>|<VALERIE>..."
call map(others, 'printf("<%s>:",v:val)')
return '\v' . join(others, '|')
endf
function! GetSpeaker__(line)
" Returns the uppercase name followed by a ':' in a line
return matchstr(a:line, '\u\{2,}\ze:')
endf
function! JoinSpeakLines__()
" When cursor is on a line with an uppercase name, join all the
" following lines until another uppercase name.
let speaker = GetSpeaker__(getline('.'))
if speaker == ''
return
endif
normal V
" Search for other names after the cursor line:
let srch = search(GetOtherSpeakerPattern__(speaker), 'W')
echo srch
if srch == 0
" For the last one only:
normal GJ
else
normal kJ
endif
endf

How to use Perl to parse specified formatted text with regex?

Question abstract:
how to parse text file into two "hashes" in Perl. One store key-value pairs taken from the (X=Y) part, another from the (X:Y) part?
1=9
2=2
3=1
4=6
2:1
3:1
4:1
1:2
1:3
1:4
3:4
3:2
they are kept in one file, and only the symbol between the two digits denotes the difference.
===============================================================================
I just spent around 30 hours learning Perl during last semester and managed to finish my Perl assignment in an "head first, ad-hoc, ugly" way.
Just received my result for this section as 7/10, to be frank, I am not happy with this, particularly because it recalls my poor memory of trying to use Regular Expression to deal with formatted data, which rule is like this :
1= (the last digit in your student ID,or one if this digit is zero)
2= (the second last digit in your student ID,or one if this digit is zero)
3= (the third last digit in your student ID, or one if this digit is zero)
4= (the forth last digit in your student ID, or one if this digit is zero)
2:1
3:1
4:1
1:2
1:3
1:4
2:3 (if the last digit in your student ID is between 0 and 4) OR
3:4 (if the last digit in your student ID is between 5 and 9)
3:2 (if the second last digit in your student ID is between 0 and 4) OR
4:3 (if the second last digit in your student ID is between 5 and 9)
An example of the above configuration file: if your student ID is 10926029, it has to be:
1=9
2=2
3=1
4=6
2:1
3:1
4:1
1:2
1:3
1:4
3:4
3:2
The assignment was about Pagerank calculation, which algorithm is simplified so I came up with the answer to that part in 5 minutes. However, it was the text parsing part that took me heaps of time.
The first part of the text (Page=Pagerank) denotes the pages and their corresponding pageranks.
The second part (FromNode:ToNode) denotes the direction of a link between two pages.
For a better understanding, please go to my website and check the requirement file and my Perl script here
There are massive comments in the script so I reckon it is not hard at all to see how stupid I was in my solution :(
If you are still on this page, let me justify why I ask this question here in SO:
I got nothing else but "Result 7/10" with no comment from uni.
I am not studying for uni, I am learning for myself.
So, I hope the Perl gurus can at least guide me the right direction toward solving this problem. My stupid solution was sort of "generic" and probable would work in Java, C#, etc. I am sure that is not even close to the nature of Perl.
And, if possible, please let me know the level of solution, like I need to go through "Learning Perl ==> Programming Perl ==> Master Perl" to get there :)
Thanks for any hint and suggestion in advance.
Edit 1:
I have another question posted but closed here, which describes pretty much like how things go in my uni :(

Is this what you mean? The regex basically has three capture groups (denoted by the ()s). It should capture one digit, followed by either = or : (that's the capture group wrapping the character class [], which matches any character within it), followed by another single digit.
my ( %assign, %colon );
while (<DATA>) {
chomp;
my ($l, $c, $r) = $_ =~ m/(\d)([=:])(\d)/;
if ( q{=} eq $c ) { $assign{$l} = $r; }
elsif ( q{:} eq $c ) { $colon{$l} = $r; }
}
__DATA__
1=9
2=2
3=1
4=6
2:1
3:1
4:1
1:2
1:3
1:4
3:4
3:2
As for the recommendation, grab a copy of Mastering Regular Expressions if you can. It's very...thorough.

Well, if you don't want to validate any restrictions on the data file, you can parse this data pretty easily. The main issue lies in selecting the appropriate structure to store your data.
use strict;
use warnings;
use IO::File;
my $file_path = shift; # Take file from command line
my %page_rank;
my %links;
my $fh = IO::File->new( $file_path, '<' )
or die "Error opening $file_path - $!\n";
while ( my $line = $fh->readline ) {
chomp $line;
next unless $line =~ /^(\d+)([=:])(\d+)$/; # skip invalid lines
my $page = $1;
my $delimiter = $2;
my $value = $3;
if( $delimiter eq '=' ) {
$page_rank{$page} = $value;
}
elsif( $delimiter eq ':' ) {
$links{$page} = [] unless exists $links{$page};
push #{ $links{$page} }, $value;
}
}
use Data::Dumper;
print Dumper \%page_rank;
print Dumper \%links;
The main way that this code differs from Pedro Silva's is that mine is more verbose and it also handles multiple links from one page properly. For example, my code preserves all values for links from page 1. Pedro's code discards all but the last.

Regex to replace gibberish

I have to clean some input from OCR which recognizes handwriting as gibberish. Any suggestions for a regex to clean out the random characters? Example:
Federal prosecutors on Monday charged a Miami man with the largest
case of credit and debit card data theft ever in the United States,
accusing the one-time government informant of swiping 130 million
accounts on top of 40 million he stole previously.
, ':, Ie
':... 11'1
. '(.. ~!' ': f I I
. " .' I ~
I' ,11 l
I I I ~ \ :' ,! .~ , .. r, 1 , ~ I . I' , .' I ,.
, i
I ; J . I.' ,.\ ) ..
. : I
'I', I
.' '
r,"
Gonzalez is a former informant for the U.S. Secret Service who helped
the agency hunt hackers, authorities say. The agency later found out that
he had also been working with criminals and feeding them information
on ongoing investigations, even warning off at least one individual,
according to authorities.
eh....l
~.\O ::t
e;~~~
s: ~ ~. 0
qs c::; ~ g
o t/J (Ii .,
::3 (1l Il:l
~ cil~ 0 2:
t:lHj~(1l
. ~ ~a
0~ ~ S'
N ("b t/J :s
Ot/JIl:l"-<:!
v'g::!t:O
-....c......
VI (:ll <' 0
:= - ~
< (1l ::3
(1l ~ '
t/J VJ ~
Pl
.....
....
(II

One of the simpleset solutions(not involving regexpes):
#pseudopython
number_of_punct = sum([1 if c.ispunct() else 0 for c in line])
if number_of_punct >len(line)/2: line_is_garbage()
well. Or rude regexpish s/[!,'"##~$%^& ]{5,}//g

A simple heuristic, similar to anonymous answer:
listA = [0,1,2..9, a,b,c..z, A,B,C,..Z , ...] // alphanumerical symbols
listB = [!#$%^&...] // other symbols
Na = number_of_alphanumeric_symbols( line )
Nb = number_of_other_symbols( line )
if Na/Nb <= garbage_ratio then
// garbage

No idea how well it would work, but I have considered this problem in the past, idly. I've on occasions played with a little programmatic device called a markov chain
Now the wikipedia article probably won't make much sense until you see some of the other things a markov chain is good for. One example of a markov chain in action is this Greeking generator. Another example is the MegaHAL chatbot.
Greeking is gibberish that looks like words. Markov chains provide a way of randomly generating a sequence of letters, but weighting the random choices to emulate the frequency patterns of an examined corpus. So for instance, Given the letter "T", the letter h is more likely to show up next than any other letter. So you examine a corpus (say some newspapers, or blog postings) to produce a kind of fingerprint of the language you're targeting.
Now that you have that frequency table/fingerprint, you can examine your sample text, and rate each letter according to the likelyhood of it appearing. Then, you can flag the letters under a particular threshold likelyhood for removal. In other words, a surprise filter. Filter out surprises.
There's some leeway for how you generate your freqency tables. You're not limited to one letter following another. You can build a frequency table that predicts which letter will likely follow each digraph (group of two letters), or each trigraph, or quadgraph. You can work the other side, predicting likely and unlikely trigraphs to appear in certain positions, given some previous text.
It's kind of like a fuzzy regex. Rather than MATCH or NO MATCH, the whole text is scored on a sliding scale according to how similar it is to your reference text.

I did a combo of eliminating lines that don't contain at least two 3 letter words, or one 6 letter word.
([a-z|A-Z]{3,}\s){2,}|([a-z|A-Z]{6,})
http://www.regexpal.com/

Here is a Perl implementation of the garbage_ratio heuristic:
#!/usr/bin/perl
use strict;
use warnings;
while ( defined( my $chunk = read_chunk(\*DATA) ) ) {
next unless length $chunk;
my #tokens = split ' ', $chunk;
# what is a word?
my #words = grep {
/^[A-Za-z]{2,}[.,]?$/
or /^[0-9]+$/
or /^a|I$/
or /^(?:[A-Z][.])+$/
} #tokens;
# completely arbitrary threshold
my $score = #words / #tokens;
print $chunk, "\n" if $score > 0.5;
}
sub read_chunk {
my ($fh) = #_;
my ($chunk, $line);
while ( my $line = <$fh> ) {
if( $line =~ /\S/ ) {
$chunk .= $line;
last;
}
}
while (1) {
$line = <$fh>;
last unless (defined $line) and ($line =~ /\S/);
$chunk .= $line;
}
return $chunk;
}
__DATA__
Paste the text above after __DATA__ above (not repeating the text here to save space). Of course, the use of the __DATA__ section is for the purpose of posting a self-contained script. In real life, you would have code to open the file etc.
Output:
Federal prosecutors on Monday charged a Miami man with the largest
case of credit and debit card data theft ever in the United States,
accusing the one-time government informant of swiping 130 million
accounts on top of 40 million he stole previously.
Gonzalez is a former informant for the U.S. Secret Service who helped
the agency hunt hackers, authorities say. The agency later found out that
he had also been working with criminals and feeding them information
on ongoing investigations, even warning off at least one individual,
according to authorities.

Regex won't help here. I'd say if you have control over the recognition part then focus on better quality there:
http://www.neurogy.com/ocrpreproc.html
You can also ask user to help you and specify the type of text you work with. e.g. if it is a page from a book then you would expect the majority of lines to be the same length and mainly consisting of letters, spaces and punctuation.

Well a group of symbols would match a bit of gibberish. Perhaps checking against a dictionary for words?
There seems to be a lot of line breaks where gibberish is, so that may be an indicator too.

Interesting problem.
If this is representative, I suppose you could build a library of common words and delete any line which didn't match any of them.
Or perhaps you could match character and punctuation characters and see if there is a reliable ratio cut-off, or simply a frequency of occurrence of some characters which flags it as gibberish.
Regardless, I think there will have to be some programming logic, not simply a single regular expression.

I guess that a regex would not help here. Regex would basically match a deterministic input i.e. a regex will have a predefined set of patterns that it will match. And gibberish would in most cases be random.
One way would be to invert the problem i.e. match the relevant text instead of matching the gibberish.

I'd claim a regex like "any punctuation followed by anything except a space is spam'.
So in .NET it's possibly something like
.Replace("\\p{1,}[a-zA-Z0-9]{1,}", "");
Then you'd consider "any word with two or more punctuations consecutively:
.Replace(" \\p{2,} ", "");
Seems like a good start anyway.

I like #Breton's answer - I'd suggest using his Corpus approach also with a library of known 'bad scans', which might be easier to identify because 'junk' has more internal consistency than 'good text' if it comes from bad OCR scans (the number of distinct glyphs is lower for example).

Another good technique is to use a spell checker/dictionary and look up the 'words' after you've eliminated the non readable stuff with regex.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js