what do I use to match MS Word chars in regEx - regex

I need to find and delete all the non standard ascii chars that are in a string (usually delivered there by MS Word). I'm not entirely sure what these characters are... like the fancy apostrophe and the dual directional quotation marks and all that. Is that unicode? I know how to do it ham-handed [a-z etc. etc.] but I was hoping there was a more elegant way to just exclude anything that isn't on the keyboard.

Probably the best way to handle this is to work with character sets, yes, but for what it's worth, I've had some success with this quick-and-dirty approach, the character class
[\x80-\x9F]
this works because the problem with "Word chars" for me is the ones which are illegal in Unicode, and I've got no way of sanitising user input.

Microsoft apps are notorious for using fancy characters like curly quotes, em-dashes, etc., that require special handling without adding any real value. In some cases, all you have to do is make sure you're using one of their extended character sets to read the text (e.g., windows-1252 instead of ISO-8859-1). But there are several tools out there that replace those fancy characters with their plain-but-universally-supported ewquivalents. Google for "demoronizer" or "AsciiDammit".

I usually use a JEdit macro that replaces the most common of them with a more ascii-friendly version, i.e.:
hyphens and dashes to minus sign;
suspsension dots (single char) to multiple dots;
list item dot to asterisk;
etc.
It is easily adaptable to Word/Openoffice/whatever, and of course modified to suit your needs. I wrote an article on this topic:
http://www.megadix.it/node/138
Cheers

What you are probably looking at are Unicode characters in UTF-8 format. If so, just escape them in your regular expression language.

My solution to this problem is to write a Perl script that gives me all of the characters that are outside of the ASCII range (0 - 127):
#!/usr/bin/perl
use strict;
use warnings;
my %seen;
while (<>) {
for my $character (grep { ord($_) > 127 } split //) {
$seen{$character}++;
}
}
print "saw $_ $seen{$_} times, its ord is ", ord($_), "\n" for keys %seen;
I then create a mapping of those characters to what I want them to be and replace them in the file:
#!/usr/bin/perl
use strict;
use warnings;
my %map = (
chr(128) => "foo",
#etc.
);
while (<>) {
s/([\x{80}-\x{FF}])/$map{$1}/;
print;
}

What I would do is, use AutoHotKey, or python SendKeys or some sort of visual basic that would send me all possible keys (also with shift applied and unapplied) to a Word document.
In SendKeys it would be a script of the form
chars = ''.join([chr(i) for i in range(ord('a'),ord('z'))])
nums = ''.join([chr(i) for i in range(ord('0'),ord('9'))])
specials = ['-','=','\','/',','.',',','`']
all = chars+nums+specials
SendKeys.SendKeys("""
{LWIN}
{PAUSE .25}
r
winword.exe{ENTER}
{PAUSE 1}
%(all)s
+(%(all)s)
"testQuotationAndDashAutoreplace"{SPACE}-{SPACE}a{SPACE}{BS 3}{LEFT}{BS}
{Alt}{PAUSE .25}{SHIFT}
changeLanguage
%(all)s
+%(all)s
"""%{'all':all})
Then I would save the document as text, and use it as a database for all displable keys in your keyboard layout (you might want to replace the default input language more than once to receive absolutely all displayable characters).
If the char is in the result text document - it is displayable, otherwise not. No need for regexp. You can of course afterward embed the characters range within a script or a program.

Related

Including regex on variable before matching string

I'm trying to find and extract the occurrence of words read from a text file in a text file. So far I can only find when the word is written correctly and not munged (a changed to # or i changed to 1). Is it possible to add a regex to my strings for matching or something similar? This is my code so far:
sub getOccurrenceOfStringInFileCaseInsensitive
{
my $fileName = $_[0];
my $stringToCount = $_[1];
my $numberOfOccurrences = 0;
my #wordArray = wordsInFileToArray ($fileName);
foreach (#wordArray)
{
my $numberOfNewOccurrences = () = (m/$stringToCount/gi);
$numberOfOccurrences += $numberOfNewOccurrences;
}
return $numberOfOccurrences;
}
The routine receives the name of a file and the string to search. The routine wordsInFileToArray () just gets every word from the file and returns an array with them.
Ideally I would like to perform this search directly reading from the file in one go instead of moving everything to an array and iterating through it. But the main question is how to hard code something into the function that allows me to capture munged words.
Example: I would like to extract both lines from the file.
example.txt:
russ1#anh#ck3r
russianhacker
# this variable also will be read from a blacklist file
$searchString = "russianhacker";
getOccurrenceOfStringInFileCaseInsensitive ("example.txt", $searchString);
Thanks in advance for any responses.
Edit:
The possible substitutions will be defined by an user and the regex must be set to fit. A user could say that a common substitution is to change the letter "a" to "#" or even "1". The possible change is completely arbitrary.
When searching for a specific word ("russian" for example) this could be done with something like:
(m/russian/i); # would just match the word as it is
(m/russi[a#1]n/i); # would match the munged word
But I'm not sure how to do that if I have the string to match stored in a variable, such as:
$stringToSearch = "russian";
This is sort of a full-text search problem, so one method is to normalize the document strings before matching against them.
use strict;
use warnings;
use Data::Munge 'list2re';
...
my %norms = (
'#' => 'a',
'1' => 'i',
...
);
my $re = list2re keys %norms;
s/($re)/$norms{$1}/ge for #wordArray;
This approach only works if there's only a single possible "normalized form" for any given word, and may be less efficient anyway than just trying every possible variation of the search string if your document is large enough and you recompute this every time you search it.
As a note your regex m/$randomString/gi should be m/\Q$randomString/gi, as you don't want any regex metacharacters in $randomString to be interpreted that way. See docs for quotemeta.
There are parts of the problem which aren't specified precisely enough (yet).
Some of the roll-your-own approaches, that depend on the details, are
If user defined substitutions are global (replace every occurrence of a character in every string) the user can submit a mapping, as a hash say, and you can fix them all. The process will identify all candidates for the words (along with the actual, unmangled, words, if found). There may be false positives so also plan on some post-processing
If the user can supply a list of substitutions along with words that they apply to (the mangled or the corresponding unmangled ones) then we can have a more targeted run
Before this is clarified, here is another way: use a module for approximate ("fuzzy") matching.
The String::Approx seems to fit quite a few of your requirements.
The match of the target with a given string relies on the notion of the Levenshtein edit distance: how many insertions, deletions, and replacements ("edits") it takes to make the given string into the sought target. The maximum accepted number of edits can be set.
A simple-minded example:
use warnings;
use strict;
use feature 'say';
use String::Approx qw(amatch);
my $target = qq(russianhacker);
my #text = qw(that h#cker was a russ1#anh#ck3r);
my #matches = amatch($target, ["25%"], #text);
say for #matches; #==> russ1#anh#ck3r
See documentation for what the module avails us, but at least two comments are in place.
First, note that the second argument in amatch specifies the percentile-deviation from the target string that is acceptable. For this particular example we need to allow every fourth character to be "edited." So much room for tweaking can result in accidental matches which then need be filtered out, so there will be some post-processing to do.
Second -- we didn't catch the easier one, h#cker. The module takes a fixed "pattern" (target), not a regex, and can search for only one at a time. So, in principle, you need a pass for each target string. This can be improved a lot, but there'll be more work to do.
Please study the documentation; the module offers a whole lot more than this simple example.
I've ended solving the problem by including the regex directly on the variable that I'll use to match against the lines of my file. It looks something like this:
sub getOccurrenceOfMungedStringInFile
{
my $fileName = $_[0];
my $mungedWordToCount = $_[1];
my $numberOfOccurrences = 0;
open (my $inputFile, "<", $fileName) or die "Can't open file: $!";
$mungedWordToCount =~ s/a/\[a\#4\]/gi;
while (my $currentLine = <$inputFile>)
{
chomp ($currentLine);
$numberOfOccurrences += () = ($currentLine =~ m/$mungedWordToCount/gi);
}
close ($inputFile) or die "Can't open file: $!";
return $numberOfOccurrences;
}
Where the line:
$mungedWordToCount =~ s/a/\[a\#4\]/gi;
Is just one of the substitutions that are needed and others can be added similarly.
I didn't know that Perl would just interpret the regex inside of the variable since I've tried that before and could only get the wanted results defining the variables inside the function using single quotes. I must've done something wrong the first time.
Thanks for the suggestions, people.

How can I substitute in strings in Perl 6 by codepoint rather than by grapheme?

I need to remove diacritical marks from a string using Perl 6. I tried doing this:
my $hum = 'חוּם';
$ahm.subst(/<-[\c[HEBREW LETTER ALEF] .. \c[HEBREW LETTER TAV]]>/, '', :g);
I am trying to remove all the characters that are not in the range between HEBREW LETTER ALEF (א) and HEBREW LETTER TAV (ת). I'd expected the following code to return "חום", however it returns "חם".
I guess that what happens is that by default Perl 6 works by graphemes, considers וּ to be one grapheme, and removes all of it. It's often sensible to work by graphemes, but in my case I need it to work by codepoints.
I tried to find an adverb that would make it work by codepoint, but couldn't find it. Perhaps there is also a way in Perl 6 to use Unicode properties to exclude diacritics, or to include only letters, but I couldn't find that either.
Thanks!
My regex-fu is weak, so I'd go with a less magical solution.
First, you can remove all marks via samemark:
'חוּם'.samemark('a')
Second, you can decompose the graphemes via .NFD and operate on individual codepoints - eg only keeping values with property Grapheme_Base - and then recompose the string:
Uni.new('חוּם'.NFD.grep(*.uniprop('Grapheme_Base'))).Str
In case of mixed strings, stripping marks from Hebrew characters only could look like this:
$str.subst(:g, /<:Script<Hebrew>>+/, *.Str.samemark('a'));
Here is a simple approach:
my $hum = 'חוּם';
my $min = "\c[HEBREW LETTER ALEF]".ord;
my $max = "\c[HEBREW LETTER TAV]".ord;
my #ords;
for $hum.ords {
#ords.push($_) if $min ≤ $_ ≤ $max;
}
say join('', #ords.map: { .chr });
Output:
חום

How to extract line numbers from a multi-line string in Vim?

In my opinion, Vimscript does not have a lot of features for manipulating strings.
I often use matchstr(), substitute(), and less often strpart().
Perhaps there is more than that.
For example, what is the best way to remove all text between line numbers in the following string a?
let a = "\%8l............\|\%11l..........\|\%17l.........\|\%20l...." " etc.
I want to keep only the digits and put them in a list:
['8', '11', '17', '20'] " etc.
(Note that the text between line numbers can be different.)
You're looking for split()
echo split(a, '[^0-9]\+')
EDIT:
Given the new constraint: only the numbers from \%d\+l, I'd do:
echo map(split(a, '|'), "matchstr(v:val, '^%\\zs\\d\\+\\zel')")
NB: your vim variable is incorrectly formatted, to use only one backslash, you'd need to write your string with single-quotes. With double-quotes, here you'd need two backslashes.
So, with
let b = '\%8l............\|\%11l..........\|\%17l.........\|\%20l....'
it becomes
echo map(split(b, '\\|'), "matchstr(v:val, '^\\\\%\\zs\\d\\+\\zel')")
One can take advantage of the substitute with an expression feature (see
:help sub-replace-\=) to run over all of the target matches, appending them
to a list.
:let l=[] | call substitute(a, '\\%\(\d\+\)l', '\=add(l,submatch(1))[1:0]', 'g')

regular expression matching all 8 character strings except "00000000"

I am trying to figure out a regular expression which matches any string with 8 symbols, which doesn't equal "00000000".
can any one help me?
thanks
In at least perl regexp using a negative lookahead assertion: ^(?!0{8}).{8}$, but personally i'd rather write it like so:
length $_ == 8 and $_ ne '00000000'
Also note that if you do use the regexp, depending on the language you might need a flag to make the dot match newlines as well, if you want that. In perl, that's the /s flag, for "single-line mode".
Unless you are being forced into it for some reason, this is not a regex problem. Just use len(s) == 8 && s != "00000000" or whatever your language uses to compare strings and lengths.
If you need a regex, ^(?!0{8})[A-Za-z0-9]{8}$ will match a string of exactly 8 characters. Changing the values inside the [] will allow you to set the accepted characters.
As mentioned in the other answers, regular expressions are not the right tool for this task. I suspect it is a homework, thus I'll only hint a solution, instead of stating it explicitly.
The regexp "any 8 symbols except 00000000" may be broken down as a sum of eight regexps in the form "8 symbols with non-zero symbol on the i-th position". Try to write down such an expression and then combine them into one using alternative ("|").
Unless you have unspecified requirements, you really don't need a regular expression for this:
if len(myString) == 8 and myString != "00000000":
...
(in the language of your choice, of course!)
If you need to extract all eight character strings not equal to "000000000" from a larger string, you could use
"(?=.{8})(?!0{8})."
to identify the first character of each sequence and extract eight characters starting with its index.
Of course, one would simply check
if stuff != '00000000'
...
but for the record, one could easily employ
heavyweight regex (in Perl) for that ;-)
...
use re 'eval';
my #strings = qw'00000000 00A00000 10000000 000000001 010000';
my $L = 8;
print map "$_ - ok\n",
grep /^(.{$L})$(??{$^Nne'0'x$L?'':'^$'})/,
#strings;
...
prints
00A00000 - ok
10000000 - ok
go figure ;-)
Regards
rbo
Wouldn't ([1-9]**|\D*){8} do it? Or am I missing something here (which is actually just the inverse of ndim's, which seems like it oughta work).
I am assuming the characters was chosen to include more than digits.
Ok so that was wrong, so Professor Bolo did I get a passing grade? (I love reg expressions so I am really curious).
>>> if re.match(r"(?:[^0]{8}?|[^0]{7}?|[^0]{6}?|[^0]{5}?|[^0]{4}?|[^0]{3}?|[^0]2}?|[^0]{1}?)", '00000000'):
print 'match'
...
>>> if re.match(r"(?:[^0]{8}?|[^0]{7}?|[^0]{6}?|[^0]{5}?|[^0]{4}?|[^0]{3}?|[^0]{2}?|[^0]{1}?)", '10000000'):
... print 'match'
match
>>> if re.match(r"(?:[^0]{8}?|[^0]{7}?|[^0]{6}?|[^0]{5}?|[^0]{4}?|[^0]{3}?|[^0]{2}?|[^0]{1}?)", '10011100'):
... print 'match'
match
>>>
That work?

Trying to simplify a Regex

I'm spending my weekend analyzing Campaign Finance Contribution records. Fun!
One of the annoying things I've noticed is that entity names are entered differently:
For example, i see stuff like this: 'llc', 'llc.', 'l l c', 'l.l.c', 'l. l. c.', 'llc,', etc.
I'm trying to catch all these variants.
So it would be something like:
"l([,\.\ ]*)l([,\.\ ]*)c([,\.\ ]*)"
Which isn't so bad... except there are about 40 entity suffixes that I can think of.
The best thing I can think of is programmatically building up this pattern , based on my list of suffixes.
I'm wondering if there's a better way to handle this within a single regex that is human readable/writable.
You could just strip out excess crap. Using Perl:
my $suffix = "l. lc.."; # the worst case imaginable!
$suffix =~ s/[.\s]//g;
# no matter what variation $suffix was, it's now just "llc"
Obviously this may maul your input if you use it on the full company name, but getting too in-depth with how to do that would require knowing what language we're working with. A possible regex solution is to copy the company name and strip out a few common words and any words with more than (about) 4 characters:
my $suffix = $full_name;
$suffix =~ s/\w{4,}//g; # strip words of more than 4 characters
$suffix =~ s/(a|the|an|of)//ig; # strip a few common cases
# now we can mangle $suffix all we want
# and be relatively sure of what we're doing
It's not perfect, but it should be fairly effective, and more readable than using a single "monster regex" to try to match all of them. As a rule, don't use a monster regex to match all cases, use a series of specialized regexes to narrow many cases down to a few. It will be easier to understand.
Regexes (other than relatively simple ones) and readability rarely go hand-in-hand. Don't misunderstand me, I love them for the simplicity they usually bring, but they're not fit for all purposes.
If you want readability, just create an array of possible values and iterate through them, checking your field against them to see if there's a match.
Unless you're doing gene sequencing, the speed difference shouldn't matter. And it will be a lot easier to add a new one when you discover it. Adding an element to an array is substantially easier than reverse-engineering a regex.
The first two "l" parts can be simplified by [the first "l" part here]{2}.
You can squish periods and whitespace first, before matching: for instance, in perl:
while (<>) {
$Sq = $_;
$Sq =~ s/[.\s]//g; # squish away . and " " in the temporary save version
$Sq = lc($Sq);
/^llc$/ and $_ = 'L.L.C.'; # try to match, if so save the canonical version
/^ibm/ and $_ = 'IBM'; # a different match
print $_;
}
Don't use regexes, instead build up a map of all discovered (so far) entries and their 'canonical' (favourite) versions.
Also build a tool to discover possible new variants of postfixes by identifying common prefixes to a certain number of characters and printing them on the screen so you can add new rules.
In Perl you can build up regular expressions inside your program using strings. Here's some example code:
#!/usr/bin/perl
use strict;
use warnings;
my #strings = (
"l.l.c",
"llc",
"LLC",
"lLc",
"l,l,c",
"L . L C ",
"l W c"
);
my #seps = ('.',',','\s');
my $sep_regex = '[' . join('', #seps) . ']*';
my $regex_def = join '', (
'[lL]',
$sep_regex,
'[lL]',
$sep_regex,
'[cC]'
);
print "definition: $regex_def\n";
foreach my $str (#strings) {
if ( $str =~ /$regex_def/ ) {
print "$str matches\n";
} else {
print "$str doesn't match\n";
}
}
This regular expression could also be simplified by using case-insensitive matching (which means $match =~ /$regex/i ). If you run this a few times on the strings that you define, you can easily see cases that don't validate according to your regular expression. Building up your regular expression this way can be useful in only defining your separator symbols once, and I think that people are likely to use the same separators for a wide variety of abbreviations (like IRS, I.R.S, irs, etc).
You also might think about looking into approximate string matching algorithms, which are popular in a large number of areas. The idea behind these is that you define a scoring system for comparing strings, and then you can measure how similar input strings are to your canonical string, so that you can recognize that "LLC" and "lLc" are very similar strings.
Alternatively, as other people have suggested you could write an input sanitizer that removes unwanted characters like whitespace, commas, and periods. In the context of the program above, you could do this:
my $sep_regex = '[' . join('', #seps) . ']*';
foreach my $str (#strings) {
my $copy = $str;
$copy =~ s/$sep_regex//g;
$copy = lc $copy;
print "$str -> $copy\n";
}
If you have control of how the data is entered originally, you could use such a sanitizer to validate input from the users and other programs, which will make your analysis much easier.