I am trying to extract text in the form:
1.The tree has only green apples.
a. the person is boy.
b. the person is a liar.
2.The car drove very fast.
a. it was driven by a mad man.
b. it was by a little girl.
I want the "1.The tree has only green apples." parts of all the paragraphs.
My results list should contain "1.The tree has only green apples.","
2.The car drove very fast"... I have written a regex expression "\d{1,2}.\s(.+?\n\n|.+?$)" and tested on various engines and it works. I can't see why this will not work in delphi xe 5.
Here is the code that I use it from:
procedure TForm1.btSearch;
var
regex: TRegEx;
i, j: integer;
mygrps: TGroupCollection;
begin
regex:= TRegEx.Create(edit1.text);
mycoll:= regex.Matches(memo1.text);
if mycoll.Count>0 then
begin
label2.caption:= 'Count: ' + IntToStr(mycoll.Count);
memo2.Lines.Add('First Collection: ');
for i := 0 to mycoll.Count-1 do
begin
memo2.Lines.Add('Match #' + IntToStr(i) + ': ' + mycoll.Item[i].Value);
memo2.Lines.Add('Group: ' + IntToStr(i));
mygrps:= mycoll.Item[i].Groups;
for j := 0 to mygrps.Count-1 do
begin
memo2.Lines.Add('Value: ' + mygrps.Item[j].Value);
end;
end;
end;
end;
This has nothing to do with Delphi at all. Your regular expression is appears to be incorrect. It doesn't match your supplied text in any engine I can find.
Given the following text (from your sample, properly formatted):
1. The tree has only green apples.
a. the person is boy.
b. the person is a liar.
2. The car drove very fast
a. it was driven by a mad man.
b. it was by a little girl.
The following regular expression matches the two lines you've indicated as your desired result (tested in JGSoft, .NET, PCRE, Java, Perl, JavaScript, XMLSchema, XPath, and Perl engines):
\d{1,2}\.\s.*
When you tested with "various engines", you didn't take into account that various applications and programming languages handle line breaks differently. In Delphi, TMemo.Text returns a string with CRLF line breaks. To match one such line break in Delphi you need the regex \r\n.
Related
I'm trying to determine whether a term appears in a string.
Before and after the term must appear a space, and a standard suffix is also allowed.
Example:
term: google
string: "I love google!!! "
result: found
term: dog
string: "I love dogs "
result: found
I'm trying the following code:
regexPart1 = "\s"
regexPart2 = "(?:s|'s|!+|,|.|;|:|\(|\)|\"|\?+)?\s"
p = re.compile(regexPart1 + term + regexPart2 , re.IGNORECASE)
and get the error:
raise error("multiple repeat")
sre_constants.error: multiple repeat
Update
Real code that fails:
term = 'lg incite" OR author:"http++www.dealitem.com" OR "for sale'
regexPart1 = r"\s"
regexPart2 = r"(?:s|'s|!+|,|.|;|:|\(|\)|\"|\?+)?\s"
p = re.compile(regexPart1 + term + regexPart2 , re.IGNORECASE)
On the other hand, the following term passes smoothly (+ instead of ++)
term = 'lg incite" OR author:"http+www.dealitem.com" OR "for sale'
The problem is that, in a non-raw string, \" is ".
You get lucky with all of your other unescaped backslashes—\s is the same as \\s, not s; \( is the same as \\(, not (, and so on. But you should never rely on getting lucky, or assuming that you know the whole list of Python escape sequences by heart.
Either print out your string and escape the backslashes that get lost (bad), escape all of your backslashes (OK), or just use raw strings in the first place (best).
That being said, your regexp as posted won't match some expressions that it should, but it will never raise that "multiple repeat" error. Clearly, your actual code is different from the code you've shown us, and it's impossible to debug code we can't see.
Now that you've shown a real reproducible test case, that's a separate problem.
You're searching for terms that may have special regexp characters in them, like this:
term = 'lg incite" OR author:"http++www.dealitem.com" OR "for sale'
That p++ in the middle of a regexp means "1 or more of 1 or more of the letter p" (in the others, the same as "1 or more of the letter p") in some regexp languages, "always fail" in others, and "raise an exception" in others. Python's re falls into the last group. In fact, you can test this in isolation:
>>> re.compile('p++')
error: multiple repeat
If you want to put random strings into a regexp, you need to call re.escape on them.
One more problem (thanks to Ωmega):
. in a regexp means "any character". So, ,|.|;|:" (I've just extracted a short fragment of your longer alternation chain) means "a comma, or any character, or a semicolon, or a colon"… which is the same as "any character". You probably wanted to escape the ..
Putting all three fixes together:
term = 'lg incite" OR author:"http++www.dealitem.com" OR "for sale'
regexPart1 = r"\s"
regexPart2 = r"(?:s|'s|!+|,|\.|;|:|\(|\)|\"|\?+)?\s"
p = re.compile(regexPart1 + re.escape(term) + regexPart2 , re.IGNORECASE)
As Ωmega also pointed out in a comment, you don't need to use a chain of alternations if they're all one character long; a character class will do just as well, more concisely and more readably.
And I'm sure there are other ways this could be improved.
The other answer is great, but I would like to point out that using regular expressions to find strings in other strings is not the best way to go about it. In python simply write:
if term in string:
#do whatever
i have an example_str = "i love you c++" when using regex get error multiple repeat Error. The error I'm getting here is because the string contains "++" which is equivalent to the special characters used in the regex. my fix was to use re.escape(example_str ), here is my code.
example_str = "i love you c++"
regex_word = re.search(rf'\b{re.escape(word_filter)}\b', word_en)
Also make sure that your arguments are in the correct order!
I was trying to run a regular expression on some html code. I kept getting the multiple repeat error, even with very simple patterns of just a few letters.
Turns out I had the pattern and the html mixed up. I tried re.findall(html, pattern) instead of re.findall(pattern, html).
A general solution to "multiple repeat" is using re.escape to match the literal pattern.
Example:
>>>> re.compile(re.escape("c++"))
re.compile('c\\+\\+')
However if you want to match a literal word with space before and after try out this example:
>>>> re.findall(rf"\s{re.escape('c++')}\s", "i love c++ you c++")
[' c++ ']
This is an algorithmic question; any pseudocode/verbal explanation will do (although a Python solution outline would be ideal).
Let's have a query word A, for example pity. And let's have a set of other strings, Bs, each of which is composed of one or more space-delimited words: pious teddy, piston tank yard, pesky industrial strength tylenol, oh pity is me! etc.
The goal is to identify those strings B from which A could be constructed. Here "constructed" means we could take prefixes of one or more words in B, in order, and concatenate them together to get A.
Example:
pity = piston tank yard
pity = pesky industrial strength tylenol
pity = oh pity is me!
On the other hand, pious teddy should not be identified, because there's no way we can take prefixes of the words pious and teddy and concatenate them into pity.
The checking should be fast (ideally some regexp), because the set of strings B is potentially large.
You can use a pattern of \bp(?:i|\w*(?>\h+\w+)*?\h+i)(?:t|\w*(?>\h+\w+)*?\h+t)(?:y|\w*(?>\h+\w+)*?\h+y) to match those words. It assumes spaces to be used as word separators. This is rather easy to construct, just take the first letter of your word to be matched, then loop over the others and construct (?:[letter]|\w*(?>\h+\w+)*?\h+[letter]) from them.
This pattern is basically an unrolled version of \bp(?:i|.*?\bi)(?:t|.*?\bt)(?:y|.*?\by), which matters the second to last letters either as exactly next letter or first letter (because of the word boundary) of a next word.
You can see it in action here: https://regex101.com/r/r3ZVNE/2
I have added the last sample as non-matching one for some tests I did with atomic groups.
In Delphi I would do it like this:
program ProjectTest;
uses
System.SysUtils,
RegularExpressions;
procedure CheckSpecialMatches(const Matchword: string; const CheckList: array of string);
var
I: Integer;
Pat, C: string;
RegEx: TRegEx;
begin
assert(Matchword.Length > 0);
Pat := '\b' + Matchword[1];
for I := Low(Matchword) + 1 to High(Matchword) do
Pat := Pat + Format('(?:%0:s|\w*(?>\h+\w+)*?\h+%0:s)', [Matchword[I]]);
RegEx := TRegEx.Create(Pat, [roCompiled, roIgnoreCase]);
for C in CheckList do
if Regex.IsMatch(C) then
WriteLn(C);
end;
const
CheckList: array[0..3] of string = (
'pious teddy',
'piston tank yard',
'pesky industrial strength tylenol',
'prison is ity');
Matchword = 'pity';
begin
CheckSpecialMatches(Matchword, CheckList);
ReadLn;
end
.
I'm converting a text file to a Tab-Delimited text file, and ran into a bit of a snag. I can get everything I need to work the way I want except for one small part.
One field I'm working with has the home addresses of the subjects as a single entry ("1234 Happy Lane Somewhere, St 12345") and I need each broken down by Street(Tab)City(Tab)State(Tab)Zip. The one part I'm hung up on is the Tab between the State and the Zip.
I've been using input=input.Replace throughout, and it's worked well so far, but I can't think of how to untangle this one. The wildcards I'm used to don't seem to be working, I can't replace ("?? #####") with ("??" + ControlChars.Tab + "#####")...which I honestly didn't expect to work, but it's the only idea on the matter I had.
I've read a bit about using Regex, but have no experience with it, and it seems a bit...overwhelming.
Is Regex my best option for this? If not, are there any other suggestions on solutions I may have missed?
Thanks for your time. :)
EDIT: Here's what I'm using so far. It makes some edits to the line in question, taking care of spaces, commas, and other text I don't need, but I've got nothing for the State/Zip situation; I've a bad habit of wiping something if it doesn't work, but I'll append the last thing I used to the very end, if that'll help.
If input Like "Guar*###/###-####" Then
input = input.Replace("Guar:", "")
input = input.Replace(" ", ControlChars.Tab)
input = input.Replace(",", ControlChars.Tab)
input = "C" + ControlChars.Tab + strAccount + ControlChars.Tab + input
End If
input = System.Text.RegularExpressions.Regex.Replace(" #####", ControlChars.Tab + "#####") <-- Just one example of something that doesn't work.
This is what's written to input in this example
" Guar: LASTNAME,FIRSTNAME 999 E 99TH ST CITY,ST 99999 Tel: 999/999-9999"
And this is what I can get as a result so far
C 99999/9 LASTNAME FIRSTNAME 999 E 99TH ST CITY ST 99999 999/999-9999
With everything being exactly what I need besides the "ST 99999" bit (with actual data obviously omitted for privacy and professional whatnots).
UPDATE: Just when I thought it was all squared away, I've got another snag. The raw data gives me this.
# TERMINOLOGY ######### ##/##/#### # ###.##
And the end result is giving me this, because this is a chunk of data that was just fine as-is...before I removed the Tabs. Now I need a way to replace them after they've been removed, or to omit this small group of code from a document-wide Tab genocide I initiate the code with.
#TERMINOLOGY###########/##/########.##
Would a variant on rgx.Replace work best here? Or can I copy the code to a variable, remove Tabs from the document, then insert the variable without losing the tabs?
I think what you're looking for is
Dim r As New System.Text.RegularExpressions.Regex(" (\d{5})(?!\d)")
Dim input As String = rgx.Replace(input, ControlChars.Tab + "$1")
The first line compiles the regular expression. The \d matches a digit, and the {5}, as you can guess, matches 5 repetitions of the previous atom. The parentheses surrounding the \d{5} is known as a capture group, and is responsible for putting what's captured in a pseudovariable named $1. The (?!\d) is a more advanced concept known as a negative lookahead assertion, and it basically peeks at the next character to check that it's not a digit (because then it could be a 6-or-more digit number, where the first 5 happened to get matched). Another version is
" (\d{5})\b"
where the \b is a word boundary, disallowing alphanumeric characters following the digits.
I'm trying to adapt a Haxe Markdown library ( http://code.google.com/p/mdown/ ) into an official haxelib that works across platforms. I'm running into some weirdness where something works on flash and javascript, but not neko.
See this sample code:
var str = "<p>This is a blockquote</p>";
var out = ~/(^|\n)/g.replace(str, "$1 ");
trace(out);
On Javascript and Flash I get this, as expected:
" <p>This is a blockquote</p>"
On Neko I get this:
" < p > T h i s i s a b l o c k q u o t e < / p > "
I can work around it for now (not use regular expressions) - but can anyone show me at what point this breaking?
Thanks,
Jason
p.s. This might help answer the question: http://haxe.org/doc/cross/regexp#implementation-details
If you use the m flag to convert it into a multiline regex, you can leave out the newline part. That might help.
The relevant part of documentation is right at the beginning of the linked page:
m : multiline matching, ^ and $ represent the beginning and end of a line
As for why your problem is happening, it would appear that Neko's regex library is wrongly simplifying your regex to empty, which will match between every charater. You could put a . at the end of the regex and move the space to the front of your replacement string, which might prevent that bug from occurring, and it should be compatible with all the implementations.
I have to clean some input from OCR which recognizes handwriting as gibberish. Any suggestions for a regex to clean out the random characters? Example:
Federal prosecutors on Monday charged a Miami man with the largest
case of credit and debit card data theft ever in the United States,
accusing the one-time government informant of swiping 130 million
accounts on top of 40 million he stole previously.
, ':, Ie
':... 11'1
. '(.. ~!' ': f I I
. " .' I ~
I' ,11 l
I I I ~ \ :' ,! .~ , .. r, 1 , ~ I . I' , .' I ,.
, i
I ; J . I.' ,.\ ) ..
. : I
'I', I
.' '
r,"
Gonzalez is a former informant for the U.S. Secret Service who helped
the agency hunt hackers, authorities say. The agency later found out that
he had also been working with criminals and feeding them information
on ongoing investigations, even warning off at least one individual,
according to authorities.
eh....l
~.\O ::t
e;~~~
s: ~ ~. 0
qs c::; ~ g
o t/J (Ii .,
::3 (1l Il:l
~ cil~ 0 2:
t:lHj~(1l
. ~ ~a
0~ ~ S'
N ("b t/J :s
Ot/JIl:l"-<:!
v'g::!t:O
-....c......
VI (:ll <' 0
:= - ~
< (1l ::3
(1l ~ '
t/J VJ ~
Pl
.....
....
(II
One of the simpleset solutions(not involving regexpes):
#pseudopython
number_of_punct = sum([1 if c.ispunct() else 0 for c in line])
if number_of_punct >len(line)/2: line_is_garbage()
well. Or rude regexpish s/[!,'"##~$%^& ]{5,}//g
A simple heuristic, similar to anonymous answer:
listA = [0,1,2..9, a,b,c..z, A,B,C,..Z , ...] // alphanumerical symbols
listB = [!#$%^&...] // other symbols
Na = number_of_alphanumeric_symbols( line )
Nb = number_of_other_symbols( line )
if Na/Nb <= garbage_ratio then
// garbage
No idea how well it would work, but I have considered this problem in the past, idly. I've on occasions played with a little programmatic device called a markov chain
Now the wikipedia article probably won't make much sense until you see some of the other things a markov chain is good for. One example of a markov chain in action is this Greeking generator. Another example is the MegaHAL chatbot.
Greeking is gibberish that looks like words. Markov chains provide a way of randomly generating a sequence of letters, but weighting the random choices to emulate the frequency patterns of an examined corpus. So for instance, Given the letter "T", the letter h is more likely to show up next than any other letter. So you examine a corpus (say some newspapers, or blog postings) to produce a kind of fingerprint of the language you're targeting.
Now that you have that frequency table/fingerprint, you can examine your sample text, and rate each letter according to the likelyhood of it appearing. Then, you can flag the letters under a particular threshold likelyhood for removal. In other words, a surprise filter. Filter out surprises.
There's some leeway for how you generate your freqency tables. You're not limited to one letter following another. You can build a frequency table that predicts which letter will likely follow each digraph (group of two letters), or each trigraph, or quadgraph. You can work the other side, predicting likely and unlikely trigraphs to appear in certain positions, given some previous text.
It's kind of like a fuzzy regex. Rather than MATCH or NO MATCH, the whole text is scored on a sliding scale according to how similar it is to your reference text.
I did a combo of eliminating lines that don't contain at least two 3 letter words, or one 6 letter word.
([a-z|A-Z]{3,}\s){2,}|([a-z|A-Z]{6,})
http://www.regexpal.com/
Here is a Perl implementation of the garbage_ratio heuristic:
#!/usr/bin/perl
use strict;
use warnings;
while ( defined( my $chunk = read_chunk(\*DATA) ) ) {
next unless length $chunk;
my #tokens = split ' ', $chunk;
# what is a word?
my #words = grep {
/^[A-Za-z]{2,}[.,]?$/
or /^[0-9]+$/
or /^a|I$/
or /^(?:[A-Z][.])+$/
} #tokens;
# completely arbitrary threshold
my $score = #words / #tokens;
print $chunk, "\n" if $score > 0.5;
}
sub read_chunk {
my ($fh) = #_;
my ($chunk, $line);
while ( my $line = <$fh> ) {
if( $line =~ /\S/ ) {
$chunk .= $line;
last;
}
}
while (1) {
$line = <$fh>;
last unless (defined $line) and ($line =~ /\S/);
$chunk .= $line;
}
return $chunk;
}
__DATA__
Paste the text above after __DATA__ above (not repeating the text here to save space). Of course, the use of the __DATA__ section is for the purpose of posting a self-contained script. In real life, you would have code to open the file etc.
Output:
Federal prosecutors on Monday charged a Miami man with the largest
case of credit and debit card data theft ever in the United States,
accusing the one-time government informant of swiping 130 million
accounts on top of 40 million he stole previously.
Gonzalez is a former informant for the U.S. Secret Service who helped
the agency hunt hackers, authorities say. The agency later found out that
he had also been working with criminals and feeding them information
on ongoing investigations, even warning off at least one individual,
according to authorities.
Regex won't help here. I'd say if you have control over the recognition part then focus on better quality there:
http://www.neurogy.com/ocrpreproc.html
You can also ask user to help you and specify the type of text you work with. e.g. if it is a page from a book then you would expect the majority of lines to be the same length and mainly consisting of letters, spaces and punctuation.
Well a group of symbols would match a bit of gibberish. Perhaps checking against a dictionary for words?
There seems to be a lot of line breaks where gibberish is, so that may be an indicator too.
Interesting problem.
If this is representative, I suppose you could build a library of common words and delete any line which didn't match any of them.
Or perhaps you could match character and punctuation characters and see if there is a reliable ratio cut-off, or simply a frequency of occurrence of some characters which flags it as gibberish.
Regardless, I think there will have to be some programming logic, not simply a single regular expression.
I guess that a regex would not help here. Regex would basically match a deterministic input i.e. a regex will have a predefined set of patterns that it will match. And gibberish would in most cases be random.
One way would be to invert the problem i.e. match the relevant text instead of matching the gibberish.
I'd claim a regex like "any punctuation followed by anything except a space is spam'.
So in .NET it's possibly something like
.Replace("\\p{1,}[a-zA-Z0-9]{1,}", "");
Then you'd consider "any word with two or more punctuations consecutively:
.Replace(" \\p{2,} ", "");
Seems like a good start anyway.
I like #Breton's answer - I'd suggest using his Corpus approach also with a library of known 'bad scans', which might be easier to identify because 'junk' has more internal consistency than 'good text' if it comes from bad OCR scans (the number of distinct glyphs is lower for example).
Another good technique is to use a spell checker/dictionary and look up the 'words' after you've eliminated the non readable stuff with regex.