Getting Everything Between Two Characters Across New Lines - regex

This is a sample of the text I am working with.
6) Jake's Taxi Service is a new entrant to the taxi industry. It has achieved success by staking out a unique position in the industry. How did Jake's Taxi Service mostly likely achieve this position?
A) providing long-distance cab fares at a higher rate than
competitors; servicing a larger area than competitors
B) providing long-distance cab fares at a lower rate than competitors;
servicing a smaller area than competitors
C) providing long-distance cab fares at a higher rate than
competitors; servicing the same area as competitors
D) providing long-distance cab fares at a lower rate than competitors;
servicing the same area as competitors
Answer: D
I am trying to match the entire question including the answer options. Everything from the question number to the word Answer
This is my current regex expression
((rf'(?<={searchCounter}\) ).*?(?=Answer).*'), re.DOTALL)
SearchCounter is just a variable that will correspond to the current question, in this case 6. I think the issue is something to do with searching across the new lines.
EDIT: Full source code
searchCounter = 1
bookDict = {}
with open ('StratMasterKey.txt', 'rt') as myfile:
for line in myfile:
question_pattern = re.compile((rf'(?<={searchCounter}\) ).*?(?=Answer).*'), re.DOTALL)
result = question_pattern.search(line)
if result != None:
bookDict[searchCounter] = result[0]
searchCounter +=1

The reason your regex fails is that you read the file line by line with for line in myfile:, while your pattern searches for matches in a single multiline string.
Replace for line in myfile: with contents = myfile.read() and then use result = question_pattern.search(contents) to get the first match, or result = question_pattern.findall(contents) to get multiple matches.
A note on the regex: I am not fixing the whole pattern since, as you mentioned, it is out of scope of this question, but since the string input is a multiline string now, you need to remove re.DOTALL and use [\s\S] to match any char in the pattern and . to match any char but a line break char. Also, the lookaround contruct is redundant, you may safely replace (?=Answer) with Answer. Also, to check if there is a match, you may simply use if result: and then grab the whole match value by accessing result.group().
Full code snippet:
with open ('StratMasterKey.txt', 'rt') as myfile:
contents = myfile.read()
question_pattern = re.compile((rf'(?<={searchCounter}\) )[\s\S]*?Answer.*'))
result = question_pattern.search(contents)
if result:
print( result.group() )

Related

Splunk - check logs that are equal to any string I provide

I want to capture only the logs that hold the values of "Zero tolerance", "bolt from the blue", "A change is as good as a rest" inside. I've tried with this but it doesn't work it only captures the first one. /description=(?Zero tolerance | bolt from the blue | A change is as good as a rest)
Have in mind that the strings to check needs to be provided by me.
code = random05, description=bird in the hand is worth two in the bush, level=5
code = random02, description=bolt from the blue, level=8
code = random09, description=bunch of fives, level=3
code = random05, description=A chain is only as strong as its weakest link, level=0
code = random08, description=A change is as good as a rest, level=3```
There are more logs but they are not showing.
It looks like you want to match anything after description that contains one of the strings you specified. You can then use
description=(?<des>.*(?:Zero tolerance|bolt from the blue|A change is as good as a rest).*)
So, here, "des" group will match any zero or more chars other than line break chars as many as possible, then one of the specified strings (note the spaces around | are removed) and then again any zero or more chars other than line break chars as many as possible.
If you need to limit the match by the first comma, replace . with [^,].

VB.Net Beginner: Replace with Wildcards, Possibly RegEx?

I'm converting a text file to a Tab-Delimited text file, and ran into a bit of a snag. I can get everything I need to work the way I want except for one small part.
One field I'm working with has the home addresses of the subjects as a single entry ("1234 Happy Lane Somewhere, St 12345") and I need each broken down by Street(Tab)City(Tab)State(Tab)Zip. The one part I'm hung up on is the Tab between the State and the Zip.
I've been using input=input.Replace throughout, and it's worked well so far, but I can't think of how to untangle this one. The wildcards I'm used to don't seem to be working, I can't replace ("?? #####") with ("??" + ControlChars.Tab + "#####")...which I honestly didn't expect to work, but it's the only idea on the matter I had.
I've read a bit about using Regex, but have no experience with it, and it seems a bit...overwhelming.
Is Regex my best option for this? If not, are there any other suggestions on solutions I may have missed?
Thanks for your time. :)
EDIT: Here's what I'm using so far. It makes some edits to the line in question, taking care of spaces, commas, and other text I don't need, but I've got nothing for the State/Zip situation; I've a bad habit of wiping something if it doesn't work, but I'll append the last thing I used to the very end, if that'll help.
If input Like "Guar*###/###-####" Then
input = input.Replace("Guar:", "")
input = input.Replace(" ", ControlChars.Tab)
input = input.Replace(",", ControlChars.Tab)
input = "C" + ControlChars.Tab + strAccount + ControlChars.Tab + input
End If
input = System.Text.RegularExpressions.Regex.Replace(" #####", ControlChars.Tab + "#####") <-- Just one example of something that doesn't work.
This is what's written to input in this example
" Guar: LASTNAME,FIRSTNAME 999 E 99TH ST CITY,ST 99999 Tel: 999/999-9999"
And this is what I can get as a result so far
C 99999/9 LASTNAME FIRSTNAME 999 E 99TH ST CITY ST 99999 999/999-9999
With everything being exactly what I need besides the "ST 99999" bit (with actual data obviously omitted for privacy and professional whatnots).
UPDATE: Just when I thought it was all squared away, I've got another snag. The raw data gives me this.
# TERMINOLOGY ######### ##/##/#### # ###.##
And the end result is giving me this, because this is a chunk of data that was just fine as-is...before I removed the Tabs. Now I need a way to replace them after they've been removed, or to omit this small group of code from a document-wide Tab genocide I initiate the code with.
#TERMINOLOGY###########/##/########.##
Would a variant on rgx.Replace work best here? Or can I copy the code to a variable, remove Tabs from the document, then insert the variable without losing the tabs?
I think what you're looking for is
Dim r As New System.Text.RegularExpressions.Regex(" (\d{5})(?!\d)")
Dim input As String = rgx.Replace(input, ControlChars.Tab + "$1")
The first line compiles the regular expression. The \d matches a digit, and the {5}, as you can guess, matches 5 repetitions of the previous atom. The parentheses surrounding the \d{5} is known as a capture group, and is responsible for putting what's captured in a pseudovariable named $1. The (?!\d) is a more advanced concept known as a negative lookahead assertion, and it basically peeks at the next character to check that it's not a digit (because then it could be a 6-or-more digit number, where the first 5 happened to get matched). Another version is
" (\d{5})\b"
where the \b is a word boundary, disallowing alphanumeric characters following the digits.

Finding if a string matches a pattern

At one point in my app, I need to match some strings against a pattern. Let's say that some of the sample strings look as follows:
Hi there, John.
What a lovely day today!
Lovely sunset today, John, isn't it?
Will you be meeting Linda today, John?
Most (not all) of these strings are from pre-defined patterns as follows:
"Hi there, %s."
"What a lovely day today!"
"Lovely sunset today, %s, isn't it?"
"Will you be meeting %s today, %s?"
This library of patterns is ever-expanding (currently at around 1,500), but is manually maintained. The input strings though (the first group) is largely unpredictable. Though most of them will match one of the patterns, some of them will not.
So, here's my question: Given a string (from the first group) as input, I need to know which of the patterns (known second group) it matched. If nothing matched, it needs to tell me that.
I'm guessing the solution involves building a regex out of the patterns, and iteratively checking which one matched. However, I'm unsure what the code to build those regexes looks like.
Note: The strings I've given here are for illustration purposes. In reality, the strings aren't human generated, but are computer-generated human-friendly strings as shown above from systems I don't control. Since they aren't manually typed in, we don't need to worry about things like typos and other human errors. Just need to find which pattern it matches.
Note 2: I could modify the patterns library to be some other format, if that makes it easier to construct the regexes. The current structure, with the printf style %s, isn't set in stone.
I am looking at this as a parsing problem. The idea is that the parser function takes a string and determines if it is valid or not.
The string is valid if you can find it among the given patterns. That means you need an index of all the patterns. The index must be a full text index. Also it must match according to the word position. eg. it should short circuit if the first word of the input is not found among the first word of the patterns. It should take care of the any match ie %s in the pattern.
One solution is to put the patterns in an in memory database (eg. redis) and do a full text index on it. (this will not match according to word position) but you should be able to narrow down to the correct pattern by splitting the input into words and searching. The searches will be very fast because you have a small in memory database. Also note that you are looking for the closest match. One or more words will not match. The highest number of matches is the pattern you want.
An even better solution is to generate your own index in a dictionary format. Here is an example index for the four patterns you gave as a JavaScript object.
{
"Hi": { "there": {"%s": null}},
"What: {"a": {"lovely": {"day": {"today": null}}}},
"Lovely": {"sunset": {"today": {"%s": {"isnt": {"it": null}}}}},
"Will": {"you": {"be": {"meeting": {"%s": {"today": {"%s": null}}}}}}
}
This index is recursive descending according to the word postion. So search for the first word, if found search for the next within the object returned by the first and so on. Same words at a given level will have only one key. You should also match the any case. This should be blinding fast in memory.
My first thought would be to have the regexp engine take all the trouble of handling this. They're usually optimised to handle large amounts of text so it shouldn't be that much of a performance hassle. It's brute force but the performance seems to be okay. And you could split the input into pieces and have multiple processes handle them. Here's my moderately tested solution (in Python).
import random
import string
import re
def create_random_sentence():
nwords = random.randint(4, 10)
sentence = []
for i in range(nwords):
sentence.append("".join(random.choice(string.lowercase) for x in range(random.randint(3,10))))
ret = " ".join(sentence)
print ret
return ret
patterns = [ r"Hi there, [a-zA-Z]+.",
r"What a lovely day today!",
r"Lovely sunset today, [a-zA-Z]+, isn't it?",
r"Will you be meeting [a-zA-Z]+ today, [a-zA-Z]+\?"]
for i in range(95):
patterns.append(create_random_sentence())
monster_pattern = "|".join("(%s)"%x for x in patterns)
print monster_pattern
print "--------------"
monster_regexp = re.compile(monster_pattern)
inputs = ["Hi there, John.",
"What a lovely day today!",
"Lovely sunset today, John, isn't it?",
"Will you be meeting Linda today, John?",
"Goobledigoock"]*2000
for i in inputs:
ret = monster_regexp.search(i)
if ret:
print ".",
else:
print "x",
I've created a hundred patterns. This is the maximum limit of the python regexp library. 4 of them are your actual examples and the rest are random sentences just to stress performance a little.
Then I combined them into a single regexp with 100 groups. (group1)|(group2)|(group3)|.... I'm guessing you'll have to sanitise the inputs for things that can have meanings in regular expressions (like ? etc.). That's the monster_regexp.
Testing one string against this tests it against 100 patterns in a single shot. There are methods that fetch out the exact group which was matched. I test 10000 strings 80% of which should match and 10% which will not. It short cirtcuits so if there's a success, it will be comparatively quick. Failures will have to run through the whole regexp so it will be slower. You can order things based on the frequency of input to get some more performance out of it.
I ran this on my machine and this is my timing.
python /tmp/scratch.py 0.13s user 0.00s system 97% cpu 0.136 total
which is not too bad.
However, to run a pattern against such a large regexp and fail will take longer so I changed the inputs to have lots of randomly generated strings that won't match and then tried. 10000 strings none of which match the monster_regexp and I got this.
python /tmp/scratch.py 3.76s user 0.01s system 99% cpu 3.779 total
Similar to Noufal's solution, but returns the matched pattern or None.
import re
patterns = [
"Hi there, %s.",
"What a lovely day today!",
"Lovely sunset today, %s, isn't it",
"Will you be meeting %s today, %s?"
]
def make_re_pattern(pattern):
# characters like . ? etc. have special meaning in regular expressions.
# Escape the string to avoid interpretting them as differently.
# The re.escape function escapes even %, so replacing that with XXX to avoid that.
p = re.escape(pattern.replace("%s", "XXX"))
return p.replace("XXX", "\w+")
# Join all the pattens into a single regular expression.
# Each pattern is enclosed in () to remember the match.
# This will help us to find the matched pattern.
rx = re.compile("|".join("(" + make_re_pattern(p) + ")" for p in patterns))
def match(s):
"""Given an input strings, returns the matched pattern or None."""
m = rx.match(s)
if m:
# Find the index of the matching group.
index = (i for i, group in enumerate(m.groups()) if group is not None).next()
return patterns[index]
# Testing with couple of patterns
print match("Hi there, John.")
print match("Will you be meeting Linda today, John?")
Python solution. JS should be similar.
>>> re2.compile('^ABC(.*)E$').search('ABCDE') == None
False
>>> re2.compile('^ABC(.*)E$').search('ABCDDDDDDE') == None
False
>>> re2.compile('^ABC(.*)E$').search('ABX') == None
True
>>>
The trick is to use ^ and $ to bound your pattern and making it a "template". Use (.*) or (.+) or whatever it is that you want to "search" for.
The main bottleneck for you, imho, will be iterating through a list of these patterns. Regex searches are computationally expensive.
If you want a "does any pattern match" result, build a massive OR based regex and let your regex engine handle the 'OR'ing for you.
Also, if you have only prefix patterns, check out the TRIE data structure.
This could be a job for sscanf, there is an implementation in js: http://phpjs.org/functions/sscanf/; the function being copied is this: http://php.net/manual/en/function.sscanf.php.
You should be able to use it without changing the prepared strings much, but I have doubts about the performances.
the problem isn't clear to me. Do you want to take the patterns and build regexes out of it?
Most regex engines have a "quoted string" option. (\Q \E). So you could take the string and make it
^\QHi there,\E(?:.*)\Q.\E$
these will be regexes that match exactly the string you want outside your variables.
if you want to use a single regex to match just a single pattern, you can put them in grouped patterns to find out which one matched, but that will not give you EVERY match, just the first one.
if you use a proper parser (I've used PEG.js), it might be more maintainable though. So that's another option if you think you might get stuck in regex hell

How can I use regexextract function in Google Docs spreadsheets to get "all" occurrences of a string?

My text string is in cell D2:
Decision, ERC Case No. 2009-094 MC, In the Matter of the Application for Authority to Secure Loan from the National Electrification Administration (NEA), with Prayer for Issuance of Provisional Authority, Dinagat Island Electric Cooperative, Inc. (DIELCO) applicant(12/29/2011)
This function:
=regexextract(D2,"\([A-Z]*\)")
will grab (NEA) but not (DIELCO)
I would like it to extract both (NEA) and (DIELCO)
You can use capture groups, which will cause regexextract() to return an array. You can use this as the cell result, in which case you will get a range of results, or you can feed the array to another function to reformat it to your purpose. For example:
regexextract( "abracadabra" ; "(bra).*(bra)" )
will return the array:
{bra,bra}
Another approach would be to use regexreplace(). This has the advantage that the replace is global (like s/pattern/replacement/g), so you do not need to know the number of results in advance. For example:
regexreplace( "aBRAcadaBRA" ; "[a-z]+" ; "..." )
will return the string:
...BRA...BRA
Here are two solutions, one using the specific terms in the author's example, the other one expanding on the author's sample regex pattern which appears to match all ALLCAPS terms. I'm not sure which is wanted, so I gave both.
(Put the block of text in A1)
Generic solution for all words in ALLCAPS
=regexreplace(regexreplace(REGEXREPLACE(A1,"\b\w[^A-Z]*\b","|"),"\W+","|"),"^\||\|$","")
Result:
ERC|MC|NEA|DIELCO
NB: The brunt of the work is in the CAPITALIZED formula, the lowercase functions are just for cleanup.
If you want space separation, the formula is a little simpler:
=trim(regexreplace(REGEXREPLACE(A1,"\b\w[^A-Z]*\b"," "),"\W+"," "))
Result:
ERC MC NEA DIELCO
(One way I like playing with regex in google spreadsheets is to read the regex pattern from another cell so I can change it without having to edit or re-paste into all the cells using that pattern. This looks so:
Cell A1:
Block of text
Cell B1 (no quote marks):
\b\w[^A-Z]*\b
Formula, in any cell:
=trim(regexreplace(REGEXREPLACE(A1,B$1," "),"\W+"," "))
By anchoring it to B$1, I can fill all my rows at once and the reference won't increment.)
Previous answer:
Specific solution for selected terms (ERC, DIELCO)
=regexreplace(join("|",IF(REGEXMATCH(A1,"ERC"),"ERC",""),IF(REGEXMATCH(A1,"DIELCO"),"DIELCO","")),"(^\||\|$)","")
Result:
ERC|DIELCO
As before, the brunt of the work is in the CAPITALIZED formula, the lowercase functions are just for cleanup.
This formula will find any ERC or DIELCO, or both in the block of text. The initial order doesn't matter, but the output will always be ERC followed by DIELCO (the order of appearance is lost). This fixes the shortcoming with the previous answer using "(bra).*(bra)" in that isolated ERC or DIELCO can still be matched.
This also has a simpler form with space separation:
=trim(join(" ",IF(REGEXMATCH(A1,"ERC"),"ERC",""),IF(REGEXMATCH(A1,"DIELCO"),"DIELCO","")))
Result:
ERC DIELCO
Please try:
=SPLIT(regexreplace(A1 ; "(?s)(.)?\(([A-Z]+)\)|(.)" ; "🧸$2");"🧸")
or
=REGEXEXTRACT(A1;"\Q"&REGEXREPLACE(A1;"\([A-Z]+\)";"\\E(.*)\\Q")&"\E")

Regex to replace gibberish

I have to clean some input from OCR which recognizes handwriting as gibberish. Any suggestions for a regex to clean out the random characters? Example:
Federal prosecutors on Monday charged a Miami man with the largest
case of credit and debit card data theft ever in the United States,
accusing the one-time government informant of swiping 130 million
accounts on top of 40 million he stole previously.
, ':, Ie
':... 11'1
. '(.. ~!' ': f I I
. " .' I ~
I' ,11 l
I I I ~ \ :' ,! .~ , .. r, 1 , ~ I . I' , .' I ,.
, i
I ; J . I.' ,.\ ) ..
. : I
'I', I
.' '
r,"
Gonzalez is a former informant for the U.S. Secret Service who helped
the agency hunt hackers, authorities say. The agency later found out that
he had also been working with criminals and feeding them information
on ongoing investigations, even warning off at least one individual,
according to authorities.
eh....l
~.\O ::t
e;~~~
s: ~ ~. 0
qs c::; ~ g
o t/J (Ii .,
::3 (1l Il:l
~ cil~ 0 2:
t:lHj~(1l
. ~ ~a
0~ ~ S'
N ("b t/J :s
Ot/JIl:l"-<:!
v'g::!t:O
-....c......
VI (:ll <' 0
:= - ~
< (1l ::3
(1l ~ '
t/J VJ ~
Pl
.....
....
(II
One of the simpleset solutions(not involving regexpes):
#pseudopython
number_of_punct = sum([1 if c.ispunct() else 0 for c in line])
if number_of_punct >len(line)/2: line_is_garbage()
well. Or rude regexpish s/[!,'"##~$%^& ]{5,}//g
A simple heuristic, similar to anonymous answer:
listA = [0,1,2..9, a,b,c..z, A,B,C,..Z , ...] // alphanumerical symbols
listB = [!#$%^&...] // other symbols
Na = number_of_alphanumeric_symbols( line )
Nb = number_of_other_symbols( line )
if Na/Nb <= garbage_ratio then
// garbage
No idea how well it would work, but I have considered this problem in the past, idly. I've on occasions played with a little programmatic device called a markov chain
Now the wikipedia article probably won't make much sense until you see some of the other things a markov chain is good for. One example of a markov chain in action is this Greeking generator. Another example is the MegaHAL chatbot.
Greeking is gibberish that looks like words. Markov chains provide a way of randomly generating a sequence of letters, but weighting the random choices to emulate the frequency patterns of an examined corpus. So for instance, Given the letter "T", the letter h is more likely to show up next than any other letter. So you examine a corpus (say some newspapers, or blog postings) to produce a kind of fingerprint of the language you're targeting.
Now that you have that frequency table/fingerprint, you can examine your sample text, and rate each letter according to the likelyhood of it appearing. Then, you can flag the letters under a particular threshold likelyhood for removal. In other words, a surprise filter. Filter out surprises.
There's some leeway for how you generate your freqency tables. You're not limited to one letter following another. You can build a frequency table that predicts which letter will likely follow each digraph (group of two letters), or each trigraph, or quadgraph. You can work the other side, predicting likely and unlikely trigraphs to appear in certain positions, given some previous text.
It's kind of like a fuzzy regex. Rather than MATCH or NO MATCH, the whole text is scored on a sliding scale according to how similar it is to your reference text.
I did a combo of eliminating lines that don't contain at least two 3 letter words, or one 6 letter word.
([a-z|A-Z]{3,}\s){2,}|([a-z|A-Z]{6,})
http://www.regexpal.com/
Here is a Perl implementation of the garbage_ratio heuristic:
#!/usr/bin/perl
use strict;
use warnings;
while ( defined( my $chunk = read_chunk(\*DATA) ) ) {
next unless length $chunk;
my #tokens = split ' ', $chunk;
# what is a word?
my #words = grep {
/^[A-Za-z]{2,}[.,]?$/
or /^[0-9]+$/
or /^a|I$/
or /^(?:[A-Z][.])+$/
} #tokens;
# completely arbitrary threshold
my $score = #words / #tokens;
print $chunk, "\n" if $score > 0.5;
}
sub read_chunk {
my ($fh) = #_;
my ($chunk, $line);
while ( my $line = <$fh> ) {
if( $line =~ /\S/ ) {
$chunk .= $line;
last;
}
}
while (1) {
$line = <$fh>;
last unless (defined $line) and ($line =~ /\S/);
$chunk .= $line;
}
return $chunk;
}
__DATA__
Paste the text above after __DATA__ above (not repeating the text here to save space). Of course, the use of the __DATA__ section is for the purpose of posting a self-contained script. In real life, you would have code to open the file etc.
Output:
Federal prosecutors on Monday charged a Miami man with the largest
case of credit and debit card data theft ever in the United States,
accusing the one-time government informant of swiping 130 million
accounts on top of 40 million he stole previously.
Gonzalez is a former informant for the U.S. Secret Service who helped
the agency hunt hackers, authorities say. The agency later found out that
he had also been working with criminals and feeding them information
on ongoing investigations, even warning off at least one individual,
according to authorities.
Regex won't help here. I'd say if you have control over the recognition part then focus on better quality there:
http://www.neurogy.com/ocrpreproc.html
You can also ask user to help you and specify the type of text you work with. e.g. if it is a page from a book then you would expect the majority of lines to be the same length and mainly consisting of letters, spaces and punctuation.
Well a group of symbols would match a bit of gibberish. Perhaps checking against a dictionary for words?
There seems to be a lot of line breaks where gibberish is, so that may be an indicator too.
Interesting problem.
If this is representative, I suppose you could build a library of common words and delete any line which didn't match any of them.
Or perhaps you could match character and punctuation characters and see if there is a reliable ratio cut-off, or simply a frequency of occurrence of some characters which flags it as gibberish.
Regardless, I think there will have to be some programming logic, not simply a single regular expression.
I guess that a regex would not help here. Regex would basically match a deterministic input i.e. a regex will have a predefined set of patterns that it will match. And gibberish would in most cases be random.
One way would be to invert the problem i.e. match the relevant text instead of matching the gibberish.
I'd claim a regex like "any punctuation followed by anything except a space is spam'.
So in .NET it's possibly something like
.Replace("\\p{1,}[a-zA-Z0-9]{1,}", "");
Then you'd consider "any word with two or more punctuations consecutively:
.Replace(" \\p{2,} ", "");
Seems like a good start anyway.
I like #Breton's answer - I'd suggest using his Corpus approach also with a library of known 'bad scans', which might be easier to identify because 'junk' has more internal consistency than 'good text' if it comes from bad OCR scans (the number of distinct glyphs is lower for example).
Another good technique is to use a spell checker/dictionary and look up the 'words' after you've eliminated the non readable stuff with regex.