Regular expression to expand to sentence - regex

I'm trying to extract regions around keywords from longer passages of text. They should include complete sentences, based on the following conditions:
n=250 Charactars before / after keyword should be included if existing (the keyword can be closer then this to the start / end of the text)
from there it should expand further to include the complete sentence (let's assume here we can define sentence borders with ".?! or :" knowing it's not completely accurate)
I already achieved the expanding to the end of the last sentence, but not to start of the first in the following example, where vitamin is the keyword and the italic is captured by the regex. However, it should capture from "An extra 24 hours..."
Apparently I don't get the corresponding group up front, neither using lazy nor using lookbehind.
((.{0,250}(vitamin)\b.{0,250})(.+?(\.|\!|\?|\:))?)/ig
Well, this year you’re getting an extra day to get ahead on your taxes or (finally) clean out the garage. (Hey, we’re not trying to tell you what do but you might as well be productive.) February 29 is back on the calendar this year because it’s a leap year. Whether you love or loathe the extra winter day, you’re probably wondering why it happens in the first place. An extra 24 hours — or day — is built into the calen dar every four years to ensure it aligns with the Earth’s movement around the sun. There’s 365 days in a calendar year, but it actually takes longer for the Earth’s annual journey — about 365.2421 days — around the star that gives us light, life and vitamin D. The difference may seem like no big deal to us, but over time, it adds up. “To ensure consistency with the true astronomical year, it is necessary to periodically add in an extra day to make up the lost time and get the calendar back in sync with the heavens,” according the history. com.
Acknowledgement of the need for a leap year happened around the time of Julius Caesar. In 46 B.C., Caesar enlisted the help of astronomer Sosigenes to update the calendar so that it had 12 months and 365 days, including a leap year every four years.,

You can try something like this:
(([.?!:][^.?!:]*.{250}\bvitamin\b.{250})[^.?!:]*[.?!:])
It works by consuming 250 characters of text before and after the keyword "vitamin". From that point it finds the first punctuation point (.?!:) before/after the 250 characters of text.
Here's a sample of it in action.
You can you use extra parentheses () to strategically group what exact output you want. For example, the above answer includes the ending period from the preceding sentence in the output. So you could use
(([.?!:]([^.?!:]*.{250}\bvitamin\b.{250})[^.?!:]*[.?!:]))
and use group 3 from the result set which doesn't have this ending period.

I do not see how the specification in the question can be matched by a regex. It boils down to the following logic problem:
to match as many characters as possible but no more than 250 before/after the keyword, .{0,250} needs to be greedy and can neither be lazy .{0,250}? nor possessive .{0,250}+
if this part is greedy, you will miss the occurrences of the keyword that start before the .{0,250} part is matched.
The same logic applies to my understanding to the 'match back to the start of the sentnence as well.
I played around with the following more or less meaningful regex:
[.?!:]?([^.?!:]*?(.{0,250}\byear\b.{0,250})[^.?!:]*[.?!:]?) misses first 'year'
[.?!:]?([^.?!:]*?(.{0,250}?\byear\b.{0,250})[^.?!:]*[.?!:]?) gets the first 'year' but fails on others.
I suggest you write your on extraction logic in a function, eihter using regex or not, to achieve the extraction you want.
You could for example find the index of the start of the keyword \bkeyword\b and the full stops (\.[^\d]|[.?!:]$) and then with this information extract the part of the text you want.

Related

Select all lines starting with a character and keep only one entry

I'm using a firefox addon called "rikaisama", this addon is a pop up dictionary for japanese and it allows epwing dictionary files. In the addon option we can use one regular expression to remove unnecessary parts of a dictionary entry.
I'm using the "Kenkyusha's New Japanese-English Dictionary" epwing file but it has way too much examples to be readable.
Example of an entry :
まにあう【間に合う】 ローマ(maniau)
1 〔時間に遅れない〕 be in time 《for…》.
▲7 時の列車に間に合う catch [make] the 7 o'clock train
・締め切りに間に合う meet the deadline
・開演に間に合う arrive before curtain time
▲9 時の札幌行きに間に合うように空港に着いた. I arrived in time for the nine o'clock flight to Sapporo.
・「間に合うかな」「走っても間に合いそうにないね」 "Will we be in time?"―"It doesn't look like we'll be in time even if we run."
2 〔役に立つ〕 answer [serve, suit, meet] the purpose; be useful; be serviceable; be of use [service]; be good enough; 〔十分である〕 be enough; 〔用意ができる〕 be ready; 〔必要をみたす〕 meet the requirements; serve the [one's] turn [need].
▲「費用はどのぐらいかな」「5 万もあれば間に合うよ」 "And what is the expense?"―"Fifty-thousand yen should cover it."
・これだけあれば丸 1 年は間に合う. This will last us [see us through] one whole year. | This will be enough for a whole year.
Where all entries starting with "▲" or "・" are examples and all entries matching this regex are definitions :
\n[″*〖〈《⇒=➡【〔(〜A-Za-z0-9].*
I already managed to come up with this regular expression on my own but it removes all examples:
\n[^″*〖〈《⇒=➡【〔(〜A-Za-z0-9].*
Is it possible to have a regex matching this regex AND the following line of the match ?
Wished result :
まにあう【間に合う】 ローマ(maniau)
1 〔時間に遅れない〕 be in time 《for…》.
▲7 時の列車に間に合う catch [make] the 7 o'clock train
2 〔役に立つ〕 answer [serve, suit, meet] the purpose; be useful; be serviceable; be of use [service]; be good enough; 〔十分である〕 be enough; 〔用意ができる〕 be ready; 〔必要をみたす〕 meet the requirements; serve the [one's] turn [need].
▲「費用はどのぐらいかな」「5 万もあれば間に合うよ」 "And what is the expense?"―"Fifty-thousand yen should cover it."
Any help appreciated !
You almost had it I think - just add \n.* so it becomes /\n[″*〖〈《⇒=➡【〔(〜A-Za-z0-9].*\n.*/. That makes it get the next line...
See it in action here: https://regexr.com/442st

Search for a string with inconsistent inner characters, but consistent outer ones

How would you construct a search in order to look for a string where the outer characters are consistent, but the inner ones could change?
For example, suppose I have this text (and a lot of other text like it):
In January 2015, I went for a walk.
In June 2005, I went for a jog.
During December 2000, they went for a drive.
I want to search for the year. The only things I know that are consistent are the presence of the year, followed by a comma.
How would I search for '20xy,' where xy could be anything for 00 to 17?
Edit: Searching for just '20' (or variations of) is no good, as the number 20 may appear earlier in the text of one of the documents I'm working with.
Edit 2:
What I'm after is the index of the first instance that the year appears where it is followed by a comma. The year could be anything from 2000 to present, but the comma is always present. If the year appears earlier in the string without a comma, then we ignore it.
eg. The year is 2000 and this is followed by 2001, when I went swimming.
In this example, I want to ignore 2000 and find 2001.
You could use regular expressions re.search() first looking for 200[0-9], then 201[0-7],

Excluding % from a Regex number search

I'm attempting to create a Regex that finds only 2-digit integers or numbers with a precision of 2 decimal points.
In the example string at the bottom, I want to find only the following:
21 and 10.50
Using this expression, 100% is getting captured, in addition to the strings I desire to capture:
(\d){1,2}(\.?)([0-9]?[0-9]?){1,2}
I know I need to use ^% somewhere, but I can't figure out where it goes. Any suggestions are greatly appreciated.
Here's my sample string:
Earn Up to $21 Per Hour - Deliver Food with !!
Delivery Drivers work when they want and make great money when they do.
All orders are prepaid, just pick them up and deliver them to hungry diners. No waiting in line or fumbling with receipts and prepaid cards.
It's fast and easy to start working. Get started today.
Apply Now
Why choose ?
More orders than any other takeout platform
100% of our restaurants are official partners
Competitive pay: Per order fee + mileage + tips
We guarantee an hourly minimum of $10.50/hour*
Create your own schedule & work the hours you want
Word boundaries in your regular expression will grant you a bit more control.
Since word boundaries are a bit strict, we need to introduce an OR condition to address both cases which will satisfy your regex.
(\b[\d]{2}\.[\d]{2}\b)|(\b[\d]{2}\b)
Edit: Try this one,
\b[\d]{2}\b(\.[\d]{2})?
The first example has a chance to fail as it is order dependent due to the way it short-circuits. This I believe should address multiple cases properly.
I think this should work:
(?<!\d)((\d+\.\d\d)|(\d\d))(?!%|\d)
Demo (and explanation)
EDIT:
Improved version:
(?<!\d)(\d{1,2}(?:\.\d{1,2})?)(?!%|\d)
Demo (and explanation)
You can try this variant: (\d{1,}|[\d.])\b(?!%)
It uses negative lookahead (?!%) to exclude digits following by % sign.
Details at regex101

Using RegEx to Find a Block of Text

I'm attempting to block a long string of unnecessary text that's on every page of a document.
Ex: "36075 This is another page and this is the date March 4 2013"
I know this must be very simple, but I'm hoping there is a way to block text verbatim. Is the only way to block this text by using a lot of /d/s/w+/+ etc or is there is a way to say, "match 36075 This is another page and this is the date March 4 2013".
This would be SO HELPFUL to know. Thank you for helping!
From what you wrote I assume you need to get leading numbers from string, to do it you just need to use this pattern: ^\d+ which from this input:
36075 This is another page and this is the date March 4 2013
will return this:
36075
For future, in case of such questions please provide example string and expected output. As well as what you have tried.
I realized the issue I was having. I didn't need to use RegEx. The program I was using has the functionality to match specific words or groups of words and pronounce them differently. What I discovered is that it will not match the words unless the word groups are input exactly the way the program typically reads them.
Ergo --> The channel saw
the end of the British hold over
Would have to be listed as one group for, "The channel saw" and a second group for "the end of the British hold over"
In addition, there were some numbers --> 11960_30_o_ho_
and if the program naturally read 119 and then 60_3 and then _o_ho_ then three strings would need to be input for each section.
A few frustrating hours later, problem solved :) Thank you for your assistance.

Regex to match time

I want my users to be able to enter a time form.
If more info necessary, users use this to express how much time is needed to complete a task, and it will be saved in a database if filled.
here is what I have:
/^$|^([0-1]?[0-9]|2[0-4]):([0-5][0-9])(:[0-5][0-9])?$/
It matches an empty form or 01:30 and 01:30:00 formatted times. I really won't need the seconds as every task takes a minute at least, but I tried removing it and it just crashed my code and removed support for empty string.. I really don't understand regex at all.
What I'd like, is for it to also match simple minutes and simple hours, like for instance 3:30, 3:00, 5. Is this possible? It would greatly improve the user experience and limit waste typing. But I'd like to keep the zero optional in case some users find it natural to type it.
I think the following pattern does what you want:
p="((([01]?\d)|(2[0-4])):)?([0-5]\d)?(:[0-5]\d)?"
The first part:
(([01]?\d)|(2[0-3])):)?
is an optional group which deals with hours in format 00-24.
The second part:
([0-5]\d)?
is an optional group which deals with minutes if hours or seconds are present in your expression. The group also deals with expressions containing only minutes or only hours.
The third part:
(:[0-5]\d)?
is an optional group dealing with seconds.
The following samples show the pattern at work:
In [180]: re.match(p,'14:25:30').string
Out[180]: '14:25:30'
In [182]: re.match(p,'2:34:05').string
Out[182]: '2:34:05'
In [184]: re.match(p,'02:34').string
Out[184]: '02:34'
In [186]: re.match(p,'59:59').string
Out[186]: '59:59'
In [188]: re.match(p,'59').string
Out[188]: '59'
In [189]: re.match(p,'').string
Out[189]: ''
As every group is optional the pattern matches also the empty string. I've tested it with Python but I think it will work with other languages too with minimal changes.