How to get a string before the last occurrence of a specific character before a maximum character count? - regex

I have some long but variable-length texts that are divided into sections marked by ********************. I need to post those texts into a field that only accepts 2048 characters, so I will need to divide that text into groups of no more than 2048 characters but which do not contain an incomplete section.
My regex so far is ^([\s\S]{1,2048})([\s\S]{1,2048})([\s\S]{1,2048})
However, this has two problems:
1) It divides the text into groups that can include an incomplete section. What I want is a complete section, even if it is not a full 2048 characters. Assume the example below is at the end of 2048 characters.
Here's my actual result. Notice that the "7 Minute Workout" section is cut off mid-section
********************
Maybe Baby™ Period & Fertility (📱)
Popular app for tracking your periods and predicting times of fertility; recommended; avg 4.5/5 stars (3,500+ ratings); 50% off, $3.99 ↘️ $1.99!
https://example.com/2019/07/29/maybe-baby-period-fertility-7-29-19/
********************
7 Minute Workout: Lose Weight (📱)
Scientifically-proven and featured by the New York Times, a 7-minute high intensity workout proven to lose weig
Here's my desired result. Notice that the "7 Minute Workout" section is entirely omitted because it could not be included in its entirety while staying under the 2048 character limit.
********************
Maybe Baby™ Period & Fertility (📱)
Popular app for tracking your periods and predicting times of fertility; recommended; avg 4.5/5 stars (3,500+ ratings); 50% off, $3.99 ↘️ $1.99!
https://example.com/2019/07/29/maybe-baby-period-fertility-7-29-19/
2) The second problem with this regex is that the text I need to input varies greatly in length; it may be less than 2048 or it could be 10,000+ characters. My regex obviously only works for texts up to 6,144 characters long. Do I just keep duplicating the regex a crazy number of times to get longer than the longest text I could enter, or is there a way to get it to repeat?
Addendum: Several asked about the use case/environment for this question. No, it’s not a spambot 🙂. Rather, I’m trying to use Apple’s Shortcuts app to cross-post items from my website to followers on Kik. Unfortunately, Kik has a 2048 character limit, so I can’t post it all at once. I’m trying to use regex to split the text into appropriate sections so I can copy them from Shortcuts and paste them one at a time into Kik.

Couple Notes:
No need to use groups at all, just use match results directly as each match represent one section.
Use lazy quantifier instead of greedy by adding ? after {1,2048} to make the match cut in the right place.
In my regex, I used only Global g without the multiline m.
The code below will work only with sections that have 2048 characters or less. If the section has more than 2048 characters, it will be skipped.
The regex below uses Positive Lookahead to signal the end of the section without matching.
Here is the regex:
^|\*[\s\S]{1,2048}?(?=\n\*|$)
Example: https://regex101.com/r/hezvu5/1/
==== Update ====
To make the results greedy, to match as many sections as possible without splitting the last section, use this regex:
^|\*[\s\S]{1,2048}(?=\n\*|$)

Related

regex search window size

I have a large document that I am running a regex on. Below is an example of a similar expression:
(?=( aExample| bExample)(?=.*(XX))(?=.*(P1)))
This works a lot of times, but sometimes due to other text within the document the condition is met by looking in the entire document, e.g., there might be 10 characters between "aExample" and "XX, but 1,000 characters between "XX" and "P1". I would like to contain the expression to N characters (lets say 50 for the sake of the example) so that the regex is a little more conservative. Any help is appreciated. How can I go about reducing the size of the window of the regex to N characters instead of the entire string/document? Thanks!
(?=( aExample| bExample)(?=.{1,50}(XX))(?=.{1,50}(P1)))
You want to limit the number of .s to look at so you can just use braces.

Regex (Python) data extraction - overlapping or incomplete results

I'm trying to extract data from some WHO codebooks that I've converted from PDF to text with Python slate library.
The text I want to hit starts with 2 digits, dash, 2 digits, followed by some text and ends with "Q"+1 or 2 digits and again "Q"+1 or 2 digits
17-17How old are you?Q1Q1
31-31During the past 30 days, how many times per day did you usually eat fruit, such as bananas, apples, oranges, dates, or any other fruits?Q7Q11
Sometimes those phrases end with a blank, sometimes the next questions starts immediately (here are three question), observe Q4Q424-29 and Q5Q530-30
20-23How tall are you without your shoes on? (Note: Data are in meters.)Q4Q424-29How much do you weigh without your shoes on? (Note: Data are in kilograms.)Q5Q530-30During the past 30 days, how often did you go hungry because there was not enough food in your home?Q6Q7
With
\d{2}-\d{2}[a-zA-Z0-9 .()?:,]+Q\d{1,2}Q\d(\d)*?
I get pretty close, but I'm missing the second digit when the second "Q" has two digits.
I've tried to add a negative lookahead
\d{2}-\d{2}[a-zA-Z0-9 .()?:,]+Q\d{1,2}Q\d((\d)(?!\d\d-))
to exclude the start of the pattern with two digits and a dash.
\d{2}-\d{2}[a-zA-Z0-9 .()?:,]+Q\d{1,2}Q\d{1,2}
includes the second digit of the "Q" but generates overlapping results e.g. at Q4Q424-29 where the first string ends with Q4Q42 and the second string starts with 4-29.
The regex with parts of the original sample text is here: https://regex101.com/r/d9Dlga/2/
Any suggestions who to extract the correct strings like:
17-17How old are you?Q1Q1
20-23How tall are you without your shoes on? (Note: Data are in meters.)Q4Q4
24-29How much do you weigh without your shoes on? (Note: Data are in kilograms.)Q5Q5
31-31During the past 30 days, how many times per day did you usually eat fruit, such as bananas, apples, oranges, dates, or any other fruits?Q7Q11
Thanks!
I see the problem now. New attempt that I think works:
\d{2}-\d{2}.+?Q\d{1,2}Q\d{1,2}(?!\d-\d{2})
I put a negative lookahead at the end to test if a new section has begun.
9 matches
Correctly grabs the full 2-digit endings
Demo
The following pattern should work:
\d{2}-\d{2}[a-zA-Z0-9 .()?:,]+Q\d{1,2}Q\d(\d(?!\d-))?

Regular expression to expand to sentence

I'm trying to extract regions around keywords from longer passages of text. They should include complete sentences, based on the following conditions:
n=250 Charactars before / after keyword should be included if existing (the keyword can be closer then this to the start / end of the text)
from there it should expand further to include the complete sentence (let's assume here we can define sentence borders with ".?! or :" knowing it's not completely accurate)
I already achieved the expanding to the end of the last sentence, but not to start of the first in the following example, where vitamin is the keyword and the italic is captured by the regex. However, it should capture from "An extra 24 hours..."
Apparently I don't get the corresponding group up front, neither using lazy nor using lookbehind.
((.{0,250}(vitamin)\b.{0,250})(.+?(\.|\!|\?|\:))?)/ig
Well, this year you’re getting an extra day to get ahead on your taxes or (finally) clean out the garage. (Hey, we’re not trying to tell you what do but you might as well be productive.) February 29 is back on the calendar this year because it’s a leap year. Whether you love or loathe the extra winter day, you’re probably wondering why it happens in the first place. An extra 24 hours — or day — is built into the calen dar every four years to ensure it aligns with the Earth’s movement around the sun. There’s 365 days in a calendar year, but it actually takes longer for the Earth’s annual journey — about 365.2421 days — around the star that gives us light, life and vitamin D. The difference may seem like no big deal to us, but over time, it adds up. “To ensure consistency with the true astronomical year, it is necessary to periodically add in an extra day to make up the lost time and get the calendar back in sync with the heavens,” according the history. com.
Acknowledgement of the need for a leap year happened around the time of Julius Caesar. In 46 B.C., Caesar enlisted the help of astronomer Sosigenes to update the calendar so that it had 12 months and 365 days, including a leap year every four years.,
You can try something like this:
(([.?!:][^.?!:]*.{250}\bvitamin\b.{250})[^.?!:]*[.?!:])
It works by consuming 250 characters of text before and after the keyword "vitamin". From that point it finds the first punctuation point (.?!:) before/after the 250 characters of text.
Here's a sample of it in action.
You can you use extra parentheses () to strategically group what exact output you want. For example, the above answer includes the ending period from the preceding sentence in the output. So you could use
(([.?!:]([^.?!:]*.{250}\bvitamin\b.{250})[^.?!:]*[.?!:]))
and use group 3 from the result set which doesn't have this ending period.
I do not see how the specification in the question can be matched by a regex. It boils down to the following logic problem:
to match as many characters as possible but no more than 250 before/after the keyword, .{0,250} needs to be greedy and can neither be lazy .{0,250}? nor possessive .{0,250}+
if this part is greedy, you will miss the occurrences of the keyword that start before the .{0,250} part is matched.
The same logic applies to my understanding to the 'match back to the start of the sentnence as well.
I played around with the following more or less meaningful regex:
[.?!:]?([^.?!:]*?(.{0,250}\byear\b.{0,250})[^.?!:]*[.?!:]?) misses first 'year'
[.?!:]?([^.?!:]*?(.{0,250}?\byear\b.{0,250})[^.?!:]*[.?!:]?) gets the first 'year' but fails on others.
I suggest you write your on extraction logic in a function, eihter using regex or not, to achieve the extraction you want.
You could for example find the index of the start of the keyword \bkeyword\b and the full stops (\.[^\d]|[.?!:]$) and then with this information extract the part of the text you want.

Parsing out a number after a specific word appears

I am trying to collect the forecasted high temperature from the National Weather Service from their text-based website. The website I am trying to pull information from can be found here.
So far I have been able to pull the first number that appears after each day. Most of the time this is the high temperature, but occasionally they will put a precipitation amount before the forecated temperature for the day. I want to find a way to pull the digits that follow the word "high". It should also be noted that sometimes they use "high near", "high around", or any other variation so it wouldn't necessarily be the next string following high.
Below is my code. I intend to run this everyday at a certain time, so I will get the current days forecast up to six days later. If you were to run this code in the evening, you would get the next seven days of forecasted temperatures, with the first temperature actually referring to next weeks forecast.
My end goal is to put this onto trendy, so I'm sure this would be easier to accomplish in other formats, but I want to stick straight to Matlab.
url = 'http://forecast.weather.gov/MapClick.php?lat=40.48622&lon=-74.45181587699966&unit=0&lg=english&FcstType=text&TextType=1';
html = urlread(url);
DayForm = 'long';
today = clock;
today = today(:,3);
nvalue = zeros(6,1);
for i = 0:6
[~, getDay] = weekday(today+i,DayForm);
target = ['<b>' getDay ':'];
[a,b] = regexp(html,'\d');
strPos = find( a > strfind(html,target),1,'first');
nvalue(i+1) = str2double(html(a(strPos):b(strPos)+1));
end
EDIT: after implenting the answer, here is my updated code:
url = 'http://forecast.weather.gov/MapClick.php?lat=40.48622&lon=-74.45181587699966&unit=0&lg=english&FcstType=text&TextType=1';
html = urlread(url);
fcast = zeros(7,1);
target = 'with\sa\shigh\s\w*\s?([0-9]+)';
[~,b] = regexp(html,target);
for i = 1:7
fcast(i) = str2double(html(b(i)-1:b(i)));
end
well it seems matlab supports gnu extended regex which is limiting which means MrAzzaman answer may not work. Though he accounts for mph that has the word high before, the following regex should match and capture the digits you want into capture group $1.
with\sa\shigh\s\w*\s?([0-9]+)
find with a high, then a space then possible word, then another space followed by the captured group that contains temp.
It should work
This is slightly complicated by the fact that they also occasionally say things like "winds as high as 32 mph". The following works, though there may be more edge cases that aren't accounted for:
high\D+(\d+)\D(?!mph)
This searches for the word 'high', and then slurps all of the characters until it reaches a digit. It grabs the digits in a group, and then grabs the next non-digit character (this ensures it grabs all of the digits). It then uses a negative lookahead to make sure the next 3 letters aren't 'mph' (which would suggest that the number indicates a wind, rather than a temperature).
As I said, there may be more edge cases, but it seems to work for the present web page.

Regular Expression - Matching and extracting complicated conditions

I'm trying to write a regular expression that will match these conditions:
Maximum of 8000 characters (any characters, including "\r\n")
Maximum of 10 lines (separated by \r\n).
to extract from the matched text only the first 4 lines.
Can't find a good way do it...:/
Thanks!!
Regular expressions are not what you need. They are used to match a certain pattern, not a certain length. If you are holding the data in a string, myString.length <= 8000 is all you need for the character count (using the correct syntax for your language, of course). For the number of lines, you will have to count the number of \r\n sequences in your string (can be done iteratively). To get the first four lines, just find the 4th \r\n and get everything before that with a substring method.
Description
This expression does the following:
validates the input string is between zero and 8,000 characters
validates there are at most 10 line of new line delimited text
then captures the first 4 new line delimited lines of text
\A(?=.{0,8000}\Z)(?=(?:^.*?(?:\r|\n|\Z)){0,10}\Z)(?:^.*?[\r\n\Z]+){0,4} This requires options: m multiline, and s dot matches all characters
Expanded
\A anchor to the begining of the string, this anchor allows the use of the s option which allows the . to match new line and line feed characters
(?=.{0,8000}\Z) look ahead and validate there are between zero and 8000 characters
(?=(?:^.*?(?:\r|\n|\Z)){0,10}\Z) look ahead and validate there are no more then 10 new line delimited lines
(?:^.*?[\r\n\Z]+){0,4} match the first 4 lines of text
PHP Code Example:
You didn't specify a language so I'm including this PHP example to show how it works and the sample output.
Input Text
This input test is 8 lines of new line delimited strings. There are only 1779 characters here.
Far far away, behind the word mountains, far from the countries Vokalia and Consonantia, there live the blind texts. Separated they live in Bookmarksgrove right at the coast of the Semantics, a large language ocean. A small
river named Duden flows by their place and supplies it with the necessary regelialia. It is a paradisematic country, in which roasted parts of sentences fly into your mouth. Even the all-powerful Pointing has no control about
the blind texts it is an almost unorthographic life One day however a small line of blind text by the name of Lorem Ipsum decided to leave for the far World of Grammar. The Big Oxmox advised her not to do so, because there were
thousands of bad Commas, wild Question Marks and devious Semikoli, but the Little Blind Text didn’t listen. She packed her seven versalia, put her initial into the belt and made herself on the way. When she reached the first hills of
the Italic Mountains, she had a last view back on the skyline of her hometown Bookmarksgrove, the headline of Alphabet Village and the subline of her own road, the Line Lane. Pityful a rethoric question ran over her cheek, then
she continued her way. On her way she met a copy. The copy warned the Little Blind Text, that where it came from it would have been rewritten a thousand times and everything that was left from its origin would be the word "and"
and the Little Blind Text should turn around and return to its own, safe country. But nothing the copy said could convince her and so it didn’t take long until a few insidious Copy Writers ambushed her, made her drunk with Longe
and Parole and dragged her into their agency, where they abused her for their projects again and again. And if she hasn’t been rewritten, then they are still using her.
Code
<?php
$sourcestring="your source string";
preg_match('/\A(?=.{0,8000}\Z)(?=(?:^.*?(?:\r|\n|\Z)){0,10}\Z)(?:^.*?[\r|\n\Z]+){0,4}/ims',$sourcestring,$matches);
echo "<pre>".print_r($matches,true);
?>
Matches
$matches Array:
(
[0] => Far far away, behind the word mountains, far from the countries Vokalia and Consonantia, there live the blind texts. Separated they live in Bookmarksgrove right at the coast of the Semantics, a large language ocean. A small
river named Duden flows by their place and supplies it with the necessary regelialia. It is a paradisematic country, in which roasted parts of sentences fly into your mouth. Even the all-powerful Pointing has no control about
the blind texts it is an almost unorthographic life One day however a small line of blind text by the name of Lorem Ipsum decided to leave for the far World of Grammar. The Big Oxmox advised her not to do so, because there were
thousands of bad Commas, wild Question Marks and devious Semikoli, but the Little Blind Text didn’t listen. She packed her seven versalia, put her initial into the belt and made herself on the way. When she reached the first hills of
)