Regular Expression - Matching and extracting complicated conditions - regex

I'm trying to write a regular expression that will match these conditions:
Maximum of 8000 characters (any characters, including "\r\n")
Maximum of 10 lines (separated by \r\n).
to extract from the matched text only the first 4 lines.
Can't find a good way do it...:/
Thanks!!

Regular expressions are not what you need. They are used to match a certain pattern, not a certain length. If you are holding the data in a string, myString.length <= 8000 is all you need for the character count (using the correct syntax for your language, of course). For the number of lines, you will have to count the number of \r\n sequences in your string (can be done iteratively). To get the first four lines, just find the 4th \r\n and get everything before that with a substring method.

Description
This expression does the following:
validates the input string is between zero and 8,000 characters
validates there are at most 10 line of new line delimited text
then captures the first 4 new line delimited lines of text
\A(?=.{0,8000}\Z)(?=(?:^.*?(?:\r|\n|\Z)){0,10}\Z)(?:^.*?[\r\n\Z]+){0,4} This requires options: m multiline, and s dot matches all characters
Expanded
\A anchor to the begining of the string, this anchor allows the use of the s option which allows the . to match new line and line feed characters
(?=.{0,8000}\Z) look ahead and validate there are between zero and 8000 characters
(?=(?:^.*?(?:\r|\n|\Z)){0,10}\Z) look ahead and validate there are no more then 10 new line delimited lines
(?:^.*?[\r\n\Z]+){0,4} match the first 4 lines of text
PHP Code Example:
You didn't specify a language so I'm including this PHP example to show how it works and the sample output.
Input Text
This input test is 8 lines of new line delimited strings. There are only 1779 characters here.
Far far away, behind the word mountains, far from the countries Vokalia and Consonantia, there live the blind texts. Separated they live in Bookmarksgrove right at the coast of the Semantics, a large language ocean. A small
river named Duden flows by their place and supplies it with the necessary regelialia. It is a paradisematic country, in which roasted parts of sentences fly into your mouth. Even the all-powerful Pointing has no control about
the blind texts it is an almost unorthographic life One day however a small line of blind text by the name of Lorem Ipsum decided to leave for the far World of Grammar. The Big Oxmox advised her not to do so, because there were
thousands of bad Commas, wild Question Marks and devious Semikoli, but the Little Blind Text didn’t listen. She packed her seven versalia, put her initial into the belt and made herself on the way. When she reached the first hills of
the Italic Mountains, she had a last view back on the skyline of her hometown Bookmarksgrove, the headline of Alphabet Village and the subline of her own road, the Line Lane. Pityful a rethoric question ran over her cheek, then
she continued her way. On her way she met a copy. The copy warned the Little Blind Text, that where it came from it would have been rewritten a thousand times and everything that was left from its origin would be the word "and"
and the Little Blind Text should turn around and return to its own, safe country. But nothing the copy said could convince her and so it didn’t take long until a few insidious Copy Writers ambushed her, made her drunk with Longe
and Parole and dragged her into their agency, where they abused her for their projects again and again. And if she hasn’t been rewritten, then they are still using her.
Code
<?php
$sourcestring="your source string";
preg_match('/\A(?=.{0,8000}\Z)(?=(?:^.*?(?:\r|\n|\Z)){0,10}\Z)(?:^.*?[\r|\n\Z]+){0,4}/ims',$sourcestring,$matches);
echo "<pre>".print_r($matches,true);
?>
Matches
$matches Array:
(
[0] => Far far away, behind the word mountains, far from the countries Vokalia and Consonantia, there live the blind texts. Separated they live in Bookmarksgrove right at the coast of the Semantics, a large language ocean. A small
river named Duden flows by their place and supplies it with the necessary regelialia. It is a paradisematic country, in which roasted parts of sentences fly into your mouth. Even the all-powerful Pointing has no control about
the blind texts it is an almost unorthographic life One day however a small line of blind text by the name of Lorem Ipsum decided to leave for the far World of Grammar. The Big Oxmox advised her not to do so, because there were
thousands of bad Commas, wild Question Marks and devious Semikoli, but the Little Blind Text didn’t listen. She packed her seven versalia, put her initial into the belt and made herself on the way. When she reached the first hills of
)

Related

Regex to select two spaces between words (or two spaces before a letter)

I'm cleaning up some ancient HTML help files and there are a lot of double-spaces that I'd like to clean up (replace each double-space with single).
Sample: <li><b>% Successful</b>. The percentage of jobs that returned a confirmation.</li>
I want to find double-spaces only before the start of sentences or between words (so between the setting label or between 'percentage' and 'of', not spaces in isolation or before the XML tags).
I tried a simple search for two spaces, but that also brings up the tab/space mixture that the creator used for formatting the indents, so I'm getting five useless results for every relevant one.
Is there a single regex that would help with both use cases, or is it better to use two different ones for each format? I'm fine either way, just am still pretty new to regexes and not sure where to start on this one.
Spaces between periods and text (three here for a better visual):
<li><b>Submit Time</b>. The time the job was scheduled.</li>
Spaces between words (three here as well):
<li><b>End Time</b>. The date/time when the job was completed or canceled.</li>
I'm looking for multiple spaces either between words or at the start of sentences (generally after the setting name period).

Regex select words longer than 4 characters but only one instance if duplicates

I am trying to format text in InDesign using GREP Style.
The goal is to select words longer then 4 letters in a paragraph but if the word has been duplicated in a paragraph it should not select more then first instance of this word.
This is sample text:
"The Lord's right hand is lifted high; the Lord's right hand has done mighty things!"
The solution should give
Lord right hand lifted high done mighty things
i have done the first part
[[:word:]]{4,}
but don't have a clue how to deal with those duplicates.
Is order a requirement? If not, how about words longer than 4 characters not followed by that same word later in the text? See:
([[:word:]]{4,})(?!.*\1)
https://regex101.com/r/Ug4dLZ/1
Result: lifted high Lord right hand done many things
To be more comprehensive, include word breaks (i.e. count "Person" and "Personhood" as 2 separate words):
([[:word:]]{4,})(?!.*\b\1\b)

How to get a string before the last occurrence of a specific character before a maximum character count?

I have some long but variable-length texts that are divided into sections marked by ********************. I need to post those texts into a field that only accepts 2048 characters, so I will need to divide that text into groups of no more than 2048 characters but which do not contain an incomplete section.
My regex so far is ^([\s\S]{1,2048})([\s\S]{1,2048})([\s\S]{1,2048})
However, this has two problems:
1) It divides the text into groups that can include an incomplete section. What I want is a complete section, even if it is not a full 2048 characters. Assume the example below is at the end of 2048 characters.
Here's my actual result. Notice that the "7 Minute Workout" section is cut off mid-section
********************
Maybe Baby™ Period & Fertility (📱)
Popular app for tracking your periods and predicting times of fertility; recommended; avg 4.5/5 stars (3,500+ ratings); 50% off, $3.99 ↘️ $1.99!
https://example.com/2019/07/29/maybe-baby-period-fertility-7-29-19/
********************
7 Minute Workout: Lose Weight (📱)
Scientifically-proven and featured by the New York Times, a 7-minute high intensity workout proven to lose weig
Here's my desired result. Notice that the "7 Minute Workout" section is entirely omitted because it could not be included in its entirety while staying under the 2048 character limit.
********************
Maybe Baby™ Period & Fertility (📱)
Popular app for tracking your periods and predicting times of fertility; recommended; avg 4.5/5 stars (3,500+ ratings); 50% off, $3.99 ↘️ $1.99!
https://example.com/2019/07/29/maybe-baby-period-fertility-7-29-19/
2) The second problem with this regex is that the text I need to input varies greatly in length; it may be less than 2048 or it could be 10,000+ characters. My regex obviously only works for texts up to 6,144 characters long. Do I just keep duplicating the regex a crazy number of times to get longer than the longest text I could enter, or is there a way to get it to repeat?
Addendum: Several asked about the use case/environment for this question. No, it’s not a spambot 🙂. Rather, I’m trying to use Apple’s Shortcuts app to cross-post items from my website to followers on Kik. Unfortunately, Kik has a 2048 character limit, so I can’t post it all at once. I’m trying to use regex to split the text into appropriate sections so I can copy them from Shortcuts and paste them one at a time into Kik.
Couple Notes:
No need to use groups at all, just use match results directly as each match represent one section.
Use lazy quantifier instead of greedy by adding ? after {1,2048} to make the match cut in the right place.
In my regex, I used only Global g without the multiline m.
The code below will work only with sections that have 2048 characters or less. If the section has more than 2048 characters, it will be skipped.
The regex below uses Positive Lookahead to signal the end of the section without matching.
Here is the regex:
^|\*[\s\S]{1,2048}?(?=\n\*|$)
Example: https://regex101.com/r/hezvu5/1/
==== Update ====
To make the results greedy, to match as many sections as possible without splitting the last section, use this regex:
^|\*[\s\S]{1,2048}(?=\n\*|$)

Matching variable words in a string

This will sound extremely nerdy, but I play this online game that writes its in-game events to a log file. There's a program I'm using that is capable of reading this log file, and it's also capable of interpreting regex. My goal is to write a regex command that analyzes a certain string from this log file and then spits out certain parts of the string onto my screen.
The string that gets written to the log file has the following syntax (variables in bold):
NAME hits/bashes/crushes/claws/whatever NEWNAME for NUMBER points of damage.
If it matters, NUMBER will never contain commas or spaces, and the action verb (hits, bashes, whatever) will only ever be a single word without any special characters, spaces, numbers, etc.
What I'd like this program to do is interpret the regex code that I enter and spit out a result that says: NAME attacks NEWNAME
The catch is, NAME and NEWNAME can have the following range of possibilities (names and examples picked at random):
Kevin
Kevin's pet
Kevin from Oregon
Kevin from Oregon's pet
Kevin from Oregon`s pet (note the grave accent there instead of the apostrophe)
It's pretty simple if it's just something like Kevin hits Josh for 10728 points of damage. In this case, my regex is the following code block (please note that the program interprets the {N} wildcard on its own as any number without the need for regex):
(?<char1>\w+) \w+ (?<char2>\w+) for {N} points of damage.
...and my output reads...
${char1} attacks ${char2}
Whenever the game outputs that string of Kevin hits Josh for 10728 points of damage. to the log file, the program I'm using picks up on it and correctly outputs Kevin attacks Josh to my screen.
However, using that regex line results in a failure when spaces, apostrophes, grave accents, and/or any combination of the three are present in either NAME or NEWNAME.
I tried to alter the regex line to read...
(?<char1>[a-zA-Z0-9_ ]+) \w+ (?<char2>[a-zA-Z0-9_ ]+) for {N} points of damage.
...but when I encounter the string Kevin bashes Josh of Texas for 2132344 points of damage., for example, the output to my screen winds up being:
Kevin bashes Josh attacks Texas.
I'm trying different things but ultimately not coming up with something that's spitting out the proper format of NAME attacks NEWNAME when those two variables contain spaces, apostrophes, grave accents, and/or any combination of the three.
Any help or tips on what I'm doing wrong or how I can further alter that regex line would be extremely appreciated!
This is going to sound even nerdier, but I think the question isn't the regex, it's what tool you use the regex in.
Your biggest problem thus far has been the names. I suggest ignoring the names, and focusing only on the elements you know are there. The names are what's left.
I tried this myself using GNU sed:
sed -e 's/for [[:digit:]]\+ points of damage//' -e 's/hits\|bashes\|crushes/attacks/'
You see, first we can eliminate the end of the sentence, which is wholly superfluous. Then, we simply switch the verb to "attacks".
If the program uses a synonym for "attacks" that you don't have yet, you'll still have reasonable output; you can then fix your regex to include the new synonym.
You are guaranteed trouble if somebody's name includes "bashes" (or whatever) in it.
The second sed expression should be improved to be relevant only at a word boundary, but I'll leave that as an exercise for the reader. :)

Separating out a list with regex?

I have a CSV file which has been generated by a system. The problem is with one of the fields which used to be a list of items. An example of the original list is below....
The serial number of the desk is 45TYTU
This is the second item in the list
The colour of the apple is green
The ID code is 489RUI
This is the fourth item in the list.
And unfortunately the system spits out the code below.....
The serial number of the desk is 45TYTUThis is the second item in the listThe colour of the apple is greenThe ID code is 489RUIThis is the fourth item in the list.
As you can see, it ignores the line breaks and just bunches everything up. I am unable to modify the system that generates this output so what I am trying to do is come up with some sort of regex find and replace expression that will separate them out.
My original though would be to try and detect when an upper case letter is in the middle of a lower case word, but as in one of the items in the example, when a serial number is used it throws this out.
Anyone any suggestions? Is regex the way to go?
--- EDIT ---
I think i need to simplify things for myself, if I ignore the fact that lines that end in a serial number will break things for now. I need to just create an expression that will insert a line break if it detects that an upper case letter is being used after a lower case one
--- EDIT 2 ---
Using the example given by fardjad everything works for the sample data given, the strong was...
(.(?=[A-Z][a-z]))
Now as I test with more data I can see an issue appearing, certain lines begin with numbers so it is seeing these as serial numbers, you can see an example of this at http://regexr.com?2vfi5
There are only about 10 known numbers it uses at the start of the lines such as 240v, 120v etc...
Is there a way to exclude these?
That won't be a robust solution but this is what you asked. It matches the character before an uppercase letter followed by a lowercase one. You can simply use regex replace and append a new line character:
(.(?=[A-Z][a-z]))
see this demo.
You could search for this
(?<=\p{Ll})(?=\p{Lu})
and replace with a linebreak. The regex matches the empty space between a lowercase letter \p{Ll} and an uppercase letter \p{Lu}.
This assumes you're using a Unicode-aware regex engine (.NET, PCRE, Perl for example). If not, you might also get away with
(?<=[a-z])(?=[A-Z])
but this of course only detects lower-/uppercase changes in ASCII words.