I'm trying to match all line breaks that are not followed by another line break so that I can convert the first line break to a space, but still keep paragraphs separated, so that:
Lorem ipsum dolor sit amet, consectetur adipiscing elit
sed do eiusmod tempor incididunt ut labore et dolore magna
aliqua. Ut enim ad minim veniam, quis nostrud exercitation
ullamco laboris nisi ut aliquip ex ea
commodo consequat. Duis aute irure dolor
in reprehenderit in voluptate velit esse
cillum dolore eu fugiat nulla pariatur.
Excepteur sint occaecat cupidatat
will be transformed to this:
Lorem ipsum dolor sit amet, consectetur adipiscing elit sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation
ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse
cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat
So far I have .*?\r\n(?<!(\r\n)), which I feel is really close, but I cannot seem to get it quite right. Any help is appreciated. Thanks.
Use the regex \r?\n(?!\r?\n). You can find an online explanation and demonstration here.
This regex uses a negative lookahead to make sure that the line break is followed by another line break. The line breaks are matched by \r?\n to conform to the standard, because some line breaks are represented by a carriage return (\r) followed by a newline (\n), while others are just a newline.
The only real way to find a solitary line break is to find it between two non whitespace chars.
Any other way and it turns out it might be bordered by any number of linebreak's.
So, you can't just look one way and not the other, and either way you look could be
padded with non-breaking whitespace, so you're better off doing it this way.
The simplest is to do a global
Find: (\S[^\S\r\n]*)\r\n([^\S\r\n]*\S)
Replace: $1 $2 (<-that's 'capture group 1' + 'space' + 'capture group 2')
( \S [^\S\r\n]* ) # (1)
\r \n
( [^\S\r\n]* \S ) # (2)
Extra info
Also, the capture groups can be replaced with look around's
as well as trim spurious non-linebreak whitespace.
Find: (?<=\S)[^\S\r\n]*\r\n[^\S\r\n]*(?=\S)
Replace: (<- that's a space)
(?<= \S )
[^\S\r\n]* \r \n [^\S\r\n]*
(?= \S )
Related
I have this find regex:
^(?=.{35})(?!.*(?:-\h)).{0,35}[\h.]
It matches every line until the last whitespace/dot before the 35th position of the line, it also excludes lines starting with a dash.
Now I want to include lines starting with a dash and longer than 35 characters.
I tried with:
^(?=.{35})(?!.*(?:-\h.{0,35})).{0,35}[\h.]
But it doesn't work as expected.
What am I doing wrong?
Example text:
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.
- Include this line Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.
- This line is too short.
Thanks
Matching at least 35 chars after the - can be done using ^-.{35,} If you want to match both, you could use an alternation | matching either of the alternatives:
^(?:(?=.{35})(?!.*(?:-\h)).{0,35}[\h.]|-.{35,})
Regex demo
Let's say I want to replace any number of repeats of string 1 with an equal number of repeats of string 2, using regular expressions. For example, string 1 = "apple", string 2 = "orange".
I imagine something like this:
s/apple{2,}/orange{N}/
but I don't know how to specify the N to match the number of repeats of apple. Is that even possible?
Note: as pointed out by xhienne, I am looking for repeats, therefore at least two occurrences of the string 1.
Sample input:
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. apple Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. appleappleapple Excepteur sint occaecat cupidatat non proident, appleappleappleapple sunt in culpa qui officia deserunt mollit anim id est laborum.
Sample output:
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. apple Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. orangeorangeorange Excepteur sint occaecat cupidatat non proident, orangeorangeorangeorange sunt in culpa qui officia deserunt mollit anim id est laborum.
A possible solution is using a regex that supports \G operator:
(?:\G(?!\A)|(?=(?:apple){2}))apple
See the regex demo
Details
(?:\G(?!\A)|(?=(?:apple){2})) - a non-capturing group that matches either of the two alternatives:
\G(?!\A) - the end of the previous successful match (with the start of string position subtracted from the \G)
| - or
(?=(?:apple){2}) - a location in string that is followed with two occurrences of apple substring
apple - an apple substring.
Note that the regex does not need to count much, it just finds a place where a string repeats 2 times, then, it replaces all consecutive, adjoining matches.
Since this problem initially arose while you were using vim (which doesn't support the \G operator used by Wiktor Stribiżew in his answer), here is an answer specifically for vim:
:s/\(apple\)\{2,\}/\= substitute(submatch(0), "apple", "orange", "g")/g
(of course, this cannot be considered as a true regex since it makes use of a vim function to do a sub-substitution in the matched text)
I Have outgoing emails which go like:
Dear XYZ,
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
CASE ID: 123654
Best Regards,
XYZ
The text could be one or two paragraphs. I want to make two regex. One should give me the text in paragraphs and the other should give me the number that is the CASE ID. The result should look like this:
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
123654
I have managed to create a RegEx to get the case using (CASE ID\s*[:]+\s*(\w*)\s*)but I haven't been able to extract the paragraph. Any help will be much appreciated.
Basically you can or should do one regex instead, that will deliver matchgroups.
In almost any other language it would look like this (using "gs" flag to ignore newline):
(.+?)CASE ID: (\d+)
But for vbscript it we have something like this:
(.*?[^\$]*)CASE ID: (\d+)
Also you need to deal with matchgroups like this:
Dim RegEx : Set RegEx = New RegExp
RegEx.Pattern = "(.*?[^\$]*)CASE ID: (\d+)"
RegEx.Global = True
RegEx.MultiLine = True
Dim strTemp : strTemp = "Lorem ipsum " & VbCrLf & "Cannot be translated to english " & VbCrLf & "CASE ID: 153"
WScript.Echo RegEx.Execute(strTemp)(0).SubMatches(0)
WScript.Echo RegEx.Execute(strTemp)(0).SubMatches(1)
The thing is that this will only work if the constant string "CASE ID: " is contained in the message. In case the string is missing e.g. the newline after the ":" it would not work
I need to delete a paragraph enclosed within parentheses like below, without touching the rest of the text as below
(Text to delete Lorem ipsum dolor sit amet, consectetur linebreak->
in voluptate velit esse cillum. Excepteur sint proident, mollit anim id est laborum.)
Text that shouldnt be touched Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation llamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehend in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat upidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
For now I have /\(.*\)[\n]*/ to match a pragraph, but with the linebreaks, it obviously doesn't work. I was thinking about something in the lines of /\(.*[\n]*\)[\n]*/ but that didn't work. Looking here results with (?<=\()(.*?)(?=\)) but its python, so won't work, and other links are about parentheses within parentheses, so that's different from my problem.
The \n is to simplify the (\r|\n|\r\n) linebreak thing.
So is there a way to do it, or is the regexp in groovy not capable of this?
You could use something like /(?s)\(.+?\)/ (example available here), which according to here makes the period character also match new line feeds.
The expression will look for round brackets and stop at the first occurrence of a close bracket.
I'm trying to split a text of n phrases into paragraphs using regular expressions (i.e. : after a certain number of phrases, begin a new paragraph) with Notepad++.
I have come up with the following regex (in this case, every 3 phrases -> new paragraph) :
(([\S\s]*?)(\.)){3}
So far so good. However, how do I match the phrases now? $1, $2 will only match the braces..
Example text:
Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do
eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad
minim veniam, quis nostrud exercitation ullamco laboris nisi ut
aliquip ex ea commodo consequat. Duis aute irure dolor in
reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla
pariatur. Excepteur sint occaecat cupidatat non proident, sunt in
culpa qui officia deserunt mollit anim id est laborum.
Desired result (using a count of 2):
Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do
eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad
minim veniam, quis nostrud exercitation ullamco laboris nisi ut
aliquip ex ea commodo consequat.
Duis aute irure dolor in
reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla
pariatur. Excepteur sint occaecat cupidatat non proident, sunt in
culpa qui officia deserunt mollit anim id est laborum.
How about:
Find what: ((?:[^.]+\.){2})
Replace with: $1\n
Find using this pattern:
((.*?\.){2})
Breaking it down a bit...
The inner parentheses ...
( )
... provide the group which is affected by {2}.
The outer parentheses ...
( )
...provide the delimiters for the replace pattern. Since they are "top-level", they are what the replace pattern \1 will attach to.
Note the outer parentheses have to enclose the {2}. I'm not good at thinking through how regex will handle everything, but fortunately Notepad++ offers instant confirmation -- just press "Find" to watch it jump through the matches.
The replace pattern is followed by your return and new line, so the whole string looks like this:
\1\r\n
If you want an optional space, make sure you add \s? ... probably like this, but I didn't test it.:
((.*?\.\s?){2})
If the issue is inserting a space with the results, just add a space (or two, if you're old-school like me) to the replace pattern:
\1 \r\n
To find n sentence that end with period is quite easy. For instance for two sentence
(?:.*?\.){2}
To make it a paragraph (insert new line) you replace with
$0\r\n\r\n
This insert two carriage return + line feed which is the Windows way of marking new line. On Unix files \n\n would be enough. If you only want one line break, just do $0\r\n\r\n
If you want to make it htlm paragraph same search, you can replace with
<p>$0</p>