I'm trying to find a way to make an array of matched patterns out of a string.
I'll explain myself with an example.
From a string like
Lorem ipsum dolor **sit** amet, consectetur adipiscing elit.
Nulla elementum euismod mi. Morbi eu eros eget augue vestibulum semper.
Curabitur sapien purus, **semper** in consequat eu, gravida vitae purus.
I need to apply a regexp to extract the words sit and semper
and I really don't know how to manage it.
I would think a regex such as \*{2}(.*?)\*{2} would take care of it, and using regular expressions in Objective-C (assuming you're on an Apple platform) you'd want to look at the NSRegularExpression iOS or Mac documentation.
You can do it like this..
\s*{2}([^\*]+)\s*{2}
Related
I have an OCR text document where paragraphs have been broken into individual lines. I'd like to make them whole paragraphs on a single line again (as per the original PDF).
How can I use regex, or find and replace, to remove the line breaks between two lines of text and replace them with a space?
Eg:
Every line of text is on a newline. I'd like them to be whole paragraphs on a single line.
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nam vehicula tellus faucibus metus consequat
scelerisque. Maecenas sit amet urna quis ipsum interdum consequat. Praesent elementum libero nec
velit suscipit placerat accumsan vitae lacus. Aliquam erat volutpat. Etiam egestas lectus sed orci
venenatis, ullamcorper gravida elit pulvinar. Pellentesque imperdiet, augue pulvinar sodales dapibus,
tortor magna rutrum nulla, vel ullamcorper mi purus a diam. Ut id odio sed arcu aliquet lobortis.
Lorem ipsum dolor sit amet, consectetur adipiscing elit.
Donec quam arcu, egestas feugiat eleifend blandit, vulputate non elit. Nulla a erat vel leo maximus
viverra at ac lorem. Nam non imperdiet lorem. Fusce tempor arcu massa, non commodo ligula lobortis
nec. Aliquam sit amet fringilla sapien, non euismod metus. Donec orci mi, sagittis vitae lobortis eu,
aliquet nec libero. Sed sodales magna lacus, pretium lobortis magna varius nec. Pellentesque quis
ipsum viverra orci lobortis egestas. Aliquam porttitor tincidunt ipsum, egestas placerat ante
consectetur in. Morbi porttitor lacus eu augue tincidunt, at aliquet lorem consectetur.
You might be looking for a programatic/dynamic approach for every new scan generated so I'm not sure if this answers your question, but since you have visual studio code in your tags I will answer how to do this in vscode.
Open keyboard shortcuts from File > Preferences > Keyboard shortcuts, and bind editor.action.joinLines to a shortcut of your choice like for example Ctrl + J.
Then go ahead and open the text you are looking to fix in vscode, select it and press that keybinding. You will notice everything will be in 1 line. I hope I helped!
I am using two regular expressions when removing linebreaks from OCR texts.
They can be used in the Find&Replace dialog from VS Code.
Remove linebreaks at lines ending with a hyphen: (?<=\w)- *\n *
Replace remaining linebreaks with whitespace, but keeping blank lines: (?<!\n) *\n *(?!\n).
Note that the * in the regular expression trims whitespace at the end and beginning of the lines.
There is also a Python tool based on Flair called dehyphen that does the job.
In my experience it produces useful results but may take quite long compared to replacing linebreaks with regular expressions.
I want to ignore the certain substring in the result match, not exclude if the substring exists.
For example
I have the text:
Lorem ipsum dolor sit amet, consectetur adipiscing eliti qwer-
ty egeet qwewerty lectus. Proinera risus massa, placerat in q-
werty sed, tincidunt in nunci auspendisse vel dolor qwerty qw-
erty, molestie nisl sit amet, qwerty ligula curabitur ipsum,
euismod at augue at, dapibus feugiat qweerty
I need to find all qwerty, even if it contains -\n.
My decision is adding (?:-\n)? after every char:
/q(?:-\n)?w(?:-\n)?e(?:-\n)?r(?:-\n)?t(?:-\n)?y/gm
But it looks bulky (even for the example that contains only 6 chars) and it is too hard to modify the regex later, is there a magic to make the regex shorter?
No, regex is not good at this kind of match. The easiest way would be to remove - and \n first.
I have a certain amount of content like this:
<p><strong>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Ut ullamcorper enim ut nulla fringilla, non elementum nunc dapibus. Donec porta a lorem in vestibulum. Aenean viverra vulputate finibus. Sed malesuada nibh vitae enim luctus, at placerat diam vehicula.</strong></p>
<p>Quisque eu nisl sed tellus congue aliquet ac id risus. Etiam eget nisi ac lectus cursus suscipit. Mauris a dictum justo. Aliquam eget mi vel nunc imperdiet ultricies.</p>
<iframe width="480" height="270" frameborder="0" src="https://www.youtube.com/embed/EgqUJOudrcM" allowfullscreen="" ></iframe>
All I am trying to do is get the YouTube video ID.
So far, I have come up with the following Regular Expression:
/<iframe.*src=["\'].*youtube\.com\/embed\/(.*)["\'] ?>/
This works if the src attribute is the last attribute in the tag, otherwise it doesn't. How can my regular expression be written so as to overcome this?
Works in this case
But not in this one
As you can see, in the second example, my Regex also matches the attribute after src. I know why this happens, I just can't work out how to prevent it.
I'm certainly no Regex expert, so any suggestions to improve what I currently have are welcome.
With this one:
<iframe.*?src=".*?youtube\.com\/embed\/(\w+)
The .*? avoid matching to much and stop on first src attribute
Then it match the url straightforward.
Edit: You just want the id, not full url
You can use the following regex:
<iframe[^>]*src=\"[^\"]+\/([^\"]+)\"[^>]*>
Let's say I have the following text:
Lorem ipsum dolor sit amet, consectetur aaBaaBaaB adipiscing elit.
aaBaaB
aaB Ut in risus quis elit posuere faucibus sed vitae metus. aaBaaBaaBaaB
Fusce nec tortor in dolor aaBaaBaaB porttitor viverra. aaB
I'm trying to figure out how to perform a regular expression search and replace on this in such a way that the output is:
Lorem ipsum dolor sit amet, consectetur aaBaaB adipiscing elit.
aaB
Ut in risus quis elit posuere faucibus sed vitae metus. aaBaaBaaB
Fusce nec tortor in dolor aaBaaB porttitor viverra.
That is, to remove one "aaB" from each pattern of it. Is this actually possible, and if so, how would it be done? Specifically, I intend to do this in Sublime Text 2 as a RegEx search/replace in a file.
You can use a positive lookahead:
(?=(?<w>[a-z]{2}[A-Z]{1})\s)\k<w>
You just need to make sure you have case-sensitive matching on.
example: http://regex101.com/r/sK8bG1
Use either the leading or trailing whitespace to remove the first or last substring. Either of these work:
(\s+)(aaB) with $1 in the Replace field
or
(aaB)(\s+) with $2 in the Replace field
I am trying to write an expression to take a block of text an return up until a full-stop before an ellipsis or three full-stops (... or …). So the idea is that the example text test string:
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nullam felis nisi, eleifend ut quam eget, venenatis vestibulum turpis. Nam dignissim laoreet iaculis. Etiam sit amet rhoncus sem. Duis laoreet justo tellus, at volutpat risus molestie sed. Etiam posuere, arcu vitae faucibus hendrerit, lorem elit consequat urna, id congue eros felis in mauris. Donec non fermentum ipsum. Curabitur nec...
Would become:
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nullam felis nisi, eleifend ut quam eget, venenatis vestibulum turpis. Nam dignissim laoreet iaculis. Etiam sit amet rhoncus sem. Duis laoreet justo tellus, at volutpat risus molestie sed. Etiam posuere, arcu vitae faucibus hendrerit, lorem elit consequat urna, id congue eros felis in mauris. Donec non fermentum ipsum.
So far I have come up with this pathetic attempt. I keep getting right up until the last full-stop (because the quantifier consumes the previous two full-stops so there is nothing for the look ahead to fail on). I just can't seem to wrap my head around it:
Dim testText As String = "Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nullam felis nisi, eleifend ut quam eget, venenatis vestibulum turpis. Nam dignissim laoreet iaculis. Etiam sit amet rhoncus sem. Duis laoreet justo tellus, at volutpat risus molestie sed. Etiam posuere, arcu vitae faucibus hendrerit, lorem elit consequat urna, id congue eros felis in mauris. Donec non fermentum ipsum. Curabitur nec..."
Dim ellipsisExpression As String = "(.*\.(?!\.\.))"
Dim ellipsisMatch As Match
ellipsisMatch = Regex.Match(testText, ellipsisExpression)
If ellipsisMatch.Success Then
testText = ellipsisMatch.Groups(1).Value
End If
edit: I also need this expression to take any ... character in the text into account. for example the string:
`begin. this is a test... test complete. beginning shutdown... shutting down... `
should return
`begin. this is a test... test complete.`
The aim of this expression is to find the most flowing text before any truncation has occurred. A block of text with closure so it doesn't confuse readers expecting to be able to 'get more'.
You could replace [^.]*(?:\.{3}|…).* with an empty string to get the desired result.
For example:
result = Regex.Replace(input, "[^.]*(?:\\.{3}|…).*", "")
Use this:
result = Regex.Replace(input, "(.+\.).+(?:\.{3}|…)\s*", "$1")
Edit:
Use this regex instead:
(.+[^.]\.)(?:(?:[^.]{2})|$)
You could match that with:
.*(?<!\.)\.(?!\.)(?=(?:[^.]+|\.{3})*(?:\.{3}|…)$)
Or replace
(?<!\.)\.(?!\.)(?:[^.]+|\.{3})*(?:\.{3}|…)$
with a ..
I think I have come up with a solution that works for me. Thank you to everyone who answered previously but this expression seems to do what I need and doesn't execute as slowly as some of the other answers. It also takes other sentence terminating punctuation into account such as ! or ? and not just ..
(.*([^\.](?=\.|\?|!)(?!\.\.\.)).)
This get's the last sentence terminating character (defined with the lookahead). In this case they are ?, ! and . that isn't followed by .... This also solves the ellipsis character issue since it is effectively a sentence terminating white list. This expression succeeds in finding the largest block of text with closure.