In Notepad++ I want to make an expression that only keeps the email addresses of that text and that removes everything else.
Imagine that we have the following text:
Lorem ipsum dolor sit amet, <micorreo0#gmail.com>consectetur adipiscing elit. Curabitur ac risus molestie, laoreet ligula vitae, tincidunt risus.<micorreo1#gmail.com> Aliquam ut felis efficitur, iaculis nunc in, feugiat dui.<micorreo2#gmail.com> Etiam sodales ligula tellus, id vehicula augue aliquet eu. Nulla blandit maximus lectus, quis consequat metus vulputate suscipit. Duis finibus lorem justo, non sollicitudin urna aliquet a. Sed at ligula justo. Nam est ex, suscipit in facilisis nec, rutrum vitae urna<micorreo3#gmail.com>.
The mails will always go between the <> signs.
Ctrl+H
Find what: [^<]*(<[^>]+>)[^<]*
Replace with: $1
Replace all
Explanation:
[^<]* : 0 or more any character that is not <
( : begin capture group
< : literally <
[^>]+ : 1 or more any character that is not >
> : literally >
) : end group
[^<]* : 0 or more any character that is not <
Replacement:
$1 : group 1
Related
Is it possible to find all punctuation marks of a given type, only when a key phrase exists?
For example:
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Integer pulvinar ac augue nec auctor. Vestibulum eleifend, sem non placerat porttitor, urna neque pulvinar enim, ut ullamcorper massa libero nec tellus. Sed est massa, congue eu auctor gravida, efficitur sit amet lacus. Nullam tincidunt posuere sollicitudin. Sed ac ullamcorper risus, ac cursus justo. Phasellus vehicula quam nec libero venenatis venenatis. Donec metus erat: maximus in risus eu: imperdiet: dignissim mauris. Aliquam sit amet augue vel ex tincidunt convallis. Morbi a sem neque. Nam tellus dolor, congue in mi eu, laoreet sodales lectus. Fusce sed ullamcorper purus. Nulla facilisi.
For above, as long as "neque" is in the text, I want to find all occurrences of ":"
I've tried something like this without luck:
(.*\neque\b.*)(?!^)([:])
This works well in my system
explanation
extract the given phrase and store it in a variable.
if the phrase exists find the symbol and count its occurrences.
#!/bin/bash
a="Lorem ipsum dolor sit amet, consectetur adipiscing elit. Integer pulvinar ac augue nec auctor. Vestibulum eleifend, sem non placerat porttitor, urna neque pulvinar enim, ut ullamcorper massa libero nec tellus. Sed est massa, congue eu auctor gravida, efficitur sit amet lacus. Nullam tincidunt posuere sollicitudin. Sed ac ullamcorper risus, ac cursus justo. Phasellus vehicula quam nec libero venenatis venenatis. Donec metus erat: maximus in risus eu: imperdiet: dignissim mauris. Aliquam sit amet augue vel ex tincidunt convallis. Morbi a sem neque. Nam tellus dolor, congue in mi eu, laoreet sodales lectus. Fusce sed ullamcorper purus. Nulla facilisi."
b=$(echo "$a"| grep -o "neque"| head -1)
echo $b
if [ "$b" == "neque" ]
then
number_of_occurences=$(echo "$a"| grep -o ":"| wc -l)
echo "$number_of_occurences"
fi
Your desired action is not clear. What I can read from the highlights in your example, you want to find all words that end in :, but only if the word neque exists anywhere in the text. Assuming that is the case, you can use this regex:
/(?=.*\bneque\b)\w+:/g
Explanation:
(?=.*\bneque\b) - positive lookahead for neque with word boundaries, anywhere in the text; if this fails, the whole regex fails
\w+: - look for a word that is followed by :
use the g to find all occurrences of words followed by :
EDIT: After seeing that the bash tag has been added, here is a shell script version using a shortened string. The first example has the neque keyword, the second one not:
$ echo 'Urna neque metus erat: maximus in risus eu: imperdiet: dignissim.'\
> | egrep '\bneque\b' | egrep -o '\w+:'
erat:
eu:
imperdiet:
$
$ echo 'Urna metus erat: maximus in risus eu: imperdiet: dignissim.'\
> | egrep '\bneque\b' | egrep -o '\w+:'
$
Explanation:
use first egrep to filter by required keyword neque using word boundaries
use second 'egrep' with -o flag to extract words followed by :
So I have this text that I am trying to parse with Regex:
Name: Test Data 1
Description: Lorem ipsum dolor sit amet, consectetur adipiscing elit. Donec feugiat nulla id nisi venenatis blandit.
Donec blandit egestas orci, at tristique dui vehicula in. Maecenas fringilla fringilla enim, in pulvinar ex gravida
in. Nam cursus facilisis ante, sed tristique nisl sagittis sed. In auctor felis id neque suscipit ullamcorper. Nunc
faucibus elit sed metus vestibulum, ullamcorper pulvinar nisi auctor. Praesent sodales orci mauris, eget dapibus
mauris sodales in. Ut iaculis, ante vitae ullamcorper semper, metus tortor auctor purus, eu convallis nulla lacus
in tellus. Phasellus feugiat tempus neque, in fringilla nisi scelerisque sed. Donec elementum diam nec mattis dignissim.
I am trying to parse it to load it into a database.
With this expression, I am trying to get a match on the "Name" and "Description" parameters but also trying to get a match on the parameter value as well (which can sometimes be multi-line).
(.*):\s(.*)
I have been searching for a while now and I cannot seem to be able to make it match the whole paragraph but stop when it hits a blank line.
I would like the result to be as follows:
1st Match
Group 1: Name
Group 2: Test Data 1
2nd Match
Group 1: Description
Group 2: Description value with multi-line
https://regex101.com/r/mG2ms9/3
Thanks
You can use the following:
(.*?):\s([\s\S]*?)(?=\n(?:\n|\w|$))
Here it is on regex101.
[\s\S] matches any character, even a new line (whereas '.' does not, by default).
Then we're matching as few characters as possible (*?) up until the point where the next line is either blank (\n), starts with a word character (\w), or is the end of the string ($).
We can get away with the \w option since all of the new lines in the description parameter are followed by a space. If this isn't always the case, you could replace \w with something like .*: to check instead if the next line contains ':' and stop if so.
Note that I disabled multi-line mode; it's not suitable here.
I am trying to write an expression to take a block of text an return up until a full-stop before an ellipsis or three full-stops (... or …). So the idea is that the example text test string:
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nullam felis nisi, eleifend ut quam eget, venenatis vestibulum turpis. Nam dignissim laoreet iaculis. Etiam sit amet rhoncus sem. Duis laoreet justo tellus, at volutpat risus molestie sed. Etiam posuere, arcu vitae faucibus hendrerit, lorem elit consequat urna, id congue eros felis in mauris. Donec non fermentum ipsum. Curabitur nec...
Would become:
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nullam felis nisi, eleifend ut quam eget, venenatis vestibulum turpis. Nam dignissim laoreet iaculis. Etiam sit amet rhoncus sem. Duis laoreet justo tellus, at volutpat risus molestie sed. Etiam posuere, arcu vitae faucibus hendrerit, lorem elit consequat urna, id congue eros felis in mauris. Donec non fermentum ipsum.
So far I have come up with this pathetic attempt. I keep getting right up until the last full-stop (because the quantifier consumes the previous two full-stops so there is nothing for the look ahead to fail on). I just can't seem to wrap my head around it:
Dim testText As String = "Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nullam felis nisi, eleifend ut quam eget, venenatis vestibulum turpis. Nam dignissim laoreet iaculis. Etiam sit amet rhoncus sem. Duis laoreet justo tellus, at volutpat risus molestie sed. Etiam posuere, arcu vitae faucibus hendrerit, lorem elit consequat urna, id congue eros felis in mauris. Donec non fermentum ipsum. Curabitur nec..."
Dim ellipsisExpression As String = "(.*\.(?!\.\.))"
Dim ellipsisMatch As Match
ellipsisMatch = Regex.Match(testText, ellipsisExpression)
If ellipsisMatch.Success Then
testText = ellipsisMatch.Groups(1).Value
End If
edit: I also need this expression to take any ... character in the text into account. for example the string:
`begin. this is a test... test complete. beginning shutdown... shutting down... `
should return
`begin. this is a test... test complete.`
The aim of this expression is to find the most flowing text before any truncation has occurred. A block of text with closure so it doesn't confuse readers expecting to be able to 'get more'.
You could replace [^.]*(?:\.{3}|…).* with an empty string to get the desired result.
For example:
result = Regex.Replace(input, "[^.]*(?:\\.{3}|…).*", "")
Use this:
result = Regex.Replace(input, "(.+\.).+(?:\.{3}|…)\s*", "$1")
Edit:
Use this regex instead:
(.+[^.]\.)(?:(?:[^.]{2})|$)
You could match that with:
.*(?<!\.)\.(?!\.)(?=(?:[^.]+|\.{3})*(?:\.{3}|…)$)
Or replace
(?<!\.)\.(?!\.)(?:[^.]+|\.{3})*(?:\.{3}|…)$
with a ..
I think I have come up with a solution that works for me. Thank you to everyone who answered previously but this expression seems to do what I need and doesn't execute as slowly as some of the other answers. It also takes other sentence terminating punctuation into account such as ! or ? and not just ..
(.*([^\.](?=\.|\?|!)(?!\.\.\.)).)
This get's the last sentence terminating character (defined with the lookahead). In this case they are ?, ! and . that isn't followed by .... This also solves the ellipsis character issue since it is effectively a sentence terminating white list. This expression succeeds in finding the largest block of text with closure.
lets say there is something like this
Lorem ipsum dolor sit amet, consectetur adipiscing elit. "Vestibulum interdum dolor nec sapien blandit a suscipit arcu fermentum. Nullam lacinia ipsum vitae enim consequat iaculis quis in augue. Phasellus fermentum congue blandit. Donec laoreet, ipsum et vestibulum vulputate, risus augue commodo nisi, vel hendrerit sem justo sed mauris." Phasellus ut nunc neque, id varius nunc. In enim lectus, blandit et dictum at, molestie in nunc. Vivamus eu ligula sed augue pretium tincidunt sit amet ac nisl. "Morbi eu elit diam, sed tristique nunc."
to be something like this
Lorem ipsum dolor sit amet, consectetur adipiscing elit. "Vestibulum interdum dolor nec sapien blandit a suscipit arcu fermentum[dot] Nullam lacinia ipsum vitae enim consequat iaculis quis in augue[dot] Phasellus fermentum congue blandit[dot] Donec laoreet, ipsum et vestibulum vulputate, risus augue commodo nisi, vel hendrerit sem justo sed mauris[dot]" Phasellus ut nunc neque, id varius nunc. In enim lectus, blandit et dictum at, molestie in nunc. Vivamus eu ligula sed augue pretium tincidunt sit amet ac nisl. "Morbi eu elit diam, sed tristique nunc[dot]"
i somehow found a regex to select all the "{sentence}" with "(.)+?" or use them like
regex('"(.)+?"','[sentence]')
but can we do something like replace the dots inside a group?. so i can get the output like above example?
I'm not sure regexps are able to suit your needs on their own.
You should implement an algorithm that replaces nested dots until the string doesn't contain nested dots anymore.
For example in PHP:
$string = 'He asked "Please." while she answered "No. Or maybe yes."';
var_dump($string);
while(preg_match('/"[^"]*\.[^"]*"/', $string)) {
$string = preg_replace('/("[^"]*)\.([^"]*")/', '$1[dot]$2', $string);
}
var_dump($string);
which prints:
string 'He asked "Please." while she answered "No. Or maybe yes."' (length=57)
string 'He asked "Please[dot]" while she answered "No[dot] Or maybe yes[dot]"' (length=69)
This is what I would do.
echo
preg_replace_callback('~(?<!\\\)"(.+?)((?<!\\\)")~',
/*
Pattern:
--------
(?<!\\\)" a double quote not preceded by a backward (escaping) slash
(.+?) anything (with min 1 char.) between condition above and below
((?<!\\\)") a double quote not preceded by a backward (escaping) slash
*/
// for anything that matches the above pattern
// the following function is called
create_function('$m',
'return preg_replace("~\.~","[dot]",$m[0]);'),
// which replaces each dot with [dot] and returns the match
$str);
EDIT: Added explanations in comments.
try this:
(\"[^\.]*)\.([^\"]*) to \1[dot]\2
works well in my editor, but sometimes $ is used instead of \ in replacement (e.g. in php)
With Javascript I would just do a basic replace:
str = str.replace(/".+?"/g,function(m) {
return m.replace(/\./g,'[dot]');
});
Is there a regex to match "all characters including newlines"?
For example, in the regex below, there is no output from $2 because (.+?) doesn't include new lines when matching.
$string = "START Curabitur mollis, dolor ut rutrum consequat, arcu nisl ultrices diam, adipiscing aliquam ipsum metus id velit. Aenean vestibulum gravida felis, quis bibendum nisl euismod ut.
Nunc at orci sed quam pharetra congue. Nulla a justo vitae diam eleifend dictum. Maecenas egestas ipsum elementum dui sollicitudin tempus. Donec bibendum cursus nisi, vitae convallis ante ornare a. Curabitur libero lorem, semper sit amet cursus at, cursus id purus. Cras varius metus eu diam vulputate vel elementum mauris tempor.
Morbi tristique interdum libero, eu pulvinar elit fringilla vel. Curabitur fringilla bibendum urna, ullamcorper placerat quam fermentum id. Nunc aliquam, nunc sit amet bibendum lacinia, magna massa auctor enim, nec dictum sapien eros in arcu.
Pellentesque viverra ullamcorper lectus, a facilisis ipsum tempus et. Nulla mi enim, interdum at imperdiet eget, bibendum nec END";
$string =~ /(START)(.+?)(END)/;
print $2;
If you don't want add the /s regex modifier (perhaps you still want . to retain its original meaning elsewhere in the regex), you may also use a character class. One possibility:
[\S\s]
a character which is not a space or is a space. In other words, any character.
You can also change modifiers locally in a small part of the regex, like so:
(?s:.)
Add the s modifier to your regex to cause . to match newlines:
$string =~ /(START)(.+?)(END)/s;
Yeap, you just need to make . match newline :
$string =~ /(START)(.+?)(END)/s;
You want to use "multiline".
$string =~ /(START)(.+?)(END)/m;