Regular expression matching a sequence of words - regex

Let's suppose we have a paragraph like this:
Lorem ipsum, sit amet consectetur adipiscing elit. Lorem - ipsum, sit
amet. Morbi a suscipit sem, quis finibus turpis. Lorem ipsum: sit
amet. Proin suscipit ac arcu pharetra tincidunt. Lorem ipsum. sit
amet. Pellentesque eu lacinia metus. sit amet: Lorem ipsum. Lorem
turpis ipsum, sit amet.
I need a regex pcre pattern case insensitive that only selects the words
1 lorem
2 ipsum
3 sit
4 amet
in that specific order ignoring punctutation and occurrences like
Sit amet lorem ipsum
Lorem turpis ipsum, sit amet

Simple straight forward with certain punctuation characters. You can append any punctuation character inside the []:
([Ll]orem)[\s,.!:\-()?]+(ipsum)[\s,.!:\-()?]+(sit)[\s,.!:\-()?]+(amet)
or everything that is a whitespace and not [A-Za-z0-9]
([Ll]orem)[\s\W]+(ipsum)[\s\W]+(sit)[\s\W]+(amet)
Case sensitivity can be an option to switch depending on the programming language. Or you have to manually add every relevant variation like ([L|l]orem)
Regex101 Example

Related

Regex that matches multiple new lines until finding patern

I am not very familiar to regex and I am having trouble to create a regex that solves my problem.
I want to create a regex that finds the following example: (What the regex should match is in bold)
Action type: Lorem ipsum dolor sit amet, consectetur adipiscing elit. Etiam sodales tincidunt ipsum ut ullamcorper
Phasellus rhoncus quam id eros volutpat, ac sodales magna tincidunt Phasellus rhoncus quam id eros volutpat, ac sodales magna
Phasellus rhoncus quam id eros volutpat, ac sodales magna tincidunt
Number Name Degree
11111111 LOREM IPSUM COMPUTER ENGINEERING
31837183 DOLOR IPSUM COMPUTER ENGINEERING
Total: 2
Action type: Lorem ipsum dolor sit amet, consectetur adipiscing elit.
Number Name Degree
128172818211 SIT AMET IPSUM COMPUTER ENGINEERING
12183781 CONSECTETUR ELIT COMPUTER ENGINEERING
128172818212 ETIAM SODALES COMPUTER ENGINEERING
128172818213 IPSUM UT COMPUTER ENGINEERING
128172818215 SODALES MAGNA COMPUTER ENGINEERING
Total: 5
What I have accomplished so far, is generating a regex that matches the lines with success and the first line of the action type, but not the subsequent. I would like to match everything that comes after action type till the line that contains Number, Name and Degree.
The currently regex I am using is (Action type: .+?\n|[0-9]{8,12} .+?\n). A preview of the current executiong using regex101.com is attached.
As You can see, it works well for the second example, but it does not fulfil my needs with regard to the first one.
Is it possible to adapt my current regex to fit these multilines?
Try:
^Action type:.*?(?=^Number Name Degree)|^\d{8,12}[^\n]+
Regex demo.
^Action type:.*?(?=^Number Name Degree) - this matches all text beginning with Action type: until ^Number Name Degree is found.
^\d{8,12}[^\n]+ - this matches all lines beginning with 8-12 digits.
Note: the expression needs (?s) modifier

regex not capturing newline

I am trying to parse log files using regex. logs looks like that:
2022-04-01 00:00:00.0000|DEBUG|LOREM:LOREM|IPSUM:LOREM:LOREMIPSUM Lorem ipsum dolor sit amet, consectetur adipiscing elit. Etiam vel placerat sapien. Suspendisse interdum est nulla, ac interdum sem pellentesque vel. Ut condimentum nisl ipsum (Failed:1/Total:5) [10.0000 ms].
2022-04-01 00:00:00.0000|DEBUG|LOREM:IPSUM|lorem ipsum \\SOME-PATH[Lorem Ipsum] (ID:000000-0000-0000-0000). Lorem ipsum dolor sit amet, consectetur adipiscing elit. Etiam vel placerat sapien. Suspendisse interdum est nulla, ac interdum sem pellentesque vel. //line return here
Ut condimentum nisl ipsum.
2022-04-01 00:00:00.0000|DEBUG|LOREM:IPSUM|lorem ipsum \\SOME-PATH[Lorem Ipsum] (ID:000000-0000-0000-0000). Lorem ipsum dolor sit amet, consectetur adipiscing elit. Etiam vel placerat sapien. Suspendisse interdum est nulla, ac interdum sem pellentesque vel. //line return here
Ut condimentum nisl ipsum.
Here is what I have tried (live version on regex 101 https://regex101.com/r/RoDU5L/1)
^(?<timestamp>^[\d-]+\s[\d:.]+)\|DEBUG\|(.*?)?\r?$|.*?(?<path>\\.*\]\s)(?<description>.*)+$ /gm
The problem is that it is not taking the last line "Ut condimentum nisl ipsum."
Thanks for your help
You can use
^(?<timestamp>^[\d-]+\s[\d:.]+)\|DEBUG\|(.*(?:\r?\n(?![\d-]+\s[\d:.]+\|).*)*)|.*?(?<path>\\.*\]\s)(?<description>.*)+$
See the regex demo.
The .*(?:\r?\n(?![\d-]+\s[\d:.]+\|).*)* part now matches
.* - any zero or more chars other than line break chars, as many as possible
(?:\r?\n(?![\d-]+\s[\d:.]+\|).*)* - zero or more occurrences of
\r?\n(?![\d-]+\s[\d:.]+\|) - CRLF or LF line ending now immediately followed with a datetime-like pattern and a | right after
.* - any zero or more chars other than line break chars, as many as possible.

Ignore a substring in RegEx pattern

I want to ignore the certain substring in the result match, not exclude if the substring exists.
For example
I have the text:
Lorem ipsum dolor sit amet, consectetur adipiscing eliti qwer-
ty egeet qwewerty lectus. Proinera risus massa, placerat in q-
werty sed, tincidunt in nunci auspendisse vel dolor qwerty qw-
erty, molestie nisl sit amet, qwerty ligula curabitur ipsum,
euismod at augue at, dapibus feugiat qweerty
I need to find all qwerty, even if it contains -\n.
My decision is adding (?:-\n)? after every char:
/q(?:-\n)?w(?:-\n)?e(?:-\n)?r(?:-\n)?t(?:-\n)?y/gm
But it looks bulky (even for the example that contains only 6 chars) and it is too hard to modify the regex later, is there a magic to make the regex shorter?
No, regex is not good at this kind of match. The easiest way would be to remove - and \n first.

regex - multiple occurrences within match

Is it possible to find multiple matching groups within the full match using ONLY regex?
Given the text below
{1234} Lorem ipsum dolor sit amet, consectetur adipiscing elit. ** Sed
iaculis nisi et dapibus consectetur. Vestibulum ** feugiat sapien, sed
sagittis magna. Phasellus euismod tempor augue, ** eget dictum mi
sagittis sit amet. Quisque sit amet diam vel magna imperdiet pulvinar
vel ac lectus. {4321} Lorem ipsum....
Im trying to group all the occurences of ** within the numbers.
I came up with the following:
\{\d+\}.+?(\*\*)+.+\{\d.+\}
https://regex101.com/r/s746be/2
Which as you can see it only groups the first group because of the lazy question mark or the last if I remove the question mark.
Why not break it down to a few simple steps instead of using a really big inefficient Regex?
You can try doing something like this:
1) Grab the text between {1234} and {4321} using a simple regex like this:
/\{\d+\}(.*?)\{\d+\}/
2) Extract the matched text between these two delimiters
3) Run a second global regex search on this matched inner text using a simple regex pattern like so:
/\*\*/g
Hope this helps

Remove one iteration from every instance of a pattern with a RegEx?

Let's say I have the following text:
Lorem ipsum dolor sit amet, consectetur aaBaaBaaB adipiscing elit.
aaBaaB
aaB Ut in risus quis elit posuere faucibus sed vitae metus. aaBaaBaaBaaB
Fusce nec tortor in dolor aaBaaBaaB porttitor viverra. aaB
I'm trying to figure out how to perform a regular expression search and replace on this in such a way that the output is:
Lorem ipsum dolor sit amet, consectetur aaBaaB adipiscing elit.
aaB
Ut in risus quis elit posuere faucibus sed vitae metus. aaBaaBaaB
Fusce nec tortor in dolor aaBaaB porttitor viverra.
That is, to remove one "aaB" from each pattern of it. Is this actually possible, and if so, how would it be done? Specifically, I intend to do this in Sublime Text 2 as a RegEx search/replace in a file.
You can use a positive lookahead:
(?=(?<w>[a-z]{2}[A-Z]{1})\s)\k<w>
You just need to make sure you have case-sensitive matching on.
example: http://regex101.com/r/sK8bG1
Use either the leading or trailing whitespace to remove the first or last substring. Either of these work:
(\s+)(aaB) with $1 in the Replace field
or
(aaB)(\s+) with $2 in the Replace field