Regular expressions: Finding BB code in a piece of text - regex

I'm trying to match on "url" BB code tag in a random piece of text. Example text:
Lorem ipsum dolor sit amet, consectetur adipiscing elit. [url]http://www.google.com[/url] Donec purus nunc, rhoncus vitae tempus vitae, [url=www.facebook.com]facebook[/url] elementum sit amet justo.
I want to find both "url" tags from this text:
[url]http://www.google.com[/url]
[url=www.facebook.com]facebook[/url]
I am not that good with regular expressions so this is as far as I could get:
\[url(=[a-z]*)?\][a-z]*\[/url\]
I think I just need to replace [a-z] with something that matches on anything EXCEPT the characters '[' and ']'. Can anybody help me out with this please?

The following expression should do it for you
\[url(=(.*?))?\](.*?)\[\/url\]

((\[url\].*?\[/url\])|(\[url=.*\](.*?)\[/url\]))
Will pull both results.

Related

Regex that matches multiple new lines until finding patern

I am not very familiar to regex and I am having trouble to create a regex that solves my problem.
I want to create a regex that finds the following example: (What the regex should match is in bold)
Action type: Lorem ipsum dolor sit amet, consectetur adipiscing elit. Etiam sodales tincidunt ipsum ut ullamcorper
Phasellus rhoncus quam id eros volutpat, ac sodales magna tincidunt Phasellus rhoncus quam id eros volutpat, ac sodales magna
Phasellus rhoncus quam id eros volutpat, ac sodales magna tincidunt
Number Name Degree
11111111 LOREM IPSUM COMPUTER ENGINEERING
31837183 DOLOR IPSUM COMPUTER ENGINEERING
Total: 2
Action type: Lorem ipsum dolor sit amet, consectetur adipiscing elit.
Number Name Degree
128172818211 SIT AMET IPSUM COMPUTER ENGINEERING
12183781 CONSECTETUR ELIT COMPUTER ENGINEERING
128172818212 ETIAM SODALES COMPUTER ENGINEERING
128172818213 IPSUM UT COMPUTER ENGINEERING
128172818215 SODALES MAGNA COMPUTER ENGINEERING
Total: 5
What I have accomplished so far, is generating a regex that matches the lines with success and the first line of the action type, but not the subsequent. I would like to match everything that comes after action type till the line that contains Number, Name and Degree.
The currently regex I am using is (Action type: .+?\n|[0-9]{8,12} .+?\n). A preview of the current executiong using regex101.com is attached.
As You can see, it works well for the second example, but it does not fulfil my needs with regard to the first one.
Is it possible to adapt my current regex to fit these multilines?
Try:
^Action type:.*?(?=^Number Name Degree)|^\d{8,12}[^\n]+
Regex demo.
^Action type:.*?(?=^Number Name Degree) - this matches all text beginning with Action type: until ^Number Name Degree is found.
^\d{8,12}[^\n]+ - this matches all lines beginning with 8-12 digits.
Note: the expression needs (?s) modifier

Ignore a substring in RegEx pattern

I want to ignore the certain substring in the result match, not exclude if the substring exists.
For example
I have the text:
Lorem ipsum dolor sit amet, consectetur adipiscing eliti qwer-
ty egeet qwewerty lectus. Proinera risus massa, placerat in q-
werty sed, tincidunt in nunci auspendisse vel dolor qwerty qw-
erty, molestie nisl sit amet, qwerty ligula curabitur ipsum,
euismod at augue at, dapibus feugiat qweerty
I need to find all qwerty, even if it contains -\n.
My decision is adding (?:-\n)? after every char:
/q(?:-\n)?w(?:-\n)?e(?:-\n)?r(?:-\n)?t(?:-\n)?y/gm
But it looks bulky (even for the example that contains only 6 chars) and it is too hard to modify the regex later, is there a magic to make the regex shorter?
No, regex is not good at this kind of match. The easiest way would be to remove - and \n first.

How to handle embedded commas and quotes in a regular expression search string

I have a CSV file and I want to convert
Lorem ipsum dolor sit amet, consectetuer adipiscing elit, maecenas porttitor congue massa
To
<text>
<name>Lorem ipsum dolor sit amet</name>
<element>consectetuer adipiscing elit</element>
<desc> maecenas porttitor congue massa</desc>
</text>
I'm able to get this simple case done with the search expression being:
^([^,]*),([^,]*),([^,]*),
^ - look for the beginning of the line
([^,]*), - look for zero or more characters that are not a comma, followed by a comma, and group it (do this 3 times)
And the replacement expression as:
<text>\n <name>$1</name>\n <element>$2</element>\n <desc>$3</desc>\n</test>\n
This works for the simple case. However, sometimes a value in the CSV has embedded commas, in which case the value has quotes around it.
Lorem ipsum dolor sit amet, "consectetuer, adipiscing elit", maecenas porttitor congue massa
So the second value (which will be an <element>) should end up with:
<text>
<name>Lorem ipsum dolor sit amet</name>
<element>consectetuer, adipiscing elit</element>
<desc> maecenas porttitor congue massa</desc>
</text>
That is, <element> should have the embedded comma. I don't need to keep the quotes.
And then to make it a bit messier, the string might also contain quotes, which are escaped with quotes (or at least that's how the CSV is given to me, which was generated from a google sheet and saved as a CSV)
Lorem ipsum dolor sit amet, "and he said, ""no way!"", to my astonishment", maecenas porttitor congue massa
I want to end up with:
<text>
<name>Lorem ipsum dolor sit amet</name>
<element>and he said, "no way!", to my astonishment</element>
<desc> maecenas porttitor congue massa</desc>
</text>
So <element> should have the embedded commas and escaped quotes (with the escape character, which is a second quote, removed).
I got lost on trying to create the search regular expression.
Something along these lines should work:
^\s* ( " (?:[^"]|(?:""))*" |(?:[^,]*)), \s*(" (?:[^"]|(?:""))*" |(?:[^,]*)), \s*(" (?:[^"]|(?:""))*" |(?:[^,]*))
It's the same pattern basically...Repeated 3 times.
Whitespace, followed by a capturing group that is either a sequence of non-commas, or preferably, a " followed by (anything that is not a ") OR a "", lastly followed by a closing quote.
You'll need to check the "Ignore Whitespace" button at the link below.
regex storm
Using a {3} notation instead of repeating the pattern 3 times can work and could even be used to replace the "" but I'm a little unsure about how to get at repeated capture groups through the UI
I am no visual-studio-code expert. But I think this can be done without regex
Following python code should give an idea
Key is to ignore the commas until the quotes are paired.
data = 'Lorem ipsum dolor sit amet, "and he said, ""no way!"", to my astonishment", maecenas porttitor congue massa'
items = data.split(',')
result = []
for i in range(len(items)):
if (len(result) == 0):
result.append(items[i])
continue
# If last item has odd number of quotes, it needs pairing - Ignore commas
if (result[-1].count('"') % 2):
# Append to last element
result[-1] += ',' + items[i]
else:
result.append(items[i])
print("\n".join(result))
Output
Lorem ipsum dolor sit amet
"and he said, ""no way!"", to my astonishment"
maecenas porttitor congue massa
Please let me know if you need more explanation for the code

regex - multiple occurrences within match

Is it possible to find multiple matching groups within the full match using ONLY regex?
Given the text below
{1234} Lorem ipsum dolor sit amet, consectetur adipiscing elit. ** Sed
iaculis nisi et dapibus consectetur. Vestibulum ** feugiat sapien, sed
sagittis magna. Phasellus euismod tempor augue, ** eget dictum mi
sagittis sit amet. Quisque sit amet diam vel magna imperdiet pulvinar
vel ac lectus. {4321} Lorem ipsum....
Im trying to group all the occurences of ** within the numbers.
I came up with the following:
\{\d+\}.+?(\*\*)+.+\{\d.+\}
https://regex101.com/r/s746be/2
Which as you can see it only groups the first group because of the lazy question mark or the last if I remove the question mark.
Why not break it down to a few simple steps instead of using a really big inefficient Regex?
You can try doing something like this:
1) Grab the text between {1234} and {4321} using a simple regex like this:
/\{\d+\}(.*?)\{\d+\}/
2) Extract the matched text between these two delimiters
3) Run a second global regex search on this matched inner text using a simple regex pattern like so:
/\*\*/g
Hope this helps

Remove one iteration from every instance of a pattern with a RegEx?

Let's say I have the following text:
Lorem ipsum dolor sit amet, consectetur aaBaaBaaB adipiscing elit.
aaBaaB
aaB Ut in risus quis elit posuere faucibus sed vitae metus. aaBaaBaaBaaB
Fusce nec tortor in dolor aaBaaBaaB porttitor viverra. aaB
I'm trying to figure out how to perform a regular expression search and replace on this in such a way that the output is:
Lorem ipsum dolor sit amet, consectetur aaBaaB adipiscing elit.
aaB
Ut in risus quis elit posuere faucibus sed vitae metus. aaBaaBaaB
Fusce nec tortor in dolor aaBaaB porttitor viverra.
That is, to remove one "aaB" from each pattern of it. Is this actually possible, and if so, how would it be done? Specifically, I intend to do this in Sublime Text 2 as a RegEx search/replace in a file.
You can use a positive lookahead:
(?=(?<w>[a-z]{2}[A-Z]{1})\s)\k<w>
You just need to make sure you have case-sensitive matching on.
example: http://regex101.com/r/sK8bG1
Use either the leading or trailing whitespace to remove the first or last substring. Either of these work:
(\s+)(aaB) with $1 in the Replace field
or
(aaB)(\s+) with $2 in the Replace field