Regex expression to bold using asterisks - regex

I have a question regarding using a regular expression to bold text within a string using asterisks.
The other questions on this topic work well for simple scenarios however we have encountered some issues.
Our particular scenario is for asterisks to be replaced with <bold></bold> tags.
It must also be able to handle multiple asterisks as well as an uneven number of asterisks.
Our example input text is as follows;
string exampleText1 = "**** PLEASE NOTE *** Testing, *nuts*, **please note..., test";
string exampleText2 = "**Test text (10)";
Our current regex is as follows;
Regex _boldRegex = new Regex(#"(\*)+([^*?$]+)+(\*)");
string value = _boldRegex.Replace(exampleText1, #"<bold>$2</bold>");
Example 1 should show "<bold> PLEASE NOTE </bold> Testing, <bold>nuts</bold>, *please note..., test" where the groups of asterisks are treated as single asterisks and an unfinished tag is ignored.
Example 2 crashes the program because it expects a 'closing' asterisk. It should show "*Text text (10)"
Can anyone help by suggesting a new regex, bearing in mind the ability to handle groups of asterisks and also an uneven number of asterisks?
Thanks in advance.

For you examle data, you might use an optional part with a capture group to capture the repeated character class without newlines between 1 or more *
In the callback of replace, you can test for the existence of group 1, and do the replacements based on that.
\*+(?:([^*?$\n\r]+)\*+)?
The pattern matches:
\*+ Match 1+ times *
(?: Non capture group
( Capture group 1
[^*?$\n\r]+ Match 1+ times any char other than the listed in the character class
) Close group 1
\*+ Match 1+ times *
)? Close on capture group
See a regex demo.
For example
Regex _boldRegex = new Regex(#"\*+(?:([^*?$\n\r]+)\*+)?");
string exampleText1 = #"**** PLEASE NOTE *** Testing, *nuts*, **please note..., test
**Test text (10)";
string value = _boldRegex.Replace(exampleText1, m =>
m.Groups[1].Success ? String.Format("<bold>{0}</bold>", m.Groups[1].Value) : "*"
);
Console.WriteLine(value);
Output
<bold> PLEASE NOTE </bold> Testing, <bold>nuts</bold>, *please note..., test
*Test text (10)

Related

Grok pattern/Regex to parse string with nested parenthesis

I am trying to parse out several dynamic strings via Grok/Regex that exist in log messages between (). For example (SenderPartyName below):
2021/05/23 16:01:26.094 High Messaging.Message.Delivered Id(ci1653336085475.12327434#test_te) MessageId(EPIUM#1130754#84601671) SenderPartyName(Mcdonalds (CFH) Restaurant Glen) ReceiverPartyName(TEST_HERE_AGAIN) SenderRoutingId(08Mdsfkm853)
I would want to parse each key-value out from the string that follow the () format. Here is my grok pattern so far. I've been testing with https://grokdebug.herokuapp.com/
%{DATESTAMP:ts} %{WORD:loglevel} %{DATA:reason}\s ?(Id\(%{DATA:id}\))? ?(MessageId\(%{DATA:originalmessageid}\))? ?(SenderPartyName\((?<senderpartyname>.+?\).+?)\))? ?(ReceiverPartyName\(%{DATA:receiverpartyname}\))? ?(SenderRoutingId\(%{DATA:senderroutingid}\))?
This works when there are () within the nested string like this:
Mcdonalds (CFH) Restaurant Glen
...but it is dynamic and could appear without () like such: Mcdonalds Restaurant Glen
Trying to build regex to account for both scenarios with this portion of the grok pattern:
?(SenderPartyName\((?<senderpartyname>.+?\).+?)\))?
Currently this parses the non-parenthesis case like this though:
"senderpartyname": "Mcdonalds Restaurant Glen) ReceiverPartyName(TEST_HERE_AGAIN"
..where desired state is one of the following depending on the string:
"senderpartyname": "Mcdonalds Restaurant Glen"
or
"senderpartyname": "Mcdonalds (CFH) Restaurant Glen"
You can use
%{DATESTAMP:ts}\s+%{WORD:loglevel}\s+%{DATA:reason}\s+Id\(%{DATA:id}\)(?:\s+MessageId\(%{DATA:originalmessageid}\))?(?:\s+SenderPartyName(?<senderpartyname>\((?:[^()]++|\g<senderpartyname>)*\)))?(?:\s+ReceiverPartyName\(%{DATA:receiverpartyname}\))?(?:\s+SenderRoutingId\(%{DATA:senderroutingid}\))?
Note I revamped it so that all optional fields match one or more whitespaces and the fields as obligatory patterns, but they are made optional as a sequence, which makes matching more efficient.
The main thing changed is (?:\s+SenderPartyName(?<senderpartyname>\((?:[^()]++|\g<senderpartyname>)*\)))?, it matches
(?: - start of a non-capturing group:
\s+ - one or more whitespaces
SenderPartyName - a fixed word
(?<senderpartyname>\((?:[^()]++|\g<senderpartyname>)*\)) - Group "senderpartyname": ( (matched with \(), then zero or more repetitions of any char other than ( and ) or the Group "senderpartyname" pattern recursed ( see (?:[^()]++|\g<senderpartyname>)*) and then a ) char (matched with \))
)? - end of the group, one or zero repetitions (optional)

openrefine extracting values between symbols

I am trying to extract string of text from a whole field with Openrefine.
This is an extract of my dataset:
172. D3B: 23Y1-Up, 27Y1-Up (36 LK) 6-S/F Rollers, 4-D/F Rollers, 2-Carrier Rollers
179. D3C: 23Y2508-UP (37LK) 6-S/F, 4-D/F, 2-T/C
180. 27Y5050-UP (37LK) 6-S/F, 4-D/F, 2-T/C
181. 2XF622-UP (37LK) 6-S/F, 4-D/F, 2-T/C
182. 3RF0147-UP (36LK) 6-S/F, 4-D/F, 2-T/C
200. D4D:67A1-UP, 78A1-UP, 85A1-UP, 86A1-UP, 59J1-644, 58J1-UP, 49J1-473, 22C1-UP, 91A1-UP, 88A1-UP
I want to extract 23Y1-Up, 27Y1-Up from record 172,
23Y2508-UP from record 179, 27Y5050-UP from record 180 and the whole 67A1-UP, 78A1-UP, 85A1-UP, 86A1-UP, 59J1-644, 58J1-UP, 49J1-473, 22C1-UP, 91A1-UP, 88A1-UP from record 200
So basically the rule would be to extract everything between :if present and ( if present. Maybe restricting it to where there is one or more occurrence of the string UP
So I am adding a new column based on existing column using value.match.
I tried to adapt some query to my scope but I am very far from succeding despite multiple attempts.
I started with this regex expression value.match(/\:?\s*(\w+\.?)+?.*/)[0] that I tought would isolate any word AFTER the semicolon (and the space) but it works only with words BEFORE...
Yesterday I successfully extracted the numbers before the LK that is also relevant information for my dataset, but I can't grasp this.
Any help is much appreciated!
Thanks
Using match matches the whole string.
You can use a single capture group with a negated character class to exclude matching (
^[^:]*:\s*([^(]+).*$
^[^:]*:\s* Match until the first : followed by optional whitespace chars
( Capture group 1
[^(]+ Match 1+ occurrence of any char except (
) Close group 1
.*$ Match the rest of the line
regex demo
Or capture in a group matching only word characters separated by a hyphen
^[^:]*:\s*(\w+-\w+(?:,\s+\w+-\w+)*).*$
regex demo

Regex for SQL Query

Hello together I have the following problem:
I have a long list of SQL queries which I would like to adapt to one of my changes. Finally, I have a renaming problem and I'm afraid I want to solve it more complicated than expected.
The query looks like this:
INSERT member (member, prename, name, street, postalcode, town, tel1, tel2, fax, bem, anrede, salutation, email, name2, name3, association, project) VALUES (2005, N'John', N'Doe', N'Street 4711', N'1234', N'Town', N'1234-5678', N'1234-5678', N'1234-5678', N'Leader', NULL, N'Dear Mr. Doe', N'a#b.com', N'This is the text i want to delete', N'Name2', N'Name3', NULL, NULL);
In the "Insert" there was another column which I removed (which I did simply via Notepad++ by typing the search term - "example, " - and replaced it with an empty field. Only the following entry in Values I can't get out using this method, because the text varies here. So far I have only worked with the text file in which I adjusted the list of queries.
So as you can see there is one more entry in Values than in the insertions (there was another column here, but it was removed by my change).
It is the entry after the email address. I would like to remove this including the comma (N'This is the text i want to delete',).
My idea was to form a group and say that the 14th digit after the comma should be removed. However, even after research I do not know how to realize this.
I thought it could look like this (tried in https://regex101.com/)
VALUES\s?\((,) something here
Is this even the right approach or is there another method? I only knew Regex to solve this problem, because of course the values look different here.
And how can I finally use the regex to get the queries adapted (because the queries are local to my computer and not yet included in the code).
Short summary:
Change the query from
VALUES (... test5, test6, test7 ...)
To
VALUES (... test5, test7 ...)
As per my comment, you could use find/replace, where you search for:
(\bVALUES +\((?:[^,]+,){13})[^,]+,
And replace with $1
See the online demo
( - Open 1st capture group.
\bValues +\( - Match a word-boundary, literally 'VALUES', followed by at least a single space and a literal open paranthesis.
(?: - Open non-capturing group.
[^,]+, - Match anything but a comma at least once followed by a comma.
){13} - Close non-capture group and repeat it 13 times.
) - Close 1st capture group.
[^,]+, - Match anything but a comma at least once followed by a comma.
You may use the following to remove / replace the value you need:
Find What: \bVALUES\s*\((\s*(?:N'[^']*'|\w+))(?:,(?1)){12}\K,(?1)
Replace With: (empty string, or whatever value you need)
See the regex demo
Details
\bVALUES - whole word VALUES
\s* - 0+ whitespaces
\( - a (
(\s*(?:N'[^']*'|\w+)) - Group 1: 0+ whitespaces and then either N' followed with any 0 or more chars other than ' and then a ', or 1+ word chars
(?:,(?1)){12} - twelve repetitions of , followed with the Group 1 pattern
\K - match reset operator that discards the text matched so far from the match memory buffer
, - a comma
(?1) - Group 1 pattern.
Settings screen:

Regular expression not capturing optional group

I'm using the following regular expression pattern:
.*(?<line>^\s*Extends\s+#(?<extends>[_A-Za-z0-9]+)\s*$)?.*
And the following text:
Name #asdf
Extends #extendedClass
Origin #id
What I don't understand is that both of the caught group results (line and extends) are empty, but when I remove the last question mark from the expression the groups are caught.
The line group must be optional since the Extends line is not always present.
I created a fiddle using this expression, which can be accessed at https://regexr.com/4rekk
EDIT
I forgot to mention that I'm using the multiline and dotall flags along with the expression.
It's already been mentioned that the leading .* is capturing everything when you make your (?<line>) group optional. The following is not directly related to your question but it may be useful information (if not, just ignore):
You need to be careful elsewhere. You are using ^ and $ to match the start and end of lines as well as the start and end of the string. But the $ character will not consume the newline character that marks the end of a line. So:
'Line 1\nLine 2'.match(/^Line 1$^Line 2/m) returns null
while
'Line 1\nLine 2'.match(/^Line 1\n^Line 2/m) returns a match
So in your case if you were trying to capture all three lines, any of which were optional, you would write the regex for one of the lines as follows to make sure you consume the newline:
/(?<line>^\s*Extends\s+#(?<extends>[_A-Za-z0-9]+)[^\S\n]*\n)?/ms
Where you had specified \s*$, I have [^\S\n]*\n. [^\S\n]* is a double negative that says one or more non non-white space character excluding the newline character. So it will consume all white space characters except the newline character. If you wanted to look for any of the three lines in your example (any or all are optional), then the following code snippet should do it. I have used the RegExp function to create the regex so that it can be split across multiple lines. Unfortunately, it takes a string as its argument and so some backslash characters have to be doubled up:
let s = ` Name #asdf
Extends #extendedClass
Origin #id
`;
let regex = new RegExp(
"(?<line0>^\\s*Name\\s+#(?<name>[_A-Za-z0-9]+)[^\\S\\n]*\\n)?" +
"(?<line>^\\s*Extends\\s+#(?<extends>[_A-Za-z0-9]+)[^\\S\\n]*\\n)?" +
"(?<line2>^\\s*Origin\\s+#(?<id>[_A-Za-z0-9]+)[^\\S\\n]*\\n)?",
'm'
);
let m = s.match(regex);
console.log(m.groups);
The above code snippet seems to have a problem under Firefox (an invalid regex flag, 's', is flagged on a line that doesn't exist in the above snippet). See the following regex demo.
And without named capture groups:
let s = ` Name #asdf
Extends #extendedClass
Origin #id
`;
let regex = new RegExp(
"(^\\s*Name\\s+#([_A-Za-z0-9]+)[^\\S\\n]*\\n)?" +
"(^\\s*Extends\\s+#([_A-Za-z0-9]+)[^\\S\\n]*\\n)?" +
"(^\\s*Origin\\s+#([_A-Za-z0-9]+)[^\\S\\n]*\\n)?",
'm'
);
let m = s.match(regex);
console.log(m);

Extracting email addresses from messy text in OpenRefine

I am trying to extract just the emails from text column in openrefine. some cells have just the email, but others have the name and email in john doe <john#doe.com> format. I have been using the following GREL/regex but it does not return the entire email address. For the above exaple I'm getting ["n#doe.com"]
value.match(
/.*([a-zA-Z0-9_\-\+]+#[\._a-zA-Z0-9-]+).*/
)
Any help is much appreciated.
The n is captured because you are using .* before the capturing group, and since it can match any 0+ chars other than line break chars greedily the only char that can land in Group 1 during backtracking is the char right before #.
If you can get partial matches git rid of the .* and use
/[^<\s]+#[^\s>]+/
See the regex demo
Details
[^<\s]+ - 1 or more chars other than < and whitespace
# - a # char
[^\s>]+ - 1 or more chars other than whitespace and >.
Python/Jython implementation:
import re
res = ''
m = re.search(r'[^<\s]+#[^\s>]+', value)
if m:
res = m.group(0)
return res
There are other ways to match these strings. In case you need a full string match .*<([^<]+#[^>]+)>.* where .* will not gobble the name since it will stop before an obligatory <.
If some cells contain just the email, it's probably better to use the #wiktor-stribiżew's partial match. In the development version of Open Refine, there is now a value.find() function that can do this, but it will only be officially implemented in the next version (2.9). In the meantime, you can reproduce it using Python/Jython instead of GREL:
import re
return re.findall(r"[^<\s]+#[^\s>]+", value)[0]
Result :