Using regex for HTML-parsed text - regex

Based on the following text returned from the server storing an HTML-parsed text string for tagging users, how do I use regex here for the name, "Dave Park":
[u=8367|Dave Park]
I tried the following regex, but to no avail:
|(\\w*)]

For some reason you seem to have escaped exactly what you shouldn't escape, and have not escaped several special symbols in regex that do need escaping.
Taking the full pattern, and escaping the correct part and adding the capture group, you should end up with this:
\[u=\d+\|([^\]]+)\]
This matches a literal [ bracket, the u= string followed by multiple numbers, and then the literal |, then the group containing any characters that are not a closing ] bracket, and finally, the literal closing ] bracket.
Test it out yourself
I'm sort of wondering why you're not also capturing the obvious ID in the first part, but, well, you can do that simply by putting round brackets around the \d+ in my posted pattern.

You were very close. You needed to escape the | character and include spaces as a legal character in your capture group. So something like this:
\|([\w ]*)]

Related

Regular Expression for email formatting without hypen at first and last

I have created the regular expression which will take the email address as in following format:
abc#xyz.com.in
Regular Expression
/^(?!-)[\w-\.]+#([\w-]+\.)+[\w-]{2,4}/
I am trying to do the email which is not having hyphen at start and last.
Invalid Format
-abc#xyz.com
abc#xyz.com-
valid format
abc#xyz.com
abc#xyz.com.in
Your regex can be edited in a simple way (see a demo at Regex101):
/^[\w\.]+[\w\.\-]*#[\w\.]+\.[\w\.]{2,4}$/
^: This is the beginning of the line
[\w\.]+: This is the first part of the email before # can have only word characters (\w) or dot (\.) at least once.
[\w\.\-]*: After that, the same characters from the list before can occur including the dash (\-) and as many times as you want. Remember, the dash has to be escaped if used in the list between [ and ], otherwise it represents a range instead of the dash itself.
#: This matches itself.
[\w\.]+: After the #` character, there must be at least one character from the list.
\.: Then followed by the dot literally.
[\w\.]{2,4}: Finally the last 2-4 characters.
$: And the end of a line.
The difference between this and your Regex is just a little:
/^[\w\.]+[\w\.\-]*#[\w\.]+\.[\w\.]{2,4}$/
/^(?!-)[\w-\.]+#([\w-]+\.)+[\w-]{2,4}/
I rather avoided the negative look-ahead and specify (whitelist) the characters that can occur on the position, unless it is really needed to blacklist them (which I generally try to avoid). The rest of the Regex is quite similar except you should escape the dash - character between the list braces [ and ].
Finally, I omitted the capturing groups ( and ) and leave it up to you to place them wherever you need.
Add \w to each end of your regex, and include the end anchor$
^\w[\w.-]+#([\w-]+\.)+[\w-]{2,4}\w$
Note also the dot doesn't need escaping within a character class.
a complete email RegEx
/^(([^<>()[\]\\.,;:\s#"]+(\.[^<>()[\]\\.,;:\s#"]+)*)|(".+"))#((\[[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\])|(([a-zA-Z\-0-9]+\.)+[a-zA-Z]{2,}))$/

regex code: does or does not contain a character

I cant figure this out. I want to capture the string inside the square brackets, with or without characters in it.
[5123512], [412351, 1235123, 5125123], [12312-AA] and []
i want to convert the square brackets into double quote
[5123512] ==> "5123512"
[412351, 1235123, 5125123] ==> "412351, 1235123, 5125123"
[12312-AA] ==> "12312-AA"
[] == > ""
i tried this \[\d+\] and not working
This is my sample data, its a json format.
Square brackets inside the description need not to change, only the attributes.
{"results":
[{"listing": 4613456431,"sku": [5123512],"category":[412351, 1235123,
5125123],"subcategory": "12312-AA", "description":"This is [123sample]"}
{"listing": 121251,"sku":[],"category": [412351],"subcategory": "12312-AA",
"description": "product sample"}]}
TIA
Your regex doesn't work for three reasons :
[ is a meta-character that opens a character class. To match a literal [, you need to escape it with a backslash. ] also is a meta-character when it follows the [ meta-character, but if you escape the [ you shouldn't need to escape the ] (not that it hurts to do so).
\d only captures decimal digits, however your sample contains the letter A. If that's the hexadecimal digit, you will probably want to use [\dA-F] instead of \d, or [\dA-Fa-f] if the digits can be found in small case. If that can be any letter, you could use [\dA-Z] or [\dA-Za-z] depending on your need to match small case letters.
+ means "one or more occurences", so it wouldn't match an empty []. Use the * "0 or more occurences" quantifier instead.
Additionally, you probably need to capture the sequence of digits in a (capturing group) in order to be able to reference it in your replacement pattern.
However, as Andrew Morton suggests, it looks like you should be able to use a plain text search/replace.
First off, regex is a horrible tool for parsing JSON formatted data. I'm sure you'll find plenty of tools to simply read your JSON in vb.net and mangle it in simpler ways than taking it in as text... For example: How to parse json and read in vb.net
Original answer (edited slightly):
You're almost there, but here's a few things you need to change:
in your regex pattern, escape the square brackets: \[ and \]
if you only want to capture all characters in the brackets, then . is a good way to go
the plus sign + means "at least one" — if you want to match empty brackets too, use *? instead
the question mark means "lazy" — it explicitly tells the regex to match the shortest sequence of characters possible (instead of going over to the next square bracket...)
wrap the .*? into parenthesis so that you can reference to that part later when substituting the stuff
finally, the output value / pattern to substitute with is \1 or $1, depending on the context
or "\1" or "$1" if you really need the double quotes in the output — maybe you just need a string variable?
All in all this becomes:
Find this: \[(.*?)\]
Replace with: \1

Trying to match a sequence if not preceded by one group, but yes if preceded by another

This is getting a little meta, but I'm trying to figure out a regex to match regexes for syntax highlighting purposes. There's a nice long backstory, but in the interest of brevity I'll skip it. Here's what I'm trying to do: I need to match a comment (preceded by # and terminated at the end of the line) only if it is not inside a character class ([...]), although it should be matched if there is a complete (closed) character class earlier in the line.
The complicating factor is escaped square brackets — while a plain [ earlier in the line not followed by a closing ] would indicate that we're still in a character class, and therefore illegal, an escaped bracket \[ could be present, with or without the presence of a closing escaped bracket \].
Maybe some examples will help. Here are some instances where a comment should be matched:
(\h{8}-\h{4}-\h{4}-\h{4}-\h{12}) # match UUID (no square brackets at all)
([A-Za-z_][A-Za-z0-9_]*) # valid Python identifier (paired unescaped square brackets)
(\||\[|\?) # match some stuff (escaped opening square bracket)
Here is an example of where an "attempted comment" should not be matched:
[A-Za-z # letters
0-9_-.] # numbers and other characters
(the first line should not be matched, the second one is fine)
I'm by no means a regex master (which is why I'm asking this question!), but I have tried fiddling around with positive and negative lookbehinds, and trying to nest them, but I've had zero luck except with
(?<!\[)((#+).*$)
which matches a comment only if not preceded by an opening square bracket. Once I started nesting the lookarounds, though, and trying to match if the opener was preceded by an escape, I got stumped. Any help would be ... helpful.
It is rather simple, but in works with cases from your example. So try this:
(?<=[\][)]\s)(#(.*))$
DEMO
it match comment only if preceded by closing bracket and space.
EDIT
As I thought you case is much more complicated, so maybe try this one:
^(?=(?:[-\w\d?*.+|{}\\\/\s<>\]]|(?:\\[\[\]()]))+(#+.*)$)|^(?=^[\[(].+?[\])]\s*(#+.*)$)
DEMO
It will match only by groups (it is not matching any text at all, as it use only positive lookahead, but grouping is lookarounds is allowed). Or if you want to match directly, match more text, and then get what you want with groups with something like:
^(?:(?:[-\w\d?*.+|{}\\\/\s<>\]])|(?:\\[\[\]()])|^[\[(].+?[\])])+\s*(#+.*)$
DEMO
However in both cases, you probably would need to add more characters occuring in regular expressions to first alternative (?:[-\w\d?*.+|{}\\\/\s<>\]]). For example, if you want it to match also comment in (\[ # works if escaped [ is in group you need to add ( to alternative. But I am not sure is it what you wanted.
EDIT "invalid scope"
Try with:
^(?:(?:[-\w\d?*.+|{}\\\/\s<>\]\(])|(?:\\[\[\]()])|^[\[(].+?[\])])+\s*(?<valid>(?:#+).*)$|^[-\[\w\d?*.+|{}\\\/\s<>\(]+(?<invalid>(?:#+).*)$
DEMO
Think you mean this,
^\[[^\]]*\].*#.*$|#(.*)$
DEMO
OR
^\[[^\]]*\].*#.*$(*SKIP)(*F)|#.*$

Non-brute force regex to remove commas numbers in CSV list

The main thing I am trying to do here is learn regex so that I have a better understanding of it. What I am trying to do is a find and replace using regex to remove only the commas that are within the numbers.
I can do this using multiple find/replace patterns, and I can also do this using a brute force method of matching a large number and ignoring commas, however I am wondering if there is some way to place the numbers and comma into a capture group but ignore the commas from output.
Here is an example of a list of numbers:
"7,033.00","0.00","7,033.00","0.00","0.00","0.00","0.00","0.00","0.00","0.00",1,1,1,!!$,,"123,123,123.00","123,444,38.01"
So my 'brute-force' method is the following:
\"([0-9]+)[,]?([0-9]*)[,]?([0-9]*)[,]?([0-9]*[.]+[0-9]+)\"
This would account for any number up to 999,999,999,999.00. It contains the four capture groups $1$2$3$4 and will output any number I would expect in the format that I want.
Example of wanted output using a replace of $1$2$3$4:
7033.00,0.00,7033.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,1,1,1,!!$,,123123123.00,12344438.01
What I would like to do is something like this (pseudo code):
[\"]([0-9]+)([(?:,)[0-9]*][.]+[0-9]+)[\"]
The idea behind this is:
Match the first quotation mark but ignore it
Match a group of numbers and place in capture group $1
Match either a number or comma followed by a period and one or more numbers and store in a capture group, but leave the commas out of the capture group.
Match the last quotation mark but ignore it
I've been reading and reading but can't seem to find a way to ignore part of a capture group the way I want to do it. Any suggestions or can it not be done?
A two step method would be to match the commas first then remove the quotes, which might work too:
(,)(?=([0-9]{2,3}[.,]))
Well, regexr uses ECMAScript regex, so you might use something like
"|([0-9]),(?=[0-9])(?=(?:[^"]*"[^"]*")*[^"]*"[^"]*$)
And replace with $1.
regexr demo
Otherwise, with PCRE, you might use something like:
"|(?<=[0-9]),(?=[0-9])(?=(?:[^"]*"[^"]*")*[^"]*"[^"]*$)
And replace with nothing, where it makes use of lookarounds to make sure that the comma in question is surrounded by [0-9] (ECMAScript doesn't support lookbehinds currently).
regex101 demo
" matches a literal quote character.
| means OR, so the regex matches a " or a ([0-9]),(?=[0-9]) (or (?<=[0-9]),(?=[0-9]))
([0-9]) is a capture group to get one digit.
, matches a literal comma.
(?=[0-9]) is a positive lookahead and ensures that the comma is followed by a digit, without matching the digit itself.
(?<=[0-9]) is a positive lookbehind and ensures that the comma is preceded by a digit, again without matching the digit itself.
(?=(?:[^"]*"[^"]*")*[^"]*"[^"]*$) ensures that there are an odd number of quotes ahead, and this in turn means that this will match a comma only within quotes, assuming that there are no unbalanced or escaped quotes.
In two steps:
First remove all commas within quotes (i.e. commas that are followed by an odd number of quotes. This even works with escaped quotes since in CSV files, quotes are escaped by doubling):
>>> import re
>>> s = '"7,033.00","0.00","7,033.00","0.00","0.00","0.00","0.00","0.00","0.00","0.00",1,1,1,!!$,,"123,123,123.00","123,444,38.01"'
>>> s = re.sub(r',(?!(?:[^"]*"[^"]*")*[^"]*$)', '', s)
>>> s
'"7033.00","0.00","7033.00","0.00","0.00","0.00","0.00","0.00","0.00","0.00",1,1,1,!!$,,"123123123.00","12344438.01"'
Then remove all the quotes:
>>> s.replace('"', '')
'7033.00,0.00,7033.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,1,1,1,!!$,,123123123.00,12344438.01'

RegEx with strange behaviour: matching String with back reference to allow escaping and single and double quotes

Matching a string that allows escaping is not that difficult.
Look here: http://ad.hominem.org/log/2005/05/quoted_strings.php.
For the sake of simplicity I chose the approach, where a string is divided into two "atoms": either a character that is "not a quote or backslash" or a backslash followed by any character.
"(([^"\\]|\\.)*)"
The obvious improvement now is, to allow different quotes and use a backreference.
(["'])((\\.|[^\1\\])*?)\1
Also multiple backslashes are interpreted correctly.
Now to the part, where it gets weird: I have to parse some variables like this (note the missing backslash in the first variable value):
test = 'foo'bar'
var = 'lol'
int = 7
So I wrote quite an expression. I found out that the following part of it does not work as expected (only difference to the above expression is the appended "([\r\n]+)"):
(["'])((\\.|[^\1\\])*?)\1([\r\n]+)
Despite the missing backslash, 'foo'bar' is matched. I used RegExr by gskinner for this (online tool) but PHP (PCRE) has the same behaviour.
To fix this, you can hardcode the quote by replacing the backreferences with '. Then it works as expected.
Does this mean the backreference does actually not work in this case? And what does this have to do with the linebreak characters, it worked without it?
You can't use a backreference inside a character class; \1 will be interpreted as octal 1 in this case (at least in some regex engines, I don't know if this is universally true).
So instead try the following:
(["'])(?:\\.|(?!\1).)*\1(?:[\r\n]+)
or, as a verbose regex:
(["']) # match a quote
(?: # either match...
\\. # an escaped character
| # or
(?!\1). # any character except the previously matched quote
)* # any number of times
\1 # then match the previously matched quote again
(?:[\r\n]+) # plus one or more linebreak characters.
Edit: Removed some unnecessary parentheses and changed some into non-capturing parentheses.
Your regex insists on finding at least one carriage return after the matched string - why? What if it's the last line of your file? Or if there is a comment or whitespace after the string? You probably should drop that part completely.
Also note that you don't have to make the * lazy for this to work - the regex can't cross an unescaped quote character - and that you don't have to check for backslashes in the second part of the alternation since all backslashes have already been scooped up by the first part of the alternation (?:\\.|(?!\1).). That's why this part has to be first.