Regular Expression for poorly defined key value pairs - regex

I am using regular expressions to parse a text files that look like the following:
<diagnostics> data=filenames/sometimes with/spaces\filename with or without spaces.dat start=0 end=90 overload=2 offset=871
<region> data=another file.filetype <diagnostics> replay=true
I would like to find all data names by scanning individual lines. If there were no spaces in the folder or filenames I could match against data= and then scan until a space with pattern:
data=([^ \n]*)
I might scan until a .xxxx filename is found, but in theory periods can be part of the folder or partial filenames. The actual pattern is to scan until data= is found and then keep going until end of line or until either one of the following: <, unknownTagNoSpaces=.
<stuff> data=(folder one/folder\value I want.whatever) (unknownTagNoSpaces)=
<stuff> replay=false data=(value I want followed by newline.xxx)
data=(folder/value I want.hhhh) <something>
So the regular expression might be to stop:
data=[^/\n|=|</]*
and this almost works except in the case of the equals sign = I have to omit the word (no spaces) and space before the equals sign as well so data=value.docx otherkey=something removes otherkey from the match.
Is this possible with regular expressions? I think the answer might be no.

I hope i understood what you want, so here is my try:
data=((?:(?> *[^ \n<=]+)(?!=))*)
It uses atomic groups, i hope your regex engine supports it.
Explanation:
data=((?:(?> *[^ \n<=]+)(?!=))*) whole regex
data=( ) match 'data=' and the stuff behind it as first capture group
(?: )* repeat as long as the contained stuff is valid
(?> ) atomic group: treat as one part, don not break apart, "tokenize"
̺ * match all spaces here (has some nice effect explained later)
[^ \n<=]+ match (at least one) symbol that is not newline, '<' or '='
(?!=) ensure there is no equal sign
The atomic group captures preceding whitespace and all valid symbols thus stopping at spaces.
Since spaces are captured beforehand there will no trailing whitespace, however leading whitespace must be matched (but can be excluded from the capture group) because the 'data=' prefix is also part of the match.
The atomic group magic happens when the '=' is encountered. It is not allowed in the atomic group and if it is found to be behind it the entire group will be discarded.
In this case the group consist of the attributes name and the spaces in between.
Example on regex101

I thought about a solution without atomic groups:
data=((?: *(?![^ ]+=)[^< ]+)*)
Explanation:
data=((?: *(?![^ ]+=)[^< ]+)*) whole regex
data=( ) match 'data=' and the stuff behind it as first capture group
(?: )* repeat as long as the contained stuff is valid
̺ * match all spaces here
(?![^ ]+=) check that no "attribute" (no-space followed by '=') comes next
[^< ]+ math all the valid symbols
This regex basically checks for all text that appears that it is not followed by '=' and then matches it.
Example on regex101

Related

How can I use Neovim regex to make a separated selection (of two characters not following one another)?

I'm attempting to replace all strings in a text file surrounded by single quotes and ending with an # first followed by any number and then any mix of any other letter(s) or #(s) like these:
'package#1.1.k'
'otherpackage#14'
'anotherpackage#7.8'
and I wish to select only the single quotes and remove them like so:
export MYVAR="/dir/'package#1.1.k':$MYVAR"
to:
export MYVAR="/dir/package#1.1.k:$MYVAR"
in the entire file. I figured out a way to remove the proceeding single quote and preceding single quote in two separate commands (using vim's zs and ze, similar to positive and negative lookarounds):
:%s/'.*#.[0-9]*.*\zs'//g
:%s/\zs'\ze.*#.[0-9]*.*//g
However, I am curious about doing it in a single operation as I want to learn more about using Neovim, and apply the answer to future operations.
I am actually using Neovim, if that comes with any additional features related to this.
You need to use
%s/'\([^#'0-9]*#[0-9][^']*\)'/\1/g
Details:
' - a single quote
\( - start of a capturing group
[^#'0-9]* - zero or more chars other than #, ' and digits
# - a # char
[0-9] - a digit
[^']* - zero or more chars other than a ' char
\)' - end of the capturing group.
The replacement is \1, the Group 1 backreference/placeholder.
You can replace matches of the following regular expression with the content of capture group 1.
'([^#']*#\d[^#']*)'
Demo
I'm not familiar with Vim's regex engine, but I understand it is quite robust. Considering that the regular expression I have suggested is nothing fancy (just making use of a capture group) I'm confident it should work in Vim.
The regular expression can be broken down as follows.
' # match a single quote
( # begin capture group 1
[^#']* # match zero or more characters other than '#' and a single-quote
#\d # match '#' followed by a digit ofollowed by zero or more characters
# other than '#' and a single-quote
[^#']* # match zero or more characters other than '#' and a single-quote
) # end capture group 1
' # match a single quote
I've assumed that there can be at most one '#' between asterisks, but if more that one is permitted (if, for example, 'a#11#22*' is to be converted to a#11#22*) change the regular expression to
'([^']*#\d[^']*)'

Regex Pattern to Match except when the clause enclosed by the tilde (~) on both sides

I want to extract matches of the clauses match-this that is enclosed with anything other than the tilde (~) in the string.
For example, in this string:
match-this~match-this~ match-this ~match-this#match-this~match-this~match-this
There should be 5 matches from above. The matches are explained below (enclosed by []):
Either match-this~ or match-this is correct for first match.
match-this is correct for 2nd match.
Either ~match-this# or ~match-this is correct for 3rd match.
Either #match-this~ or #match-this or match-this~ is correct for 4th match.
Either ~match-this or match-this is correct for 5th match.
I can use the pattern ~match-this~ catch these ~match-this~, but when I tried the negation of it (?!(~match-this)), it literally catches all nulls.
When I tried the pattern [^~]match-this[^~], it catches only one match (the 2nd match from above). And when I tried to add asterisk wild card on any negation of tilde, either [^~]match-this[^~]* or [^~]*match-this[^~], I got only 2 matches. When I put the asterisk wild card on both, it catches all match-this including those which enclosed by tildes ~.
Is it possible to achieve this with only one regex test? Or Does it need more??
If you also want to match #match-this~ as a separate match, you would have to account for # while matching, as [^~] also matches #
You could match what you don't want, and capture in a group what you want to keep.
~[^~#]*~|((?:(?!match-this).)*match-this(?:(?!match-this)[^#~])*)
Explanation
~[^~#]*~ Match any char except ~ or # between ~
| Or
( Capture group 1
(?:(?!match-this).)* Match any char if not directly followed by *match-this~
match-this Match literally
(?:(?!match-this)[^#~])* Match any char except ~ or # if not directly followed by match this
) Close group 1
See a regex demo and a Python demo.
Example
import re
pattern = r"~[^~#]*~|((?:(?!match-this).)*match-this(?:(?!match-this)[^#~])*)"
s = "match-this~match-this~ match-this ~match-this#match-this~match-this~match-this"
res = [m for m in re.findall(pattern, s) if m]
print (res)
Output
['match-this', ' match-this ', '~match-this', '#match-this', 'match-this']
If all five matches can be "match-this" (contradicting the requirement for the 3rd match) you can match the regular expression
~match-this~|(\bmatch-this\b)
and keep only matches that are captured (to capture group 1). The idea is to discard matches that are not captured and keep matches that are captured. When the regex engine matches "~match-this~" its internal string pointer is moved just past the closing "~", thereby skipping an unwanted substring.
Demo
The regular expression can be broken down as follows.
~match-this~ # match literal
| # or
( # begin capture group 1
\b # match a word boundary
match-this # match literal
\b # match a word boundary
) # end capture group 1
Being so simple, this regular expression would be supported by most regex engines.
For this you need both kinds of lookarounds. This will match the 5 spots you want, and there's a reason why it only works this way and not another and why the prefix and/or suffix can't be included:
(?<=~)match-this(?!~)|(?<!~)match-this(?=~)|(?<!~)match-this(?!~)
Explaining lookarounds:
(?=...) is a positive lookahead: what comes next must match
(?!...) is a negative lookahead: what comes next must not match
(?<=...) is a positive lookbehind: what comes before must match
(?<!...) is a negative lookbehind: what comes before must not match
Why other ways won't work:
[^~] is a class with negation, but it always needs one character to be there and also consumes that character for the match itself. The former is a problem for a starting text. The latter is a problem for having advanced too far, so a "don't match" character is gone already.
(^|[^~]) would solve the first problem: either the text starts or it must be a character not matching this. We could do the same for ending texts, but this is a dead again anyway.
Only lookarounds remain, and even then we have to code all 3 variants, hence the two |.
As per the nature of lookarounds the character in front or behind cannot be captured. Additionally if you want to also match either a leading or a trailing character then this collides with recognizing the next potential match.
It's a difference between telling the engine to "not match" a character and to tell the engine to "look out" for something without actually consuming characters and advancing the current position in the text. Also not every regex engine supports all lookarounds, so it matters where you actually want to use it. For me it works fine in TextPad 8 and should also work fine in PCRE (f.e. in PHP). As per regex101.com/r/CjcaWQ/1 it also works as expected by me.
What irritates me: if the leading and/or trailing character of a found match is important to you, then just extract it from the input when processing all the matches, since they also come with starting positions and lengths: first match at position 0 for 10 characters means you look at input text position -1 and 10.

Name validation - Adding a check to this regex to stop entering just identical characters

I'm trying to add another feature to a regex which is trying to validate names (first or last).
At the moment it looks like this:
/^(?!^mr$|^mrs$|^ms$|^miss$|^dr$|^mr-mrs$)([a-z][a-z'-]{1,})$/i
https://regex101.com/r/pQ1tP2/1
The idea is to do the following
Don't allow just adding a title like Mr, Mrs etc
Ensure the first character is a letter
Ensure subsequent characters are either letters, hyphens or apostrophes
Minimum of two characters
I have managed to get this far (shockingly I find regex so confusing lol).
It matches things like O'Brian or Anne-Marie etc and is doing a pretty good job.
My next additions I've struggled with though! trying to add additional features to the regex to not match on the following:
Just entering the same characters i.e. aaa bbbbb etc
Thanks :)
I'd add another negative lookahead alternative matching against ^(.)\1*$, that is, any character, repetead until the end of the string.
Included as is in your regex, it would make that :
/^(?!^mr$|^mrs$|^ms$|^miss$|^dr$|^mr-mrs$|^(.)\1*$)([a-z][a-z'-]{1,})$/i
However, I would probably simplify your negative lookahead as follows :
/^(?!(mr|ms|miss|dr|mr-mrs|(.)\2*)$)([a-z][a-z'-]{1,})$/i
The modifications are as follow :
We're evaluating the lookahead at the start of the string, as indicated by the ^ preceding it : no need to repeat that we match the start of the string in its clauses
Each alternative match the end of the string. We can put the alternatives in a group, which will be followed by the end-of-string anchor
We have created a new group, which we have to take into account in our back-reference : to reference the same group, it now must address \2 rather than \1. An alternative in certain regex flavours would have been to use a non-capturing group (?:...)

Regular expression to find number in parentheses, but only at beginning of string

Disclaimer: I'm new to writing regular expressions, so the only problem may be my lack of experience.
I'm trying to write a regular expression that will find numbers inside of parentheses, and I want both the numbers and the parentheses to be included in the selection. However, I only want it to match if it's at the beginning of a string. So in the text below, I would want it to get (10), but not (2) or (Figure 50).
(10) Joystick Switch - Contains control switches (Figure 50)
Two (2) heavy lifting straps
So far, I have (\(\d+\)) which gets (10) but also (2). I know ^ is supposed to match the beginning of a string (or line), but I haven't been able to get it to work. I've looked at a lot of similar questions, both here and on other sites, but have only found parts of solutions (finding things inside of parentheses, finding just numbers at the beginning for a string, etc.) and haven't quite been able to put them together to work.
I'm using this to create a filter in a CAT tool (for those of you in translation) which means that there's no other coding languages involved; essentially, I've been using RegExr to test all of the other expressions I've written, and that's worked fine.
The regex should be
^\(\d+\)
^ Anchors the regex at the start of the string.
\( Matches (. Should be escaped as it has got special meaning in regex
\d+ Matches one or more digits
\) Matches the )
Capturing brackets like (\(\d+\)) are not necessary as there are no other characters matched from the pattern. It is required only when you require to extract parts from a matched pattern
For example if you like to match (50) but to extract digits, 50 from the pattern then you can use
\((\d+)\)
here the \d+ part comes within the captured group 1, That is the captured group 1 will be 50 where as the entire string matched is (50)
Regex Demo
Like so:
^\(\d+\)
^ anchor
Each of ( and ) are regex meta character, so they need to be escaped with \
So \( and \) match literal parenthesis.
( and ) captures.
\d+ match 1 or more digits
Demo

What does /([^.]*)\.(.*)/ mean?

When I searched about something, I found an answered question in this site. 2 of the answers contain
/([^.]*)\.(.*)/
on their answer.
The question is located at Find & replace jquery. I'm newbie in javascript, so I wonder, what does it mean? Thanks.
/([^.]*)\.(.*)/
Let us deconstruct it. The beginning and trailing slash are delimiters, and mark the start and end of the regular expression.
Then there is a parenthesized group: ([^.]*) The parentheseis are there just to group a string together. The square brackets denote a "character group", meaning that any character inside this group is accepted in its place. However, this group is negated by the first character being ^, which reverse its meaning. Since the only character beside the negation is a period, this matches a single character that is not a period. After the square brackets is a * (asterisk), which means that the square brackets can be matched zero or more times.
Then we get to the \.. This is an escaped period. Periods in regular expressions have special meaning (except when escaped or in a character group). This matches a literal period in the text.
(.*) is a new paranthesized sub-group. This time, the period matches any character, and the asterisk says it can be repeated as many times as needs to.
In summary, the expression finds any sequence of characters (that isn't a period), followed by a single period, again followed by any character.
Edit: Removed part about shortening, as it defeats the assumed purpose of the regular expression.
It's a regular expression (it matches non-periods, followed by a period followed by anything (think "file.ext")). And you should run, not walk, to learn about them. Explaining how this particular regular expression works isn't going to help you as you need to start simpler. So start with a regex tutorial and pick up Mastering Regular Expressions.
Original: /([^.]*)\.(.*)/
Split this as:
[1] ([^.]*) : It says match all characters except . [ period ]
[2] \. : match a period
[3] (.*) : matches any character
so it becomes
[1]Match all characters which are not . [ period ] [2] till you find a .[ period ] then [3] match all characters.
Anything except a dot, followed by a dot, followed by anything.
You can test regex'es on regexpal
It's a regular expression that roughly searches for a string that doesn't contain a period, followed by a period, and then a string containing any characters.
That is a regular expression. Regular expressions are powerful tools if you use them right.
That particular regex extracts filename and extension from a string that looks like "file.ext".
It's a regular expression that splits a string into two parts: everything before the first period, and then the remainder. Most regex engines (including the Javascript one) allow you to then access those parts of the string separately (using $1 to refer to the first part, and $2 for the second part).
This is a regular expression with some advanced use.
Consider a simpler version: /[^.]*\..*/ which is the same as above without parentheses. This will match just any string with at least one dot. When the parentheses are added, and a match happens, the variables \1 and \2 will contain the matched parts from the parentheses. The first one will have anything before the first dot. The second part will have everything after the first dot.
Examples:
input: foo...bar
\1: foo
\2: ..bar
input: .foobar
\1:
\2: foobar
This regular expression generates two matching expressions that can be retrieved.
The two parts are the string before the first dot (which may be empty), and the string after the first dot (which may contain other dots).
The only restriction on the input is that it contain at least one dot. It will match "." contrary to some of the other answers, but the retrived groups will be empty.
IMO /.*\..*/g Would do the same thing.
const senExample = 'I am test. Food is good.';
const result1 = senExample.match(/([^.]*)\.(.*)/g);
console.log(result1); // ["I am test. Food is good."]
const result2 = senExample.match(/^.*\..*/g);
console.log(result2); // ["I am test. Food is good."]
the . character matches any character except line break characters the \r or \n.
the ^ negates what follows it (in this case the dot)
the * means "zero or more times"
the parentheses group and capture,
the \ allows you to match a special character (like the dot or the star)
so this ([^.]*) means any line break repeated zero or more times (it just eats up carriage returns).
this (.*) part means any string of characters zero or more times (except the line breaks)
and the \. means a real dot
so the whole thing would match zero or more line breaks followed by a dot followed by any number of characters.
For more information and a really great reference on Regular Expressions check out: http://www.regular-expressions.info/reference.html
It's a regular expression, which basically is a pattern of characters that is used to describe another pattern of characters. I once used regexps to find an email address inside a text file, and they can be used to find pretty much any pattern of text within a larger body of text provided you write the regexp properly.