Regex to parse a config file, where the # sign represents a comment - regex

With the strings
Test=Hello World #Some more text
Test=Hello World
I need both to capture the "Test" group and the "Hello World" group. If the string starts with a "#" it should not be captured at all.
The below expressions work for the first and second strings, respectively:
^((?!#).+)(?:=)(.+[\S])(?:[\s]*[#])
^((?!#).+)(?:=)(.+[\S])
How do I do a bitwise logical OR between two non-capturing Regex groups?
I tried doing something like
^((?!#).+)(?:=)(.+[\S])(?:[\s]*[#])|(?:.*)
but can't get it to work out correctly.
More Details
Background: This is being done in C# (.NET Framework 4.0). A file is being read line by line. The text to the left of the equalize sign refers to a variable name and the text to the right of the equalize sign refers to the variable's value. This file is being used as a config file.
General cases:
Note: All trailing whitespace - any whitespace after the end of the last non-whitespace character should not be captured. This also includes any space between the end of the second group and the pound sign.
1) All characters, except for a whitespace, followed immediately by an equalize sign, followed immediately by any set of characters followed by a space and a pound sign. e.g.
this=is valid #text
s0_is=this #text
and=th.is #text
the=characters after the # Pound sign are irrelevant
2) The exact same situation as case 1 except that there is no trailing space between the second capture group and the pound sign. e.g.
this=is valid#text
s0_is=this#text
and=th.is#text
the=characters after the# Pound sign are irrelevant
3) The same situation as in cases one and two; however, where there is no # sign at all (see the above note about trailing whitespace). e.g.
this=is valid
s0_is=this
and=th.is
the=characters after the
For all three of these cases the capture groups should be as shown below, respectively (the | symbol is used to distinguish between capture groups):
this|is valid
s0_is|this
and|th.is
the|characters after the
Special cases:
1) The first character of the line is a # sign. This should result in nothing being captured.
2) The # sign occurs immediately after the = sign. This should result in the second capture group being null.
3) The # sign occurs anywhere else not otherwise explicitly stated above. This should result in nothing being captured.
4) There should be no whitespace preceeding the first character of the new line; however, this case is unlikely to actually occur.
5) A space immediately after the equalize sign is invalid.
Invalid cases (where nothing should be captured):
th is=is not valid#text
nor =this#text
or_this=something
also= this

I suspect you're making this more difficult than it needs to be. Try this regex:
^(\w+)=([^\s#]+(?:[ \t]+[^\s#]+)+)
I used [ \t]+ instead of \s+ to prevent it from matching the newline and spilling over onto the next line--assuming the input really is multiline, of course. You can still apply it to standalone strings if that's what you prefer.
EDIT: In answer to your comment, try this regex:
^(\w+)=(\w+(?:[ \t]+\w+)*)
With the first regex I was trying to avoid making confining assumptions and I got a little carried away. If you can use \w+ for all words it becomes much easier, as you can see.

^((?!#).+)(?:=)(.+[\S])(?:[\s]*[#])|(?:.*)
means match
^((?!#).+)(?:=)(.+[\S])(?:[\s]*[#])
OR
(?:.*)
try this
^((?!#).+)(?:=)(.+[\S])(?:(?:[\s]*[#])|(?:.*))
although (?:.*) seems kind of pointless, why don't you try something like this instead:
^((?!#).+)(?:=)(.+?\S)(?:\s*[#])?
that will optionally match the last group, which is what I think you're trying to do, and it would be the better option in this case.

Related

How do I match a colon with no values after it?

I'm new to the website and to Regular Expression as well.
So I want to bookmark a list of Emails that have no value after the colons ":" as highlighted in the picture below.
Here is an example:
abcdef#gmail.com:123456
abcdEF452#gmail.com:test123##NEW
abcdef#gmail.com:
abcdef#gmail.com:
I only want to bookmark the last two ones so it would be like this:
abcdef#gmail.com:
abcdef#gmail.com:
The following regex will match the "pre-colon" pattern if and only if it is followed by nothing but whitespace until the end of the line:
\w+#\w+\.\w+:\s*$
View on regex101
Note that matching email addresses with 100% correctness is more complicated than this, but this will likely do for your use case.
If you only want to find strings that end with a colon, then all you need is :$.
I find this request a bit odd, perhaps if you could elaborate a bit more on your use case I may be able to provide a better approach or solution.
Now, I think that this expression should work the way you expect:
[\w\.]+#[a-z0-9][a-z0-9-]*[a-z0-9]?
Add the colon sign at the end if you need to match for the colon sign as well.
I noticed that the other proposed expressions don't account for email addresses with a dot in the username part or with dashed in the domain part. You may use a combination of all the solutions if you are more familiar with RegEx. I highly recommend you test the expression before moving it to production, you can do further tests easily on this page https://regexr.com/.
[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*#(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?
Will be a more adequate RegEx since the Internet Engineering Task Force established limits on how an email address can be formatted and this accounts for those additional characters. More details on this page https://www.mailboxvalidator.com/resources/articles/acceptable-email-address-syntax-rfc/.
As a friendly reminder, Stack Overflow can be best used when you have already invested some effort in fixing some problem, rather than having a community member provide you with a straight answer. This and other suggestions are listed on this other page https://stackoverflow.com/tour.
Try this:
[a-zA-Z]+#[a-zA-Z]+: # Only a-zA-Z, numbers are not accepted
Note: the last character is a space " "
[\w+]+#[\w+]+: # \w+ = Matches one or more [A-Za-z0-9_]
Without a space it will matches only these with no character after the colon.
[\w+]+#[\w+]+.*:$ # Matches only when there is also .XXX. For example: .com or .de
Given this:
abcdEF452#gmail.com:test123##NEW
There are three parts to this:
Before the #.
Between the # and the :
After the :
If we assume (1) has to be there and not empty.
If we assume (2) has to be there and not empty.
If we assume (3) the ':' is required by the trailing part can be empty.
I don't want to make assumptions about other requirements.
Then I would use:
[^#]+#[^:]+:.*$
Meaning:
[^#] => Anything apart from the '#' character.
[^#]+ => The above 1 or more times.
[^#]+# => The above followed by '#' character.
[^:] => Anything apart from the ':' character.
[^:]+ => The above 1 or more times.
[^:]+: => The above followed by ':' character.
.* => Any character 0 or more times.
$ end of line.
So if we want to mkae sure we only find things that don't have anything after the ':' we can simplify a bit.
[^#]+#[^:]+:$
Make sure we have the '#' and ':' parts and they are none empty. But the colon is followed by the end of line.
If you don't care about part (1) or (2) we can simplify even more.
[^:]+:$
Line must contain a : don't care what is in front as long as there is a least one character before the ':' and zero after.
Final simplification.
:$
If you don't care about anything except that the colon is not followed by anything.

Dart regex for capturing groups but ignoring certain similar patterns

I'm trying to capture a group from a string with ~, ~~ and ~~~ symbols. I was successful with extracting single symbols but it doesn't ignore the other occurrences in the string.
This is my code I tried experimenting with:
String f = '~the calculator is on and working~I entered 50 into the calculator'+
'~~I press add button~~holding equal button ~~~The result should be 50';
List<String>givens = f.split(RegExp(r'~+'));
List<String>whens = f.split(RegExp(r'~~+'));
List<String>thens = f.split(RegExp(r'~~~+'));
for(String ss in givens){
print(ss);
}
print('xxxxxxxxxxxx');
for(String ss in whens){
print(ss);
}
print('xxxxxxxxxxxx');
for(String ss in thens){
print(ss);
}
Which will result with:
The givens capture group also captured the ones with ~~ and ~~~ which is not intended.
The whens capture group also captured the ones single ~ which made it very confusing.
Lastly, the thens capture group also captured the others which is also not intended.
I only need to capture the strings starting with the specific pattern but will stop when they see a different one.
Example: givens should only capture 'the calculator is on and working' and 'I entered 50 into the calculator' only.
Any hints or help is greatly appreciated!
I think the problem is that you started off by splitting the string into pieces. But it might be easier to search for the elements with a pattern that will look for some text preceeded with either one, two or three ~ chars.
This can be done with regex positive lookbehind patterns.
Typically, if you want to find a string preceeded by one tild then you have to avoid that it matches if we have other tilds before it.
Find givens
(?<=(?:[^~]|^)~)[^~]+ would be the pattern to find only givens.
Test it here: https://regex101.com/r/9WLbM3/2
Explanation
[^~] means search for any character which is not a ~. This is because [abc] means any char which is in the list, so a, b or c. If you add the ^ char at the beginning of the list then it means "not these chars".
[^~]+ means search for one or multiple times a character which is not ~. This will capture phrases between the tilds.
A positive lookbehind is done with (?<=something present). We want to search for a tild so we would put (?<=~) as positive lookbehind. But the problem is that it will also match the ones with several tilds in front. To avoid that we can say that the tild should either be prefixed by ^ (meaning the beginning of a string) or by [^~] (meaning not a tild). To say "either this or that", we use the syntax (this|that|or even that). But using parenthesis will capture the content and we don't need that. To disable group capturing we can add ?: at the beginning of the group, leading finally to (?:[^~]|^) meaning either a non-tild char or the beginning of the string, without capturing it.
Find whens and thens
The regular expression is almost the same. It's just that we replace ~ by ~{2} or ~{3}.
Pattern for whens: (?<=(?:[^~]|^)~{2})[^~]+
Pattern for thens: (?<=(?:[^~]|^)~{3})[^~]+

regex needed for parsing string

I am working with government measures and am required to parse a string that contains variable information based on delimiters that come from issuing bodies associated with the fda.
I am trying to retrieve the delimiter and the value after the delimiter. I have searched for hours to find a regex solution to retrieve both the delimiter and the value that follows it and, though there seems to be posts that handle this, the code found in the post haven't worked.
One of the major issues in this task is that the delimiters often have repeated characters. For instance: delimiters are used such as "=", "=,", "/=". In this case I would need to tell the difference between "=" and "=,".
Is there a regex that would handle all of this?
Here is an example of the string :
=/A9999XYZ=>100T0479&,1Blah
Notice the delimiters are:
"=/"
"=>'
"&,1"
Any help would be appreciated.
You can use a regex like this
(=/|=>|&,1)|(\w+)
Working demo
The idea is that the first group contains the delimiters and the 2nd group the content. I assume the content can be word characters (a to z and digits with underscore). You have then to grab the content of every capturing group.
You need to capture both the delimiter and the value as group 1 and 2 respectively.
If your values are all alphanumeric, use this:
(&,1|\W+)(\w+)
See live demo.
If your values can contain non-alphanumeric characters, it get complicated:
(=/|=>|=,|=|&,1)((?:.(?!=/|=>|=,|=|&,1))+.)
See live demo.
Code the delimiters longest first, eg "=," before "=", otherwise the alternation, which matches left to right, will match "=" and the comma will become part of the value.
This uses a negative look ahead to stop matching past the next delimiter.

Regex for specific file structure

I need to parse file with next simple structure:
some string 1
some string 2
some string 3
some string x
some string y
some string z
...
File consist of 2 parts separated by "\n\n" or "\r\n\r\n". This separator present in my example after "some string 3". Each part is optional, that is if first part omitted than there will be 1(but with my regex I need 2 empty lines) empty line(\n|\r\n) before second part. And if second part is omitted than there will be any number of empty lines after first part(include no empty lines at all).
I'm trying to achieve desired result with regex like this:
(?isx: \h* (.+)? \h* (?:(?:\n|\r\n){2,} \h* (.+))? \s*)
But with no success because first "(.+)?" very greedy and if I making 2nd part non-optional it violates my requirements that both part must be optional. I know that I can use split /(?:\n|\r\n)/, $str in this case but this file in future could have more complex structure so I can't use split.
Can someone help me with this?
You actually might want to use a non-greedy group, since you don't want to match your seperator.
(?ìsx: (?:
(.*?) # Non greedy
(?:\r?\n){2,} # also matches \r\n\n but that might not be of concern
|\r?\n) # one empty line.
(.*) # second group
)
I don't know what you wanted to achieve with the \hs. If you want to ensure that there is something in the lines (right now, the . also could all match \n or spaces) you could try something like (?:[^\n]+\n)*? for the groups.
Also, for brevities sake, I avoided the explicit ? you used. There might be a difference in results. If you match nothing under a star, you'll get the empty string, if you don't match at all, the value of the group-variable is undefined. Here is a short example to show the difference:
"aa" =~ /(c)?(d*)aa/
Here $1 is undefined, while $2 is the empty string. This minor difference might yield some annoying warnings or unexpected results if someone tested with defined for the contents of a group.

regex remove all numbers from a paragraph except from some words

I want to remove all numbers from a paragraph except from some words.
My attempt is using a negative look-ahead:
gsub('(?!ami.12.0|allo.12)[[:digit:]]+','',
c('0.12','1245','ami.12.0 00','allo.12 1'),perl=TRUE)
But this doesn't work. I get this:
"." "" "ami.. " "allo."
Or my expected output is:
"." "" 'ami.12.0','allo.12'
You can't really use a negative lookahead here, since it will still replace when the cursor is at some point after ami.
What you can do is put back some matches:
(ami.12.0|allo.12)|[[:digit:]]+
gsub('(ami.12.0|allo.12)|[[:digit:]]+',"\\1",
c('0.12','1245','ami.12.0 00','allo.12 1'),perl=TRUE)
I kept the . since I'm not 100% sure what you have, but keep in mind that . is a wildcard and will match any character (except newlines) unless you escape it.
Your regex is actually finding every digit sequence that is not the start of "ami.12.0" or "allo.12". So for example, in your third string, it gets to the 12 in ami.12.0 and looks ahead to see if that 12 is the start of either of the two ignored strings. It is not, so it continues with replacing it. It would be best to generalize this, but in your specific case, you can probably achieve this by instead doing a negative lookbehind for any prefixes of the words (that can be followed by digit sequences) that you want to skip. So, you would use something like this:
gsub('(?<!ami\\.|ami\\.12\\.|allo\\.)[[:digit:]]+','',
c('0.12','1245','ami.12.0 00','allo.12 1'),perl=TRUE)