I have a long input string that contains certain field names in-bedded in it. For instance:
SELECT some-name, some-name FROM [some-table] WHERE [some-column] = 'some-value'
The actual field name may change, but it is always in the form of word-word. I need to perform a regex replace on the string so that the output will look like this:
SELECT some - name, some - name FROM [some-table] WHERE [some-column] = 'some - value'
In other words, when the field name is enclosed in square-brackets, it should be left untouched, but when it is not, spaces should be inserted on either side of the dash. There are no nested square brackets and the reserved word could be one or more in the string.
You can do this:
Regex.Replace(input, "(?<!\[[^-\]]*)(\w+)-(\w+)(?![^-\]]*\])", "$1 - $2")
Here's an explanation of the pattern:
(?<!\[[^-\]]*) - This is a negative look-behind. It asserts that matches cannot be immediately preceded by text that matches the sub-pattern \[[^-\]]*. In other words, the matches we are looking for cannot be preceded by a [ character followed by any number of characters that are not a - or a ].
(\w+)-(\w+) - Matches one or more word-characters, then a dash, and then one or more word characters following the dash. By enclosing the sub-patterns on either side of the dash in capturing groups, we can then refer to their values as $1 and $2 in the replacement pattern.
(?![^-\]]*\]) - This is a negative look-ahead. Similar to the negative look-behind, it asserts that matches cannot be immediately followed by text which matches the sub pattern [^-\]]*\]. In other words, a match cannot be followed by any number of characters that are not a - or a ] and then a closing ].
See a demo.
At first glance, you might assume that you could simply assert that is must not be immediately preceded by a [ character and that it must not be immediately followed by a ] character. In other words, (?<!\[)(\w+)-(\w+)(?!\]). However, that pattern would still match the text ome-nam in the input [some-name] because the text ome-nam is not immediately preceded or followed by the brackets.
Dim regex As Regex = New Regex("\[[^-]*-[^-]*\]")
Dim match As Match = regex.Match("A long string containing square brackets [some-name]")
If match.Success Then
Console.WriteLine(match.Value)
End If
Or you could use Regex.IsMatch:
Return Regex.IsMatch("A long string containing square brackets [some-name]",
"\[[^-]*-[^-]*\]")
You may match and capture the [...] substrings and then only match hyphens that are not surrounded with hyphens to replace them:
Dim nStr As String = "SELECT 'some-name' FROM [some-name]"
Dim nResult = Regex.Replace(nStr, "(\[.+?])|\s*-\s*", New MatchEvaluator(Function(m As Match)
If m.Groups(1).Success Then
Return m.Groups(1).Value
Else
Return " - "
End If
End Function))
So, what is happening is:
(\[[^]]+]) - matches and stores the value of [...] substring inside the Group(1) buffer (or \[.+?] can be used here to match a [, then 1 or more any characters and then ] - with RegexOptions.Singleline flag so that . could match a newline, too)
(?<!\s)-(?!\s) - matches any hyphen not preceded ((?<!\s)) or followed ((?!\s)) with whitespace (\s). Actually, we may even use \s*-\s* (where \s* stands for zero or more whitespaces as many as possible since * is a greedy quantifier matching zero or more occurrences of the quantified subpattern) here to remove any whitespace there is to make sure we just insert 1 space before and after -.
If Group 1 matches, then we just re-insert it (Return m.Groups(1).Value), else we insert the space-enclosed hyphen Return " - ".
Just to check if it exists, you could try
\[[^\]]+-[^\]]+\]
It matches a literal [ and then any characters, except ], up to (including) a hyphen. Then again any characters, except ], up to a literal ].
See it here at regex101.
Actually I don't know the vb.net syntax but you can use regex as
/[\s\'](\w+)\-(\w+)/g
find the (\w+)-(\w+) which is followed by space or ' and replace your string with capture group 1st - 2nd
See the sample here
Related
I have a text like this;
[Some Text][1][Some Text][2][Some Text][3][Some Text][4]
I want to match [Some Text][2] with this regex;
/\[.*?\]\[2\]/
But it returns [Some Text][1][Some Text][2]
How can i match only [Some Text][2]?
Note : There can be any character in Some Text including [ and ] And the numbers in square brackets can be any number not only 1 and 2. The Some Text that i want to match can be at the beginning of the line and there can be multiple Some Texts
JSFiddle
The \[.*?\]\[2\] pattern works like this:
\[ - finds the leftmost [ (as the regex engine processes the string input from left to right)
.*? - matches any 0+ chars other than line break chars, as few as possible, but as many as needed for a successful match, as there are subsequent patterns, see below
\]\[2\] - ][2] substring.
So, the .*? gets expanded upon each failure until it finds the leftmost ][2]. Note the lazy quantifiers do not guarantee the "shortest" matches.
Solution
Instead of a .*? (or .*) use negated character classes that match any char but the boundary char.
\[[^\]\[]*\]\[2\]
See this regex demo.
Here, .*? is replaced with [^\]\[]* - 0 or more chars other than ] and [.
Other examples:
Strings between angle brackets: <[^<>]*> matches <...> with no < and > inside
Strings between parentheses: \([^()]*\) matches (...) with no ( and ) inside
Strings between double quotation marks: "[^"]*" matches "..." with no " inside
Strings between curly braces: \{[^{}]*} matches "..." with no " inside
In other situations, when the starting pattern is a multichar string or complex pattern, use a tempered greedy token, (?:(?!start).)*?. To match abc 1 def in abc 0 abc 1 def, use abc(?:(?!abc).)*?def.
You could try the below regex,
(?!^)(\[[A-Z].*?\]\[\d+\])
DEMO
How to split some strings defined in a specific format:
[length namevalue field]name=value[length namevalue field]name=value[length namevalue field]name=value[length namevalue field]name=value
Is it possible with a Find/Replace regex in Notepad++ isolate the pair name=value replacing [length namevalue field] with a white space?
The main problem is related to numeric value where a simple \d{4} search doesn't work.
Eg.
INPUT:
0010name=mario0013surname=rossi0006age=180006phone=0014address=street
0013name=marianna0013surname=rossi0006age=210006phone=0015address=street1
0003name=pia0015surname=rossini0005age=30017phone=+39221122330020address=streetstreet
OUTPUT:
name=mario surname=rossi age=18 phone= address=street
name=mario surname=rossi age=18 phone= address=street
name=marianna surname=rossi age=21 phone= address=street1
name=pia surname=rossini age=3 phone=+3922112233 address=streetstreet
You can use
\d{4}(?=[[:alpha:]]\w*=)
\d{4}(?=[^\W\d]\w*=)
See the regex demo.
The patterns match
\d{4} - four digits
(?=[[:alpha:]]\w*=) - that are immediately followed with a letter and then any zero or more word chars followed with a = char immediately to the right of the current position.
(?=[^\W\d]\w*=) - that are immediately followed with a letter or an underscore and then any zero or more word chars followed with a = char immediately to the right of the current position.
In Notepad++, if you want to remove the match at the start of the line and replace with space anywhere else, you can use
^(\d{4}(?=[[:alpha:]]\w*=))|(?1)
and replace with (?1: ). The above explained pattern, \d{4}(?=[[:alpha:]]\w*=), is matched and captured into Group 1 if it is at the start of a line (^), and just matched anywhere else ((?1) recurses the Group 1 pattern, so as not to repeat it). The (?1: ) replacement means we replace with empty string if Group 1 matched, else, we replace with a space.
See the demo screenshot:
I have to rename the toString output variables in several hundred files with many occurrences in each. In the most efficient way possible, how could I parse this text:
.append(", myVariable=").append(myVariable)
.append(", myOtherVariable=").append(myOtherVariable)
.append(", mylowervariable=").append(myLowerVariable) // note the left is already lowercase
.append(", myVarWithURL=").append(myVarWithURL);
and it becomes:
.append(", my_variable=").append(myVariable)
.append(", my_other_variable=").append(myOtherVariable)
.append(", mylowervariable=").append(myLowerVariable) // note the left is already lowercase
.append(", my_var_with_url=").append(myVarWithURL);
The ones on the right are to remain unchanged, while the ones to the left of the equals sign are to be changed, if they contain uppercase characters.
These will be of arbitrary lengths with a varying number of upper case letters. I was thinking I had to do some sort of lookahead but could not get the replacement value to work correctly.
I have the flexibility of being able to do this in IntelliJ or Notepad++, so I can easily perform the \l \L operators to make a replacement value lowercase.
This was my thought process:
in: myLongCamelCasedVariable
re: ([a-z]+)([A-Z]{1})([a-z]+) // repeat grouping for capturing
group 1 group 2 group 3 group 4
my + [ L + ong ] + [ C + amel ] + [ C + ased ] + [ V + ariable ]
Is it possible to use a regular expression to effectively capture the various groups of 'text' in the larger text string, and 'loop' over that and apply the output?
Out: $1_\l$2 .... etc
Now I am just stuck
You may use
Find What: (?:\G(?!\A)|",\h*)\K(\b|[a-z]+)([A-Z]+)(?=\w*=")
Replace With: $1_\L$2
Match case: True
Details:
(?:\G(?!\A)|",\h*) - start matching from the end of the previous successful match (\G(?!\A)) or (|) a ", and zero or more horizontal whitespaces (",\h*)
\K - remove the text matched so far from the match memory buffer
(\b|[a-z]+) - Group 1: word boundary or one or more lowercase letters
([A-Z]+) - Group 2: one or more uppercase letters
(?=\w*=") - immediately to the right, there must be zero or more word chars followed with a = char.
The replacement is $1_\L$2: Group 1, _, and then lowercased Group 2 value.
See the Notepad++ demo screen:
You could match sequences of an uppercase char followed by optional uppercase chars and then optional lowercase chars.
In the replacement use _ followed by the lowercased match \L$0
Find what:
(?>,\h+[a-z]+|\G(?!^))\K[A-Z][A-Z]*[a-z]*
(?> Atomic group
,\h+[a-z]+ Match a comma, 1 or more spaces and 1 or more lowercase chars
| Or
\G(?!^) Assert the current position at the end of the previous match but not at the start of the string (so the first part of the alternation has to match first)
) Close atomic group
\K Forget what is matched so far
[A-Z][A-Z]*[a-z]* Match an uppercase char followed by optional upper and lowercase chars
Replace with:
_\L$0
Regex demo
Without using \K you can use 2 capture groups.
(?>(, [a-z]+)|\G(?!^))([A-Z][A-Z]*[a-z]*)
In the replacement use $1_\L$2
So I need to match the following:
1.2.
3.4.5.
5.6.7.10
((\d+)\.(\d+)\.((\d+)\.)*) will do fine for the very first line, but the problem is: there could be many lines: could be one or more than one.
\n will only appear if there are more than one lines.
In string version, I get it like this: "1.2.\n3.4.5.\n1.2."
So my issue is: if there is only one line, \n needs not to be at the end, but if there are more than one lines, \n needs be there at the end for each line except the very last.
Here is the pattern I suggest:
^\d+(?:\.\d+)*\.?(?:\n\d+(?:\.\d+)*\.?)*$
Demo
Here is a brief explanation of the pattern:
^ from the start of the string
\d+ match a number
(?:\.\d+)* followed by dot, and another number, zero or more times
\.? followed by an optional trailing dot
(?:\n followed by a newline
\d+(?:\.\d+)*\.?)* and another path sequence, zero or more times
$ end of the string
You might check if there is a newline at the end using a positive lookahead (?=.*\n):
(?=.*\n)(\d+)\.(\d+)\.((\d+)\.)*
See a regex demo
Edit
You could use an alternation to either match when on the next line there is the same pattern following, or match the pattern when not followed by a newline.
^(?:\d+\.\d+\.(?:\d+\.)*(?=.*\n\d+\.\d+\.)|\d+\.\d+\.(?:\d+\.)*(?!.*\n))
Regex demo
^ Start of string
(?: Non capturing group
\d+\.\d+\. Match 2 times a digit and a dot
(?:\d+\.)* Repeat 0+ times matching 1+ digits and a dot
(?=.*\n\d+\.\d+\.) Positive lookahead, assert what follows a a newline starting with the pattern
| Or
\d+\.\d+\. Match 2 times a digit and a dot
(?:\d+\.)* Repeat 0+ times matching 1+ digits and a dot
*(?!.*\n) Negative lookahead, assert what follows is not a newline
) Close non capturing group
(\d+\.*)+\n* will match the text you provided. If you need to make sure the final line also ends with a . then (\d+\.)+\n* will work.
Most programming languages offer the m flag. Which is the multiline modifier. Enabling this would let $ match at the end of lines and end of string.
The solution below only appends the $ to your current regex and sets the m flag. This may vary depending on your programming language.
var text = "1.2.\n3.4.5.\n1.2.\n12.34.56.78.123.\nthis 1.2. shouldn't hit",
regex = /((\d+)\.(\d+)\.((\d+)\.)*)$/gm,
match;
while (match = regex.exec(text)) {
console.log(match);
}
You could simplify the regex to /(\d+\.){2,}$/gm, then split the full match based on the dot character to get all the different numbers. I've given a JavaScript example below, but getting a substring and splitting a string are pretty basic operations in most languages.
var text = "1.2.\n3.4.5.\n1.2.\n12.34.56.78.123.\nthis 1.2. shouldn't hit",
regex = /(\d+\.){2,}$/gm;
/* Slice is used to drop the dot at the end, otherwise resulting in
* an empty string on split.
*
* "1.2.3.".split(".") //=> ["1", "2", "3", ""]
* "1.2.3.".slice(0, -1) //=> "1.2.3"
* "1.2.3".split(".") //=> ["1", "2", "3"]
*/
console.log(
text.match(regex)
.map(match => match.slice(0, -1).split("."))
);
For more info about regex flags/modifiers have a look at: Regular Expression Reference: Mode Modifiers
I am using a regular expression to allow and reject strings based on the criteria--
The expression used-
^([\w\.,'()#&-]|\s)*$
Allows-
exmple_
example(ggg)
exam.pl56e
exam.pl56e.hbhbh.
exampleghh. vgvj
example (bb)ste kklk ae
_example_
Currently, it allows adding period in the middle of the string as well as at the end.
I just want to reject string if the period is added at the end of the string but allow it to be added in the middle using the above regular expression
For example, reject-
Test.test1.
Example.
Test Test.
test#example.
exam.pl56e.hbhbh.
You may use a single character class in the pattern (merge \s with the previous character class) to simplify the pattern, and use either
^([\w.,'()#&\s-]*[\w,'()#&\s-])?$
See the regex demo.
Details
^ - start of string
([\w.,'()#&\s-]*[\w,'()#&\s-])? - an optional sequence (if you want to match at least 1 char, remove ( and )?) of:
[\w.,'()#&\s-]* - 0+ word, ., ,, ', (, ), #, &, whitespace or hyphen chars
[\w,'()#&\s-] - a word, ,, ', (, ), #, &, whitespace or hyphen chars (but no .!)
$ - end of string
Or, a lookbehind version:
^[\w.,'()#&\s-]*$(?<!\.)
It matches a string that only consists of the chars inside the character class, and after the end of string is matched, the lookbehind checks if the last char is a dot. If it is, the match is failed.
Or, a lookahead
^(?!.*\.$)[\w.,'()#&\s-]*$
Here, (?!.*\.$) checks if the string ends with . after any 0+ chars, and if it does, no match is returned. Else, the string is matched against the [\w.,'()#&\s-]* pattern.
Just specify that the last character cannot be a period.
^([\w\.,'()#&-]|\s)*[^.]$
A nice trick I've learned is to blacklist certain otherwise-allowed expressions by placing them on their own in an unmatched alternation in front of the matched one.
# sentences containing `foo` or `bar` but not the word `foobar`
^.*foobar.*$|(^.*foo.*$)|(^.*bar.*$)
This is admittedly a bit...verbose here:
^(?:[\w\.,'()#&-]|\s)*\.$|^([\w\.,'()#&-]|\s)*$
So it might be better to use a negative lookbehind
^([\w\.,'()#&-]|\s)*$(?<!\.)
You could use a negative lookahead (?!) to assert that what follows are not the characters in the character class repeated zero or more times ending with a dot at the end of the string.
^(?![\w\.,'()#&\s-]*\.$)[\w\.,'()#&\s-]*$
Note that using the asterix * it matches zero or more times.