Regex Enforcing match - regex

Ok i got this regex:
^[\w\s]+=["']\w+['"]
Now the regex will match:
a href='google'
a href="google"
and also
a href='google"
How can i enforce regex to match its quote?
If first quote is single quote, how can i make the last quote also a single quote not a double quote

Read about backreferences.
^[\w\s]+=(["'])\w+?\1
Note that you want to put a ? after the second + or else it will be greedy. However, in general this is not the right way to parse HTML. Use Beautiful Soup.

I am afraid you will have to do it the long way:
^[\w\s]+=("\w+"|'\w+')
More technically, ensuring correct matching / nesting of quotes is not a problem for a regular grammar so for more complex problems you would have to use a proper parser (or perl6 style extended regular expression but they technically do not class as regular expressions).

Replace the ['"] with \1 to use a back reference (capture group)
^[\w\s]+=["']\w+\1

What exactly do you want to match? It sounds you want to match:
word (tagname)
mandatory whitespace
word (attr name)
optional whitespace
=
optional whitespace
either single quoted or double quoted anything (attr value)
That would be: ^(\w+)\s+(\w+)\s*=\s*(?:'([^']*)'|"([^"]*)")
This will allow matches like:
a href='' - empty attr
a href='Hello world' - spaces and other non-word characters in quoted part
a href="one 'n two" - quotes of different kind in quoted part
a href = 'google' - spaces on both sides of =
And disallow things like these that your original regexp allows:
a b c href='google' - extra words
='google' - only spaces on the left
href='google' - only attr on the left
It still doesn't sound exactly right - you're trying to match a tag with exactly one attribute?
With this regexp, tag name will be in $1, attr name in $2, and attr value in either $3 or $4 (the other being nil - most languages distinguish group not taken with nil vs group taken but empty with "" if you need it).
Regexp that would ensure attr value gets in the same group would be messier if you wanted to allow single quotes in doubly quoted attr value and vice verse - something like ^(\w+)\s+(\w+)\s*=\s*(['"])((?:(?!\3).)*)\3 ((?!) is zero-width negative look-ahead - (?:(?!\3).) means something like [^\3] except the latter isn't supported).
If you don't care about this ^(\w+)\s+(\w+)\s*=\s*(['"])(['"]*)\3 will do just fine (for both $3 will be quote type, and $4 attr value).
By the way re (["'])\w+?\1 above - \w doesn't match quotes, so this ? doesn't change anything.
Having said all that, use a real HTML parser ;-)
These regexps will work in Perl and Ruby. Other languages usually copy Perl's regexp system, but often introduce minor changes so some adjustments might be necessary. Especially the one with negative look-aheads might be unsupported.

Try this:
^[\w\s]+="\w+"|^[\w\s]+='\w+'

Related

Regex to match(extract) string between dot(.)

I want to select some string combination (with dots(.)) from a very long string (sql). The full string could be a single line or multiple line with new line separator, and this combination could be in start (at first line) or a next line (new line) or at both place.
I need help in writing a regex for it.
Examples:
String s = I am testing something like test.test.test in sentence.
Expected output: test.test.test
Example2 (real usecase):
UPDATE test.table
SET access = 01
WHERE access IN (
SELECT name FROM project.dataset.tablename WHERE name = 'test' GROUP BY 1 )
Expected output: test.table and project.dataset.tablename
, can I also add some prefix or suffix words or space which should be present where ever this logic gets checked. In above case if its update regex should pick test.table, but if the statement is like select test.table regex should not pick it up this combinations and same applies for suffix.
Example3: This is to illustrate the above theory.
INS INTO test.table
SEL 'abcscsc', wu_id.Item_Nbr ,1
FROM test.table as_t
WHERE as_t.old <> 0 AND as_t.date = 11
AND (as_t.numb IN ('11') )
Expected Output: test.table, test.table (Key words are INTO and FROM)
Things Not Needed in selection:as_t.numb, as_t.old, as_t.date
If I get the regex I can use in program to extract this word.
Note: Before and after string words to the combination could be anything like update, select { or(, so we have to find the occurrence of words which are joined together with .(dot) and all the number of such occurrence.
I tried something like this:
(?<=.)(.?)(?=.)(.?) -: This only selected the word between two .dot and not all.
.(?<=.)(.?)(?=.)(.?). - This everything before and after.
To solve your initial problem, we can just use some negation. Here's the pattern I came up with:
[^\s]+\.[^\s]+
[^ ... ] Means to make a character class including everything except for what's between the brackets. In this case, I put \s in there, which matches any whitespace. So [^\s] matches anything that isn't whitespace.
+ Is a quantifier. It means to find as many of the preceding construct as you can without breaking the match. This would happily match everything that's not whitespace, but I follow it with a \., which matches a literal .. The \ is necessary because . means to match any character in regex, so we need to escape it so it only has its literal meaning. This means there has to be a . in this group of non-whitespace characters.
I end the pattern with another [^\s]+, which matches everything after the . until the next whitespace.
Now, to solve your secondary problem, you want to make this match only work if it is preceded by a given keyword. Luckily, regex has a construct almost specifically for this case. It's called a lookbehind. The syntax is (?<= ... ) where the ... is the pattern you want to look for. Using your example, this will only match after the keywords INTO and FROM:
(?<=(?:INTO|FROM)\s)[^\s]+\.[^\s]+
Here (?:INTO|FROM) means to match either the text INTO or the text FROM. I then specify that it should be followed by a whitespace character with \s. One possible problem here is that it will only match if the keywords are written in all upper case. You can change this behavior by specifying the case insensitive flag i to your regex parser. If your regex parser doesn't have a way to specify flags, you can usually still specify it inline by putting (?i) in front of the pattern, like so:
(?i)(?<=(?:INTO|FROM)\s)[^\s]+\.[^\s]+
If you are new to regex, I highly recommend using the www.regex101.com website to generate regex and learn how it works. Don't forget to check out the code generator part for getting the regex code based on the programming language you are using, that's a cool feature.
For your question, you need a regex that understands any word character \w that matches between 0 and unlimited times, followed by a dot, followed by another series of word character that repeats between 0 and unlimited times.
So here is my solution to your question:
Your regex in JavaScript:
const regex = /([\w][.][\w])+/gm;
in Java:
final String regex = "([\w][.][\w])+";
in Python:
regex = r"([\w][.][\w])+"
in PHP:
$re = '/([\w][.][\w])+/m';
Note that: this solution is written for your use case (to be used for SQL strings), because now if you have something like '.word' or 'word..word', it will still catch it which I assume you don't have a string like that.
See this screenshot for more details

Perl regex - Is (x|y)* equivalent to [xy]*?

As the title says, does the regex pattern (x|y)* match the same string as [xy]*?
Yes, they match the exact same set of strings.
They are not equivalent. (x|y)* sets a backreference, [xy]* doesn't.
Thus (?:x|y)* and [xy]* are equivalent in behavior, as neither sets a backreference.
It's close to equivalent, but the first form makes a capture from the group delimited by ( ) that can be retrieved with $1 (for the first one) when the regex match.
If you want to avoid capturing, use
(?:re)
Where re is the regex.
Note
this only works if x and y are exactly x and y, not if they are general regexes
See Backtracking

how to group in regex matching correctly?

consider following scenario
input string = "WIPR.NS"
i have to replace this with "WIPR2.NS"
i am using following logic.
match pattern = "(.*)\.NS$" \\ any string that ends with .NS
replace pattern = "$12.NS"
In above case, since there is no group with index 12, i get result $12.NS
But what i want is "WIPR2.NS".
If i don't have digit 2 to replace, it works in all other cases but not working for 2.
How to resolve this case?
Thanks in advance,
Alok
Usually depends entirely on your regex engine (I'm not familiar with those that use $1 to represent a capture group, I'm more used to \1 but you'd have the same problem with that).
Some will provide a delimiter that you can use, like:
replace pattern = "${1}2.NS"
which clearly indicates that you want capture group 1 followed by the literal 2.NS.
In fact, by looking at this page, it appears that's exactly the way to do it (assuming .NET):
To replace with the first backreference immediately followed by the digit 9, use ${1}9. If you type $19, and there are less than 19 backreferences, the $19 will be interpreted as literal text, and appear in the result string as such.
Also keep in mind that Jay provides an excellent answer for this specific use case that doesn't require capture groups at all (by just replacing .NS with 2.NS).
You may want to look into that as a possibility - I'll leave this answer here since:
it's the accepted answer; and
it probably better for the more complex cases, like changing X([A-Z])4([A-Z]) with X${1}5${2}, where you have variable text on either side of the bit you wish to modify.
You don't need to do anything with what precedes the .NS, since only what is being matched is subject to replacement.
match pattern = "\.NS$" (any string that ends with .NS -- don't forget to escape the .)
replace pattern = "2.NS"
You can further refine this with lookaround zero-width assertions, but that depends on your regex engine, and you have not specified the environment/programming language in which you are working.

How to ignore whitespace in a regular expression subject string?

Is there a simple way to ignore the white space in a target string when searching for matches using a regular expression pattern? For example, if my search is for "cats", I would want "c ats" or "ca ts" to match. I can't strip out the whitespace beforehand because I need to find the begin and end index of the match (including any whitespace) in order to highlight that match and any whitespace needs to be there for formatting purposes.
You can stick optional whitespace characters \s* in between every other character in your regex. Although granted, it will get a bit lengthy.
/cats/ -> /c\s*a\s*t\s*s/
While the accepted answer is technically correct, a more practical approach, if possible, is to just strip whitespace out of both the regular expression and the search string.
If you want to search for "my cats", instead of:
myString.match(/m\s*y\s*c\s*a\*st\s*s\s*/g)
Just do:
myString.replace(/\s*/g,"").match(/mycats/g)
Warning: You can't automate this on the regular expression by just replacing all spaces with empty strings because they may occur in a negation or otherwise make your regular expression invalid.
Addressing Steven's comment to Sam Dufel's answer
Thanks, sounds like that's the way to go. But I just realized that I only want the optional whitespace characters if they follow a newline. So for example, "c\n ats" or "ca\n ts" should match. But wouldn't want "c ats" to match if there is no newline. Any ideas on how that might be done?
This should do the trick:
/c(?:\n\s*)?a(?:\n\s*)?t(?:\n\s*)?s/
See this page for all the different variations of 'cats' that this matches.
You can also solve this using conditionals, but they are not supported in the javascript flavor of regex.
You could put \s* inbetween every character in your search string so if you were looking for cat you would use c\s*a\s*t\s*s\s*s
It's long but you could build the string dynamically of course.
You can see it working here: http://www.rubular.com/r/zzWwvppSpE
If you only want to allow spaces, then
\bc *a *t *s\b
should do it. To also allow tabs, use
\bc[ \t]*a[ \t]*t[ \t]*s\b
Remove the \b anchors if you also want to find cats within words like bobcats or catsup.
This approach can be used to automate this
(the following exemplary solution is in python, although obviously it can be ported to any language):
you can strip the whitespace beforehand AND save the positions of non-whitespace characters so you can use them later to find out the matched string boundary positions in the original string like the following:
def regex_search_ignore_space(regex, string):
no_spaces = ''
char_positions = []
for pos, char in enumerate(string):
if re.match(r'\S', char): # upper \S matches non-whitespace chars
no_spaces += char
char_positions.append(pos)
match = re.search(regex, no_spaces)
if not match:
return match
# match.start() and match.end() are indices of start and end
# of the found string in the spaceless string
# (as we have searched in it).
start = char_positions[match.start()] # in the original string
end = char_positions[match.end()] # in the original string
matched_string = string[start:end] # see
# the match WITH spaces is returned.
return matched_string
with_spaces = 'a li on and a cat'
print(regex_search_ignore_space('lion', with_spaces))
# prints 'li on'
If you want to go further you can construct the match object and return it instead, so the use of this helper will be more handy.
And the performance of this function can of course also be optimized, this example is just to show the path to a solution.
The accepted answer will not work if and when you are passing a dynamic value (such as "current value" in an array loop) as the regex test value. You would not be able to input the optional white spaces without getting some really ugly regex.
Konrad Hoffner's solution is therefore better in such cases as it will strip both the regest and test string of whitespace. The test will be conducted as though both have no whitespace.

Extracting some data items in a string using regular expression

<![Apple]!>some garbage text may be here<![Banana]!>some garbage text may be here<![Orange]!><![Pear]!><![Pineapple]!>
In the above string, I would like to have a regex that matches all <![FruitName]!>, between these <![FruitName]!>, there may be some garbage text, my first attempt is like this:
<!\[[^\]!>]+\]!>
It works, but as you can see I've used this part:
[^\]!>]+
This kills some innocents. If the fruit name contains any one of these characters: ] ! > It'd be discarded and we love eating fruit so much that this should not happen.
How do we construct a regex that disallows exactly this string ]!> in the FruitName while all these can still be obtained?
The above example is just made up by me, I just want to know what the regex would look like if it has to be done in regex.
The simplest way would be <!\[.+?]!> - just don't care about what is matched between the two delimiters at all. Only make sure that it always matches the closing delimiter at the earliest possible opportunity - therefore the ? to make the quantifier lazy.
(Also, no need to escape the ])
About the specification that the sequence ]!> should be "disallowed" within the fruit name - well that's implicit since it is the closing delimiter.
To match a fruit name, you could use:
<!\[(.*?)]!>
After the opening <![, this matches the least amount of text that's followed by ]!>. By using .*? instead of .*, the least possible amount of text is matched.
Here's a full regex to match each fruit with the following text:
<!\[(.*?)]!>(.*?)(?=(<!\[)|$)
This uses positive lookahead (?=xxx) to match the beginning of the next tag or end-of-string. Positive lookahead matches but does not consume, so the next fruit can be matched by another application of the same regex.
depending on what language you are using, you can use the string methods your language provide by doing simple splitting (and simple regex that is more understandable). Split your string using "!>" as separator. Go through each field, check for <!. If found, replace all characters from front till <!. This will give you all the fruits. I use gawk to demonstrate, but the algorithm can be implemented in your language
eg gawk
# set field separator as !>
awk -F'!>' '
{
# for each field
for(i=1;i<=NF;i++){
# check if there is <!
if($i ~ /<!/){
# if <! is found, substitute from front till <!
gsub(/.*<!/,"",$i)
}
# print result
print $i
}
}
' file
output
# ./run.sh
[Apple]
[Banana]
[Orange]
[Pear]
[Pineapple]
No complicated regex needed.