How to match only expressions containing not equal groups? - regex

I'm trying to capture only expressions with a difference in given groups by using regular expressions.
For example I need to capture these (in bold):
;TEXT;2;34;1;0;;;;;;3200;
PRINT_Polohopis.dgn;Different TEXT;2;64;1;0;;;;;;3200;
but not these (if it is the same):
;TEXT;2;34;1;0;;;;;;3200;
PRINT_Polohopis.dgn;TEXT;2;64;1;0;;;;;;3200;
So far I managed to create this regex:
^;([\w\s]*;).*\n(?:[\w\s_\.]*);(?:(?!(\1))(\K[\w\s]*;))
which works only if I include a semicolon inside the capturing groups.
Is it possible to capture those groups in a better way?

Something like this might work for you:
/^;([^;]+);.*?\n[^;]+;(?!\1;)([^;]+)/
Try it online
The trick here is that a negative lookthahead is used to make sure \1 (back reference) is not at the desired position:
/^; / # Start of string and literal ;
([^;]+); # Capture all but ; followed by literal ;
.*?\n # Match rest of line
[^;]+; # Match all but ; followed by literal ;
(?!\1;) # Negative lookahead to make sure captured
# group is no at this position, followed
# by literal ;
([^;]+) # Capture all but ;

Related

Regular Expression Nucleotide Search

I am trying to find a regular expression that will allow me to know if there is a dinucleotide(Two letters) that appears 2 times in a row in my sequence. I give you an example:
Let's suppose I have this sequence (The character ; is to make clear that I am talking about dinucleotides):
"AT;GC;TA;CC;AG;AG;CC;CA;TA;TA"
The result I expect is that it matches the pattern AGAG and TATA.
I have tried this already but it fails because it gives me any pair of dinucleotides, not the same pair :
([ATGC]{2}){2}
You will need to use backreferences.
Start with matching one pair:
[ATGC]{2}
will match any pair of two of the four letters.
You need to put that in capturing parentheses and refer to the contents of the parentheses with \1, like so:
([ATGC]{2});\1
Suppose the string were
"TA;TA;GC;TA;CC;AG;AG;CC;CA;TA;TA"
^^ ^^ ^^ ^^ ^^ ^^
If you wish to match "TA" twice (and "AG" once) you could apply #Andy's solution.
If you wish to match "TA" just once, no matter the number of instances of "TA;TA" in the string, you could match
([ATGC]{2});\1(?!.*\1;\1)
and retrieve the contents of capture group 1.
Demo
The expression can be broken down as follows.
([ATGC]{2}) # match two characters, each from the character class,
# and save to capture group 1
;\1 # match ';' followed by the content of capture group 1
(?! # begin a negative lookahead
.* # match zero or more characters
\1;\1 # match the content of capture group 1 followed by ';'
# followed by the content of capture group 1
) # end negative lookahead

Regex Expression to remove "autoplay" parameter in url

I'm trying to match the url https://youtube.com/embed/id and its parameters i.e ?start=10&autoplay=1, but I need the autoplay parameter removed or set to 0.
These are some example urls and what I want the results to look like:
http://www.youtube.com/embed/JW5meKfy3fY?autoplay=1
I want to remove the autoplay parameter and its value:
http://www.youtube.com/embed/JW5meKfy3fY
2nd example
http://www.youtube.com/embed/JW5meKfy3fY?start=10&autoplay=1
results should be
http://www.youtube.com/embed/JW5meKfy3fY?start=10
I have tried (https?:\/\/www.youtube.com\/embed\/[a-zA-Z0-9\\-_]+)(\?[^\t\n\f\r \"']*)(\bautoplay=[01]\b&?) and replace with $1$2, but it matches with a trailing ? and & in example 1 and 2 respectively. Also, it doesn't match at all for a url like
http://www.youtube.com/embed/JW5meKfy3fY
I have the regex and examples on here
NB:
The string I am working on contains HTML with one or more youtube urls in it, so I don't think I can easily use go's net/url package to parse the url.
You're asking for a regex but I think you'd be better off using Go's "net/url" package. Something like this:
import "net/url"
//...
u, _ := url.Parse("http://www.youtube.com/embed/JW5meKfy3fY?start=10&autoplay=1")
q := u.Query()
q.Del("autoplay")
u.RawQuery = q.Encode()
clean_url_string = u.String()
In real life you'd want to handle errors from u.Parse of course.
Here's a solution that ensures a valid page URI. Simply match this and only return capture group 1 and 3.
Edit: The pattern is not elegant but it ensures no stale ampersands stay. The previous solution was more elegant and albeit wouldn't break anything, isn't worth the tradeoff imo.
Pattern
(https?:\/\/www\.youtube\.com\/embed\/[^?]+\?.*)(&autoplay=[01]|autoplay=[01]&?)(.*)
See the demo here.
As the OP has linked to a regex tester that employs the the PCRE (PHP) engine I offer a PCRE-compatible solution. The one token I've used in the regular expression below that is not widely supported in other regex engines is \K (though it is supported by Perl, Ruby, Python's PyPI regex module, R with Perl=TRUE and possibly other engines.
\K causes the regex engine to reset the beginning of the match to the current location in the string and to discard any previously-matched characters in the match it returns (if there is one).
With one caveat you can replace matches of the following regular expression with empty strings.
(?x) # assert 'extended'/'free spacing' mode
\bhttps?:\/\/www.youtube.com\/embed\/
# match literal
(?=.*autoplay=[01]) # positive lookahead asserts 'autoplay='
# followed by '1' or '2' appears later in
# the string
[a-zA-Z0-9\\_-]+ # match 1+ of the chars in the char class
[^\t\n\f\r \"']* # match 0+ chars other than those in the
# char class
(?<![?&]) # negative lookbehind asserts that previous
# char was neither '?' nor '&'
(?: # begin non-capture group
(?=\?) # use positive lookahead to assert next char
# is a '?'
(?: # begin a non-capture group
(?=.*autoplay=[01]&)
# positive lookahead asserts 'autoplay='
# followed by '1' or '2', then '&' appears
# later in the string
\? # match '?'
)? # end non-capture group and make it optional
\K # reset start of match to current location
# and discard all previously-matched chars
\?? # optionally match '?'
autoplay=[01]&? # match 'autoplay=' followed by '1' or '2',
# optionally followed by '&'
| # or
(?=&) # positive lookahead asserts next char is '&'
\K # reset start of match to current location
# and discard all previously-matched chars
&autoplay=[01]&? # match '&autoplay=' followed by '1' or '2',
# optionally followed by '&'
) # end non-capture group
The one limitation is that it fails to match all instances of .autoplay=.. if more than one such substring appears in the string.
I wrote this expression with the x flag, called extended or free spacing mode, to be able to make it self-documenting.
Start your engine!

Regular expression: data between brackets (mysqli > pdo)

For migrating a system to PDO, we would like to replace some queries. Via a regular expression, we can improve the progress. We are looking for a method to match:
mysqli_query($db, [expression follows here])
So, we made use of the following regex:
mysqli_query\(\$db, (.*?)\)
The problem is that we've a problem when we have for example a join query with an opening and closing () parameter.
Example: mysqli_query($db, "SELECT users.id FROM users JOIN (... ) on .. WHERE users.id=1")
Is it possible to edit the regex so we allow a ) when it is opened? So, the number of ( and ) should be equal.
You need an editor with PCRE-compatible regexes (Dreamweaver probably only supports JavaScript-style regexes); then you can use a recursive regex like this:
mysqli_query\(\$db, ((?:[^()]++|\((?1)\))*)\)
Test it live on regex101.com.
Explanation:
( # Match and capture in group 1:
(?: # Start of non-capturing group that either matches
[^()]++ # a sequence of characters except parentheses
| # or
\( # an opening parenthesis
(?1) # followed by an expression that follows the same rules as group 1
\) # and a closing parenthesis.
)* # Do this any number of times (including 0)
) # End of group 1
Try the following: mysqli_query\(\$db, ('|")(.+?)\1\)[\s;]. This assumes that the expression will be between single or double quotes. This might fail depending on what content comes after it though, and for some notations within the SQL query itself (e.g. if it encounters \')).
https://regex101.com/r/1NRYUw/1

Fail on Character in RegEx expression

I have an expression that works, except I want it to fail if it sees a ; anyplace other than the first character. It works for (as desired)
; Person1: Role1,
but it also works for (not desired)
; Person1: Role1; Person2: Role2,
So, trying to figure out how to modify the following expression to fail on a semicolon after the first character (and actually the semicolon would only be found in the third group):
(^|; )(.+?:)(.+?)(,)
Sorry, I don't know the flavor. Usage is in an addon for a music tagging program.
If you don't want to allow a semicolon, then tell the regex engine so:
^;?\s*([^;:]+):\s*([^,;]+),$
Explanation:
^ # Start of string
;?\s* # Match optional ; and optional whitespace
([^;:]+) # Capture one or more characters except ; and : in group 1
:\s* # Match : and optional whitespace
([^,;]+) # Capture one or more characters except , and ; in group 2
, # Match ,
$ # End of string

RegEx to match some wrapped texts

Consider following text:
aas( I)f df (as)(dfdsf)(adf).dgdf(sfg).(dfdf) asdfsdf dsfa(asd #54 54 !fa.) sdf
I want to retrive text between parenthesis, but adjacent parentheses should be consider a single unit. How can I do that?
For above example desired output is:
( I)
(as)(dfdsf)(adf)
(sfg).(dfdf)
(asd #54 54 !fa.)
Assumption
No nesting (), and no escaping of ()
Parentheses are chained together with the . character, or by being right next to each other (no flexible spacing allowed).
(a)(b).(c) is consider a single token (the . is optional).
Solution
The regex below is to be used with global matching (match all) function.
\([^)]*\)(?:\.?\([^)]*\))*
Please add the delimiter on your own.
DEMO
Explanation
Break down of the regex (spacing is insignificant). After and including # are comments and not part of the regex.
\( # Literal (
[^)]* # Match 0 or more characters that are not )
\) # Literal ). These first 3 lines match an instance of wrapped text
(?: # Non-capturing group
\.? # Optional literal .
\([^)]*\) # Match another instance of wrapped text
)* # The whole group is repeated 0 or more times
I'd go with: /(?:\(\w+\)(?:\.(?=\())?)+/g
\(\w+\) to match a-zA-Z0-9_ inside literal braces
(?:\.(?=\())? to capture a literal . only if it's followed by another opening brace
The whole thing wrapped in (?:)+ to join adjacent captures together
var str = "aas(I)f df (asdfdsf)(adf).dgdf(sfg).(dfdf) asdfsdf dsfa(asdfa) sdf";
str.match(/(?:\(\w+\)(?:\.(?=\())?)+/g);
// -> ["(I)", "(asdfdsf)(adf)", "(sfg).(dfdf)", "(asdfa)"]
try [^(](\([^()]+([)](^[[:alnum:]]*)?[(][^()]+)*\))[^)]. capture group 1 is what you want.
this expression assumes that every kind of character apart from parentheses mayy occur in the text between parentheses and it won't match portions with nested parentheses.
This one should do the trick:
\([A-Za-z0-9]+\)