Regex for extracting each word between hyphens - regex

I am learning regex and trying to write a pattern that exactly matches each of the strings without'-' so that I can iterate for each of the groups and print the respective strings.
I have a string that looks like "Abcd001-wd2s-vwe1-20180e3103.txt"
I was able to write a regex for extracting Abcd001, wd2s and .txt from above text as shown below
(\A[^-]+)=> Abcd001
(-[^-]+-)=> wd2s
(\..*)=>.txt
However, I was unable to come up with the correct pattern for extracting the exact strings vwe1 and 20180e3103
It will be really helpful if you can guide me on this or if there is a better approach to achieve this?
Please note: [^-.]+ may give me all the words separately but I am looking for an option where I have a group defined for each of these strings so that its one to one mapping.
Thanks!

To get vwe1 or 20180e3103 from the example data, you might use a quantifier {2} or {3} to repeat matching one or more word charcters followed by a hyphen (?:\w+-){2}.
Then you could capture in a group ([^-.]+) matching not a hyphen or a dot.
(?:\w+-){2}([^-.]+)

Try the below regex
/\-([^\)]+)\-/gmi;
Also check the similar implementation:
https://stackoverflow.com/a/50336050/8179245

Related

How to extract characters from a string with optional string afterwards using Regex?

I am in the process of learning Regex and have been stuck on this case. I have a url that can be in two states EXAMPLE 1:
spotify.com/track/1HYcYZCOpaLjg51qUg8ilA?si=Nf5w1q9MTKu3zG_CJ83RWA
OR EXAMPLE 2:
spotify.com/track/1HYcYZCOpaLjg51qUg8ilA
I need to extract the 1HYcYZCOpaLjg51qUg8ilA ID
So far I am using this: (?<=track\/)(.*)(?=\?)? which works well for Example 2 but it includes the ?si=Nf5w1q9MTKu3zG_CJ83RWA when matching with Example 1.
BUT if I remove the ? at the end of the expression then it works for Example 1 but not Example 2! Doesn't that mean that last group (?=\?) is optional and should match?
Where am I going wrong?
Thanks!
I searched a handful of "Questions that may already have your answer" suggestions from SO, and didn't find this case, so I hope asking this is okay!
The capturing group in your regular expression is trying to match anything (.) as much as possible due to the greediness of the quantifier (*).
When you use:
(?<=track\/)(.*)(?=\?)
only 1HYcYZCOpaLjg51qUg8ilA from the first example is captured, as there is no question mark in your second example.
When using:
(?<=track\/)(.*)(?=\??)
You are effectively making the positive lookahead optional, so the capturing group will try to match as much as possible (including the question mark), so that 1HYcYZCOpaLjg51qUg8ilA?si=Nf5w1q9MTKu3zG_CJ83RWA and 1HYcYZCOpaLjg51qUg8ilA are matched, which is not the desired output.
Rather than matching anything, it is perhaps more appropriate for you to match alphanumerical characters \w only.
(?<=track\/)(\w*)(?=\??)
Alternatively, if you are expecting other characters , let's say a hyphen - or a underscore _, you may use a character class.
(?<=track\/)([a-zA-Z0-9_-]*)(?=\??)
Or you might want to capture everything except a question mark ? with a negated character class.
(?<=track\/)([^?]*)(?=\??)
As pointed out by gaganso, a look-behind is not necessary in this situation (or indeed the lookahead), however it is indeed a good idea to start playing around with them. The look-around assertions do not actually consume the characters in the string. As you can see here, the full match for both matches only consists of what is captured by the capture group. You may find more information here.
This should work:
track\/(\w+)
Please see here.
Since track is part of both the strings, and the ID is formed from alphanumeric characters, the above regex which matches the string "track/" and captures the alphanumeric characters after that string, should provide the required ID.
Regex : (\w+(?=\?))|(\w+&)
See the demo for the regex, https://regexr.com/3s4gv .
This will first try to search for word which has '?' just after it and if thats unsuccessful it will fetch the last word.

Regex to match reocurring character groups

I'm trying to write a regex that would match groups of exactly three characters, that reoccur within the text at least one time.
What I came up with is this simple regex:(.{3}).*\g1, using the \g (global) and \s (dot also matches newline) flags. However, it is clearly faulty, as it only finds a part of the groups I'm hoping to capture. Any idea how can I improve it? Here is the link to an example input https://regex101.com/r/Cuiva1/2
Edit: Here's the full list of groups I was hoping to capture as requested in the comment:GLT,VIW,IWK,KTL,GLT,LTK,LIS,KTX,TXK,XDL,KTL
If your input is always multiple triplets of uppercase characters and you're only looking for ones that repeat, then you need something more complex to avoid backtracking into a previous triplet:
/(?>[^A-Z]*+([A-Z]{3}))(?=(?:[^A-Z]*+[A-Z]{3})*?\1)|(?>[^A-Z]*+[A-Z]{3})/g
The matches from index 1 will hold what you want. If your strings are not that well formatted (i.e. may contain any length string in between repeating patterns, then you can use a simpler pattern but you'll get totally inconsistent results and miss some matches.
I re-read your desired output, you're not going to achieve this with regex. VIW and IWK are overlapping, which won't work in a single preg_match_all(). Just use string functions.

How to regex replace all characters of specific type in lines that have a specific word

There is a multiline text, in which there are specific lines that i'm interested in indicated by specific words. For example i am interested in the lines that have ".jpg" in them.
I'm trying to use a lookahead:
(?=\.jpg)
In these lines i would like to delete specific characters, for example all matches of "_"
Sample input:
https://somewebpage/stuff1_stuff2_stuff3.jpg
Desired output:
https://somewebpage/stuff1stuff2stuff3.jpg
I'm trying to write this regex for latest notepad++
My problem is that i can't seem to properly combine the positive lookahead with my regex recursively
([^_]*)(_?)
Any help is appreciated.
[_-](?=.*\.jpg) worked for me. replace with empty string to remove the characters or just do a find. you can expand your character list of course, but I think this covers you.

Trying to extract repeating pattern from string in php/javascript

The following is in PHP but the regex will also be used in javascript.
Trying to extract repeating patterns from a string
string can be any of the following:
"something arbitrary"
"D123"
"D111|something"
"D197|what.org|when.net"
"D297|who.197d234.whatever|when.net|some other arbitrary string"
I'm currently using the following regex: /^D([0-9]{3})(?:\|([^\|]+))*/
This correctly does not match the first string, matches the second and third correctly. The problem is the third and fourth only match the Dxxx and the last string. I need each of the strings between the '|' to be matched.
I'm hoping to use a regex as it makes it a single step. I realize I could just detect the leading Dxxx then use explode or split as appropriate to break the strings out. I've just gotten stuck on wanting a single regular expression match step.
This same regex may be used in Python as well so just want a generic regex solution.
There is no way to have a dynamic number of capture groups in a regular expression, but if you know some upper limit to how many parts you would have in one string, you can just repeat the pattern that many times:
/^D([0-9]{3})(?:$|\|)(.*?)(?:$|\|)(.*?)(?:$|\|)(.*?)(?:$|\|)(.*?)(?:$|\|)/
So after the initial ^D([0-9]{3})(?:$|\|) you just repeat (.*?)(?:$|\|) as many times as you need it.
When the string has fewer elements, those remaining capture groups will match the empty string.
See regex tester.
Is something like preg_match_all() (the PHP variant of a global match) also acceptable for you?
Then you could use:
^(?|D([0-9]{3})|^.+$|(?!^)\|([^|\n]*)(?=\||$))
This will match everything in a string in different matches, e.g. take your string:
D197|what.org|when.net
It will you then give three matches:
D197
what.org
when.net
Running live: https://regex101.com/r/jL2oX6/4 (Everything in green are your group matches. Ignore what's in blue.)

How to evaluate a RegExp in an array with match groups?

I need to parse an array-like text with regular expression and get the match groups.
One example of then text I want to parse is this:
['red','green', 'blue']
I want to use match groups, because I want to extract them.
I am using this regular expression, but the groups found by it are not like what I expected:
\[ *('.+?')( *, *('.+?'))* *\]
The idea is to parse in this order:
A square bracket
Any number of spaces
A group with:
Single quote
Any character
Single quote
Zero or more groups of:
Any number of spaces
A comma
Any number of spaces
A group with
Single quote
Any character
Single quote
Any number of spaces
A square bracket
And get one group with each parsed array element.
Can you help me?
Hint: a easy way to test regexp is the site http://rubular.com
This isn't going to be a totalitarian answer, but I'm fairly certain you can't whitespace check by doing " *", at least it may depend on the language you're using.
Here's a C# regex example that shows some of the language requirements to check for whitespace: regex check for white space in middle of string
Edit: I see you added Ruby as your language, unfortunately I'm not verbose in Ruby so specifics I cannot help you with, sorry.
Edit2: Seeing as you're forcing yourself into Ruby to debug your regex statement, might I suggest: http://www.debuggex.com/ which tries to stay language independent?
Try this regex: '([^']+)', it should give you the following match groups red, green, blue according to rubular.com
You can match an arbitrary number of groups with one regex:
^\[\s*|(?:\G'([^']+)'\s*(?:,\s*|]$))+
or like this (should be more performant):
^\[\s*+|(?>\G'([^']++)'\s*+(?>,\s*+|]$))++
This work in ruby like asked before, in delphi I don't know.