Regex for capturing the nth occurrence of a char - regex

I want to capture the third comma in strings like:
98,52,"110,18479456000019"
I thought of something like a character except:
[^"0123456789]
But, result was the capture of all commas.
After that, I've tried some regex about nth capture - seems to be a solution -, but none works.
How do I solve this problem?

There are several ways to capture the third ,. This RegEx is one way to do so:
([\d,])\x22\d+(,)\d+\x22
where your desired , is in the second group (,), just to be simple, and you can call it using $2.
I have added additional boundaries to this RegEx for safety, which you can remove it:
\x22 is just ", which you can replace, if you wish:
([\d,])"\d+(,)\d+"
You can also use (\) and escape a char, where necessary.
If your input would be a bit more complex, maybe such as this:
you might create a middle boundary before the third , and add all possible chars in the middle boundary ([\d\w\"]+), such as this RegEx:
(\d+,){2}[\d\w\"]+(,)
and capture the third , using $2. This time you can also relax your expression from the right side, and it would still work.
You might also add a start ^ in the regex:
^(\d+,){2}[\d\w\"]+(,)
as an additional left boundary which means your input must start with this expression.

Related

Regex to find if all the characters in a word are the same specific character

I have a set of words coming in one by one like aa, ##, ???, ~~~, ?~ etc
I need a regex to find if any of these words is containing only ? or only ~.
Of the above input examples, ??? and ~~~ should match but not the others.
I tried ^[\s?]*$ and ^[\s~]*$ separately and it works, I am trying to combine them.
^[\s?||~]*$ doesn't work as it also recognizes ?~ as valid.
Any help?
You can use this regex, which looks for a string starting with a ~ or a ?, and then asserts that every other character in the string is the same as the first one using a backreference (\1):
^([~?])\1+$
Demo on regex101
You need to use backreference to achived your desired result.
If you want only ~ or ? use
^([~?])\1+$
If you want any repetitive pattern, use
^(.)\1+$
Explanation (.) or ([~?]) capturing the first charactor.
Then, \1+ checking the first charactor, one or more times (backreferencing)
You want to match lines that both start and end with any number of either a tilde or questionmark. That would be ^\(~\|?\)*$. The parentheses to make a group and the vertical bar to do the 'or' need to be backslash escaped.

regex match combined with something before a string if exists

I tried to get the sub-strings from a string
such like:
test strings:
cat_zoo_New_York_US
dog_zoo_South_Carolina
dolphin_zoo_Montreal_Canada
pokemon_home_d_K2-155
returned sub strings:
cat, New_York
dog, South_Carolina
dolphin, Montreal
pokemon, d
the Regex pattern I have tried is
([\w]+)(?:(_zoo_|_home_))(((?!(_US|_Canada|_K2-155))\w)+)
which I don't think is very concise and it returns other sub-strings besides what I need. Do you have any other suggestions?
Thanks!
Some updates
after #The fourth bird's answer #03/15/2018.
First of all, I like the idea of utilizing both ([^_]+) and the (?:) for different part of the sample strings.But let me extend a little more of the sample strings.
cat_zoo_New_York_US
dog_zoo_South_Carolina
yellow_dolphin_zoo_Montreal_Canada
pokemon_home_d_K2-155
pokemon_home_zoo_d_K2-155
I actually want to use the anchor strings such as 'zoo','home' or 'home_zoo' to separate the characters before and after, together with matching(and discarding) the last part of the country(or whatever specified place ID), which makes this question a bit less general(I like the idea of using _,but let me make it more tricky to learn better).
two questions here
what is the function of (?=) and .* here in
(?=(?:_US|_Canada|_K2-155|$)).*$? It seems if I use
(?:_US|_Canada|_K2-155|$), it is still ok...
since I extended a little bit on the anchor string to let it support
_, I used:
(.*?)(?:_*)(?:home_zoo|zoo|home)(?:_*)(.*?)(?:_*)(?:US|Canada|K2-155|$)
It seems ok, but if I use:
(.*?)(?:_*)(?:home|zoo|home_zoo)(?:_*)(.*?)(?:_*)(?:US|Canada|K2-155|$)
It will firstly match home for the last sample string. Is there a
greedy algorithm to catch this without specify the order of the pattern
string?
Well again, I don't like to make a long list of anchor strings, but I don't have other ideas make it more general without doing so.
Thanks again!
You could try it like this:
^([^_]+)_[^_]+_(.*?)(?=(?:_US|_Canada|_K2-155|$)).*$
This will capture 2 groups. You could for example use this in a replacement with group1, group2.
First capture the first part ending on an underscore in group 1 like cat_. Then match the second part ending with an underscore like zoo_ or home_.
From that point capture in a group until you encounter one of your values using a lookahead (?= or the end of the string.
That would match:
^ Begin of the string
([^_]+) Match in a capturing group not an _ one or more times (group 1)
_[^_]+_ match _ then not an _ one or more times followed by _
(.*?) Capture in a group any character zero or more times greedy (group 2)
(?= Positive lookahead that asserts what is on the right side is
(?: Non capturing group
_US|_Canada|_K2-155|$ your values or end of the string
) Close group
) Close group
.*$ Match any character zero or more times till the end of the string
Edit: After the updated question, perhaps this will suit your requirements:
^(.*?)_(?:home_zoo|zoo|home)(.*?)(?=(?:_US|_Canada|_K2-155|$))
This will match any charcter zero or more times non greedy (.*?), then an underscore and a non capturing group (?:home|zoo|home_zoo) to separate the characters before and after.
Well, I tried a more straightforward approach. If your data is more complex than the sample that you gave above, this may fail. Otherwise, for the above text, it works fine.
Here is the expression that I used:
^([^_]*)_[^_]*_(.*)_.*$
1 23 45 67
Basically what I did was:
Group the first char stream, which does not contain _, starting at the beginning of the line.
Then there is an _ following the above group
Follows an arbitrary length string, which does not have _'s in it
Then comes an _
Group the next arbitrary length string
Comes and _ afterwards
Rest of the string
replace it with \1, \2 (first group, second group).
You can find a fiddle here
If you are using vim, you can also achieve the same thing in vim with the following command:
:%s/^[^_]*_\([^_]*\)_\(.*\)_.*$/\1, \2/g
UPDATE
^([^_]*)_[^_]*_(((?:South_)|(?:New_))*[^_]*)((?:_US)|(?:_Canada)|(?:_K2-155))*$
You can find the new fiddle (here)[https://regex101.com/r/qQ2dE4/273]
What is the difference between this one and the previous one?
Now, I cheat a little, as such that I look for adjectives, which modify the state name, like South_ or New_. You can add more here, like East_, West_, Old_ or whatever if there is a case in your date.
There are cases where country is skipped in data. Plus looks like that last token on the very last line does not follow up a pattern. So, I explicitly listed those options in the expression, like US, Canada etc. You may need to add more exceptional cases in here as well.

regex - Removing text from around numbers in Notepad++

I have a large subset of data that looks like this:
MyApp.Whatever\app.config(115): More stuff here, but possibly with numbers or parenthesis...
I'd like to create a replace filter using Notepad++ that would identify and replace the line number "(115):" and replace it with a tab character followed by the same number.
I've been trying filters such as (\(\d+\):) and (\(\[0-9]+\):), but they keep returning the entire value in the \1 output.
How would I create a filter using Notepad++ that would successfully replace (115): with tab character + 115?
Use a quantifier.. (\(\d+?\):) where the ? will prevent it from being greedy. Also, since everything is in a () it will group it all and treat it as \1 ..
If it was in perl I'd say \((\d+?)\): which should match only the inner part.
Edit:
Just talked with my colleague - he said s/\((\d+)\)/\t\1/ and if you needed app config in front you could just put that in the front.
this should work for your needs
replace
\((\d+)\):
with
\t$1
Replacing (\(\d+\):) with \t\1 will keep the parenthesis and the colon since you've included them in the group (the outer parenthesis), and I think that's what you mean by "they keep returning the entire value."
Instead of escaping those inner parenthesis, escape the outer ones like the other answers have suggested: \((\d+)\): - this says to match a left paren, then match and capture a group of digits, then match a right paren and a colon. Replacing that with \t\1 will get rid of the parens and colon that were not in the captured group.

How do I make part of a regex match optional?

This is an example string:
123456#p654321
Currently, I am using this match to capture 123456 and 654321 in to two different groups:
([0-9].*)#p([0-9].*)
But on occasions, the #p654321 part of the string will not be there, so I will only want to capture the first group. I tried to make the second group "optional" by appending ? to it, which works, but only as long as there is a #p at the end of the remaining string.
What would be the best way to solve this problem?
You have the #p outside of the capturing group, which makes it a required piece of the result. You are also using the dot character (.) improperly. Dot (in most reg-ex variants) will match any character. Change it to:
([0-9]*)(?:#p([0-9]*))?
The (?:) syntax is how you get a non-capturing group. We then capture just the digits that you're interested in. Finally, we make the whole thing optional.
Also, most reg-ex variants have a \d character class for digits. So you could simplify even further:
(\d*)(?:#p(\d*))?
As another person has pointed out, the * operator could potentially match zero digits. To prevent this, use the + operator instead:
(\d+)(?:#p(\d+))?
Your regex will actually match no digits, because you've used * instead of +.
This is what (I think) you want:
(\d+)(?:#p(\d+))?

How to match everything up to the second occurrence of a character?

So my string looks like this:
Basic information, advanced information, super information, no information
I would like to capture everything up to second comma so I get:
Basic information, advanced information
What would be the regex for that?
I tried: (.*,.*), but I get
Basic information, advanced information, super information,
This will capture up to but not including the second comma:
[^,]*,[^,]*
English translation:
[^,]* = as many non-comma characters as possible
, = a comma
[^,]* = as many non-comma characters as possible
[...] is a character class. [abc] means "a or b or c", and [^abc] means anything but a or b or c.
You could try ^(.*?,.*?),
The problem is that .* is greedy and matches maximum amount of characters. The ? behind * changes the behaviour to non-greedy.
You could also put the parenthesis around each .*? segment to capture the strings separately if you want.
I would take a DRY approach, like this:
^([^,]*,){1}[^,]*
This way you can match everything until the n occurrence of a character without repeating yourself except for the last pattern.
Although in the case of the original poster, the group and repetition of the group is useless I think this will help others that need to match more than 2 times the pattern.
Explanation:
^ From the start of the line
([^,]*,) Create a group matching everything except the comma character until it meet a comma.
{1} Count the above pattern (the number of time you need)-1. So if you need 2 put 1, if you need 20 put 19.
[^,]* Repeat the pattern one last time without the tailing comma.
Try this approach:
(.*?,.*?),.*
Link to the solution