I need to extract numeric values from strings like "£17,000 - £35,000 dependent on experience"
([0-9]+k?[.,]?[0-9]+)
That string is just an example, i can have 17k 17.000 17 17,000, in every string there can be 0,1 or 2 numbers (not more than 2), they can be everywhere in the string, separated by anything else. I just need to extract them, put the first extracted in a place and the second in another.
I could come up with this, but it gives me two matches (don't mind the k?[,.], it's correct), in the $1 grouping. I need to have 17,000 in $1 and 35,000 in $2, how can i accomplish this? I can also manage to use 2 different regex
Using regex
With every opening round bracket you create a new capturing group. So to have a second capturing group $2, you need to match the second number with another part of your regex that is within brackets and of course you need to match the part between the to numbers.
([0-9]+k?[.,]?[0-9]+)\s*-\s*.*?([0-9]+k?[.,]?[0-9]+)
See here on Regexr
But could be that Solr has regex functions that put all matches into an array, that would maybe be easier to use.
Match the entire dollar range with 2 capture groups rather than matching every dollar amount with one capture group:
([0-9]+k?[.,]?[0-9]+) - ([0-9]+k?[.,]?[0-9]+)
However, I'm worried (yeah, I'm minding it :p) about that regex as it will match some strange things:
182k,938 - 29.233333
will both be matched, it can definitely be improved if you can give more information on your input types.
What about something along the lines of
[£]?([0-9]+k?[.,]?[0-9]+) - [£]([0-9]+k?[.,]?[0-9]+)
This should now give you two groups.
Edit: Might need to clean up the spaces too
Related
I know there are several similar answers, but I am struggling to find one that fits my use case.
I need a regex to extract IDs that are 6 characters long and have a mix of numbers and characters.
The IDs will start with one of the following chars [eEdDwWaA]
I have had some solutions that have nearly worked, but the tool I want to plug this regex into does NOT support positive look around and every answer seems to use this.
The string I need to find can be anywhere in text and will either be preceded by a whitespace or a backslash.
Example of what I would want to match is eh3geh (case insensitive)
Here is what I have so far [eEdDwWaA](?:[0-9]+[a-z]|[a-z]+[0-9],{5})[a-z0-9]*
This works for the most part but it is not consistently matching and I'm not sure why.
If you can't use a lookahead an idea is to capture using The Trick.
The trick is that we match what we don't want on the left side of the alternation (the |), then we capture what we do want on the right side....
[\\ ](?:.[a-z]{5}|([eEdDwWaA][a-z0-9]{5}))\b
.[a-z]{5} we don't want only letters (left side)
|(...) but capture what we need to group one (righte side)
Here is the demo at regex101
Get the captures of group 1 on program-side (where group not null/empty).
So basically I want to reformat a 10 digit number like so:
1234567890 --> (123) 456-7890
A long way to do this would be to have each number be its own capture group and then back-reference each one individually:
'([0-9])([0-9])...([0-9])' --> (\1\2\3) \4\5\6-\7\8\9\10
This seems unnecessary and verbose, but when I try the following
'([0-9]){10}'
There appears to be only one back-reference and its of the last digit in the number.
Is there is a more elegant way to reference each character as its own capture group?
Thanks!
The following pattern will do the job: ^(\d{3})(\d{3})(\d{4})$
^(\d{3}): beginning of the string, then exactly 3 digits
(\d{3}): exactly 3 digits
(\d{4})$: exactly 4 digits, then end of the string.
Then replace by: (\1) \2-\3
Although the other answer with its example regex patterns hopefully shed light on the correct application of capture groups, it does not directly answer the question. If you fail to understand how regular expressions work (capture groups in particular), you may find yourself wanting to do the same thing with a different pattern in the future.
Is there is a more elegant way to reference each character as its own
capture group?
The initial answer is "No", there is no way to reference an individual capture of a single capture group using traditional replacement syntax - regardless of whether it is a single digit or any other capture group. Consider that you indicate a precise number of matches with {10} and it seems perfectly reasonable to be able to access each capture. But what if you had indicated a variable number of matches with + or {,3}? There would be no well-defined way of knowing how many possible captures occurred. If the same regex pattern had had more capture groups following the "repeated" capture group, there would be no way of correctly referencing the later groups. Example: Given the pattern ([a-z])+(\d){3}, the first capture group could match 4 letters one time, then the next time match 11 letters. If you wanted to refer to the captured digits, how would you do that? You could not, since \1, \2, \3, ... would all be reserved for possible capture instances of the first group.
But the inability of basic regular expressions syntax to do what you want does not remove the validity of your question, nor does it necessarily place the solution outside the realm of many regular expression implementations. Various regex implementations (i.e. language syntax and regex libraries) resolve this limitation by facilitating regex matching with various objects for accessing repeated captures. (c# and .Net regex library is one example, like match.Groups[1].Captures[3]) So even though you can't use basic replacement patterns to get want you want, the answer is often "Yes", depending on the specific implementation.
So I am looking through different numbers that have different delimiters, and I want to find all numbers that have the same delimiter.
Basically, I want the following
123+456+7890 // MATCH
123-456-7890 // MATCH
123.456.7890 // MATCH
123+456-7890 // FAILURE
My current regex I plan to use was
\d{3}[+-.]\d{3}[+-.]\d{4}
However, it would match number sequences that have the different delimiters. I don't want to use one big huge OR for something like this because the real life equivalent has many more characters that could fit there.
Is there a way to match the same character in multiple locations?
You can use a captured group and a back-reference to ensure same delimiter is used again.
^\d{3}([+.-])\d{3}\1\d{4}$
([+.-]) here we capture delimiter in group #1
\1 Here we are using back-reference of same delimiter
RegEx Demo
You can use a back reference like this:
\d{3}([+.-])\d{3}\1\d{4}
The first operator that is matched [+-.] is kept inside a capturing group so that it can be referenced later.
\1 is a backreference to the first capturing group which in this case is [+-.] so it will ensure that the operator is same as the previous one.
Regex 101 Demo
You can read more about backreferences here
The following is in PHP but the regex will also be used in javascript.
Trying to extract repeating patterns from a string
string can be any of the following:
"something arbitrary"
"D123"
"D111|something"
"D197|what.org|when.net"
"D297|who.197d234.whatever|when.net|some other arbitrary string"
I'm currently using the following regex: /^D([0-9]{3})(?:\|([^\|]+))*/
This correctly does not match the first string, matches the second and third correctly. The problem is the third and fourth only match the Dxxx and the last string. I need each of the strings between the '|' to be matched.
I'm hoping to use a regex as it makes it a single step. I realize I could just detect the leading Dxxx then use explode or split as appropriate to break the strings out. I've just gotten stuck on wanting a single regular expression match step.
This same regex may be used in Python as well so just want a generic regex solution.
There is no way to have a dynamic number of capture groups in a regular expression, but if you know some upper limit to how many parts you would have in one string, you can just repeat the pattern that many times:
/^D([0-9]{3})(?:$|\|)(.*?)(?:$|\|)(.*?)(?:$|\|)(.*?)(?:$|\|)(.*?)(?:$|\|)/
So after the initial ^D([0-9]{3})(?:$|\|) you just repeat (.*?)(?:$|\|) as many times as you need it.
When the string has fewer elements, those remaining capture groups will match the empty string.
See regex tester.
Is something like preg_match_all() (the PHP variant of a global match) also acceptable for you?
Then you could use:
^(?|D([0-9]{3})|^.+$|(?!^)\|([^|\n]*)(?=\||$))
This will match everything in a string in different matches, e.g. take your string:
D197|what.org|when.net
It will you then give three matches:
D197
what.org
when.net
Running live: https://regex101.com/r/jL2oX6/4 (Everything in green are your group matches. Ignore what's in blue.)
someText
1
2
3
4
moreText
I would like to add a prefix before each digit.
but using (\w+\R)(\d+\R)+(\w+) and \1prefix\2\3 will only prefix the last digit and erase the others.
Is there a way to do it with a single regex or should i write a script on the side?
The problem with your regex is the use of greedy matching in the (\d+\R)+, specifically the last +. That reads, "match this group as many times as you can so long as it doesn't cause the miss of a match". So for your text it gobbles up 1, 2, 3, and 4 before it can't gobble any more and puts the last match into the second capture group. Obviously, it's in the nature of regex engines to be unable to express variadic groups, how would you address them anyway? So the short answer, I think is that regexes are the wrong tool for a fully automated process and you'll have to write a script.
However, for a slightly less automated process that still incorporates your surrounding text, you could try
find: (\w+\R)((?:\d+\R)+)(\w+)
replace: \1prefix\2\3
We wrap the second group plus it's greedy modifier in an extra set of capturing parens and enclose the actual matching text in a non-capturing group. Now, we have the full set of digits in their own group and can add the prefix to the first one. The interesting side effect of this is that the first number then matches the first group (\w+\R) and if you run the find/replace again it hits the next number in the line until it no longer matches.
This way, you should be able to run through your files at least only hitting the areas you are interested in adding this prefix to and it shouldn't take nearly as long as finding every digit in every file.