Extract information through regexp

Extract information through regexp - regex

I have a question about groups in a rule i created to extract dates from text.
Let's consider the following string:
fherfrefercr17hfeuetvbyeituew
The string is composed by everything at the beginning, then there is a number composed by one or two digits and then everything again. I need to extract only the number "17" from the string listed above.
With the following rule i extract only 7 and not 17.
.*(\d{1,2}).*
Can anyone help me with that please?

Overview
Given your pattern:
.*(\d{1,2}).*
This works in the following way:
.* Match any character any number of times
The quantifier here is considered to be greedy because it will match as many characters as possible so long as the pattern matches the string.
\d{1,2} Since your pattern says to match 1 or 2 digits and the previous token is greedy, the regex is just going to match a single digit because this still satisfies the pattern (the previous token stole the first digit).
Code
There are multiple ways you can fix this issue
Method 1
This will simply extract all numbers (1+ digits) from the string. If you want to only match 1 or two digits use \d\d? or \d{1,2} instead.
\d+
\d\d?
\d{1,2}
Method 2
This method turns the greedy quantifier * (in .*) into a lazy quantifier .*?. This will match any character any number of times, but as few as possible. The drawback to this method is that it's expensive because the engine needs to backtrack.
.*?\d{1,2}.*
Method 3
This method matches any non-digit character any number of times, then it matches one or two digits. This is likely the solution you're looking for.
\D*(\d{1,2}).*

Related

Find certain colons in string using Regex

I'm trying to search for colons in a given string so as to split the string at the colon for preprocessing based on the following conditions
Preceeded or followed by a word e.g A Book: Chapter 1 or A Book :Chapter 1
Do not match if it is part of emoticons i.e :( or ): or :/ or :-) etc
Do not match if it is part of a given time i.e 16:00 etc
I've come up with a regex as such
(\:)(?=\w)|(?<=\w)(\:)
which satisfies conditions 2 & 3 but still fails on condition 3 as it matches the colon present in the string representation of time. How do I fix this?
edit: it has to be in a single regex statement if possible

You can use
(:\b|\b:)(?!(?:(?<=\b\d:)|(?<=\b\d{2}:))\d{1,2}\b)
See the regex demo. Details:
(:\b|\b:) - Group 1: a : that is either preceded or followed with a word char
(?!(?:(?<=\b\d:)|(?<=\b\d{2}:))\d{1,2}\b) - there should be no one or two digits right after : (followed with a word boundary) if the : is preceded with a single or two digits (preceded with a word boundary).
Note :\b is equal to :(?=\w) and \b: is equal to (?<=\w):.
If you need to get the same capturing groups as in your original pattern, replace (:\b|\b:) with (?:(:)\b|\b(:)).
More flexible solution
Note that excluding matches can be done with a simpler pattern that matches and captures what you need and just matches what you do not need. This is called "best regex trick ever". So, you may use a regex like
8:|:[PD]|\d+(?::\d+)+|(:\b|\b:)
that will match 8:, :P, :D, one or more digits and then one or more sequences of : and one or more digits, or will match and capture into Group 1 a : char that is either preceded or followed with a word char. All you need to do is to check if Group 1 matched, and implement required extraction/replacement logic in the code.

Word characters \w include numbers [a-zA-Z0-9_]
So just use [a-ZA-Z] instead
(\:)(?=[a-zA-Z])|(?<=[a-zA-Z])(\:)
Test Here

Regex : extract the biggest number from x to y figures

I have an Url formatted as follow : https://www.mywebsite.com/subdomain/123456789.htm. I know that the webpage number is built with exactly 9 or 10 digits. I would like to extract this number using a Regex.
The Regex I use to perform this operation is :
^https://www.mywebsite.com/[A-Za-z0-9_.-~/]+([0-9]{9,10}).htm$
The problem is that when the number is 10 digits long, I get a match which is good but only the last 9 digits are captured. For example : https://www.mywebsite.com/subdomain/1234567890.htm captures 234567890 only.
I could easily create two regexes (one with 9 digits and one with 10) and take the longest number if both matches, but is there any elegant way to solve this problem using Regex?
EDIT
Following remarks which have been made below, there is actually a mistake in my original Regex : the first character group matches the first digit of the 10, and leaves only the 9 others for the capturing group. I've added a screenshot below. Adding a forward slash to the Regex before the capturing group solved the issue, thanks!

As per #TheFourthBird, you are missing a match on the forward slash. Maybe a slightly different approach to yours would be a non-capturing group:
^https://www.mywebsite.com/(?:[^/]+/)+(\d{9,10}).htm$

The character class [A-Za-z0-9_.-~/]+ matches all the character that follow until the end of the line.
This part ([0-9]{9,10}). will then backtrack until it can match the resulting digits, which it can starting from 9 digits and that will be in the capturing group.
Note to either escape the hyphen \- or place it at the start or end of the character class or else it could possible match a range.
One option is to use a word bounary \b before matching the digits
^https://www\.mywebsite\.com/[A-Za-z0-9_.~/-]+\b([0-9]{9,10})\.htm$
Regex demo
Another way could be matching the / right before the digits.
^https://www\.mywebsite\.com/[A-Za-z0-9_.~/-]+/([0-9]{9,10})\.htm$
Regex demo
If there can also be chars a-zA-Z or an underscoe before the digits and a lookbehind is supported, you could also assert that there is not a digit before (?<!\d)
^https://www\.mywebsite\.com/[A-Za-z0-9_.~/-]+(?<!\d)([0-9]{9,10})\.htm$
Regex demo

One more approach. This gets all the numbers between / and htm
(\d+)(?=\.htm)
RegexDemo

Using regex to match numbers which have 5 increasing consecutive digits somewhere in them

First off, this has sort of been asked before. However I haven't been able to modify this to fit my requirement.
In short: I want a regex that matches an expression if and only if it only contains digits, and there are 5 (or more) increasing consecutive digits somewhere in the expression.
I understand the logic of
^(?=\d{5}$)1*2*3*4*5*6*7*8*9*0*$
however, this limits the expression to 5 digits. I want there to be able to be digits before and after the expression. So 1111345671111 should match, while 11111 shouldn't.
I thought this might work:
^[0-9]*(?=\d{5}0*1*2*3*4*5*6*7*8*9*)[0-9]*$
which I interpret as:
^$: The entire expression must only contain what's between these 2 symbols
[0-9]*: Any digits between 0-9, 0 or more times followed by:
(?=\d{5}0*1*2*3*4*5*6*7*8*9*): A part where at least 5 increasing digits are found followed by:
[0-9]*: Any digits between 0-9, 0 or more times.
However this regex is incorrect, as for example 11111 matches. How can I solve this problem using a regex? So examples of expressions to match:
00001459000
12345
This shouldn't match:
abc12345
9871234444

While this problem can be solved using pure regular expressions (the set of strictly ascending five-digit strings is finite, so you could just enumerate all of them), it's not a good fit for regexes.
That said, here's how I'd do it if I had to:
^\d*(?=\d{5}(\d*)$)0?1?2?3?4?5?6?7?8?9?\1$
Core idea: 0?1?2?3?4?5?6?7?8?9? matches an ascending numeric substring, but it doesn't restrict its length. Every single part is optional, so it can match anything from "" (empty string) to the full "0123456789".
We can force it to match exactly 5 characters by combining a look-ahead of five digits and an arbitrary suffix (which we capture) and a backreference \1 (which must exactly the suffix matched by the look-ahead, ensuring we've now walked ahead 5 characters in the string).
Live demo: https://regex101.com/r/03rJET/3
(By the way, your explanation of (?=\d{5}0*1*2*3*4*5*6*7*8*9*) is incorrect: It looks ahead to match exactly 5 digits, followed by 0 or more occurrences of 0, followed by 0 or more occurrences of 1, etc.)

Because the starting position of the increasing digits isn't known in advance, and the consecutive increasing digits don't end at the end of the string, the linked answer's concise pattern won't work here. I don't think this is possible without being repetitive; alternate between all possibilities of increasing digits. A 0 must be followed by [1-9]. (0(?=[1-9])) A 1 must be followed by [2-9]. A 2 must be followed by [3-9], and so on. Alternate between these possibilities in a group, and repeat that group four times, and then match any digit after that (the lookahead in the last repeated digit in the previous group will ensure that this 5th digit is in sequence as well).
First lookahead for digits followed by the end of the string, then match the alternations described above, followed by one or more digits:
^(?=\d+$)\d*?(?:0(?=[1-9])|1(?=[2-9])|2(?=[3-9])|3(?=[4-9])|4(?=[5-9])|5(?=[6-9])|6(?=[7-9])|7(?=[89])|8(?=9)){4}\d+
Separated out for better readability:
^(?=\d+$)\d*?
(?:
0(?=[1-9])|
1(?=[2-9])|
2(?=[3-9])|
3(?=[4-9])|
4(?=[5-9])|
5(?=[6-9])|
6(?=[7-9])|
7(?=[89])|
8(?=9)
){4}
\d+
The lazy quantifier in the first line there \d*? isn't necessary, but it makes the pattern a bit more efficient (otherwise it initially greedily matches the whole string, requiring lots of failing alternations and backtracking until at least 5 characters before the end of the string)
https://regex101.com/r/03rJET/2
It's ugly, but it works.

Difference between regex quantifiers plus and star

I try to extract the error number from strings like "Wrong parameters - Error 1356":
Pattern p = Pattern.compile("(\\d*)");
Matcher m = p.matcher(myString);
m.find();
System.out.println(m.group(1));
And this does not print anything, that became strange for me as the * means * - Matches the preceding element zero or more times from Wiki
I also went to the www.regexr.com and regex101.com and test it and the result was the same, nothing for this expression \d*
Then I start to test some different things (all tests made on the sites I mentioned):
(\d)* doesn't work
\d{0,} doesn't work
[\d]* doesn't work
[0-9]* doesn't work
\d{4} works
\d+ works
(\d+) works
[0-9]+ works
So, I start to search on the web if I could find an explanation for this. The best I could find was here on the Quantifier section, which states:
\d? Optional digit (one or none).
\d* Eat as many digits as possible (but none if necessary)
\d+ Eat as many digits as possible, but at least one.
\d*? Eat as few digits as necessary (possibly none) to return a match.
\d+? Eat as few digits as necessary (but at least one) to return a match.
The question
As english is not my primary language I'm having trouble to understand the difference (mainly the (but none if necessary) part). So could you Regex expert guys explain this in simple words please?
The closest thing that I find to this question here on SO was this one: Regex: possessive quantifier for the star repetition operator, i.e. \d** but here it is not explained the difference.

The * quantifier matches zero or more occurences.
In practice, this means that
\d*
will match every possible input, including the empty string. So your regex matches at the start of the input string and returns the empty string.

but none if necessary means that it will not break the regex pattern if there is no match. So \d* means it will match zero or more occurrences of digits.
For eg.
\d*[a-z]*
will match
abcdef
but \d+[a-z]*
will not match
abcdef
because \d+ implies that at least one digit is required.

\d* Eat as many digits as possible (but none if necessary)
\d* means it matches a digit zero or more times. In your input, it matches the least possible one (ie, zero times of the digit). So it prints none.
\d+
It matches a digit one or more times. So it should find and match a digit or a digit followed by more digits.

With the pattern /d+ at least one digit will need to be reached, and then the match will return all subsequent characters until a non-digit character is reached.
/d* will match all the empty strings (zero or more), as well at the match. The .Net Regex parser will return all these empty string groups in its set of matches.

Simply:
\d* implies zero or more times
\d+ means one or more times

Match against 1 hyphen per any number of digit groups

I'm trying to come up with some regex to match against 1 hyphen per any number of digit groups. No characters ([a-z][A-Z]).
123-356-129811231235123-1235612346123451235
/[^\d-]/g
The one above will match the string below, but it will let the following go through:
1223--1235---123123-------
I was looking at the following post How to match hyphens with Regular Expression? for an answer, but I didn't find anything close.
#Konrad Rudolph gave a good example.
Regular expression to match 7-12 digits; may contain space or hyphen
This tool is useful for me http://www.gskinner.com/RegExr/

Assuming it can't ever start with a hyphen:
^\d(-\d|\d)*$
broken down:
^ # match beginning of line
\d # match single digit
(-\d|\d)+ # match hyphen & digit or just a digit (0 or more times)
$ # match end of line
That makes every hyphen have to have a digit immediately following it. Keep in mind though, that the following are examples of legal patterns:
213-123-12314-234234
1-2-3-4-5-6-7
12234234234
gskinner example

Alternatively:
^(\d+-)+(\d+)$
So it's one or more group(s) of digits followed by hyphen + final group of digits.
Nothing very fancy, but in my tests it matched only when there were hyphen(s) with digits on both sides.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Extract information through regexp - regex

Related

Find certain colons in string using Regex

Regex : extract the biggest number from x to y figures

Using regex to match numbers which have 5 increasing consecutive digits somewhere in them

Difference between regex quantifiers plus and star

Match against 1 hyphen per any number of digit groups

Categories

Resources