I'm creating a regex to process the line below as read from a file.
30/05/2014 17:58:19 418087******2093 No415000345536 5,000.00
I have successfully created the regex but my issue is that the string may sometimes appear as below with a slight addition (bold highlight)
31/05/2014 15:06:29 410741******7993 0027200004750 No415100345732 1,500.00
Please assist in altering the pattern to ignore the integer of 13 digits that I don't need.
Below is my regex pattern
((?:(?:[0-2]?\d{1})|(?:[3][01]{1}))[-:\/.](?:[0]?[1-9]|[1][012])[-:\/.](?:(?:[1]{1}\d{1}\d{1}\d{1})|(?:[2]{1}\d{3})))(?![\d])(\s+)((?:(?:[0-1][0-9])|(?:[2][0-3])|(?:[0-9])):(?:[0-5][0-9])(?::[0-5][0-9])?(?:\s?(?:am|AM|pm|PM))?)(\s+)(\d{6})(\*{6})(\d{4})(\s+)(No)(\d+)(\s+)([+-]?\d*\.\d+)(?![-+0-9\.])
Advice and contribution will be highly appreciated.
The regular expression in question was most likely created using a regular expression builder.
Here is your regular expression reduced to its component parts, simplified and with support for both variants of valid strings.
Date with a not complete validation (invalid days in month still possible):
(?:0?[1-9]|[12]\d|3[01])[-:\/.](?:0?[1-9]|1[012])[-:\/.](?:19|20)\d\d
Whitespace(s) between date and time:
[\t ]+
\s matches also newline characters and other not often used whitespaces which is the reason why I'm using [\t ]+ instead of \s.
Time with at least hour and minute with a not complete validation (leap second, AM or PM with invalid hour):
(?:[01]?\d|2[0-3]):[0-5][0-9](?::[0-5][0-9])?(?:[\t ]?(?:am|AM|pm|PM))?
Whitespace(s), number with 4 digits, 6 asterisk, number with 4 digits, whitespace(s):
[\t ]+\d{6}\*{6}\d{4}[\t ]+
Optionally a number with 13 digits not marked for backreferencing:
(?:\d{13}[\t ]+)?
Number with undetermined number of digits, whitespace(s), optional plus or minus sign, floating point number (without exponent):
No\d+[\t ]+[+-]?[\d,.]+
And here is the entire expression with 2 additionally added pairs of parentheses to mark the strings of real interest for further processing.
((?:0?[1-9]|[12]\d|3[01])[-:\/.](?:0?[1-9]|1[012])[-:\/.](?:19|20)\d\d[\t ]+(?:[01]?\d|2[0-3]):[0-5][0-9](?::[0-5][0-9])?(?:[\t ]?(?:am|AM|pm|PM))?[\t ]+\d{6}\*{6}\d{4}[\t ]+)(?:\d{13}[\t ]+)?(No\d+[\t ]+[+-]?[\d,.]+)
The first marking group matches:
30/05/2014 17:58:19 418087******2093
31/05/2014 15:06:29 410741******7993
\1 or $1 can be used to reference this part of entire found string.
The second marking group matches:
No415000345536 5,000.00
No415100345732 1,500.00
\2 or $2 can be used to reference this part of entire found string.
Hint: (...) is a marking group. (?:...) is a non-marking group because of ?: immediately after opening parenthesis.
Related
I've got a Regular Expression meant to validate that a phone number string is either empty, or contains 10-14 digits in any format. It works for requiring a minimum of 10 but continues to match beyond 14 digits. I've rarely used lookaheads before and am not seeing the problem. Here it is with the intended interpretation in comments:
/// ^ - Beginning of string
/// (?= - Look ahead from current position
/// (?:\D*\d){10,14} - Match 0 or more non-digits followed by a digit, 10-14 times
/// \D*$ - Ending with 0 or more non-digits
/// .* - Allow any string
/// $ - End of string
^(?=(?:\D*\d){10,14}\D*|\s*$).*$
This is being used in an asp.net MVC 5 site with the System.ComponentModel.DataAnnotations.RegularExpressionAttribute so it is in use server side with .NET Regexes and client-side in javascript with jquery validate. How can I get it to stop matching if the string contains more than 14 digits?
The problem with the regular expression
^(?=(?:\D*\d){10,14}\D*|\s*$).*$
is that there is no end-of-line anchor between \D and |. Consider, for example, the string
12345678901234567890
which contains 20 digits. The lookahead will be satisfied because (?:\D*\d){10,14} will match
12345678901234
and then \D* will match zero non-digits. By contrast, the regex
^(?=(?:\D*\d){10,14}\D*$|\s*$).*$
will fail (as it should).
There is, however, no need for a lookahead. One can simplify the earlier expression to
^(?:(?:\D*\d){10,14}\D*)?$
Demo
Making the outer non-capture group optional allows the regex to match empty strings, as required.
There may be a problem with this last regex, as illustrate at the link. Consider the string
\nabc12\nab12c3456d789efg
The first match of (?:\D*\d) will be \nabc1 (as \D matches newlines) and the second match will be 2, the third, \nab1, and so on, for a total of 11 matches, satisfying the requirement that there be 10-14 digits. This undoubtedly is not intended. The solution is change the regex to
^(?:(?:[^\d\n]*\d){10,14}[^\d\n]*)?$
[^\d\n] matches any character other than a digit and a newline.
Demo
I have a question about groups in a rule i created to extract dates from text.
Let's consider the following string:
fherfrefercr17hfeuetvbyeituew
The string is composed by everything at the beginning, then there is a number composed by one or two digits and then everything again. I need to extract only the number "17" from the string listed above.
With the following rule i extract only 7 and not 17.
.*(\d{1,2}).*
Can anyone help me with that please?
Overview
Given your pattern:
.*(\d{1,2}).*
This works in the following way:
.* Match any character any number of times
The quantifier here is considered to be greedy because it will match as many characters as possible so long as the pattern matches the string.
\d{1,2} Since your pattern says to match 1 or 2 digits and the previous token is greedy, the regex is just going to match a single digit because this still satisfies the pattern (the previous token stole the first digit).
Code
There are multiple ways you can fix this issue
Method 1
This will simply extract all numbers (1+ digits) from the string. If you want to only match 1 or two digits use \d\d? or \d{1,2} instead.
\d+
\d\d?
\d{1,2}
Method 2
This method turns the greedy quantifier * (in .*) into a lazy quantifier .*?. This will match any character any number of times, but as few as possible. The drawback to this method is that it's expensive because the engine needs to backtrack.
.*?\d{1,2}.*
Method 3
This method matches any non-digit character any number of times, then it matches one or two digits. This is likely the solution you're looking for.
\D*(\d{1,2}).*
I have data indexed in this format 676767 2343423 2344444 32494444. I need a regular expression to pattern anlayser last 7 digits from right. Ex output: 2494444. Pattern which we have tried [0-9]{7} which is not working.
In ElasticSearch, the pattern is anchored by default. That means, you cannot rely on partial matches, you need to match the entire string and capture the last consecutive 7 digits.
Use
.*([0-9]{7})
where
.* - will match any 0+ chars other than newline (as many as possible) and then will backtrack to match...
([0-9]{7}) - 7 digits placed into Capture group 1.
The Sense plug-in returns the captured value if a capturing group is defined in the regular expression pattern, so, no additional extraction work (or group accessing work) needs to be done.
I'm trying to apply a data validation formula to a column, checking if the content is a valid international telephone number. The problem is I can't have +1 or +some dial code because it's interpreted as an operator. So I'm looking for a regex that accepts all these, with the dial code in parentheses:
(+1)-234-567-8901
(+61)-234-567-89-01
(+46)-234 5678901
(+1) (234) 56 89 901
(+1) (234) 56-89 901
(+46).234.567.8901
(+1)/234/567/8901
A starting regex can be this one (where I also took the examples).
This regex match all the example you gave us (tested with https://fr.functions-online.com/preg_match_all.html)
/^\(\+\d+\)[\/\. \-]\(?\d{3}\)?[\/\. \-][\d\- \.\/]{7,11}$/m
^ Match the beginning of the string or new line.
To match (+1) and (+61): \(\+\d+\): The plus sign and the parentheses have to be escaped since they have special meaning in the regex. \d+ Stand for any digit (\d) character and the plus means one or more (the plus could be replaced by {1,2})
[\/\. \-] This match dot, space, slash and hyphen exactly one time.
\(?\d{3}\)?: The question mark is for optional parenthesis (? = 0 or 1 time). It expect three digits.
[\/\. \-] Same as step 3
[\d\- \.\/]{7,11}: Expect digits, hyphen, space, dot or slash between 7 and 11 time.
$ Match the end of the line or the end of the string
The m modifier allow the caret (^) and dollar sign ($) combination to match line break. Remove that if you want those symbol to match only the begining and the end of the string.
Slashes are use are delimiter for this regex (there are other character that you can use).
I must admit I don't like the last part of the regex as do not ensure that you have at least 7 digits.
It would be probably better to remove all the separator (by example with PHP function str_replace) and deal only with parenthesis and number with this regex
/(\(\+\d+\))(\(?\d{3}\)?)(\d{3})(\d{4})/m
Notice that in this last regex I used 4 capturing group to match the four digit section of the phone number. This regex keep the parenthesis and the plus sign of the first group and the optional parenthesis of the second group. To keep only the digits group, you can use this regex:
/\(\+(\d+)\)\(?(\d{3})\)?(\d{3})(\d{4})/m
Note: The groups are for formatting the phone number after validating it. It is probably better for you to keep all your phone number in your database in the same format.
Well, here are different possibility you can use.
Note: Those regex should be compatible with all regex engine, but it is good practice to specify with which language you works because regex engine don't deal the same way with advanced/fancy function.
By example, the look behind is not supported by javascript and .Net allow a more powerful control on lookbehind than PHP.
Keep me in touch if you need more information
I have a string like:
$str1 = "12 ounces";
$str2 = "1.5 ounces chopped;
I'd like to get the amount from the string whether it is a decimal or not (12 or 1.5), and then grab the immediately preceding measurement (ounces).
I was able to use a pretty rudimentary regex to grab the measurement, but getting the decimal/integer has been giving me problems.
Thanks for your help!
If you just want to grab the data, you can just use a loose regex:
([\d.]+)\s+(\S+)
([\d.]+): [\d.]+ will match a sequence of strictly digits and . (it means 4.5.6 or .... will match, but those cases are not common, and this is just for grabbing data), and the parentheses signify that we will capture the matched text. The . here is inside character class [], so no need for escaping.
Followed by arbitrary spaces \s+ and maximum sequence (due to greedy quantifier) of non-space character \S+ (non-space really is non-space: it will match almost everything in Unicode, except for space, tab, new line, carriage return characters).
You can get the number in the first capturing group, and the unit in the 2nd capturing group.
You can be a bit stricter on the number:
(\d+(?:\.\d*)?|\.\d+)\s+(\S+)
The only change is (\d+(?:\.\d*)?|\.\d+), so I will only explain this part. This is a bit stricter, but whether stricter is better depending on the input domain and your requirement. It will match integer 34, number with decimal part 3.40000 and allow .5 and 34. cases to pass. It will reject number with excessive ., or only contain a .. The | acts as OR which separate 2 different pattern: \.\d+ and \d+(?:\.\d*)?.
\d+(?:\.\d*)?: This will match and (implicitly) assert at least one digit in integer part, followed by optional . (which needs to be escaped with \ since . means any character) and fractional part (which can be 0 or more digits). The optionality is indicated by ? at the end. () can be used for grouping and capturing - but if capturing is not needed, then (?:) can be used to disable capturing (save memory).
\.\d+: This will match for the case such as .78. It matches . followed by at least one (signified by +) digit.
This is not a good solution if you want to make sure you get something meaningful out of the input string. You need to define all expected units before you can write a regex that only captures valid data.
use this regular expression \b\d+([\.,]\d+)?
To get integers and decimals that either use a comma or a dot plus the next word, use the following regex:
/\d+([\.,]\d+)?\s\S+/