find numbers with fix length in text file - regex

I have a text file which does not have any specific format. It contains text and numbers. I want to get numbers only with 24 digits. I want to remove all extra text and get those numbers separated by space or newline.
I can select numbers with 24 digits by using [0-9]{24} but I want to remove all extra text and leave the numbers there.
For example, if the file is like this:
asafa sfasd asd 123 15 1 asd ad7a sd78a6s da87ds6a 8s7d .123 1.
32 141.23 . 123456789012345678901234 asafa sfasd asd 123 15 1 asd ad7a sd78a6s da87ds6
a 8s7d .123 1.32 141.23 . 123456789012345678901234 asafa sfasd asd 123 15 1 asd ad7a sd78a6s da87ds6a 8s7d .12
3 1.32 141.23 . 123456789012345678901234 asafa sfasd asd 123 15 1 asd ad7a sd78a6s da87ds6a 8s7d .123 1.32 141.23 . 123456789012345678901234
I want to get
123456789012345678901234 123456789012345678901234 123456789012345678901234 123456789012345678901234
separated by space or newline (any separator would do.) Numbers are not always the same in the file, this is just an example to show what I'm going to do.
Thanks.

You might use the following regex and replace with an empty string:
(?>(?:\D|(?<!\d)\d{1,23}(?!\d)|(?<!\d)\d{25,}(?!\d))+)
It will match all text that is not digits, or numbers that are not 24 symbols long.
Settings screen:
REGEX EXPLANATION:
(?>...) - An atomic group syntax, we do not backtrack inside the group (it increases performance)
(?:\D|(?<!\d)\d{1,23}(?!\d)|(?<!\d)\d{25,}(?!\d))+ - A non-capturing group where we list our alternatives (the patterns we want to match) that are listed with | alternation operator:
\D - a non-digit
(?<!\d)\d{1,23}(?!\d) - Any sequence of 1 to 23 digits that are not preceded with a digit (thanks to the negative look-behind (?<!\d)), and are not followed by a digit (thanks to the negative look-ahead (?!\d))
(?<!\d)\d{25,}(?!\d) - A similar to the above, but it matches sequences of 25 digits and more.

Related

regex to extract a specific set of numbers

I need a regular expression to extract a specific set of numbers from a string. The string could contain letters, special characters and spaces.
Input examples:
This is a test 99 12 3456
This is test 2 94123456
This is test 3 357 95123456
This is test 4 35797123 456
And so on…
The regex should look for a string of 8 numbers starting with 94 or 95 or 96 or 97 or 99 followed by 6 more numbers.
example:
94<6 more numbers here>
95<6 more numbers here>
96<6 more numbers here>
97<6 more numbers here>
99<6 more numbers here>
or 11 numbers starting with 357 followed by 94 or 95 or 96 or 97 and 6 more numbers.
example:
35794<6 more numbers here>
35795<6 more numbers here>
35796<6 more numbers here>
35797<6 more numbers here>
35799<6 more numbers here>
So the output should either be 8 numbers, or 11 numbers. Less than 8 or more than 11 is not a valid output. Also anything between 8 and 11 is not valid.
Hope this makes it more clear
Thanks for your help
Maybe this:
(357|94|95|96)[\d ]{6,}
Which means "357" or "94" or "95" or "96" followed by at least six digits and/or spaces. I wasn't sure exactly what you want. It would be better just to post the exact input and output desired.
If you’re working in an environment that supports lookbehinds, you can ensure you’re not matching a partial number by using a negative lookbehind and negative lookahead:
/(?<!\d)(?:357)?9[4-79]\d{6}(?!\d)/
(?<!\d): Negative lookbehind (ensure there isn’t a digit before the matching expression)
(?:357)?: Create a non-capturing group of 357 to attach an optional quantifier (match 357 zero to 1 times)
9: Match 9
[4-79]: Character set with range 4-7 and 9 (match one of these characters)
\d{6}: Match a digit exactly 6 times
(?!\d): Negative lookahead (ensure there isn’t a digit after the matching expression)
This regular expression will do it if you remove the spaces from the input first: 3579[4-9](?:\d{8}|\d{6})

regex: Numbers and spaces (10 or 14 numbers)

How I can write a regex which accepts 10 or 14 digits separated by a single space in groups of 1,2 or 3 digits?
examples:
123 45 6 789 1 is valid
1234 567 8 9 1 is not valid (group of 4 digits)
123 45 6 789 109 123 8374 is not valid (not 10 or 14 digits)
EDIT
This is what I have tried so far
[0-9 ]{10,14}+
But it validates also 11,12,13 numbers, and doesn't check for group of numbers
You may use this regex with lookahead assertion:
^(?=(?:\d ?){10}(?:(?:\d ?){4})?$)\d{1,3}(?: \d{1,3})+$
RegEx Demo
Here (?=...) is lookahead assertion that enforces presence of 10 or 14 digits in input.
\d{1,3}(?: \d{1,3})+ matches input with 1 to 3 digits separated by space with no space allowed at start or end.
aggtr,
You can match your use case with the following:
^(?:\d\s?){10}$|^(?:\d\s?){14}$
^ means the beginning of the string and $ means the end of the string.
(?:...) means a non-capturing group. Thus, the part before the | means a string that starts and has a non-capturing group of a decimal followed by an optional space that has exactly 10 items followed by the end of the string. By putting the | you allow for either 10 or 14 of your pattern.
Edit I missed the part of your requirement to have the digits grouped by 1, 2, or 3 digits.

PCI Compliance regex detect pattern with spaces

I have to generate a regular expression to detect patterns of text where credit card numbers are involved, I have a regular expression but fails when the text is altered with simple spaces between the text for example (not valid credit card number):
4320 7589 9456 0123
The regex is:
4\d{3}(\s+|-)?\d{4}(\s+|-)?\d{4}(\s+|-)?\d{4}
This regex match easy, but if someone alter the text with spaces between any number like this:
4 320 7589 9456 0123
Does not match, I need a regex to detect any possible variable with spaces, special symbols, letters, some examples:
43 20 75 89 94 56 01 23
4 3 2 0 7 5 8 9 9 4 5 6 0 1 2 3
4320a7589b9456c0123
4320$7589$9456$0123
4320_7589_9456_0123
I don't know if I can strip any space, symbols from the pattern to analyze the text?
I am posting because you actually asked for help with pattern to match any number of non-digits between the first 4 and 15 more digits.
The pattern is
^4(?:\D*\d){15}$
See demo
Regex breakdown:
^ - start of string
4 - literal 4
(?:\D*\d){15} - 15 occurrences of sequences of...
\D* - 0 or more non-digit symbols before..
\d - a digit
$ - end of string
If you need to capture, you can capture (like ^4((?:\D*\d){3})((?:\D*\d){4})((?:\D*\d){4})((?:\D*\d){4})$), but the submatches will still contain the "junk" in-between digits.

Regex preg_match to neutralize a pricelist, keeping only digits, dots and commas*

I am using preg_match (PHP version 5.5.*) and want to ignore all alphabetic letters [a-zA-Z] and special symbols such as $ and -, only to match numbers, commas, dots. Whitespaces between numbers such as 6 000 should be matched. Commas after a number that is not followed by another number should be ignored, such as 6, would only match 6
Note that this is used in a single string and never in a list, like the sample below. I use the list to show what input and desired output is, "per line".
Sample input:
1
1,99
1.99
10
100
5999 dollars
2 USD
$2,99
Our price 2.99
Price: $ 20
200 $
20,-
6 999 USD
Desired output:
1
1,99
1.99
10
100
5999
2
2,99
2.99
20
200
20
6 999
I have tried /([0-9.,\s]+)/ but the output of 6 999 USD becomes 6.
Edit
The code we are using looks like this:
preg_match($regex, $value, $extractions);
array_shift($extractions);
$this->persist($extractions);
Demo
Update:
If you have   instead of spaces, you can do two things..my recommended is to just do a str_replace() first:
str_replace(' ', ' ', $number);
The other option is to also check for   with the [\s,] group:
[\d.](?:[\d.]|(?:[\s,]| )(?=\d))*
Example:
preg_match('/[\d.](?:[\d.]|[\s,](?=\d))*/', $number, $matches);
$number = reset($matches);
Explanation:
So I classified the valid characters (digits, spaces, commas, and periods) into two groups: [\d.] and [\s,]. A number must start with a digit or a period ($.99 == .99 != 99). Then we use a repeated non-capturing group (?:...)* to take care of our alternation and lookahead assertions. Anytime there is a [\d.] we match it with now questions asked. Otherwise (|), it it is a [\s,] we assert that it is followed with a digit using a lookahead ((?=...)).
Demo
Example:
preg_replace('/\s*[^\d\s,.]+\s*|,(?!\d)/', '', $number);
Explanation:
[^\d\s,.]+ will match 1+ characters that are not either a digit, whitespace, a comma, or a period. We put \s* on either side to grab any extra whitespace around these unwanted characters (like in "Our price "). The only unwanted character this doesn't match is a trailing comma. We use an alternation (|), then look for a comma, and then make sure that it is not followed by a digit using a negative lookahead ((?!...)).
Demo

RegEx with counting digits and allow special chars

I've done some searching but cant find the right regex.
i would like a regex for a text that only contains digits, whitespaces and plus signs.
like: [0-9 +]
But with a min/max limit for only the digits in that text.
My suggestions ended up with something like this:
^[0-9 \+](?=(.*[0-9]){5,8})$
Should be OK:
"123 456 7"
"12345"
"+ 123 456 78"
Should not be ok:
"123456789"
"+ 124 578a"
"+123456789"
Anyone got a solution that might do the trick?
Edit:
I can see that i was to short on my explanation what i'm aiming for.
My regex conditions should be:
Must include between 5-8 digits
Allow whitespaces and plus signs
I'm guessing from your own regex that between 5 and 8 digits in a row without a whitespace in between are allowed. If that's true, than the following regex might do the trick (example written in Python). It allows single digit groups being between 5 and 8 digits long. If there is more than one group, it allows each group to have exactly 3 digits except for the last group which can be between 1 and 3 digits long. One single plus sign on the left is optional.
Are you parsing phone numbers? :)
In [176]: regex = re.compile(r"""
^ # start of string
(?: \+\s )? # optional plus sign followed by whitespace
(?:
(?: \d{3}\s )+ # one or more groups of three digits followed by whitespace
\d{1,3} # one group of between one and three digits
| # ALTERNATIVE
\d{5,8} # one group of between five and eight digits
)
$ # end of string
""", flags=re.X)
# --- MATCHES ---
In [177]: regex.findall('123 456 7')
Out[177]: ['123 456 7']
In [178]: regex.findall('12345')
Out[178]: ['12345']
In [179]: regex.findall('+ 123 456 78')
Out[179]: ['+ 123 456 78']
In [200]: regex.findall('12345678')
Out[200]: ['12345678']
# --- NON-MATCHES ---
In [180]: regex.findall('123456789')
Out[180]: []
In [181]: regex.findall('+ 124 578a')
Out[181]: []
In [182]: regex.findall('+123456789')
Out[182]: []
In [198]: regex.findall('123')
Out[198]: []
In [24]: regex.findall('1234 556')
Out[24]: []
You can do something like this:
^(?:[ +]*[0-9]){5}(?:(?:[ +]*[0-9])?){3}$
See it here on Regexr
The first group (?:[ +]*[0-9]){5} are the 5 minimum digits, with any amount of spaces and plus before, the second part (?:(?:[ +]*[0-9])?){3} matches the optional digits, with any amount of spaces and plus before.
You were very close - you need to anchor the lookahead to the start of input, and add a second negative lookahead for the upper bound of the quantity of digits:
^(?=(.*\d){5,8})(?!(.*\d){9,})[\d +]+$
Also, fyi you don't need to escape the plus sign within the character class, and [0-9] is \d