Regex to capture text not beginning with number but containing number - regex

I am writing a perl script and part of it is to capture data that does not begin with a number. I have tried (\w)\s+(\d+)\s+(\S+)\s+(\d.+). Below are some parts of the file(too big to put all lines here).
The text I want to capture is
BBACCap 8 N/A 48,46,44,42,40,38,36,34,32,
or can be
IG-XL_DataTool N/A N/A N/A
or
DC-30 1 N/A 1,0
The regex does match for the above data I need however I am also capturing data(which I don't want) such as
1 2
2 3, 4
and also(which I don't want)
1.0 BBAC-15 805-004-50 0301B5C5 0829-E 5445
aka: 805-004-02,805-004-03
but only E 5445
aka: 805-004-02,805-004-03 from the above.
Any help on this?

It's hard to be sure what you need, but it looks like you can split each line on whitespace and select just the first three fields, rejecting any line whose first field starts with a decimal digit
Here's a demonstration which reads from files specified on the command line
while ( <> ) {
my #fields = split;
next if $fields[0] =~ /^[0-9]/;
print "#fields[0..2]\n";
}

Related

Regex - Match n occurences of substring within any m-lettered window

I am facing some issues forming a regex that matches at least n times a given pattern within m characters of the input string.
For example imagine that my input string is:
00000001100000001110111100000000000000000000000000000000000000000000000000110000000111000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001100
I want to detect all cases where an 1 appears at least 7 times (not necessarily consecutively) in the input string, but within a window of up to 20 characters.
So far I have built this expression:
(1[^1]*?){7,}
which detects all cases where an 1 appears at least 7 times in the input string, but this now matches both the:
11000000011101111
and the
1100000001110000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000011
parts whereas I want only the first one to be kept, as it is within a substring composed of less than 20 characters.
It tried to combine the aforementioned regex with:
(?=(^[01]{0,20}))
to also match only parts of the string containing either an '1' or a '0' of length up to 20 characters but when I do that it stops working.
Does anyone have an idea gow to accomplish this?
I have put this example in regex101 as a quick reference.
Thank you very much!
This is not something that can be done with regex without listing out every possible string. You would need to iterate over the string instead.
You could also iterate over the matches. Example in Python:
import re
matches = re.finditer(r'(?=((1[^1]*?){7}))', string)
matches = [match.group(1) for match in matches if len(match.group(1)) <= 20]
The next Python snippet is an attempt to get the desired sequences using only the regular expression.
import re
r = r'''
(?mx)
( # the 1st capturing group will contain the desired sequence
1 # this sequence should begin with 1
(?=(?:[01]{6,19}) # let's see that there are enough 0s and 1s in a line
(.*$)) # the 2nd capturing group will contain all characters to the end of a line
(?:0*1){6}) # there must be six more 1s in the sequence
(?=.{0,13} # complement the 1st capturing group to 20 characters
\2) # the rest of a line should be 2nd capturing group
'''
s = '''
0000000
101010101010111111100000000000001
00000001100000001110111100000000000000000000000000000000000000000000000000110000000111000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001100
1111111
111111
'''
print([m.group(1) for m in re.finditer(r, s)])
Output:
['1010101010101', '11111100000000000001', '110000000111011', '1111111']
You can find an exhaustive explanation of this regular expression on RegEx101.

How to capture a multiple patterns using regex?

I have several text files, containing error values. The values are different in each file and so i'm not able to get the exact line where the value is present.
The example is as follows:
v1 = 1111
v2 = A:10 B:2
Text:
12.10.08,11:12:39,183769 1111,10352,003,12,11:12:39,183 Syntax-->12345
(would like to capture v1)
01.01.02,06:10:56,243648 00488,00000,018,01,06:10:56,243 A:10 B:2--1212 (would like to capture v2)
The regex is as follows:
((\d{2}[.]\d{2}[.]\d{2}),(\d{2}[:]\d{2}[:]\d{2},\d*\s*(('+v1+')[,].*|\S*\s('+v2+')).*))
Irrespective of the value passed, it should go through text and grab the value. If v1 is present, should provide the complete text and if v2 is present the same.
But with one regex equation.
You might use:
\d{2}\.\d{2}\.\d{2},\d{2}:\d{2}:\d{2},\d{6}(?: \d{5}(?:,\d+)+:\d{2}:\d{2},\d+)? (\d{4}\b|[A-Z]:\d{2} [A-Z]:\d)
Explanation
\d{2}\.\d{2}\.\d{2},\d{2}:\d{2}:\d{2},\d{6} Match the format of the starting digits
(?: \d{5}(?:,\d+)+:\d{2}:\d{2},\d+)? Optionally match the part starting with 5 digits up until a time like format
( Capturing group
\d{4}\b Match 4 digits
| Or
[A-Z]:\d{2} [A-Z]:\d Match A:10 B: format
) Close group
Regex demo

Find repeating gps using regular expression

I work with text files, and I need to be able to see when the gps (last 3 columns of csv) "hangs up" for more than a few lines.
So for example, usually, part of a text file looks like this:
5451,1667,180007,35.7397387,97.8161897,375.8
5448,1053z,180006,35.7397407,97.8161814,375.7
5444,1667,180005,35.7397445,97.8161674,375.6
5439,1668,180004,35.7397483,97.8161526,375.5
5435,1669,180003,35.7397518,97.8161379,375.5
5431,1669,180002,35.7397554,97.8161269,375.6
5426,1054z,180001,35.7397584,97.8161115,375.6
5420,1670,175959,35.7397649,97.8160931,375.9
But sometimes there is an error with the gps and it looks like this:
36859,1598,202603.00,35.8867316,99.2515545,555.700
36859,1598,202608.00,35.8867316,99.2515545,555.700
36859,1142z,202610.00,35.8867316,99.2515545,555.700
36859,1597,202612.00,35.8867316,99.2515545,555.700
36859,1597,202614.00,35.8867316,99.2515545,555.700
36859,1596,202616.00,35.8867316,99.2515545,555.700
36859,1595,202618.00,35.8867316,99.2515545,555.700
I need to be able to figure out a way to search for matching strings of 7 different numbers, (the decimal portion of the gps) but so far I've only been able to figure out how to search for repeating #s or consecutive numbers.
Any ideas?
If you were to find such repetitions in an editor (such as Notepad++), you could use the following regex to find 4 or more repeating lines:
([^,]+(?:,[^,]+){2})\v+(?:(?:[^,]+,){3}\1(?:\v+|$)){3,}
To go a bit into detail
([^,]+(?:,[^,]+){2})\v+ is a group consisting of one or more non-commas followed by comma and another one or more non-commas followed by a vertical space (linebreak), that is not part of the group (e.g. 1,1,1\n)
(?:[^,]+,){3} matches one or more non-commas followed by comma, three times (your columns that don't have to be considered)
\1 is a backreference to group 1, matching if it contains exactly the same as group 1
(?:\v+|$) matches either another vertical whitespaces or the end of the text
{3,} for 3 or more repetitions - increase it if you want more
Here you can see, how it works
However, if you are using any programming language to check this, I wouldn't walk on the path of regex, as checking for those repetitions can be done a lot easier. Here is one example in Python, I hope you can adopt it for your needs:
oldcoords = [0,0,0]
lines = [line.rstrip('\n') for line in open(r'C:\temp\gps.csv')]
for line in lines:
gpscoords = line.split(',')[3:6]
if gpscoords == oldcoords:
repetitions += 1
else:
oldcoords = gpscoords
repetitions = 0
if repetitions == 4: #or however you define more than a few
print(', '.join(gpscoords) + ' is repeated')
If you can use perl, and if I understood you:
perl -ne 'm/^[^,]*,[^,]*,[^,]*,([^,]*,[^,]*,[^,]*$)/g; $current_line=$1; ++$line_number; if ($prev_line==$current_line){$equals++} else {if ($equals>=6){ print "Last three fields in lines ".($line_number-$equals-1)." to ".($line_number-1)." are equals to:\n$prev_line" } ; $equals=0}; $prev_line=$current_line' < onlyreplacethiswithyourfilepath should do the trick.
Sample output:
Last three fields in lines 1 to 7 are equals to:
35.8867316,99.2515545,555.700
Last three fields in lines 16 to 22 are equals to:
37.8782116,99.7825545,572.810
Last three fields in lines 31 to 44 are equals to:
36.6868916,77.2594245,581.358
Last three fields in lines 57 to 63 are equals to:
35.5128764,71.2874545,575.631

c# split text file by changing the line number

I'm trying to split text file by line numbers,
for example, if I have text file like:
1 ljhgk uygk uygghl \r\n
1 ljhg kjhg kjhg kjh gkj \r\n
1 kjhl kjhl kjhlkjhkjhlkjhlkjhl \r\n
2 ljkih lkjhl kjhlkjhlkjhlkjhl \r\n
2 lkjh lkjh lkjhljkhl \r\n
3 asdfghjkl \r\n
3 qweryuiop \r\n
I want to split it to 3 parts (1,2,3),
How can I do this? the size of the text is very large (~20,000,000 characters) and I need an efficient way (like regex).
Another idea, you can use linq to get the groups you're after, by splitting by each first word. Note that this will take each first word, so make sure you only have numbers there. This is using the split/join antipattern, but it seems to work nice here.
var lines = from line in s.Split("\r\n".ToCharArray(),
StringSplitOptions.RemoveEmptyEntries)
let lineNumber = line.Split(" ".ToCharArray(), 2).FirstOrDefault()
group line by lineNumber
into g
select String.Join("\n", g);
Notes:
GroupBy is gurenteed to return lines in the order they appeared.
If a block appears more than once (e.g. "1 1 2 2 3 3 1"), all blocks with the same number will be merged.
You can use a regex, but Split will not work too well. You can Match for the following pattern:
^(\d).*$ # Match first line, capture number
([\r\n]+^\1.*$)* # Match additional lines that begin with the same number
Example: here
I did try to split by$(?<=^(\d+).*)[\r\n]+^(?!\1), but it adds the line numbers as additional elementnt in the array.

RegEx: Uk Landlines, Mobile phone numbers

I've been struggling with finding a suitable solution :-
I need an regex expression that will match all UK phone numbers and mobile phones.
So far this one appears to cover most of the UK numbers:
^0\d{2,4}[ -]{1}[\d]{3}[\d -]{1}[\d -]{1}[\d]{1,4}$
However mobile numbers do not work with this regex expression or phone-numbers written in a single solid block such as 01234567890.
Could anyone help me create the required regex expression?
[\d -]{1}
is blatently incorrect: a digit OR a space OR a hyphen.
01000 123456
01000 is not a valid UK area code. 123456 is not a valid local number.
It is important that test data be real area codes and real number ranges.
^\s*(?(020[7,8]{1})?[ ]?[1-9]{1}[0-9{2}[ ]?[0-9]{4})|(0[1-8]{1}[0-9]{3})?[ ]?[1-9]{1}[0-9]{2}[ ]?[0-9]{3})\s*|[0-9]+[ ]?[0-9]+$
The above pattern is garbage for many different reasons.
[7,8] matches 7 or comma or 8. You don't need to match a comma.
London numbers also begin with 3 not just 7 or 8.
London 020 numbers aren't the only 2+8 format numbers; see also 023, 024, 028 and 029.
[1-9]{1} simplifies to [1-9]
[ ]? simplifies to \s?
Having found the intial 0 once, why keep searching for it again and again?
^(0....|0....|0....|0....)$ simplifies to ^0(....|....|....|....)$
Seriously. ([1]|[2]|[3]|[7]){1} simplifies to [1237] here.
UK phone numbers use a variety of formats: 2+8, 3+7, 3+6, 4+6, 4+5, 5+5, 5+4. Some users don't know which format goes with which number range and might use the wrong one on input. Let them do that; you're interested in the DIGITS.
Step 1: Check the input format looks valid
Make sure that the input looks like a UK phone number. Accept various dial prefixes, +44, 011 44, 00 44 with or without parentheses, hyphens or spaces; or national format with a leading 0. Let the user use any format they want for the remainder of the number: (020) 3555 7788 or 00 (44) 203 555 7788 or 02035-557-788 even if it is the wrong format for that particular number. Don't worry about unbalanced parentheses. The important part of the input is making sure it's the correct number of digits. Punctuation and spaces don't matter.
^\(?(?:(?:0(?:0|11)\)?[\s-]?\(?|\+)44\)?[\s-]?\(?(?:0\)?[\s-]?\(?)?|0)(?:\d{5}\)?[\s-]?\d{4,5}|\d{4}\)?[\s-]?(?:\d{5}|\d{3}[\s-]?\d{3})|\d{3}\)?[\s-]?\d{3}[\s-]?\d{3,4}|\d{2}\)?[\s-]?\d{4}[\s-]?\d{4}|8(?:00[\s-]?11[\s-]?11|45[\s-]?46[\s-]?4\d))(?:(?:[\s-]?(?:x|ext\.?\s?|\#)\d+)?)$
The above pattern matches optional opening parentheses, followed by 00 or 011 and optional closing parentheses, followed by an optional space or hyphen, followed by optional opening parentheses. Alternatively, the initial opening parentheses are followed by a literal + without a following space or hyphen. Any of the previous two options are then followed by 44 with optional closing parentheses, followed by optional space or hyphen, followed by optional 0 in optional parentheses, followed by optional space or hyphen, followed by optional opening parentheses (international format). Alternatively, the pattern matches optional initial opening parentheses followed by the 0 trunk code (national format).
The previous part is then followed by the NDC (area code) and the subscriber phone number in 2+8, 3+7, 3+6, 4+6, 4+5, 5+5 or 5+4 format with or without spaces and/or hyphens. This also includes provision for optional closing parentheses and/or optional space or hyphen after where the user thinks the area code ends and the local subscriber number begins. The pattern allows any format to be used with any GB number. The display format must be corrected by later logic if the wrong format for this number has been used by the user on input.
The pattern ends with an optional extension number arranged as an optional space or hyphen followed by x, ext and optional period, or #, followed by the extension number digits. The entire pattern does not bother to check for balanced parentheses as these will be removed from the number in the next step.
At this point you don't care whether the number begins 01 or 07 or something else. You don't care whether it's a valid area code. Later steps will deal with those issues.
Step 2: Extract the NSN so it can be checked in more detail for length and range
After checking the input looks like a GB telephone number using the pattern above, the next step is to extract the NSN part so that it can be checked in greater detail for validity and then formatted in the right way for the applicable number range.
^\(?(?:(?:0(?:0|11)\)?[\s-]?\(?|\+)(44)\)?[\s-]?\(?(?:0\)?[\s-]?\(?)?|0)([1-9]\d{1,4}\)?[\s\d-]+)(?:((?:x|ext\.?\s?|\#)\d+)?)$
Use the above pattern to extract the '44' from $1 to know that international format was used, otherwise assume national format if $1 is null.
Extract the optional extension number details from $3 and store them for later use.
Extract the NSN (including spaces, hyphens and parentheses) from $2.
Step 3: Validate the NSN
Remove the spaces, hyphens and parentheses from $2 and use further RegEx patterns to check the length and range and identify the number type.
These patterns will be much simpler, since they will not have to deal with various dial prefixes or country codes.
The pattern to match valid mobile numbers is therefore as simple as
^7([45789]\d{2}|624)\d{6}$
Premium rate is
^9[018]\d{8}$
There will be a number of other patterns for each number type: landlines, business rate, non-geographic, VoIP, etc.
By breaking the problem into several steps, a very wide range of input formats can be allowed, and the number range and length for the NSN checked in very great detail.
Step 4: Store the number
Once the NSN has been extracted and validated, store the number with country code and all the other digits with no spaces or punctuation, e.g. 442035557788.
Step 5: Format the number for display
Another set of simple rules can be used to format the number with the requisite +44 or 0 added at the beginning.
The rule for numbers beginning 03 is
^44(3\d{2})(\d{3])(\d{4})$
formatted as
0$1 $2 $3 or as +44 $1 $2 $3
and for numbers beginning 02 is
^44(2\d)(\d{4})(\d{4})$
formatted as
(0$1) $2 $3 or as +44 $1 $2 $3
The full list is quite long. I could copy and paste it all into this thread, but it would be hard to maintain that information in multiple places over time. For the present the complete list can be found at: http://aa-asterisk.org.uk/index.php/Regular_Expressions_for_Validating_and_Formatting_GB_Telephone_Numbers
Given that people sometimes write their numbers with spaces in random places, you might be better off ignoring the spaces all together - you could use a regex as simple as this then:
^0(\d ?){10}$
This matches:
01234567890
01234 234567
0121 3423 456
01213 423456
01000 123456
But it would also match:
01 2 3 4 5 6 7 8 9 0
So you may not like it, but it's certainly simpler.
Would this regex do?
// using System.Text.RegularExpressions;
/// <summary>
/// Regular expression built for C# on: Wed, Sep 8, 2010, 06:38:28
/// Using Expresso Version: 3.0.2766, http://www.ultrapico.com
///
/// A description of the regular expression:
///
/// [1]: A numbered capture group. [\+44], zero or one repetitions
/// \+44
/// Literal +
/// 44
/// [2]: A numbered capture group. [\s+], zero or one repetitions
/// Whitespace, one or more repetitions
/// [3]: A numbered capture group. [\(?]
/// Literal (, zero or one repetitions
/// [area_code]: A named capture group. [(\d{1,5}|\d{4}\s+?\d{1,2})]
/// [4]: A numbered capture group. [\d{1,5}|\d{4}\s+?\d{1,2}]
/// Select from 2 alternatives
/// Any digit, between 1 and 5 repetitions
/// \d{4}\s+?\d{1,2}
/// Any digit, exactly 4 repetitions
/// Whitespace, one or more repetitions, as few as possible
/// Any digit, between 1 and 2 repetitions
/// [5]: A numbered capture group. [\)?]
/// Literal ), zero or one repetitions
/// [6]: A numbered capture group. [\s+|-], zero or one repetitions
/// Select from 2 alternatives
/// Whitespace, one or more repetitions
/// -
/// [tel_no]: A named capture group. [(\d{1,4}(\s+|-)?\d{1,4}|(\d{6}))]
/// [7]: A numbered capture group. [\d{1,4}(\s+|-)?\d{1,4}|(\d{6})]
/// Select from 2 alternatives
/// \d{1,4}(\s+|-)?\d{1,4}
/// Any digit, between 1 and 4 repetitions
/// [8]: A numbered capture group. [\s+|-], zero or one repetitions
/// Select from 2 alternatives
/// Whitespace, one or more repetitions
/// -
/// Any digit, between 1 and 4 repetitions
/// [9]: A numbered capture group. [\d{6}]
/// Any digit, exactly 6 repetitions
///
///
/// </summary>
public Regex MyRegex = new Regex(
"(\\+44)?\r\n(\\s+)?\r\n(\\(?)\r\n(?<area_code>(\\d{1,5}|\\d{4}\\s+"+
"?\\d{1,2}))(\\)?)\r\n(\\s+|-)?\r\n(?<tel_no>\r\n(\\d{1,4}\r\n(\\s+|-"+
")?\\d{1,4}\r\n|(\\d{6})\r\n))",
RegexOptions.IgnoreCase
| RegexOptions.Singleline
| RegexOptions.ExplicitCapture
| RegexOptions.CultureInvariant
| RegexOptions.IgnorePatternWhitespace
| RegexOptions.Compiled
);
//// Replace the matched text in the InputText using the replacement pattern
// string result = MyRegex.Replace(InputText,MyRegexReplace);
//// Split the InputText wherever the regex matches
// string[] results = MyRegex.Split(InputText);
//// Capture the first Match, if any, in the InputText
// Match m = MyRegex.Match(InputText);
//// Capture all Matches in the InputText
// MatchCollection ms = MyRegex.Matches(InputText);
//// Test to see if there is a match in the InputText
// bool IsMatch = MyRegex.IsMatch(InputText);
//// Get the names of all the named and numbered capture groups
// string[] GroupNames = MyRegex.GetGroupNames();
//// Get the numbers of all the named and numbered capture groups
// int[] GroupNumbers = MyRegex.GetGroupNumbers();
Notice how the spaces and dashes are optional and can be part of it.. also it is now divided into two capture groups called area_code and tel_no to break it down and easier to extract.
Strip all whitespace and non-numeric characters and then do the test. It'll be musch , much easier than trying to account for all the possible options around brackets, spaces, etc.
Try the following:
#"^(([0]{1})|([\+][4]{2}))([1]|[2]|[3]|[7]){1}\d{8,9}$"
Starts with 0 or +44 (for international) - I;m sure you could add 0044 if you wanted.
It then has a 1, 2, 3 or 7.
It then has either 8 or 9 digits.
If you want to be even smarter, the following may be a useful reference: http://en.wikipedia.org/wiki/Telephone_numbers_in_the_United_Kingdom
It's not a single regex, but there's sample code from Braemoor Software that is simple to follow and fairly thorough.
The JS version is probably easiest to read. It strips out spaces and hyphens (which I realise you said you can't do) then applies a number of positive and negative regexp checks.
Start by stripping the non-numerics, excepting a + as the first character.
(Javascript)
var tel=document.getElementById("tel").value;
tel.substr(0,1).replace(/[^+0-9]/g,'')+tel.substr(1).replace(/[^0-9]/g,'')
The regex below allows, after the international indicator +, any combination of between 7 and 15 digits (the ITU maximum) UNLESS the code is +44 (UK). Otherwise if the string either begins with +44, +440 or 0, it is followed by 2 or 7 and then by nine of any digit, or it is followed by 1, then any digit except 0, then either seven or eight of any digit. (So 0203 is valid, 0703 is valid but 0103 is not valid). There is currently no such code as 025 (or in London 0205), but those could one day be allocated.
/(^\+(?!44)[0-9]{7,15}$)|(^(\+440?|0)(([27][0-9]{9}$)|(1[1-9][0-9]{7,8}$)))/
Its primary purpose is to identify a correct starting digit for a non-corporate number, followed by the correct number of digits to follow. It doesn't deduce if the subscriber's local number is 5, 6, 7 or 8 digits. It does not enforce the prohibition on initial '1' or '0' in the subscriber number, about which I can't find any information as to whether those old rules are still enforced. UK phone rules are not enforced on properly formatted international phone numbers from outside the UK.
After a long search for valid regexen to cover UK cases, I found that the best way (if you're using client side javascript) to validate UK phone numbers is to use libphonenumber-js along with custom config to reduce bundle size:
If you're using NodeJS, generate UK metadata by running:
npx libphonenumber-metadata-generator metadata.custom.json --countries GB --extended
then import and use the metadata with libphonenumber-js/core:
import { isValidPhoneNumber } from "libphonenumber-js/core";
import data from "./metadata.custom.json";
isValidPhoneNumber("01234567890", "GB", data);
CodeSandbox Example