Using regex to capture phone numbers with spaces inserted at differing points - regex

I want to be able to extract a complete phone number from text, irrespective of how many spaces interrupt the number.
For example in the passage:
I think Emily was her name, and that her number was either 0421032614 or 0423 032 615 or 04321 98 564
I would like to extract:
0421032614
0423032615
0432198564
I can extract the first two using
(\d{4}[\s]?)(\d{3}[\s]?)+
But this is contingent on me knowing ahead of time how the ten numbers will be grouped (i.e. where the spaces will be). Is there any way to capture the ten numbers with a more flexible pattern?

You need to remove all white space then run a for loop and iterate through the groups:
public static void main (String [] args){
String reg = "(\\d{10})";
String word = " think Emily was her name, and that her number was either 0421032614 or 0423 032 615 or 04321 98 564";
word = word.replaceAll("\\s+",""); // replace all the whitespace with nothing
Pattern pat = Pattern.compile(reg);
Matcher mat = pat.matcher(word);
while (mat.find()) {
for (int i = 1; i <= mat.groupCount(); i++) {
System.out.println(mat.group(i));
}
}
}
output is
0421032614
0423032615
0432198564

Related

Regex match the number in between numbers

I have a list of string containing time in the following format.
15 min 43 sec
I want to extract 43 only. I was practicing at http://regexr.com/ but could not find an answer. The answer i have come to right now is \d+\s+min+\s+(\d*)+\s+sec which is match the whole word. But it should match only 43. Looking forward for the help soon. Thanks in advance.
A rudimentary and fast solution can be... \s(\d+)\s
But try to find a better one ;)
Use lookaround:
(\d+)(?=\s+sec)
The following pattern contains two capturing groups (for minutes and seconds), and allows for an arbitrary number of whitespaces inbetween the values. If only the seconds need to be extracted, one group would suffice.
To extract the values, match against an input (using a Matcher) and read the value of the according group (matcher.group(n), where 1 is the first group):
Pattern pattern = Pattern.compile("(\\d+)\\s*min\\s*(\\d+)\\s*sec");
String[] data = {"15 min 43 sec", "15min 43sec", "15 min 43 sec"};
for (String d : data) {
Matcher matcher = pattern.matcher(d);
if (matcher.matches()) {
int minutes = Integer.parseInt(matcher.group(1));
int seconds = Integer.parseInt(matcher.group(2));
System.out.println(minutes + ":" + seconds);
} else {
System.out.println("no match: " + d);
}
}

Checking phone numbers for equivalence

What's the best way to check phone numbers in different formats for equivalence?
Different formats:
"(708) 399 7222"
"7083997222"
"708-399-7222"
"708399-7222"
"+1 (708) 399-7222"
"+1 (708)399-7222"
Additional Difficulty: what if the phone number isn't prefaced by the country code?
You can't use a single regular expression. To get a canonical representation that can be compared:
Replace an initial + with your international call prefix. In many countries this is 00.
If number doesn't start with the prefix, add the prefix and the country code for your country.
Remove all non-digits.
This will be sufficient if you only have to deal with calls made from a single country, for example if you are developing something for internal use at a phone company. If you have to accept a wide range of inputs from different countries with various prefixes I suggest finding a well tested library to do this.
You can implement PhoneEqualityComparer class. If you deal only with US numbers, the following code will work:
sealed class PhoneEqualityComparer : IEqualityComparer<string>
{
public bool Equals(string x, string y)
{
return string.Equals(NormalizePhone(x), NormalizePhone(y));
}
public int GetHashCode(string obj)
{
return NormalizePhone(obj).GetHashCode();
}
private static string NormalizePhone(string phone)
{
if (phone.StartsWith("+1"))
phone = phone.Substring(2);
return Regex.Replace(phone, #"\D", string.Empty);
}
}
Sample usage:
string[] phones = new[] {
"(708) 399 7222",
"7083997222",
"708-399-7222",
"708399-7222",
"+1 (708) 399-7222",
"+1 (708)399-7222"
};
string[] uniquePhones = phones.Distinct(new PhoneEqualityComparer()).ToArray();
Try this regex
\(?\d{3}\)?-? *\d{3}-? *-?\d{4}
Or:
^\+?(\d[\d-]+)?(\([\d-.]+\))?[\d-]+\d$
Regex Demo
This is a pretty neat regex that will replace:
any plus sign followed by one or more digits
any character that is not a digit or a line break character
with a blank string
string text = Regex.Replace( inputString, #"\+\d+|[^0-9\r\n]", "" , RegexOptions.None | RegexOptions.Multiline );
Input:
"(708) 399 7222"
"7083997222"
"708-399-7222"
"708399-7222"
"+1 (708) 399-7222"
"+1 (708)399-7222"
Output:
7083997222
7083997222
7083997222
7083997222
7083997222
7083997222

RegEx. Check and pad string to ensure certain string format is used

Is it possible to take a string, and reformat it to ensure the output is always the same format.
I have an identification number that always follows the same format:
e.g.
166688205F02
16 66882 05 F 02
(15/16) (any 5 digit no) (05/06) (A-Z) (any 2 digit no)
Sometimes these are expressed as:
66882 5F 2
668825F2
66882 5 F 2
I want to take any of these lazy expressions, and pad them to form to proper format as above (defaulting as 16 for the first group).
Is this possible?
Your numbers can be matched by the following regex:
^ *(1[56])? *(\d{5}) *(0?[56]) *([A-Z]) *(\d{1,2}) *$
Here is a rough breakdown. I named the parts of the identification number. You may have more appropriate names for them.:
^ * #Start the match at the beginning of a string and consume all leading spaces if any.
(1[56])? #GROUP 1: The Id number prefix. (Optional)
* #Consume spaces if any.
(\d{5}) #GROUP 2: The five digit identifier code.
* #Consume spaces if any.
(0?[56]) #GROUP 3: The two digit indicator code.
* #Consume spaces if any.
([A-Z]) #GROUP 4: The letter code.
* #Consume spaces if any.
(\d{1,2}) #GROUP 5: The end code.
*$ #End the match with remaining spaces and the end of the string.
You didn't mention the language you are using. Here is a function I wrote in C# that uses this regex to reformat an input identification number.
private string FormatIdentificationNumber(string inputIdNumber) {
const string DEFAULT_PREFIX = "16";
const string REGEX_ID_NUMBER = #"^ *(1[56])? *(\d{5}) *(0?[56]) *([A-Z]) *(\d{1,2}) *$";
const int REGEX_GRP_PREFIX = 1;
const int REGEX_GRP_IDENTIFIER = 2;
const int REGEX_GRP_INDICATOR = 3;
const int REGEX_GRP_LETTER_CODE = 4;
const int REGEX_GRP_END_CODE = 5;
Match m = Regex.Match(inputIdNumber, REGEX_ID_NUMBER, RegexOptions.IgnoreCase);
if (!m.Success) return inputIdNumber;
string prefix = m.Groups[REGEX_GRP_PREFIX].Value.Length == 0 ? DEFAULT_PREFIX : m.Groups[REGEX_GRP_PREFIX].Value;
string identifier = m.Groups[REGEX_GRP_IDENTIFIER].Value;
string indicator = m.Groups[REGEX_GRP_INDICATOR].Value.PadLeft(2, '0');
string letterCode = m.Groups[REGEX_GRP_LETTER_CODE].Value.ToUpper();
string endCode = m.Groups[REGEX_GRP_END_CODE].Value.PadLeft(2, '0');
return String.Concat(prefix, identifier, indicator, letterCode, endCode);
}
You can replace space character with a blank one.
In JS for example :
"66882 5F 2".replace(' ','') // Will output "668825F2"
"66882 5 F 2".replace(' ','') // Will output "668825F2"
With regex, you can use "\s" delimiter for white spaces
First you eliminate spaces by replacing blank characters, then you use this regex
^1[5|6]([0-9]{5})0[5|6][A-Z]([0-9]{2})$

How to separate a line of input into multiple variables?

I have a file that contains rows and columns of information like:
104857 Big Screen TV 567.95
573823 Blender 45.25
I need to parse this information into three separate items, a string containing the identification number on the left, a string containing the item name, and a double variable containing the price. The information is always found in the same columns, i.e. in the same order.
I am having trouble accomplishing this. Even when not reading from the file and just using a sample string, my attempt just outputs a jumbled mess:
string input = "104857 Big Screen TV 567.95";
string tempone = "";
string temptwo = input.substr(0,1);
tempone += temptwo;
for(int i=1 ; temptwo != " " && i < input.length() ; i++)
{
temptwo = input.substr(j,j);
tempone += temp2;
}
cout << tempone;
I've tried tweaking the above code for quite some time, but no luck, and I can't think of any other way to do it at the moment.
You can find the first space and the last space using std::find_first_of and std::find_last_of . You can use this to better split the string into 3 - first space comes after the first variable and the last space comes before the third variable, everything in between is the second variable.
How about following pseudocode:
string input = "104857 Big Screen TV 567.95";
string[] parsed_output = input.split(" "); // split input string with 'space' as delimiter
// parsed_output[0] = 104857
// parsed_output[1] = Big
// parsed_output[2] = Screen
// parsed_output[3] = TV
// parsed_output[4] = 567.95
int id = stringToInt(parsed_output[0]);
string product = concat(parsed_output[1], parsed_output[2], ... ,parsed_output[length-2]);
double price = stringToDouble(parsed_output[length-1]);
I hope, that's clear.
Well try breaking down the files components:
you know a number always comes first, and we also know a number has no white spaces.
The string following the number CAN have whitespaces, but won't contain any numbers(i would assume)
After this title, you're going to have more numbers(with no whitespaces)
from these components, you can deduce:
grabbing the first number is as simple as reading in using the filestream <<.
getting the string requires you to check until you reach a number, grabbing one character at a time and inserting that into a string. the last number is just like the first, using the filestream <<
This seems like homework so i'll let you put the rest together.
I would try a regular expression, something along these lines:
^([0-9]+)\s+(.+)\s+([0-9]+\.[0-9]+)$
I am not very good at regex syntax, but ([0-9]+) corresponds to a sequence of digits (this is the id), ([0-9]+\.[0-9]+) is the floating point number (price) and (.+) is the string that is separated from the two number by sequences of "space" characters: \s+.
The next step would be to check if you need this to work with prices like ".50" or "10".

Regular Expression to find numbers with same digits in different order

I have been looking for a regular expression with Google for an hour or so now and can't seem to work this one out :(
If I have a number, say:
2345
and I want to find any other number with the same digits but in a different order, like this:
2345
For example, I match
3245 or 5432 (same digits but different order)
How would I write a regular expression for this?
There is an "elegant" way to do it with a single regex:
^(?:2()|3()|4()|5()){4}\1\2\3\4$
will match the digits 2, 3, 4 and 5 in any order. All four are required.
Explanation:
(?:2()|3()|4()|5()) matches one of the numbers 2, 3, 4, or 5. The trick is now that the capturing parentheses match an empty string after matching a number (which always succeeds).
{4} requires that this happens four times.
\1\2\3\4 then requires that all four backreferences have participated in the match - which they do if and only if each number has occurred once. Since \1\2\3\4 matches an empty string, it will always match as long as the previous condition is true.
For five digits, you'd need
^(?:2()|3()|4()|5()|6()){5}\1\2\3\4\5$
etc...
This will work in nearly any regex flavor except JavaScript.
I don't think a regex is appropriate. So here is an idea that is faster than a regex for this situation:
check string lengths, if they are different, return false
make a hash from the character (digits in your case) to integers for counting
loop through the characters of your first string:
increment the counter for that character: hash[character]++
loop through the characters of the second string:
decrement the counter for that character: hash[character]--
break if any count is negative (or nonexistent)
loop through the entries, making sure each is 0:
if all are 0, return true
else return false
EDIT: Java Code (I'm using Character for this example, not exactly Unicode friendly, but it's the idea that matters now):
import java.util.*;
public class Test
{
public boolean isSimilar(String first, String second)
{
if(first.length() != second.length())
return false;
HashMap<Character, Integer> hash = new HashMap<Character, Integer>();
for(char c : first.toCharArray())
{
if(hash.get(c) != null)
{
int count = hash.get(c);
count++;
hash.put(c, count);
}
else
{
hash.put(c, 1);
}
}
for(char c : second.toCharArray())
{
if(hash.get(c) != null)
{
int count = hash.get(c);
count--;
if(count < 0)
return false;
hash.put(c, count);
}
else
{
return false;
}
}
for(Integer i : hash.values())
{
if(i.intValue()!=0)
return false;
}
return true;
}
public static void main(String ... args)
{
//tested to print false
System.out.println(new Test().isSimilar("23445", "5432"));
//tested to print true
System.out.println(new Test().isSimilar("2345", "5432"));
}
}
This will also work for comparing letters or other character sequences, like "god" and "dog".
Put the digits of each number in two arrays, sort the arrays, find out if they hold the same digits at the same indices.
RegExes are not the right tool for this task.
You could do something like this to ensure the right characters and length
[2345]{4}
Ensuring they only exist once is trickier and why this is not suited to regexes
(?=.*2.*)(?=.*3.*)(?=.*4.*)(?=.*5.*)[2345]{4}
The simplest regular expression is just all 24 permutations added up via the or operator:
/2345|3245|5432|.../;
That said, you don't want to solve this with a regex if you can get away with it. A single pass through the two numbers as strings is probably better:
1. Check the string length of both strings - if they're different you're done.
2. Build a hash of all the digits from the number you're matching against.
3. Run through the digits in the number you're checking. If you hit a match in the hash, mark it as used. Keep going until you don't get an unused match in the hash or run out of items.
I think it's very simple to achieve if you're OK with matching a number that doesn't use all of the digits. E.g. if you have a number 1234 and you accept a match with the number of 1111 to return TRUE;
Let me use PHP for an example as you haven't specified what language you use.
$my_num = 1245;
$my_pattern = '/[' . $my_num . ']{4}/'; // this resolves to pattern: /[1245]{4}/
$my_pattern2 = '/[' . $my_num . ']+/'; // as above but numbers can by of any length
$number1 = 4521;
$match = preg_match($my_pattern, $number1); // will return TRUE
$number2 = 2222444111;
$match2 = preg_match($my_pattern2, $number2); // will return TRUE
$number3 = 888;
$match3 = preg_match($my_pattern, $number3); // will return FALSE
$match4 = preg_match($my_pattern2, $number3); // will return FALSE
Something similar will work in Perl as well.
Regular expressions are not appropriate for this purpose. Here is a Perl script:
#/usr/bin/perl
use strict;
use warnings;
my $src = '2345';
my #test = qw( 3245 5432 5542 1234 12345 );
my $canonical = canonicalize( $src );
for my $candidate ( #test ) {
next unless $canonical eq canonicalize( $candidate );
print "$src and $candidate consist of the same digits\n";
}
sub canonicalize { join '', sort split //, $_[0] }
Output:
C:\Temp> ks
2345 and 3245 consist of the same digits
2345 and 5432 consist of the same digits