grep with regex for phone number

grep with regex for phone number - regex

I would like to get the phone numbers from a file. I know the numbers have different forms, I can handle for a single one, but don't know how to get a uniform regex. For example
xxx-xxx-xxxx
(xxx)xxx-xxxx
xxx xxx xxxx
xxxxxxxxxx
I can only handle 1, 2, and 4 together
grep '[0-9]\{3\}[ -]\?[0-9]\{3\}[ -]\?[0-9]\{4\}' file
Is there any one single regex can handle all of these four forms?

grep '\(([0-9]\{3\})\|[0-9]\{3\}\)[ -]\?[0-9]\{3\}[ -]\?[0-9]\{4\}' file
Explanation:
([0-9]\{3\}) three digits inside parentheses
\| or
[0-9]\{3\} three digits not inside parens
...with grouping parentheses - \(...\) - around the alternation so the rest of the regex behaves the same no matter which alternative matches.

There are usually four patterns of phone numbers
1. xxx-xxx-xxxx grep -o '[0-9]\{3\}\-[0-9]\{3\}\-[0-9]\{4\}' file.txt
2. (xxx)xxx-xxxx grep -o '([0-9]\{3\})[0-9]\{3\}\-[0-9]\{4\}' file.txt
3. xxx xxx xxxx grep -o '[0-9]\{3\}\s[0-9]\{3\}\s[0-9]\{4\}' file.txt
4. xxxxxxxxxx grep -o '[0-9]\{10\}' file.txt
In all
grep -o '\([0-9]\{3\}\-[0-9]\{3\}\-[0-9]\{4\}\)\|\(([0-9]\{3\})[0-9]\{3\}\-[0-9]\{4\}\)\|\([0-9]\{10\}\)\|\([0-9]\{3\}\s[0-9]\{3\}\s[0-9]\{4\}\)' file.txt
Of course, one could simplify the regex above but we can also leave this simplification to grep itself ~

This is just a modified version of Alan Moore's solution. This is protected against some race condition where the last part of the number has more than four digits in it or the if the total number of digits are more than 10:
grep '\(\(([0-9]\{3\})\|[0-9]\{3\}\)[ -]\?\)\{2\}[0-9]\{4\} '
Explanation:
\(([0-9]\{3\})\|[0-9]\{3\}\) matches exactly three digits (e.g. 234)
with or without surrounded by parentheses. \| performs the 'OR' operation.
The first \( ... \) groups together the above format followed by a space or - or no space at all - ([ -]\?) does that.
The \{2\} matches exactly two occurrences of the above
The [0-9]\{4\} ' matches exactly one occurrence for a 4 digit number followed by a space
And it's a bit shorter as well. Tested on RHEL and Ubuntu. Cheers!!

You can just OR (|) your regexes together -- will be more readable that way too!

My first thought is that you may find it easier to see if your candidate number matches against one of four regular expressions. That will be easier to develop/debug, especially as/when you have to handle additional formats in the future.

grep -P '[0-9]{3}-[0-9]{3}-[0-9]{3}|[0-9]{3}\ [0-9]{3}\ [0-9]{3}|[0-9]{9}|\([0-9]{3}\)[0-9]{3}-[0-9]{3}'

Try this one:
^(\d{10}|((([0-9]{3})\s){2})[0-9]{4}|((([0-9]{3})\-){2})[0-9]{4}|([(][0-9]{3}[)])[0-9]{3}[-][0-9]{4})$
This is only applicable for the formate you mention above like:
xxxxxxxxxx
xxx xxx xxxx
xxx-xxx-xxxx
(xxx)xxx-xxxx

We can put all the required phone number validations one by one using an or condition which is more likely to work well (but tiresome coding).
grep '^[0-9]\{10\}$\|^[0-9]\{3\}[-][0-9]\{3\}[-][0-9]\{4\}$\|^[0-9]\{3\}[ ][0-9]\{3\}[ ][0-9]\{4\}$\|^[(][0-9]\{3\}[)][0-9]\{3\}[-][0-9]\{4\}$' phone_number.txt
returns all the specific formats :
920-702-9999
(920)702-9999
920 702 9999
9207029999

+?(1[ -])?((\d{3})[ -]|(\d{3}[ -]?)){2}\d{4}
works for:
123-678-1234
123 678 1234
(123)-678-1234
+1-(123)-678-1234
1-(123)-678-1234
1 123 678 1234
1 (123) 678 1234

grep -oE '\(?\<[0-9]{3}[-) ]?[0-9]{3}[ -]?[0-9]{4}\>'
Matches all your formats.
The \< and \> word boundaries prevent matching numbers that are too long, such as 123-123-12345 or 1234-123-1234

I got this:
debian:tmp$ cat p.txt
333-444-5555
(333)333-6666
123 456 7890
1234567890
debian:tmp$ egrep '\(?[0-9]{3}[ )-]?[0-9]{3}[ -]?[0-9]{4}' p.txt
333-444-5555
(333)333-6666
123 456 7890
1234567890
debian:tmp$ egrep --version
GNU grep 2.5.3
Copyright (C) 1988, 1992-2002, 2004, 2005 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
debian:tmp$

Related

How to prefer use of an optional part of a regular expression?

Is there a way of telling a regular expression (specifically sed) to prefer using an optional component when the input also matches without using that component?
I'm trying to extract a number from a string that may optionally be preceded by prefix. It works in the following cases:
echo dummy/123456/dummy | sed "s:.*/\(prefix\)\?\([0-9]\{3,\}\)/.*:\2:"
123456
echo dummy/prefix123456/dummy | sed "s:.*/\(prefix\)\?\([0-9]\{3,\}\)/.*:\2:"
123456
but if the string contains both a prefixed number and a "bare" number, it choses the bare number:
echo dummy/prefix123456/987654/dummy | sed "s:.*/\(prefix\)\?\([0-9]\{3,\}\)/.*:\2:"
987654
Is there a way of forcing sed to prefer the match including the prefix (123456)? All search results I've found talk of greedy/lazy options, which – as far as I can tell – don't apply here.
Clarifications
The dummy portions in the examples above may contain slashes.
The bit I'm interested in is either the first slash-delimited run of three or more digits (.../123456/...) or the first slash-delimited run of 3+ digits with a prefix (.../prefix123456/...), whichever occurs first.

You may try this sed command:
sed '
/.*\/prefix\([0-9]\{3,\}\)\/.*/{
s//\1/
b
}
s/.*\/\([0-9]\{3,\}\)\/.*/\1/
' file
which will print out
123456
123456
123456
123456
where the content of file is
dummy/123456/dummy
dummy/prefix123456/dummy
dummy/prefix123456/987654/dummy
dummy/987654/prefix123456/dummy

With GNU awk you could try following code. Written and tested with shown samples only.
awk 'match($0,/\/(prefix){0,1}([0-9]+)/,arr){print arr[2]}' Input_file
Explanation: Simple explanation would be, using GNU awk's match function. In it using regex (prefix){0,1}([0-9]+) which is having 2 capturing groups and its matched values are getting stored into array named arr and if condition is fine then printing 2nd element of that array.

sed BRE or ERE doesn't have a way to use lazy quantifier in starting .*?.
However, based on your use-cases, you may use this sed:
sed -E 's~[^/]*/(prefix){0,1}([0-9]{3,})/.*~\2~' file
123456
123456
123456
where input is:
cat file
dummy/123456/dummy
dummy/prefix123456/dummy
dummy/prefix123456/987654/dummy
Here we are using negated character class (bracket expression) [^/]* instead of .* to allow pattern to match 0 or more of any char that is not a /.
If you can consider perl then .*? with a negative lookahead will work for you:
perl -pe 's~^.*?/(?:prefix)?(\d{3,})(?!.*prefix\d{3}).*~$1~' file
RegEx Demo

Find the first name that starts with any letter than S using regex

I am new to regex and I am trying to find the last names that only start with S followed by comma and then space and then the first names that doesn't start with S from a text file.
I am using the terminal on a MacBook.
This is my regex
^[S\w][,]?[' ']?[A-RT-Z]?
My full command
cat People.txt | grep -E ^[S\w][,]?[' ']?[A-RT-Z]?
The first name is the second word and the last name is the first word on each line.
The results I get:
Schmidt, Paul
Sells, Simon
Smith, Peter
Stephens, Sheila
What I am expecting to get
Schmidt, Paul
Smith, Peter

The first rule of writing regular expressions in a shell script (or at the terminal) is "enclose the regular expression in single quotes" so that the shell doesn't try to interpret the metacharacters in the regex. You might sometimes use double quotes instead of single quotes if you need to match single quotes but not double quotes or if you need to interpolate a variable, but aim to use single quotes. Also, avoid UUoC — Useless Use of cat.
Your question currently shows two regular expressions:
^[S\w][,]?[' ']?[A-RT-Z]?
cat People.txt | grep -E ^[S\w][,]?[' ']?[P\w+]?
If written as suggested, these would become:
grep -E -e '^[Sw],? ?[A-RT-Z]?' People.txt
grep -E -e '^[Sw],? ?[Pw+]?' People.txt
The shell removes the backslashes in your rendition. The + in the character class matches a plus sign. You don't need square brackets around the comma (though they do no major harm). I use the -e option for explicitness, and so I can add extra arguments after the regex (-w or -l or -n or …) when editing commands via history. (I also dislike having options recognized after non-option arguments; I often run with $POSIXLY_CORRECT set in my environment. That's a personal quirk.)
The first of the two commands looks for a line starting S or w, followed by an optional comma, an optional blank, and an optional upper-case letter other than S. The second is similar except that it looks for an optional P or w. None of this bears much relationship to the question.
You need an expression more like one of these:
grep -E -e '^[S][[:alpha:]]*, [^S]' People.txt
grep -E -e '^[S][a-zA-Z]*, [^S]' People.txt
These allow single-character names — just S — but you can use + instead of * to require one or more letters.
There are lots of refinements possible, depending on how much you want to work, but this does the primary job of finding 'first word on the line starts with S, and is followed by a comma, a blank, and the second word does not start with S'.
Given a file People.txt containing:
Randall, Steven
Rogers, Timothy
Schmidt, Paul
Sells, Simon
Smith, Peter
Stephens, Sheila
Titus, Persephone
Williams, Shirley
Someone
S
Your regular expressions produce the output:
Schmidt, Paul
Sells, Simon
Smith, Peter
Stephens, Sheila
Someone
S
My commands produce:
Schmidt, Paul
Smith, Peter

Something like this seems to work fine:
^S.*, [^S].*$
^S.* - must start with S and start capturing everything
, [^S] - leading up to a comma, space, not S
.*$ - capture the rest of the string
https://regex101.com/r/76bfji/1

egrep command for lines that have one or more instance of 1234 but no other numbers?

So I'm fairly new to regular expressions and I'm wondering how this would be implemented as a egrep command.
I basically want to look for lines in a file that have one or more instances of "1234", but no other numbers. (non-digit characters are allowed).
Examples:
1234 - valid
12341234 - valid
12345 - invalid (since 5 is there)

You can use grep to extract the lines that contain 1234, then replace 1234 with something that doesn't appear in the input, then remove lines that still contain any digits, and replace the special string back by 1234:
< input-file grep 1234 \
| sed 's/1234/\x1/g' \
| grep -v '[0-9]' \
| sed 's/\x1/1234/g'

So, we want to select lines that have 1234 one or more times but no other digits:
grep -E '^([^[:digit:]]*1234)+[^[:digit:]]*$' file
How it works
The regex begins with ^ and ends with $. That means that is must match the whole line.
Inside the regex are two parts:
([^[:digit:]]*1234)+ matches one or more 1234 with no other digits.
[^[:digit:]]* matches any non-digits that follows the last 1234.
In olden times, one would use [0-9] to match digits. With unicode, that is no longer reliable. So, we are using [:digit:] which is unicode safe.
Example
Let's use this test file:
$ cat file
this 1234 is valid
12341234 valid
not valid 12345
not 2 valid 1234 line
no numbers so not valid
Here is the result:
$ grep -E '^([^[:digit:]]*1234)+[^[:digit:]]*$' file
this 1234 is valid
12341234 valid

If you want no other digit after your 1234 block:
egrep '\<(1234)+(\>|[^0-9])' *
-- -- --> word delimiters
---- --> the word you're looking for
------ --> non digit characters
- --> one or more times
If you want only "words" made up by the "1234" block, then you can egrep this:
egrep '\<(1234)+\>' *
-- -- --> word delimiters
---- --> the word you're looking for
- --> one or more times.

Grep pattern between quotes

I'm trying to grep a code base to find alpha numeric codes between quotes. So, for example my code base might contain the line
some stuff "A234DG3" maybe more stuff
And I'd like to output: A234DG3
I'm lucky in that I know my string is 7 long and only integers and the letters A-Z, a-z.
After a bit of playing I've come up with the following, but it's just not coming out with what I'd like
grep -ro '".*"' . | grep [A-Za-z0-9]{7} | less
Where am I going wrong here? It feels like grep should give me what I want, but am I better off using something else? Cheers!

The problem is that an RE is pretty much required to match the longest sequence it can. So, given something like:
a "bcd" efg "hij" klm "nop" q
A pattern of ".*" should match: "bcd" efg "hij" klm "nop" (everything from the first quote to the last quote), not just "bcd".
You probably want a pattern more like "[^"]*" to match the open-quote, an arbitrary number of other things, then a close quote.

Using basic or extended POSIX regular expressions there is no way to extract the value between the quotes with grep. Since that I would use sed for a portable solution:
sed -n 's/.*\"\([^"]\+\)".*/\1/p' <<< 'some stuff "A234DG3" maybe more stuff'
However, having GNU goodies, GNU grep will support PCRE expressions with the -P command line option. You can use this:
grep -oP '.*?"\K[^"]+(?=")' <<< 'some stuff "A234DG3" maybe more stuff'
.*" matches everything until the first quote - including it. The \K option clears the matching buffer and therefore works like a handy, dynamic lookbehind assertion. (I could have used a real lookbehind but I like \K). [^"]+ matches the text between the quotes. (?=") is a lookahead assertion the ensure after the match will follow a " - without including it into the match.

So after more playing about I've come up with this which gives me what I'm after:
grep -r -E -o '"[A-Za-z0-9]{7}"' . | less
With the -E allowing the use of the {7} length matcher

Shell script linux, validating integer

This code is for check if a character is a integer or not (i think). I'm trying to understand what this means, I mean... each part of that line, checking the GREP man pages, but it's really difficult for me. I found it on the internet. If anyone could explain me the part of the grep... what means each thing put there:
echo $character | grep -Eq '^(\+|-)?[0-9]+$'
Thanks people!!!

Analyse this regex:
'^(\+|-)?[0-9]+$'
^ - Line Start
(\+|-)? - Optional + or - sign at start
[0-9]+ - One or more digits
$ - Line End
Overall it matches strings like +123 or -98765 or just 9
Here -E is for extended regex support and -q is for quiet in grep command.
PS: btw you don't need grep for this check and can do this directly in pure bash:
re='^(\+|-)?[0-9]+$'
[[ "$character" =~ $re ]] && echo "its an integer"

I like this cheat sheet for regex:
http://www.cheatography.com/davechild/cheat-sheets/regular-expressions/
It is very useful, you could easily analyze the
'^(+|-)?[0-9]+$'
as
^: Line must begin with...
(): grouping
\: ESC character (because + means something ... see below)
+|-: plus OR minus signs
?: 0 or 1 repetation
[0-9]: range of numbers from 0-9
+: one or more repetation
$: end of line (no more characters allowed)
so it accepts like: -312353243 or +1243 or 5678
but do not accept: 3 456 or 6.789 or 56$ (as dollar sign).

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

grep with regex for phone number - regex

You can just OR (|) your regexes together -- will be more readable that way too!

My first thought is that you may find it easier to see if your candidate number matches against one of four regular expressions. That will be easier to develop/debug, especially as/when you have to handle additional formats in the future.

grep -P '[0-9]{3}-[0-9]{3}-[0-9]{3}|[0-9]{3}\ [0-9]{3}\ [0-9]{3}|[0-9]{9}|\([0-9]{3}\)[0-9]{3}-[0-9]{3}'

Try this one: ^(\d{10}|((([0-9]{3})\s){2})[0-9]{4}|((([0-9]{3})\-){2})[0-9]{4}|([(][0-9]{3}[)])[0-9]{3}[-][0-9]{4})$ This is only applicable for the formate you mention above like: xxxxxxxxxx xxx xxx xxxx xxx-xxx-xxxx (xxx)xxx-xxxx

+?(1[ -])?((\d{3})[ -]|(\d{3}[ -]?)){2}\d{4} works for: 123-678-1234 123 678 1234 (123)-678-1234 +1-(123)-678-1234 1-(123)-678-1234 1 123 678 1234 1 (123) 678 1234

grep -oE '\(?\<[0-9]{3}[-) ]?[0-9]{3}[ -]?[0-9]{4}\>' Matches all your formats. The \< and \> word boundaries prevent matching numbers that are too long, such as 123-123-12345 or 1234-123-1234

Related

How to prefer use of an optional part of a regular expression?

Find the first name that starts with any letter than S using regex

egrep command for lines that have one or more instance of 1234 but no other numbers?

Grep pattern between quotes

Shell script linux, validating integer

Categories

Resources