Grep Regex: How to find multiple area codes in phone number? - regex

I have a file: each line consist of a name, room number, house address, phone number.
I want to search for the lines that have the area codes of either 404 or 202.
I did "(404)|(202)" but it also gives me lines that had the numbers in the phone number in general instead of from area code, example:
John Smith 300 123 N. Street 808-543-2029
I do not want the above, I am targeting lines like this, examples:
Danny Brown 173 555 W. Avenue 202-383-1540
Martha Keith 567 322 S. Example 404-653-1200

Let's consider this test file:
$ cat addresses
John Smith 202 404 N. Street 808-543-2029
Danny Brown 173 555 W. Avenue 202-383-1540
Martha Keith 567 322 S. Example 404-653-1200
The distinguishing feature of area codes, as opposed to other three digit numbers, is that they have a space before them and a - after them. Thus, use:
$ grep -E ' (202|404)-' addresses
Danny Brown 173 555 W. Avenue 202-383-1540
Martha Keith 567 322 S. Example 404-653-1200
More complex example
Suppose that phone numbers appear at the end of lines but can have any of the three forms 808-543-2029, 8085432029, or 808 543 2029 as in the following example:
$ cat addresses
John Smith 202 404 N. Street 808-543-2029
Danny Brown 173 555 W. Avenue 2023831540
Martha Keith 567 322 S. Example 404 653 1200
To select the lines with 202 or 404 area codes:
$ grep -E ' (202|404)([- ][[:digit:]]{3}[- ][[:digit:]]{4}|[[:digit:]]{7})$' addresses
Danny Brown 173 555 W. Avenue 2023831540
Martha Keith 567 322 S. Example 404 653 1200
If it is possible that the phone numbers are followed by stray whitespaces, then use:
$ grep -E ' (202|404)([- ][[:digit:]]{3}[- ][[:digit:]]{4}|[[:digit:]]{7})[[:blank:]]*$' addresses
Danny Brown 173 555 W. Avenue 2023831540
Martha Keith 567 322 S. Example 404 653 1200

You need to add a word boundary token \b right at the beginning of expression, such as \b(202|404).
Demo.

Related

Not understanding group/value/capture attributes of Powershell object matches method

Because of my lack of understanding of Powershell objects my question may not be worded accurately. I take it from the documentation Powershell 7.3 ForEach-Object that I am using a script block & utilizing the Powershell automatic variable $_ But that is about as relevant to my example that these docs get.
I'm trying to access each of two parts of a collection of text file type name/address listings. Namely the first three listings (001 - 003) or the second three (004 - 006)
Using $regexListings and $testListings I have tested that I can access, the first three or second three listings, using references to the capture groups e.g $1 $2 See this example running here: regex101
When I run the following Powershell code:
$regexListings = '(?s)(001.*?003.*?$)|(004.*?006.*?$)'
$testListings =
'001 AALTON Alan 25 Every Street
002 BROWN James 101 Browns Road
003 BROWN Jemmima 101 Browns Road
004 BROWN John 101 Browns Road
005 CAMPBELL Colin 57 Camp Avenue
006 DONNAGAN Dolores 11 Main Road'
$testListings | Select-String -AllMatches -Pattern $regexListings | ForEach-Object {$_.Matches}
Output is:
Groups : {0, 1, 2}
Success : True
Name : 0
Captures : {0}
Index : 0
Length : 204
Value : 001 AALTON Alan 25 Every Street
002 BROWN James 101 Browns Road
003 BROWN Jemmima 101 Browns Road
004 BROWN John 101 Browns Road
005 CAMPBELL Colin 57 Camp Avenue
006 DONNAGAN Dolores 11 Main Road
ValueSpan :
My interpretation of the Powershell output is:
there are 3 match groups?
no captures available
the value is all of it?
Why does the Powershell script output Captures {0} when the link page (regex101) above describes two capture groups which I can access?
The documentation Groups, Captures, and Substitutions is helpful but doesn't address this kind of issue. I have gone on using trial & error examples like:
ForEach-Object {$_.Matches.Groups}
ForEach-Object {$_.Matches.Captures}
ForEach-Object {$_.Matches.Value}
And I'm still none the wiser.
Information overflow. What's being output is what's relevant to us, the administrators. Capture group 0 is the entire value since $regexListings indeed matches the entire string. This is where PowerShell attempts to be helpful with it's rich type system and displays what we may find useful; although, this may just be the implementation of the creators of the cmdlet. So, you were on the right track with $_.Matches.Groups which should've exposed the capture groups and the values for the RegEx matching.
If you're looking to access those values, as mentioned above, you'd have to iterate over .Matches.Groups within that Foreach-Object. What you're passing isn't the individual captures to that cmdlet, but rather the captures of the expression as a whole. This is why you're better off saving to a variable and indexing through the group capture(s) such as: $var.Matches.Groups[0], or $var.Matches.Groups[1], etc.. You can also just use the automatic variable $matches to get some confusion out the way seeing as it's populated via the -Match operator, you can index through the captures with $matches[n] instead. Using your same example:
$regexListings = '(?s)(001.*?003.*?$)|(004.*?006.*?$)'
$testListings =
'001 AALTON Alan 25 Every Street
002 BROWN James 101 Browns Road
003 BROWN Jemmima 101 Browns Road
004 BROWN John 101 Browns Road
005 CAMPBELL Colin 57 Camp Avenue
006 DONNAGAN Dolores 11 Main Road'
$testListings -match $regexListings
$Matches
Which outputs:
True # this is output by -match letting you know it's succeeded in matching.
Name Value
---- -----
1 001 AALTON Alan 25 Every Street ...
0 001 AALTON Alan 25 Every Street ...
Now you have a hashtable with a more representable example of the pattern matching.
In order to access each of two parts of the listings I needed to be able to see them in the output using:
$regexListings = '(?ms)(001.*?003.*?$)|(004.*?006.*?$)'
$testListings | Select-String -AllMatches -Pattern $regexListings | ForEach-Object {$_.Matches.Captures}
Groups : {0, 1, 2}
Success : True
Name : 0
Captures : {0}
Index : 0
Length : 102
Value : 001 AALTON Alan 25 Every Street
002 BROWN James 101 Browns Road
003 BROWN Jemmima 101 Browns Road
ValueSpan :
Groups : {0, 1, 2}
Success : True
Name : 0
Captures : {0}
Index : 103
Length : 101
Value : 004 BROWN John 101 Browns Road
005 CAMPBELL Colin 57 Camp Avenue
006 DONNAGAN Dolores 11 Main Road
ValueSpan :
The differences from the question code being:
using multi line modifier (?ms) instead of (?s) in the regex
using {$_.Matches.Captures} as the regex contains capture grouping
Access to these captures can be got from assigning a variable then indexing e.g:
$result = $testListings | Select-String -AllMatches -Pattern $regexListings | ForEach-Object {$_.Matches.Captures}
$result[1]
$result[0]

How can I extract Twitter #handles from a text with RegEx?

I'm looking for an easy way to create lists of Twitter #handles based on SocialBakers data (copy/paste into TextMate).
I've tried using the following RegEx, which I found here on StackOverflow, but unfortunately it doesn't work the way I want it to:
^(?!.*#([\w+])).*$
While the expression above deletes all lines without #handles, I'd like the RegEx to delete everything before and after the #handle as well as lines without #handles.
Example:
1
katyperry KATY PERRY (#katyperry)
Followings 158
Followers 82 085 596
Rating
5
Worst012345678910Best
2
justinbieber Justin Bieber (#justinbieber)
254 399
74 748 878
2
Worst012345678910Best
3
taylorswift13 Taylor Swift (#taylorswift13)
245
70 529 992
Desired result:
#katyperry
#justinbieber
#taylorswift13
Thanks in advance for any help!
Something like this:
cat file | perl -ne 'while(s/(#[a-z0-9_]+)//gi) { print $1,"\n"}'
This will also work if you have lines with multiple #handles in.
A Twitter handle regex is #\w+. So, to remove everything else, you need to match and capture the pattern and use a backreference to this capture group, and then just match any character:
(#\w+)|.
Use DOTALL mode to also match newline symbols. Replace with $1 (or \1, depending on the tool you are using).
See demo
Strait REGEX Tested in Caret:
#.*[^)]
The above will search for and any given and exclude close parenthesis.
#.*\b
The above here does the same thing in Caret text editor.
How to awk and sed this:
Get usernames as well:
$ awk '/#.*/ {print}' test
katyperry KATY PERRY (#katyperry)
justinbieber Justin Bieber (#justinbieber)
taylorswift13 Taylor Swift (#taylorswift13)
Just the Handle:
$ awk -F "(" '/#.*/ {print$2}' test | sed 's/)//g'
#katyperry
#justinbieber
#taylorswift13
A look at the test file:
$ cat test
1
katyperry KATY PERRY (#katyperry)
Followings 158
Followers 82 085 596
Rating
5
Worst012345678910Best
2
justinbieber Justin Bieber (#justinbieber)
254 399
74 748 878
2
Worst012345678910Best
3
taylorswift13 Taylor Swift (#taylorswift13)
245
70 529 992
Bash Version:
$ bash --version
GNU bash, version 3.2.57(1)-release (x86_64-apple-darwin14)
Copyright (C) 2007 Free Software Foundation, Inc.

SED / RegEx Puzzle

I have a file, with many of the log lines in like below example, what i'd like to do is basically add a CR after each piece of process information. I figured i'd do this with SED using the below
sed -rn 's/([0-9]+) \(([a-z._0-9]+)\) ([0-9]+) ([0-9]+)/ \2,\1,\3,\4 \n/gp' < file
This partially works, but I still get the Total: 3266 #015 from the log, which appears at the end of each line. I didn't expect this as it doesn't get matched in the regular expression.
I've tested the regular expression on the available websites, and they always look good, and find what i'd expect, its just when i combine with SED i don't quite get the result i was expecting.
Any help or suggestions would be most appreciated,
Thanks
Andy
This is a single line of the stats
1 (init) 3686400 123 148 (klogd) 3690496 116 16364 (memlogger.sh) 3686400 144 17 0 225 (dropbear) 1847296 113 242 (mini_httpd) 2686976 167 281 (snmpd) 4812800 231 283 (logmuxd) 2514944 262 284 (watchdog) 3551232 82 285 (controld) 5259264 610 287 (setupd) 5120000 436 289 (checkpoold) 3424256 129 296 (trap_sender_d) 3457024 165 298 (watch) 3686400 114 299 (processwatchdog) 3420160 119 314 (timerd) 3637248 219 315 (init) 3686400 116 16365 (cat) 3694592 120 Total: 3266 #015
Just remove the "Total:"
sed -rn 's/ +Total:.*//;
s/([0-9]+) +\(([a-z._0-9]+)\) +([0-9]+) +([0-9]+)/ \2,\1,\3,\4\n/gp'
You can also match the "Total:" optionally:
sed -rn 's/([0-9]+) +\(([a-z._0-9]+)\) +([0-9]+) +([0-9]+)( *Total:.*)?/ \2,\1,\3,\4\n/gp'
# ------------^

Print lines until the second field changes

Let's say this is my command line output:
Mike US 11
John US 3
Dina US 1002
Dan US 44
Mike UK 552
Luc US 23
Jenny US 23
I want to print all lines starting from first line and stop printing once the second field changes to something other than "US" even if there are more "US" after that. So I want to the output to be:
Mike US 11
John US 3
Dina US 1002
Dan US 44
This is the code I have right now:
awk '$2 == "US"{a=1}$2 != "US"{a=0}a'
It works fine as long as there are no more "US" after the range I matched. So my current code will output like this:
Mike US 11
John US 3
Dina US 1002
Dan US 44
Luc US 23
Jenny US 23
As you may notice, it dropped the "UK" line and kept printing which is not what I'm trying to achieve here.
Here is a generic approach, it prints to second filed change, regardless of data in second field
awk '$2!=f && NR>1 {exit} 1; {f=$2}' file
Mike US 11
John US 3
Dina US 1002
Dan US 44
This just test if its US, if not exit. Maybe more correct to your question:
awk '$2!="US" {exit}1' file
Mike US 11
John US 3
Dina US 1002
Dan US 44
I'm sure there is something more elegant, but this does the job:
awk 'BEGIN { P=1 } P == 1 && $2 != "US" { P = 0 }P' filename
This might work for you (GNU sed):
sed '/US/!Q' file
If the line does not contain US quit.
For specifically the second field:
sed '/^\S\+\s\+US\b/!Q' file

GNU Sed REGEX find and alter string (no replace)

I have the followinf text:
11 Cherrywood Rise Ashford Kent TN25 4QA United Kingdom N B BONE 02/12 387
Bisham village Bisham Buckinghamshire SL7 1RR United Kingdom Neil Noakes 06/13 488
6 Kynaston Road London London N16 0EX United Kingdom MR N P SALTMARSH 04/13 907
116 Long Acre London London WC2E 9SU United Kingdom Lorna J Gradden 11/14 415
How can I use sed to match the dates "mm/yy" format and alter to "|mm/yy|"
Like: 11 Cherrywood Rise Ashford Kent TN25 4QA United Kingdom N B BONE|02/12|387
Thanks!
does this work for you?
sed -r 's# ([0-9]{2}/[0-9]{2}) #|\1|#' file
Example 1
cat t.txt | sed -E 's/([0-9]{2}\/[0-9]{2})/|\1|/g'
11 Cherrywood Rise Ashford Kent TN25 4QA United Kingdom N B BONE (02/12) 387
Bisham village Bisham Buckinghamshire SL7 1RR United Kingdom Neil Noakes (06/13) 488
6 Kynaston Road London London N16 0EX United Kingdom MR N P SALTMARSH (04/13) 907
116 Long Acre London London WC2E 9SU United Kingdom Lorna J Gradden (11/14) 415
or
sed -E 's/([0-9]{2}\/[0-9]{2})/|\1|/g' t.txt