GNU Sed REGEX find and alter string (no replace) - regex

I have the followinf text:
11 Cherrywood Rise Ashford Kent TN25 4QA United Kingdom N B BONE 02/12 387
Bisham village Bisham Buckinghamshire SL7 1RR United Kingdom Neil Noakes 06/13 488
6 Kynaston Road London London N16 0EX United Kingdom MR N P SALTMARSH 04/13 907
116 Long Acre London London WC2E 9SU United Kingdom Lorna J Gradden 11/14 415
How can I use sed to match the dates "mm/yy" format and alter to "|mm/yy|"
Like: 11 Cherrywood Rise Ashford Kent TN25 4QA United Kingdom N B BONE|02/12|387
Thanks!

does this work for you?
sed -r 's# ([0-9]{2}/[0-9]{2}) #|\1|#' file

Example 1
cat t.txt | sed -E 's/([0-9]{2}\/[0-9]{2})/|\1|/g'
11 Cherrywood Rise Ashford Kent TN25 4QA United Kingdom N B BONE (02/12) 387
Bisham village Bisham Buckinghamshire SL7 1RR United Kingdom Neil Noakes (06/13) 488
6 Kynaston Road London London N16 0EX United Kingdom MR N P SALTMARSH (04/13) 907
116 Long Acre London London WC2E 9SU United Kingdom Lorna J Gradden (11/14) 415
or
sed -E 's/([0-9]{2}\/[0-9]{2})/|\1|/g' t.txt

Related

Not understanding group/value/capture attributes of Powershell object matches method

Because of my lack of understanding of Powershell objects my question may not be worded accurately. I take it from the documentation Powershell 7.3 ForEach-Object that I am using a script block & utilizing the Powershell automatic variable $_ But that is about as relevant to my example that these docs get.
I'm trying to access each of two parts of a collection of text file type name/address listings. Namely the first three listings (001 - 003) or the second three (004 - 006)
Using $regexListings and $testListings I have tested that I can access, the first three or second three listings, using references to the capture groups e.g $1 $2 See this example running here: regex101
When I run the following Powershell code:
$regexListings = '(?s)(001.*?003.*?$)|(004.*?006.*?$)'
$testListings =
'001 AALTON Alan 25 Every Street
002 BROWN James 101 Browns Road
003 BROWN Jemmima 101 Browns Road
004 BROWN John 101 Browns Road
005 CAMPBELL Colin 57 Camp Avenue
006 DONNAGAN Dolores 11 Main Road'
$testListings | Select-String -AllMatches -Pattern $regexListings | ForEach-Object {$_.Matches}
Output is:
Groups : {0, 1, 2}
Success : True
Name : 0
Captures : {0}
Index : 0
Length : 204
Value : 001 AALTON Alan 25 Every Street
002 BROWN James 101 Browns Road
003 BROWN Jemmima 101 Browns Road
004 BROWN John 101 Browns Road
005 CAMPBELL Colin 57 Camp Avenue
006 DONNAGAN Dolores 11 Main Road
ValueSpan :
My interpretation of the Powershell output is:
there are 3 match groups?
no captures available
the value is all of it?
Why does the Powershell script output Captures {0} when the link page (regex101) above describes two capture groups which I can access?
The documentation Groups, Captures, and Substitutions is helpful but doesn't address this kind of issue. I have gone on using trial & error examples like:
ForEach-Object {$_.Matches.Groups}
ForEach-Object {$_.Matches.Captures}
ForEach-Object {$_.Matches.Value}
And I'm still none the wiser.
Information overflow. What's being output is what's relevant to us, the administrators. Capture group 0 is the entire value since $regexListings indeed matches the entire string. This is where PowerShell attempts to be helpful with it's rich type system and displays what we may find useful; although, this may just be the implementation of the creators of the cmdlet. So, you were on the right track with $_.Matches.Groups which should've exposed the capture groups and the values for the RegEx matching.
If you're looking to access those values, as mentioned above, you'd have to iterate over .Matches.Groups within that Foreach-Object. What you're passing isn't the individual captures to that cmdlet, but rather the captures of the expression as a whole. This is why you're better off saving to a variable and indexing through the group capture(s) such as: $var.Matches.Groups[0], or $var.Matches.Groups[1], etc.. You can also just use the automatic variable $matches to get some confusion out the way seeing as it's populated via the -Match operator, you can index through the captures with $matches[n] instead. Using your same example:
$regexListings = '(?s)(001.*?003.*?$)|(004.*?006.*?$)'
$testListings =
'001 AALTON Alan 25 Every Street
002 BROWN James 101 Browns Road
003 BROWN Jemmima 101 Browns Road
004 BROWN John 101 Browns Road
005 CAMPBELL Colin 57 Camp Avenue
006 DONNAGAN Dolores 11 Main Road'
$testListings -match $regexListings
$Matches
Which outputs:
True # this is output by -match letting you know it's succeeded in matching.
Name Value
---- -----
1 001 AALTON Alan 25 Every Street ...
0 001 AALTON Alan 25 Every Street ...
Now you have a hashtable with a more representable example of the pattern matching.
In order to access each of two parts of the listings I needed to be able to see them in the output using:
$regexListings = '(?ms)(001.*?003.*?$)|(004.*?006.*?$)'
$testListings | Select-String -AllMatches -Pattern $regexListings | ForEach-Object {$_.Matches.Captures}
Groups : {0, 1, 2}
Success : True
Name : 0
Captures : {0}
Index : 0
Length : 102
Value : 001 AALTON Alan 25 Every Street
002 BROWN James 101 Browns Road
003 BROWN Jemmima 101 Browns Road
ValueSpan :
Groups : {0, 1, 2}
Success : True
Name : 0
Captures : {0}
Index : 103
Length : 101
Value : 004 BROWN John 101 Browns Road
005 CAMPBELL Colin 57 Camp Avenue
006 DONNAGAN Dolores 11 Main Road
ValueSpan :
The differences from the question code being:
using multi line modifier (?ms) instead of (?s) in the regex
using {$_.Matches.Captures} as the regex contains capture grouping
Access to these captures can be got from assigning a variable then indexing e.g:
$result = $testListings | Select-String -AllMatches -Pattern $regexListings | ForEach-Object {$_.Matches.Captures}
$result[1]
$result[0]

Using a regex to find specific text between two multi-character dividers (Not HTML)

The problem: I have a text file which represents a list of trips. The characters which indicate a separation between two trips are: a space, 132 '-' chars, and then a new line. I'm trying to find those trips which contain the text 'CUN'. Note that the '-' also can appear in trips, but not in a group of 132; at most, they will appear two in a row.
To give an idea of what the data itself looks like (with slight redactions), here is a sample:
------------------------------------------------------------------------------------------------------------------------------------
ALPA MEAL CODE UPDATES: M* => PILOT PAID MEAL Mm => MEAL MOVED TO THIS SEGMENT Mc => THIS MEAL WAS CHANGED
1DSL EFF 04/01/19 THRU 04/30/19 737 LOS ANGELES APR 2019 04/01/19 PAGE 58
EQP FLT# DPT ARV DPTR ARVL GRND ML FTM ACM DTM IND D/C SU|MO TU WE TH FR|SA
EFF 04/06/19 THRU 04/06/19 ID L5265 - BASIC (HNL) |-- -- -- -- --| 6
RPT: 0803 --|-- -- -- -- --|--
37K 489 LAX DEN 0903 1228 1.02 2.25 2.25 --|-- -- -- -- --|--
37K 1472 DEN SAN 1330 1447 16.38 L 2.17 4.42 6.59 .00 --|-- -- -- -- --|--
RLS: 1502 HTL: XXXXXX SAN DIEGO XXX-XXX-XXXX OP=> XXX-XXX-XXXX --|-- --
XXXXXX XXX-XXX-XXXX
RPT: 0640
37X 662 SAN SFO 0725 0907 1.39 B 1.42 1.42
37K 1122 SFO HNL 1046 1328 20.32 L S 5.42 7.24 10.03 .00
RLS: 1343 HTL: XXXXXX WAIKIKI XXX-XXX-XXXX OP=> XXX-XXX-XXXX
XXXXXX XXX-XXX-XXXX
RPT: 0915
37K 1221 HNL LAX 1000 1834 1.16 L S 5.34 5.34
37K 2048 LAX LAS 1950 2112 17.23 1.22 6.56 9.12 .00
RLS: 2127 HTL: XXXXXX VEGAS XXX-XXX-XXXX OP=> XXX-XXX-XXXX
XXXXXX XXX-XXX-XXXX
RPT: 1350
37K 640 LAS ORD 1435 2012 1.03 3.37 3.37
37K 1186 ORD LAX 2115 2344 D 4.29 8.06 10.09 .00
RLS: 2359
DAYS- 4 CRD-27.08* FTM-27.08* TAFB- 87.56 INT- .00 NTE- .00 M$-229.74 T/C- .00 .34*
------------------------------------------------------------------------------------------------------------------------------------
EFF 04/12/19 THRU 04/12/19 ID L5266 - BASIC |-- -- -- -- --|--
RPT: 0803 --|-- -- -- -- 12|--
73Q 489 LAX DEN 0903 1228 1.17 2.25 2.25 --|-- -- -- -- --|--
37K 716 DEN AUS 1345 1653 1.02 L 2.08 4.33 --|-- -- -- -- --|--
37K 630 AUS IAH 1755 1855 1.01 1.00 5.33 --|-- --
37K 530 IAH SAT 1956 2059 17.10 1.03 6.36 11.11 .00
RLS: 2114 HTL: XXXXXX SAN ANTONIO XXX-XXX-XXXX OP=> XXX-XXX-XXXX
XXXXXX XXX-XXX-XXXX
RPT: 1324
73G 1967 SAT ORD 1409 1655 1.10 2.46 2.46
37K 246 ORD TPA 1805 2155 21.05 D 2.50 5.36 7.46 .00
RLS: 2210 HTL: XXXXXX TAMPA XXX-XXX-XXXX OP=> XXX-XXX-XXXX
XXXXXX
RPT: 1815
37K 352 TPA IAD 1900 2109 1.06 2.09 2.09
37K 1448 IAD LAX 2215 0049 D S 5.34 7.43 9.49 .00
RLS: 0104
DAYS- 4 CRD-20.00* FTM-19.55* TAFB- 65.01 INT- .00 NTE- 1.49 M$-159.29 T/C- .05 .25*
------------------------------------------------------------------------------------------------------------------------------------
EFF 04/15/19 THRU 04/15/19 ID L5267 - BASIC (CAM) |-- -- -- -- --|--
RPT: 0803 --|-- -- -- -- --|--
73Q 489 LAX DEN 0903 1228 3.40 2.25 2.25 --|15 -- -- -- --|--
37K 543 DEN MCO 1608 2137 15.53 3.29 5.54 10.49 .00 --|-- -- -- -- --|--
RLS: 2152 HTL: XXXXXX ORLANDO XXX-XXX-XXXX OP=> XXX-XXX-XXXX --|-- --
XXXXXX XXX-XXX-XXXX
RPT: 1245
73Q 1601 MCO EWR 1330 1613 3.06 2.43 2.43
37K 1054 EWR CUN 1919 2220 18.40 D 4.01 6.44 11.05 .00
RLS: 2250 HTL: XXXXXX CANCUN XXX-XXX-XXXX OP=> XXX-XXX-XXXX
XXXXXX XXX-XXX-XXXX
RPT: 1615
37K 1118 CUN SFO 1700 2051 1.54 D S 5.51 5.51
37K 257 SFO LAX 2245 0020 1.35 7.26 10.20 .00
RLS: 0035
DAYS- 4 CRD-20.04* FTM-20.04* TAFB- 64.32 INT- 9.52 NTE- .00 M$-170.94 T/C- .00 .25*
------------------------------------------------------------------------------------------------------------------------------------
With python or similar, this is trivial to do; simply split the string on the divider ' ' + '-' * 132 + '\n', and search within each section for 'CUN'.
So, this is not a question on how to do this programmatically, but more specifically whether it can be done only with regexes. I was looking over the file with a text editor (Sublime Text, which uses the perl Boost library), and the 'obvious' way to find the trips that had CUN in them didn't work:
(?s)^ -{132}\n.+?CUN.+? -{132}\n
Even though the parts of the trip before and after CUN are caputred by the non-greedy ., the problem is the expression matches on the first divider, then all the trips afterwards until it finds one with CUN in it, then finishes on the next divider.
If the divder between the sections was a single, unique character, this would be trivially easy to do. If the divider was, for instance, '#', and it didn't appear in the body of the trip, we could use:
(?s)#[^#]+?CUN[^#]+?#
I don't think this is the same case as attempting to parse parts of HTML or XML or other non-regular languages; it feels like it should be much simpler. But what I feel, of course, may not be so. So, my question, is can this be done with just regexes easily? How, specifically? Or is it a typical kind of problem that regexes are bad at?
To state the problem more abstractly:
Consider a text file that possesses sections; each section is separated from another by a unique string of characters called the divider. While some of characters within the divider may appear in the sections, the entire divider will never appear within a section; in fact the sections are defined by the dividers. What regular expression will allow you to capture an entire section (and only that section) when some part of the section contains some simple string of interest? Or can this not be done?
If you use Sublime, you can find a whole block containing the word CUN with:
^ -{132}(?:\R(?! -{132}$|.*\bCUN\b).*)*+\R.*\bCUN\b.*(?:\R(?! -{132}$).*)*
Explanation
^ Start of string
-{132} Match a space and 132 hyphens
(?: Non capture group
\R(?! -{132}$|.*\bCUN\b).* Match a newline, and the rest of the line if it does not contains the newline sequence or the word CUN
)*+ Close the non capture group and optionally repeat using a possessive quantifier to prevent some backtracking if there is no match
\R Match a newline
.*\bCUN\b.* match a line with the word CUN
(?:\R(?! -{132}$).*)* Optionally match the rest of the lines not starting with the hyphens
Regex demo

Swap two words with regex in a text file using bash

I have file adr.txt with info:
Franklin Avenue, US, 33123
Laurel Drive, US, 59121
Street King, UK, 00939
Street Williams, US, 19123
Warren Avenue, UK, 93891
Street Court, UK, 89730
Country Club Road, US, 10865
Madison Avenue, US, 36975
Street Front, US, 41911
Cedar Lane, UK, 21563
Garfield Avenue, UK, 00842
Street Cottage, US, 33205
Arlington Avenue, US, 94008
Cedar Avenue, US, 72635
Windsor Drive, US, 34384
Devon Court, UK, 13789
Garfield Avenue, US, 86115
Street Olive, US, 63007
Street Williams, US, 54675
Franklin Avenue, US, 82479
I need to swap the words "Street" and the name of the street to get the following - the name should come first, and then the word "Street".
Franklin Avenue, US, 33123
Laurel Drive, US, 59121
King Street, UK, 00939
Williams Street, US, 19123
Warren Avenue, UK, 93891
Court Street, UK, 89730
Country Club Road, US, 10865
Madison Avenue, US, 36975
Front Street, US, 41911
Cedar Lane, UK, 21563
Garfield Avenue, UK, 00842
Cottage Street, US, 33205
Arlington Avenue, US, 94008
Cedar Avenue, US, 72635
Windsor Drive, US, 34384
Devon Court, UK, 13789
Garfield Avenue, US, 86115
Olive Street, US, 63007
Williams Street, US, 54675
Franklin Avenue, US, 82479
As an example, "sed" command works for me sed -i 's/\(Street\) \(King\)/\2 \1/' adr.txt What regular expression can be used to automatically catch all words with a street name?
I tried sed -i 's/\(Street\) \([WO]{1}[a-z]*[es]{1}\,\)/\2 \1/' adr.txt
I checked the regular expression [WO]{1}[a-z]*[se]{1}\, on regex101.com. It looks for the names "Williams," "Olive," But it does not work in "sed" command.
Your regex attempt was very weird. The simple sed solution would be
sed -i 's/^\(Street\) \([^ ,]*\)/\2 \1/' adr.txt
Though perhaps better to use Awk for this.
awk '/^Street [^ ,]+,/ {
two=$2; $2=$1 ",";
sub(/,$/, "", two);
$1=two }1' adr.txt >newfile
mv newfile adr.txt
As an aside, https://regex101.com/ supports a number of regex dialects, but none of them is exactly the one understood by sed.
Also, {1} in a regex is never useful; if you want to repeat something once, the expression before {1} already does exactly that.
$ cat file
Warren Avenue, UK, 93891
Street Court, UK, 89730
Street Front, US, 41911
Cedar Lane, UK, 21563
Street Cottage, US, 33205
Arlington Avenue, US, 94008
Windsor Drive, US, 34384
Garfield Avenue, US, 86115
Franklin Avenue, US, 82479
Street Muhammad Ali, US, 82479
Street Albert Einstein Jr., US, 82479
awk -F',' -v OFS="," '/^Street /{$1=gensub(/^(Street) (.*)$/,"\\2 \\1",1,$1)}1' file
Warren Avenue, UK, 93891
Court Street, UK, 89730
Front Street, US, 41911
Cedar Lane, UK, 21563
Cottage Street, US, 33205
Arlington Avenue, US, 94008
Windsor Drive, US, 34384
Garfield Avenue, US, 86115
Franklin Avenue, US, 82479
Muhammad Ali Street, US, 82479
Albert Einstein Jr. Street, US, 82479
Simple sed solution:
cat adr.txt | sed -E 's/^(Street) ([^,]+)/\2 \1/

Grep Regex: How to find multiple area codes in phone number?

I have a file: each line consist of a name, room number, house address, phone number.
I want to search for the lines that have the area codes of either 404 or 202.
I did "(404)|(202)" but it also gives me lines that had the numbers in the phone number in general instead of from area code, example:
John Smith 300 123 N. Street 808-543-2029
I do not want the above, I am targeting lines like this, examples:
Danny Brown 173 555 W. Avenue 202-383-1540
Martha Keith 567 322 S. Example 404-653-1200
Let's consider this test file:
$ cat addresses
John Smith 202 404 N. Street 808-543-2029
Danny Brown 173 555 W. Avenue 202-383-1540
Martha Keith 567 322 S. Example 404-653-1200
The distinguishing feature of area codes, as opposed to other three digit numbers, is that they have a space before them and a - after them. Thus, use:
$ grep -E ' (202|404)-' addresses
Danny Brown 173 555 W. Avenue 202-383-1540
Martha Keith 567 322 S. Example 404-653-1200
More complex example
Suppose that phone numbers appear at the end of lines but can have any of the three forms 808-543-2029, 8085432029, or 808 543 2029 as in the following example:
$ cat addresses
John Smith 202 404 N. Street 808-543-2029
Danny Brown 173 555 W. Avenue 2023831540
Martha Keith 567 322 S. Example 404 653 1200
To select the lines with 202 or 404 area codes:
$ grep -E ' (202|404)([- ][[:digit:]]{3}[- ][[:digit:]]{4}|[[:digit:]]{7})$' addresses
Danny Brown 173 555 W. Avenue 2023831540
Martha Keith 567 322 S. Example 404 653 1200
If it is possible that the phone numbers are followed by stray whitespaces, then use:
$ grep -E ' (202|404)([- ][[:digit:]]{3}[- ][[:digit:]]{4}|[[:digit:]]{7})[[:blank:]]*$' addresses
Danny Brown 173 555 W. Avenue 2023831540
Martha Keith 567 322 S. Example 404 653 1200
You need to add a word boundary token \b right at the beginning of expression, such as \b(202|404).
Demo.

Print lines until the second field changes

Let's say this is my command line output:
Mike US 11
John US 3
Dina US 1002
Dan US 44
Mike UK 552
Luc US 23
Jenny US 23
I want to print all lines starting from first line and stop printing once the second field changes to something other than "US" even if there are more "US" after that. So I want to the output to be:
Mike US 11
John US 3
Dina US 1002
Dan US 44
This is the code I have right now:
awk '$2 == "US"{a=1}$2 != "US"{a=0}a'
It works fine as long as there are no more "US" after the range I matched. So my current code will output like this:
Mike US 11
John US 3
Dina US 1002
Dan US 44
Luc US 23
Jenny US 23
As you may notice, it dropped the "UK" line and kept printing which is not what I'm trying to achieve here.
Here is a generic approach, it prints to second filed change, regardless of data in second field
awk '$2!=f && NR>1 {exit} 1; {f=$2}' file
Mike US 11
John US 3
Dina US 1002
Dan US 44
This just test if its US, if not exit. Maybe more correct to your question:
awk '$2!="US" {exit}1' file
Mike US 11
John US 3
Dina US 1002
Dan US 44
I'm sure there is something more elegant, but this does the job:
awk 'BEGIN { P=1 } P == 1 && $2 != "US" { P = 0 }P' filename
This might work for you (GNU sed):
sed '/US/!Q' file
If the line does not contain US quit.
For specifically the second field:
sed '/^\S\+\s\+US\b/!Q' file