I have file adr.txt with info:
Franklin Avenue, US, 33123
Laurel Drive, US, 59121
Street King, UK, 00939
Street Williams, US, 19123
Warren Avenue, UK, 93891
Street Court, UK, 89730
Country Club Road, US, 10865
Madison Avenue, US, 36975
Street Front, US, 41911
Cedar Lane, UK, 21563
Garfield Avenue, UK, 00842
Street Cottage, US, 33205
Arlington Avenue, US, 94008
Cedar Avenue, US, 72635
Windsor Drive, US, 34384
Devon Court, UK, 13789
Garfield Avenue, US, 86115
Street Olive, US, 63007
Street Williams, US, 54675
Franklin Avenue, US, 82479
I need to swap the words "Street" and the name of the street to get the following - the name should come first, and then the word "Street".
Franklin Avenue, US, 33123
Laurel Drive, US, 59121
King Street, UK, 00939
Williams Street, US, 19123
Warren Avenue, UK, 93891
Court Street, UK, 89730
Country Club Road, US, 10865
Madison Avenue, US, 36975
Front Street, US, 41911
Cedar Lane, UK, 21563
Garfield Avenue, UK, 00842
Cottage Street, US, 33205
Arlington Avenue, US, 94008
Cedar Avenue, US, 72635
Windsor Drive, US, 34384
Devon Court, UK, 13789
Garfield Avenue, US, 86115
Olive Street, US, 63007
Williams Street, US, 54675
Franklin Avenue, US, 82479
As an example, "sed" command works for me sed -i 's/\(Street\) \(King\)/\2 \1/' adr.txt What regular expression can be used to automatically catch all words with a street name?
I tried sed -i 's/\(Street\) \([WO]{1}[a-z]*[es]{1}\,\)/\2 \1/' adr.txt
I checked the regular expression [WO]{1}[a-z]*[se]{1}\, on regex101.com. It looks for the names "Williams," "Olive," But it does not work in "sed" command.
Your regex attempt was very weird. The simple sed solution would be
sed -i 's/^\(Street\) \([^ ,]*\)/\2 \1/' adr.txt
Though perhaps better to use Awk for this.
awk '/^Street [^ ,]+,/ {
two=$2; $2=$1 ",";
sub(/,$/, "", two);
$1=two }1' adr.txt >newfile
mv newfile adr.txt
As an aside, https://regex101.com/ supports a number of regex dialects, but none of them is exactly the one understood by sed.
Also, {1} in a regex is never useful; if you want to repeat something once, the expression before {1} already does exactly that.
$ cat file
Warren Avenue, UK, 93891
Street Court, UK, 89730
Street Front, US, 41911
Cedar Lane, UK, 21563
Street Cottage, US, 33205
Arlington Avenue, US, 94008
Windsor Drive, US, 34384
Garfield Avenue, US, 86115
Franklin Avenue, US, 82479
Street Muhammad Ali, US, 82479
Street Albert Einstein Jr., US, 82479
awk -F',' -v OFS="," '/^Street /{$1=gensub(/^(Street) (.*)$/,"\\2 \\1",1,$1)}1' file
Warren Avenue, UK, 93891
Court Street, UK, 89730
Front Street, US, 41911
Cedar Lane, UK, 21563
Cottage Street, US, 33205
Arlington Avenue, US, 94008
Windsor Drive, US, 34384
Garfield Avenue, US, 86115
Franklin Avenue, US, 82479
Muhammad Ali Street, US, 82479
Albert Einstein Jr. Street, US, 82479
Simple sed solution:
cat adr.txt | sed -E 's/^(Street) ([^,]+)/\2 \1/
this is the address method
the number might be different 12 or 412 and how many words for the finch ave east
1460 Finch Ave East, Toronto, Ontario, A1A1A1
so I try this
^[0-9]+\s+[a-zA-Z]+\s+[a-zA-Z]+\s+[a-zA-Z]+[,]{1}+\s[a-zA-Z]+[,]{1}+\s+[a-zA-Z]+[,]{1}+\s[A-Za-z]\d[A-Za-z][ -]?\d[A-Za-z]\d$
I usually recommend using regex capture-groups, so you can break and simplify your matching problem to smaller sets. For most cases I use \d and \w, s for matching numbers, standard letters and whitespaces.
I usually experiment on https://regex101.com before I put it into code, because it provides a nice interactive way to play with expressions and samples.
Regarding your question the expression that I came up is:
$regexp = "^(\d+)\s*((\w+\s*)+),\s*(\w+),\s*(\w+),\s*((\w\d)*)$"
In PowerShell I like to use the direct regex class, because it offers more granularity than the standard -match operator.
# Example match and results
$sample = "1460 Finch Ave East, Toronto, Ontario, A1A1A1"
$match = [regex]::Match($sample, $regexp)
$match.Success
$match | Select -ExpandProperty groups | Format-Table Name, Value
# Constructed fields
#{
number = $match.Groups[1]
street = $match.Groups[2]
city = $match.Groups[4]
state = $match.Groups[5]
areacode = $match.Groups[6]
}
So this will result in $match.Success $true and the following numbered capture-groups will be presented in the Groups list:
Name Value
---- -----
0 1460 Finch Ave East, Toronto, Ontario, A1A1A1
1 1460
2 Finch Ave East
3 East
4 Toronto
5 Ontario
6 A1A1A1
7 A1
For constructing the fields, you can ignore 3 and 7 as those are partial-groups:
Name Value
---- -----
areacode A1A1A1
street Finch Ave East
city Toronto
state Ontario
number 1460
To add to mákos excellent answer, I would suggest using named capture groups and the $Matches automatic variable. This makes it super easy to grab the individual fields and turn them into objects for multiple input strings:
function Split-CanadianAddress {
param(
[Parameter(Mandatory,ValueFromPipeline)]
[string[]]$InputString
)
$Pattern = "^(?<Number>\d+)\s*(?<Street>(\w+\s*)+),\s*(?<City>(\w+\s*)+),\s*(?<State>(\w+\s*)+),\s*(?<AreaCode>(\w\d)*)$"
foreach($String in $InputString){
if($String -match $Pattern){
$Fields = #{}
$Matches.Keys |Where-Object {$_ -isnot [int]} |ForEach-Object {
$Fields.Add($_,$Matches[$_])
}
[pscustomobject]$Fields
}
}
}
The $Matches hashtable will contain both the numbered and named capture groups, which is why I copy only the named entries to the $Fields variable before creating the pscustomobject
Now you can use it like:
PS C:\> $sample |Split-CanadianAddress
Street : Finch Ave East
State : Ontario
AreaCode : A1A1A1
Number : 1460
City : Toronto
I've update the pattern to allow for spaces in city and state names as well (think "New Westminster, British Columbia")
my program read a file, and i don't want to treat empty line
while (<FICC>) {
my $ligne=$_;
if ($ligne =~ /^\s*$/){}else{
print " $ligne\n";}
but this code also print empty line
the file that i test with contain:
Ms. Ruth Dreifuss Dreifuss Federal Councillor Federal ruth
sir christopher warren US Secretary of state secretary of state
external economic case federal economic affair conference the Federal Office case
US bill clinton bill clinton Mr. Bush
Nestle food cs holding swiss Swiss Performance Index Performance
I think it is because of using the \n within your code. Just remove that \n from your code and it should be fine.
Usually people do chomp after reading a line from a file to remove the end of line character.
An easier way to write that is probably to invert the logic and only print lines that contain non-whitespace characters.
while (<FICC>) {
my $ligne = $_;
if ($ligne =~ /\S/) {
print " $ligne"; # No need for linefeed here as $ligne already has one
}
}
Update: Demo using your sample data:
#!/usr/bin/perl
use strict;
use warnings;
while (<DATA>) {
my $ligne = $_;
if ($ligne =~ /\S/) {
print " $ligne";
}
}
__END__
Ms. Ruth Dreifuss Dreifuss Federal Councillor Federal ruth
sir christopher warren US Secretary of state secretary of state
external economic case federal economic affair conference the Federal Office case
US bill clinton bill clinton Mr. Bush
Nestle food cs holding swiss Swiss Performance Index Performance
Output:
Ms. Ruth Dreifuss Dreifuss Federal Councillor Federal ruth
sir christopher warren US Secretary of state secretary of state
external economic case federal economic affair conference the Federal Office case
US bill clinton bill clinton Mr. Bush
Nestle food cs holding swiss Swiss Performance Index Performance
Which seems correct to me.
The reason is also that you're adding a new line, to the end of your string which already has a newline in it "$ligne\n", so use chomp as below
I think the nicer way of doing this is with next (skip to next loop iteration) as it removes some brackets from your code:
while (<FICC>) {
my $ligne=chomp $_;
next if $ligne =~ /^\s*$/;
print " $ligne\n";
}
I have several string in my text file witch have this case:
Brisbane, Queensland, Australia|BNE
I know how to use the SED command, to replace any character by another one. This time I want to replace the characters coma-space by a pipe, only for the first match to not affect the country name at the same time.
I need to convert it to something like that:
Brisbane|Queensland, Australia|BNE
As you can see, only the first coma-space was replaced, not the second one and I keep the country name "Queensland, Australia" complete. Can someone help me to achieve this, thanks.
Here is a sample of my file:
Brisbane, Queensland, Australia|BNE
Bristol, United Kingdom|BRS
Bristol, VA|TRI
Brive-La-Gaillarde, France - Laroche|BVE
Brno, Czech Republic - Bus service|ZDN
Brno, Czech Republic - Turany|BRQ
If you do: sed 's/, /|/' file.txt doesn't work.
The output should be like that:
Brisbane|Queensland, Australia|BNE
Simply don't use the g option. Your sed command should look like this:
sed 's/, /|/'
The s command will by default only the replace the first occurrence of a string in the pattern buffer - unless you pass the g option.
Since you have not posted the output of your test file, we can only guess what you need. And here is may guess:
awk -F", *" 'NF>2{$0=$1"|"$2 OFS $3}1' OFS=", " file
Brisbane|Queensland, Australia|BNE
Bristol, United Kingdom|BRS
Bristol, VA|TRI
Brive-La-Gaillarde, France - Laroche|BVE
Brno, Czech Republic - Bus service|ZDN
Brno, Czech Republic - Turany|BRQ
As you see it counts fields to see if it needs | or not. If it neds | then reconstruct the line.
How can I extract the state text which is before third comma only using the regex code?
54 West 21st Street Suite 603, New York,New York,United States, 10010
I've managed to extract the rest how I wanted but this one is a problem.
Also, how can I extract the "United States" please?
It looks like you want to use capturing groups:
.*,.*,(.*),(.*),.*
The first capturing group will be "New York" and the second will be "United States" (try it on Rubular).
Or you can split by commas (which will probably be even simpler) as #Jerry points out, assuming the language/tool you're using supports that.
You can use this regex:
(?:[^,]*,){2}([^,]*)
And use captured group # 1 for your desired String.
TL;DR
A lot depends on your regular expression engine, and whether you really need a regular expression or field-splitting. You can do field-splitting in Ruby and Awk (among others), but sed and grep only do regular expressions. See some examples below to get you started.
Ruby
str = '54 West 21st Street Suite 603, New York,New York,United States, 10010'
str.match /(?:.*?,){2}([^,]+)/
$1
#=> "New York"
GNU sed
$ echo '54 West 21st Street Suite 603, New York,New York,United States, 10010' |
sed -rn 's/([^,]+,){2}([^,]+).*/\2/p'
GNU awk
$ echo '54 West 21st Street Suite 603, New York,New York,United States, 10010' |
awk -F, '{print $3}'