Trouble with a sed regex - regex

I'm trying to handle a file containing currencies with sed but can't figure out where my error is.
This is a extract from the file :
AED: United Arab Emirates DirhamAFN: Afghan AfghaniALL: Albanian LekAMD: Armenian DramANG: Netherlands Antillean GuldenAOA: Angolan KwanzaARS: Argentine PesoAUD: Australian DollarAWG: Aruban FlorinAZN: Azerbaijani ManatBAM: Bosnia & Herzegovina Convertible MarkBBD: Barbadian DollarBDT: Bangladeshi TakaBGN: Bulgarian LevBIF: Burundian FrancBMD: Bermudian DollarBND: Brunei DollarBOB: Bolivian BolivianoBRL: Brazilian Real*BSD: Bahamian DollarBWP: Botswana PulaBZD: Belize DollarCAD: Canadian Dollar[...]
I want to add a newline before each tree uppercase group followed by the character ":".
What I tried was sed -e 's/\([A-Z]{3}:)/\n\1/g list1.txt > list2.txt, but nothing is changed. In fact, when I just try /[A-Z]{3}/blabla/ nothing happens.
I am puzzled.

sed -r 's/([A-Z]{3}:)/\n\1/g' list1.txt
# or
# sed -e 's/\([A-Z]\{3\}:\)/\n\1/g' list1.txt
return:
AED: United Arab Emirates Dirham
AFN: Afghan Afghani
ALL: Albanian Lek
AMD: Armenian Dram
ANG: Netherlands Antillean Gulden
AOA: Angolan Kwanza
ARS: Argentine Peso
AUD: Australian Dollar
AWG: Aruban Florin
AZN: Azerbaijani Manat
BAM: Bosnia & Herzegovina Convertible Mark
BBD: Barbadian Dollar
BDT: Bangladeshi Taka
BGN: Bulgarian Lev
BIF: Burundian Franc
BMD: Bermudian Dollar
BND: Brunei Dollar
BOB: Bolivian Boliviano
BRL: Brazilian Real*
BSD: Bahamian Dollar
BWP: Botswana Pula
BZD: Belize Dollar
CAD: Canadian Dollar

Related

Swap two words with regex in a text file using bash

I have file adr.txt with info:
Franklin Avenue, US, 33123
Laurel Drive, US, 59121
Street King, UK, 00939
Street Williams, US, 19123
Warren Avenue, UK, 93891
Street Court, UK, 89730
Country Club Road, US, 10865
Madison Avenue, US, 36975
Street Front, US, 41911
Cedar Lane, UK, 21563
Garfield Avenue, UK, 00842
Street Cottage, US, 33205
Arlington Avenue, US, 94008
Cedar Avenue, US, 72635
Windsor Drive, US, 34384
Devon Court, UK, 13789
Garfield Avenue, US, 86115
Street Olive, US, 63007
Street Williams, US, 54675
Franklin Avenue, US, 82479
I need to swap the words "Street" and the name of the street to get the following - the name should come first, and then the word "Street".
Franklin Avenue, US, 33123
Laurel Drive, US, 59121
King Street, UK, 00939
Williams Street, US, 19123
Warren Avenue, UK, 93891
Court Street, UK, 89730
Country Club Road, US, 10865
Madison Avenue, US, 36975
Front Street, US, 41911
Cedar Lane, UK, 21563
Garfield Avenue, UK, 00842
Cottage Street, US, 33205
Arlington Avenue, US, 94008
Cedar Avenue, US, 72635
Windsor Drive, US, 34384
Devon Court, UK, 13789
Garfield Avenue, US, 86115
Olive Street, US, 63007
Williams Street, US, 54675
Franklin Avenue, US, 82479
As an example, "sed" command works for me sed -i 's/\(Street\) \(King\)/\2 \1/' adr.txt What regular expression can be used to automatically catch all words with a street name?
I tried sed -i 's/\(Street\) \([WO]{1}[a-z]*[es]{1}\,\)/\2 \1/' adr.txt
I checked the regular expression [WO]{1}[a-z]*[se]{1}\, on regex101.com. It looks for the names "Williams," "Olive," But it does not work in "sed" command.
Your regex attempt was very weird. The simple sed solution would be
sed -i 's/^\(Street\) \([^ ,]*\)/\2 \1/' adr.txt
Though perhaps better to use Awk for this.
awk '/^Street [^ ,]+,/ {
two=$2; $2=$1 ",";
sub(/,$/, "", two);
$1=two }1' adr.txt >newfile
mv newfile adr.txt
As an aside, https://regex101.com/ supports a number of regex dialects, but none of them is exactly the one understood by sed.
Also, {1} in a regex is never useful; if you want to repeat something once, the expression before {1} already does exactly that.
$ cat file
Warren Avenue, UK, 93891
Street Court, UK, 89730
Street Front, US, 41911
Cedar Lane, UK, 21563
Street Cottage, US, 33205
Arlington Avenue, US, 94008
Windsor Drive, US, 34384
Garfield Avenue, US, 86115
Franklin Avenue, US, 82479
Street Muhammad Ali, US, 82479
Street Albert Einstein Jr., US, 82479
awk -F',' -v OFS="," '/^Street /{$1=gensub(/^(Street) (.*)$/,"\\2 \\1",1,$1)}1' file
Warren Avenue, UK, 93891
Court Street, UK, 89730
Front Street, US, 41911
Cedar Lane, UK, 21563
Cottage Street, US, 33205
Arlington Avenue, US, 94008
Windsor Drive, US, 34384
Garfield Avenue, US, 86115
Franklin Avenue, US, 82479
Muhammad Ali Street, US, 82479
Albert Einstein Jr. Street, US, 82479
Simple sed solution:
cat adr.txt | sed -E 's/^(Street) ([^,]+)/\2 \1/

I am trying to write a regex on powershell for the canadian addresses

this is the address method
the number might be different 12 or 412 and how many words for the finch ave east
1460 Finch Ave East, Toronto, Ontario, A1A1A1
so I try this
^[0-9]+\s+[a-zA-Z]+\s+[a-zA-Z]+\s+[a-zA-Z]+[,]{1}+\s[a-zA-Z]+[,]{1}+\s+[a-zA-Z]+[,]{1}+\s[A-Za-z]\d[A-Za-z][ -]?\d[A-Za-z]\d$
I usually recommend using regex capture-groups, so you can break and simplify your matching problem to smaller sets. For most cases I use \d and \w, s for matching numbers, standard letters and whitespaces.
I usually experiment on https://regex101.com before I put it into code, because it provides a nice interactive way to play with expressions and samples.
Regarding your question the expression that I came up is:
$regexp = "^(\d+)\s*((\w+\s*)+),\s*(\w+),\s*(\w+),\s*((\w\d)*)$"
In PowerShell I like to use the direct regex class, because it offers more granularity than the standard -match operator.
# Example match and results
$sample = "1460 Finch Ave East, Toronto, Ontario, A1A1A1"
$match = [regex]::Match($sample, $regexp)
$match.Success
$match | Select -ExpandProperty groups | Format-Table Name, Value
# Constructed fields
#{
number = $match.Groups[1]
street = $match.Groups[2]
city = $match.Groups[4]
state = $match.Groups[5]
areacode = $match.Groups[6]
}
So this will result in $match.Success $true and the following numbered capture-groups will be presented in the Groups list:
Name Value
---- -----
0 1460 Finch Ave East, Toronto, Ontario, A1A1A1
1 1460
2 Finch Ave East
3 East
4 Toronto
5 Ontario
6 A1A1A1
7 A1
For constructing the fields, you can ignore 3 and 7 as those are partial-groups:
Name Value
---- -----
areacode A1A1A1
street Finch Ave East
city Toronto
state Ontario
number 1460
To add to mákos excellent answer, I would suggest using named capture groups and the $Matches automatic variable. This makes it super easy to grab the individual fields and turn them into objects for multiple input strings:
function Split-CanadianAddress {
param(
[Parameter(Mandatory,ValueFromPipeline)]
[string[]]$InputString
)
$Pattern = "^(?<Number>\d+)\s*(?<Street>(\w+\s*)+),\s*(?<City>(\w+\s*)+),\s*(?<State>(\w+\s*)+),\s*(?<AreaCode>(\w\d)*)$"
foreach($String in $InputString){
if($String -match $Pattern){
$Fields = #{}
$Matches.Keys |Where-Object {$_ -isnot [int]} |ForEach-Object {
$Fields.Add($_,$Matches[$_])
}
[pscustomobject]$Fields
}
}
}
The $Matches hashtable will contain both the numbered and named capture groups, which is why I copy only the named entries to the $Fields variable before creating the pscustomobject
Now you can use it like:
PS C:\> $sample |Split-CanadianAddress
Street : Finch Ave East
State : Ontario
AreaCode : A1A1A1
Number : 1460
City : Toronto
I've update the pattern to allow for spaces in city and state names as well (think "New Westminster, British Columbia")

Detect empty line (contain only space) in perl

my program read a file, and i don't want to treat empty line
while (<FICC>) {
my $ligne=$_;
if ($ligne =~ /^\s*$/){}else{
print " $ligne\n";}
but this code also print empty line
the file that i test with contain:
Ms. Ruth Dreifuss Dreifuss Federal Councillor Federal ruth
     
sir christopher warren US Secretary of state secretary of state
     
external economic case federal economic affair conference the Federal Office case
     
US bill clinton bill clinton Mr. Bush
     
Nestle food cs holding swiss Swiss Performance Index Performance
I think it is because of using the \n within your code. Just remove that \n from your code and it should be fine.
Usually people do chomp after reading a line from a file to remove the end of line character.
An easier way to write that is probably to invert the logic and only print lines that contain non-whitespace characters.
while (<FICC>) {
my $ligne = $_;
if ($ligne =~ /\S/) {
print " $ligne"; # No need for linefeed here as $ligne already has one
}
}
Update: Demo using your sample data:
#!/usr/bin/perl
use strict;
use warnings;
while (<DATA>) {
my $ligne = $_;
if ($ligne =~ /\S/) {
print " $ligne";
}
}
__END__
Ms. Ruth Dreifuss Dreifuss Federal Councillor Federal ruth
sir christopher warren US Secretary of state secretary of state
external economic case federal economic affair conference the Federal Office case
US bill clinton bill clinton Mr. Bush
Nestle food cs holding swiss Swiss Performance Index Performance
Output:
Ms. Ruth Dreifuss Dreifuss Federal Councillor Federal ruth
sir christopher warren US Secretary of state secretary of state
external economic case federal economic affair conference the Federal Office case
US bill clinton bill clinton Mr. Bush
Nestle food cs holding swiss Swiss Performance Index Performance
Which seems correct to me.
The reason is also that you're adding a new line, to the end of your string which already has a newline in it "$ligne\n", so use chomp as below
I think the nicer way of doing this is with next (skip to next loop iteration) as it removes some brackets from your code:
while (<FICC>) {
my $ligne=chomp $_;
next if $ligne =~ /^\s*$/;
print " $ligne\n";
}

Sed remove only first occurence of a string

I have several string in my text file witch have this case:
Brisbane, Queensland, Australia|BNE
I know how to use the SED command, to replace any character by another one. This time I want to replace the characters coma-space by a pipe, only for the first match to not affect the country name at the same time.
I need to convert it to something like that:
Brisbane|Queensland, Australia|BNE
As you can see, only the first coma-space was replaced, not the second one and I keep the country name "Queensland, Australia" complete. Can someone help me to achieve this, thanks.
Here is a sample of my file:
Brisbane, Queensland, Australia|BNE
Bristol, United Kingdom|BRS
Bristol, VA|TRI
Brive-La-Gaillarde, France - Laroche|BVE
Brno, Czech Republic - Bus service|ZDN
Brno, Czech Republic - Turany|BRQ
If you do: sed 's/, /|/' file.txt doesn't work.
The output should be like that:
Brisbane|Queensland, Australia|BNE
Simply don't use the g option. Your sed command should look like this:
sed 's/, /|/'
The s command will by default only the replace the first occurrence of a string in the pattern buffer - unless you pass the g option.
Since you have not posted the output of your test file, we can only guess what you need. And here is may guess:
awk -F", *" 'NF>2{$0=$1"|"$2 OFS $3}1' OFS=", " file
Brisbane|Queensland, Australia|BNE
Bristol, United Kingdom|BRS
Bristol, VA|TRI
Brive-La-Gaillarde, France - Laroche|BVE
Brno, Czech Republic - Bus service|ZDN
Brno, Czech Republic - Turany|BRQ
As you see it counts fields to see if it needs | or not. If it neds | then reconstruct the line.

Regex code for address separated by commas

How can I extract the state text which is before third comma only using the regex code?
54 West 21st Street Suite 603, New York,New York,United States, 10010
I've managed to extract the rest how I wanted but this one is a problem.
Also, how can I extract the "United States" please?
It looks like you want to use capturing groups:
.*,.*,(.*),(.*),.*
The first capturing group will be "New York" and the second will be "United States" (try it on Rubular).
Or you can split by commas (which will probably be even simpler) as #Jerry points out, assuming the language/tool you're using supports that.
You can use this regex:
(?:[^,]*,){2}([^,]*)
And use captured group # 1 for your desired String.
TL;DR
A lot depends on your regular expression engine, and whether you really need a regular expression or field-splitting. You can do field-splitting in Ruby and Awk (among others), but sed and grep only do regular expressions. See some examples below to get you started.
Ruby
str = '54 West 21st Street Suite 603, New York,New York,United States, 10010'
str.match /(?:.*?,){2}([^,]+)/
$1
#=> "New York"
GNU sed
$ echo '54 West 21st Street Suite 603, New York,New York,United States, 10010' |
sed -rn 's/([^,]+,){2}([^,]+).*/\2/p'
GNU awk
$ echo '54 West 21st Street Suite 603, New York,New York,United States, 10010' |
awk -F, '{print $3}'