GNU sed remove spaces from digit number in text file - regex

I have the following bogus data:
Dominik Dryja|4111 2386 0873 0189|0315
Laivonen Eero|5111 0620 0750 8041|0813
Jukka Valimaa|5111 6500 0489 0035|0415
Rafael Diaz de Leon|4111 3036 6209 4796|0516
Mr Jonathan Bird|4111 6150 0291 7415|0215
ERRANTE VINCENZO|4222 6111 0038 6639|0114
YOSHIO MOTOKI|5222 3200 0374 7129|0513
I. A. VLACHOGIANNIS|4333 0115 6936 2003|0315
Soumya Kanti Deb|4333 0590 0165 4877|1019
WU KE ZHAN|5444 8213 7236 0431|0716
I try to strip the space ONLY from the digit number to look like this:
Dominik Dryja|4111238608730189|0315
Laivonen Eero|5111062007508041|0813
Jukka Valimaa|5111650004890035|0415
Rafael Diaz de Leon|4111303662094796|0516
Mr Jonathan Bird|4111615002917415|0215
ERRANTE VINCENZO|4222611100386639|0114
YOSHIO MOTOKI|5222320003747129|0513
I. A. VLACHOGIANNIS|4333011569362003|0315
Soumya Kanti Deb|4333059001654877|1019
WU KE ZHAN|5444821372360431|0716
Tried sed -r '#|([0-9]{4})\ ([0-9]{4})\ ([0-9]{4})\ ([0-9]{4})|#\1\2\3\4#g'
for some reason without success. Any idea where I'm mistaken?
Thanks!

You can simplify your sed:
sed 's/\([0-9]\{4\}\) /\1/g' inFile

Assuming there's always a single space between numbers:
sed 's/\([0-9]\) \([0-9]\)/\1\2/g'
Works with your example.
The code is simple - remove all single spaces if they happen between two digits.

Related

RegEx for a multiple line search and replace using sed

I need to have a RegEx that finds a \n in the middle of a line as a start point, anything before is random, and replace after 15 digits and 49 alpha on the second line. I need to replace all that by blanks, but the second line needs to join with the first one.
Attempt
sed -r -e '{N;s/\n[[:digit:]]{15}[[:space:]]{49}//}'
Input
QC HOH 0H0 CA
:70:NOFX TRADE TR
100000100200621 ADE RELATED WOOD PURCHASE
What needs to be removed is the linefeed after TRADE TR and bring the ADE RELATED to the TR so it spells TRADE.
Desired Output
QC H0H 0H0 CA
:70:NOFX TRADE TRADE RELATED WOOD PURCHASE
This might work for you (GNU sed):
sed -E 'N;s/\n[[:digit:]]{15}[[:space:]]{49}//;P;D' file
This opens up a two line window and amends the second of them if the substitute command matches. It always prints the first of the two lines and then removes it.
With GNU sed:
$ sed -Ez 's/\n[[:digit:]]{15}[[:space:]]{49}//' file
QC J0B 2Y0 CA
:70:NOFX TRADE TRADE RELATED WOOD PURCHASE

Sed remove only first occurence of a string

I have several string in my text file witch have this case:
Brisbane, Queensland, Australia|BNE
I know how to use the SED command, to replace any character by another one. This time I want to replace the characters coma-space by a pipe, only for the first match to not affect the country name at the same time.
I need to convert it to something like that:
Brisbane|Queensland, Australia|BNE
As you can see, only the first coma-space was replaced, not the second one and I keep the country name "Queensland, Australia" complete. Can someone help me to achieve this, thanks.
Here is a sample of my file:
Brisbane, Queensland, Australia|BNE
Bristol, United Kingdom|BRS
Bristol, VA|TRI
Brive-La-Gaillarde, France - Laroche|BVE
Brno, Czech Republic - Bus service|ZDN
Brno, Czech Republic - Turany|BRQ
If you do: sed 's/, /|/' file.txt doesn't work.
The output should be like that:
Brisbane|Queensland, Australia|BNE
Simply don't use the g option. Your sed command should look like this:
sed 's/, /|/'
The s command will by default only the replace the first occurrence of a string in the pattern buffer - unless you pass the g option.
Since you have not posted the output of your test file, we can only guess what you need. And here is may guess:
awk -F", *" 'NF>2{$0=$1"|"$2 OFS $3}1' OFS=", " file
Brisbane|Queensland, Australia|BNE
Bristol, United Kingdom|BRS
Bristol, VA|TRI
Brive-La-Gaillarde, France - Laroche|BVE
Brno, Czech Republic - Bus service|ZDN
Brno, Czech Republic - Turany|BRQ
As you see it counts fields to see if it needs | or not. If it neds | then reconstruct the line.

Using Sed to capitalize the first letter of each word

Here is the data I want to capitalize:
molly w. bolt 334-78-5443
walter q. bugg 984-49-0032
noah p. way 887-12-0921
kerry t. bricks 431-09-1239
ping h. yu 109-32-9845
Here is the script I have written so far to capitalize the first letter of name including initial
h
s/\(.\).*/\1/
y/abcdefghijklmnopqrstuvwxyz/ABCDEFGHIJKLMNOPQRSTUVWXYZ/
G
s/\(.\)\n\(.\)\(.*\)/\1\3/
/ [a-z]/{
h
s/\([A-Z][a-z]* \)\([a-z]\).*/\2/
y/abcdefghijklmnopqrstuvwxyz/ABCDEFGHIJKLMNOPQRSTUVWXYZ/
G
s/\(.\)\n\([A-Z][a-z]* \)\(.\)\(.*\)/\2\1\4/
}
/ [a-z]/{
h
s/\([A-Z][a-z]* \)\([a-z]\).*/\2/
y/abcdefghijklmnopqrstuvwxyz/ABCDEFGHIJKLMNOPQRSTUVWXYZ/
G
s/\(.\)\n\([A-Z][a-z]* \)\(.\)\(.*\)/\2\1\4/
}
It gives me:
MOLLY W. BOLT 334-78-544Molly 3. bolt 334-78-5443
WALTER Q. BUGG 984-49-003Walter 2. bugg 984-49-0032
NOAH P. WAY 887-12-092Noah 1. way 887-12-0921
KERRY T. BRICKS 431-09-123Kerry 9. bricks 431-09-1239
PING H. YU 109-32-984Ping 5. yu 109-32-9845
I want to only have:
Molly W. Bolt 334-78-544
Walter Q. Bugg 984-49-003
Noah P. Way 887-12-092
Kerry T. Bricks 431-09-123
Ping H. Yu 109-32-984
What would I change?
How about this (GNU sed):
$ sed 's/\b[a-z]/\u&/g' myfile
Molly W. Bolt 334-78-5443
Walter Q. Bugg 984-49-0032
Noah P. Way 887-12-0921
Kerry T. Bricks 431-09-1239
Ping H. Yu 109-32-9845
(GNU) Sed what should works with utf8 too:
sed -E 's/[[:alpha:]]+/\u&/g'
#or
sed -E 's/\S+/\u&/g'
Or perl
perl -pe 's/(\w+)/\u$1/g'
search for "word-strings" \w+
replace (substitute) s/// it $1 with uppercase 1st character \u
everywhere in the line g
or the simpler
perl -pe 's/\S+/\u$&/g'
any nonspaced string
capitalize
the
perl -CSDA -pe 's/\S+/\u$&/g'
will work with utf8 encoded files too..., e.g. from the
павел андреевич чехов 234
γεοργε πατσασογλοθ 123
čajka šumivá 345
will print
Павел Андреевич Чехов 234
Γεοργε Πατσασογλοθ 123
Čajka Šumivá 345
for inline file edit use the next:
perl -i.bak -CSDA -pe 's/\S+/\u$&/g' some filenames ....
will create the .bak (backup) file.
If you have bash 4.2+ and need convert only in the variables, you can use:
for name in павел андреевич чехов γεοργε πατσασογλοθ čajka šumivá
do
echo "${name^}" #capitalize the $name
done
prints
Павел
Андреевич
Чехов
Γεοργε
Πατσασογλοθ
Čajka
Šumivá
Also, a solution for sed, what doesn;t knows the \u https://stackoverflow.com/a/11804643/632407
Quite simple with python also:
$ python -c 'with open("myfile") as f:print f.read().title()'
https://docs.python.org/2/library/stdtypes.html
sed 's/^/ /;s/ [aA]/ A/g;s/ [bB]/ B/g;s/ [cC]/ C/g;s/ [dD]/ D/g;s/ [eE]/ E/g;s/ [fF]/ F/g;s/ [gG]/ G/g;s/ [hH]/ H/g;s/ [iI]/ I/g;s/ [jJ]/ J/g;s/ [kK]/ K/g;s/ [lL]/ L/g;s/ [mM]/ M/g;s/ [nN]/ N/g;s/ [oO]/ O/g;s/ [pP]/ P/g;s/ [qQ]/ Q/g;s/ [rR]/ R/g;s/ [sS]/ S/g;s/ [tT]/ T/g;s/ [uU]/ U/g;s/ [vV]/ V/g;s/ [wW]/ W/g;s/ [xX]/ X/g;s/ [yY]/ Y/g;s/ [zZ]/ Z/g;s/^.//' YourFile
Posix (no GNU sed) version
Works on your sample but not if something like {andrea,georges ... assuming word are at the start of line OR after a space char here.

Regular expression to match second or last decimal number in a string

String:
<LF><CR>A214 pH/ISE,X00066,2.59,ABCDE,10/16/13 22:06:59,ABC1,CH-1,pH,7.00,pH,0.0, mV,25.0,C,100.0,%,M100,#35<LF><CR>
I need to match only the 7.00 - This number could be anywhere from 0.00 - 14.00 (its a pH reading).
Right now I can only come up with [0-9]{1,2}\.[0-9]{2} which also matches the software revision number which appears earlier in the string (2.59)
Any help is greatly appreciated.
EDIT: Thanks everyone. I figured it out by using [0-9]{1,2}\.[0-9]{2}(?=,p)
Simply find all entries and get the last:
>>> s = "A214 pH/ISE,X00066,2.59,ABCDE,10/16/13 22:06:59,ABC1,CH-1,pH,7.00,pH,0.0, mV,25.0,C,100.0,%,M100,#35"
>>> re.findall("[0-9]{1,2}.[0-9]{2}", s)[-1]
'7.00'
You can improve that regex by using the information that PH is between 0-14(first digit can only by one etc). Or better, just split by commas or use csv module.
maybe you can use that:
pH,(([0-9]|1[0-4])\.\d{2}),pH
group 1 match number that you need. And that control data
If the format of the string is fixed, i.e. the data is in the 9th position if you split on , Use e.g. awk:
$ awk -F, '{print $8, $9}' input
pH 7.00
or using perl in awk-mode:
$ perl -F, -lane 'print $F[8]' input
7.00
Or this regexp
pH,(\d+\.\d{2})
See it line on http://www.rubular.com/r/3kkWNVBAi8

SED: Inserting an existing pattern, to several other places on the same line

Again a SED question from me :)
So, same as last time, I'm wrestling with phone numbers. This time the problem is a bit different.
I this kind of organization currently in my text file:
Areacode: List of phone numbers:
4444 NUM:111111 NUM:2222222 NUM:33333333
5555 NUM:1111111 NUM:2222 NUM:3333333 NUM:44444444 NUM:5555555
Now, every areacode can have unknown number of numbers, and also the phone numbers are not fixed in length.
What I would like to know, is how could I combine areacode and phone number, to look something like this:
4444-111111, 4444-2222222, 4444-33333333
My first idea was to add again a line break before each phone number and to match these sections with regex, and then just add the first remembered item to second, and first to third:
\1-\2, \1-\3, etc
But of course since sed can only remember 9 arguments, and there can be more than 10 numbers in one line this doesn't work. Moreover, also non-fixed list of phone numbers made this a no go.
I'm again looking primarily the SED option, as I've been trying to get proficient with it - but more efficient solutions with other tools are of course definitely welcome!
$ cat input.txt | sed '1d;s/NUM:/ /g' | awk '{for(i=2;i<=NF;i++)printf("%s-%s%s", $1, $i, i==NF?"\n":",")}'
4444-111111,4444-2222222,4444-33333333
5555-1111111,5555-2222,5555-3333333,5555-44444444,5555-5555555
This might work for you:
sed '1d;:a;s/^\(\S*\)\(.*\)NUM:/\1\2,\1-/;ta;s/[^,]*,//;s/ //g' file
4444-111111,4444-2222222,4444-33333333
5555-1111111,5555-2222,5555-3333333,5555-44444444,5555-5555555
or:
awk 'NR>1{gsub(/NUM:/,","$1"-");sub(/[^,]*,/,"");gsub(/ /,"");print}' file
4444-111111,4444-2222222,4444-33333333
5555-1111111,5555-2222,5555-3333333,5555-44444444,5555-5555555
TXR:
#(collect)
#area #(coll :mintimes 1)NUM:#{num /[0-9]+/}#(end)
#(output)
#(rep)#area-#num, #(last)#area-#num#(end)
#(end)
#(end)
Run:
$ txr phone.txr phone.txt
4444-111111, 4444-2222222, 4444-33333333
5555-1111111, 5555-2222, 5555-3333333, 5555-44444444, 5555-5555555
$ cat phone.txt
Areacode: List of phone numbers:
4444 NUM:111111 NUM:2222222 NUM:33333333
5555 NUM:1111111 NUM:2222 NUM:3333333 NUM:44444444 NUM:5555555