Using Sed to capitalize the first letter of each word - regex

Here is the data I want to capitalize:
molly w. bolt 334-78-5443
walter q. bugg 984-49-0032
noah p. way 887-12-0921
kerry t. bricks 431-09-1239
ping h. yu 109-32-9845
Here is the script I have written so far to capitalize the first letter of name including initial
h
s/\(.\).*/\1/
y/abcdefghijklmnopqrstuvwxyz/ABCDEFGHIJKLMNOPQRSTUVWXYZ/
G
s/\(.\)\n\(.\)\(.*\)/\1\3/
/ [a-z]/{
h
s/\([A-Z][a-z]* \)\([a-z]\).*/\2/
y/abcdefghijklmnopqrstuvwxyz/ABCDEFGHIJKLMNOPQRSTUVWXYZ/
G
s/\(.\)\n\([A-Z][a-z]* \)\(.\)\(.*\)/\2\1\4/
}
/ [a-z]/{
h
s/\([A-Z][a-z]* \)\([a-z]\).*/\2/
y/abcdefghijklmnopqrstuvwxyz/ABCDEFGHIJKLMNOPQRSTUVWXYZ/
G
s/\(.\)\n\([A-Z][a-z]* \)\(.\)\(.*\)/\2\1\4/
}
It gives me:
MOLLY W. BOLT 334-78-544Molly 3. bolt 334-78-5443
WALTER Q. BUGG 984-49-003Walter 2. bugg 984-49-0032
NOAH P. WAY 887-12-092Noah 1. way 887-12-0921
KERRY T. BRICKS 431-09-123Kerry 9. bricks 431-09-1239
PING H. YU 109-32-984Ping 5. yu 109-32-9845
I want to only have:
Molly W. Bolt 334-78-544
Walter Q. Bugg 984-49-003
Noah P. Way 887-12-092
Kerry T. Bricks 431-09-123
Ping H. Yu 109-32-984
What would I change?

How about this (GNU sed):
$ sed 's/\b[a-z]/\u&/g' myfile
Molly W. Bolt 334-78-5443
Walter Q. Bugg 984-49-0032
Noah P. Way 887-12-0921
Kerry T. Bricks 431-09-1239
Ping H. Yu 109-32-9845

(GNU) Sed what should works with utf8 too:
sed -E 's/[[:alpha:]]+/\u&/g'
#or
sed -E 's/\S+/\u&/g'
Or perl
perl -pe 's/(\w+)/\u$1/g'
search for "word-strings" \w+
replace (substitute) s/// it $1 with uppercase 1st character \u
everywhere in the line g
or the simpler
perl -pe 's/\S+/\u$&/g'
any nonspaced string
capitalize
the
perl -CSDA -pe 's/\S+/\u$&/g'
will work with utf8 encoded files too..., e.g. from the
павел андреевич чехов 234
γεοργε πατσασογλοθ 123
čajka šumivá 345
will print
Павел Андреевич Чехов 234
Γεοργε Πατσασογλοθ 123
Čajka Šumivá 345
for inline file edit use the next:
perl -i.bak -CSDA -pe 's/\S+/\u$&/g' some filenames ....
will create the .bak (backup) file.
If you have bash 4.2+ and need convert only in the variables, you can use:
for name in павел андреевич чехов γεοργε πατσασογλοθ čajka šumivá
do
echo "${name^}" #capitalize the $name
done
prints
Павел
Андреевич
Чехов
Γεοργε
Πατσασογλοθ
Čajka
Šumivá
Also, a solution for sed, what doesn;t knows the \u https://stackoverflow.com/a/11804643/632407

Quite simple with python also:
$ python -c 'with open("myfile") as f:print f.read().title()'
https://docs.python.org/2/library/stdtypes.html

sed 's/^/ /;s/ [aA]/ A/g;s/ [bB]/ B/g;s/ [cC]/ C/g;s/ [dD]/ D/g;s/ [eE]/ E/g;s/ [fF]/ F/g;s/ [gG]/ G/g;s/ [hH]/ H/g;s/ [iI]/ I/g;s/ [jJ]/ J/g;s/ [kK]/ K/g;s/ [lL]/ L/g;s/ [mM]/ M/g;s/ [nN]/ N/g;s/ [oO]/ O/g;s/ [pP]/ P/g;s/ [qQ]/ Q/g;s/ [rR]/ R/g;s/ [sS]/ S/g;s/ [tT]/ T/g;s/ [uU]/ U/g;s/ [vV]/ V/g;s/ [wW]/ W/g;s/ [xX]/ X/g;s/ [yY]/ Y/g;s/ [zZ]/ Z/g;s/^.//' YourFile
Posix (no GNU sed) version
Works on your sample but not if something like {andrea,georges ... assuming word are at the start of line OR after a space char here.

Related

Find the first name that starts with any letter than S using regex

I am new to regex and I am trying to find the last names that only start with S followed by comma and then space and then the first names that doesn't start with S from a text file.
I am using the terminal on a MacBook.
This is my regex
^[S\w][,]?[' ']?[A-RT-Z]?
My full command
cat People.txt | grep -E ^[S\w][,]?[' ']?[A-RT-Z]?
The first name is the second word and the last name is the first word on each line.
The results I get:
Schmidt, Paul
Sells, Simon
Smith, Peter
Stephens, Sheila
What I am expecting to get
Schmidt, Paul
Smith, Peter
The first rule of writing regular expressions in a shell script (or at the terminal) is "enclose the regular expression in single quotes" so that the shell doesn't try to interpret the metacharacters in the regex. You might sometimes use double quotes instead of single quotes if you need to match single quotes but not double quotes or if you need to interpolate a variable, but aim to use single quotes. Also, avoid UUoC — Useless Use of cat.
Your question currently shows two regular expressions:
^[S\w][,]?[' ']?[A-RT-Z]?
cat People.txt | grep -E ^[S\w][,]?[' ']?[P\w+]?
If written as suggested, these would become:
grep -E -e '^[Sw],? ?[A-RT-Z]?' People.txt
grep -E -e '^[Sw],? ?[Pw+]?' People.txt
The shell removes the backslashes in your rendition. The + in the character class matches a plus sign. You don't need square brackets around the comma (though they do no major harm). I use the -e option for explicitness, and so I can add extra arguments after the regex (-w or -l or -n or …) when editing commands via history. (I also dislike having options recognized after non-option arguments; I often run with $POSIXLY_CORRECT set in my environment. That's a personal quirk.)
The first of the two commands looks for a line starting S or w, followed by an optional comma, an optional blank, and an optional upper-case letter other than S. The second is similar except that it looks for an optional P or w. None of this bears much relationship to the question.
You need an expression more like one of these:
grep -E -e '^[S][[:alpha:]]*, [^S]' People.txt
grep -E -e '^[S][a-zA-Z]*, [^S]' People.txt
These allow single-character names — just S — but you can use + instead of * to require one or more letters.
There are lots of refinements possible, depending on how much you want to work, but this does the primary job of finding 'first word on the line starts with S, and is followed by a comma, a blank, and the second word does not start with S'.
Given a file People.txt containing:
Randall, Steven
Rogers, Timothy
Schmidt, Paul
Sells, Simon
Smith, Peter
Stephens, Sheila
Titus, Persephone
Williams, Shirley
Someone
S
Your regular expressions produce the output:
Schmidt, Paul
Sells, Simon
Smith, Peter
Stephens, Sheila
Someone
S
My commands produce:
Schmidt, Paul
Smith, Peter
Something like this seems to work fine:
^S.*, [^S].*$
^S.* - must start with S and start capturing everything
, [^S] - leading up to a comma, space, not S
.*$ - capture the rest of the string
https://regex101.com/r/76bfji/1

Shell script sed - replace numbers

I have a file that contains address, name and phone number.
Original line:
Elizabeth Salnger 117 Someone St - Fresno, CA013023459876AcccountActive
Expected line:
Elizabeth Salnger 117 Someone St - Fresno, CA099999999999AcccountActive
I have a function
sed -r 's/./&\n/64;s//\n&/52;:a;s/(\n.[0-9]+)[0-9](.*\n)/\1P\2/;ta;s/\n//g'
Obs: this function is used to convert various files, that's why I have to set position on sed function.
Not sure it's what you look for but you can try this sed :
sed -E 's/(.*,[^0-9]*)[0-9]*(.*)/\1099999999999\2/' infile
If it's ok, i can add explanation.

Regexp with sed, alphabetical order without duplicates

I need an expression which allows letters only in alphabetical order without duplications, white spaces allowed.
For example:
abc d efg
abcd efg
bcdefg h
I have to use "sed". Due to that i cant use lookahead expression.
Sed reads file and in each string must find substring that matches example.
Best i have now is this:
sed -nr 's/^[a-g]*(a?b?c?d?e?f?g?)[a-g]*$/\1/gp' test.txt
It doesn't work with white spaces, and doesn't work at all
Suggest you try for letters in [a-h] range:
sed -nr '/^a? *b? *c? *d? *e? *f? *g? *h? *$/p' test.txt
With GNU sed:
sed -nE '/^a{0,1} *b{0,1} *c{0,1} *d{0,1} *e{0,1} *f{0,1} *g{0,1} *h{0,1} *i{0,1} *j{0,1} *k{0,1} *l{0,1} *m{0,1} *n{0,1} *o{0,1} *p{0,1} *q{0,1} *r{0,1} *s{0,1} *t{0,1} *u{0,1} *v{0,1} *w{0,1} *x{0,1} *y{0,1} *z{0,1} *$/p' file
cat test.txt | sed -e "/\([a-z]\).*\1/d" | grep -E "^ *a* *b* *c* *d* *e* *f* *g* *h* *i* *j* *k* *l* *m* *n* *o* *p* *q* *r* *s* *t* *u* *v* *w* *x* *y* *z* *$"
or
grep -E "^ *a? *b? *c? *d? *e? *f? *g? *h? *i? *j? *k? *l? *m? *n? *o? *p? *q? *r? *s? *t? *u? *v? *w? *x? *y? *z? *$" test.txt
or
sed -nE '/^ *a? *b? *c? *d? *e? *f? *g? *h? *i? *j? *k? *l? *m? *n? *o? *p? *q? *r? *s? *t? *u? *v? *w? *x? *y? *z? *$/p' test.txt
This might work for you (GNU sed):
sed -r 'h;s/ //g;/(.).*\1/d;s/.*/&\nzyxwvutsrqponmlkjihgfedcba/;:a;ta;/^\n/!s/^(.)(.*\n.*)\1.*/\2/;ta;/^.+\n/d;x'
Copy the line then remove spaces. If the line contains duplicates delete it. Otherwise starting from the front, remove each character in alphabetical order and if successful reinstate the original line. Otherwise delete the line.
sed -nE 's/[a-z]*(^a{0,1} *b{0,1} *c{0,1} *d{0,1} *e{0,1} *f{0,1} *g{0,1} *h{0,1} *i{0,1} *j{0,1} *k{0,1} *l{0,1} *m{0,1} *n{0,1} *o{0,1} *p{0,1} *q{0,1} *r{0,1} *s{0,1} *t{0,1} *u{0,1} *v{0,1} *w{0,1} *x{0,1} *y{0,1} *z{0,1} *)[a-z]*$/\1/gp' file.txt
Was enough for me. Thanks for all. Great answers.

Regex code for address separated by commas

How can I extract the state text which is before third comma only using the regex code?
54 West 21st Street Suite 603, New York,New York,United States, 10010
I've managed to extract the rest how I wanted but this one is a problem.
Also, how can I extract the "United States" please?
It looks like you want to use capturing groups:
.*,.*,(.*),(.*),.*
The first capturing group will be "New York" and the second will be "United States" (try it on Rubular).
Or you can split by commas (which will probably be even simpler) as #Jerry points out, assuming the language/tool you're using supports that.
You can use this regex:
(?:[^,]*,){2}([^,]*)
And use captured group # 1 for your desired String.
TL;DR
A lot depends on your regular expression engine, and whether you really need a regular expression or field-splitting. You can do field-splitting in Ruby and Awk (among others), but sed and grep only do regular expressions. See some examples below to get you started.
Ruby
str = '54 West 21st Street Suite 603, New York,New York,United States, 10010'
str.match /(?:.*?,){2}([^,]+)/
$1
#=> "New York"
GNU sed
$ echo '54 West 21st Street Suite 603, New York,New York,United States, 10010' |
sed -rn 's/([^,]+,){2}([^,]+).*/\2/p'
GNU awk
$ echo '54 West 21st Street Suite 603, New York,New York,United States, 10010' |
awk -F, '{print $3}'

GNU sed remove spaces from digit number in text file

I have the following bogus data:
Dominik Dryja|4111 2386 0873 0189|0315
Laivonen Eero|5111 0620 0750 8041|0813
Jukka Valimaa|5111 6500 0489 0035|0415
Rafael Diaz de Leon|4111 3036 6209 4796|0516
Mr Jonathan Bird|4111 6150 0291 7415|0215
ERRANTE VINCENZO|4222 6111 0038 6639|0114
YOSHIO MOTOKI|5222 3200 0374 7129|0513
I. A. VLACHOGIANNIS|4333 0115 6936 2003|0315
Soumya Kanti Deb|4333 0590 0165 4877|1019
WU KE ZHAN|5444 8213 7236 0431|0716
I try to strip the space ONLY from the digit number to look like this:
Dominik Dryja|4111238608730189|0315
Laivonen Eero|5111062007508041|0813
Jukka Valimaa|5111650004890035|0415
Rafael Diaz de Leon|4111303662094796|0516
Mr Jonathan Bird|4111615002917415|0215
ERRANTE VINCENZO|4222611100386639|0114
YOSHIO MOTOKI|5222320003747129|0513
I. A. VLACHOGIANNIS|4333011569362003|0315
Soumya Kanti Deb|4333059001654877|1019
WU KE ZHAN|5444821372360431|0716
Tried sed -r '#|([0-9]{4})\ ([0-9]{4})\ ([0-9]{4})\ ([0-9]{4})|#\1\2\3\4#g'
for some reason without success. Any idea where I'm mistaken?
Thanks!
You can simplify your sed:
sed 's/\([0-9]\{4\}\) /\1/g' inFile
Assuming there's always a single space between numbers:
sed 's/\([0-9]\) \([0-9]\)/\1\2/g'
Works with your example.
The code is simple - remove all single spaces if they happen between two digits.