How to sed, awk, or tr certain characters in certain situations? - regex

How could I awk, sed or tr a " " and replace it with a ",". More specifically when the number of fields for each line are different. I know how to simply sed the problem
sed 's/ /,/g'
Here's and example of a problem
Ted 36 Shaker Heights 04-25-1978
Robin 34 Vancouver 07-23-1980
Marshall 36 St. Cloud 11-28-1978
Lily 37 New York 03-22-1978
I need to sed, awk, or tr so the result becomes
Ted,36,Shaker Heights,04-25-1978
Robin,34,Vancouver,07-23-1980
Marshall,36,St. Cloud,11-28-1978
Lily,37,New York,03-22-1978
I am having trouble with the space within the city name. Any suggestions on how to fix that? The field numbers for each line is not always consist. It will either have 4 or 5 depending on the city.

If the city is always surrounded by numbers, you can just check for the transition from digits to non-digits or vice versa:
sed 's/\([0-9]\) \([^0-9]\)/\1,\2/g;s/\([^0-9]\) \([0-9]\)/\1,\2/g'

Try this:
sed -E 's/ ([0-9]+) /,\1,/;s/ ([0-9-]+)$/,\1/' file
Output:
Ted,36,Shaker Heights,04-25-1978
Robin,34,Vancouver,07-23-1980
Marshall,36,St. Cloud,11-28-1978
Lily,37,New York,03-22-1978

a dumb and basic approach that uses the greediness:
sed -r 's/^([^ ]*) ([0-9]*) (.*) /\1,\2,\3,/' file
or shorter:
sed -r 's/ ([0-9]*) (.*) /,\1,\2,/' file

Related

Using grep to remove text after the first, or second, occurrence of a four digit string. Issue with hyphenated text

I am trying to use grep and sed to format text and need help with my grep statement to include hyphens and preceding text in the output.
Example strings:
Merry.Ex-Mas.2014.1080p.Text.x265-JOHN
30.Rock.A.One-Time.Special.2020.1080p.Text.x265-JOHN
Creature.from.the.Black.Lagoon.REMASTERED.1954.1080p.BluRay.x265-JOHN
1984.1984.1080p.Text.x265-JOHN
The desired output would be:
Merry Ex-Mas 2014
30 Rock A One-Time Special 2020
Creature from the Black Lagoon 1954
1984 1984
Thanks to #grzegorz-pudłowski I have this line of code. (but for some reason hyphens and everything in front of the hyphen is being removed)
`grep -E -o '(\\w*[\\.]?)*(19|20)[0-9]{2}'`
(the extra escapes are needed in AppleScript)
Those grep commands result in:
Mas.2014
Time.Special.2020
Creature.from.the.Black.Lagoon.1954
1984.1984
I then pipe to sed to replace periods with spaces:
| sed 's/\\. */ /g'"
The original answer from #grzegorz-pudłowski that was removed from stackoverflow:
Better than sed should be grep in this situation. I gues you have bunch of files and you want to rename them or what not. So I would use something like this:
echo "Title.Text.2012.1080p.text.text" | grep -E -o "(\w*[\.]?)*(19|20)[0-9]{2}"
So... -E is "regex extended" flag. You can use egrep instead. Next flag is -o and it makes grep print only matched expression (as you want to throw away rest of this string).
Regexp is simple:
(\w*[\.]?)* match zero or more groups of zero or more alphanumeric
characters with zero or one dot at the end.
(19|20) match 19 or 20 as you want to match a year (assuming years
1900-2099 so change this part if you want wider range)
[0-9]{2} match two digits from 0 to 9
After that you can pipe result to mv or what not. If you grep file however then just use:
grep -E -o "(\w*[\.]?)*(19|20)[0-9]{2}" filename.txt
EDIT2: In case OP wants to stick with his original solution with additional steps then try following.
grep -E -o "(\w+\.){1,}.*(19|20)[0-9]{2}" Input_file | sed 's/\./ /g'
EDIT: As per OP's comment adding more generic solution.
awk '
match($0,/[0-9]{4}\.[0-9]+[a-zA-Z]+\..*/){
val=substr($0,1,RSTART+4)
gsub(/\./," ",val)
print val
val=""
}
' Input_file
Could you please try following, written and tested with shown samples in GNU sed.
sed -E 's/\.[0-9]+p\.Text\..*Text//;s/\./ /g' Input_file
2nd solution: Using awk.
awk '
BEGIN{
FS="."
}
match($0,/\.[0-9]+p\.Text\..*Text/){
$1=$1
print substr($0,1,RSTART-1)
}
' Input_file
A sed expression using BRE (Basic Regular Expressions) can be written as:
sed 's/[.]/ /g;s/\w\w*p\s.*$//' file
Where the first substitution globally replaces each '.' with a space and then the second deletes from the word ending in 'p' to the end of line. \w matches [A-Za-z0-9_], so you can tighten the matching criteria by adjusting the match of characters before 'p' if needed.
Example Use/Output
$ sed 's/[.]/ /g;s/\w\w*p\s.*$//' file
Merry Ex-Mas 2014
30 Rock A One-Time Special 2020
1984 1984
Per-Edits To Include Additional Strings
Including additional strings such as:
"WALL-E.2008.1080p.BluRay.x265-JOHN", and
"WALL-E.2008.REMASTERED.1080p.BluRay.x265-RARBG"
To use BRE you would need:
sed 's/[.]/ /g;s/^[0-9][0-9]*[ ]\([0-9][0-9][0-9][0-9]\).*$/\1 \1/;s/[ ]\([0-9][0-9][0-9][0-9]\).*$/ \1/' file
Example Input File
$ cat file
Merry.Ex-Mas.2014.1080p.Text.x265.Text
30.Rock.A.One-Time.Special.2020.1080p.Text.x265.Text
1984.1984.1080p.Text.x265.Text
WALL-E.2008.1080p.BluRay.x265-JOHN
WALL-E.2008.REMASTERED.1080p.BluRay.x265-RARBG
Example Use/Output
$ sed 's/[.]/ /g;s/^[0-9][0-9]*[ ]\([0-9][0-9][0-9][0-9]\).*$/\1 \1/;s/[ ]\([0-9][0-9][0-9][0-9]\).*$/ \1/' file
Merry Ex-Mas 2014
30 Rock A One-Time Special 2020
1984 1984
WALL-E 2008
WALL-E 2008
This can be solved using sed substitution:
sed -E 's/(.*(19|20)[0-9]{2}).*/\1/; s/\./ /g' file
Merry Ex-Mas 2014
30 Rock A One-Time Special 2020
1984 1984
Details:
(.*(19|20)[0-9]{2}): Match longest string till we get a year string and capture in group #1
.*: Match remaining part till end
\1: Put 1st capture group back
s/\./ /g: replace each dot with spacec
You may use
sed -E 's/\.1080p\..*//g;s/\./ /g' file
See the online sed demo
Details
-E - enables POSIX ERE syntax
s/\.1080p\..*//g - removes the .1080. and all text to the end of string
s/\./ /g - replaces dots with spaces.
Test:
#!/bin/bash
s='Merry.Ex-Mas.2014.1080p.
30.Rock.A.One-Time.Special.2020.1080p.
1984.1984.1080p.'
sed -E 's/\.1080p\..*//g;s/\./ /g' <<< "$s"
Output:
Merry Ex-Mas 2014
30 Rock A One-Time Special 2020
1984 1984

regex -- grepping for alphabetic characters only

I have a quick regex question.
Let's say I have a list of packages:
packageA-0:8.39-6.fc24.x86_64
packageB-0:6.4-1.fc24.x86_64
packageB-utils-0:3.63-2.fc24.x86_64
What I want returned is:
packageA
packageB
packageB-utils
I've tried
grep -oP '^[a-z]*' myfile.txt
and
awk -F"[_-]" '{print $1}' myfile.txt
Any ideas? I think I'm sort of close, but I just can't get packageB-utils
.*?(?=-\d)
.*? => everything non greedy
(?=-\d) => until "-" followed by a digit
Try this. Selects everything upto the last alphabet:
grep -o "^[a-zA-Z-]*[a-zA-Z]" file.txt
Or, if your package name also contains digits, you can use sed to trim out everything after -0:...:
sed 's|-[0-9]*:.*||' file.txt
With sed using grouping:
sed -rn 's/([A-Za-z\-]+)\-(.*)/\1/p' packages.txt
Should yield:
#packageA
#packageB
#packageB-utils
packages.txt contains:
packageA-0:8.39-6.fc24.x86_64
packageB-0:6.4-1.fc24.x86_64
packageB-utils-0:3.63-2.fc24.x86_64

sed replace : delete a character between two strings while preserving a regex

I have a csv file which is delimited by #~#.
there is a field which contains 0 and then n(more than 1) number of '.'(dot).
I need to remove the zero and preserve the later dots. I have to also take care that floating numbers are not affected.
So effectively replace #~#0.....#~# to #~#.....#~# (dots can be from 1 to any)
To limit the replacement with fields matching the pattern use this
$ echo "#~#0.12#~#0.....#~#0.1#~#0.#~#" | sed -r 's/#~#0(\.+)#~#/#~#\1#~#/g'
will preserve 0.12 and 0.1 but replace 0..... and 0.
#~#0.12#~#.....#~#0.1#~#.#~#
+ in regex means one or more. Anchoring with the field delimiters will make sure nothing else will be replaced.
Using sed you can do:
s='#~#0.....#~#'
sed -r 's/(^|#~#)0(\.+($|#~#))/\1\2/g' <<< "$s"
#~#.....#~#
sed -r 's/(^|#~#)0(\.+($|#~#))/\1\2/g' <<< "#~#0.00#~#"
#~#0.00#~#
]$ echo "#~#0.....#~#" | sed 's/#0/#/g'
#~#.....#~#
Escape the dots ans include all characters that should match:
echo "#~#0.1234#~#0.....#~#" | sed 's/#~#0\.\./#~#../g'
Using var's will not improve much:
delim="#~#"
echo "#~#0.1234#~#0.....#~#" | sed "s/${delim}0\.\./${delim}../g"

replace more than one special character with sed

I´m a nooby in regex so i have my headache with sed.
I need help to replace all special characters from the given company names with "-".
So this is the given string:
FML Finanzierungs- und Mobilien Leasing GmbH & Co. KG
I want the result:
FML-Finanzierungs-und-Mobilien-Leasing-GmbH-&-Co-KG
I tried the following:
nr = $(echo "$name" | sed -e 's/ /-/g'))
so this replace all whitespaces with -, but what the right expression to replace the others? My one search via google are not very successful.
That depends on what you consider to be a special character -- I say this because you appear to consider & a regular character but not ., which seems a bit odd. Anyway, I imagine something of the form
nr=$(echo "$name" | sed 's/[^[:alnum:]&]\+/-/g')
would serve you best. Here [^[:alnum:]&] matches any character that is not alphanumeric or &, and [^[:alnum:]&]\+ matches a sequence of one or more such characters, so the sed call replaces all such sequences in $name with a hyphen. If there are other characters that you consider regular, add them to the set. Note that the handling of umlauts and suchlike depends on your locale.
Also note that echo may cause trouble if $name begins with a hyphen (it could be parsed as options for echo), so if you can tether yourself to bash,
nr=$(sed 's/[^[:alnum:]&]\+/-/g' <<< "$name")
might be more robust.
Apparently you wan to remove - and . and then replace spaces with -.
This would do it, by saying sed -e 'one thing' -e 'another thing':
$ echo "$name" | sed -e 's/[-\.]//g' -e 's/ /-/g'
FML-Finanzierungs-und-Mobilien-Leasing-GmbH-&-Co-KG
Note we enclose within square backets all the characters that we want to treat equally: [-\.] means either - or . (we need to escape it, otherwise it would match any character).
Do this help you:
awk -vOFS=- '{gsub(/[.-]/,"");$1=$1}1' <<< "$name"
FML-Finanzierungs-und-Mobilien-Leasing-GmbH-&-Co-KG
gsub(/[.-]/,"") Removes . and _
-vOFS=- sets new field separator to -
$1=$1 reconstruct the line so it uses new field separator
1 print the line.
To get it to a variable
nr=$(awk -vOFS=- '{gsub(/[.-]/,"");$1=$1}1' <<< "$name")
Try this way also
echo "name" | sed 's/ \|- \|\. /-/g'
OutPut :
FML-Finanzierungs-und-Mobilien-Leasing-GmbH-&-Co-KG

Regex correct but not working in sed for 2-character words

I've used regex101.com and a few others to check that this is correct and it seems to be. I want to remove all words which are two characters long or less. My current implementation is:
head -n 10 abstracts.txt | sed 's/ [a-zA-Z]{1,2} //g'
And it's just not doing anything. I would like to go from something like this:
This is a short sentence.
To this:
This short sentence.
Thanks for any help.
Escape the curly brackets and use word boundary:
head -n 10 abstracts.txt | sed 's/ [a-zA-Z]\{1,2\}\b//g'
Don't use empty spaces use \b for word boundaries:
echo 'This is a short sentence' | sed -e 's/\b[a-zA-Z]\{1,2\}\b//g'
This short sentence
Just for test, using awk
awk '{for (i=1;i<=NF;i++) if (length($i)<3) $i="";gsub(/ +/," ")}1'
This short sentence.
This might work for you (GNU sed):
sed -e 's/\b\w\w\?\b\s\+\|\s\+\w\w\?$//g' file
This removes one or two character words and the following spaces throughout a line or the preceeding spaces and one or two character word at the end of a line.