How to replace arbritary combinations of (special) characters and numbers using sed and regular expressions - regex

I have a csv file with nearly arbritary filled colums like this:
"bla","","blabla","bla::bla::blabla",19.05.16 12:00:03,123456789,"bla::38594f-47849-h945f",""
and now I want to replace the comma between the two numbers with a point:
"bla","","blabla","bla::bla::blabla",19.05.16 12:00:03.123456789,"bla::38594f-47849-h945f",""
I tried a lot but nothing helped. :-(
sed s/[0-9],[0-9]/./g data.csv
works but it delets the two numbers before and after the comma. So I tried things like
sed s/\(\.[0-9]\),\([0-9]\.\)/\1.\2/g data.csv
but that changed nothing.

Try with s/\([0-9]\),\([0-9]\)/\1.\2/g:
$ echo '"bla","","blabla","bla::bla::blabla",19.05.16 12:00:03,123456789,"bla::38594f-47849-h945f",""' | sed 's/\([0-9]\),\([0-9]\)/\1.\2/g'
"bla","","blabla","bla::bla::blabla",19.05.16 12:00:03.123456789,"bla::38594f-47849-h945f",""
Regex Demo Here
You don't really need the additional dot \. in the capturing groups.

Related

Find and Replace Specific characters in a variable with sed

Problem: I have a variable with characters I'd like to prepend another character to within the same string stored in a variable
Ex. "[blahblahblah]" ---> "\[blahblahblah\]"
Current Solution: Currently I accomplish what I want with two steps, each step attacking one bracket
Ex.
temp=[blahblahblah]
firstEscaped=$(echo $temp | sed s#'\['#'\\['#g)
fullyEscaped=$(echo $firstEscaped | sed s#'\]'#'\\]'#g)
This gives me the result I want but I feel like I can accomplish this in one line using capturing groups. I've just had no luck and I'm getting burnt out. Most examples I come across involve wanting to extract the text between brackets instead of what I'm trying to do. This is my latest attempt to no avail. Any ideas?
There may be more efficient ways, (only 1 s/s/r/ with a fancier reg-ex), but this works, given your sample input
fully=$(echo "$temp" | sed 's/\([[]\)/\\\1/;s/\([]]\)/\\\1/') ; echo "$fully"
output
\[blahblahblah\]
Note that it is quite OK to chain together multiple sed operations, separated by ; OR if in a sed script file, by blank lines.
Read about sed capture-groups using \(...\) pairs, and referencing them by number, i.e. \1.
IHTH
$ temp=[blahblahblah]
$ fully=$(echo "$temp" |sed 's/\[\|\]/\\&/g'); echo "$fully"
\[blahblahblah\]
Brief explanation,
\[\|\]: target to substitute '[' or ']', and for '[', ']', and '|' need to be escaped.
&: the character & to refer to the pattern which matched, and mind that it also needs to be escaped.
As #Gordon Davisson's suggestion, you may also use bracket expression to avoid the extended format regex,
sed 's/[][]/\\&/g'

Select a single character in an alphanumeric string in bash

I have an issue with string manipulation in bash. I have a list of names, each name being composed of two parts, chars and numbers: for example
abcdef01234
I want to cut the last character before the numeric part starts, in this case
f
I think there is a regular expression to help me with this but just can't figure it out. AWK/sed solutions are accepted too. Hope someone can help.
Thank you.
In bash it can be done with parameter expansion with substring removal and string indexes, e.g.,
a=abcdef01234 # your string
tmp=${a%%[0-9]*} # remove all numbers from right
echo ${tmp:(-1)} # output last of remaining chars
Output: f
You can use a regexp like [a-zA-Z]+([a-zA-Z])[0-9]+. If you know how to use sed is pretty easy.
Check https://regex101.com/r/XCkKM5/1
The match will be the letter you want.
^\w+([a-zA-Z])\d+$
As a sed command (on OSX) this will be :
echo "abcdef12345" | sed -E "s#^[a-zA-Z]+([a-zA-Z])[0-9]+\$#\1#"
try following too once.
echo "abcdef01234" | awk '{match($0,/[a-zA-Z]+/);print substr($0,RLENGTH,1)}'
I have a list of names I assume is a file, file. Using grep's PCRE and (positive) lookahead:
$ grep -oP "[a-z](?=[^a-z])" file
f
It prints out the first (lowercase) letter followed by a non-(lowercase)-letter.

Add curly braces to string after a match (sed)

I'm a beginner with regexes and I'm trying to achieve something relatively simple:
I have a dataset arranged like this:
1,AAA,aaaa,BBB,bbbbbb ...
2,AAA,aaaaaaa,BBB,bbb ...
3,AAA,aaaaa,BBB,bb ...
I'm looking into adding curly brackets to the strings of various length (alphanumeric chars) following AAA or BBB (these are constant):
1,AAA,{aaaa},BBB,{bbbbbb} ...
2,AAA,{aaaaaaa},BBB,{bbb} ...
3,AAA,{aaaaa},BBB,{bb} ...
So I have tried with sed this way:
sed 's/(AAA|BBB)[[:punct:]].[[:alnum:]]/\1{&}/g' dataset.txt
However I got this result:
1,AAA,{AAA,aa}aa,BBB,{BBB,bb}bbbb, ...
2,AAA,{AAA,aa}aaaaa,BBB,[BBB,bb}b, ...
3,AAA,{AAA,aa}aaa,BBB,{BBB,bb} ...
Obvisouly, the & in the replace part of sed is going to be the matched pattern, however, I would like & to be only what is after the matched patter, what am I doing wrong?
I have also tried adding word boundaries, after [^ ] to no avail. Am I trying too hard with sed? Should I use a language that allows lookbehind instead?
Thanks for any help!
Try this:
sed 's/\(AAA\|BBB\),\([^,]*\)/\1,{\2}/g' dataset.txt
You can always have more than 1 capture groups in your regex, to capture different parts. You can even move the [:punct:] part inside the first capture group:
sed 's/((?:AAA|BBB)[[:punct:]])([[:alnum:]]+)/\1{\3}/g' dataset.txt
I don't understand what that . in between [:punct:] and [:alnum:] was doing. So, I removed it. Because of that, you might have noticed that, the regex was matching the following pattern:
{AAA,aa}
{BBB,bb}
i.e, it was matching just 2 characters after AAA and BBB. One for . and one for [[:alnum:]].
To match all the alphanumeric characters after , till the next , you need to use quantifier: [[:alnum:]]+
Following sed should work.
On Linux:
sed -i.bak -r 's/((AAA|BBB)[[:punct:]])([[:alnum:]]+)/\1{\3}/g'
OR on OSX:
sed -i.bak -E 's/((AAA|BBB)[[:punct:]])([[:alnum:]]+)/\1{\3}/g'
-i is for inline option to save changes in the input file itself.

Simplify points in KML using regex

I am trying to cut down the file size of a kml file I have.
The coordinates for the polygons are this accurate:
-113.52106535153605,53.912817815321503,0.
I am not very good with regex, but I think it would be possible to write one that selects the eight characters before the commas. I'd run a search and replace so the result would be
-113.521065,53.9128178,0.
Any regex experts out there think this is possible?
Try this
\d{8}(?=,)
and replace with an empty string
See it here on Regexr
Here is something that might work. Replaces 8 chars and the coma with a coma: s/(.{8}),/,/g;
echo "-113.52106535153605,53.912817815321503,0." | sed 's/.\{8\},/,/'
So you can cat the file you have to a sed command like this:
cat file.kml | sed 's/.\{8\},/,/' > newfile.kml
I Just had to do the same thing. This is perl instead of sed, but it will look for a string of eight uninterrupted digits and then replace any number of uninterrupted digits after that with nothing. It worked great.
cat originalfile.kml | perl -pe 's/(?<=\d{8})\d*//g' > shortenedfile.kml

using sed to copy lines and delete characters from the duplicates

I have a file that looks like this:
#"Afghanistan.png",
#"Albania.png",
#"Algeria.png",
#"American_Samoa.png",
I want it to look like this
#"Afghanistan.png",
#"Afghanistan",
#"Albania.png",
#"Albania",
#"Algeria.png",
#"Algeria",
#"American_Samoa.png",
#"American_Samoa",
I thought I could use sed to do this but I can't figure out how to store something in a buffer and then modify it.
Am I even using the right tool?
Thanks
You don't have to get tricky with regular expressions and replacement strings: use sed's p command to print the line intact, then modify the line and let it print implicitly
sed 'p; s/\.png//'
Glenn jackman's response is OK, but it also doubles the rows which do not match the expression.
This one, instead, doubles only the rows which matched the expression:
sed -n 'p; s/\.png//p'
Here, -n stands for "print nothing unless explicitely printed", and the p in s/\.png//p forces the print if substitution was done, but does not force it otherwise
That is pretty easy to do with sed and you not even need to use the hold space (the sed auxiliary buffer). Given the input file below:
$ cat input
#"Afghanistan.png",
#"Albania.png",
#"Algeria.png",
#"American_Samoa.png",
you should use this command:
sed 's/#"\([^.]*\)\.png",/&\
#"\1",/' input
The result:
$ sed 's/#"\([^.]*\)\.png",/&\
#"\1",/' input
#"Afghanistan.png",
#"Afghanistan",
#"Albania.png",
#"Albania",
#"Algeria.png",
#"Algeria",
#"American_Samoa.png",
#"American_Samoa",
This commands is just a replacement command (s///). It matches anything starting with #" followed by non-period chars ([^.]*) and then by .png",. Also, it matches all non-period chars before .png", using the group brackets \( and \), so we can get what was matched by this group. So, this is the to-be-replaced regular expression:
#"\([^.]*\)\.png",
So follows the replacement part of the command. The & command just inserts everything that was matched by #"\([^.]*\)\.png", in the changed content. If it was the only element of the replacement part, nothing would be changed in the output. However, following the & there is a newline character - represented by the backslash \ followed by an actual newline - and in the new line we add the #" string followed by the content of the first group (\1) and then the string ",.
This is just a brief explanation of the command. Hope this helps. Also, note that you can use the \n string to represent newlines in some versions of sed (such as GNU sed). It would render a more concise and readable command:
sed 's/#"\([^.]*\)\.png",/&\n#"\1",/' input
I prefer this over Carles Sala and Glenn Jackman's:
sed '/.png/p;s/.png//'
Could just say it's personal preference.
or one can combine both versions and apply the duplication only on lines matching the required pattern
sed -e '/^#".*\.png",/{p;s/\.png//;}' input