gsub regex pattern - regex

I am using gsub to substitute tabs with commas
gsub(/\t/,\",\")
a\tb will be a,b
In some instances I have two tabs follwed by each other
For example
a/t/tb
In that case gsub converts it to a,,b
I want that in cases like that, the string should be converted to a,-,b (a minus sign in between).
I tried writing two sepearate gsubs
gsub(/\t/,\",\") // for tab
gsub(/,,/,\"/,-,/\") // for consecutive commas
The second doesn't seem to work.
Whats wrong with it. Is there a way, I can combine both in one gsub.

I take it you're asking about awk?
I don't think it can be done with a single gsub, in fact I needed three:
$ abc=$(echo 'a.b..c...d....e.....f' | tr . '\t')
$ echo "$abc" | awk '{gsub(/\t/, ","); gsub(/,,/, ",-,"); gsub(/,,/, ",-,"); print}'
a,b,-,c,-,-,d,-,-,-,e,-,-,-,-,f
The problem is that a single gsub on /,,/ will consume both commas, so it will leave a gap between the next pair of commas, if there are three or more consecutive ones. In a more powerful regexp engine, such as Perl, it can be done in a single pass using a lookahead:
$ echo "$abc" | perl -pe 's/\t/,/g; s/,(?=,)/,-/g;'
a,b,-,c,-,-,d,-,-,-,e,-,-,-,-,f

Related

can sed replace words in pattern substring match in one line?

original line in file sed.txt:
outer_string_PATTERN_string(PATTERN_And_PATTERN_PATTERN_i)PATTERN_outer_string(i_PATTERN_inner)_outer_string
only need to replace PATTERN to pattern which in brackets, not lowercase, it could replace to other word.
expect result:
outer_string_PATTERN_string(pattern_And_pattern_pattern_i)PATTERN_outer_string(i_pattern_inner)_outer_string
I could use ([^)]*) pattern to find the substring which would be replace some worlds in. But I can't use this pattern to index the substring's position, and it will replace the whole line's PATTERN to pattern.
:/tmp$ sed 's/([^)]*)/---/g' sed.txt
outer_string_PATTERN_string---PATTERN_outer_string---_outer_string
:/tmp$ sed '/([^)]*)/s/PATTERN/pattern/g' sed.txt
outer_string_pattern_string(pattern_And_pattern_pattern_i)pattern_outer_string(i_pattern_inner)_outer_string
I also tried to use the regex group in sed to capture and replace the words, but I can't figure out the command.
Can sed implement that? And how to achieve that? THX.
Can sed implement that?
It can be done using GNU sed and basic regular expressions
(BRE):
sed '
s/)/)\n/g
:1
s/\(([^)]*\)PATTERN\([^)]*)\n\)/\1pattern\2/
t1
s/\n//g
' < file
where
1st s inserts a newline after each )
2nd s replaces the last (* is greedy) PATTERN inside ()s with pattern
t loops back if a substitution was made
3rd s strips all inserted newlines
EDIT
2nd substitute command edited according to OP's suggestion
since there is no need to match \n inside ().
Can sed implement that?
Yes. But you do not want to do it in sed. Use other programming language, like Python, Perl, or awk.
how to achieve that?
Implementing non-greedy regex is not simple in sed. Basically, generally, it consists of:
taking chunk of the input
process the chunk
put it in hold space
shuffle hold with pattern space - extract what been already processed, what's not
repeat
shuffle with hold space
output
Anyway, the following script:
#!/bin/bash
sed <<<'outer_string_PATTERN_string(PATTERN_i_PATTERN_PATTERN_i)PATTERN_outer_string(i_PATTERN_inner)_outer_string' '
:loop;
/\([^(]*\)\(([^)]*)\)\(.*\)/{
# Lowercase the second part.
s//\1\L\2\E\n\3/;
# Mix with hold space.
G;
s/\(.*\)\n\(.*\)\n\(.*\)/\3\1\n\2/;
# Put processed stuff into hold spcae
h; s/\n.*//; x;
# Process the other stuff again.
s/.*\n//;
bloop;
};
# Is hold space empty?
x; /^$/!{
# Pattern space has trailing stuff - add it.
G; s/\n//;
# We will print it.
h;
# Clear hold space
s/.*//
};x;
'
outputs:
PATTERN_outer_string(i_pattern_inner)outer_string_PATTERN_string(pattern_i_pattern_pattern_i)_outer_string
As an alternative, it is easier to do this in gnu awk with RS that matches (...) substring:
awk -v RS='\\([^)]+)' '{gsub(/PATTERN/, "pattern", RT); ORS=RT} 1' file
outer_string_PATTERN_string(pattern_i_pattern_pattern_i)PATTERN_outer_string(i_pattern_inner)_outer_string
Steps:
RS='\\([^)]+)' captures a (...) string as record separator
gsub function then replaces PATTERN with pattern in matched text i.e. RT
ORS=RT sets ORS as the new modified RT
1 prints each record to stdout
Another alternative solution using lookahead assertion in a perl regex:
perl -pe 's/PATTERN(?=[^()]*\))/pattern/g' file
Solved by this:
:/tmp$ sed 's/(/\n(/g' sed.txt | sed 's/)/)\n/g' | sed '/([^)]*)/s/PATTERN/pattern/g' | sed ':a;N;$!ba;s/\n//g'
outer_string_PATTERN_string(pattern_And_pattern_pattern_i)PATTERN_outer_string(i_pattern_inner)_outer_string
make pattern () in a new line
find the () lines and replace the PATTERN to pattern
merge multiple lines in one line
thanks for How can I replace a newline (\n) using sed?

Find and Replace Specific characters in a variable with sed

Problem: I have a variable with characters I'd like to prepend another character to within the same string stored in a variable
Ex. "[blahblahblah]" ---> "\[blahblahblah\]"
Current Solution: Currently I accomplish what I want with two steps, each step attacking one bracket
Ex.
temp=[blahblahblah]
firstEscaped=$(echo $temp | sed s#'\['#'\\['#g)
fullyEscaped=$(echo $firstEscaped | sed s#'\]'#'\\]'#g)
This gives me the result I want but I feel like I can accomplish this in one line using capturing groups. I've just had no luck and I'm getting burnt out. Most examples I come across involve wanting to extract the text between brackets instead of what I'm trying to do. This is my latest attempt to no avail. Any ideas?
There may be more efficient ways, (only 1 s/s/r/ with a fancier reg-ex), but this works, given your sample input
fully=$(echo "$temp" | sed 's/\([[]\)/\\\1/;s/\([]]\)/\\\1/') ; echo "$fully"
output
\[blahblahblah\]
Note that it is quite OK to chain together multiple sed operations, separated by ; OR if in a sed script file, by blank lines.
Read about sed capture-groups using \(...\) pairs, and referencing them by number, i.e. \1.
IHTH
$ temp=[blahblahblah]
$ fully=$(echo "$temp" |sed 's/\[\|\]/\\&/g'); echo "$fully"
\[blahblahblah\]
Brief explanation,
\[\|\]: target to substitute '[' or ']', and for '[', ']', and '|' need to be escaped.
&: the character & to refer to the pattern which matched, and mind that it also needs to be escaped.
As #Gordon Davisson's suggestion, you may also use bracket expression to avoid the extended format regex,
sed 's/[][]/\\&/g'

How to replace arbritary combinations of (special) characters and numbers using sed and regular expressions

I have a csv file with nearly arbritary filled colums like this:
"bla","","blabla","bla::bla::blabla",19.05.16 12:00:03,123456789,"bla::38594f-47849-h945f",""
and now I want to replace the comma between the two numbers with a point:
"bla","","blabla","bla::bla::blabla",19.05.16 12:00:03.123456789,"bla::38594f-47849-h945f",""
I tried a lot but nothing helped. :-(
sed s/[0-9],[0-9]/./g data.csv
works but it delets the two numbers before and after the comma. So I tried things like
sed s/\(\.[0-9]\),\([0-9]\.\)/\1.\2/g data.csv
but that changed nothing.
Try with s/\([0-9]\),\([0-9]\)/\1.\2/g:
$ echo '"bla","","blabla","bla::bla::blabla",19.05.16 12:00:03,123456789,"bla::38594f-47849-h945f",""' | sed 's/\([0-9]\),\([0-9]\)/\1.\2/g'
"bla","","blabla","bla::bla::blabla",19.05.16 12:00:03.123456789,"bla::38594f-47849-h945f",""
Regex Demo Here
You don't really need the additional dot \. in the capturing groups.

Why those two sed commands get different result?

A csv file example.csv, it has
hello,world,wow
this,is,amazing
I want to get the first column elements, at the beginning I wrote a sed command like:
sed -n 's/\([^,]*\),*/\1/p' example.csv
output:
helloworld,now
thisis,amazing
Then I modified my command to the following and get what I want:
sed -n 's/\([^,]*\).*/\1/p' example.csv
output:
hello
this
command1 I used comma(,) and command2 I replaced comma with dot(.), and it works as expected, can anyone explain how sed really works to get the 1st output? What's the story behind? Is it because of the dot(.) or because of the substitution group & back-reference?
In both regexes, ([^,]*) will consume the same part of the string - all the symbols preceding the first encountered comma. Apparently the difference is how are the remaining parts of those regexes treated.
In the first one, it's ,* - zero or more comma symbols. Obviously all it might consume is
the comma itself - the rest of the line isn't covered by a pattern.
In the second one, it's .* - zero or more of any symbols. It's not a big surprise that'll cover the remaining string completely - as it has nothing to stop at; any is, well, any. )
In both cases the pattern-covered part of the string is replaced by the contents of the capturing group (and that's, as I said already, 'all the symbols before the first comma') - and what's covered by the remaining part of the regex is just removed. So in first case the very first comma is erased, in the second - the comma and the rest of the string.
The reason behind that is that the pattern matches only to the first part of the word, i.e. only the Hello, part is replaced. The part ,* takes arbitrary amount of commas, and then nothing is set to be next, i.e. nothing else matches the pattern. For example:
hello,,,,,,,,,,,,,,,,,,world
would be replaced to
helloworld
A good example would be
sed -n 's/\([^,]*\),*$/\1/p' example.csv
This will work if and only if all the commas are at the end of the line and will trim them, e.g.
hello,,,,,,
Hope this makes the problem a bit clearer.
On regex the . (dot) is a place holder for one, single character.
Can I suggest not using sed?
cut -d, -f1 example.csv
Personally, I'm a huge sed fan, but cut is much more appropriate in this instance.
If you like first word, why not use awk
awk -F, '{print $1}' file
hello
this
Using sed with back reference
sed -nr 's/([^,]*),.*/\1/p' file
hello
this
It seems that to make it work you need the .* so it get the whole line.
The r option make you not need to escape the parentheses \(

Matching A File Name Using Grep

The overarching problem:
So I have a file name that comes in the form of
JohnSmith14_120325_A10_6.raw
and I want to match it using regex. I have a couple of issues in building a working example but unfortunately my issues won't be solved unless I get the basics.
So I have just recently learned about piping and one of the cool things I learned was that I can do the following.
X=ll_paprika.sc (don't ask)
VAR=`echo $X | cut -p -f 1`
echo $VAR
which gives me paprika.sc
Now when I try to execute the pipe idea in grep, nothing happens.
x=ll_paprika.sc
VAR=`echo $X | grep *.sc`
echo $VAR
Can anyone explain what I am doing wrong?
Second question:
How does one match a single underscore using regex?
Here's what I am ultimately trying to do;
VAR=`echo $X | grep -e "^[a-bA-Z][a-bA-Z0-9]*(_){1}[0-9]*(_){1}[a-bA-Z0-9]*(_){1}[0-9](\.){1}(raw)"
So the basic idea of my pattern here is that the file name must start with a letter
and then it can have any number of letters and numbers following it and it must have an _ delimit a series of numbers and another _ to delimit the next set of numbers and characters and another _ to delimit the next set of numbers and then it must have a single period following by raw. This looks grossly wrong and ugly (because I am not sure about the syntax). So how does one match a file extension? Can someone put up a simple example for something ll_parpika.sc so that I can figure out how to do my own regex?
Thanks.
x=ll_paprika.sc
VAR=`echo $X | grep *.sc`
echo $VAR
The reason this isn't doing what you want is that the grep matches a line and returns it. *.sc does in fact match 11_paprika.sc, so it returns that whole line and sticks it in $VAR.
If you want to just get a part of it, the cut line probably better. There is a grep -o option that returns only the matching portion, but for this you'd basically have to put in the thing you were looking for, at which point why bother?
the file name must start with a letter
`grep -e "^[a-zA-Z]
and then it can have any number
of letters and numbers following it
[a-zA-Z0-9]*
and it must have an _ delimit a
series of numbers and another _ to delimit the next set of numbers and
characters and another _ to delimit the next set of numbers
(_[0-9]+){3}
and then it must have a single period following by raw.
.raw"
For the first, use:
VAR=`echo $X | egrep '\.sc$'`
For the second, you can try this alternative instead:
VAR=`echo $X | egrep '^[[:alpha:]][[:alnum:]]*_[[:digit:]]+_[[:alnum:]]+_[[:digit:]]+\.raw'`
Note that your character classes from your expression differ from the description that follows in that they seem to only be permissive of a-b for lower case characters in some places. This example is permissive of all alphanumeric characters in those places.