sed and regex to replace ',' except inside a string

sed and regex to replace ',' except inside a string - regex

I have an input of the following schema
10,0,'string1_string2,_string3','',8,0,0,0.59,'20140101205216','20140128074836',584266915,5934
and I would like to replace all comma "," characters with tabs using sed. The constraint is to not replace "," inside text strings (i.e the comma in 'string1_string2,_string3' should not be replaced with tab). A regex to do this is ,(?!,_).
However the following sed does not work. I've tried all escaping permutations too.
sed s/",\(\?\!,_\)"/"\t"/g
Is there a way to do this?

On Mac OS X 10.9.1, you can use:
sed -E -e "s/('[^']*'|[^,]*),/\1X/g"
except that you'd replace the X with an actual tab. For your input line, that yields:
10X0X'string1_string2,_string3'X''X8X0X0X0.59X'20140101205216'X'20140128074836'X584266915X5934
which has X's where you want tabs. With GNU sed, you can use -r in place of -E (though it also recognizes -E). Mac sed will not expand \t to a tab; GNU sed will. With Bash, you can use the ANSI-C Quoting mechanism to have the shell embed a tab in the string passed to sed:
sed -E -e "s/('[^']*'|[^,]*),/\1"$'\t'"/g"
Without the extended regular expressions (activated by -r or -E), it isn't worth trying in sed; use awk instead.
The regex looks for either a single quote followed by zero or more non-quotes and a single quote or zero or more non-commas, followed by a comma, and replaces it with what was remembered as the either/or string and a 'tab' (using X to represent tab because it is more visible).
devnull points out that the answer above replaces the comma in a string at the end of a line. There's a workaround for that:
sed -E -e "s/('[^']*'|[^,]*)(,|$)/\1"$'\t'"/g; s/"$'\t'"$//"
The s///g before the semicolon adds a tab to the end of each line; the s/// after the semicolon removes the tab that was just added.

You could use Text::ParseWords:
perl -MText::ParseWords -n -l -e 'print join("\t", parse_line(",", 1, $_));' filename
For your input, it'd result in:
10 0 'string1_string2,_string3' '' 8 0 0 0.59 '20140101205216' '20140128074836' 584266915 5934

I would suggest take Perl's help if available because of availability of lookarounds:
s="10,0,'string1_string2,_string3','',8,0,0,0.59,'20140101205216','20140128074836',584266915,5934"
perl -pe "s/,(?=(([^']*'){2})*[^']*$)/\t/g" <<< "$s"
10\t0\t'string1_string2,_string3'\t''\t8\t0\t0\t0.59\t'20140101205216'\t'20140128074836'\t584266915\t5934
PS: Showing \t only for readability purpose.

This seems to work if I understand your question correctly:
sed -E 's/,([^_])/\t\1/g'
Output:
10 0 'string1_string2,_string3' '' 8 0 0 0.59 '20140101205216' '20140128074836' 584266915 5934

Related

bash tool to search and replace text (while leaving text in the middle the same)

I have text files that look like this:
foo(bar(some_id)) I want to replace that with
bleh(some_id)
I can come up with the regex to find the instances, which is: foo\(bar\([a-zA-z0-9_]+\)\). But I dont know how to express that I want to keep the text in the middle the same.
Any suggestion? (I'm thinking of using sed or awk or any standard bash tool, whichever is easier )

You can use
sed -E 's/foo\(bar\(([^()]*).*/bleh(\1)/'
sed 's/foo(bar(\([^()]*\).*/bleh(\1)/'
The first pattern is POSIX ERE compliant, hence the -E option.
The foo\(bar\(([^()]*).* POSIX ERE pattern matches foo(bar(, then captures any zero or more chars other than ( and ) into Group 1 (\1 refers to this group value from the replacement pattern), and then matches the rest of string. After the replacement, the Group 1 value remains. You may add .* at the start if there is text before foo(bar(.
The second sed command is POSIX BRE equivalent of the above command.
See an online demo:
s='foo(bar(some_id))'
sed -E 's/foo\(bar\(([^()]*).*/bleh(\1)/' <<< "$s"
# => bleh(some_id)
sed 's/foo(bar(\([^()]*\).*/bleh(\1)/' <<< "$s"
# => bleh(some_id)

Using sed
$ sed 's/.*\(([^)]*)\).*/bleh\1/' input_file
bleh(some_id)

Bash: String manipulation with sed and Regular expression is not working: replace a string by slash

I hope you can help me out:
Here is one of my lines that I have to string manipulate:
./period/0.0.1/projectname/path/path/-rw-rw-r--filename.txt 2462
Where the last number is the file size and needed for later calculations.
The sequence -rw-rw-r-- is from a file listing Output where I separated files from directories and skipped all lines starting with "d".
Now I need to get rid of the rights sequence in the lines.
Here is my regex, that exactly hits that target: [/][-][-rwx]{9,9}
I checked tat with a regex checker and get exact what I want:
the string /- including the following 9 characters.
What I want: replace this string " /- including the following 9 characters " by a single slash /.
To avoid escaping I use pipe as separator in sed. The following sed command is working correct:
sed 's|teststring|/|g' inputfile > outputfile
The problem:
When I replace "teststring" bei my regex it is not manipulating anything:
sed 's|[/][-][-rwx]{9,9}|/|g' inputfile > outputfile
I get no errors at all, but have no stringmanipulations in result outputfile.
What am I doing wrong here??
Please help!

You can use this sed with extended regex:
sed -E 's|/-[-rwx]{9}|/|g' file
./period/0.0.1/projectname/path/path/filename.txt 2462
No need to use [/] and [-] in your regex
Use -E for extended regex matching
.{9,9} is same as .{9}

You may use
sed 's|/-[-rwx]\{9\}|/|g'
Note that in POSIX BRE patterns, in order to specify a limiting quantifier, you need to escape the braces.
See the Bash demo:
s='./period/0.0.1/projectname/path/path/-rw-rw-r--filename.txt 2462'
echo $s | sed 's|/-[-rwx]\{9\}|/|g'
Output:
./period/0.0.1/projectname/path/path/filename.txt 2462
NOTE: It is not a good idea to wrap each individual char with a bracket expression, [/] = / and [-] = -.

sed uses the BRE regex flavour by default, where braces should be escaped.
Either escape them :
sed 's|[/][-][-rwx]\{9,9\}|/|g' inputfile > outputfile
Or switch to ERE :
sed -n 's|[/][-][-rwx]{9,9}|/|g' inputfile > outputfile # for GNU sed
sed -E 's|[/][-][-rwx]{9,9}|/|g' inputfile > outputfile # for BSD sed & modern GNU sed
As a side note, your regex can be simplified to /-[-rwx]{9} :
sed -E 's|/-[-rwx]{9}|/|g' inputfile > outputfile

sed find and replace fastq regex

I have a file such as
head testSed.fastq
#M01551:51:000000000-BCB7H:1:1101:15800:1330 1:N:0:NGTCACTN+TATCCTCTCTTGAAGA
NGTCACTN
+
#>AAAAF#
#M01551:51:000000000-BCB7H:1:1101:15605:1331 1:N:0:NATCAGCN+TAGATCGCCAAGTTAA
NATCAGCN
+
#>>AA?C#
#M01551:51:000000000-BCB7H:1:1101:15557:1332 1:N:0:NCAGCAGN+TATCTTCTATAAATAT
NCAGCAGN
And I am attempting to replace the string after the final colon with 0 (in this example on lines 1,5,9 - but globally) using a regular expression.
I have checked my regex using egrep egrep '[ATGCN]{8}\+[ATGCN]{16}$' testSed.fastq which returns all the lines I would expect.
However when I try to use sed -i 's/[ATGCN]{8}\+[ATGCN]{16}$/0/g' testSed.fastq the original file is unchanged and no replacement occurs.
How can I fix this? Is my regex not specific enough?

Do you need a regex for this?
awk -F: -v OFS=: '/^#/ {$NF = "0"} 1' testfile
That won't save in-place. If you have GNU awk you can
gawk -F: -v OFS=: -i inplace '...' file
ref: https://www.gnu.org/software/gawk/manual/html_node/Extension-Sample-Inplace.html

Your regex is structured as an ERE rather than a BRE, which is sed's default interpretation. Not all sed implementations support ERE, but you can check man sed in your environment to determine whether it's possible for you. Look for -r or -E options. You can alternately use bounds by preceding the curly braces with backslashes.
That said, rather than matching the precise text in the last field, why not just look for the string that starts with a colon, and is followed by no-more-colons? The following RE is both BRE and ERE compatible.
$ sed '/^#/s/:[^:]*$/:0/' testq
#M01551:51:000000000-BCB7H:1:1101:15800:1330 1:N:0:0
NGTCACTN
+
#>AAAAF#
#M01551:51:000000000-BCB7H:1:1101:15605:1331 1:N:0:0
NATCAGCN
+
#>>AA?C#
#M01551:51:000000000-BCB7H:1:1101:15557:1332 1:N:0:0
NCAGCAGN

How to add a line break before and after a regex in a text file?

This is an excerpt from the file I want to edit:
>chr1|-|9|S|somatic ACCACAGCCCTGTTTTACGTTGCGTCATCGCCCCGGGTGCCTGGTGACGTCACCAGCCCGCTCG >chr1|+|9|Y|somatic ACCACAGCCCTGTTTTACGTTGCGTCATCGCCCCGGGTGCCTGGTGACGTCACCAGCCCGCTCG
I would a new text file in which I add a line break before ">" and after "somatic" or after "germline", how can I do in R or Unix?
Expected output:
>chr1|-|9|S|somatic
ACCACAGCCCTGTTTTACGTTGCGTCATCGCCCCGGGTGCCTGGTGACGTCACCAGCCCGCTCG
>chr1|+|9|Y|somatic
ACCACAGCCCTGTTTTACGTTGCGTCATCGCCCCGGGTGCCTGGTGACGTCACCAGCCCGCTCG

By the looks of your input, you could simply replace spaces with newlines:
tr -s ' ' '\n' <infile >outfile
(Some tr dialects don't like \n. Try '\012' or a literal newline: opening quote, newline, closing quote.)
If that won't work, you can easily do this in sed. If somatic is static, just hard-code it:
sed -e 's/somatic */&\n/g' -e 's/ >/\n>/g' file >newfile
The usual caveats about different sed dialects apply. Some versions don't like \n for newline, some want a newline or a semicolon instead of multiple -e arguments.
On Linux, you can modify the file in-place:
sed -i 's/somatic */&\
/g
s/ >/\
/g' file
(For variation, I'm showing how to do this if your sed doesn't recognize \n but allows literal newlines, and how to put the script in a single multi-line string.)
On *BSD (including MacOS) you need to add an argument to -i always; sed -i '' ...
If somatic is variable, but you always want to replace the first space after a wedge, try something like
sed 's/\(>[^ ]*\) /\1\n/g'
>[^ ] matches a wedge followed by zero or more non-space characters. The parentheses capture the matched string into \1. Again, some sed variants don't want backslashes in front of the parentheses, or are otherwise just ... different.
If you have very long lines, you might bump into a sed which has problems with that. Maybe try Perl instead. (Luckily, no dialects to worry about!)
perl -i -pe 's/(>[^ ]*) /$1\n/g;s/ >/\n>/g' file
(Skip the -i option if you don't want to modify the input file. Then output will be to standard output.)

(\bsomatic\b|\bgermline\b)|(?=>)
Try this.See demo.Replace by $1\n
http://regex101.com/r/tF5fT5/53
If there's no support for lookahead then try
(\bsomatic\b|\bgermline\b)
Try this.Replace by $1\n.See demo.
http://regex101.com/r/tF5fT5/50
and
(>)
Replace by \n$1.See demo.
http://regex101.com/r/tF5fT5/51

Thank you everyone!
I used:
tr -s ' ' '\n' <infile >outfile
as suggested by tripleee and it worked perfectly!

Replace all whitespace with a line break/paragraph mark to make a word list

I am trying to vocab list for a Greek text we are translating in class. I want to replace every space or tab character with a paragraph mark so that every word appears on its own line. Can anyone give me the sed command, and explain what it is that I'm doing? I’m still trying to figure sed out.

For reasonably modern versions of sed, edit the standard input to yield the standard output with
$ echo 'τέχνη βιβλίο γη κήπος' | sed -E -e 's/[[:blank:]]+/\n/g'
τέχνη
βιβλίο
γη
κήπος
If your vocabulary words are in files named lesson1 and lesson2, redirect sed’s standard output to the file all-vocab with
sed -E -e 's/[[:blank:]]+/\n/g' lesson1 lesson2 > all-vocab
What it means:
The character class [[:blank:]] matches either a single space character or
a single tab character.
Use [[:space:]] instead to match any single whitespace character (commonly space, tab, newline, carriage return, form-feed, and vertical tab).
The + quantifier means match one or more of the previous pattern.
So [[:blank:]]+ is a sequence of one or more characters that are all space or tab.
The \n in the replacement is the newline that you want.
The /g modifier on the end means perform the substitution as many times as possible rather than just once.
The -E option tells sed to use POSIX extended regex syntax and in particular for this case the + quantifier. Without -E, your sed command becomes sed -e 's/[[:blank:]]\+/\n/g'. (Note the use of \+ rather than simple +.)
Perl Compatible Regexes
For those familiar with Perl-compatible regexes and a PCRE-capable sed, use \s+ to match runs of at least one whitespace character, as in
sed -E -e 's/\s+/\n/g' old > new
or
sed -e 's/\s\+/\n/g' old > new
These commands read input from the file old and write the result to a file named new in the current directory.
Maximum portability, maximum cruftiness
Going back to almost any version of sed since Version 7 Unix, the command invocation is a bit more baroque.
$ echo 'τέχνη βιβλίο γη κήπος' | sed -e 's/[ \t][ \t]*/\
/g'
τέχνη
βιβλίο
γη
κήπος
Notes:
Here we do not even assume the existence of the humble + quantifier and simulate it with a single space-or-tab ([ \t]) followed by zero or more of them ([ \t]*).
Similarly, assuming sed does not understand \n for newline, we have to include it on the command line verbatim.
The \ and the end of the first line of the command is a continuation marker that escapes the immediately following newline, and the remainder of the command is on the next line.
Note: There must be no whitespace preceding the escaped newline. That is, the end of the first line must be exactly backslash followed by end-of-line.
This error prone process helps one appreciate why the world moved to visible characters, and you will want to exercise some care in trying out the command with copy-and-paste.
Note on backslashes and quoting
The commands above all used single quotes ('') rather than double quotes (""). Consider:
$ echo '\\\\' "\\\\"
\\\\ \\
That is, the shell applies different escaping rules to single-quoted strings as compared with double-quoted strings. You typically want to protect all the backslashes common in regexes with single quotes.

The portable way to do this is:
sed -e 's/[ \t][ \t]*/\
/g'
That's an actual newline between the backslash and the slash-g. Many sed implementations don't know about \n, so you need a literal newline. The backslash before the newline prevents sed from getting upset about the newline. (in sed scripts the commands are normally terminated by newlines)
With GNU sed you can use \n in the substitution, and \s in the regex:
sed -e 's/\s\s*/\n/g'
GNU sed also supports "extended" regular expressions (that's egrep style, not perl-style) if you give it the -r flag, so then you can use +:
sed -r -e 's/\s+/\n/g'
If this is for Linux only, you can probably go with the GNU command, but if you want this to work on systems with a non-GNU sed (eg: BSD, Mac OS-X), you might want to go with the more portable option.

All of the examples listed above for sed break on one platform or another. None of them work with the version of sed shipped on Macs.
However, Perl's regex works the same on any machine with Perl installed:
perl -pe 's/\s+/\n/g' file.txt
If you want to save the output:
perl -pe 's/\s+/\n/g' file.txt > newfile.txt
If you want only unique occurrences of words:
perl -pe 's/\s+/\n/g' file.txt | sort -u > newfile.txt

option 1
echo $(cat testfile)
Option 2
tr ' ' '\n' < testfile

This should do the work:
sed -e 's/[ \t]+/\n/g'
[ \t] means a space OR an tab. If you want any kind of space, you could also use \s.
[ \t]+ means as many spaces OR tabs as you want (but at least one)
s/x/y/ means replace the pattern x by y (here \n is a new line)
The g at the end means that you have to repeat as many times it occurs in every line.

You could use POSIX [[:blank:]] to match a horizontal white-space character.
sed 's/[[:blank:]]\+/\n/g' file
or you may use [[:space:]] instead of [[:blank:]] also.
Example:
$ echo 'this is a sentence' | sed 's/[[:blank:]]\+/\n/g'
this
is
a
sentence

You can also do it with xargs:
cat old | xargs -n1 > new
or
xargs -n1 < old > new

Using gawk:
gawk '{$1=$1}1' OFS="\n" file

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

sed and regex to replace ',' except inside a string - regex

You could use Text::ParseWords: perl -MText::ParseWords -n -l -e 'print join("\t", parse_line(",", 1, $_));' filename For your input, it'd result in: 10 0 'string1_string2,_string3' '' 8 0 0 0.59 '20140101205216' '20140128074836' 584266915 5934

This seems to work if I understand your question correctly: sed -E 's/,([^_])/\t\1/g' Output: 10 0 'string1_string2,_string3' '' 8 0 0 0.59 '20140101205216' '20140128074836' 584266915 5934

Related

bash tool to search and replace text (while leaving text in the middle the same)

Bash: String manipulation with sed and Regular expression is not working: replace a string by slash

sed find and replace fastq regex

How to add a line break before and after a regex in a text file?

Replace all whitespace with a line break/paragraph mark to make a word list

Categories

Resources