Converting Files with regexp Pattern in sed

Converting Files with regexp Pattern in sed - regex

I want to turn this (Mitarbeiter.csv):
Max;Mustermann;02.03.1964;501;GL;Prokurist
Monika;Mueller;02.02.1972;500;Sek;Chefsekretaerin
Michael;Maier;06.07.1985;617;Aquise;-
into this (header-content.html):
<tr><td>Max</td><td>Mustermann</td><td>501</td></tr>
<tr><td>Monika</td><td>Mueller</td><td>500</td></tr>
<tr><td>Michael</td><td>Maier</td><td>617</td></tr>
by using sed
I've tried:
sed 's#^\([^\]+\);\([^\]+\);[^\]+;\([^\]+\);.*$#<tr><td>\2</td><td>\1</td><td>\3</td></tr>\n#g' <Mitarbeiter.csv >header-content.html
but that does nothing. Output is same as Mitarbeiter.csv

awk might be a little better suited to what you're trying to do:
awk -F\; '{printf "<tr><td>%s</td><td>%s</td><td>%s</td></tr>\n",$1,$2,$4}'

sed -r -ne 's:^([^;]+);([^;]+);[^;]+;([^;]+);.*:<tr><td>\1</td><td>\2</td><td>\3</td></tr>:p'
Or if you're using OSX or an older version of FreeBSD or NetBSD, replace the -r with -E to use extended regular expressions.
If you want to skip using ERE for portability (i.e. you're using Solaris or HP/UX or somesuch), the regexp might be:
^\([^;][^;]*\);\([^;][^;]*\);[^;]*;\([^;][^;]*\);.*
Note that these both require at least 1 character per field. If fields are allowed to be empty ... well, update your question before we more spend more time on things that might not be necessary. :-)

A few points,
you need the -r switch for extended regex patterns
Sed is greedy, and even -r does not support non greedy matching
The g flag is a special get flag, you probably don't want this
So your command should be:
sed -r 's#^([^\;]+);([^\;]+);[^\;]+;([^\;]+);.*$#<tr><td>\1</td><td>\2</td><td>\3</td></tr>#' < Mitarbeiter.csv > header-content.html
Note that your items cannot have a semicolon in them, as that is the field separator. If you a a true csv file, this won't work, as it will not ignore an escaped semicolon, either wrapped in quotes or with an escape char.

Why would you want to use sed?
awk '{print "<tr><td>"$1"</td><td>"$2"</td><td>"$4"</td></tr>}
' IFS=';' Mitarbeiter.csv > header-content.html

If you insist on using sed, you can try:
$ p='\([^;]*\);'
$ sed "s#$p$p$p$p.*#<tr><td>\1</td><td>\2</td><td>\4</td></tr>#" \
Mitarbeiter.csv > header-content.html

Related

sed find and replace fastq regex

I have a file such as
head testSed.fastq
#M01551:51:000000000-BCB7H:1:1101:15800:1330 1:N:0:NGTCACTN+TATCCTCTCTTGAAGA
NGTCACTN
+
#>AAAAF#
#M01551:51:000000000-BCB7H:1:1101:15605:1331 1:N:0:NATCAGCN+TAGATCGCCAAGTTAA
NATCAGCN
+
#>>AA?C#
#M01551:51:000000000-BCB7H:1:1101:15557:1332 1:N:0:NCAGCAGN+TATCTTCTATAAATAT
NCAGCAGN
And I am attempting to replace the string after the final colon with 0 (in this example on lines 1,5,9 - but globally) using a regular expression.
I have checked my regex using egrep egrep '[ATGCN]{8}\+[ATGCN]{16}$' testSed.fastq which returns all the lines I would expect.
However when I try to use sed -i 's/[ATGCN]{8}\+[ATGCN]{16}$/0/g' testSed.fastq the original file is unchanged and no replacement occurs.
How can I fix this? Is my regex not specific enough?

Do you need a regex for this?
awk -F: -v OFS=: '/^#/ {$NF = "0"} 1' testfile
That won't save in-place. If you have GNU awk you can
gawk -F: -v OFS=: -i inplace '...' file
ref: https://www.gnu.org/software/gawk/manual/html_node/Extension-Sample-Inplace.html

Your regex is structured as an ERE rather than a BRE, which is sed's default interpretation. Not all sed implementations support ERE, but you can check man sed in your environment to determine whether it's possible for you. Look for -r or -E options. You can alternately use bounds by preceding the curly braces with backslashes.
That said, rather than matching the precise text in the last field, why not just look for the string that starts with a colon, and is followed by no-more-colons? The following RE is both BRE and ERE compatible.
$ sed '/^#/s/:[^:]*$/:0/' testq
#M01551:51:000000000-BCB7H:1:1101:15800:1330 1:N:0:0
NGTCACTN
+
#>AAAAF#
#M01551:51:000000000-BCB7H:1:1101:15605:1331 1:N:0:0
NATCAGCN
+
#>>AA?C#
#M01551:51:000000000-BCB7H:1:1101:15557:1332 1:N:0:0
NCAGCAGN

use regular expressions to identify html form action tags

I am trying to sed -i to update all my html forms for url shortening. Basically I need to delete the .php from all the action="..." tags in my html forms.
But I am stuck at just identifying these instances. I am trying this testfile:
action = "yo.php"
action = 'test.php'
action='test.php'
action="upup.php"
And I am using this expression:
grep -R "action\s?=\s?(.*)php(\"|\')" testfile
And grep returns nothing at all.
I've tried a bunch of variations, and I can see that even the \s? isn't working because just this grep command also returns nothing:
grep -R "action\s?=\s?" testfile
grep -R "action\\s?=\\s?" testfile
(the latter I tried thinking maybe I had to escape the \ in \s).
Can someone tell me what's wrong with these commands?
Edit:
Fix 1 - apparently I need to escape the question make in \s? to make it be perceived as optional character rather than a literal question mark.

The way you're using it, grep accepts basic posix regex syntax. The single quote does not need to be escaped in it1, but some of the metacharacters you use do -- in particular, ?, (), and |. You can use
grep -R "action\s\?=\s\?\(.*\)php\(\"\|'\)" testfile
I recommend, however, that you use extended posix regex syntax by giving grep the -E flag:
grep -E -R "action\s?=\s?(.*)php(\"|')" testfile
As you can see, that makes the whole thing much more readable.
Addendum: To remove the .php extension from all action attributes in a file, you could use
sed -i 's/\(action\s*=\s*["'\''][^"'\'']*\)\.php\(["'\'']\)/\1\2/g' testfile
Shell strings make this look scarier than it is; the sed code is simply
s/\(action\s*=\s*["'][^"']*\)\.php\(["']\)/\1\2/g
I amended the regex slightly so that in a line action='foo.php' somethingelse='bar.php' the right .php would be removed. I tried to make this as safe as I can, but be aware that handling HTML with sed is always hacky.
Combine this with find and its -exec filter to handle a whole directory.
1 And that the double quote needs to be escaped is because you use a doubly-quoted shell string, not because the regex requires it.

You need to use the -P option to use Perl regexs:
$ grep -P "action\s?=\s?(.*)php(\"|\')" test
action = "yo.php"
action = 'test.php'
action='test.php'
action="upup.php"

try this unescaped plain regex, which only selects text within quotes:
action\s?=\s?["'](.*)\.php["']
you can fiddle around here:
https://regex101.com/r/lN8iG0/1
so on command line this would be:
grep -P "action\s?=\s?[\"'](.*)\.php[\"']" test

how to extract these fields via sed?

I'm trying to grep for individual quantities in lines like this:
foo=24.587 bar=88 fox=jobs
and extract, say, all the '88' values..the number of columns isn't consistent so awk followed by a cut wont cut it.
I tried using sed like this:
sed -e 's/.*\s\(bar=.+\)\s.*/\1/g'
and that just dumps the entire line. I'm not sure how to correct this regexp, and more importantly why this regexp doesnt do what I expect?

Use -r (extended regex). This tends to use regexen more like you may expect. You have to remove the backslashes from the parens, though:
$ echo "foo=24.587 bar=88 fox=jobs" | sed -r 's/.*\s(bar=.+)\s.*/\1/g'
bar=88

sed -r 's/.*\s(bar=.+)\s.*/\1/g'

Points to slashes with sed

I have text file like this format:
...
SomeText.any_text/ch SomeText2.any_3/ch 5.6e-5
SomeText.any_text/ch something.else.point.separated/ch4 5.4e5
...
in line I have three elements: two - alpha-numerical-underscored-slashed strings and one - float number.
I need to replace points to slashes only at strings.
I have try to use sed with regular expression like this
sed 's/\([\w_]\+\)\(\.\)/\1\//g'
And don't have positive result.

This might work for you (GNU sed):
sed 's/[^ ]*$/\n&/;h;y/./\//;G;s/\n.*\n//' file
Explanation:
s/[^ ]*$/\n&/ insert a newline before the last field
h copy the pattern space (PS) to the hold space (HS)
y/./\// translate all .'s to /'s in the PS
G append a newline then HS to the PS
s/\n.*\n// remove everything between the first and last newlines i.e. delete the old strings
This idiom can be used to simplify changing part of a line without the need to resorting to complicated regexp's

Your elements look like fields. Therefore, my preferred method would be to use awk:
awk '{ for (i=1; i<=2; i++) gsub(/\./, "/", $i) }1' file.txt
Results:
SomeText/any_text/ch SomeText2/any_3/ch 5.6e-5
SomeText/any_text/ch something/else/point/separated/ch4 5.4e5

You can do this in classic sed notation with a couple of loops, one to fix dots in the first field, and one to fix dots in the second field.
sed -e ':f1' -e 's/^\([^ .]*\)\./\1\//' -e 't f1' \
-e ':f2' -e 's/^\([^ ][^ ]*\) \([^ .]*\)\./\1 \2\//' -e 't f2'
The ^ anchors are crucial to this working correctly. Yes, you can write it all on one line in a single argument to sed; I prefer the clarity of separate arguments when the script is a complex as this. A typical sed script is inscrutable enough without adding any extra obstacles to comprehension.
sed ':f1;s/^\([^ .]*\)\./\1\//;t f1;:f2;s/^\([^ ][^ ]*\) \([^ .]*\)\./\1 \2\//;t f2'
For your input sample (two lines), the output is:
SomeText/any_text/ch SomeText2/any_3/ch 5.6e-5
SomeText/any_text/ch something/else/point/separated/ch4 5.4e5
If you're using GNU sed, you might need to add --posix to the options, though it seemed to behave itself correctly (so it probably recognized that I wasn't using any non-POSIX notations and therefore stuck with POSIX).
Tested on Mac OS X 10.7.5 with BSD sed and GNU sed.

awk '{gsub(/\./,"",$1);;gsub(/\./,"",$2);print}' your_file

matching a specific substring with regular expressions using awk

I'm dealing with a specific filenames, and need to extract information from them.
The structure of the filename is similar to: "20100613_M4_28007834.005_F_RANDOMSTR.raw.gz"
with RANDOMSTR a string of max 22 chars, and which may contain a substring (or not) with the format "-W[0-9].[0-9]{2}.[0-9]{3}". This substring also has the unique feature of starting with "-W".
The information I need to extract is the substring of RANDOMSTR without this optional substring.
I want to implement this in a bash script, and so far the best option I found is to use gawk with a regular expression. My best attempt so far fails:
gawk --re-interval '{match ($0,"([0-9]{8})_(M[0-9])_([0-9]{8}\\.[0-9]{3})_(.)_(.*)(-W.*)?.raw.gz",arr); print arr[5]}' <<< "20100613_M4_28007834.005_F_OTHER-STRING-W0.40+045.raw.gz"
OTHER-STRING-W0.40+045
The expected results are:
gawk --re-interval '{match ($0,$regexp,arr); print arr[5]}' <<< "20100613_M4_28007834.005_F_SOME-STRING.raw.gz"
SOME-STRING
gawk --re-interval '{match ($0,$regexp,arr); print arr[5]}' <<< "20100613_M4_28007834.005_F_OTHER-STRING-W0.40+045.raw.gz"
OTHER-STRING
How can I get the desired effect.
Thanks.

You need to be able to use look-arounds and I don't think awk/gawk supports that, but grep -P does.
$ pat='(?<=[0-9]{8}_M[0-9]_[0-9]{8}\.[0-9]{3}_._)(.*?)(?=(-W.*)?\.raw\.gz)'
$ echo "20100613_M4_28007834.005_F_SOME-STRING.raw.gz" | grep -Po "$pat"
SOME-STRING
$ echo "20100613_M4_28007834.005_F_OTHER-STRING-W0.40+045.raw.gz" | grep -Po "$pat"
OTHER-STRING

While the grep solution is very nice indeed, the OP didn't mention an operating system, and the -P option only seems to be available in Linux. It's also pretty simple to do this in awk.
$ awk -F_ '{sub(/(-W[0-9].[0-9]+.[0-9]+)?\.raw\.gz$/,"",$NF); print $NF}' <<EOT
> 20100613_M4_28007834.005_F_SOME-STRING.raw.gz
> 20100613_M4_28007834.005_F_OTHER-STRING-W0.40+045.raw.gz
> EOT
SOME-STRING
OTHER-STRING
$
Note that this breaks on "20100613_M4_28007834.005_F_OTHER-STRING-W0_40+045.raw.gz". If this is a risk, and -W only shows up in the place shown above, it might be better to use something like:
$ awk -F_ '{sub(/(-W[0-9.+]+)?\.raw\.gz$/,"",$NF); print $NF}'

The difficulty here seems to be the fact that the (.*) before the optional (-W.*)? gobbles up the latter text. Using a non-greedy match doesn't help either. My regex-fu is unfortunately too weak to combat this.
If you don't mind a multi-pass solution, then a simpler approach would be to first sanitise the input by removing the trailing .raw.gz and possible -W*.
str="20100613_M4_28007834.005_F_OTHER-STRING-W0.40+045.raw.gz"
echo ${str%.raw.gz} | # remove trailing .raw.gz
sed 's/-W.*$//' | # remove trainling -W.*, if any
sed -nr 's/[0-9]{8}_M[0-9]_[0-9]{8}\.[0-9]{3}_._(.*)/\1/p'
I used sed, but you can just as well use gawk/awk.

Wasn't able to get reluctant quantifiers going, but running through two regexes in sequence does the job:
sed -E -e 's/^.{27}(.*).raw.gz$/\1/' << FOO | sed -E -e 's/-W[0-9.]+\+[0-9.]+$//'
20100613_M4_28007834.005_F_SOME-STRING.raw.gz
20100613_M4_28007834.005_F_OTHER-STRING-W0.40+045.raw.gz
FOO

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Converting Files with regexp Pattern in sed - regex

awk might be a little better suited to what you're trying to do: awk -F\; '{printf "<tr><td>%s</td><td>%s</td><td>%s</td></tr>\n",$1,$2,$4}'

Why would you want to use sed? awk '{print "<tr><td>"$1"</td><td>"$2"</td><td>"$4"</td></tr>} ' IFS=';' Mitarbeiter.csv > header-content.html

If you insist on using sed, you can try: $ p='\([^;]\);' $ sed "s#$p$p$p$p.#<tr><td>\1</td><td>\2</td><td>\4</td></tr>#" \ Mitarbeiter.csv > header-content.html

Related

sed find and replace fastq regex

use regular expressions to identify html form action tags

how to extract these fields via sed?

Points to slashes with sed

matching a specific substring with regular expressions using awk

Categories

Resources

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Converting Files with regexp Pattern in sed - regex

awk might be a little better suited to what you're trying to do: awk -F\; '{printf "<tr><td>%s</td><td>%s</td><td>%s</td></tr>\n",$1,$2,$4}'

Why would you want to use sed? awk '{print "<tr><td>"$1"</td><td>"$2"</td><td>"$4"</td></tr>} ' IFS=';' Mitarbeiter.csv > header-content.html

If you insist on using sed, you can try: $ p='\([^;]*\);' $ sed "s#$p$p$p$p.*#<tr><td>\1</td><td>\2</td><td>\4</td></tr>#" \ Mitarbeiter.csv > header-content.html

Related

sed find and replace fastq regex

use regular expressions to identify html form action tags

how to extract these fields via sed?

Points to slashes with sed

matching a specific substring with regular expressions using awk

Categories

Resources

If you insist on using sed, you can try: $ p='\([^;]\);' $ sed "s#$p$p$p$p.#<tr><td>\1</td><td>\2</td><td>\4</td></tr>#" \ Mitarbeiter.csv > header-content.html