How to use sed to replace u'sometext' with 'sometext' - regex

I have a file with text in it I simply want to strip off the leading u from all instances of u'sometext' so that it leaves 'sometext'. I haven't been able to figure out how to get sed to match on u' and replace with '.
Sed command I though would work:
echo ['a', u'update for microsoft office 2013 (kb4022166) 32-bit edition', 'unknown', 'null'] | sed "s/u'/'/g"
output:
[a, uupdate for microsoft office 2013 (kb4022166) 32-bit edition, unknown, null]
what I wanted:
['a', 'update for microsoft office 2013 (kb4022166) 32-bit edition', 'unknown', 'null']
More examples of what is in the file:
"[u'cpe:/o:microsoft:windows_7::sp1:x64-enterprise', u'cpe:/a:adobe:acrobat:11.0.19']"
What I would like to have:
"['cpe:/o:microsoft:windows_7::sp1:x64-enterprise', 'cpe:/a:adobe:acrobat:11.0.19']"

Try, if possible, with something like this:
echo "['a', u'update for microsoft office 2013 (kb4022166) 32-bit edition', 'unknown', 'null']" | sed "s/u'/'/g"
OUTPUT:
['a', 'update for microsoft office 2013 (kb4022166) 32-bit edition', 'unknown', 'null']
It seems that it is not taking well the complete string but assuming it as several ones.

You will need to use word boundaries, denoted with the special character \b which goes immediately before the first thing to be matched on a boundary
$ echo "[u'a', u'hello']" | sed "s/\bu'/'/g"
['a', 'hello']

$ echo "[u'a', u'hello', u'version 7-u']" | sed "s/u\('[^']*'\)/\1/g"
['a', 'hello', 'version 7-u']
$ echo "['a', u'update for microsoft office 2013 (kb4022166) 32-bit edition', 'unknown', 'null']" | sed "s/u\('[^']*'\)/\1/g"
['a', 'update for microsoft office 2013 (kb4022166) 32-bit edition', 'unknown', 'null']
$ echo "[u'cpe:/o:microsoft:windows_7::sp1:x64-enterprise', u'cpe:/a:adobe:acrobat:11.0.19']" | sed "s/u\('[^']*'\)/\1/g"
['cpe:/o:microsoft:windows_7::sp1:x64-enterprise', 'cpe:/a:adobe:acrobat:11.0.19']
Note though that both the above and the currently accepted answer would fail if you can have a u at the end of a single-quote-delimited string earlier in the line. e.g.:
$ echo "['u', 'a']" | sed "s/u\('[^']*'\)/\1/g"
['', 'a']
$ echo "['u', 'a']" | sed "s/\bu'/'/g"
['', 'a']
so, assuming that is an issue, we can use a more robust approach with awk (in this case using GNU awk for multi-char RS and RT):
$ echo "['u', 'a']" | awk -v RS="'[^']*'" -v ORS= 'RT{sub(/u$/,"")} {print $0 RT}'
['u', 'a']
$ echo "[u'a', u'hello', u'version 7-u']" | awk -v RS="'[^']*'" -v ORS= 'RT{sub(/u$/,"")} {print $0 RT}'
['a', 'hello', 'version 7-u']

Related

Why isn't Mac sed isn't matching what I expect?

echo 'iPhone 12 Pro Max (5EF5105C-7EED-4017-979C-A6185E927B84) (Booted)' | sed -En 's,(\w+-\w+-\w+-\w+-\w+),\1,p'
Because I'm using extended regex -E (-r in GNU sed) and -n for print only matched/replaced. Assuming my regex101 is correct,
expecting 5EF5105C-7EED-4017-979C-A6185E927B84 in the output, but getting empty.
If you're just trying to get the serial number out from inside the parens, and you're not actually modifying anything, then use grep
$ echo 'iPhone 12 Pro Max (5EF5105C-7EED-4017-979C-A6185E927B84) (Booted)' \
| grep -E '\w+-\w+-\w+-\w+-\w+' -o
5EF5105C-7EED-4017-979C-A6185E927B84
-o tells grep "Just output what matched, not the entire line".

What command line tool should I use to extract a string from a file using a complex regexp

Problem
In a linux script I want to assign several variables with data from a file.
The command should be able to:
use complex regexp: meaning normal java/javascript/php/perl regexps,
I can adapt to some variations
return at stdout the first capturing group (or a capturing group specified by parameter). If it supports named capturing groups could receive the name as a parameter: `regextractor ""
ideally should already exist in standard linux distributions maybe on osx as well
Sample
I'm searching for a command like
mydate=`regextractor "^date:\s*(\S{10}).*$" myfile.md`
mytitle=`regextractor "^title:\s*(.*)\s*$" myfile.md`
echo $mydate - $mytitle
My file:
---
slug: article1
date: 2012-01-29 15:34:01
title: What is the best monetary system invented til now?
author: raisercostin<raisercostin#gmail.com>
tags: currency,monetary,system
type: question
toslug: article
The regexp was tested here: https://regex101.com/r/y311eP/1
I tried
UPDATE: grep - the default grep is using a simplified regex
grep -o "toslug:\(.*\)" myfile.md => toslug: article
UPDATE: grep extended - it supports complex regex but not capturing groups
grep --color=no -oE "toslug:(.*)" myfile.md => toslug: article
sed - it doesn't support capturing groups so a line is returned.
sed -nE "s/^date:\s*(\S*)//p" myfile.md
UPDATE: it actually supports capturing groups but you need to enable --regexp-extended with -E parameter
awk - it doesn't support capturing groups so a line is returned
awk '/^date:\s*(.*)$/' myfile.md
UPDATE
On my Linux vagrant-ubuntu-trusty-64 3.13.0-101-generic #148-Ubuntu SMP Thu Oct 20 22:08:32 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux:
sed doesn't have the -E option documented but works for some cases
WORKED sed -nE 's/^date:\s*(\S*)/\1/p' myfile.md => 2012-01-29 15:34:01
DIDN'T WORK sed -nE 's/^date:\s*(\S{10})/\1/p' myfile.md => 2012-01-29 15:34:01
DIDN'T WORK sed -nE 's/^date:\s*(\S*)$/\1/p' myfile.md => no line
WORKED awk '{ match($0, /^date:\s*(\S{10}).*$/, a); if(a[1])print a[1]}' myfile.md => 2012-01-29
On my Darwin costins-MBP.router1 16.3.0 Darwin Kernel Version 16.3.0: Thu Nov 17 20:23:58 PST 2016; root:xnu-3789.31.2~1/RELEASE_X86_64 x86_64:
PARTIALLY WORKED sed -nE 's/^date:\s*(\S*)/\1/p' myfile.md => 2012-01-29 15:34:01 - the initial spaces are not consumed
DIDN'T WORK sed -nE 's/^date:\s*(\S{10})/\1/p' myfile.md => no line
DIDN'T WORK sed -nE 's/^date:\s*(\S*)$/\1/p' myfile.md => no line
DIDN'T WORK awk '{ match($0, /^date:\s*(\S{10}).*$/, a); if(a[1])print a[1]}' myfile.md => awk: syntax error at source line 1{ match($0, >>> /^date:\s*(\S{10}).*$/, <<<
Here is a perl oneliner:
perl -E 'undef$/;$_=<>;($d,$t)= $_ =~ /\ndate:\s*(\S{10}).+\ntitle:\s*(.+?)\R/s;say "$d - $t";' <file.txt
Output:
2012-01-29 - What is the best monetary system invented til now?
file.txt
---
slug: article1
date: 2012-01-29 15:34:01
title: What is the best monetary system invented til now?
author: raisercostin<raisercostin#gmail.com>
tags: currency,monetary,system
type: question
toslug: article
(gnu) awk supports captures:
awk '/^date/ { match($0, /^date:[[:blank:]]*([^[:blank:]]{10}).*$/, a); print a[1]}' myfile.md
gives: 2012-01-29
awk '/title/ { match($0, /^title:[[:blank:]]*([^[:blank:]]*.*$)/, a); print a[1]}' myfile.md
gives: What is the best monetary system invented til now?
Based on Jan Smydke response. This matches two groups.
Input:
regextractor () {
perl -E 'undef$/;$_=<>;($v1,$v2)= $_ =~ /'$1'/m;say "$v1$v2";' <$2
}
regextractor "^date:\s*(\S{10}).*$" myfile.md
regextractor "^title:\s*(.*)\s*$" myfile.md
Output:
2012-01-29
What is the best monetary system invented til now?

Sed invalid range end

I have strings like these:
volume 5
vol. 5
V. 5
v. 5
I'm trying to turn them into this format:
\textbf{5}
with this sed command
s/\(v[a-Z]*[.]*\) \([0-9]*\)/\1 \\textbf{\2}/
but I keep getting invalid range end. Am I doing something wrong with the 0-9 range?
If you check the ASCII table, you will see that a value is higher than the value of Z. This creates an invalid range. Moreover, you need a case-insensitive pattern, add /I modifier (for GNU sed only):
echo 'volume 5' | sed 's/\(v[a-z]*[.]*\) \([0-9]*\)/\1 \\textbf{\2}/gI'
echo 'vol. 5' | sed 's/\(v[a-z]*[.]*\) \([0-9]*\)/\1 \\textbf{\2}/gI'
echo 'V. 5' | sed 's/\(v[a-z]*[.]*\) \([0-9]*\)/\1 \\textbf{\2}/gI'
echo 'v. 5' | sed 's/\(v[a-z]*[.]*\) \([0-9]*\)/\1 \\textbf{\2}/gI'
produces
volume \textbf{5}
vol. \textbf{5}
V. \textbf{5}
v. \textbf{5}
Since the BSD implementation of sed does not support case-insensitive matching, on macOS, you need to install GNU sed with the following brew command:
brew install gnu-sed
and then use
gsed -e 's/\(v[a-z]*[.]*\) \([0-9]*\)/\1 \\textbf{\2}/gI'
etc.
Or, add the uppercase letters to the bracket expression:
sed 's/\(v[a-zA-Z]*[.]*\) \([0-9]*\)/\1 \\textbf{\2}/g'
And if you want to make sure only ASCII letters are matched add
LC_ALL=C sed 's/\(v[a-zA-Z]*[.]*\) \([0-9]*\)/\1 \\textbf{\2}/g'
This worked for me:
sed -r "s/([vV][a-zA-Z]*[.]*) ([0-9]*)/\1 \\\textbf{\2}/"

Replace string if first letter is uppercase using sed

I try to write sed answer to this question Edit a file using sed/awk using:
sed -e 's/^[A-Z]/$:$&/' file.txt
but the result is:
wednesday
$:$Weekday
$:$thursday
$:$Weekday
$:$friday
$:$Weekday
$:$saturday
$:$MaybeNot
$:$sunday
$:$MaybeNot
$:$monday
$:$Weekday
$:$tuesday
$:$Weekday
Why it replace if first character is lower case?
This is a "feature" according to this bug report caused by unexpected character ordering in the locale, further explained here and here.
$ locale
LANG=en_GB.UTF-8
LC_CTYPE="en_GB.UTF-8"
LC_NUMERIC="en_GB.UTF-8"
LC_TIME="en_GB.UTF-8"
LC_COLLATE="en_GB.UTF-8"
LC_MONETARY="en_GB.UTF-8"
LC_MESSAGES="en_GB.UTF-8"
LC_ALL=
$ echo "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz" | sed -e 's/[A-Z]/./g'
..........................a.........................
$ echo "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz" | sed -e 's/[a-z]/./g'
.........................Z..........................
$ echo "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz" | LC_ALL=C sed -e 's/[A-Z]/./g'
..........................abcdefghijklmnopqrstuvwxyz
$ echo "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz" | LC_ALL=C sed -e 's/[a-z]/./g'
ABCDEFGHIJKLMNOPQRSTUVWXYZ..........................
$ echo "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz" | sed -e 's/[[:upper:]]/./g'
..........................abcdefghijklmnopqrstuvwxyz
$ echo "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz" | sed -e 's/[[:lower:]]/./g'
ABCDEFGHIJKLMNOPQRSTUVWXYZ..........................
$ sed --version
GNU sed version 4.2.1

Curly braces in awk reg exp

I am trying to match a fixed number of digits using curly braces in awkbut I get no result.
# This outputs nothing
echo "123" | awk '/^[0-9]{3}$/ {print $1;}'
# This outputs 123
echo "123" | awk '/^[0-9]+$/ {print $1;}'
Do I need to do something specific to use curly braces?
Mac OS X awk (BSD awk) works with the first command shown:
$ echo "123" | /usr/bin/awk '/^[0-9]{3}$/ {print $1;}'
123
$
GNU awk does not. Adding backslashes doesn't help GNU awk. Using option --re-interval does, and so does using --posix.
$ echo "123" | /usr/gnu/bin/awk --re-interval '/^[0-9]{3}$/ {print $1;}'
123
$ echo "123" | /usr/gnu/bin/awk --posix '/^[0-9]{3}$/ {print $1;}'
123
$
(I'm not sure where mawk 1.3.3 dated 1996 comes from, but it is probably time to get an updated version of awk for your machine.)
AWK on Ubuntu 20.04.4 LTS is up-to-date, released in year 2020 of but its mawk.
As Ed Morton stated in a comment above, "mawk is a minimal functionality awk, optimized for speed of execution,...".
Seems those optimizations were at the expense of functionality.
SOLUTION
Install GNU awk (gawk):
$ sudo apt install gawk -y
$ awk -W version
GNU Awk 5.0.1, API: 2.0 (GNU MPFR 4.0.2, GNU MP 6.2.0)
Copyright (C) 1989, 1991-2019 Free Software Foundation.