Cryptic sed command syntax confusion

Cryptic sed command syntax confusion - regex

Can someone explain, how this sed command works here?
pkg info | sed -e 's/\([^.]*\).*/\1/' -e 's/\(.*\)-.*/\1/'
This command removes version numbers from packages and prints into stdout like this
yajl-2.1.0 Portable JSON parsing and serialization library in ANSI C
youtube_dl-2018.12.03 Program for downloading videos from YouTube.com
zathura-0.4.1 Customizable lightweight pdf viewer
zathura-pdf-poppler-0.2.9_1 Poppler render PDF plugin for Zathura PDF viewer
zip-3.0_1 Create/update ZIP files compatible with PKZIP
zsh-5.6.2 The Z shell
and turns into this
yajl
youtube_dl
zathura
zathura-pdf-poppler
zip
zsh
But I am having a hard time understanding the parts ([^.]*\).* \(.*\)-.*. I understand the case of \, -e, s. But those wildcards seems very cryptic here.

In your regex ([^.]*\).*, ( which actually is \( is the start of a capturing group and then [^.]* captures every character except a literal dot and * means zero or more, then \) is the mark of closing of group that we started, then .* captures whatever remains after capturing group1.
Similar will be the explanation for \(.*\)-.* regex, where \(.*\) will capture everything greedily in capturing group but will stop at last hyphen - and then will match hyphen and further .* will match remaining text.
To explain with an example, lets take youtube_dl-2018.12.03.
Here, \([^.]*\) will capture everything until dot, hence it will capture youtube_dl-2018 and then remaining .* will capture .12.03. Then it will be replaced by \1 which means youtube_dl-2018 will be passed to the next regex -e 's/\(.*\)-.*/\1/'.
Then in your second regex, \(.*\)-.*, \(.*\) will capture youtube_dl and put in group1 because after that there is a hyphen and .* will capture remaining text which is 2018. And as it is replaced by \1 hence final text will become youtube_dl.
Seeing your data, I believe, you can also simplify your command to this, as your first regex in sed command seems redundant. Try this following command and see if it outputs same result?
pkg info | sed -e 's/\(.*\)-.*/\1/'
You can only use this simplified command, as none of your data contains a . before a -, otherwise you should use your own command which has two sed rules.
Also, on another note, if you use -r, (or -E for OS X), for extended regex, you don't need to escape the parentheses and you can write your regex as,
pkg info | sed -r 's/([^.]*).*/\1/' -r 's/(.*)-.*/\1/'

It is a difficult way for saying:
Remove all substrings starting with a dot or hypen.
The part before the delimiter is matched and remembered.
Alternatives:
# Incorrect: removes from first, not last hypen:
# pkg info | sed 's/[-.].*//'
# pkg info | cut -d "-" -f1 | cut -d"." -f1
# pkg info | awk -F "-|[.]" '{print $1}'
# The dot is not needed when you remove the substring starting with the last hypen
pkg info | sed 's/-[^-]*$//'
pkg info | rev | cut -d"-" -f2- | rev
pkg info | awk -F "[.]" '{print $1}' | awk -F "[-]" -vOFS='-' 'NF>1 { NF--;print;}'

Silly invisible-text GNU grep method that works on the console,
but which would fail if sent to a file or piped to a filter:
pkg info | GREP_COLORS='ms=30;30;30' grep '\-[^-]*\s.*$'
How it works: grep is used to find the last hyphen before a
space, and everything after that, (i.e. everything we don't want
to see), which grep shows in highlighted colors as defined in the
GREP_COLORS environmental variable. Since the highlight colors
30;30;30 is a black font, (on a black background), the unwanted
text is invisible.
If the terminal background is already black, GREP_COLORS='ms=30
would be sufficient.
sed method based on not printing the grep regex:
pkg info | sed 's#\(^.*\)\(-[^-]*[[:space:]].*$\)#\1#'
...this method can be sent to pipes and filters. Shorter version using GNU sed:
pkg info | sed 's#\(^.*\)\(-.*\s.*\)#\1#'

Related

How to extract jira ticket number with sed?

I want to extract Jira ticket number from the branch name with sed.
This is what I have
echo "PTW-123-branch-name" | sed 's/.*\([A-Z]+-[0-9]+[^-]\).*/\1/'
expected result: PTW-123
What is wrong with the regexp?

You may use this sed:
echo "PTW-123-branch-name" | sed 's/\([0-9]\)-.*$/\1/'
PTW-123
Details:
\([0-9]\)-: Matches a digit and captures it in group #1 followed by hyphen
.*$: Match remaining string until end
\1: Is replacement that puts captured digit back in output
Alternatively you can use cut also:
echo "PTW-123-branch-name" | cut -d- -f1,2
PTW-123

In case you are ok with GNU grep please try following then. Simple explanation would be passing echo command's output as a standard input to grep command. Then in grep command using -oP option to print only matched portion and enabling PCRE regex capabilities here. In match section of grep then using non-greedy match to match till digits which should be followed by -, then if a match is found it will print it.
echo "PTW-123-branch-name" | grep -oP '^.*?\d+(?=-)'

“sed” command to remove a line that matches an exact string on first word

I've found an answer to my question here: "sed" command to remove a line that match an exact string on first word
...but only partially because that solution only works if I query pretty much exactly like the answer person answered.
They answered:
sed -i "/^maria\b/Id" file.txt
...to chop out only a line starting with the word "maria" in it and not maria if it's not the first word for example.
I want to chop out a specific url in a file, example: "cnn.com" - but, I also have a bunch of local host addressses, 0.0.0.0 and both have some with a single space in front. I also don't want to chop out sub domains like ads.cnn.com so that code "should" work but doesn't when I string in more commands with the -e option. My code below seems to clean things up well except that I can't get it to whack out the cnn.com! My file is called raw.txt
sed -r -e 's/^127.0.0.1//' -e 's/^ 127.0.0.1//' -e 's/^0.0.0.0//' -e 's/^ 0.0.0.0//' -e '/#/d' -e '/^cnn.com\b/d' -e '/::/d' raw.txt | sort | tr -d "[:blank:]" | awk '!seen[$0]++' | grep cnn.com
When I grep for cnn.com I see all the cnn's INCLUDING the one I don't want which is actually "cnn.com".
ads.cnn.com
cl.cnn.com
cnn.com <-- the one I don't want
cnn.dyn.cnn.com
customad.cnn.com
gdyn.cnn.com
jfcnn.com
kermit.macnn.com
metrics.cnn.com
projectcnn.com
smetrics.cnn.com
tiads.sportsillustrated.cnn.com
trumpincnn.com
victory.cnn.com
xcnn.com
If I just use that one piece of code with the cnn.com chop out it seems to work.
sed -r '/^cnn.com\b/d' raw.txt | grep cnn.com
* I'm not using the "-e" option
Result:
ads.cnn.com
cl.cnn.com
cnn.dyn.cnn.com
customad.cnn.com
gdyn.cnn.com
jfcnn.com
kermit.macnn.com
metrics.cnn.com
projectcnn.com
smetrics.cnn.com
tiads.sportsillustrated.cnn.com
trumpincnn.com
victory.cnn.com
xcnn.com
Nothing I do seems to work when I string commands together with the "-e" option. I need some help on getting my multiple option command kicking with SED.
Any advice?
Ubuntu 12 LTS & 16 LTS.
sed (GNU sed) 4.2.2

The . is metacharacter in regex which means "Match any one character". So you accidentally created a regex that will also catch cnnPcom or cnn com or cnn\com. While it probably works for your needs, it would be better to be more explicit:
sed -r '/^cnn\.com\b/d' raw.txt
The difference here is the \ backslash before the . period. That escapes the period metacharacter so it's treated as a literal period.
As for your lines that start with a space, you can catch those in a single regex (Again escaping the period metacharacter):
sed -r '/(^[ ]*|^)127\.0\.0\.1\b/d' raw.txt
This (^[ ]*|^) says a line that starts with any number of repeating spaces ^[ ]* OR | starts with ^ which is then followed by your match for 127.0.0.1.
And then for stringing these together you can use the | OR operator inside of parantheses to catch all of your matches:
sed -r '/(^[ ]*|^)(127\.0\.0\.1|cnn\.com|0\.0\.0\.0)\b/d' raw.txt
Alternatively you can use a ; semicolon to separate out the different regexes:
sed -r '/(^[ ]*|^)127\.0\.0\.1\b/d; /(^[ ]*|^)cnn\.com\b/d; /(^[ ]*|^)0\.0\.0\.0\b/d;' raw.txt

sed doesn't understand matching on strings, only regular expressions, and it's ridiculously difficult to try to get sed to act as if it does, see Is it possible to escape regex metacharacters reliably with sed. To remove a line whose first space-separated word is "foo" is just:
awk '$1 != "foo"' file
To remove lines that start with any of "foo" or "bar" is just:
awk '($1 != "foo") && ($1 != "bar")' file
If you have more than just a couple of words then the approach is to list them all and create a hash table indexed by them then test for the first word of your line being an index of the hash table:
awk 'BEGIN{split("foo bar other word",badWords)} !($1 in badWords)' file
If that's not what you want then edit your question to clarify your requirements and include concise, testable sample input and the expected output given that input.

How to use sed to grab regular expression

I'd like to grab the digits in a string like so :
"sample_2341-43-11.txt" to 2341-43-11
And so I tried the following command:
echo "sample_2341-43-11.txt" | sed -n -r 's|[0-9]{4}\-[0-9]{2}\-[0-9]{2}|\1|p'
I saw this answer, which is where I got the idea.
Use sed to grab a string, but it doesn't work on my machine:
it gives an error "illegal option -r".
it doesn't like the \1, either.
I'm using sed on MacOSX yosemite.
Is this the easiest way to extract that information from the file name?

You need to set your grouping and match the rest of the line to remove it with the group. Also the - does not need to be escaped. And the -n will inhibit the output (It just returns exit level for script conditionals).
echo "sample_2341-43-11.txt" | sed -r 's/^.*([0-9]{4}-[0-9]{2}-[0-9]{2}).*$/\1/'

Enhanced regular expressions are not supported in the Mac version of sed.
You can use grep instead:
echo "sample_2341-43-11.txt" | grep -Eo "((\d+|-)+)"
OUTPUT
2341-43-11

echo "one1sample_2341-43-11.txt" \
| sed 's/[^[:digit:]-]\{1,\}/ /g;s/ \{1,\}/ /g;s/^ //;s/ $//'
1 2341-43-11
Extract all numbers(digit) completed with - (thus allow here --12 but can be easily treated)
posix compliant
all number of the line are on same line (if several) separate by a space character (could be changed to new line if wanted)

You can try this ways also
sed 's/[^_]\+_\([^.]\+\).*/\1/' <<< sample_2341-43-11.txt
OutPut:
2341-43-11
Explanation:
[^_]\+ - Match the content untile _ ( sample_)
\([^.]\+\) - Match the content until . and capture the pattern (2341-43-11)
.* - Discard remaining character (.txt)

You can go with what the poster above said. Well, making use of this
pattern "\d+-\d+-\d+" would match what you are looking for. See demo here
https://regex101.com/r/kO2cZ1/3

Extract multiple occurrences on the same line using sed/regex

I am trying to loop through each line in a file and find and extract letters that start with ${ and end with }. So as the final output I am expecting only SOLDIR and TEMP(from inputfile.sh).
I have tried using the following script but it seems it matches and extracts only the second occurrence of the pattern TEMP. I also tried adding g at the end but it doesn't help. Could anybody please let me know how to match and extract both/multiple occurrences on the same line ?
inputfile.sh:
.
.
SOLPORT=\`grep -A 4 '\[LocalDB\]' \${SOLDIR}/solidhac.ini | grep \${TEMP} | awk '{print $2}'\`
.
.
script.sh:
infile='inputfile.sh'
while read line ; do
echo $line | sed 's%.*${\([^}]*\)}.*%\1%g'
done < "$infile"

May I propose a grep solution?
grep -oP '(?<=\${).*?(?=})'
It uses Perl-style lookaround assertions and lazily matches anything between '${' and '}'.
Feeding your line to it, I get
$ echo "SOLPORT=\`grep -A 4 '[LocalDB]' \${SOLDIR}/solidhac.ini | grep \${TEMP} | awk '{print $2}'\`" | grep -oP '(?<=\${).*?(?=})'
SOLDIR
TEMP

This might work for you (but maybe only for your specific input line):
sed 's/[^$]*\(${[^}]\+}\)[^$]*/\1\t/g;s/$[^{$]\+//g'

Extracting multiple matches from a single line using sed isn't as bad as I thought it'd be, but it's still fairly esoteric and difficult to read:
$ echo 'Hello ${var1}, how is your ${var2}' | sed -En '
# Replace ${PREFIX}${TARGET}${SUFFIX} with ${PREFIX}\a${TARGET}\n${SUFFIX}
s#\$\{([^}]+)\}#\a\1\n#
# Continue to next line if no matches.
/\n/!b
# Remove the prefix.
s#.*\a##
# Print up to the first newline.
P
# Delete up to the first newline and reprocess what's left of the line.
D
'
var1
var2
And all on one line:
sed -En 's#\$\{([^}]+)\}#\a\1\n#;/\n/!b;s#.*\a##;P;D'
Since POSIX extended regexes don't support non-greedy quantifiers or putting a newline escape in a bracket expression I've used a BEL character (\a) as a sentinel at the end of the prefix instead of a newline. A newline could be used, but then the second substitution would have to be the questionable s#.*\n(.*\n.*)##, which might involve a pathological amount of backtracking by the regex engine.

Best way to complete this Perl regex one-liner

I'm trying to use a Perl one-liner to munge some output from grepping svn diff, so I can automatically test the files. We have a run_test.sh script that can take multiple PHP files prepended with 'Test" as its arguments.
So far I have the following which successfully prepends 'Test' to the file names
[gjempty#gjempty-rhel4 classes]$ svn diff | grep '(revision' | perl -wpl -e 's/(.*)\/(.*)$/$1\/Test$2/'
--- commerce/TestLCart.php (revision 104387)
--- commerce/manufacturing/TestLRoutingData.php (revision 104387)
Now I'd just like to grab the file/path to pass it to our run_test.sh. I can finish it off with awk as below, but am trying to improve my Perl/one-liner skills. So how do I revise the perl one-liner to additionally extract only the file path?
svn diff | grep '(revision' | perl -wpl -e 's/(.*)\/(.*)$/$1\/Test$2/' | awk '{print $2}' | xargs run_test.sh

You're just wanting the file names, so svn st is what you want. Instead of getting large quantities of noise which could potentially contain (revision in it, and the main lines you want, you'll get it like this: M commerce/LCart.php. Then you can just chop off \S* (any number of non-whitespace characters) followed by \s* (any number of whitespace characters), and take what's left. You could do the \S*\s* differently, but that's the simplest way to get all cases.
svn st | perl -wpl -e 's|\S*\s*(.*)/(.*)$|$1/Test$2|'
(Switched it after posting from using s/// to s||| so the / doesn't need to be escaped; good idea, Axeman.)

You can get rid of the grep and the awk fairly easily.
svn diff | perl -wnl -e '/\(revision/ or next; m|(\S+)/(\S+)|; print "$1/Test$2";'
I changed the -p to -n. -p means while (<>) { <your code>; print $_; }, and -n is the same but without the print, since the new version has an explicit print instead.
Rather than an s/// substitution, I used an m// pattern match. I changed the delimiter to | to avoid backslashing the slash (a cause of Leaning Toothpick Syndrome). You can use almost any punctuation character you want.
\S is similar to . but matches only non-whitespace characters. Your .*s in the pattern were actually matching the entire chunks of the line before and after the slash, but the new pattern only matches the pathname of the file. Since the + is "greedy", the first one ($1) will get more string when there are multiple slashes in the pathname, the same as with your substitution pattern.

Better version:
No default print ( -n)
Extract substring first
Subst on that
print value
perl -wnl -e '($_)=m{---\s+(\S+)} and s|/([^/]+)$|/Test$1| and print "$_\n";'
You don't need awk now. And adding '(revision to the expression,
perl -wnl -e '($_)=m{---\s+(\S+)\s+\(revision} and s|/([^/]+)$|/Test$1| and print "$_\n";'
you don't need grep either.
But I have several subversion tools I created, and if all you want are the changed files 'svn st' is better.
svn st | perl -wnle 'm/^[CM]\s+(\S+)/and$r=rindex($1,"/")+1and print substr($1,0,$r),"Test",substr($1,$r+1),"\n"'
This time I chose a rindex + substr method. Now, there's no regex backtracking.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Cryptic sed command syntax confusion - regex

Related

How to extract jira ticket number with sed?

“sed” command to remove a line that matches an exact string on first word

How to use sed to grab regular expression

Extract multiple occurrences on the same line using sed/regex

Best way to complete this Perl regex one-liner

Categories

Resources