How to use sed to grab regular expression - regex

I'd like to grab the digits in a string like so :
"sample_2341-43-11.txt" to 2341-43-11
And so I tried the following command:
echo "sample_2341-43-11.txt" | sed -n -r 's|[0-9]{4}\-[0-9]{2}\-[0-9]{2}|\1|p'
I saw this answer, which is where I got the idea.
Use sed to grab a string, but it doesn't work on my machine:
it gives an error "illegal option -r".
it doesn't like the \1, either.
I'm using sed on MacOSX yosemite.
Is this the easiest way to extract that information from the file name?

You need to set your grouping and match the rest of the line to remove it with the group. Also the - does not need to be escaped. And the -n will inhibit the output (It just returns exit level for script conditionals).
echo "sample_2341-43-11.txt" | sed -r 's/^.*([0-9]{4}-[0-9]{2}-[0-9]{2}).*$/\1/'

Enhanced regular expressions are not supported in the Mac version of sed.
You can use grep instead:
echo "sample_2341-43-11.txt" | grep -Eo "((\d+|-)+)"
OUTPUT
2341-43-11

echo "one1sample_2341-43-11.txt" \
| sed 's/[^[:digit:]-]\{1,\}/ /g;s/ \{1,\}/ /g;s/^ //;s/ $//'
1 2341-43-11
Extract all numbers(digit) completed with - (thus allow here --12 but can be easily treated)
posix compliant
all number of the line are on same line (if several) separate by a space character (could be changed to new line if wanted)

You can try this ways also
sed 's/[^_]\+_\([^.]\+\).*/\1/' <<< sample_2341-43-11.txt
OutPut:
2341-43-11
Explanation:
[^_]\+ - Match the content untile _ ( sample_)
\([^.]\+\) - Match the content until . and capture the pattern (2341-43-11)
.* - Discard remaining character (.txt)

You can go with what the poster above said. Well, making use of this
pattern "\d+-\d+-\d+" would match what you are looking for. See demo here
https://regex101.com/r/kO2cZ1/3

Related

“sed” command to remove a line that matches an exact string on first word

I've found an answer to my question here: "sed" command to remove a line that match an exact string on first word
...but only partially because that solution only works if I query pretty much exactly like the answer person answered.
They answered:
sed -i "/^maria\b/Id" file.txt
...to chop out only a line starting with the word "maria" in it and not maria if it's not the first word for example.
I want to chop out a specific url in a file, example: "cnn.com" - but, I also have a bunch of local host addressses, 0.0.0.0 and both have some with a single space in front. I also don't want to chop out sub domains like ads.cnn.com so that code "should" work but doesn't when I string in more commands with the -e option. My code below seems to clean things up well except that I can't get it to whack out the cnn.com! My file is called raw.txt
sed -r -e 's/^127.0.0.1//' -e 's/^ 127.0.0.1//' -e 's/^0.0.0.0//' -e 's/^ 0.0.0.0//' -e '/#/d' -e '/^cnn.com\b/d' -e '/::/d' raw.txt | sort | tr -d "[:blank:]" | awk '!seen[$0]++' | grep cnn.com
When I grep for cnn.com I see all the cnn's INCLUDING the one I don't want which is actually "cnn.com".
ads.cnn.com
cl.cnn.com
cnn.com <-- the one I don't want
cnn.dyn.cnn.com
customad.cnn.com
gdyn.cnn.com
jfcnn.com
kermit.macnn.com
metrics.cnn.com
projectcnn.com
smetrics.cnn.com
tiads.sportsillustrated.cnn.com
trumpincnn.com
victory.cnn.com
xcnn.com
If I just use that one piece of code with the cnn.com chop out it seems to work.
sed -r '/^cnn.com\b/d' raw.txt | grep cnn.com
* I'm not using the "-e" option
Result:
ads.cnn.com
cl.cnn.com
cnn.dyn.cnn.com
customad.cnn.com
gdyn.cnn.com
jfcnn.com
kermit.macnn.com
metrics.cnn.com
projectcnn.com
smetrics.cnn.com
tiads.sportsillustrated.cnn.com
trumpincnn.com
victory.cnn.com
xcnn.com
Nothing I do seems to work when I string commands together with the "-e" option. I need some help on getting my multiple option command kicking with SED.
Any advice?
Ubuntu 12 LTS & 16 LTS.
sed (GNU sed) 4.2.2
The . is metacharacter in regex which means "Match any one character". So you accidentally created a regex that will also catch cnnPcom or cnn com or cnn\com. While it probably works for your needs, it would be better to be more explicit:
sed -r '/^cnn\.com\b/d' raw.txt
The difference here is the \ backslash before the . period. That escapes the period metacharacter so it's treated as a literal period.
As for your lines that start with a space, you can catch those in a single regex (Again escaping the period metacharacter):
sed -r '/(^[ ]*|^)127\.0\.0\.1\b/d' raw.txt
This (^[ ]*|^) says a line that starts with any number of repeating spaces ^[ ]* OR | starts with ^ which is then followed by your match for 127.0.0.1.
And then for stringing these together you can use the | OR operator inside of parantheses to catch all of your matches:
sed -r '/(^[ ]*|^)(127\.0\.0\.1|cnn\.com|0\.0\.0\.0)\b/d' raw.txt
Alternatively you can use a ; semicolon to separate out the different regexes:
sed -r '/(^[ ]*|^)127\.0\.0\.1\b/d; /(^[ ]*|^)cnn\.com\b/d; /(^[ ]*|^)0\.0\.0\.0\b/d;' raw.txt
sed doesn't understand matching on strings, only regular expressions, and it's ridiculously difficult to try to get sed to act as if it does, see Is it possible to escape regex metacharacters reliably with sed. To remove a line whose first space-separated word is "foo" is just:
awk '$1 != "foo"' file
To remove lines that start with any of "foo" or "bar" is just:
awk '($1 != "foo") && ($1 != "bar")' file
If you have more than just a couple of words then the approach is to list them all and create a hash table indexed by them then test for the first word of your line being an index of the hash table:
awk 'BEGIN{split("foo bar other word",badWords)} !($1 in badWords)' file
If that's not what you want then edit your question to clarify your requirements and include concise, testable sample input and the expected output given that input.

Capture text between two tokens

I'm trying to get the text between two tokens.
For example, let's say the text is:
arn:aws:dfasdfasdf/asdfa:start:CaptureThis/end
The output should be: CaptureThis
And the two tokens are: :start: and /end
The closest I could get was using this regex:
INPUT="arn:aws:dfasdfasdf/asdfa:start:CaptureThis/end"
VALUE=$(echo "${INPUT}" | sed -e 's/:start:\(.*\)\/end/\1/')
... but this returns most of the string: arn:aws:dfasdfasdf/asdfa:start:CaptureThis/end
How do I get all of the other text out of the way?
You could use (GNU) grep with Perl regular expressions (look-arounds) and the -o option to only return the match:
$ grep -Po '(?<=:start:).*(?=/end)' <<< 'arn:aws:dfasdfasdf/asdfa:start:CaptureThis/end'
CaptureThis
Try this:
$ sed 's/^.*:start:\(.*\)\/end.*$/\1/' <<<'arn:aws:dfasdfasdf/asdfa:start:CaptureThis/end'
CaptureThis
The problem with your approach was that you only replaced part of the input line, because your regex didn't capture the entire line.
Note how the command above anchors the regex both at the beginning of the line (^.*) and at the end (.*$) so as to ensure that the entire line is matched and thus replaced.
You could use :
VALUE=$(echo "${INPUT}" | sed -e 's/.*:start:\(.*\)\/end.*/\1/')
If the tokens are liable to change, you could use variables - but since "/end" has a "/", that could lead to sed getting confused, so you'd probably want to change its delimiter to some non-conflicting character (like a "?"), so :
TOKEN1=":start:"
TOKEN2="/end"
VALUE=$(echo "${INPUT}" | sed -e "s?.*$TOKEN1\(.*\)$TOKEN2.*?\1?")
There is no need for any external utilities, bash parameter-expansion will handle it all for you:
INPUT="arn:aws:dfasdfasdf/asdfa:start:CaptureThis/end"
token=${INPUT##*:}
echo ${token%/*}
Output
CaptureThis

regex: not match a group rather than single characters

echo test.a.wav|sed 's/[^(.wav)]*//g'
.a.wav
What I want is to remove every character until it reaches the whole group .wav(that is, I want the result to be .wav), but it seems that sed would remove every character until it reaches any of the four characters. How to do the trick?
Groups do not work inside [], so the dot is part of the class as is the parens.
How about:
echo test.a.wav|sed 's/.*\(\.wav\)/\1/g'
Note, there may be other valid solutions, but you provide no context on what you are trying to do to determine what may be the best solution.
The feature you're requesting wouldn't be supported by sed (negative lookahead) but Perl does the trick.
$ echo 'test.a.wav' | perl -pe 's/^(?:(?!\.wav).)*//g'
.wav
Instead of regex, you can use awk like this:
echo test.a.wav.more | awk -F".wav" '{print FS$2}'
.wav.more
It splits the data with your pattern, then print pattern and the rest of the data.
This might work for you (GNU sed):
sed ':a;/^\.wav/!s/.//;ta;/./!d' file
or:
sed 's/\.wav/\n&/;s/^[^\n]*\n//;/./!d' file
N.B. This deletes the line if it is empty. If this is not wanted just remove /./!d from the above commands.

regex to convert www.evernote.com URL to use evernote protocol

I'm writing a simple script that will take URLs pointing to Evernote notes online, and convert them to the evernote:/// protocol. The regex I'm using matches and modifies the URL correctly when I try it out in a regex tester (I'm using Patterns for OS X). However, when I use it with sed, it just returns the original string.
echo "https://www.evernote.com/shard/s2/nl/227468/1875e55a-e512-4cf9-9b18-9e93c6a27359/" | sed 's#https?:_/_/www_.evernote_.com_/shard_/(..)_/nl_/(......)_/(.+_/)#evernote:_/_/_/view_/$2_/$1_/$3$3#'
Any idea why this isn't working? Thanks!
fort
[Edit: In case anyone's interested, this was for the AppleScript bit of a Keyboard Maestro macro:
set theURL to the clipboard
set ENcode to "echo \"" & theURL & "\" | sed -E 's#https?://www.evernote.com/shard/(..)/nl/(.*)/(.+/)#evernote:///view/\\2/\\1/\\3\\3#' | pbcopy"
do shell script ENcode
Thanks to #DreadPirateShawn for helping me fix the regex.
]
Using the extended regex flag -E, removing the underscores, and replacing each $1 pattern with \1 yields a functional regex here:
$ echo "https://www.evernote.com/shard/s2/nl/227468/1875e55a-e512-4cf9-9b18-9e93c6a27359/" | sed -E 's#https?://www\.evernote\.com/shard/(..)/nl/(......)/(.+/)#evernote:///view/\2/\1/\3\3#'
evernote:///view/227468/s2/1875e55a-e512-4cf9-9b18-9e93c6a27359/1875e55a-e512-4cf9-9b18-9e93c6a27359/
(Confirmed on Ubuntu 12.04 and OS X.)
If you don't use -E, then you also need to change s? to [s]? and escape the grouping parentheses:
$ echo "https://www.evernote.com/shard/s2/nl/227468/1875e55a-e512-4cf9-9b18-9e93c6a27359/" | sed 's#http[s]*://www\.evernote\.com/shard/\(.*\)/nl/\(.*\)/\(.*/\)#evernote:///view/\2/\1/\3\3#'
evernote:///view/227468/s2/1875e55a-e512-4cf9-9b18-9e93c6a27359/1875e55a-e512-4cf9-9b18-9e93c6a27359/
In the latter example, I also replaced each (....)-type sequence with (.*) -- unless you're absolutely positive of the length of each sequence (and even then perhaps), the (.*) approach will be a bit more flexible.
I think you're trying this:
echo "https://www.evernote.com/shard/s2/nl/227468/1875e55a-e512-4cf9-9b18-9e93c6a27359/" | sed -re 's#https://www.evernote.com/shard/(..)/nl/(......)/(.+)/#evernote://view/\2/\1/\3#'
evernote://view/227468/s2/1875e55a-e512-4cf9-9b18-9e93c6a27359
Making no use of Extended regex:
echo "https://www.evernote.com/shard/s2/nl/227468/1875e55a-e512-4cf9-9b18-9e93c6a27359/" | sed 's#https://www.evernote.com/shard/\(..\)/nl/\(......\)/\(.\+\)/#evernote://view/\2/\1/\3#'
evernote://view/227468/s2/1875e55a-e512-4cf9-9b18-9e93c6a27359

put regular expression in variable

output=`grep -R -l "${images}" *`
new_output=`regex "slide[0-9]" $output`
Basically $output is a string like this:
slides/_rels/slide9.xml.rels
The number in $output will change. I want to grab "slide9" and put that in a variable. I was hoping new_output would do that but I get a command not found for using regex. Any other options? I'm using a bash shell script.
Well, regex is not a program like grep. ;)
But you can use
grep -Eo "(slide[0-9]+)"
as a simple approach. -o means: show only the matching part, -E means: extended regex (allows more sophisticated patterns).
Reading I want to grab "slide9" and put that in a variable. I assume you want what matches your regexp to be the only thing put in $new_output? If so, then you can change that to:
new_output=`egrep -R -l "${images}" * | sed 's/.*\(slide[0-9]+\).*/\1/'`
Note no setting of output= is required (unless you use that for something else)
If you need $output to use elsewhere then instead use:
output=`grep -R -l "${images}" *`
new_output=`echo ${ouput} | sed 's/.*\(slide[0-9]+\).*/\1/'`
sed's s/// command is similar to perls s// command and has an equivalent in most languages.
Here I'm matching zero or more characters .* before and after your slide[0-9]+ and then remembering (backrefrencing) the result \( ... \) in sed (the brackets may or may not need to be escaped depending on the version of sed). We then replace that whole match (i.e the whole line) with \1 which expands to the first captured result in this case your slide[0-9]+ match.
In these situations using awk is better :
output="`grep -R -l "main" codes`"
echo $output
tout=`echo $output | awk -F. '{for(i=1;i<=NF;i++){if(index($i,"/")>0){n=split($i,ar,"/");print ar[n];}}}'`
echo $tout
This prints the filename without the extension. If you want to grab only slide9 than use the solutions provided by others.
Sample output :
A#A-laptop ~ $ bash try.sh
codes/quicksort_iterative.cpp codes/graham_scan.cpp codes/a.out
quicksort_iterative graham_scan a