regex to convert www.evernote.com URL to use evernote protocol - regex

I'm writing a simple script that will take URLs pointing to Evernote notes online, and convert them to the evernote:/// protocol. The regex I'm using matches and modifies the URL correctly when I try it out in a regex tester (I'm using Patterns for OS X). However, when I use it with sed, it just returns the original string.
echo "https://www.evernote.com/shard/s2/nl/227468/1875e55a-e512-4cf9-9b18-9e93c6a27359/" | sed 's#https?:_/_/www_.evernote_.com_/shard_/(..)_/nl_/(......)_/(.+_/)#evernote:_/_/_/view_/$2_/$1_/$3$3#'
Any idea why this isn't working? Thanks!
fort
[Edit: In case anyone's interested, this was for the AppleScript bit of a Keyboard Maestro macro:
set theURL to the clipboard
set ENcode to "echo \"" & theURL & "\" | sed -E 's#https?://www.evernote.com/shard/(..)/nl/(.*)/(.+/)#evernote:///view/\\2/\\1/\\3\\3#' | pbcopy"
do shell script ENcode
Thanks to #DreadPirateShawn for helping me fix the regex.
]

Using the extended regex flag -E, removing the underscores, and replacing each $1 pattern with \1 yields a functional regex here:
$ echo "https://www.evernote.com/shard/s2/nl/227468/1875e55a-e512-4cf9-9b18-9e93c6a27359/" | sed -E 's#https?://www\.evernote\.com/shard/(..)/nl/(......)/(.+/)#evernote:///view/\2/\1/\3\3#'
evernote:///view/227468/s2/1875e55a-e512-4cf9-9b18-9e93c6a27359/1875e55a-e512-4cf9-9b18-9e93c6a27359/
(Confirmed on Ubuntu 12.04 and OS X.)
If you don't use -E, then you also need to change s? to [s]? and escape the grouping parentheses:
$ echo "https://www.evernote.com/shard/s2/nl/227468/1875e55a-e512-4cf9-9b18-9e93c6a27359/" | sed 's#http[s]*://www\.evernote\.com/shard/\(.*\)/nl/\(.*\)/\(.*/\)#evernote:///view/\2/\1/\3\3#'
evernote:///view/227468/s2/1875e55a-e512-4cf9-9b18-9e93c6a27359/1875e55a-e512-4cf9-9b18-9e93c6a27359/
In the latter example, I also replaced each (....)-type sequence with (.*) -- unless you're absolutely positive of the length of each sequence (and even then perhaps), the (.*) approach will be a bit more flexible.

I think you're trying this:
echo "https://www.evernote.com/shard/s2/nl/227468/1875e55a-e512-4cf9-9b18-9e93c6a27359/" | sed -re 's#https://www.evernote.com/shard/(..)/nl/(......)/(.+)/#evernote://view/\2/\1/\3#'
evernote://view/227468/s2/1875e55a-e512-4cf9-9b18-9e93c6a27359
Making no use of Extended regex:
echo "https://www.evernote.com/shard/s2/nl/227468/1875e55a-e512-4cf9-9b18-9e93c6a27359/" | sed 's#https://www.evernote.com/shard/\(..\)/nl/\(......\)/\(.\+\)/#evernote://view/\2/\1/\3#'
evernote://view/227468/s2/1875e55a-e512-4cf9-9b18-9e93c6a27359

Related

sed - print translated HEX using capture group

I would like to print directly with sed a HEX value translation by isolating the HEX values in capture groups. This works:
echo bbb3Accc | sed -n 's/3A/\x3A/p'
bbb:ccc
...but this doesn't work:
echo bbb3Accc | sed 's/\(3A\)/\x\1/'
bbbx3Accc
...or an actual capture group REGEX matching based on URL encoded strings:
echo bbb%3Accc | sed 's/%\([A-Za-z0-9]\)/\x\1/'
bbbx3Accc
Apparently sed no longer interprets and translates the HEX value if it is constructed from a REGEX capture group, together with the \x escape.
But I am wondering if there's a workaround that I am not aware of, to make this work only with sed. Note that I am aware that I can do a bash command substitution and wrap the sed syntax in a echo -e but I would like to avoid that.
Your question isn't clear but maybe this is what you're trying to do using GNU awk for multi-char RS, RT, and strtonum():
$ echo 'bbb%3Accc%21ddd' |
gawk -v RS='%[[:xdigit:]]{2}' 'sub(/%/,"0x",RT){RT=sprintf("%c",strtonum(RT))} {ORS=RT} 1'
bbb:ccc!ddd
As mentioned in the comments, \xAB is interpreted by sed's parser, rather than as an expression, so \x won't work in the way you were trying.
sed is pretty primitive and your example is beyond what it is intended for, so you'd be better off using something more general purpose. For example, in Perl:
$ echo bbb3Accc | perl -ple 's/([0-9A-F]{2})/chr(hex($1))/ge'
bbb:ccc

How to use sed to grab regular expression

I'd like to grab the digits in a string like so :
"sample_2341-43-11.txt" to 2341-43-11
And so I tried the following command:
echo "sample_2341-43-11.txt" | sed -n -r 's|[0-9]{4}\-[0-9]{2}\-[0-9]{2}|\1|p'
I saw this answer, which is where I got the idea.
Use sed to grab a string, but it doesn't work on my machine:
it gives an error "illegal option -r".
it doesn't like the \1, either.
I'm using sed on MacOSX yosemite.
Is this the easiest way to extract that information from the file name?
You need to set your grouping and match the rest of the line to remove it with the group. Also the - does not need to be escaped. And the -n will inhibit the output (It just returns exit level for script conditionals).
echo "sample_2341-43-11.txt" | sed -r 's/^.*([0-9]{4}-[0-9]{2}-[0-9]{2}).*$/\1/'
Enhanced regular expressions are not supported in the Mac version of sed.
You can use grep instead:
echo "sample_2341-43-11.txt" | grep -Eo "((\d+|-)+)"
OUTPUT
2341-43-11
echo "one1sample_2341-43-11.txt" \
| sed 's/[^[:digit:]-]\{1,\}/ /g;s/ \{1,\}/ /g;s/^ //;s/ $//'
1 2341-43-11
Extract all numbers(digit) completed with - (thus allow here --12 but can be easily treated)
posix compliant
all number of the line are on same line (if several) separate by a space character (could be changed to new line if wanted)
You can try this ways also
sed 's/[^_]\+_\([^.]\+\).*/\1/' <<< sample_2341-43-11.txt
OutPut:
2341-43-11
Explanation:
[^_]\+ - Match the content untile _ ( sample_)
\([^.]\+\) - Match the content until . and capture the pattern (2341-43-11)
.* - Discard remaining character (.txt)
You can go with what the poster above said. Well, making use of this
pattern "\d+-\d+-\d+" would match what you are looking for. See demo here
https://regex101.com/r/kO2cZ1/3

PCRE Regex to SED

I am trying to take PCRE regex and use it in SED, but I'm running into some issues. Please note that this question is representative of a bigger issue (how to convert PCRE regex to work with SED) so the question is not simply about the example below, but about how to use PCRE regex in SED regex as a whole.
This example is extracting an email address from a line, and replacing it with "[emailaddr]".
echo "My email is abc#example.com" | sed -e 's/[a-zA-Z0-9]+[#][a-zA-Z0-9]+[\.][A-Za-z]{2,4}/[emailaddr]/g'
I've tried the following replace regex:
([a-zA-Z0-9]+[#][a-zA-Z0-9]+[\.][A-Za-z]{2,4})
[a-zA-Z0-9]+[#][a-zA-Z0-9]+[\.][A-Za-z]{2,4}
([a-zA-Z0-9]+[#][a-zA-Z0-9]+[.][A-Za-z]{2,4})
[a-zA-Z0-9]+[#][a-zA-Z0-9]+[.][A-Za-z]{2,4}
I've tried changing the delimited of sed from s/find/replace/g to s|find|replace|g as outlined here (stack overflow: pcre regex to sed regex).
I am still not able to figure out how to use PCRE regex in SED, or how to convert PCRE regex to SED. Any help would be great.
Want PCRE (Perl Compatible Regular Expressions)? Why don't you use perl instead?
perl -pe 's/[a-zA-Z0-9]+[#][a-zA-Z0-9]+[\.][A-Za-z]{2,4}/[emailaddr]/g' \
<<< "My email is abc#example.com"
Output:
My email is [emailaddr]
Write output to a file with tee:
perl -pe 's/[a-zA-Z0-9]+[#][a-zA-Z0-9]+[\.][A-Za-z]{2,4}/[emailaddr]/g' \
<<< "My email is abc#example.com" | tee /path/to/file.txt > /dev/null
Use the -r flag enabling the use of extended regular expressions. ( -E instead of -r on OS X )
echo "My email is abc#example.com" | sed -r 's/[a-zA-Z0-9]+#[a-zA-Z0-9]+\.[A-Za-z]{2,4}/[emailaddr]/g'
Ideone Demo
GNU sed uses basic regular expressions or, with the -r flag, extended regular expressions.
Your regex as a POSIX basic regex (thanks mklement0):
[[:alnum:]]\{1,\}#[[:alnum:]]\{1,\}\.[[:alpha:]]\{2,4\}
Note that this expression will not match all email addresses (not by a long shot).
for multiline use the 0!
perl -0pe 's/search/replace/gms' file
Sometimes this might be helpful too as a work-around:
str=$(grep -Poh "pcre-pattern" file)
sed -i "s/$str/$something_else/" file
-o, --only-matching:
Print only the matched (non-empty) parts of a matching line, with each such part on a separate output line.

put regular expression in variable

output=`grep -R -l "${images}" *`
new_output=`regex "slide[0-9]" $output`
Basically $output is a string like this:
slides/_rels/slide9.xml.rels
The number in $output will change. I want to grab "slide9" and put that in a variable. I was hoping new_output would do that but I get a command not found for using regex. Any other options? I'm using a bash shell script.
Well, regex is not a program like grep. ;)
But you can use
grep -Eo "(slide[0-9]+)"
as a simple approach. -o means: show only the matching part, -E means: extended regex (allows more sophisticated patterns).
Reading I want to grab "slide9" and put that in a variable. I assume you want what matches your regexp to be the only thing put in $new_output? If so, then you can change that to:
new_output=`egrep -R -l "${images}" * | sed 's/.*\(slide[0-9]+\).*/\1/'`
Note no setting of output= is required (unless you use that for something else)
If you need $output to use elsewhere then instead use:
output=`grep -R -l "${images}" *`
new_output=`echo ${ouput} | sed 's/.*\(slide[0-9]+\).*/\1/'`
sed's s/// command is similar to perls s// command and has an equivalent in most languages.
Here I'm matching zero or more characters .* before and after your slide[0-9]+ and then remembering (backrefrencing) the result \( ... \) in sed (the brackets may or may not need to be escaped depending on the version of sed). We then replace that whole match (i.e the whole line) with \1 which expands to the first captured result in this case your slide[0-9]+ match.
In these situations using awk is better :
output="`grep -R -l "main" codes`"
echo $output
tout=`echo $output | awk -F. '{for(i=1;i<=NF;i++){if(index($i,"/")>0){n=split($i,ar,"/");print ar[n];}}}'`
echo $tout
This prints the filename without the extension. If you want to grab only slide9 than use the solutions provided by others.
Sample output :
A#A-laptop ~ $ bash try.sh
codes/quicksort_iterative.cpp codes/graham_scan.cpp codes/a.out
quicksort_iterative graham_scan a

Java regex and sed aren't the same...?

Get these strings:
00543515703528
00582124628575
0034911320020
0034911320020
005217721320739
0902345623
067913187056
00543515703528
Apply this exp in java: ^(06700|067|00)([0-9]*).
My intention is to remove leading "06700, 067 and 00" from the beggining of the string.
It is all cool in java, group 2 always have the number I intend to, but in sed it isnt the same:
$ cat strings|sed -e 's/^\(06700|067|00\)\([0-9]*\)/\2/g'
00543515703528
00582124628575
0034911320020
0034911320020
005217721320739
0902345623
067913187056
00543515703528
What the heck am I missing?
Cheers,
f.
When using extended regular expressions, you also need to omit the \ before ( and ). This works for me:
sed -r 's/^(06700|067|00)([0-9]*)/\2/g' strings
note also that there's no need for a separate call to cat
I believe your problem is this:
sed defaults to BRE: The default
behaviour of sed is to support Basic
Regular Expressions (BRE). To use all
the features described on this page
set the -r (Linux) or -E (BSD) flag to
use Extended Regular Expressions
Source
Without this flag, the | character is interpreted literally. Try this example:
echo "06700|067|0055555" | sed -e 's/^\(06700|067|00\)\([0-9]*\)/\2/g'