Extracting string after matched pattern in Shell - regex

How to extract whatever string comes after a matched pattern in Shell script. I know this functionality in Perl scripting, but i dont know in Shell scripting.
Following is the example,
Subject_01: This is a sample subject and this may vary
I have to extract whatever string that follows "Subject_01:"
Any help please.

It depends on your shell.
If you're using bourne shell or bash or (I believe) pdksh, then you can do fancy stuff like this:
$ string="Subject_01: This is a sample subject and this may vary"
$ output="${string#*: }"
$ echo $output
This is a sample subject and this may vary
$
Note that this is pretty limited in terms of format. The line above requires that you have ONE space after your colon. If you have more, it will pad the beginning of $output.
If you're using some other shell, you may have to do something like this, with the cut command:
> setenv string "Subject_01: This is a sample subject and this may vary"
> setenv output "`echo '$string' | cut -d: -f2`"
> echo $output
This is a sample subject and this may vary
> setenv output "`echo '$string' | sed 's/^[^:]*: *//'`"
> echo $output
This is a sample subject and this may vary
>
The first example uses cut, which is very small and simple. The second example uses sed, which can do far more, but is a (very) little heavier in terms of CPU.
YMMV. There's probably a better way to handle this in csh (my second example uses tcsh), but I do most of my shell programming in Bourne.

Related

Perl - Read input pipe as single variable

I have a very simple perl script which I perform some regex on the value of the data piped to my perl command. ex:
cat /tmp/myfile.txt | perl -wnE"say for /my_pattern/gi"
Note, I do realize in my current setup that the -n option wraps my command ex:
while(<>){
say for /my_pattern/gi
}
...thus iterating over each line of input.
I'd like to change this where I can perform a regex against the final output of my cat command. All examples I find show processing input line by line.
Any help would be appreciated.
Update:
To be clear, I'm referring to any pipe, not only reading from a file as in my example (think curl, wget, echo, etc...) I'm not sure it is even possible given the fact that the originating command could be long-lived or run for an "indefinite" period of time.
To answer your question directly:
cat aaaa.txt | perl -ne 'BEGIN{local $/};print for /a/gi'
To get what your stuff work:
cat aaaa.txt | perl -ne ';print if /aaaa/gi'

Converting LaTeX pmatrix command to amsmath pmatrix environment using sed

I have an old LaTeX document (with a lot of formatting commands) that I want to convert to the more modern LaTeX (I want to do the update for several reasons, not the least of which is to reduce the coupling between content and formatting). At any rate, the document has a lot of calls to the deprecated command \pmatrix{ .... } which I would like to replace with the new amsmath command \begin{pmatrix} ... \end{pmatrix}. I have been trying to use sed to do this conversion but I have never used it before and I am having trouble.
Here is a MWE
LaTeX input string
\pmatrix{0&0\cr \frac{1}{2}&0\cr 0&0\cr}\pmatrix{1&1\cr 1&1\cr 1&1\cr}
with the expected output
\begin{pmatrix}0&0\\ \frac{1}{2}&0\\ 0&0\end{pmatrix}\begin{pmatrix}1&1\\ 1&1\\ 1&1\end{pmatrix}
The commands that I have been trying to use are variants of the following
sed 's/\\pmatrix{\(.*\cr[ ]*\)}/\\begin{pmatrix}\1 \\end{pmatrix}/g' <$WORKING_FILE >$OUTPUT_FILE
but the closest output that I have been able to achieve is
\begin{pmatrix}0 & 0 \\ 0 & 0 \\ 0 & 0 \end{pmatrix}
I am pretty sure that the problem is related to having two calls to pmatrix side by side, but I am not sure how to modify the regex to make this work.
I have searched google, but being so new to regex, I just got confused by all of the variations out there and which to use, and how to properly format such a thing.
The following might work for you:
sed -re 's/(\\pmatrix)\{([^}]*)}/\\begin{pmatrix}\2\\end{pmatrix}/g' -e 's/\\cr/\\\\/g' -e 's/\\\\\\end/\\end/g' inputfile
This works by:
substituting \pmatrix{...} with `\begin{matrix}...\end{matrix}
substituting \cr with \\
handling \\\end to make it \end
EDIT: As per your update, you might be better off splitting the relevant parts using grep before piping to sed:
grep -oP '\\pmatrix.*?\\cr}' inputfile | sed -re 's/\\pmatrix\{(.*)}/\\begin{pmatrix}\1\\end{pmatrix}/g;s/\\cr/\\\\/g;s/\\\\\\end/\\end/g'
This might work for you (GNU sed):
sed -r 's/\\cr/\n/g;s/\\(pmatrix)\{([^\n]*)\n([^\n]*)\n([^\n]*)\n\}/\\begin{\1}\2\\\\ \3\\\\ \4\\end{\1}/g;s/\n/\\cr/g' file
Convert \\cr to newlines. Do a global substitution command. Then convert those newlines left back to \\cr's.

Find non-ASCII codepoints in a file

I am currently using this regex to find the non-ASCII code points in a file, no matter what encoding:
$ cat test.txt | hd | grep -P " [8-9a-f][\da-f]"
Is there a better, more concise, or less hacky method? I usually use grep -P "[^\x00-\x7f]" to find the offensive characters but here I am looking for the offensive code points.
Note that the current hacky method does have the nice side effect of showing the surrounding ASCII characters, which is very nice for context.
Using hd, this should be faster:
hd test.txt |grep -w '[89a-f][0-9a-f]'
(grep -P invokes libpcre and is slower. grep -w searches just "words" and will default to standard posix regex, which is nearly as fast as a -F plain text query. Removing the cat from the pipe also saves (trivial) effort.)
If you didn't want the context, you could give grep the -o flag. If you want the context called out more clearly, consider --color (or even --color=always if you're piping the output somewhere and don't mind the coloring control characters). You may also find grep's -n flag useful, which will give you line numbers.
I think you can use grep's -a flag to achieve what you're looking for in a single command (this forces everything to be read as text rather than the useless "Binary file test.txt matches" output), though you may not like what the output does to your terminal. Maybe pipe it into a file and then view that file with vim (which, unlike less, won't render control characters):
grep -aP '[^\x00-\x7f]' test.txt > found-highchars
view found-highchars
This may or may not be faster than piping through hd and grep.

Complex changes to a URL with sed

I am trying to parse an RSS feed on the Linux command line which involves formatting the raw output from the feed with sed.
I currently use this command:
feedstail -u http://www.heise.de/newsticker/heise-atom.xml -r -i 60 -f "{published}> {title} {link}" | sed 's/^\(.\{3\}\)\(.\{13\}\)\(.\{6\}\)\(.\{3\}\)\(.*\)/\1\3\5/'
This gives me a number of feed items per line that look like this:
Sat 20:33 GMT> WhatsApp-Ausfall: Server-Probleme blockieren Messaging-Dienst http://www.heise.de/newsticker/meldung/WhatsApp-Ausfall-Server-Probleme-blockieren-Messaging-Dienst-2121664.html/from/atom10?wt_mc=rss.ho.beitrag.atom
Notice the long URL at the end. I want to shorten this to better fit on the command line. Therefore, I want to change my sed command to produce the following:
Sat 20:33 GMT> WhatsApp-Ausfall: Server-Probleme blockieren Messaging-Dienst http://www.heise.de/-2121664
That means cutting everything out of the URL except a dash and that seven digit number preceeding the ".html/blablabla" bit.
Currently my sed command only changes stuff in the date bit. It would have to leave the title and start or the URL alone and then cut stuff out of it until it reaches the seven digit number. It needs to preserve that and then cut everything after it out. Oh yeah, and we need to leave a dash right in front of that number too.
I have no idea how to do that and can't find the answer after hours of googling. Help?
EDIT:
This is the raw output of a line of feedstail -u http://www.heise.de/newsticker/heise-atom.xml -r -i 60 -f "{published}> {title} {link}", in case it helps:
Sat, 22 Feb 2014 20:33:00 GMT> WhatsApp-Ausfall: Server-Probleme blockieren Messaging-Dienst http://www.heise.de/newsticker/meldung/WhatsApp-Ausfall-Server-Probleme-blockieren-Messaging-Dienst-2121664.html/from/atom10?wt_mc=rss.ho.beitrag.atom
EDIT 2:
It seems I can only pipe that output into one command. Piping it through multiple ones seems to break things. I don't understand why ATM.
Unfortunately (for me), I could only think of solving this with extended regexp syntax (either -E or -r flag on different systems):
... | sed -E 's|(://[^/]+/).*(-[0-9]+)\.html/.*|\1\2|'
UPDATE: In basic regexp syntax, the best I can do is
... | sed 's|\(://[^/]*/\).*\(-[0-9][0-9]*\)\.html/.*|\1\2|'
The key to writing this sort of regular expression is to be very careful about what the boundaries of what you expect are, so as to avoid the random gunk that you want to get rid of causing you problems. Also, you should bear in mind that you can use characters other than / as part of a s operation's delimiters.
sed 's!\(http://www\.heise\.de/\)newsticker/meldung/[^./]*\(-[0-9]+\)\.html[^ ]*!\1\2!'
Be aware that getting the RE right can be quite tricky; assume you'll need to test it! (This is a key part of the “now you have two problems” quote; REs very easily become horrendous.)
Something like this maybe?
... | awk -F'[^0-9]*' '{print "http://www.heise.de/-"$2}'
This might work for you (GNU sed):
sed 's|\(//[^/]*/\).*\(-[0-9]\{7\}\).*|\1\2|' file
You can place the first sed command so:
feedstail -u http://www.heise.de/newsticker/heise-atom.xml -r -i 60 -f "{published}> {title} {link}" |
sed 's/^\(.\{3\}\)\(.\{13\}\)\(.\{6\}\)\(.\{3\}\)\(.*\)/\1\3\5/;s|\(//[^/]*/\).*\(-[0-9]\{7\}\).*|\1\2|'

Extracting username from UNIX path using Regex

I need to get a username from an Unix path with this format:
/home/users/myusername/project/number/files
I just want "myusername" I've been trying for almost a hour and I'm completely clueless.
Any idea?
Thanks!
Maybe just /home/users/([a-zA-Z0-9_\-]*)/.*?
Note that the critical part [a-zA-Z0-9_\-]* has to contain all valid characters for unix usernames. I took from here, that a username should only contain digits, characters, dashes and underscores.
Also note that the extracted username is not the whole matching, but the first group (indicated by (...)).
The best answer to this depends on what you are trying to achieve. If you want to know the user who owns that file then you can use the stat command, this unfortunately has slightly different syntax dependant on the operating system however the following two commands work
Max OS/X
stat -f '%Su' /home/users/myusername/project/number/files
Redhat/Fedora/Centos
stat -c '%U' /home/users/myusername/project/number/files
If you really do want the string following /home/users then the either of the Regexes provided above will do that, you could use that in a bash script as follows (Mac OS/X)
USERNAME=$(echo '/home/users/myusername/project/number/files' | \
sed -E -e 's!^/home/users/([^/]+)/.*$!\1!g')
Check http://rubular.com/r/84zwJmV62G. The first match, not the entire match, is the username.
in a bourne shell something like :
string="/home/users/STRINGWEWANT/some/subdir/here"
echo $string | awk -F\/ '{print $3}'
would be one option, assuming its always the third element of the path. There are more lightweight that use only the shell builtins :
echo ${x#*users/}
will strip out everything up to and including 'users/'
echo ${y%%/*}
Will strip out the remainder.
So to put it all together :
export path="/home/users/STRINGWEWABT/some/other/dirs"
export y=`echo ${path#*users/}` && echo ${y%%/*}
STRINGWEWABT
Also checkout the bash manpage and search for "Parameter Expansion"
(\/home\/users\/)([^\/]+)
The 2nd capture group (index 1) will be myusername