Keep string between start and end pattern - regex

I have a text file containing this kind of content :
d__Affenpinscher|c__Abyssinian|h__Kathiawari|
a__Gold|y__Slix|c__Kathiawari|c__Cact
And I would like to obtain all the occurence that start with "c__" and end with "|" so that the final result is :
c__Abyssinian
c__Cact
I'm not that good with regular expression, so thanks for your help in advance.
edit : I'm looking for a bash command so grep/sed/awk are available
I tried to start from a basic example like :
sed -n "/<PRE>/,/<\/PRE>/p" input.html
with < PRE > and < /PRE > beeing the start and the end of the pattern
to
sed -n "/c__/,/|/p" breedList.txt > breedC.txt
But I didn't obtained the wanted output
Edit 2 : I tried to adapt this answer from a similar thread How to use sed/grep to extract text between two words? but I must be doing something wrong since my output is juste empty.
Here is the command I tried :
echo "d__Affenpinscher|c__Abyssinian|h__Kathiawari|" | grep -o -P '(?<=c__).*?(?=|)'

The answer from rkta did the trick, thanks :) :
echo "d__Affenpinscher|c__Abyssinian|h__Kathiawari|" | grep -o -P '(?<=c__).*?(?=\|)' The vertical bar | is a special character and needs to be escaped.
You say: start with "c__" and end with "|", but c__Cact doesn't end with |

Related

Filtering a variable in bash script using regex tr or awk

row1=$('+00 00:30:07.880000')
rowX=$('row1 | tr -dc '0-9')
I basically want to filter out all the special characters and space.
I wish to have a output as follows.
echo $'row1' = 003007.880000
You don't need regular expressions or external commands like tr for this. Bash's built-in parameter expansion can do it:
row1='+00 00:30:07.880000'
row1=${row1//[^0-9.]/}
echo "row1=$row1"
outputs row1=00003007.880000.
The output has two leading zeros that are not in the output suggested in the question. Maybe there's an unstated requirement to remove prefixes delimited by spaces. If that is the case, possible code is:
row1='+00 00:30:07.880000'
row1=${row1##* }
row1=${row1//[^0-9.]/}
echo "row1=$row1"
That outputs row1=003007.880000.
See How do I do string manipulations in bash? for explanations of ${row1//[^0-9.]/} and ${row1##* }.
This is the easiest way to do that :
$ echo '+00 00:30:07.880000' | tr -dc '[0-9].'
00003007880000
Regards!

Extract substring from string with sed

I want to extract MIB-Objects from snmpwalk output. The output FILE looks like:
RFC1213-MIB::sysDescr.0.0.0.0.192.168.1.2 = STRING: "Linux debian 3.16.0-4-amd64 #1 SMP Debian 3.16.43-2+deb8u1 (2017-06-18) x86_64"
RFC1213-MIB::sysObjectID.0 = OID: RFC1155-SMI::enterprises.8072.3.2.10
..
First, I read the output file, split at character = and remove everything between RFC1213-MIB:: and .0 till the end of the string.
while read -r; do echo "${REPLY%%=*}" | sed -e 's/RFC1213-MIB::\(.*\)\.0/\1/'; done <$FILE
My current output:
sysDescr.0.0.0.192.168.1.2
sysObjectID
How can I remove the other values? Is there a better solution of extracting sysDescr, sysObjectID?
With awk:
awk -F[:.] '{print $3}'
(define : and . as field delimiters and display the 3rd field)
with sed (Gnu):
sed 's/^[^:]*::\|\.0.*//g'
(replace with the empty string all that isn't a : followed by :: at the start of the line or the first .0 and following characters until the end of the line)
Maybe you can try with:
sed 's/RFC1213-MIB::\([^\.]*\).*/\1/' $FILE
This will get everything that is not a dot (.) following the RFC1213-MIB:: string.
If you don't want to use sed, you can just use parameter substitution. sed is an external process so it won't be as fast as parameter substitution since it's a bash built in.
while IFS= read -r line; do line=${line#*::}; line=${line%%.*}; echo $line; done < file
line=${line#*::} assumes RFC1213-MIB does not have two colons and will be split from sysDescr with two colons.
line=${line%%.*} assumes sysDescr will have a . after it.
If you have more examples, that you think won't work, I can update my answer.

Grep match a certain key/value set json

IF THIS IS THE FIRST TIME YOU"RE READING THIS QUESTION, SKIP RIGHT TO THE EDIT
So what I'm trying to do is match everything until a certain word
What I'm working with is similar to this:
{"selling":"0"morestuffhere"notes":"otherthingshere"}unwantedthingshere
The regex I got so far is:
grep -o "\{\"selling\":\"0\""
which will match up to {"selling":"0".
I want it to match {"selling":"0"morestuffhere"notes":"otherthingshere"} but NOT unwantedstuffhere.
I don't know beforehand what "morestuffhere", "otherthingshere" and "unwantedstuffhere" are gonna be. So what I want to do is match everything from what I already have until "notes":"otherthingshere"}.
How do I do this?
EDIT: forgot to mention some key points. Sorry, had to hurry because dinner was ready.
My input consists of a series of key:value sets, as such:
{"key":"value", "otherkey":"othervalue","morekeys":"morevalues"},{"othersetkey":"othersetvalue","otherothersetkey":"otherothersetvalue","othersetmorekeys":"othersetmorevalues"}
and so on.
The first key/value set is different from the rest of them, and I don't want to match that set.
The first key of all sets other than the first is "selling", and I want to match all sets that have a "selling" value of 1. The last key of the set is "notes".
The input is JSON, so I added that to the tags.
Through sed,
sed -r 's/^[^{]*([^}]*).*$/\1}/g' file
Example:
$ echo 'dSDGAadb{"selling":"0"morestuffhere"notes":"otherthingshere"}unwantedthingshere' | sed -r 's/^[^{]*([^}]*).*$/\1}/g'
{"selling":"0"morestuffhere"notes":"otherthingshere"}
I think you want something like this,
$ cat aa
dSDGAadb{"selling":"0"morestuffhere"notes":"otherthingshere"}{"selling":"1"morestuffhere"notes":"otherthingshere"}bgj
$ sed -r 's/.*(\{"selling":"1"[^}]*)}.*/\1}/g' aa
{"selling":"1"morestuffhere"notes":"otherthingshere"}
OR
something like this,
$ cat aa
dSDGAadb{"selling":"0"morestuffhere"notes":"otherthingshere"}{"selling":"1"morestuffhere"notes":"otherthingshere"}bgj{"selling":"1"morestuffhere"notes":"otherthingshere"}
$ grep -oP '{\"selling\":\"1\"[^}]*}' aa
{"selling":"1"morestuffhere"notes":"otherthingshere"}
{"selling":"1"morestuffhere"notes":"otherthingshere"}
You could do this with grep:
grep -o '{[^}]*}' file
This matches an opening curly brace, followed by anything that isn't a closing curly brace, followed by a closing curly brace.
Testing it out on your input:
$ grep -o '{[^}]*}' <<<'{"selling":"0"morestuffhere"notes":"otherthingshere"}unwantedthingshere'
{"selling":"0"morestuffhere"notes":"otherthingshere"}
What's wrong with
>> grep -o ".*}" file.txt
{"selling":"0"morestuffhere"notes":"otherthingshere"}
where file.txt contains your example string?
I've never found a good way to do this kind of thing with json in the shell with basic unix tools like grep, sed, etc. A quick and dirty ruby or python script is your friend,
#!/usr/bin/env ruby
# h.rb
require 'json'
key=ARGV.shift
json=ARGF.read
h=JSON.parse(json)
puts h.key?(key) ? h[key] : "not found"
And then pipe your json into the script specifying the key as a parameter,
$ echo '{"key":"value", "otherkey":"othervalue","morekeys":"morevalues"}' | /tmp/h.rb otherkey
othervalue
or from a file,
$ cat /tmp/h.json | /tmp/h.rb otherkey
othervalue

Regular expressions with grep

So I have a bunch of data that all looks like this:
janitor#1/2 of dorm#1/1
president#4/1 of class#2/2
hunting#1/1 hat#1/2
side#1/2 of hotel#1/1
side#1/2 of hotel#1/1
king#1/2 of hotel#1/1
address#2/2 of girl#1/1
one#2/1 in family#2/2
dance#3/1 floor#1/2
movie#1/2 stars#5/1
movie#1/2 stars#5/1
insurance#1/1 office#1/2
side#1/1 of floor#1/2
middle#4/1 of December#1/2
movie#1/2 stars#5/1
one#2/1 of tables#2/2
people#1/2 at table#2/1
Some lines have prepositions, others don't so I thought I could use regular expressions to clean it up. What I need is each noun, the # sign and the following number on its own line. So for example, the first lines of output should look like this in the final file:
janitor#1
dorm#1
president#4
etc...
The list is stored in a file called NPs. My code to do this is:
cat NPs | grep -E '\b(\w*[#][1-9]).' >> test
When I open test, however, it's the exact same as the input file. Any input as to what I'm missing? It doesn't seem like it should be a hard operation, so maybe I'm missing something about syntax? I'm using this command from a shell script that is called in bash.
Thanks in advance!
This should do what you need.
The -o option will show only the part of a matching line that matches the PATTERN.
grep -Eo '[a-z#]+[1-9]' NPs > test
or even the -P option, which Interprets the PATTERN as a Perl regular expression
grep -Po '[\w#]*(?=/)' NPs > test
Using grep:
$ grep -o "\w*[#]\w*" inputfile
janitor#1
dorm#1
president#4
class#2
hunting#1
hat#1
side#1
hotel#1
side#1
hotel#1
king#1
hotel#1
address#2
girl#1
one#2
family#2
dance#3
floor#1
movie#1
stars#5
movie#1
stars#5
insurance#1
office#1
side#1
floor#1
middle#4
ecember#1
movie#1
stars#5
one#2
tables#2
people#1
table#2
grep variations extracting entire lines from text, if they match pattern. If you need to modify lines, you should use sed, like
cat NPs | sed 's/^\(\b\w*[#][1-9]\).*$/\1/g'
You need sed, not grep. (Or awk, or perl.) It looks like this would do what you want:
cat NPs | sed 's?/.*??'
or simply
sed 's?/.*??' NPs
s means "substitute". The next character is the delimiter between regular expressions. Usually it's "/", but since you need to search for "/", I used "?" instead. "." refers to any character, and "*" says "zero or more of what preceded me". Whatever is between the last two delimiters is the replacement string. In this case it's empty, so you're replacing "/" followed by zero or more of any character, with the empty string.
EDIT: Oh, I see now that you wanted to extract the last item on the line, too. Well, I'm sure that others' suggested regexps would work. If it were my problem, I'd probably filter the file in two steps, perhaps piping the results from one step to the next, or using multiple substitutions with sed: First delete the "of"s and middle spaces, and add newlines, and then run sed as above. It's not as cool as doing it all in one regexp, but each step is easier to understand. For even more simplicity and uncoolness, use three steps, replacing " of " with space in the first step. Since others have provided complete solutions, I won't work out the details.
Grep by default just searches for the text, so in your case it is printing the lines that match. I think you want to investigate sed instead to perform the replacement. (And you don't need to cat the file, just grep PATTERN filename)
To get your output on separate lines, this worked for me:
sed 's|/.||g' NPs | sed 's/ .. /=/' | tr "=" "\n"
This uses two seds in a row to do different substitutions, and tr to insert line feeds.
The -o option in grep, which causes it to print out only the matching text, as described in another answer, is probably even simpler!
An awk version:
awk '/#/ {print $NF}' RS="/" NPs
janitor#1
dorm#1
president#4
class#2
hunting#1
hat#1
side#1
hotel#1
side#1
hotel#1
king#1
hotel#1
address#2
girl#1
one#2
family#2
dance#3
floor#1
movie#1
stars#5
movie#1
stars#5
insurance#1
office#1
side#1
floor#1
middle#4
December#1
movie#1
stars#5
one#2
tables#2
people#1
table#2

Bash script regex for file size

I'm trying to extract the size (in kb) from a file. Trying to do so as follows:
textA=$(du a)
sizeA=$(expr match "$textA" '\(^[^\s]*\)')
textB=$(du b)
sizeB=$(expr match "$textB" '\(^[^\s]*\)')
echo $textA
echo $sizeA
echo $textB
echo $sizeB
[[ $sizeA == $sizeB ]] && echo "eq"
But this just prints in console textA and textB. Both are like:
30745 a
Can someone please explain why is not the regex matching? I've tried to test the regex against the text in many sites, just to make sure, and it appears to capture the correct text.
I've also tried changing it to:
'^\([^\s]*\)'
But this way it will capture all the text. Any thoughts?
My expr match does not understand \s or other extended regexps. Try '\([0-9]*\)' instead.
But as others mentioned already, using regexp for getting "the first word" is a little overkill. I'd use du s | { read a b; echo $a; }, but you could also use the awk version or solutions using cut.
Not a direct answer, but I would do it like this:
sizeA=$(du a | awk '{print $1}')
size=$(wc -c < file)
If you want to use du, I would use the bash builtin read:
read size filename < <(du file)
Note that you can't say du file | read size filename because in bash, components of a pipeline are executed in subshells, so the variables will disappear when the subshell exits.
Do not parse the output of du, if available you can e.g. use stat to get the size of a file in bytes:
sizeA=$(stat -c%s "${fileA}")