Extracting a string between two patterns in bash with a regex - regex

I have a string with key/value pairs in a bash variable. The value I want is hidden like this.
{"keyIDontCareAbout"=>"valueIDontCareAbout",
"keyForValueIWant"=>"valueIWant",
...............bunch more keys
}
What should I use to extract that value? sed, awk, expr match?
My thinking is this, I should extract the string that is preceded by "keyForValueIWant"=>" and is followed by " but I'm having a hard time deciding which tool to use.
expr match seems bad, because it grabs a string at the end of an expression or at the beginning of one, but my string is in the middle of a bunch of characters.
Basically, I can't figure out the regex syntax for a substring between two other substrings.

You can use the following sed command:
valueOfInterest=$(sed -n '/keyForValueIWant/ s/.*=>"\([^"]*\).*/\1/p' <<< "$input")
-n disables output by default. The regex /keyForValueIWant/ restricts the following action only to the/those lines which match the regex. The following substitute command filters the value out of the line and prints it /p.

Try awk as follows:
# Specify key of interest.
key='keyForValueIWant'
# Extract matching value, assuming that the input data is
# in shell variable $input:
value=$(awk -F'("|=>)' -v key="$key" '$2==key { print $5; exit }' <<<"$input")
# Print result.
echo "Value for $key: [$value]"
-F'("|=>)' tells awk to split each line into fields based on " or => as separators - effectively, this will put the the key in field 2 ($2), and the value in field 5 ($5)
The key of interest is passed as a shell variable ($key) to awk as a variable of the same name (-v key=...).
If the input line's key matches the specified key ($2==key), the 5th field - containing the value - is printed (print $5).
exit ensures that processing stops once a match is found to prevent unnecessary parsing of the remainder of the file (note: this assumes that the keys are true keys, i.e., that they are unique in the input file).

Related

Perl regex - print only modified line (like sed -n 's///p')

I have a command that outputs text in the following format:
misc1=poiuyt
var1=qwerty
var2=asdfgh
var3=zxcvbn
misc2=lkjhgf
etc. I need to get the values for var1, var2, and var3 into variables in a perl script.
If I were writing a shell script, I'd do this:
OUTPUT=$(command | grep '^var-')
VAR1=$(echo "${OUTPUT}" | sed -ne 's/^var1=\(.*\)$/\1/p')
VAR2=$(echo "${OUTPUT}" | sed -ne 's/^var2=\(.*\)$/\1/p')
VAR3=$(echo "${OUTPUT}" | sed -ne 's/^var3=\(.*\)$/\1/p')
That populates OUTPUT with the basic content that I want (so I don't have to run the original command multiple times), and then I can pull out each value using sed VAR1 = 'qwerty', etc.
I've worked with perl in the past, but I'm pretty rusty. Here's the best I've been able to come up with:
my $output = `command | grep '^var'`;
(my $var1 = $output) =~ s/\bvar1=(.*)\b/$1/m;
print $var1
This correctly matches and references the value for var1, but it also returns the unmatched lines, so $var1 equals this:
qwerty
var2=asdfgh
var3=zxcvbn
With sed I'm able to tell it to print only the modified lines. Is there a way to do something similar with in perl? I can't find the equivalent of sed's p modifier in perl.
Conversely, is there a better way to extract those substrings from each line? I'm sure I could match match each line and split the contents or something like that, but was trying to stick with regex since that's how I'd typically solve this outside of perl.
Appreciate any guidance. I'm sure I'm missing something relatively simple.
One way
my #values = map { /\bvar(?:1|2|3)\s*=\s*(.*)/ ? $1 : () } qx(command);
The qx operator ("backticks") returns a list of all lines of output when used in list context, here imposed by map. (In a scalar context it returns all output in a string, possibly multiline.) Then map extracts wanted values: the ternary operator in it returns the capture, or an empty list when there is no match (so filtering out such lines). Please adjust the regex as suitable.
Or one can break this up, taking all output, then filtering needed lines, then parsing them. That allows for more nuanced, staged processing. And then there are libraries for managing external commands that make more involved work much nicer.
A comment on the Perl attempt shown in the question
Since the backticks is assigned to a scalar it is in scalar context and thus returns all output in a string, here multiline. Then the following regex, which replaces var1=(.*) with $1, leaves the next two lines since . does not match a newline so .* stops at the first newline character.
So you'd need to amend that regex to match all the rest so to replace it all with the capture $1. But then for other variables the pattern would have to be different. Or, could replace the input string with all three var-values, but then you'd have a string with those three values in it.
So altogether: using the substitution here (s///) isn't suitable -- just use matching, m//.
Since in list context the match operator also returns all matches another way is
my #values = qx(command) =~ /\bvar(?:1|2|3)\s*=\s*(.*)/g;
Now being bound to a regex, qx is in scalar context and so it returns a (here multiline) string, which is then matched by regex. With /g modifier the pattern keeps being matched through that string, capturing all wanted values (and nothing else). The fact that . doesn't match a newline so .* stops at the first newline character is now useful.
Again, please adjust the regex as suitable to yoru real problem.
Another need came up, to capture both the actual names of variables and their values. Then add capturing parens around names, and assign to a hash
my %val = map { /\b(var(?:1|2|3))\s*=\s*(.*)/ ? ($1, $2) : () } qx(command);
or
my %val = qx(command) =~ /\b(var(?:1|2|3))\s*=\s*(.*)/g;
Now the map for each line of output from command returns a pair of var-name + value, and a list of such pairs can be assigned to a hash. The same goes with subsequent matches (under /g) in the second case..
In scalar context, s/// and s///g return whether it found a match or not. So you can use
print $s if $s =~ s///;

How to use sed to search and replace a pattern who appears multiple times in the same line?

Because the question can be misleading, here is a little example. I have this kind of file:
some text
some text ##some-text-KEY-some-other-text##
text again ##some-text-KEY-some-other-text## ##some-text-KEY-some-other-text##
again ##some-text-KEY-some-other-text-KEY-text##
some text with KEY ##KEY-some-text##
blabla ##KEY##
In this example, I want to replace each occurrence of KEY- inside a pair of ## by VALUE-. I started with this sed command:
sed -i 's/\(##[^#]*\)KEY-\([^#]*##\)/\1VALUE-\2/g'
Here is how it works:
\(##[^#]*\): create a first group composed of two # and any characters except # ...
KEY-: ... until the last occurrence of KEY- on that line
\([^#]*##\): and create a second group with all the characters except # until the next pair of #.
The problem is my command can't handle correctly the following line because there are multiple KEY- inside my pair of ##:
again ##some-text-KEY-some-other-text-KEY-text##
Indeed, I get this result:
again ##some-text-KEY-some-other-text-VALUE-text##
If I want to replace all the occurrences of KEY- in that line, I have to run my command multiple times and I prefer to avoid that. I also tried with lazy operators but the problem is the same.
How can I create a regex and a sed command who can handle correctly all my file?
The problem is rather complex: you need to replace all occurrences of some multicharacter text inside blocks of text between identical multicharacter delimiters.
The easiest and safest way to solve the task is using Perl:
perl -i -pe 's/(##)(.*?)(##)/$end_delim=$3; "$1" . $2=~s|KEY-|VALUE-|gr . "$end_delim"/ge' file
See the online demo.
The (##)(.*?)(##) pattern will match strings between two adjacent ## substrings capturing the start delimiter into Group 1, end delimiter in Group 3, and all text in between into Group 2. Since the regex substitution re-sets all placeholders, the temporary variable is used to keep the value of the end delimiter ($end_delim=$3), then, "$1" . $2=~s|KEY-|VALUE-|gr . "$end_delim" replaces the match with the value in the Group 1 of the first match (the first ##), then the Group 2 value with all KEY- replaced with VALUE-, and then the end delimiter.
If there are no KEY-s in between matches on the same line you may use a branch with sed by enclosing your command with :A and tA:
sed -i ':A; s/\(##[^#]*\)KEY-\([^#]*##\)/\1VALUE-\2/g; tA' file
Note you missed the first placeholder in \VALUE-\2, it should be \1VALUE-\2.
See the online demo:
s="some KEY- text
some text ##some-text-KEY-some-other-text##
text again ##some-text-KEY-some-other-text## ##some-text-KEY-some-other-text##
again ##some-text-KEY-some-other-text-KEY-text##
some text with KEY ##KEY-some-text##
blabla ##KEY##"
sed ':A; s/\(##[^#]*\)KEY-\([^#]*##\)/\1VALUE-\2/g; tA' <<< "$s"
Output:
some KEY- text
some text ##some-text-VALUE-some-other-text##
text again ##some-text-VALUE-some-other-text## ##some-text-VALUE-some-other-text##
again ##some-text-VALUE-some-other-text-VALUE-text##
some text with KEY ##VALUE-some-text##
blabla ##KEY##
More details:
sed allows the usage of loops and branches. The :A in the code above is a label, a special location marker that can be "jumped at" using the appropriate operator. t is used to create a branch, this "command jumps to the label only if the previous substitute command was successful". So, once the pattern matched and the replacement occurred, sed goes back to where it was and re-tries a match. If it is not successful, sed goes on to search for the matches further in the string. So, tA means go back to the location marked with A if there was a successful search-and-replace operation.
This might work for you (GNU sed):
sed -E 's/##/\n/g;:a;s/^([^\n]*(\n[^\n]*\n[^\n]*)*\n[^\n]*)KEY-/\1VALUE-/;ta;s/\n/##/g' file
Convert ##'s to newlines. Using a loop, replace VAL- between matched newlines to VALUE-. When all done replace newlines by ##'s.

Extract unique lines by pattern

wish to ask bash regex question. I need to print unique lines from the list.
This list contains emails and some of them repeated many times and also some of them have same id and password but different mail accounts.
The list looks as follows:
firstman#gmail.com:pass1234
someguy#yahoo.com:onepass789
secondman#gmail.com:looksPass
firstman#yahoo.com:pass1234
thirdman#cox.net:mypas345
someguy#mail.com:onepass789
firstman# someguy# repeated 2 times but with other mail providers.
I need to get following output:
firstman#gmail.com:pass1234
someguy#yahoo.com:onepass789
secondman#gmail.com:looksPass
thirdman#cox.net:mypas345
uniq -u do this job just partly - it compares full line, instead i need to compare strings outside of #emailprovider: pattern.
How to "discard" this pattern while extracting unique lines ?
With AWK you can say:
awk -F'[#:]' '!seen[$1,$3]++' inputlist
yields:
firstman#gmail.com:pass1234
someguy#yahoo.com:onepass789
secondman#gmail.com:looksPass
thirdman#cox.net:mypas345
-F'[#:]' sets the field separator to either "#" or ":".
Then $1 holds the string before "#" and $3 does after ":".
The condition '!seen[$1,$3]++' tells AWK to print the line if the $1,$3 entry is not seen.
If you flip the fields around you can use --skip-fields=1 (or -f 1) to only consider the emails for uniqueness.
You can use the following awk command instead:
awk -F# '!s[$1]{s[$1]=1;print}' filename

Extract multiple independent regex matches per line

For the file below, I want to extract the two strings following "XC:Z:" and "XM:Z:". For example:
1st line output should be this: "TGGTCGGCGCGT, GAGTCCGT"
2nd line output should be this: "GAAGCCGCTTCC, ACCGACGG"
The original version of the file has a few more columns and millions of rows than the following example, but it should give you the idea:
MOUSE_10 XC:Z:TGGTCGGCGCGT RG:Z:A XM:Z:GAGTCCGT ZP:i:33
MOUSE_10 XC:Z:GAAGCCGCTTCC NM:i:0 XM:Z:ACCGACGG AS:i:16
MOUSE_10 ZP:i:36 XC:Z:TCCCCGGGTACA NM:i:0 XM:Z:GGGACGGG ZP:i:28
MOUSE_10 XC:Z:CAAATTTGGAAA RG:Z:A NM:i:1 XM:Z:GCAGATAG
In addition, each of following criteria would be a bonus but is not mandatory if you can get it to work:
use standard bash tools: awk, sed, grep, etc. (no GAWK, csvtools,...)
assume we don't know the order in which XC and XM appear (although I'm fairly certain XC is almost first, but I am unsure how to check). In the output, however, the XC-string should always be before the XM-string, if at all possible.
The answers from here awk extract multiple groups from each line come awfully close to it, but whenever I try using match(...) I get a "syntax error near unexpected token" message.
Looking forward to your solutions!
Thanks,
Felix
With sed you can capture non-space characters after XC:Z: and XM:Z:
sed -n 's/.*XC:Z:\([^[:blank:]]*\).*XM:Z:\([^[:blank:]]*\).*/\1, \2/p;' file
You can add a second s command for reversed values:
sed -n 's/.*XC:Z:\([^[:blank:]]*\).*XM:Z:\([^[:blank:]]*\).*/\1, \2/;s/.*XM:Z:\([^[:blank:]]*\).*XC:Z:\([^[:blank:]]*\).*/\1, \2/;p;' file
Following awk solution may help you in same.
awk '
/XC:Z:/{
match($0,/XC:[^ ]*/);
num=split(substr($0,RSTART,RLENGTH),a,":");
match($0,/XM:[^ ]*/);
num1=split(substr($0,RSTART,RLENGTH),b,":");
print a[num],b[num1]
}' Input_file
Output will be as follows.
TGGTCGGCGCGT GAGTCCGT
GAAGCCGCTTCC ACCGACGG
TCCCCGGGTACA GGGACGGG
CAAATTTGGAAA GCAGATAG
If we don't know the order in which XC and XM appear
You can try this sed
sed -E 'h;s/(XC:Z:.*XM:Z:)//;tA;x;s/(.*XM:Z:)([^[:blank:]]*)(.*XC:Z:)([^[:blank:]]*)(.*)/\4,\2/;b;:A;x;s/(.*XC:Z:)([^[:blank:]]*)(.*XM:Z:)([^[:blank:]]*)(.*)/\2,\4/' infile
explanation :
sed -E '
h
# keep the line in the hold space
s/(XC:Z:.*XM:Z:)//;x;tA
# if XCZ come before XMZ, go to A but before everything restore the pattern space with x
s/(.*XM:Z:)([^[:blank:]]*)(.*XC:Z:)([^[:blank:]]*)(.*)/\4,\2/
# XMZ come before XCZ, get the interresting parts and reorder it
b
# It is all for this line
:A
s/(.*XC:Z:)([^[:blank:]]*)(.*XM:Z:)([^[:blank:]]*)(.*)/\2,\4/
# XCZ come before XMZ, get the interresting parts
' infile
another awk
$ awk '{c=p=""; # need to reset c and p before each line
for(i=1;i<=NF;i++) # for all fields in the line
if($i~/^XC:Z:/) c=substr($i,6) # check pattern from the start of field
else if($i~/^XM:Z:/) p=substr($i,6) # if didn't match check other other pattern
if(c && p) print c,p}' file # if both matched print
TGGTCGGCGCGT GAGTCCGT
GAAGCCGCTTCC ACCGACGG
TCCCCGGGTACA GGGACGGG
CAAATTTGGAAA GCAGATAG
this will print the last matches if there are multiple instances on the same line. Here is another one with slightly different characteristic.
$ awk 'function s(x) {return ($i~x)?substr($i,6):""}
{c=p="";
for(i=1;i<=NF;i++) {
c=c?c:s("^XC:Z:"); p=p?p:s("^XM:Z:");
if(c && p)
{print c,p; next}}}' file
TGGTCGGCGCGT GAGTCCGT
GAAGCCGCTTCC ACCGACGG
TCCCCGGGTACA GGGACGGG
CAAATTTGGAAA GCAGATAG
this will print the last of the repeated match before the first match of the other. It they appear in pairs, will print the first pair.
Using POSIX awk, you can only use the string-function match(s,ere) as defined by IEEE Std 1003.1-2008 :
match(s, ere)
Return the position, in characters, numbering from 1, in
string s where the extended regular expression ere occurs, or zero if
it does not occur at all. RSTART shall be set to the starting position
(which is the same as the returned value), zero if no match is found;
RLENGTH shall be set to the length of the matched string, -1 if no
match is found.
The patterns you want to match are XM:Z:[^[:blank:]]* and XC:Z:[^[:blank:]]*. This however assumes you do not have any string which contains something like PXM:Z: (i.e. an extra non-blank character advancing the searched string). When the pattern is found in the line $0, then you only need to extract the important parts, which start 5 characters later.
The following code does the above:
awk '{match($0,/XM:Z:[^[:blank:]]*/);xm=substr($0,RSTART+5,RLENGTH-5)}
{match($0,/XC:Z:[^[:blank:]]*/);xc=substr($0,RSTART+5,RLENGTH-5)}
{print xc","xm}' <file>
As you can see, the first line extracts XM, the second XC and the third prints the outcome with comma-separator ",".
Remark - The following assumptions are made here :
each line contains both an xm and xc string
no strings of the type [^[:blank:]]X[CM]:Z:[^[:blank:]]* exist
If you are willing to use gawk, then you could use the patsplit function for string operations (Ref. here). You can do this with a single regex /X[CM]:Z:[^[:blank:]]*/. This gives you directly the requested strings in a single call which include the XM:Z: or XM:C: part. Afterwards you can easily sort them and extract the last parts.
The following lines do exactly the same in gawk
gawk '{patsplit($0,a,/X[MC]:Z:[^[:blank:]]*/) }
{xc=(a[1]~/^XC/)?a[1]:a[2]; xm=(a[1]~/^XC/)?a[2]:a[1]}
{print substr(xc,5)","substr(xm,5)' <file>
Nonetheless, I believe the awk solution is cleaner from a symmetric point of view.

Last Occurrence of Character Field Separator AWK

I'm using a find command to find all files of a certain format, that command has been golden. I'm piping that output into an awk command and I want to use the last underscore as a field separator. The problem being that depending on the path the file is in, there could be one or two underscores before the fact.
find . -regex ".*prob[0-9]*_.*" | awk 'BEGIN { FS = "_.*$" } { print $1 " " $2 }'
I get what's wrong with the regular expression in my field separator, it thinks to separate on the underscore and whatever follows, is there away to specify just the single character itself. Moreover, how do I specifically use a field separator on the last occurrence of a character.
This is somewhat an extension of a question I asked earlier:
Suppress output to StdOut when piping echo
The files I get are generally like this, the wrinkle being that the directory can have an underscore as well:
/the/directory/probXXXXX_XX
where X is any integer.
A workaround I've been thinking of is separating at every underscore and then print every column... I'd rather like to get it working in the method above though.
A trick of awk that is not obvious is that $ is an operator; you can use it with a variable or even an expression, and in particular with expressions involving the predefined variable NF: $NF gets the last field, $(NF - 1) the second last field.