Using sed to find inside of quotes and skip escaped quotes - regex

I have a curl call that queries JIRA REST API and returns a JSON string like the following (expect on a single line):
{
"expand":"renderedFields,names,schema,transitions,operations,editmeta,changelog",
"id":"36112","self":"https://jira.company.com/rest/api/2/issue/36112",
"key":"FOO-1218",
"fields":
{"summary":"the \"special\" field is not returning what is expected"}
}
I was trying to parse out the "summary" field using this sed script:
sed 's/^.*summary":"\([^"]*\)".*$/\1/'
Which works fine if the "summary" doesn't have an escaped \" inside of it - but of course, with the escaped quote all I get back is from the example is:
the \
My desired output would either be:
the \"special\" field is not returning what is expected
Or even more fancily this:
the "special" field is not returning what is expected
It doesn't appear that I can do a lookbehind in sed, is there a simple way to solve this in a bash script?

You're asking for a JSON parser written in sed. Sorry, but this is insane.
Here's an example of a sane way to do this in python:
import requests
response = requests.get(JIRA_API_ENDPOINT, headers = JIRA_HEADERS)
obj = response.json()
obj['fields']['summary']
There's also a good JIRA API wrapper in python, called jira-python. Just use that and you wont have to do any parsing at all. I've used it to good effect before. Link here: http://jira-python.readthedocs.org/en/latest/
Your coworkers will thank you.

For the inside of double quotes, you really want at least one of these facilities:
lookarounds (so you can check that what precedes and follows are quote).
\K (so you can drop the opening quote)
the ability to examine capture groups (so you can match the whole quote, but only capture what's inside).
Typically, you would want something like this:
(?<=(?<!\\)")(?:\\"|[^"])*(?=")
In grep -P mode, which uses PCRE, you can tap into even more features, such as the possessive quantifier I'll add here:
(?<=(?<!\\)")(?:\\"|[^"])*+(?=")
Note that the [^"] can normally run across multiple lines, which you'd typically control with [^"\r\n], but grep only looks line by line anyway.

For this limited case, you could use something like
vnix$ sed -n 's/.*summary":"\(\([^\\"]*\|\\.\)*\)".*/\1/p' file.json
the \"special\" field is not returning what is expected
Inside the quoted string, double quotes are disallowed, except any character is allowed immediately after a literal backslash. The character class disallows backslashes, too, to prevent a backslash from "leaking" into the wrong partial match. The repeat after the character class is just an optimization to avoid needless backtracking.
Any attempt at generalizing this will quickly become quite unwieldy. The Friedl book has an example which stretches over more than a page just to illustrate the futility of this.

After serious struggling, I have figured out a method that is working for this very specific use-case. I convern the escaped quotes (\") into an even more obscure character sequence of five underscores (_), do the regex, and then convert it back:
sed -e 's/\\"/_____/g' -e 's/^.*summary":"\([^"]*\)".*$/\1/' -e 's/_____/"/g'
So the full test looks like this:
echo '{"expand":"renderedFields,names,schema,transitions,operations,editmeta,changelo‌​g","id":"36112","self":"https://jira.company.com/rest/api/2/issue/36112","key":"F‌​OO-1218","fields":{"summary":"the \"special\" field is not returning what is expected"}}' | sed -e 's/\\"/_____/g' -e 's/^.*summary":"\([^"]*\)".*$/\1/' -e 's/_____/"/g'
And the output looks like this:
the "special" field is not returning what is expected

Related

How to locate a mismatched text delimiter

I'm trying to remove double quotes that appear within a string coming from a dB because it's causing an stream error in another application. I can't clean up the dB to remove these, so I need to replace the character on the fly.
I've tried using sed, ssed, and perl all without success. This regular expression is locating the problem quotes, but when I plug it into sed to replace them with a single quote my output still contains the double quote.
sed "s/(\?<\!\t|^)\"(\?\!\t|$)/'/g" test.txt
I'm on Mac, if this looks a bit odd.
The regex is valid, but when I test on a tab-delimited file containing this:
"foo" "rea"son" "text's"
My output is identical to the above. Any idea what I'm doing wrong?
Thanks
I assume you want to turn all occurrences of " that are not on a field boundary (i.e. either preceded or succeeded by either a tab or the beginning/end of the string) by '.
This can be done using perl and the following substitution:
s/(?<=[^\t])"(?=[^\t\n])/'/g;
(With sed this is not directly possible as it does not support look-behind / look-ahead assertions.)
To use this code on the command line, it needs to be escaped for whatever shell you're using. Assuming bash or a similar sh-like shell:
perl -pe 's/(?<=[^\t])"(?=[^\t\n])/'\''/g' test.txt
Here I use '...' to quote most of the code. To get a single ' into the quoted string, I leave the quoted area ...', add an escaped single quote \', and switch back into a single-quoted string '.... That's why a literal ' turns into '\'' on the command line.

Grep a filename with a specific underscore pattern

I am trying to grep a pattern from files using egrep and regex without success.
What I need is to get a file with for example a convention name of:
xx_code_lastname_firstname_city.doc
The code should have at least 3 digits, the lastname and firstname and city can vary on size
I am trying the code below but it fails to achieve what I desire:
ls -1 | grep -E "[xx_][A-Za-z]{3,}[_][A-Za-z]{2,}[_][A-Za-z]{2,}[_][A-Za-z]{2,}[.][doc|pdf]"
That is trying to get the standard xx_ from the beggining, then any code that has at least 3 words and after that it must have another underscore, and so on.
Could anybody help ?
Consider an extglob, as follows:
#!/bin/bash
shopt -s extglob # turn on extended globbing syntax
files=( xx_[[:alpha:]][[:alpha:]]+([[:alpha:]])_[[:alpha:]]+([[:alpha:]])_[[:alpha:]]+([[:alpha:]])_[[:alpha:]]+([[:alpha:]]).#(doc|docx|pdf) )
[[ -e ${files[0]} ]] || -L ${files[0]} ]] && printf '%s\n' "${files[#]}"
This works because
[[:alpha:]][[:alpha:]]+([[:alpha:]])
...matches any string of three or more alpha characters -- two of them explicitly, one of them with the +() one-or-more extglob syntax.
Similarly,
#(doc|docx|pdf)
...matches any of these three specific strings.
So you're trying to match a literal xx_? Begin your pattern with that portion then.
xx_
Next comes the "3 digits" you're trying to match. I'm going to assume based off your own regex that by "digits" you mean characters (hence the [a-zA-Z] character classes). Let's make the quantifier non-greedy to avoid any unintentional capturing behavior.
xx_[a-zA-Z]{3,}?
For the firstname and lastname portions, I see you've specified a variable length with at least 2 characters. Let's make sure these quantifiers are non-greedy as well by appending the ? character after our quantifiers. According to your regex, it also looks like you expect your city construct to take a similar form to the firstname and lastname bits. Let's add all three then.
xx_[a-zA-Z]{3,}?_[a-zA-Z]{2,}?_[a-zA-Z]{2,}?_[a-zA-Z]{2,}\.
NOTE: We didn't need to make the city quantifier non-greedy since we asserted that it's followed by a literal ".", which we don't expect to appear anywhere else in the text we're interested in matching. Notice how it's escaped because it's a metacharacter in the regex syntax.
Lastly comes the file extensions, which your example has as "docx". I also see you put a "doc" and a "pdf" extension in your regex. Let's combine all three of these.
xx_[a-zA-Z]{3,}?_[a-zA-Z]{2,}?_[a-zA-Z]{2,}?_[a-zA-Z]{2,}\.(docx?|pdf)
Hopefully this works. Comment if you need any clarification. Notice how the "doc" and the "docx" portions were condensed into one element. This is not necessary, but I think it looks more deliberate in this form. It could also be written as (doc|docx|pdf). A little repetitive for my taste.

Regex substituting opening parenthesis

As part of a parsing script I'm trying to convert strings like this:
<a href="http://www.web.com/%20Special%20event%202013%20%282%29.pdf">
into
<a href="http://www.web.com/%20Special%20event%202013%20(2).pdf">
The regex for the closing parenthesis works fine
perl -i -pe "s~(href\=\/?[\"\']\.\.\/$i\-(?:(?!%29).)*)%29([^\"\']*[\"\'])~\1)\2~g" "$pageName".html
giving me
<a href="http://www.web.com/%20Special%20event%202013%20%282).pdf">
The problem arrises with the equivalent regex for the opening parenthesis:
perl -i -pe "s~(href\=\/?[\"\']\.\.\/$i\-(?:(?!%28).)*)%28([^\"\']*[\"\'])~\1(\2~g" "$pageName".html
just returns the two groups with nothing in between:
<a href="http://www.web.com/%20Special%20event%202013%202%29.pdf">
Escaping the ( in the substitution with a backslash (or two) has no effect. If I wrap it in some other characters (say ~\1#(#\2~g ) the parenthesis still disappears (giving me %20##2%29 ).
If however in a fit of desperation I add seven parenthesises into the substitution, it works.
perl -i -pe "s~(href\=\/?[\"\']\.\.\/$i\-(?:(?!%28).)*)%28([^\"\']*[\"\'])~\1(((((((\L\2~g" "$pageName".html
outputs
<a href="http://www.web.com/%20Special%20event%202013%20(2%29.pdf">
Can somebody please make sense of this.
Perhaps the following will be helpful or at least provide some direction. It will work on Perl version 10 and above.
use strict;
use warnings;
use v5.10.0; # For regex \K
use URI::Escape;
my $string = '<a href="http://www.web.com/%20Special%20event%202013%20%282%29.pdf">';
$string =~ s/.+2013%20\K([^.]+)(?=\.pdf)/uri_unescape($1)/e;
print $string;
Output:
<a href="http://www.web.com/%20Special%20event%202013%20(2).pdf">
Left enough of the date and the space (%20) as an anchor, then used \K to *K*eep all of that. Then captured the URI encoded text, which is later decoded and used as the substitution text.
The pattern you have doesn't match the string you show at all. It matches something that looks like
<a href=/"../$i-xxxxxxxxxxxxxxx%29xxxxxxxxxx">
with literal dots, and whatever $i contains.
Also, a couple of points about your substitution:
Don't escape characters that don't need escaping. It may take some experience to know without checking which characters you need to escape, but the main point of using ~ as a delimiter is to avoid having to escape slashes in the regex, so at least you could have avoided that.
Don't use \1, \2 etc. in the replacement string. Perl tries very hard to make this work, but normally in Perl those sequences mean to insert the characters \x01 and \x02. Use $1 and $2.
So your regex could be written
s~(href=/?["']\.\./$i-(?:(?!%29).)*)%29([^"']*["'])~$1)$2~;
but it still doesn't "work fine" with the string you gave, which would have to look something like
<a href=/"../$i-xxxxxxxxxxxxxxx%282%29xxxxxxxxxx">
again, containing whatever is in $i. I don't understand at all the optional slash before the href attribute value: it is invalid HTML.
However, using a string that your first regex matches, your second one also works, replacing opening parentheses correctly, so I can't guess at what the problem may be.
There is often no need to verify the entire string. You can just replace the parts you're interested in. So I would write something like
s/(href="[^"]+)%28(\d+)%29(\.pdf")/$1($2)$3/;
which works fine on the string you gave, and replaces both open and close parentheses at once.
I had some problems understanding your regex, but this might work:
perl -pe "s~(href\s*=\s*\"[^\"]*)%28(.*?)%29~\$1(\$2)~g" input

Regular Expression - Capture and Replace Select Sequences

Take the following file...
ABCD,1234,http://example.com/mpe.exthttp://example/xyz.ext
EFGH,5678,http://example.com/wer.exthttp://example/ljn.ext
Note that "ext" is a constant file extension throughout the file.
I am looking for an expression to turn that file into something like this...
ABCD,1234,http://example.com/mpe.ext
ABCD,1234,http://example/xyz.ext
EFGH,5678,http://example.com/wer.ext
EFGH,5678,http://example/ljn.ext
In a nutshell I need to capture everything up to the urls. Then I need to capture each URL and put them on their own line with the leading capture.
I am working with sed to do this and I cannot figure out how to make it work correctly. Any ideas?
If the number of URLs in each line is guaranteed to be two, you can use:
sed -r "s/([A-Z0-9,]{10})(.+\.ext)(.+\.ext)/\1\2\n\1\3/" < input
This does not require the first two fields to be a particular width or limit the set of (non-comma) characters between the commas. Instead, it keys on the commas themselves.
sed 's/\(\([^,]*,\)\{2\}\)\(.*\.ext\)\(http:.*\)/\1\3\n\1\4/' inputfile.txt
You could change the "2" to match any number of comma-delimited fields.
I have no sed available to me at the moment.
Wouldn't
sed -r 's/(....),(....),(.*\.ext)(http.*\.ext)/\1,\2,\3\n\1,\2,\4/g'
do the trick?
Edit: removed the lazy quantifier

Why doesn't this simple regex match what I think it should?

I have a data file that looks like the following example. I've added '%' in lieu of \t, the tab control character.
1234:56% Alice Worthington
alicew% Jan 1, 2010 10:20:30 AM% Closed% Development
Digg:
Reddit:
Update%% file-one.txt% 1.1% c:/foo/bar/quux
Add%% file-two.txt% 2.5.2% c:/foo/bar/quux
Remove%% file-three.txt% 3.4% c:/bar/quux
Update%% file-four.txt% 4.6.5.3% c:/zzz
... many more records of the above form
The records I'm interested in are the lines beginning with "Update", "Add", "Remove", and so on. I won't know what the lines begin with ahead of time, or how many lines precede them. I do know that they always begin with a string of letters followed by two tabs. So I wrote this regex:
generate-report-for 1234:56 | egrep "^[[:alpha:]]+\t\t.+"
But this matches zero lines. Where did I go wrong?
Edit: I get the same results whether I use '...' or "..." for the egrep expression, so I'm not sure it's a shell thing.
Apparently \t isn't a special character for egrep. You can either use grep -P to enable Perl-compatible regex engine, or insert literal tabs with CtrlvCtrli
Even better, you could use the excellent ack
It looks like the shell is parsing "\t\t" before it is sent to egrep. Try "\\t\\t" or '\t\t' instead. That is 2 slashes in double quotes and one in single quotes.
The file might not be exactly what you see. Maybe there are control characters hidden. It happens, sometimes. My suggestion is that you debug this. First, reduce to the minimum regex pattern that matches, and then keep adding stuff one by one, until you find the problem:
egrep "[[:alpha:]]"
egrep "[[:alpha:]]+"
egrep "[[:alpha:]]+\t"
egrep "[[:alpha:]]+\t\t"
egrep "[[:alpha:]]+\t\t.+"
egrep "^[[:alpha:]]+\t\t.+"
There are variations on that sequence, depending on what you find out at each step. Also, the first step can really be skipped, but this is just for the sake of showing the technique.
you can use awk
awk '/^[[:alpha:]]\t\t/' file