How to debug a (PCRE) regex passed to grep? - regex

I'm trying to debug a regex passed to grep that doesn't seem to be working just on my system.
This is the full command that should return the latest terraform release version:
wget -qO - "https://api.github.com/repos/hashicorp/terraform/releases/latest" | grep -Po '"tag_name": "v\K.*?(?=")'
Which seems to be working for others but not me.
Adding a * quantifier after "tag_name": to match extra spaces makes it work for me:
wget -qO - "https://api.github.com/repos/hashicorp/terraform/releases/latest" | grep -Po '"tag_name": *"v\K.*?(?=")'
Here's the response from the wget without piping to grep:
{
"url": "https://api.github.com/repos/hashicorp/terraform/releases/20814583",
"assets_url": "https://api.github.com/repos/hashicorp/terraform/releases/20814583/assets",
"upload_url": "https://uploads.github.com/repos/hashicorp/terraform/releases/20814583/assets{?name,label}",
"html_url": "https://github.com/hashicorp/terraform/releases/tag/v0.12.12",
"id": 20814583,
"node_id": "MDc6UmVsZWFzZTIwODE0NTgz",
"tag_name": "v0.12.12",
"target_commitish": "master",
"name": "",
"draft": false,
"author": {
"login": "apparentlymart",
"id": 20180,
"node_id": "MDQ6VXNlcjIwMTgw",
"avatar_url": "https://avatars1.githubusercontent.com/u/20180?v=4",
"gravatar_id": "",
"url": "https://api.github.com/users/apparentlymart",
"html_url": "https://github.com/apparentlymart",
"followers_url": "https://api.github.com/users/apparentlymart/followers",
"following_url": "https://api.github.com/users/apparentlymart/following{/other_user}",
"gists_url": "https://api.github.com/users/apparentlymart/gists{/gist_id}",
"starred_url": "https://api.github.com/users/apparentlymart/starred{/owner}{/repo}",
"subscriptions_url": "https://api.github.com/users/apparentlymart/subscriptions",
"organizations_url": "https://api.github.com/users/apparentlymart/orgs",
"repos_url": "https://api.github.com/users/apparentlymart/repos",
"events_url": "https://api.github.com/users/apparentlymart/events{/privacy}",
"received_events_url": "https://api.github.com/users/apparentlymart/received_events",
"type": "User",
"site_admin": false
},
"prerelease": false,
"created_at": "2019-10-18T18:39:16Z",
"published_at": "2019-10-18T18:45:33Z",
"assets": [],
"tarball_url": "https://api.github.com/repos/hashicorp/terraform/tarball/v0.12.12",
"zipball_url": "https://api.github.com/repos/hashicorp/terraform/zipball/v0.12.12",
"body": "BUG FIXES:\r\n\r\n* backend/remote: Don't do local validation of whether variables are set prior to submitting, because only the remote system knows the full set of configured stored variables and environment variables that might contribute. This avoids erroneous error messages about unset required variables for remote runs when those variables will be set by stored variables in the remote workspace. ([#23122](https://github.com/hashicorp/terraform/issues/23122))"
}
And using https://regex101.com I can see that "tag_name": "v\K.*?(?=") and "tag_name": *"v\K.*?(?=") both match the version number correctly.
So there must be something wrong with my system and I'm just very curious why the original one doesn't work for me and how (if possible) to debug in situations like this.

I've been able to narrow it down to the following. If I execute the wget command without the piped grep and without formatting the json response:
wget -qO - "https://api.github.com/repos/hashicorp/terraform/releases/latest"
then I get a json without any whitespaces (I'll post only one a part of the response):
"html_url":"https://github.com/hashicorp/terraform/releases/tag/v0.12.12","id":20814583,"node_id":"MDc6UmVsZWFzZTIwODE0NTgz","tag_name":"v0.12.12","target_commitish":"master","name":"","draft":false
So naturally the original regex "tag_name": "v\K.*?(?=") fails because there is no space after :
This is clearly not related to the regex that is passed to the grep or the grep itself. I don't see the point in digging into the response itself here so the original question can be considered resolved (Though if someone knows what could be causing this please post a comment.)

It is very likely that your RegExp engine does not understand \K. There are many dialects for regexps.
Using standard PCRE regexp terms usually yields good results across all engines.
$ curl -s "https://api.github.com/repos/hashicorp/terraform/releases/latest" | egrep -oe '"tag_name": "v(.*)"'
"tag_name": "v0.12.12"
Now if you only want the version number, you need to fetch for the numbers afterwards (as using ?! to ignore a pattern might not always work).
curl -s "https://api.github.com/repos/hashicorp/terraform/releases/latest" | egrep -oe '"tag_name": "v(.*)"' | egrep -oe '([0-9]+\.?)+'
0.12.12

Related

Figuring out what this sed command do

I'm having a hard time trying to discover what the next comand is doing.
I'm trying to monitor different services on Linux using systemctl. I need a Json output with all the services on Linux that are running on the machine.
The problem is that with this comand the Status ouput is: "enable enabled". I only need the first parameter (state), and trying to delete the second one (Vendor preset) I really don't get it working. Basically because I don't understand it. I know with Sed is trying to replace some strings but with so many characters for me this isn't readable.
echo "{\"data\":[$(systemctl list-unit-files --type=service|grep \.service|grep -v "#"|sed -E -e "s/\.service\s+/\",\"{#STATUS}\":\"/;s/(\s+)?$/\"},/;s/^/{\"{#NAME}\":\"/;$ s/.$//")]}"
Result:
"data": [{
"{#NAME}": "accounts-daemon",
"{#STATUS}": "enabled enabled"
},
{
"{#NAME}": "acpid",
"{#STATUS}": "disabled enabled"
}, {
"{#NAME}": "zabbix-agent",
"{#STATUS}": "enabled enabled"
}
]
}
Expected result:
"data": [{
"{#NAME}": "accounts-daemon",
"{#STATUS}": "enabled"
},
{
"{#NAME}": "acpid",
"{#STATUS}": "disabled"
}, {
"{#NAME}": "zabbix-agent",
"{#STATUS}": "enabled"
}
]
}
Command without "sed": systemctl list-unit-files --type=service
UNIT FILE
STATE
VENDOR PRESET
accounts-daemon.service
enabled
enabled
acpid.service
disabled
enabled
zabbix-agent
static
enabled
The relevant substitute in your code is
s/(\s+)?$/
Try to replace that by deleting everyting starting with the first seperator (\s)
That is
s/\s.*$/
The modified command becomes
echo "{\"data\":[$(systemctl list-unit-files --type=service|grep \.service|grep -v "#"|sed -E -e "s/\.service\s+/\",\"{#STATUS}\":\"/;s/\s.*$/\"},/;s/^/{\"{#NAME}\":\"/;$ s/.$//")]}"

Regex Expression with the JQ Tool

I have a json file which I am using the JQ tool on to get a some lines out of it. However I now need to get some information out of this line using regex. I stuck on two parts. The first bit is that I can't figure out the regular expression to get the lines I want and the second issues is that I do now know what the correct syntax is to apply the regex along with the JQ Tool. I have tried the following syntax and get the error of "unterminated regexp"
jq '.msg.stdout_lines[2]' /tmp/vaultKeys.json | awk '{gsub(/\:(.*[\a-zA-Z0-9]))}1'
My json file is as follows:
{
"msg": {
"changed": true,
"cmd": [
"vault",
"operator",
"init"
],
"delta": "0:00:00.568974",
"end": "2018-11-29 15:42:00.243019",
"failed": false,
"rc": 0,
"start": "2018-11-29 15:41:59.674045",
"stderr": "",
"stderr_lines": [],
"stdout": "Unseal Key 1: ZA0Gas2GrHtdMlet1g63N6gvEPYf5mzZEfjPhMDRyAeS\nUnseal Key 2: NY+CLIbgMJIv+e81FuB1OpV0m7rPuqZbIuYT142MrQLl\nUnseal Key 3: HNWmsrXBsSV9JFuGfqpd+GvPYQzHEsLFlxKBfEyBhCZ6\nUnseal Key 4: xDwfI+kFHFRSzq2JyxSGArQsGjCrFiNbkGCP897Zfbuz\nUnseal Key 5: +O8/tTmDNSzaUBMT8QP+2xbvu5uulypf3+xmWzY8fSD3\n\nInitial Root Token: 6kO8ijZzyhcG5Nup5QUca0u3\n\nVault initialized with 5 key shares and a key threshold of 3. Please securely\ndistribute the key shares printed above. When the Vault is re-sealed,\nrestarted, or stopped, you must supply at least 3 of these keys to unseal it\nbefore it can start servicing requests.\n\nVault does not store the generated master key. Without at least 3 key to\nreconstruct the master key, Vault will remain permanently sealed!\n\nIt is possible to generate new unseal keys, provided you have a quorum of\nexisting unseal keys shares. See \"vault operator rekey\" for more information.",
"stdout_lines": [
"Unseal Key 1: ZA0Gas2GrHtdMlet1g63N6gvEPYf5mzZEfjPhMDRyAeS",
"Unseal Key 2: NY+CLIbgMJIv+e81FuB1OpV0m7rPuqZbIuYT142MrQLl",
"Unseal Key 3: HNWmsrXBsSV9JFuGfqpd+GvPYQzHEsLFlxKBfEyBhCZ6",
"Unseal Key 4: xDwfI+kFHFRSzq2JyxSGArQsGjCrFiNbkGCP897Zfbuz",
"Unseal Key 5: +O8/tTmDNSzaUBMT8QP+2xbvu5uulypf3+xmWzY8fSD3",
"",
"Initial Root Token: 6kO8ijZzyhcG5Nup5QUca0u3",
"",
"Vault initialized with 5 key shares and a key threshold of 3. Please securely",
"distribute the key shares printed above. When the Vault is re-sealed,",
"restarted, or stopped, you must supply at least 3 of these keys to unseal it",
"before it can start servicing requests.",
"",
"Vault does not store the generated master key. Without at least 3 key to",
"reconstruct the master key, Vault will remain permanently sealed!",
"",
"It is possible to generate new unseal keys, provided you have a quorum of",
"existing unseal keys shares. See \"vault operator rekey\" for more information."
]
}
}
Out of the line
"Unseal Key 3: HNWmsrXBsSV9JFuGfqpd+GvPYQzHEsLFlxKBfEyBhCZ6"
I would like just
HNWmsrXBsSV9JFuGfqpd+GvPYQzHEsLFlxKBfEyBhCZ6
Currently using my regex I get only if I use it without the JQ tool syntax
: ZA0Gas2GrHtdMlet1g63N6gvEPYf5mzZEfjPhMDRyAeS
So to summarise I need help with
a) getting a correct regular expression and
b) the correct syntax to use the expression with the JQ Tool.
Thanks
For this particular case you can use split instead of regex.
jq -r '.msg.stdout_lines[2]|split(" ")[-1]' file
Do you have GNU grep?
jq -r '.msg.stdout_lines[2]' /tmp/vaultKeys.json | grep -Po '(?<=: ).+'
In the interests of one-stop shopping, you could for example use this invocation:
jq -r '.msg.stdout_lines[2]
| capture(": (?<s>.*)").s'
Of course there are many other possibilities, depending on your precise requirements.
There are many ways, besides the obvious | grep -Po '(?<=: ).+\b' you could even use substr with awk if the string length is fixed:
jq .. | awk '{print substr($1, RSTART+14)}'

Git Bash regex to match latest tag

My VCS has these tags
0.0.3.156-alpha+2
0.0.3.154
0.0.3.153
build-.139
build-.140
build-.142
build-0.0.1.28
build-0.0.1.29
build-0.0.1.30
build-0.0.1.32
I want to git describe --match "<regex>" to get the latest tag of the form number.number.number.number (so it's 0.0.3.154 in this case)
I have tried with git describe --match "[0-9]*.[0-9]*.[0-9]*.[0-9]*$" but it doesn't result in anything, and neither do these pattern:
"[0-9]*.[0-9]*.[0-9]*.[0-9]+"
"[0-9]*.[0-9]*.[0-9]*.[0-9]{1,}"
I need to get the latest tag in other to bump version for the next release. So i'm thinking of doing this automatically. Please let me know if I miss anything
Thanks
UPDATE:
In my build.gradle file I have a function to get tag like this (follow #Marc reply):
version getVersionFromTag()
def getVersionFromTag() {
def stdout = new ByteArrayOutputStream()
exec {
commandLine 'git', 'tag', '|' , 'grep', '^\([0-9]\+\.\?\)\+$', '|', 'sort' , '-nr', '|', 'head', '-1'
standardOutput = stdout
}
return stdout.toString().trim()
}
Here it gives errors Unexpected Char '\' in the regex above. Hence I removed them to becomes '^([0-9]+.?)+$', then it runs fine but in my final artifact, it does not have the version appended to the name (i.e helloword.jar instead of helloword-0.0.3.154.jar
=> My question is how should I put #Marc's suggested command to the gradle function correctly?
For testing I've put the output of your git describe in a file. This will do:
cat file | grep '^\([0-9]\+\.\?\)\+$' | sort -nr | head -1
0.0.3.154
Suppose you've created some irregular formatted tags and you want to use those as well (like your build--tags) for finding the highest tag:
sed -E 's/^[^0-9.]*//' | grep '^\([0-9]\+\.\?\)\+$' | sort -nr | head -1

Is there any easy way / API to find out the number of pipelines on a gocd server?

Sorry for the brief question, but just wondering if there's an API to find out the number of pipelines on a GoCD server.
The Pipeline Groups API will give you what you need after some JSON parsing.
$ curl 'https://ci.example.com/go/api/config/pipeline_groups' \
-u 'username:password'
Returns:
HTTP/1.1 200 OK
Content-Type: application/json; charset=utf-8
[
{
"pipelines": [
{
"stages": [
{
"name": "up42_stage"
}
],
"name": "up42",
"materials": [
{
"description": "URL: https://github.com/gocd/gocd, Branch: master",
"fingerprint": "2d05446cd52a998fe3afd840fc2c46b7c7e421051f0209c7f619c95bedc28b88",
"type": "Git"
}
],
"label": "${COUNT}"
}
],
"name": "first"
}
]
You can grab the config.xml file and parse it. from the config repo or via http.
As an alternative, you can just get the cctray file from your server at http://yourgoserver/go/cctray.xml and parse it.
It contains information about all the pipelines (including its stages)
I would recommend using yagocd:
from yagocd import Yagocd
go = Yagocd(server='https://build.gocd.io')
# login as guest
go._session.get('https://build.gocd.io/go/plugin/interact/gocd.guest.user.auth.plugin/index')
print(len(list(go.pipelines)))
Yes, of course. You can get the desired output in different ways. The first easy way to get the number of pipelines and other statistical information from the GoCD support URL (https://example.com/go/api/support) which requires admin privilege.
If the user does not have the admin privilege, we need to go with the GoCD pipeline_groups API. The below command should give you the exact result with jq(JSON processor)
$ curl 'https://example.com/go/api/config/pipeline_groups' -u 'username:password' | jq -r '.[] | .pipelines[].name' | wc -l
NOTE: Still Go Administrator users can get the actual number of pipelines.

regular expression to extract data from html page

I want to extract all anchor tags from html pages. I am using this in Linux.
lynx --source http://www.imdb.com | egrep "<a[^>]*>"
but that is not working as expected, since result contains unwanted results
<a class="amazon-affiliate-site-name" href="http://www.fabric.com">Fabric</a><br>
I want just
<a href >...</a>
any good way ?
If you have a -P option in your grep so that it accepts PCRE patterns, you should be able to use better regexes. Sometimes a minimal quantifier like *? helps. Also, you’re getting the whole input line, not just the match itself; if you have a -o option to grep, it will list only the part that matches.
egrep -Po '<a[^<>]*>'
If your grep doesn’t have those options, try
perl -00 -nle 'print $1 while /(<a[^<>]*>)/gi'
Which now crosses line boundaries.
To do a real parse of HTML requires regexes subtantially more more complex than you are apt to wish to enter on the command line. Here’s one example, and here’s another. Those may not convince you to try a non-regex approach, but they should at least show you how much harder it is in the general case than in specific ones.
This answer shows why all things are possible, but not all are expedient.
why can't you use options like --dump ?
lynx --dump --listonly http://www.imdb.com
Try grep -Eo:
$ echo '<a class="amazon-affiliate-site-name" href="http://www.fabric.com">Fabric</a><br>' | grep -Eo '<a[^>]*>'
<a class="amazon-affiliate-site-name" href="http://www.fabric.com">
But please read the answer that MAK linked to.
Here's some examples of why you should not use regex to parse html.
To extract values of 'href' attribute of anchor tags, run:
$ python -c'import sys, lxml.html as h
> root = h.parse(sys.argv[1]).getroot()
> root.make_links_absolute(base_url=sys.argv[1])
> print "\n".join(root.xpath("//a/#href"))' http://imdb.com | sort -u
Install lxml module if needed: $ sudo apt-get install python-lxml.
Output
http://askville.amazon.com
http://idfilm.blogspot.com/2011/02/another-class.html
http://imdb.com
http://imdb.com/
http://imdb.com/a2z
http://imdb.com/a2z/
http://imdb.com/advertising/
http://imdb.com/boards/
http://imdb.com/chart/
http://imdb.com/chart/top
http://imdb.com/czone/
http://imdb.com/features/hdgallery
http://imdb.com/features/oscars/2011/
http://imdb.com/features/sundance/2011/
http://imdb.com/features/video/
http://imdb.com/features/video/browse/
http://imdb.com/features/video/trailers/
http://imdb.com/features/video/tv/
http://imdb.com/features/yearinreview/2010/
http://imdb.com/genre
http://imdb.com/help/
http://imdb.com/helpdesk/contact
http://imdb.com/help/show_article?conditions
http://imdb.com/help/show_article?rssavailable
http://imdb.com/jobs
http://imdb.com/lists
http://imdb.com/media/index/rg2392693248
http://imdb.com/media/rm3467688448/rg2392693248
http://imdb.com/media/rm3484465664/rg2392693248
http://imdb.com/media/rm3719346688/rg2392693248
http://imdb.com/mymovies/list
http://imdb.com/name/nm0000207/
http://imdb.com/name/nm0000234/
http://imdb.com/name/nm0000631/
http://imdb.com/name/nm0000982/
http://imdb.com/name/nm0001392/
http://imdb.com/name/nm0004716/
http://imdb.com/name/nm0531546/
http://imdb.com/name/nm0626362/
http://imdb.com/name/nm0742146/
http://imdb.com/name/nm0817980/
http://imdb.com/name/nm2059117/
http://imdb.com/news/
http://imdb.com/news/celebrity
http://imdb.com/news/movie
http://imdb.com/news/ni7650335/
http://imdb.com/news/ni7653135/
http://imdb.com/news/ni7654375/
http://imdb.com/news/ni7654598/
http://imdb.com/news/ni7654810/
http://imdb.com/news/ni7655320/
http://imdb.com/news/ni7656816/
http://imdb.com/news/ni7660987/
http://imdb.com/news/ni7662397/
http://imdb.com/news/ni7665028/
http://imdb.com/news/ni7668639/
http://imdb.com/news/ni7669396/
http://imdb.com/news/ni7676733/
http://imdb.com/news/ni7677253/
http://imdb.com/news/ni7677366/
http://imdb.com/news/ni7677639/
http://imdb.com/news/ni7677944/
http://imdb.com/news/ni7678014/
http://imdb.com/news/ni7678103/
http://imdb.com/news/ni7678225/
http://imdb.com/news/ns0000003/
http://imdb.com/news/ns0000018/
http://imdb.com/news/ns0000023/
http://imdb.com/news/ns0000031/
http://imdb.com/news/ns0000128/
http://imdb.com/news/ns0000136/
http://imdb.com/news/ns0000141/
http://imdb.com/news/ns0000195/
http://imdb.com/news/ns0000236/
http://imdb.com/news/ns0000344/
http://imdb.com/news/ns0000345/
http://imdb.com/news/ns0004913/
http://imdb.com/news/top
http://imdb.com/news/tv
http://imdb.com/nowplaying/
http://imdb.com/photo_galleries/new_photos/2010/
http://imdb.com/poll
http://imdb.com/privacy
http://imdb.com/register/login
http://imdb.com/register/?why=footer
http://imdb.com/register/?why=mymovies_footer
http://imdb.com/register/?why=personalize
http://imdb.com/rg/NAV_TWITTER/NAV_EXTRA/http://www.twitter.com/imdb
http://imdb.com/ri/TRAILERS_HPPIRATESVID/TOP_BUCKET/102785/video/imdb/vi161323033/
http://imdb.com/search
http://imdb.com/search/
http://imdb.com/search/name?birth_monthday=02-12
http://imdb.com/search/title?sort=num_votes,desc&title_type=feature&my_ratings=exclude
http://imdb.com/sections/dvd/
http://imdb.com/sections/horror/
http://imdb.com/sections/indie/
http://imdb.com/sections/tv/
http://imdb.com/showtimes/
http://imdb.com/tiger_redirect?FT_LIC&licensing/
http://imdb.com/title/tt0078748/
http://imdb.com/title/tt0279600/
http://imdb.com/title/tt0377981/
http://imdb.com/title/tt0881320/
http://imdb.com/title/tt0990407/
http://imdb.com/title/tt1034389/
http://imdb.com/title/tt1265990/
http://imdb.com/title/tt1401152/
http://imdb.com/title/tt1411238/
http://imdb.com/title/tt1411238/trivia
http://imdb.com/title/tt1446714/
http://imdb.com/title/tt1452628/
http://imdb.com/title/tt1464174/
http://imdb.com/title/tt1464540/
http://imdb.com/title/tt1477837/
http://imdb.com/title/tt1502404/
http://imdb.com/title/tt1504320/
http://imdb.com/title/tt1563069/
http://imdb.com/title/tt1564367/
http://imdb.com/title/tt1702443/
http://imdb.com/tvgrid/
http://m.imdb.com
http://pro.imdb.com/r/IMDbTabNB/
http://resume.imdb.com
http://resume.imdb.com/
https://secure.imdb.com/register/subscribe?c=a394d4442664f6f6475627
http://twitter.com/imdb
http://wireless.amazon.com
http://www.3news.co.nz/The-Hobbit-media-conference--full-video/tabid/312/articleID/198020/Default.aspx
http://www.amazon.com/exec/obidos/redirect-home/internetmoviedat
http://www.audible.com
http://www.boxofficemojo.com
http://www.dpreview.com
http://www.endless.com
http://www.fabric.com
http://www.imdb.com/board/bd0000089/threads/
http://www.imdb.com/licensing/
http://www.imdb.com/media/rm1037220352/rg261921280
http://www.imdb.com/media/rm2695346688/tt1449283
http://www.imdb.com/media/rm3987585536/tt1092026
http://www.imdb.com/name/nm0000092/
http://www.imdb.com/photo_galleries/new_photos/2010/index
http://www.imdb.com/search/title?sort=num_votes,desc&title_type=tv_series&my_ratings=exclude
http://www.imdb.com/sections/indie/
http://www.imdb.com/title/tt0079470/
http://www.imdb.com/title/tt0079470/quotes?qt0471997
http://www.imdb.com/title/tt1542852/
http://www.imdb.com/title/tt1606392/
http://www.imdb.de
http://www.imdb.es
http://www.imdb.fr
http://www.imdb.it
http://www.imdb.pt
http://www.movieline.com/2011/02/watch-jon-hamm-talk-butthole-surfers-paul-rudd-impersonate-jay-leno-at-book-reading-1.php
http://www.movingimagesource.us/articles/un-tv-20110210
http://www.npr.org/blogs/monkeysee/2011/02/10/133629395/james-franco-recites-byron-to-the-worlds-luckiest-middle-school-journalist
http://www.nytimes.com/2011/02/06/books/review/Brubach-t.html
http://www.shopbop.com/welcome
http://www.smallparts.com
http://www.twinpeaks20.com/details/
http://www.twitter.com/imdb
http://www.vanityfair.com/hollywood/features/2011/03/lauren-bacall-201103
http://www.warehousedeals.com
http://www.withoutabox.com
http://www.zappos.com
To extract values of 'href' attribute of anchor tags you may also use xmlstarlet after converting HTML to XHTML using HTML Tidy (Mac OS X version released on 25 March 2009):
curl -s www.imdb.com |
tidy -q -c -wrap 0 -numeric -asxml -utf8 --merge-divs yes --merge-spans yes 2>/dev/null |
xmlstarlet sel -N x="http://www.w3.org/1999/xhtml" -t -m "//x:a/#href" -v '.' -n |
grep '^[[:space:]]*http://' | sort -u | nl
On Mac OS X you may also use the command line tool linkscraper:
linkscraper http://www.imdb.com
see: http://codesnippets.joyent.com/posts/show/10772