Regex Expression with the JQ Tool - regex

I have a json file which I am using the JQ tool on to get a some lines out of it. However I now need to get some information out of this line using regex. I stuck on two parts. The first bit is that I can't figure out the regular expression to get the lines I want and the second issues is that I do now know what the correct syntax is to apply the regex along with the JQ Tool. I have tried the following syntax and get the error of "unterminated regexp"
jq '.msg.stdout_lines[2]' /tmp/vaultKeys.json | awk '{gsub(/\:(.*[\a-zA-Z0-9]))}1'
My json file is as follows:
{
"msg": {
"changed": true,
"cmd": [
"vault",
"operator",
"init"
],
"delta": "0:00:00.568974",
"end": "2018-11-29 15:42:00.243019",
"failed": false,
"rc": 0,
"start": "2018-11-29 15:41:59.674045",
"stderr": "",
"stderr_lines": [],
"stdout": "Unseal Key 1: ZA0Gas2GrHtdMlet1g63N6gvEPYf5mzZEfjPhMDRyAeS\nUnseal Key 2: NY+CLIbgMJIv+e81FuB1OpV0m7rPuqZbIuYT142MrQLl\nUnseal Key 3: HNWmsrXBsSV9JFuGfqpd+GvPYQzHEsLFlxKBfEyBhCZ6\nUnseal Key 4: xDwfI+kFHFRSzq2JyxSGArQsGjCrFiNbkGCP897Zfbuz\nUnseal Key 5: +O8/tTmDNSzaUBMT8QP+2xbvu5uulypf3+xmWzY8fSD3\n\nInitial Root Token: 6kO8ijZzyhcG5Nup5QUca0u3\n\nVault initialized with 5 key shares and a key threshold of 3. Please securely\ndistribute the key shares printed above. When the Vault is re-sealed,\nrestarted, or stopped, you must supply at least 3 of these keys to unseal it\nbefore it can start servicing requests.\n\nVault does not store the generated master key. Without at least 3 key to\nreconstruct the master key, Vault will remain permanently sealed!\n\nIt is possible to generate new unseal keys, provided you have a quorum of\nexisting unseal keys shares. See \"vault operator rekey\" for more information.",
"stdout_lines": [
"Unseal Key 1: ZA0Gas2GrHtdMlet1g63N6gvEPYf5mzZEfjPhMDRyAeS",
"Unseal Key 2: NY+CLIbgMJIv+e81FuB1OpV0m7rPuqZbIuYT142MrQLl",
"Unseal Key 3: HNWmsrXBsSV9JFuGfqpd+GvPYQzHEsLFlxKBfEyBhCZ6",
"Unseal Key 4: xDwfI+kFHFRSzq2JyxSGArQsGjCrFiNbkGCP897Zfbuz",
"Unseal Key 5: +O8/tTmDNSzaUBMT8QP+2xbvu5uulypf3+xmWzY8fSD3",
"",
"Initial Root Token: 6kO8ijZzyhcG5Nup5QUca0u3",
"",
"Vault initialized with 5 key shares and a key threshold of 3. Please securely",
"distribute the key shares printed above. When the Vault is re-sealed,",
"restarted, or stopped, you must supply at least 3 of these keys to unseal it",
"before it can start servicing requests.",
"",
"Vault does not store the generated master key. Without at least 3 key to",
"reconstruct the master key, Vault will remain permanently sealed!",
"",
"It is possible to generate new unseal keys, provided you have a quorum of",
"existing unseal keys shares. See \"vault operator rekey\" for more information."
]
}
}
Out of the line
"Unseal Key 3: HNWmsrXBsSV9JFuGfqpd+GvPYQzHEsLFlxKBfEyBhCZ6"
I would like just
HNWmsrXBsSV9JFuGfqpd+GvPYQzHEsLFlxKBfEyBhCZ6
Currently using my regex I get only if I use it without the JQ tool syntax
: ZA0Gas2GrHtdMlet1g63N6gvEPYf5mzZEfjPhMDRyAeS
So to summarise I need help with
a) getting a correct regular expression and
b) the correct syntax to use the expression with the JQ Tool.
Thanks

For this particular case you can use split instead of regex.
jq -r '.msg.stdout_lines[2]|split(" ")[-1]' file

Do you have GNU grep?
jq -r '.msg.stdout_lines[2]' /tmp/vaultKeys.json | grep -Po '(?<=: ).+'

In the interests of one-stop shopping, you could for example use this invocation:
jq -r '.msg.stdout_lines[2]
| capture(": (?<s>.*)").s'
Of course there are many other possibilities, depending on your precise requirements.

There are many ways, besides the obvious | grep -Po '(?<=: ).+\b' you could even use substr with awk if the string length is fixed:
jq .. | awk '{print substr($1, RSTART+14)}'

Related

How to determine if a string is located in AWS S3 CSV file

I have a CSV file in AWS S3.
The file is very large 2.5 Gigabytes
The file has a single column of strings, over 120 million:
apc.com
xyz.com
ggg.com
dddd.com
...
How can I query the file to determine if the string xyz.com is located in the file?
I only need to know if the string is there or not, I don't need to return the file.
Also it will be great if I can pass multiple strings for search and return only the ones that were found in the file.
For example:
Query => ['xyz.com','fds.com','ggg.com']
Will return => ['xyz.com','ggg.com']
The "S3 Select" SelectObjectContent API enables applications to retrieve only a subset of data from an object by using simple SQL expressions. Here's a Python example:
res = client.select_object_content(
Bucket="my-bucket",
Key="my.csv",
ExpressionType="SQL",
InputSerialization={"CSV": { "FileHeaderInfo": "NONE" }}, # or IGNORE, USE
OutputSerialization={"JSON": {}},
Expression="SELECT * FROM S3Object s WHERE _1 IN ['xyz.com', 'ggg.com']") # _1 refers to the first column
See this AWS blog post for an example with output parsing.
If you use the aws s3 cp command you can send the output to stdout:
aws s3 cp s3://yourbucket/foo.csv - | grep 'apc.com'
- The dash will send the output to stdout.
this are two examples of grep checking on multiple patterns:
aws s3 cp s3://yourbucket/foo.csv - | grep -e 'apc.com' -e 'dddd.com'
aws s3 cp s3://yourbucket/foo.csv - | grep 'apc.com\|dddd.com'
To learn more about grep, please look at the manual: GNU Grep 3.7

Duplicate AWS S3 files differing only by a carriage return at the end of the Object URL

I have an S3 bucket with nearly duplicate files:
If I run the AWS CLI, I get the same file paths, differing only by a few bytes:
2021-09-23 16:36:36 134626 Original/53866358.xml
2021-09-23 16:36:36 134675 Original/53866358.xml
If I look at the individual object pages, both have the same key:
The only difference is that one has %0D (ASCII carriage return) at the end of its Object URL. Presumably, this is the larger file. My question is: How can I get a unique reference to each of these using the AWS S3 CLI? I'd like to delete the ones with the carriage-return at the end.
This is an interesting problem, just to lay the ground work of how my solution will help, I recreated the issue with a simple python script:
import boto3
s3 = boto3.client('s3')
s3.put_object(Bucket='example-bucket', Key='temp/key', Body=b'normal key')
s3.put_object(Bucket='example-bucket', Key='temp/key\r', Body=b'this is not the normal key')
From there, you can see the issue as you describe:
$ aws s3 ls s3://example-bucket/temp/
2021-12-03 20:14:45 10 key
2021-12-03 20:14:45 26 key
You can list the objects with more details using the cli (some details have been removed from the output here):
$ aws s3api list-objects --bucket example-bucket --prefix temp/
{
"Contents": [
{
"Key": "temp/key",
"Size": 10
},
{
"Key": "temp/key\r",
"Size": 26
}
]
}
To remove the object with the CR in the key name, a script would be easiest, but you can delete it with the CLI, just with a somewhat awkward syntax:
## If you're using Unix or Mac
$ aws s3api delete-object --cli-input-json '{"Bucket": "example-bucket", "Key": "temp/key\r"}'
## If you're using Windows:
C:> aws s3api delete-object --cli-input-json "{""Bucket"": ""example-bucket"", ""
Key"": ""temp/key\r""}"
Note that required syntax to quote the JSON object, and escape the quotes on Windows.
From there, it's simple to verify this worked as expected:
$ aws s3 ls s3://example-bucket/temp/
2021-12-03 20:14:45 10 key
$ aws s3 cp s3://example-bucket/temp/key final_check.txt
download: s3://example-bucket/temp/key to ./final_check.txt
$ type final_check.txt
normal key

Grepping two patterns from event logs

I am seeking to extract timestamps and ip addresses out of log entries containing a varying amount of information. The basic structure of a log entry is:
<timestamp>, <token_1>, <token_2>, ... ,<token_n>, <ip_address> <token_n+2>, <token_n+3>, ... ,<token_n+m>,-
The number of tokens n between the timestamp and ip address varies considerably.
I have been studying regular expressions and am able to grep timestamps as follows:
grep -o "[0-9]\{4\}-[0-9]\{2\}-[0-9]\{2\}T[0-9]\{2\}:[0-9]\{2\}:[0-9]\{2\}"
And ip addresses:
grep -Eo '[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}'
But I have not been able to grep both patterns out of log entries which contain both. Every log entry contains a timestamp, but not every entry contains an ip address.
Input:
2021-04-02T09:06:44.248878+00:00,Creation Time,EVT,WinEVTX,[4624 / 0x1210] Source Name: Microsoft-Windows-Security-Auditing Message string: An account was successfully logged on.\n\nSubject:\n\tSecurity ID:\t\tS-1-5-18\n\tAccount Name:\t\tREDACTED$\n\tAccount Domain:\t\tREDACTED\n\tLogon ID:\t\tREDACTED\n\nLogon Type:\t\t\t10\n\nNew Logon:\n\tSecurity ID:\t\tREDACTED\n\tAccount Name:\t\tREDACTED\n\tAccount Domain:\t\tREDACTED\n\tLogon ID:\t\REDACTED\n\tLogon GUID:\t\tREDACTED\n\nProcess Information:\n\tProcess ID:\t\tREDACTED\n\tProcess Name:\t\tC:\Windows\System32\winlogon.exe\n\nNetwork Information:\n\tWorkstation:\tREDACTED\n\tSource Network Address:\t255.255.255.255\n\tSource Port:\t\t0\n\nDetailed Authentication Information:\n\tLogon Process:\t\tUser32 \n\tAuthentication Package:\tNegotiate\n\tTransited Services:\t-\n\tPackage Name (NTLM only):\t-\n\tKey Length:\t\t0\n\nThis event is generated when a logon session is created. It is generated on the computer that was accessed.\n\nThe subject fields indicate the account on the local system which requested the logon. This is most commonly a service such as the Server service or a local process such as Winlogon.exe or Services.exe.\n\nThe logon type field indicates the kind of logon that occurred. The most common types are 2 (interactive) and 3 (network).\n\nThe New Logon fields indicate the account for whom the new logon was created i.e. the account that was logged on.\n\nThe network fields indicate where a remote logon request originated. Workstation name is not always available and may be left blank in some cases.\n\nThe authentication information fields provide detailed information about this specific logon request.\n\t- Logon GUID is a unique identifier that can be used to correlate this event with a KDC event.\n\t- Transited services indicate which intermediate services have participated in this logon request.\n\t- Package name indicates which sub-protocol was used among the NTLM protocols.\n\t- Key length indicates the length of the generated session key. This will be 0 if no session key was requested. Strings: ['S-1-5-18' 'DEVICE_NAME$' 'NETWORK' 'REDACTED' 'REDACTED' 'USERNAME' 'WORKSTATION' 'REDACTED' '10' 'User32 ' 'Negotiate' 'REDACTED' '{REDACTED}' '-' '-' '0' 'REDACTED' 'C:\\Windows\\System32\\winlogon.exe' '255.255.255.255' '0' '%%1833'] Computer Name: REDACTED Record Number: 1068355 Event Level: 0,winevtx,OS:REDACTED,-
Desired Output:
2021-04-02T09:06:44, 255.255.255.255
$ sed -En 's/.*([0-9]{4}-[0-9]{2}-[0-9]{2}T[0-9]{2}:[0-9]{2}:[0-9]{2}).*[^0-9]([0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}).*/\1, \2/p' file
2021-04-02T09:06:44, 255.255.255.255
Your regexps can be reduced by removing some of the explicit repetition though:
$ sed -En 's/.*([0-9]{4}(-[0-9]{2}){2}T([0-9]{2}:){2}[0-9]{2}).*[^0-9](([0-9]{1,3}\.){3}[0-9]{1,3}).*/\1, \4/p' file
2021-04-02T09:06:44, 255.255.255.255
It could be simpler still if all of the lines in your log file start with a timestamp:
$ sed -En 's/([^,.]+).*[^0-9](([0-9]{1,3}\.){3}[0-9]{1,3}).*/\1, \2/p' file
2021-04-02T09:06:44, 255.255.255.255
If you are looking for lines that contain both patterns, it may be easiest to do it two separate searches.
If you're searching your log file for lines that contain both "dog" and "cat", it's usually easiest to do this:
grep dog filename.txt | grep cat
The grep dog will find all lines in the file that match "dog", and then the grep cat will search all those lines for "cat".
You seem not to know the meaning of the "-o" switch.
Regular "grep" (without "-o") means: give the entire line where the pattern can be found. Adding "-o" means: only show the pattern.
Combining two "grep" in a logical AND-clause can be done using a pipe "|", so you can do this:
grep <pattern1> <filename> | grep <pattern2>

How to debug a (PCRE) regex passed to grep?

I'm trying to debug a regex passed to grep that doesn't seem to be working just on my system.
This is the full command that should return the latest terraform release version:
wget -qO - "https://api.github.com/repos/hashicorp/terraform/releases/latest" | grep -Po '"tag_name": "v\K.*?(?=")'
Which seems to be working for others but not me.
Adding a * quantifier after "tag_name": to match extra spaces makes it work for me:
wget -qO - "https://api.github.com/repos/hashicorp/terraform/releases/latest" | grep -Po '"tag_name": *"v\K.*?(?=")'
Here's the response from the wget without piping to grep:
{
"url": "https://api.github.com/repos/hashicorp/terraform/releases/20814583",
"assets_url": "https://api.github.com/repos/hashicorp/terraform/releases/20814583/assets",
"upload_url": "https://uploads.github.com/repos/hashicorp/terraform/releases/20814583/assets{?name,label}",
"html_url": "https://github.com/hashicorp/terraform/releases/tag/v0.12.12",
"id": 20814583,
"node_id": "MDc6UmVsZWFzZTIwODE0NTgz",
"tag_name": "v0.12.12",
"target_commitish": "master",
"name": "",
"draft": false,
"author": {
"login": "apparentlymart",
"id": 20180,
"node_id": "MDQ6VXNlcjIwMTgw",
"avatar_url": "https://avatars1.githubusercontent.com/u/20180?v=4",
"gravatar_id": "",
"url": "https://api.github.com/users/apparentlymart",
"html_url": "https://github.com/apparentlymart",
"followers_url": "https://api.github.com/users/apparentlymart/followers",
"following_url": "https://api.github.com/users/apparentlymart/following{/other_user}",
"gists_url": "https://api.github.com/users/apparentlymart/gists{/gist_id}",
"starred_url": "https://api.github.com/users/apparentlymart/starred{/owner}{/repo}",
"subscriptions_url": "https://api.github.com/users/apparentlymart/subscriptions",
"organizations_url": "https://api.github.com/users/apparentlymart/orgs",
"repos_url": "https://api.github.com/users/apparentlymart/repos",
"events_url": "https://api.github.com/users/apparentlymart/events{/privacy}",
"received_events_url": "https://api.github.com/users/apparentlymart/received_events",
"type": "User",
"site_admin": false
},
"prerelease": false,
"created_at": "2019-10-18T18:39:16Z",
"published_at": "2019-10-18T18:45:33Z",
"assets": [],
"tarball_url": "https://api.github.com/repos/hashicorp/terraform/tarball/v0.12.12",
"zipball_url": "https://api.github.com/repos/hashicorp/terraform/zipball/v0.12.12",
"body": "BUG FIXES:\r\n\r\n* backend/remote: Don't do local validation of whether variables are set prior to submitting, because only the remote system knows the full set of configured stored variables and environment variables that might contribute. This avoids erroneous error messages about unset required variables for remote runs when those variables will be set by stored variables in the remote workspace. ([#23122](https://github.com/hashicorp/terraform/issues/23122))"
}
And using https://regex101.com I can see that "tag_name": "v\K.*?(?=") and "tag_name": *"v\K.*?(?=") both match the version number correctly.
So there must be something wrong with my system and I'm just very curious why the original one doesn't work for me and how (if possible) to debug in situations like this.
I've been able to narrow it down to the following. If I execute the wget command without the piped grep and without formatting the json response:
wget -qO - "https://api.github.com/repos/hashicorp/terraform/releases/latest"
then I get a json without any whitespaces (I'll post only one a part of the response):
"html_url":"https://github.com/hashicorp/terraform/releases/tag/v0.12.12","id":20814583,"node_id":"MDc6UmVsZWFzZTIwODE0NTgz","tag_name":"v0.12.12","target_commitish":"master","name":"","draft":false
So naturally the original regex "tag_name": "v\K.*?(?=") fails because there is no space after :
This is clearly not related to the regex that is passed to the grep or the grep itself. I don't see the point in digging into the response itself here so the original question can be considered resolved (Though if someone knows what could be causing this please post a comment.)
It is very likely that your RegExp engine does not understand \K. There are many dialects for regexps.
Using standard PCRE regexp terms usually yields good results across all engines.
$ curl -s "https://api.github.com/repos/hashicorp/terraform/releases/latest" | egrep -oe '"tag_name": "v(.*)"'
"tag_name": "v0.12.12"
Now if you only want the version number, you need to fetch for the numbers afterwards (as using ?! to ignore a pattern might not always work).
curl -s "https://api.github.com/repos/hashicorp/terraform/releases/latest" | egrep -oe '"tag_name": "v(.*)"' | egrep -oe '([0-9]+\.?)+'
0.12.12

How to paginate over an AWS CLI response?

I'm trying to paginate over EC2 Reserved Instance offerings, but can't seem to paginate via the CLI (see below).
% aws ec2 describe-reserved-instances-offerings --max-results 20
{
"NextToken": "someToken",
"ReservedInstancesOfferings": [
{
...
}
]
}
% aws ec2 describe-reserved-instances-offerings --max-results 20 --starting-token someToken
Parameter validation failed:
Unknown parameter in input: "PaginationConfig", must be one of: DryRun, ReservedInstancesOfferingIds, InstanceType, AvailabilityZone, ProductDescription, Filters, InstanceTenancy, OfferingType, NextToken, MaxResults, IncludeMarketplace, MinDuration, MaxDuration, MaxInstanceCount
The documentation found in [1] says to use start-token. How am I supposed to do this?
[1] http://docs.aws.amazon.com/cli/latest/reference/ec2/describe-reserved-instances-offerings.html
With deference to a 2017 solution by marjamis which must have worked on a prior CLI version, please see a working approach for paginating from AWS in bash from a Mac laptop and aws-cli/2.1.2
# The scope of this example requires that credentials are already available or
# are passed in with the AWS CLI command.
# The parsing example uses jq, available from https://stedolan.github.io/jq/
# The below command is the one being executed and should be adapted appropriately.
# Note that the max items may need adjusting depending on how many results are returned.
aws_command="aws emr list-instances --max-items 333 --cluster-id $active_cluster"
unset NEXT_TOKEN
function parse_output() {
if [ ! -z "$cli_output" ]; then
# The output parsing below also needs to be adapted as needed.
echo $cli_output | jq -r '.Instances[] | "\(.Ec2InstanceId)"' >> listOfinstances.txt
NEXT_TOKEN=$(echo $cli_output | jq -r ".NextToken")
fi
}
# The command is run and output parsed in the below statements.
cli_output=$($aws_command)
parse_output
# The below while loop runs until either the command errors due to throttling or
# comes back with a pagination token. In the case of being throttled / throwing
# an error, it sleeps for three seconds and then tries again.
while [ "$NEXT_TOKEN" != "null" ]; do
if [ "$NEXT_TOKEN" == "null" ] || [ -z "$NEXT_TOKEN" ] ; then
echo "now running: $aws_command "
sleep 3
cli_output=$($aws_command)
parse_output
else
echo "now paginating: $aws_command --starting-token $NEXT_TOKEN"
sleep 3
cli_output=$($aws_command --starting-token $NEXT_TOKEN)
parse_output
fi
done #pagination loop
Looks like some busted documentation.
If you run the following, this works:
aws ec2 describe-reserved-instances-offerings --max-results 20 --next-token someToken
Translating the error message, it said it expected NextToken which can be represented as next-token on the CLI.
If you continue to read the reference documentation that you provided, you will learn that:
--starting-token (string)
A token to specify where to start paginating. This is the NextToken from a previously truncated response.
Moreover:
--max-items (integer)
The total number of items to return. If the total number of items available is more than the value specified in max-items then a NextToken will be provided in the output that you can use to resume pagination.