Regex Variable Size HexString Parsing - regex

I am not familiar with Regex and I need to parse a spec using Regex.
I need to get the KEK, Key and Wrap hex values into a string/hex array but the hex string lengths can be variadic and have spaces. Please see an example below
The second example wraps 7 octets of key data with a 192-bit KEK.
KEK : 5840df6e29b02af1 ab493b705bf16ea1 ae8338f4dcc176a8
Key : 466f7250617369
Wrap : afbeb0f07dfbf541 9200f2ccb50bb24f
The explanation tells the length of the key which might be used; example "7 octets of key data"
The other problem is that online regex tools and online python interpreters to run python regex lib (re) behaves different so I can not be sure about the regex expression.
I tried to get a line using
(\w+)\s+:\s+([A-Fa-f\d][A-Fa-f\d]([A-Fa-f\d][A-Fa-f\d])*)
but it parse a line until space in hex string.
Any recommendation on that

In your pattern you are matching 2 chars and then optionally repeat per 2 chars. But you are repeating a capture group without matching spaces.
You can reuse that same mechanism optionally repeating per 2 chars with 1 or more whitespace chars prepended in a non capture group, and capture that whole repetition in an outer capture group.
(\w+)\s+:\s+((?:[A-Fa-f\d][A-Fa-f\d])+(?:\s+(?:[A-Fa-f\d][A-Fa-f\d])+)*)\b
Regex demo
The same mechanism as the above pattern to repeat 1 or more characters instead of per 2 characters:
(\w+)\s+:\s+([A-Fa-f\d]+(?:\s+[A-Fa-f\d]+)*)\b
(\w+) Capture group 1, match 1+ word chars
\s+:\s+ match : between 1 or more whitespace chars
( Capture group 2
[A-Fa-f\d]+ Match 1+ times any of the ranges
(?:\s+[A-Fa-f\d]+)* Match 1+ whitespace chars and 1+ times any of the ranges
) Close group 2
\b A word boundary to prevent a partial match
Regex demo

Thank you very much The fourth bird, it definitely parses the hex mentioned above.
But I have noticed that there are some irregular lines like below
The first example wraps 20 octets of key data with a 192-bit KEK.
KEK : 5840df6e29b02af1 ab493b705bf16ea1 ae8338f4dcc176a8
Key : c37b7e6492584340 bed1220780894115 5068f738
Wrap : 138bdeaa9b8fa7fc 61f97742e72248ee 5ae6ae5360d1ae6a
: 5f54f373fa543b6a
"Wrap" line is two lines and there is one more colon(:) symbol, I don't know is there anything to do here.

Related

Notepad++ Search for and replace Underscore Characters in "GUIDs"

A colleague has written some C# code that outputs GUIDs to a CSV file. The code has been running for a while but it has been discovered that the GUIDs contain underscore characters, instead of hyphens :-(
There are several files which have been produced already and rather than regenerate these, I'm thinking that we could use the Search and Replace facility in Notepad++ to search across the files for "GUIDs" in this format:
{89695C16_C0FF_4E7C_9BB2_8B50FAC9D371}
and replace it with a properly formatted GUID like this:
{89695C16-C0FF-4E7C-9BB2-8B50FAC9D371}.
I have a RegEx to find the offending GUIDs (probably not very efficient):
(([A-Z]|[0-9]){8}_)(([A-Z]|[0-9]){4})_(([A-Z]|[0-9]){4})_(([A-Z]|[0-9]){4}_(([A-Z]|[0-9]){12}))
but I don't know what RegEx to use to replace the underscores with. Does anybody know how to do this?
You can use the following solution:
Find What: (?:\G(?!\A)|{(?=[a-f\d]{8}(?:_[a-f\d]{4}){4}[a-f\d]{8}\}))[a-f\d]*\K_
Replace with: -
Match case: OFF
See the settings and demo:
See the regex demo online. Details:
(?:\G(?!\A)|{(?=[a-f\d]{8}(?:_[a-f\d]{4}){4}[a-f\d]{8}\})) - either the end of the previous match or a { char immediately followed with eight alphanumeric chars, four repetitions of an underscore and then four alphanumeric chars and then eight alphanumeric chars and a } char
[a-f\d]* - zero or more alphanumeric chars
\K - match reset operator that discards the text matched so far from the overall match memory buffer
_ - an underscore.
You can match the pattern with 5 capture groups where you would match the underscores in between.
Then you can use the capture groups in the replacement with $1-$2-$3-$4-$5
{\K([A-Z0-9]{8})_([A-Z0-9]{4})_([A-Z0-9]{4})_([A-Z0-9]{4})_([A-Z0-9]{12})(?=})
{ Match {
\K Clear the match buffer (forget what is matched so far)
([A-Z0-9]{8})_ Capture group 1, match 8 times a char A-Z0-9
([A-Z0-9]{4})_ Capture 4 times a char A-Z0-9 in group 2
([A-Z0-9]{4})_ Same for group 3
([A-Z0-9]{4})_ Same for group 4
([A-Z0-9]{12}) Capture 12 times a char A-Z0-9 in group 5
(?=}) Positive lookahead, assert } to the right
Regex demo
If the pattern should also match without matching the curly's { and } you can append word boundaries
\b([A-Z0-9]{8})_([A-Z0-9]{4})_([A-Z0-9]{4})_([A-Z0-9]{4})_([A-Z0-9]{12})\b
Regex demo

Find certain colons in string using Regex

I'm trying to search for colons in a given string so as to split the string at the colon for preprocessing based on the following conditions
Preceeded or followed by a word e.g A Book: Chapter 1 or A Book :Chapter 1
Do not match if it is part of emoticons i.e :( or ): or :/ or :-) etc
Do not match if it is part of a given time i.e 16:00 etc
I've come up with a regex as such
(\:)(?=\w)|(?<=\w)(\:)
which satisfies conditions 2 & 3 but still fails on condition 3 as it matches the colon present in the string representation of time. How do I fix this?
edit: it has to be in a single regex statement if possible
You can use
(:\b|\b:)(?!(?:(?<=\b\d:)|(?<=\b\d{2}:))\d{1,2}\b)
See the regex demo. Details:
(:\b|\b:) - Group 1: a : that is either preceded or followed with a word char
(?!(?:(?<=\b\d:)|(?<=\b\d{2}:))\d{1,2}\b) - there should be no one or two digits right after : (followed with a word boundary) if the : is preceded with a single or two digits (preceded with a word boundary).
Note :\b is equal to :(?=\w) and \b: is equal to (?<=\w):.
If you need to get the same capturing groups as in your original pattern, replace (:\b|\b:) with (?:(:)\b|\b(:)).
More flexible solution
Note that excluding matches can be done with a simpler pattern that matches and captures what you need and just matches what you do not need. This is called "best regex trick ever". So, you may use a regex like
8:|:[PD]|\d+(?::\d+)+|(:\b|\b:)
that will match 8:, :P, :D, one or more digits and then one or more sequences of : and one or more digits, or will match and capture into Group 1 a : char that is either preceded or followed with a word char. All you need to do is to check if Group 1 matched, and implement required extraction/replacement logic in the code.
Word characters \w include numbers [a-zA-Z0-9_]
So just use [a-ZA-Z] instead
(\:)(?=[a-zA-Z])|(?<=[a-zA-Z])(\:)
Test Here

use ultraedit find and replace Perl regex to insert colon into 4 digit time string

I have multiple 24-hour time strings through several files. For example, 1234, which I wish to replace with 12:34.
Finding them is easy, just \d\d\d\d, that I understand and it works. However, what replace string do I need. In other words, say xx:xx, what do I put in place of each x.
I've tried numbers of things to no avail. I'm obviously not understanding how I get it to remember the digits it found and to recall them in the replace string.
If in your example data 4 digits represent 24 hour time strings you could match 2 capturing groups between word boundaries to prevent a match with more then 4 digits. You can Adjust the word boundaries to your requirements.
Match
\b(\d{2})(\d{2})\b
Replace
group1:group2 \1:\2
Explanation
\b Match a word boundary
(\d{2}) Capture in a group 2 digits
(\d{2}) Capture in a group 2 digits
\b Match a word boundary
Note
Matching 4 digits does not verify a valid 24 hour time. You could match that using for example \b([01][0-9]|2[0-3])([0-5][0-9])\b and replace with \1:\2

Regex to match a unlimited repeating pattern between two strings

I have a dataset with repeating pattern in the middle:
YM10a15b5c27
and
YM1b5c17
How can I get what is between "YM" and the last two numbers?
I'm using this but is getting one number in the end and should not.
/([A-Z]+)([0-9a-z]+)([0-9]+)/
Capture exactly two characters in the last group:
/([A-Z]+)([0-9a-z]+)([0-9]{2})/
You should use:
/^(?:([a-z]+))([0-9a-z]+)(?=\1)/
^ matches the start of the sentence. This is really important, because if your code is aaaa1234aaaa, then without the ^, it would also match the aaaa of the end.
(?:([a-z]+)) is a non-capturing group which takes any letter from 'a' to 'z' as group 1
(?=\1) tells the regex to match the text as long as it is followed by the same code at the starting.
All you have to do is extract the code by group(2)
An example is shown here.
Solution
If you want to match these strings as whole words, use \b(([a-z])\2)([0-9a-z]+)(\1)\b. If you need to match them as separate strings, use ^(([a-z])\2)([0-9a-z]+)(\1)$.
Explanation
\b - a word boundary (or if ^ is used, start of string)
(([a-z])\2) - Group 1: any lowercase ASCII letter, exactly two occurrences (aa, bb, etc.)
([0-9a-z]+) - Group 3: 1 or more digits or lowercase ASCII letters
(\1) - Group 4: the same text as stored in Group 1
\b - a word boundary (or if $ is used, end of string).

Only match unique string occurrences

We are doing some Data Loss Prevention for emails, but the issue is when people reply to emails multiple times sometimes the credit card number or account number will appear multiple times.
How can we get Java Regex to only match strings once each.
So for example, we are using the following regex to catch account numbers that match 2 letters followed by 5 or 6 numbers. it will also omit CR in either case.
\b(?!CR)(?!cr)[A-Za-z]{2}[0-9]{5,6}\b
How can we have it find:
CX12345
CX14584
JB145888
JD748452
CX12345 (Ignore as its already found it above)
LM45855
Unique string occurrence can be matched with
<STRING_PATTERN>(?!.*<STRING_PATTERN>) // Find the last occurrence
(?<!<STRING_PATTERN>.*)<STRING_PATTERN> // Find the first occurrence, only works in regex
// that supports infinite-width lookbehind patterns
where <STRING_PATTERN> is the pattern the unique occurrence of which one searches for. Note that both will work with the .NET regex library, but the second one is not usually supported by the majority of other libraries (only PyPi Python regex library and the JavaScript ECMAScript 2018 regex support it). Note that . does not match line break chars by default, so you need to pass a modifier like DOTALL (in most libraries, you may add (?s) modifier inside the pattern (only in Ruby (?m) does the same), or use specific flags that you pass to the regex compile method. See more about this in How do I match any character across multiple lines in a regular expression?
You seem to need a regex like this:
/\b((?!CR|cr)[A-Za-z]{2}\d{5,6})\b(?![\s\S]*\b\1\b)/
The regex demo is available here
Details:
\b - a leading word boundary
((?!CR|cr)[A-Za-z]{2}\d{5,6}) - Group 1 capturing
(?!CR|cr) - the next two characters cannot be CR or cr, the negative lookahead check
[A-Za-z]{2} - 2 ASCII letters
\d{5,6} - 5 to 6 digits
\b - trailing word boundary
(?![\s\S]*\b\1\b) - a negative lookahead that fails the match if there are any 0+ chars ([\s\S]*) followed with a word boundary (\b), same value captured into Group 1 (with the \1 backreference), and a trailing word boundary.
I would use a Map of some sort here, to keep tally of the strings which you encounter. For example:
String ccNumber = "CX12345";
Map<String, Boolean> ccMap = new HashMap<>();
if (ccNumber.matches("^(?!CR)(?!cr)[A-Za-z]{2}[0-9]{5,6}$")) {
ccMap.put(ccNumber, null);
}
Then just iterate over the keyset of the map to get unique credit card numbers which matched the pattern in your regex:
for (String key : map.keySet()) {
System.out.println("Found a matching credit card: " + key);
}