Search group in bin text with regex

Search group in bin text with regex - regex

I need to found the groups in a big text by knowing of:
Word that define the start of a group
Word contained in the group
Word that define the finish of group group
the start word is : begin
the contained word is: 536916223
the finish word is: end
On the text , in the bottom, I need to find 2 groups..
I have tried to use:
\bbegin.*(\n*.*)*536916223(\n*.*)*\bbegin
but if I will be try to use the previous regex on the site "http://regexr.com/"
it respond with timeout... and I think the regex is not very good :(
The text is:
begin active link
export-version : 11
actlink-order : 2
wk-conn-type : 1
schema-name : HelpDesk
actlink-mask : 1
actlink-control: 750000002
enable : 1
action {
set-field : 0\536916222\101\4\1\1\
}
errhandler-name:
end
begin active link
export-version : 11
actlink-order : 2
wk-conn-type : 1
schema-name : HelpDesk
actlink-mask : 1
actlink-control: 610000092
enable : 1
permission : 0
action {
id : 536916223
focus : 0
access-opt : 1
option : 0
}
action {
set-field : 0\536916222\101\4\1\1\
}
errhandler-opt : 0
errhandler-name:
end
begin active link
actlink-order : 12
wk-conn-type : 1
schema-name : HelpDesk
actlink-mask : 2064
enable : 1
permission : 0
action {
id : 536916223
focus : 0
access-opt : 1
option : 0
}
action {
set-field : 0\536916222\101\4\1\1\
}
errhandler-opt : 0
errhandler-name:
end
Can someone suggest me a optimize regex for this work?
Regards,
Vincenzo

Use an unrolled tempered greedy token:
/\bbegin.*(?:\n(?!begin|end(?:$|\n)).*)*\b536916223\b.*(?:\n(?!begin|end(?:$|\n)).*)*\nend/g
or a shorter version if we add MULTILINE modifier:
/^begin.*(?:\n(?!begin|end$).*)*\b536916223\b.*(?:\n(?!begin|end$).*)*\nend$/gm
See the regex demo (a version with MULTILINE modifier)
Details:
\bbegin - a word begin (a word boundary \b can be added after it for surer matches)
.* - the rest of the line after begin
(?:\n(?!begin|end(?:$|\n)).*)* - the unrolled tempered greedy token (?:(?!\n(?:begin|end(?:$|\n)))[\s\S])* matching any sequence but begin at the beginning of a line and end as a whole line
\b536916223\b - the whole word 536916223
.* - the rest of the line after the number
(?:\n(?!begin|end(?:$|\n)).*)* - another unrolled tempered greedy token
\nend - the end word after a newline (a (?:$|\n) can be added after it for surer matches)

The .*(\n*.*)* part is a bit complicated and results in many backtrack.
Since . does not match whitespace character, we can use a global wildcard such as [\S\s] to match any character. Another possible improvement (and possibly correction) is to use lazy match, i.e. *?
The following pattern seems to work fine
\bbegin[\S\s]*?536916223[\S\s]*?\bend

Regex (with m modifier set):
^begin(?:(?!^end)[\s\S])*?536916223[\s\S]*?end
Explanation:
^begin # Match `begin` at start of line
(?: # Start of non-capturing group (a)
(?!^end)[\s\S] # A character which is not followed by `end` delimiter
)*? # Zero or more times (un-greedy)
536916223 # Up to special word
[\s\S]*? # Match any other characters
end # Up to first `end` delimiter
Live demo
Much more efficient version - (with m modifier set):
^begin.*(?:\n(?!^end).*)*536916223(?:.*\n)*?^end
Live demo

Related

Powershell Regex match statement

Trying to get nxxxxx number as the output from below input,
uniqueMember: uid=n039833,ou=people,ou=networks,o=test,c=us
uniqueMember: uid=N019560, ou=people, ou=Networks, o=test, c=Us
Tried,
[Regex]::Matches($item, "uid=([^%]+)\,")
but this gives,
Groups : {0, 1}
Success : True
Name : 0
Captures : {0}
Index : 14
Length : 43
Value : uid=N018315,ou=people,ou=Networks,o=test,
Success : True
Name : 1
Captures : {1}
Index : 18
Length : 38
Value : N018315,ou=people,ou=Networks,o=test
Some help with improving the match statement appreciated ..

You can use
[Regex]::Matches($s, "(?<=uid=)[^,]+").Value
To save in an object variable:
$matches = [Regex]::Matches($s, "(?<=uid=)[^,]+").Value
Output:
n039833
N019560
Details:
(?<=uid=) - a positive lookbehind that requires uid= text to appear immediately to the left of the current location
[^,]+ - one or more chars other than a comma.

You can use a capture group and prevent matching , and if you don't want to match % you can also exclude that.
$s = "uniqueMember: uid=n039833,ou=people,ou=networks,o=test,c=us\nuniqueMember: uid=N019560, ou=people, ou=Networks, o=test, c=Us"
[regex]::Matches($s,'uid=([^,]+)') | Foreach-Object {$_.Groups[1].Value}
Output
n039833
N019560
Note that in the current pattern there should be a trailing comma present. If that is not ways the case, you can omit matching that from the pattern. If you only want to exclude matching a comma, the pattern will be:
uid=([^,]+)

Using a regex to identify EQUIPMENTID numbers - VBA

Struggling trying to construct a Regexp to identify equipment numbers, I require this to identify equipment numbers in multiple formats including pooled equipment numbers e.g AFD21101 or AFD21101-02-03 or AFD21101-2-3 including various prefixes as per testdata.
Any tips or feedback welcome, possibly it may be easier with multiple RegExp for each scenario but I had hopped to have a master that would identify any of these patterns and be able to extract from a string for further process in a more detailed order. Possibly converting to Long format etc.
Any assistance is greatly appreciated. Hopefully I can return the favour.
What I've tried so far:
^[abcpfsmschafddfcpdcdplldt][glvmdugmrxftiichlewsnuabn][mmrprbdpucdsxtvuwcrslbubk][0-9][0-9xX][0-9xX][0-9xX][0-9xX]|[0-9xX-][0-9]|[0-9]
^[abcpfsmschafddfcpdcdplldt][glvmdugmrxftiichlewsnuabn][mmrprbdpucdsxtvuwcrslbubk][0-9][0-9xX][0-9xX][0-9xX][0-9xX]
^(BLM)|(SUB)|
(CVR)|FDR|SMP|CRU|HXC|ATS|AFD|FTS|DIX|DIT|FIT|FCV|KV|FV|CHU|PLW|BCR|DEC|CTR|CWR|V|DSS|PNL|MTR|LUB|LAU|CCL|DBB|TNK|THK|PIT|[0-9][0-9xX][0-9xX][0-9xX][0-9xX]
Testdata - will have to handle multiple separated by comma or multiline as per testdata examples below
// Example test data 1: (CSV+)
CRN21003 (CB-3), CRN21004 (CB-4)
// Example test data 2: (CSV)
CVR21404, CHU21437, AFD21401
// Example test data 3: (Multi-line)
MGD22401 - 16
DEC22401 - 16
// Example test data 4: (In string)
AFD11122 SOME OTHER RANDOM DATA WDC11121_22 SOME OTHER RANDOM DATA
//Additional matches
AFD21101-03
AFD21101_03
AFD21101-02-03
AFD21101_02_03
AFD21101-2-3
AFD21101_2_3
FDR21407-08
BLM21401
SUB21601
CVR21601
Fdr21601
SMP21501
CRU21501
HXC21501
AFD21501
FTS21X01
DIX21301
DIT22501
FIT21X0X
FCV21501
Pattern:
Base is max 8 digits
1-3 letters (A-Z)
5 Digits (0-9) including X as wildcard
Followed by pooled EQUIPMENT ID's
e.g. AFD21101-2-3, AFD21101-02-03 or AFD21101_02_03
_ or - are delimiters indicating abbreviated subsequent equipment id's or ranges.
AFD21101-02-03 is equivalent to AFD21101, AFD21102, AFD21103 in full form
Possible Prefix's continued
KV
CHU
PLW
BCR
DEC
CTR
CWR
V
DSS
PNL
MTR
LUB
LAU
CCL
DBB
TNK
THK
PIT
AGM2XXXX - valid
Some Invalid matches would be something like
AGM211011 or AGMXXXXX or 21101 or 2110 or AGM21101-094-034 or AGM (prefix only without a trailing 5 digit number/ X wildcard)

If I understand your issue, you need to get the strings which starts with substring provided and contains numbers.
You could try the following regex.
^(?:BLM|SUB|CVR|FDR|SMP|CRU|HXC|ATS|AFD|FTS|DIX|DIT|FIT|FCV|KV|FV|CHU|PLW|BCR|DEC|CTR|CWR|V|DSS|PNL|MTR|LUB|LAU|CCL|DBB|TNK|THK|PIT)[0-9_-]+
Details:
^: start of string
?:: non capturing group
(?:BLM|SUB|CVR|FDR|SMP|CRU|HXC|ATS|AFD|FTS|DIX|DIT|FIT|FCV|KV|FV|CHU|PLW|BCR|DEC|CTR|CWR|V|DSS|PNL|MTR|LUB|LAU|CCL|DBB|TNK|THK|PIT): list of prefixes.
Demo

It isn't 100% clear what you're intending to do because:
The test data you've supplied is comprised wholly of expected matches
The expected output is unclear. Although this largely relays back to point 1!
However, there are many ways of getting the information you require. They all depend on how your source data is organised though...
// Example test data 1:
AFD11122 SOME OTHER RANDOM DATA
WDC11121_22 SOME OTHER RANDOM DATA
// Example test Data 2:
SOME RANDOM DATA AFD11122 AND SOME MORE RANDOM DATA WDC11121_22 WITH SOME MORE
Assuming that the data is at the start of the string AND that you want to capture each string as a whole:
// Option 1
/^(.*?)\s/
^ : Start of string
(.*?) : Non-greedy capture group
\s : First space (first because the capture group was non-greedy)
// Option 2
/^([ABCDEFHIKLMNPRSTUVWX][ABCDEFHILMNRSTUVWX]?[BCDKLMPRSTUVWX]?[x\d]{5}[_\-\d]*)/i
^ : Start of string
( : Start of capture group
[ABCDEFHIKLMNPRSTUVWX] : Capture any letter in character set
[ABCDEFHILMNRSTUVWX]? : OPTIONALLY [?] capture any letter in character set
[BCDKLMPRSTUVWX]? : OPTIONALLY [?] capture any letter in character set
[x\d]{5} : Capture any number or x 5 times
[_\-\d]* : Capture any number, hyphen, or underscore until you reach a character not in the set
) : End of capture group
i : FLAG - case insensitive
// Option 3
/^((?:AFD|BCR|BLM....TNK|V)[\d_\-]*)/i
^ : Start of string
( : Start of capture group
(?: : Start of non-capturing group
AFD|BCR|BLM....TNK|V : List of prefixes separated with "|"
) : End of non-capturing group
[\d_\-]* : Capture any number, hyphen, or underscore until you reach a character not in the set
) : End of capture group
i : FLAG - case insensitive
// Option 4
/^([a-z]{1,3}[x\d]{5}[_\-\d]*)/i :
^ : Start of string
( : Start of capture group
[a-z]{1,3} : Capture any letter [range: a-z] 1 to 3 times {1,3}
[x\d]{5} : Capture any number [\d] or x [x] 5 times {5}
[_\-\d]* : Capture any number, hyphen, or underscore until you reach a character not in the set
) : End of capture group
i : FLAG - case insensitive
Based on your updates to the main question I would stick with option 4 unless you specifically need to make sure that only the set prefixes are matched.
In the event that your data looks more like Example Data 2 then the above expressions will need to be altered accordingly; some examples below:
/([a-z]{1,3}[x\d]{5}[_\-\d]*)/i : Remove the ^
/\b([a-z]{1,3}[x\d]{5}[_\-\d]*)/i : Add a word boundary to the start of the expression
/[^a-z]([a-z]{1,3}[x\d]{5}[_\-\d]*)/i : Start the expression with anything BUT a letter
How you alter it will depend on the data that you're searching through.
Updated RegEx based on latest question edits
/([a-z]{1,3}(?!xxxxx)[x\d]{5}(?!\d)[_\-\d]*)/ig

Try this:
[A-Z]{1,3}[\dX]{5}([_-])0?\d(\10?\d)?
This requires the separator to be the consistent, ie either both - or both _, by capturing the separator and using a back reference to it \1, although the second “pooled ID” is optional.
As far as I can tell, this matches all of your examples.

Regex: match only line where numbers are located

Really tired of this regex. So many combinations.... I believe I need another brain :-)
Here is my problem and if someone help, I'd be highly appreciated.
I have those 6 lines of JSON response
...some JSON code here
"note" : "",
"note" : "Here is my note",
"note" : "My note $%$",
"note" : "Created bug 14569 in the system",
"note" : "Another beautiful note",
"note" : "##$%##%dgdeg"
...continuation of the JSON code
With the help of Regex, how do I match number 14569 only?
I have tried this regex, but it matches all 6 lines
"note"([\s\:\"a-zA-Z])*([0-9]*) - 6 matches (I only need one)
"note"([\s\:\"a-zA-Z])*(^[0-9]*) - no matches
"note"([\s\:\"a-zA-Z])*([0-9]*+?) - pattern error
"note"([\s\:\"a-zA-Z])*(^[0-9]*+#?) - no match
Thanks for you help!
Updated for Matt. Below is my full JSON object
"response": {
"notes": [{
"note" : "",
"note" : "Here is my note",
"note" : "My note $%$",
"note" : "Created bug 14569 in the system",
"note" : "Another beautiful note",
"note" : "##$%##%dgdeg"
}]
}

You could try this regex:
"note"\s*:\s*".*?([0-9]++).*"
It will give you the number in group 1 of the match.
If you don't want to match numbers that are part of a word (e.g. "bug11") then surround the capture group with word boundary assertions (\b):
"note"\s*:\s*".*?\b([0-9]++)\b.*"
Regex101 demo

If all that you care about is that the line includes a number, then that is all you need to look for.
/[0-9]/ # matches if the string includes a digit
Or, as you want to capture the number:
/([0-9]+)/ # matches (and captures) one or more digits
This is a common error that I see when beginners build regular expressions. They want to build a regex that matches the whole string - when, actually, they only need to match the bit of the string that they want to match.
Update:
It might help to explain why some of your other attempts failed.
"note"([\s\:\"a-zA-Z])*([0-9]*) - 6 matches (I only need one)
The * means "match zero or more of the previous item", effectively making the item optional. This matches all lines as they all contain zero or more digits.
"note"([\s\:\"a-zA-Z])*(^[0-9]*) - no matches
The ^ means "the next item needs to be at the start of the string". You don't have digits at the start of your string.
"note"([\s\:\"a-zA-Z])*([0-9]*+?) - pattern error
Yeah. You're just adding random punctuation here, aren't you? *+? means nothing to the regex parser.
"note"([\s\:\"a-zA-Z])*(^[0-9]*+#?) - no match
This fails for the same reason as the previous attempt where you use ^ - the digits aren't at the start of the string. Also, the # has no special meaning in a regex, so #? means "zero or one # characters".

If you have JSON, why don't you parse the JSON and then grep through the result?
use JSON 'decode_json';
my $data = decode_json( $json_text );
my #rows = map { /\b(\d+)\b/ ? $1 : () } # keep only the number
map { $_->{note} } #$data;

This might work (?m-s)^[^"\r\n]*?"note"\h*:\h*"[^"\r\n]*?\d+[^"\r\n]*".*
https://regex101.com/r/ujDBa9/1
Explained
(?m-s) # Multi-line, no dot-all
^ # BOL
[^"\r\n]*? # Not a double quote yet
"note" \h* : \h* # Finally, a note
" [^"\r\n]*? \d+ [^"\r\n]* " # Is a number embedded within the quotes ?
.* # Ok, get the rest of the line

How can I search and replace guids in Sublime 3

I have a textfile where I would like to replace all GUIDs with space.
I want:
92094, "970d6c9e-c199-40e3-80ea-14daf1141904"
91995, "970d6c9e-c199-40e3-80ea-14daf1141904"
87445, "f17e66ef-b1df-4270-8285-b3c15da366f7"
87298, "f17e66ef-b1df-4270-8285-b3c15da366f7"
96713, "3c28e493-015b-4b48-957f-fe3e7acc8412"
96759, "3c28e493-015b-4b48-957f-fe3e7acc8412"
94665, "87ac12a3-62ed-4e1d-a1a6-51ae05e01b1a"
94405, "87ac12a3-62ed-4e1d-a1a6-51ae05e01b1a"
To become:
92094,
91995,
87445,
87298,
96713,
96759,
94665,
94405,
How can i accomplish this in Sublime 3?

Ctrl+H
Find: "[\da-f-]{36}"
Replace: LEAVE EMPTY
Enable regex mode
Replace all
Explanation:
" : double quote
[ : start class character
\d : any digit
a-f : or letter from a to f
- : or a dash
]{36} : end class, 36 characters must be present
" : double quote
Result for given example:
92094,
91995,
87445,
87298,
96713,
96759,
94665,
94405,

Try doing a search for this pattern in regex search mode:
"[0-9a-z]{8}-[0-9a-z]{4}-[0-9a-z]{4}-[0-9a-z]{4}-[0-9a-z]{12}"
And then just replace with empty string. This should strip off the GUID, leaving you with the output you want.
Demo

Another regex solution involving a slightly different search-replace strategy where we don't care about the GUI format and simply get the first column:
Search for ([^,]*,).* (again don't forget to activate the regex mode .*).
Replace with $1.
Details about the regular expression
The idea here is to capture all first columns. A column here is defined by a sequence of
"some non-comma character": [^,]*
followed by a comma: [^,]*,
The first column can then be followed by anything .* (the GUI format doesn't matter): [^,]*,.*
Finally we need to capture the 1st column using group capturing: ([^,]*,).*
In the replace field we use a backreference $x which refers the the x-th capturing group.

How can i remove before and after a particular xml tags in notepad++?

I have a huge xml file which has many xml file is in each line.(so there are thousands of xml lines)
I want to remove all the tags before and after tag from every line regardless of tag position in line.
**Input XML:**
<Main><FirstName>xyz123</FirstName><employer>ABC Co.</employer><Salary>1000</Salary><Description>Manager</Description></Main>
<Main><FirstName>xyz123</FirstName><employer>ABC Co.</employer><Salary>1000</Salary></Main>
<Main><FirstName>xyz123</FirstName><Salary>1000</Salary><Description>Manager</Description></Main>
<Main><FirstName>xyz123</FirstName><employer>ABC Co.</employer><Salary>1000</Salary><Description>Manager</Description></Main>
**Output would be something like this:**
<employer>ABC Co.</employer>
<employer>ABC Co.</employer>
<employer>ABC Co.</employer>
<employer>ABC Co.</employer>

Not sure to well understand your needs, but I guess you want:
Ctrl+H
Find what: ^.*(<employer>.*?</employer>).*$
Replace with: $1
Replace all
Explanation:
^ : begining of line
.* : 0 or more any character
( : start group 1
<employer> : literally
.*? : 0 or more any character, non greedy
</employer> : literally
) : end group 1
.* : 0 or more any character
$ : end of line

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Search group in bin text with regex - regex

Related

Powershell Regex match statement

Using a regex to identify EQUIPMENTID numbers - VBA

Regex: match only line where numbers are located

How can I search and replace guids in Sublime 3

How can i remove before and after a particular xml tags in notepad++?

Categories

Resources