Using regex multiple capture groups to split up a string - regex

I have a file that looks like this...
"1234567123456","V","0","0","BLAH","BLAH","BLAH","BLAH"
"1234567123456","D","TEST1 "
"1234567123456","D","TEST 2~TEST3"
"1234567123456","R","TEST4~TEST5"
"1234567123457","V","0","0","BLAH","BLAH","BLAH","BLAH"
"1234567123457","D","TEST 6"
"1234567123457","D","TEST7"
"1234567123457","R","TEST 8~TEST9~TEST,10"
All I'm trying to do is parse the D and R lines. The ~ is used in this case as a separator. So the end results would be...
"1234567123456","V","0","0","BLAH","BLAH","BLAH","BLAH"
"1234567123456","D","TEST1 "
"1234567123456","D","TEST3"
"1234567123456","D","TEST3"
"1234567123456","R","TEST4"
"1234567123456","R","TEST5"
"1234567123457","V","0","0","BLAH","BLAH","BLAH","BLAH"
"1234567123457","D","TEST 6"
"1234567123457","D","TEST7"
"1234567123457","R","TEST 8"
"1234567123457","R","TEST9"
"1234567123457","R","TEST,10"
I'm using regex on applications like Textpad and Notepad++. I have not figured out how to use a regex like /.+/g because the applications do not like the forward slashes. So I don't think I can use things like the global modifier. I currently have the following regex...
//In a program like Textpad/Notepad++
<FIND> "(.{13})","D","([^~]*)~(.*)
<REPLACE> "\1","D","\2"\n"\1","D","\3
Now if I run a find and replace with the above params a few times it would work fine (for the D lines only). The problem is there is an unknown number of lines to be made. For example...
"1234567123456","D","TEST1~TEST2~TEST3~TEST4~TEST5"
"1234567123457","D","TEST1~TEST2~TEST3"
"1234567123458","D","TEST1~TEST2"
"1234567123459","D","TEST1~TEST2~TEST3~TEST4"
I was hoping to be able to use a MULTI capture group to make this work. I found this PAGE talking about the common mistake between repeating a capturing group and capturing a repeated group. I need to capture a repeated group. For some reason I just could not make mine work right though. Anyone else have an idea?
Note: If I could get rid of the leading and trailing spaces EX: "1234567123456","D","TEST1 " ending up as "1234567123456","D","TEST1" that would be even better but not necessary.
RESOURCES:
http://www.regular-expressions.info/captureall.html
http://regex101.com/

Related

Regex capture group with non-uniform space group

I'm trying to parse the output of the "display interface brief" Comware switch command to convert it to a CSV file using RegEx. This command is printed using the following format:
Interface Link Speed Duplex Type PVID Description
BAGG51 UP 4G(a) F(a) T 1
FGE1/0/42 DOWN auto A T 1 ### LIVRE ###
GE6/0/20 UP 100M(a) F(a) A 1 LIVRE (MGMT - [WAN8-P8]
It's seems quite challenging for me because doesn't matter which RegEx I try, it doesn't properly handle "DOWN auto" and "100M(a) F(a)" output that has only one space between them. I also couldn't find a way to properly handle the last field, that can contain one or more spaces, but into most RegEx that I tried it create a separate capture group for each space instead of handling it's text content properly.
I'd also tried countless ways to try to parse it, and I couldn't find much content about parsing non-uniform columns into the Internet and StackOverflow community.
I need to parse it into the following format, with 7 capture groups per line, respecting the end of line:
BAGG51;UP;4G(a);F(a);T;1
FGE1/0/42;DOWN;auto;A;T;1;### LIVRE ###
GE6/0/20;UP;100M(a);F(a);A;1;LIVRE (MGMT - [WAN8-P8]
The most successfully RegEx that I found so far was: ^(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+) replacing it to $1;$2;$3;$4;$5;$6;$7 using Notepad++ but it doesn't properly handle the "Description" field, that can be empty.
The following pattern seems to be working here:
^(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)(?:[ ]+(.*))?
This follows your pattern with six mandatory capture groups, followed by an optional seventh capture group. The (?:[ ]+(\S+))? at the end of the pattern matches one or more spaces followed by the content. Note that this pattern should be used in multiline mode.
Here is a working demo

Regex in Notepad++ to select on string length between specific XML tags

I'm working with Emergency Services data in the NEMSIS XSD. I have a field, which is constrained to only 50 characters. I've searched this site extensively, and tried many solutions - Notepad++ rejects all of them, saying not found.
Here's an XML Sample:
<E09>
<E09_01>-5</E09_01>
<E09_02>-5</E09_02>
<E09_03>-5</E09_03>
<E09_04>-5</E09_04>
<E09_05>this one is too long Non-Emergency - PT IS BEING DISCHARGED FROM H AFTER BEING ADMITTED FOR FAILURE TO THRIVE AND ALCOHOL WITHDRAWAL</E09_05>
</E09>
<E09>
<E09_01>-5</E09_01>
<E09_02>-5</E09_02>
<E09_03>-5</E09_03>
<E09_04>-5</E09_04>
<E09_05>this one is is okay</E09_05>
</E09>
I've tried solutions naming the E09_05 tag in different ways, using <\/E09_05> for the closing tag as I've seen in some examples, and as just </E09_05> as I've seen in others. I've tried ^.{50,}$ between them, or [a-zA-Z]{50,}$ between them, I've tried wrapping those in-between expressions in () and without. I even tried just [\s\S]*? in between the tags. The only thing that Notepad++ finds is when I use ^.{50,}$ by itself with no XML tags ... but then I wind up hitting on all the E13_01 tags (which are EMS narratives, and always > 50 characters) -- making for painstaking and wrist-aching clicks.
I wanted to XSLT this, but there is too much individual, hands on tweeking of each E09_05 field for automating it. Perl is not an option in this environment (and not a tool I know at all anyway).
To be truly sublime, both E09_05 and E09_08 fields with string lengths >50 need to be what is selected on the search ... but no other elements of any kind or length.
Thanks in advance. I'm sure I'm just missing some subtle \, or () or [] somewhere ... hopefully ...
The following regex will find the text content of <E09_05> elements with more than 50 characters.
(?<=<E09_05>).{51,}?(?=</E09_05>)
Explanation
(?<=<E09_05>) Start matching right after <E09_05>
.{51,}? Match 51 or more characters (in a single line)
The ? makes it reluctant, so it'll stop at first </E09_05>
(?=</E09_05>) Stop matching right before </E09_05>
For truly sublime matching, i.e. both E09_05 and E09_08 fields with string lengths >50, use:
(?<=<(E09_0[58])>).{51,}?(?=</\1>)
Explanation
<(E09_0[58])> Match <E09_05> or <E09_08>, and capture the name as group 1
</\1> Use \1 backreference to match name inside </name>
If you want to shorten the text with ellipsis at the end, e.g. Hello World with max length 8 becomes Hello..., use:
Find what: (?<=<(E09_0[58])>)(.{47}).{4,}(?=</\1>)
Replace with: \2...

Multiple replace regex in one Apache-NiFi statement

I have a csv in following format.
id,mobile
1,02146477474
2,08585377474
3,07646474637
4,02158789566
5,04578599525
I want to add a new column and add just leading 3 numbers to that column (for specific cases and all the others NOT_VALID string). So result should be:
id,number,provider
1,02146477474,021
2,08585377474,085
3,07646474637,NOT_VALID
4,02158789566,021
5,04578599525,NOT_VALID
I can use following regex for replacing that. But I would like to use all possible conversations in one step. Using UpdateRecord processor.
${field.value:replaceFirst('085[0-9]+','085')}
When I use something like this:
${field.value:replaceFirst('085[0-9]+','085'):or(${field.value:replaceFirst('086[0-9]+','086')}`)}
This replaces all with false.
Nifi uses Java regex
As soon, as you are using record processing, this should work for you:
${field.value:replaceFirst('^(021|085)?.*','$1')}
The group () optionally ? catches 021 or 085 at the beginning of string ^
The replacement - $1 - is the first group
PS: The sites like https://regex101.com/ helps to understand regex

How to linebreak in REGEX with brackets

I got a list of data like exactly like that
51.9499, 7.555780000000027; 51.49705, 9.389030000000048; 51.249182, 6.991165099999989; 47.3163508, 11.09513949999996; 51.33424979999999, 12.574196000000029; 50.0297493, 19.196331099999952; 47.8270212, 16.25014150000004;
and I want to beautify it a bit by having linebreaks behind the "; " so it would rather look like
51.9499, 7.555780000000027;
51.49705, 9.389030000000048;
51.249182, 6.991165099999989;
...
I am using Adobe Brackets and I am trying to put a hard linebreak into the replace dialogue but that doesn't work - what would instead work?
In the replace bar, click on the .* icon to change the replace method to regular expression.
Then you can use ;\s* replace with ;\n to beautify your code accordingly.
You can see an example of this regex being run here.
Outputs the following:
51.9499, 7.555780000000027;
51.49705, 9.389030000000048;
51.249182, 6.991165099999989;
47.3163508, 11.09513949999996;
51.33424979999999, 12.574196000000029;
50.0297493, 19.196331099999952;
47.8270212, 16.25014150000004;

How to find "complicated" URLs in a text file

I'm using the following regex to find URLs in a text file:
/http[s]?://(?:[a-zA-Z]|[0-9]|[$-_#.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+/
It outputs the following:
http://rda.ucar.edu/datasets/ds117.0/.
http://rda.ucar.edu/datasets/ds111.1/.
http://www.discover-earth.org/index.html).
http://community.eosdis.nasa.gov/measures/).
Ideally they would print out this:
http://rda.ucar.edu/datasets/ds117.0/
http://rda.ucar.edu/datasets/ds111.1/
http://www.discover-earth.org/index.html
http://community.eosdis.nasa.gov/measures/
Any ideas on how I should tweak my regex?
Thank you in advance!
UPDATE - Example of the text would be:
this is a test http://rda.ucar.edu/datasets/ds117.0/. and I want this to be copied over http://rda.ucar.edu/datasets/ds111.1/. http://www.discover-earth.org/index.html). http://community.eosdis.nasa.gov/measures/).
This will trim your output containing trail characters, ) .
import re
regx= re.compile(r'(?m)[\.\)]+$')
print(regx.sub('', your_output))
And this regex seems workable to extract URL from your original sample text.
https?:[\S]*\/(?:\w+(?:\.\w+)?)?
Demo,,, ( edited from https?:[\S]*\/)
Python script may be something like this
ss=""" this is a test http://rda.ucar.edu/datasets/ds117.0/. and I want this to be copied over http://rda.ucar.edu/datasets/ds111.1/. http://www.discover-earth.org/index.html). http://community.eosdis.nasa.gov/measures/). """
regx= re.compile(r'https?:[\S]*\/(?:\w+(?:\.\w+)?)?')
for m in regx.findall(ss):
print(m)
So for the urls you have here:
https://regex101.com/r/uSlkcQ/4
Pattern explanation:
Protocols (e.g. https://)
^[A-Za-z]{3,9}:(?://)
Look for recurring .[-;:&=+\$,\w]+-class (www.sub.domain.com)
(?:[\-;:&=\+\$,\w]+\.?)+`
Look for recurring /[\-;:&=\+\$,\w\.]+ (/some.path/to/somewhere)
(?:\/[\-;:&=\+\$,\w\.]+)+
Now, for your special case: ensure that the last character is not a dot or a parenthesis, using negative lookahead
(?!\.|\)).
The full pattern is then
^[A-Za-z]{3,9}:(?://)(?:[\-;:&=\+\$,\w]+\.?)+(?:\/[\-;:&=\+\$,\w\.]+)+(?!\.|\)).
There are a few things to improve or change in your existing regex to allow this to work:
http[s]? can be changed to https?. They're identical. No use putting s in its own character class
[a-zA-Z]|[0-9]|[$-_#.&+]|[!*\(\),] You can shorten this entire thing and combine character classes instead of using | between them. This not only improves performance, but also allows you to combine certain ranges into existing character class tokens. Simplifying this, we get [a-zA-Z0-9$-_#.&+!*\(\),]
We can go one step further: a-zA-Z0-9_ is the same as \w. So we can replace those in the character class to get [\w$-#.&+!*\(\),]
In the original regex we have $-_. This creates a range so it actually inclues everything between $ and _ on the ASCII table. This will cause unwanted characters to be matched: $%&'()*+,-./0123456789:;<=>?#ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_. There are a few options to fix this:
[-\w$#.&+!*\(\),] Place - at the start of the character class
[\w$#.&+!*\(\),-] Place - at the end of the character class
[\w$\-#.&+!*\(\),] Escape - such that you have \- instead
You don't need to escape ( and ) in the character class: [\w$#.&+!*(),-]
[0-9a-fA-F][0-9a-fA-F] You don't need to specify [0-9a-fA-F] twice. Just use a quantifier like so: [0-9a-fA-F]{2}
(?:%[0-9a-fA-F][0-9a-fA-F]) The non-capture group isn't actually needed here, so we can drop it (it adds another step that the regex engine needs to perform, which is unnecessary)
So the result of just simplifying your existing regex is the following:
https?://(?:[$\w#.&+!*(),-]|%[0-9a-fA-F]{2})+
Now you'll notice it doesn't match / so we need to add that to the character class. Your regex was matching this originally because it has an improper range $-_.
https?://(?:[$\w#.&+!*(),/-]|%[0-9a-fA-F]{2})+
Unfortunately, even with this change, it'll still match ). at the end. That's because your regex isn't told to stop matching after /. Even implementing this will now cause it to not match file names like index.html. So a better solution is needed. If you give me a couple of days, I'm working on a fully functional RFC-compliant regex that matches URLs. I figured, in the meantime, I would at least explain why your regex isn't working as you'd expect it to.
Thanks all for the responses. A coworker ended up helping me with it. Here is the solution:
des_links = re.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_#.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', des)
for i in des_links:
tmps = "/".join(i.split('/')[0:-1])
print(tmps)