Powershell Regex to match between vertical bar ( | ) - regex

Below is just two lines of string that I am matching too
6 |UDP |ENABLED | |15006 |010.247.060.120 | UDP/IP Communications | UDP/IP Communications GH1870
10 |Gway |ONLINE | |41794 |127.000.000.001 | DM-MD64x64 | DM-MD64x64
Below is the regex I have so far, but it only matches the bottom line
(?i)(?<cipid>([\w\.]+))\s*\|\s*(?<ty>\w+)?\s*\|\s*(?<stat>[\w ]+)\s*\|\s*(?<devid>\w+)?\s*\|\s*(?<prt>\d+)\s*\|\s*(?<ip>([\d\.]+))\s*\|\s*(?<mdl>[\w-]+)\s*\|\s*(?<desc>.+)
I was wondering if I could have a regular expression that just matches every character between every vertical line, instead of having to explicitly say what is between the vertical lines
Thanks all

This usually works. (?:^|(?<=\|))[^|]*?(?=\||$)
https://regex101.com/r/KMNc47/1
Formatted
(?: ^ | (?<= \| ) ) # BOS or Pipe behind
[^|]*? # Optional non-pipe chars
(?= \| | $ ) # Pipe ahead or EOS
Here it is with whitespace trim and includes a capture group.
(?:^|(?<=\|))\s*([^|]*?)\s*(?=\||$)
https://regex101.com/r/KMNc47/2
Formatted
(?: ^ | (?<= \| ) ) # BOS or Pipe behind
\s*
( [^|]*? ) # (1), Optional non-pipe chars
\s*
(?= \| | $ ) # Pipe ahead or EOS
Here it is in a Capture Collection configuration.
(?:(?:^|\|)\s*([^|]*?)\s*(?=\||$))+
https://regex101.com/r/KMNc47/3
Formatted
(?:
(?: ^ | \| ) # BOS or Pipe
\s*
( [^|]*? ) # (1), Optional non-pipe chars
\s*
(?= \| | $ ) # Pipe ahead or EOS
)+

Related

Regex skip in C++

This is my string:
/*
Block1 {
anythinghere
}
*/
// Block2 { }
# Block3 { }
Block4 {
anything here
}
I am using this regex to get each block name and inside content.
regex e(R"~((\w+)\s+\{([^}]+)\})~", std::regex::optimize);
But this regex gets all inside of description too. There is a “skip” option in PHP that you can use to skip all descriptions.
What_I_want_to_avoid(*SKIP)(*FAIL)|What_I_want_to_match
But this is C++ and I cannot use this skip method. What should I do to skip all descriptions and just get Block4 in C++ regex?
This regex detects Block1, Block2, Block3 and Block4 but I want to skip Block1, Block2, Block3 and just get Block4 (skip descriptions). How do I have to edit my regex to get just Block4 (everything outside the descriptions)?
Since you requested this long regex, here it is.
This will not handle nested Blocks like block{ block{ } }
it would match block{ block{ } } only.
Since you specified you are using C++11 as the engine, I didn't use
recursion. This is easily changed to use recursion say if you were to use
PCRE or Perl, or even BOOST::Regex. Let me know if you'd want to see that.
As it is it's flawed, but works for your sample.
Another thing it won't do is parse Preprocessor Directives '#...' because
I forgot the rules for that (thought I did it recently, but can't find a record).
To use it, sit in a while ( regex_search() ) loop looking for a match on
capture group 1, if (m[1].success) etc.. That will be your block.
The rest of the matches are for comments, quotes, or non-comments, unrelated
to the block. These have to be matched to progress the match position.
The code is long and redundant because there is no function calls (recursion) in the C++11 EMCAscript. Like I said, use boost::regex or something.
Benchmark
Sample:
/*
Block1 {
anythinghere
}
*/
// Block2 { }
Block4 {
// CommentedBlock{ asdfasdf }
anyth"}"ing here
}
Block5 {
/* CommentedBlock{ asdfasdf }
anyth}"ing here
*/
}
Results:
Regex1: (?:/\*[^*]*\*+(?:[^/*][^*]*\*+)*/|//(?:[^\\]|\\\n?)*?\n)|(?:"[^"\\]*(?:\\[\S\s][^"\\]*)*"|'[^'\\]*(?:\\[\S\s][^'\\]*)*'|(\w+\s*\{(?:(?:/\*[^*]*\*+(?:[^/*][^*]*\*+)*/|//(?:[^\\]|\\\n?)*?\n)|(?:"[^"\\]*(?:\\[\S\s][^"\\]*)*"|'[^'\\]*(?:\\[\S\s][^'\\]*)*'|(?!\})[\S\s][^}/"'\\]*))*\})|[\S\s](?:(?!\w+\s*\{(?:(?:/\*[^*]*\*+(?:[^/*][^*]*\*+)*/|//(?:[^\\]|\\\n?)*?\n)|(?:"[^"\\]*(?:\\[\S\s][^"\\]*)*"|'[^'\\]*(?:\\[\S\s][^'\\]*)*'|(?!\})[\S\s][^}/"'\\]*))*\})[^/"'\\])*)
Options: < none >
Completed iterations: 50 / 50 ( x 1000 )
Matches found per iteration: 8
Elapsed Time: 1.95 s, 1947.26 ms, 1947261 µs
Regex Explained:
# Raw: (?:/\*[^*]*\*+(?:[^/*][^*]*\*+)*/|//(?:[^\\]|\\\n?)*?\n)|(?:"[^"\\]*(?:\\[\S\s][^"\\]*)*"|'[^'\\]*(?:\\[\S\s][^'\\]*)*'|(\w+\s*\{(?:(?:/\*[^*]*\*+(?:[^/*][^*]*\*+)*/|//(?:[^\\]|\\\n?)*?\n)|(?:"[^"\\]*(?:\\[\S\s][^"\\]*)*"|'[^'\\]*(?:\\[\S\s][^'\\]*)*'|(?!\})[\S\s][^}/"'\\]*))*\})|[\S\s](?:(?!\w+\s*\{(?:(?:/\*[^*]*\*+(?:[^/*][^*]*\*+)*/|//(?:[^\\]|\\\n?)*?\n)|(?:"[^"\\]*(?:\\[\S\s][^"\\]*)*"|'[^'\\]*(?:\\[\S\s][^'\\]*)*'|(?!\})[\S\s][^}/"'\\]*))*\})[^/"'\\])*)
# Stringed: "(?:/\\*[^*]*\\*+(?:[^/*][^*]*\\*+)*/|//(?:[^\\\\]|\\\\\\n?)*?\\n)|(?:\"[^\"\\\\]*(?:\\\\[\\S\\s][^\"\\\\]*)*\"|'[^'\\\\]*(?:\\\\[\\S\\s][^'\\\\]*)*'|(\\w+\\s*\\{(?:(?:/\\*[^*]*\\*+(?:[^/*][^*]*\\*+)*/|//(?:[^\\\\]|\\\\\\n?)*?\\n)|(?:\"[^\"\\\\]*(?:\\\\[\\S\\s][^\"\\\\]*)*\"|'[^'\\\\]*(?:\\\\[\\S\\s][^'\\\\]*)*'|(?!\\})[\\S\\s][^}/\"'\\\\]*))*\\})|[\\S\\s](?:(?!\\w+\\s*\\{(?:(?:/\\*[^*]*\\*+(?:[^/*][^*]*\\*+)*/|//(?:[^\\\\]|\\\\\\n?)*?\\n)|(?:\"[^\"\\\\]*(?:\\\\[\\S\\s][^\"\\\\]*)*\"|'[^'\\\\]*(?:\\\\[\\S\\s][^'\\\\]*)*'|(?!\\})[\\S\\s][^}/\"'\\\\]*))*\\})[^/\"'\\\\])*)"
(?: # Comments
/\* # Start /* .. */ comment
[^*]* \*+
(?: [^/*] [^*]* \*+ )*
/ # End /* .. */ comment
|
// # Start // comment
(?: [^\\] | \\ \n? )*? # Possible line-continuation
\n # End // comment
)
| # OR,
(?: # Non - comments
"
[^"\\]* # Double quoted text
(?: \\ [\S\s] [^"\\]* )*
"
| '
[^'\\]* # Single quoted text
(?: \\ [\S\s] [^'\\]* )*
'
|
( # (1 start), BLOCK
\w+ \s* \{
####################
(?: # ------------------------
(?: # Comments inside a block
/\*
[^*]* \*+
(?: [^/*] [^*]* \*+ )*
/
|
//
(?: [^\\] | \\ \n? )*?
\n
)
|
(?: # Non - comments inside a block
"
[^"\\]*
(?: \\ [\S\s] [^"\\]* )*
"
| '
[^'\\]*
(?: \\ [\S\s] [^'\\]* )*
'
|
(?! \} )
[\S\s]
[^}/"'\\]*
)
)* # ------------------------
#####################
\}
) # (1 end), BLOCK
| # OR,
[\S\s] # Any other char
(?: # -------------------------
(?! # ASSERT: Here, cannot be a BLOCK{ }
\w+ \s* \{
(?: # ==============================
(?: # Comments inside a block
/\*
[^*]* \*+
(?: [^/*] [^*]* \*+ )*
/
|
//
(?: [^\\] | \\ \n? )*?
\n
)
|
(?: # Non - comments inside a block
"
[^"\\]*
(?: \\ [\S\s] [^"\\]* )*
"
|
'
[^'\\]*
(?: \\ [\S\s] [^'\\]* )*
'
|
(?! \} )
[\S\s]
[^}/"'\\]*
)
)* # ==============================
\}
) # ASSERT End
[^/"'\\] # Char which doesn't start a comment, string, escape,
# or line continuation (escape + newline)
)* # -------------------------
) # Done Non - comments
Tl;DR: Regular expressions cannot be used to parse full blown computer languages. What you want to do cannot be done with regular expressions. You need to develop a mini-C++ parser to filter out comments. The answer to this related question might point you in the right direction.
Regex can be used to process regular expressions, but computer languages such as C++, PHP, Java, C#, HTML, etc. have a more complex syntax that includes a property named "middle recursion". Middle recursion includes complications such as an arbitrary number of matching parenthesis, begin / end quotes, and comments that can contain symbols
If you want to understand this in more detail, read the answers to this question about the difference between regular expressions and context free grammars. If you are really curious, enroll in a Formal Language Theory class.

How to recursively match strings balanced with multi-character delimiters?

How can I recursively match a string balanced with multi-character delimiters?
Consider a LaTeX inline quotation such that 2 doubleticks (``) mark up where the quote begins, and 2 apostrophes (\x27\x27) where it ends.
The following code gives me ``five''. I want to capture two ``three `four' ``five'' three four'' six
my $str = q|one ``two ``three `four' ``five'' three four'' six'' seven|;
if ( $str =~ /
(
``
(?:
[^`']
|
(?1)
)*
''
)
/x
)
{
print "$1\n";
}
I guess it has to do with how to negate, not a character class ([^`'], but multi-character strings.
(?:(?!PAT)(?s:.))* is to PAT as [^CHAR]* is to CHAR, so
(?:(?!``|'')(?s:.))*
matches any character that isn't the start of those two sequences. However, I think lookaheads are little expensive, so I believe
(?: [^`']+ | `(?!`) | '(?!') )*
would be cheaper. We get the following:
/
(
``
(
(?: [^`']+ | `(?!`) | '(?!') )*
(?:
(?-2)
(?: [^`']+ | `(?!`) | '(?!') )*
)*
)
''
)
/x
We can simplify for a small performance drop.
/
(
``
(
(?: [^`']+
| `(?!`)
| '(?!')
| (?-2)
)*
)
''
)
/x
In both snippets, The text you want to capture is in $2.

Deleting multiple lines from recursive pcregrep search

In the header of every file in my project I have a block of comments that were generated by VCC with revision history for the file. We have moved away from VCC and no longer want to have the revision history in the file since it is obsolete.
I currently have a search pcregrep search that returns the exact results I'm looking for:
pcregrep -rM '(\$Rev)(?s)(.+?(?=\*\*))' *
I have tried piping the results into an xargs sed along with many other attempts to remove all the returned lines from the files but I get various errors including "File name too long"
I want to delete the entire block
Since you are talking about C++ files, you can't just find comments,
you have to parse comments because literal strings could contain comment
delimiters.
This has been done before, no use re-inventing the wheel.
A simple grep is not going to cut it. You need a simple macro or C# console app
that has better capabilities.
If you want to go this route, below is a regex for you.
Each match will either match group 1 (comments block) or group 2 (non-comments).
You need to either rewrite a new string, via appending the results of each match.
Or, use a callback function to do the replacement.
Each time it matches group 2 just append that (or return it if callback) unchanged.
When it matches group 1, you have to run another regular expression on the
contents to see if the comment block contains the Revision information.
If it does contain it, don't append (or return "" if callback) its contents.
If it doesn't, just append it unchanged.
So, its a 2 step process.
Pseudo-code:
// here read in the source sting from file.
string strSrc = ReadFile( name );
string strNew = "";
Matcher CmtMatch, RevMatch;
while ( GloballyFind( CommentRegex, strSrc, CmtMatch ) )
{
if ( CmtMatch.matched(1) )
{
string strComment = Match.value(1);
if ( FindFirst( RevisionRegex, strComment, RevMatch ) )
continue;
else
strNew += strComment;
}
else
strNew += Match.value(2);
}
// here write out the new string.
The same could be done via a ReplaceAll() using a callback function, if
using a Macro language. The logic goes in the callback.
Its not as hard as it looks, but if you want to do it right I'd do it this way.
And then, hey, you got a nifty utility to be used again.
Here is the regex expanded=, formatted and compressed.
(constructed with RegexFormat 6 (Unicode))
# raw: ((?:(?:^\h*)?(?:/\*[^*]*\*+(?:[^/*][^*]*\*+)*/(?:\h*\n(?=\h*(?:\n|/\*|//)))?|//(?:[^\\]|\\\n?)*?(?:\n(?=\h*(?:\n|/\*|//))|(?=\n))))+)|("(?:\\[\S\s]|[^"\\])*"|'(?:\\[\S\s]|[^'\\])*'|[\S\s][^/"'\\\s]*)
# delimited: /((?:(?:^\h*)?(?:\/\*[^*]*\*+(?:[^\/*][^*]*\*+)*\/(?:\h*\n(?=\h*(?:\n|\/\*|\/\/)))?|\/\/(?:[^\\]|\\\n?)*?(?:\n(?=\h*(?:\n|\/\*|\/\/))|(?=\n))))+)|("(?:\\[\S\s]|[^"\\])*"|'(?:\\[\S\s]|[^'\\])*'|[\S\s][^\/"'\\\s]*)/
# Dbl-quoted: "((?:(?:^\\h*)?(?:/\\*[^*]*\\*+(?:[^/*][^*]*\\*+)*/(?:\\h*\\n(?=\\h*(?:\\n|/\\*|//)))?|//(?:[^\\\\]|\\\\\\n?)*?(?:\\n(?=\\h*(?:\\n|/\\*|//))|(?=\\n))))+)|(\"(?:\\\\[\\S\\s]|[^\"\\\\])*\"|'(?:\\\\[\\S\\s]|[^'\\\\])*'|[\\S\\s][^/\"'\\\\\\s]*)"
# Sing-quoted: '((?:(?:^\h*)?(?:/\*[^*]*\*+(?:[^/*][^*]*\*+)*/(?:\h*\n(?=\h*(?:\n|/\*|//)))?|//(?:[^\\\]|\\\\\n?)*?(?:\n(?=\h*(?:\n|/\*|//))|(?=\n))))+)|("(?:\\\[\S\s]|[^"\\\])*"|\'(?:\\\[\S\s]|[^\'\\\])*\'|[\S\s][^/"\'\\\\\s]*)'
( # (1 start), Comments
(?:
(?: ^ \h* )? # <- To preserve formatting
(?:
/\* # Start /* .. */ comment
[^*]* \*+
(?: [^/*] [^*]* \*+ )*
/ # End /* .. */ comment
(?: # <- To preserve formatting
\h* \n
(?=
\h*
(?: \n | /\* | // )
)
)?
|
// # Start // comment
(?: [^\\] | \\ \n? )*? # Possible line-continuation
(?: # End // comment
\n
(?= # <- To preserve formatting
\h*
(?: \n | /\* | // )
)
| (?= \n )
)
)
)+ # Grab multiple comment blocks if need be
) # (1 end)
| ## OR
( # (2 start), Non - comments
"
(?: \\ [\S\s] | [^"\\] )* # Double quoted text
"
| '
(?: \\ [\S\s] | [^'\\] )* # Single quoted text
'
| [\S\s] # Any other char
[^/"'\\\s]* # Chars which doesn't start a comment, string, escape,
# or line continuation (escape + newline)
) # (2 end)
Incase you want something simpler -
This is the same regex without multiple comment block capture or format preserving. Same grouping and replacement principle applies.
# Raw: (/\*[^*]*\*+(?:[^/*][^*]*\*+)*/|//(?:[^\\]|\\\n?)*?\n)|("(?:\\[\S\s]|[^"\\])*"|'(?:\\[\S\s]|[^'\\])*'|[\S\s][^/"'\\]*)
( # (1 start), Comments
/\* # Start /* .. */ comment
[^*]* \*+
(?: [^/*] [^*]* \*+ )*
/ # End /* .. */ comment
|
// # Start // comment
(?: [^\\] | \\ \n? )*? # Possible line-continuation
\n # End // comment
) # (1 end)
|
( # (2 start), Non - comments
"
(?: \\ [\S\s] | [^"\\] )* # Double quoted text
"
| '
(?: \\ [\S\s] | [^'\\] )* # Single quoted text
'
| [\S\s] # Any other char
[^/"'\\]* # Chars which doesn't start a comment, string, escape,
# or line continuation (escape + newline)
) # (2 end)

Splitting string having special characters, words, numbers and URL

I have a .txt file which contains:
"'the url address i checked is: https://www.google.com/ for 2times and it's awesome!."
After parsing, the expected output should be:
['"',"'",'the','url','address','i','checked','is',':','https://www.google.com/','for','2','times','and',"it's",'awesome','!','.','"']
How do I split this list to get the output using the re module.
I came up with this pattern:
pattern = re.compile(r"\d+|[a-zA-Z]+[a-zA-Z']*|[^\w\s]")
but this is also splitting my URL.
Can any one please help?
Just pick a url regex from somewhere and make it first in the alternations.
An example only -
# (?!mailto:)(?:(?:https?|ftp)://)?(?:\S+(?::\S*)?#)?(?:(?:(?:[1-9]\d?|1\d\d|2[01]\d|22[0-3])(?:\.(?:1?\d{1,2}|2[0-4]\d|25[0-5])){2}(?:\.(?:[1-9]\d?|1\d\d|2[0-4]\d|25[0-4]))|(?:(?:[a-z\u00a1-\uffff0-9]+-?)*[a-z\u00a1-\uffff0-9]+)(?:\.(?:[a-z\u00a1-\uffff0-9]+-?)*[a-z\u00a1-\uffff0-9]+)*(?:\.(?:[a-z\u00a1-\uffff]{2,})))|localhost)(?::\d{2,5})?(?:/[^\s]*)?|\d+|[a-zA-Z]+[a-zA-Z']*|[^\w\s]
(?! mailto: )
(?:
(?: https? | ftp )
://
)?
(?:
\S+
(?: : \S* )?
#
)?
(?:
(?:
(?:
[1-9] \d?
| 1 \d\d
| 2 [01] \d
| 22 [0-3]
)
(?:
\.
(?: 1? \d{1,2} | 2 [0-4] \d | 25 [0-5] )
){2}
(?:
\.
(?:
[1-9] \d?
| 1 \d\d
| 2 [0-4] \d
| 25 [0-4]
)
)
| (?:
(?: [a-z\u00a1-\uffff0-9]+ -? )*
[a-z\u00a1-\uffff0-9]+
)
(?:
\.
(?: [a-z\u00a1-\uffff0-9]+ -? )*
[a-z\u00a1-\uffff0-9]+
)*
(?:
\.
(?: [a-z\u00a1-\uffff]{2,} )
)
)
| localhost
)
(?: : \d{2,5} )?
(?: / [^\s]* )?
| \d+
| [a-zA-Z]+ [a-zA-Z']*
| [^\w\s]
Outputs:
['"',"'",'the','url','address','i','checked','is',':','https://www.google.com/','for','2','times','and',"it's",'awesome','!','.','"']

Regular Expression for a single occurrence within a String

I am new to Regular Expression and can't seem to do the proper syntax for what I need to do. I need regular expression for an alphanumeric string that can be 1-8 characters long and can contain at most 1 dash, but can't be a single dash alone.
Valid:
A-
-A
1234-678
ABC76-
Invalid:
-
F-1-
ABCD1234-
---
Thanks in advance!
One way. (Sorry if this is already posted)
# ^(?=[a-zA-Z0-9-]{1,8}$)(?=[^-]*-?[^-]*$)(?!-$).*$
^ # BOL
(?= [a-zA-Z0-9-]{1,8} $ ) # 1 - 8 alpha-num or dash
(?= [^-]* -? [^-]* $ ) # at most 1 dash
(?! - $ ) # not just a dash
.* $
Edit: Just extend it for segments separated by comma's
# ^(?!,)(?:(?=(?:^|,)[a-zA-Z0-9-]{1,8}(?:$|,))(?=(?:^|,)[^-]*-?[^-]*(?:$|,))(?!(?:^|,)-(?:$|,)),?[^,]*)+(?<!,)$
^ # BOL
(?! , ) # does not start with comma
(?: # Grouping
(?=
(?: ^ | , )
[a-zA-Z0-9-]{1,8} # 1 - 8 alpha-num or dash
(?: $ | , )
)
(?=
(?: ^ | , )
[^-]* -? [^-]* # at most 1 dash
(?: $ | , )
)
(?!
(?: ^ | , )
- # not just a dash
(?: $ | , )
)
,? [^,]* # consume the segment
)+ # Grouping, do many times
(?<! , ) # does not end with comma
$ # EOL
Edit2: If your engine doesn't support lookbehinds, this is same thing but without
# ^(?!,)(?:(?=(?:^|,)[a-zA-Z0-9-]{1,8}(?:$|,))(?=(?:^|,)[^-]*-?[^-]*(?:$|,))(?!(?:^|,)-(?:$|,))(?!,$),?[^,]*)+$
^ # BOL
(?! , ) # does not start with comma
(?: # Grouping
(?=
(?: ^ | , )
[a-zA-Z0-9-]{1,8} # 1 - 8 alpha-num or dash
(?: $ | , )
)
(?=
(?: ^ | , )
[^-]* -? [^-]* # at most 1 dash
(?: $ | , )
)
(?!
(?: ^ | , )
- # not just a dash
(?: $ | , )
)
(?! , $ ) # does not end with comma
,? [^,]* # consume the segment
)+ # End Grouping, do many times
$ # EOL
Try this regex:
/^(?!([^-]*-){2})[a-zA-Z0-9-]{1,8}$/
^ and $ are to match start and end.
(?!([^-]*-){2}) is a lookahead that makes sure that matching pattern has only one hyphen in it at the most.
[a-zA-Z0-9-]{1,8} match 1 to 8 alpha-numerals or -
Reference: http://regular-expressions.info