Regex that match table input - regex

I have this kind of input
||ID||Part Number||Product Name||Serial Number||Status||Dunning Status||Commitment End||Address||Country||
|1|SX0486|Mobilný Hlas Postpaid|0911193419|Active|Closed|04. 08. 2020| | |
I am looking for two regexes, one that match only inside headers ||ID||Part Number||Product Name||Serial Number||Status||Dunning Status||Commitment End||Address||Country|| from whole table input so no match |1|SX0486|Mobilný Hlas Postpaid|0911193419|Active|Closed|04. 08. 2020| | | the other I could theoretically split by newlines and by |...
I have tried something like [^\|\|]+(?=\|\|) ist good solution?
regex

You can't negate a sequence of characters with a negated character class, only individual chars.
I suggest using a regex that will extract any chunks of chars other than | between double ||:
(?<=\|\|)[^|]+(?=\|\|)
See the regex demo.
Details
(?<=\|\|) - two | chars must be present immediately on the left
[^|]+ - 1+ chars other than |
(?=\|\|) - two | chars must be present immediately on the right.
If you ever need to make sure there is exactly two pipes on each side, and not match if there are three or more, you will need to precise the pattern as (?<=(?<!\|)\|\|)[^|]+(?=\|\|(?!\|)).

Related

Can you match a single character only that's within parenthesis for replacement using regex?

I have a weird case where the only real tool I have to use is Notepad++ without some heavy lifting, and I have a | delimited text file that has |s in the text that I need to remove.
Each | that I need to remove falls within parenthesis, so the text patterns look like this:
(123 | 456) (11.1 | 11.2)
...and so on.
My ideal result would be removing the |s contained within ()s and replacing with a -, so:
(123 - 456) (11.1 - 11.2)
So far I have:
\(.*\|.*\)
That matches each set of parenthesis that contains a | reliably, but I can't figure out a way to just match the | itself for replacement. Any ideas?
With your shown samples, please try following regex in notepad++
find what: ([^|]*)\|([^)]*\))
Replace with: $1-$2
Online demo for above regex
Explanation of regex: Adding detailed explanation for above regex.
([^|]*) ##Creating 1st capturing group here, which has everything till | comes.
\| ##Matching literal | here.
([^)]*\)) ##Creating 2nd capturing group here, which has everything till ) here including ).
You can use
(\([^()|]*)\|(?=[^()]*\))
and replace with $1-. Details:
(\([^()|]*) - Group 1: ( char and then zero or more chars other than (, ) and |
\| - a | char
(?=[^()]*\)) - there must be zero or more chars other than ( and ) and then a ) char immediately to the right of the current location
See the regex demo and the demo screenshot below:
If you have multiple pipes (like in (123 | 456 | 23) (11.1 | 11.2 | 788 | 6896)):
(\G(?!^)|\()([^()|]*)\|(?=[^()]*\))
But now, replace with $1$2-. See the regex demo. This is compatible with some other common text editors, hence I did not suggest using a pattern with \K (see this regex demo).
I just tested this code, which is a bit safe to use, but a little long code ....
Find: (\(\d+[. \d]*)[|](?=[ \d.]*\))
Replace All: $1-
Updated

golang how to split a string to a slice containing the delimeters using a regular expression

I want to split a string to a slice but want to keep the delimiters. Is there a way to do that in golang?
For example
Input:
"Hello! It's, a? beautiful$ day* (today and tomorrow).
Output:
[Hello | ! | It's | , | a | ? | beautiful | $ | day | * | ( | today | and | tomorrow | ) | . ]
where | represents the separation of elements.
Can anyone help?
You can do that by creating a regular expression that matches either a word or one of your special characters. I don't know exactly what your rules are, but given the input and desired output, this works:
[A-Za-z']+|[*?()$.,!]
You can then use FindAllString to return the individual strings.
See https://play.golang.org/p/se6B74G_Fv.
The idea of the regular expression is to match either a word, here defined as a sequence of one or more upper or lower case letters or apostrophes, or a "special character", defined as any of the non-word characters in your sample input. The regular expression contains two parts separated by a |, meaning that either alternative can match. A set of characters contained within square brackets is called a character class, which means that any of the characters can match (and a-z means all characters from a to z). Since the left alternative is a character class followed by a +, one or more of the characters in the class will match. The portion on the right only matches one character. You may need to adjust this to behave exactly the way you want. For more information on Go regular expressions, see https://github.com/google/re2/wiki/Syntax.

RegEx that excludes characters doesn't begin matching until 2nd character

I'm trying to create a regular expression that will include all ascii but exclude certain characters such as "+" or "%" - I'm currently using this:
^[\x00-\x7F][^%=+]+$
But I noticed (using various RegEx validators) that this pattern only begins matching with 2 characters. It won't match "a" but it will match "ab." If I remove the "[^]" section, (^[\x00-\x7F]+$) then the pattern matches one character. I've searched for other options, but so far come up with nothing. I'd like the pattern to begin matching on 1 character but also exclude characters. Any suggestions would be great!
Try this:
^(?:(?![%=+])[\x00-\x7F])+$
Demo.
This will loop through, make sure that the "bad" characters aren't there with a negative lookahead, then match the "good" characters, then repeat.
You can use a negative lookahead here to exclude certain characters:
^((?![%=+])[\x00-\x7F])+$
RegEx Demo
(?![%=+]) is a negative lookahead that will assert that matched character is not one of the [%=+].
You could simply exclude those chars from the \x00-\x7f range (using the hex value of each char).
+----------------+
|Char|Dec|Oct|Hex|
+----------------+
| % |37 |45 |25 |
+----------------+
| + |43 |53 |2B |
+----------------+
| = |61 |75 |3D |
+----------------+
Regex:
^[\x00-\x24\x26-\x2A\x2C-\x3C\x3E-\x7F]+$
DEMO
Engine-wise this is more efficient than attempting an assertion for each character.

Can't get a specific regex to work in Perl

I have a string formatted like:
project-version-project_test-type-other_info-other_info.file_type
I can strip most of the information I need out of this string in most cases. My trouble arises when my version has an extra qualifying character in it (i.e. normally 5 characters but sometimes a 6th is added).
Previously, I was using substrings to remove the excess information and get the 'project_test-type' however, now I need to switch to a regex (mostly to handle that extra version character). I could keep using substrings and change the length depending on whether I have that extra version character or not but a regex seems more appropriate here.
I tried using patterns like:
my ($type) = $_ =~ /.*-.*-(.*)-.*/;
But the extra '-' in the 'project_test-type' means I can't simply space my regex using that character.
What regex can I use to get the 'project_test-type' out of my string?
More information:
As a more human readable example, the information is grouped in the following way:
project - version - project_test-type - other_info - other_info . file_type
'project' is a simple string of chars
'version' is normally a string of 5 integers, but is sometimes followed by a char, i.e. 11111 is normal and 11111A is the rarer occurence.
'project_test-type' is a specific test associated with a project that can have both '_' and '-' in it's otherwise char name
Both cases of 'other_info' are additional bits of information for the system like an IP address or another version number. The first has no fixed length while the second is always 10 characters long
Since no field other than the desired one can contain -, any extra - belongs to the desired field.
+--------------------------- project
| +--------------------- version
| | +----------------- project_test-type
| | | +---------- other_info
| | | | +---- other_info.file_type
| | | | |
____| ____| _| ____| ____|
/^[^-]*-[^-]*-(.*)-[^-]*-[^-]*\z/
[^-] matches a character that's not a -.
[^-]* matches zero or more characters that's aren't -.
To match everything:
/^([^-]+)-([^-]+)-(.+)-([^-]+)-([^-]+)\.([a-zA-Z0-9]+)$/
[] defines character sets and ^ at the beginning of a set means "NOT". Also a - in a set usually means a range, unless it is at the beginning or end. So [^-]+ consumes as many non-dash characters as possible (at least one).
You can use
/\w+\s*-\s*\d{5}[a-zA-Z]?\s*-\s*(.*?)(?=\s*-\s*\d)/
Explanation:
\w+\s*- ==> match character sequence followed by any number of spaces and a -
\d{5}[a-zA-Z]? ==> always 5 digits with one or zero character
(.*?) => match everything in a non greedy way
(?=\s*-\s*\d) => look forward for a digit and stop (since IP starts with a digit)
Demo and Explanation
Greedy/non-greedy approach
($type) = /.*?-.*?-(.*)-.*-.*/;
.*? is a non-greedy match, meaning match any number of any character, but no more than necessary to match the regular expression. Using .* between the second and third dashes is a greedy match, matching as many characters as possible while still matching the regular expression, and using this will capture words with any extra dashes in them.

Need Regex Pattern: Can't start w num; No special characters except underscore and hyphen; allows characters/nums

Closest I've gotten: ^[-_[a-zA-Z0-9]*$
That still allows the string to start with numbers. Apologies for asking such question when there are resources everywhere. I just need something fast and have problems figuring out RegEx.
Valid input examples: Account-Numbers_2010 | NewMoney | test_data | a1B2-c3_d4_5e-6f
Invalid input examples: 2010_Account_Numbers | New$Money | %test*data | 1aB2
This should make it:
"^[A-Za-z_-][A-Za-z0-9_-]*$"
[A-Za-z_-] means a letter or underscore or hyphen
[A-Za-z0-9_-]* is the same, but allows numbers too
So this will allow letters, underscores, hyphens, and numbers, but no numbers at the start.
Looking at your valid input example Account-Numbers_2010 | NewMoney | test_data | a1B2-c3_d4_5e-6f, you may want to also allow spaces and |. This one allows them:
"^[A-Za-z_ |-][A-Za-z0-9_ |-]*$"
This one correctly matches Account-Numbers_2010 | NewMoney | test_data | a1B2-c3_d4_5e-6f and not 2010_Account_Numbers | New$Money | %test*data | 1aB2.
You need 2 parts to the regex. The first character, and then the rest.
^[a-zA-Z_-][a-zA-Z0-9_-]*$
This says:
Start with any character from a-z or A-Z or _ or -. And then follow that by any alphanumeric character or _ or -.
I hope this helps
^[a-zA-Z]([a-zA-z0-9_-]){0,}