Can't get a specific regex to work in Perl

Can't get a specific regex to work in Perl - regex

I have a string formatted like:
project-version-project_test-type-other_info-other_info.file_type
I can strip most of the information I need out of this string in most cases. My trouble arises when my version has an extra qualifying character in it (i.e. normally 5 characters but sometimes a 6th is added).
Previously, I was using substrings to remove the excess information and get the 'project_test-type' however, now I need to switch to a regex (mostly to handle that extra version character). I could keep using substrings and change the length depending on whether I have that extra version character or not but a regex seems more appropriate here.
I tried using patterns like:
my ($type) = $_ =~ /.*-.*-(.*)-.*/;
But the extra '-' in the 'project_test-type' means I can't simply space my regex using that character.
What regex can I use to get the 'project_test-type' out of my string?
More information:
As a more human readable example, the information is grouped in the following way:
project - version - project_test-type - other_info - other_info . file_type
'project' is a simple string of chars
'version' is normally a string of 5 integers, but is sometimes followed by a char, i.e. 11111 is normal and 11111A is the rarer occurence.
'project_test-type' is a specific test associated with a project that can have both '_' and '-' in it's otherwise char name
Both cases of 'other_info' are additional bits of information for the system like an IP address or another version number. The first has no fixed length while the second is always 10 characters long

Since no field other than the desired one can contain -, any extra - belongs to the desired field.
+--------------------------- project
| +--------------------- version
| | +----------------- project_test-type
| | | +---------- other_info
| | | | +---- other_info.file_type
| | | | |
____| ____| _| ____| ____|
/^[^-]*-[^-]*-(.*)-[^-]*-[^-]*\z/
[^-] matches a character that's not a -.
[^-]* matches zero or more characters that's aren't -.

To match everything:
/^([^-]+)-([^-]+)-(.+)-([^-]+)-([^-]+)\.([a-zA-Z0-9]+)$/
[] defines character sets and ^ at the beginning of a set means "NOT". Also a - in a set usually means a range, unless it is at the beginning or end. So [^-]+ consumes as many non-dash characters as possible (at least one).

You can use
/\w+\s*-\s*\d{5}[a-zA-Z]?\s*-\s*(.*?)(?=\s*-\s*\d)/
Explanation:
\w+\s*- ==> match character sequence followed by any number of spaces and a -
\d{5}[a-zA-Z]? ==> always 5 digits with one or zero character
(.*?) => match everything in a non greedy way
(?=\s*-\s*\d) => look forward for a digit and stop (since IP starts with a digit)
Demo and Explanation

Greedy/non-greedy approach
($type) = /.*?-.*?-(.*)-.*-.*/;
.*? is a non-greedy match, meaning match any number of any character, but no more than necessary to match the regular expression. Using .* between the second and third dashes is a greedy match, matching as many characters as possible while still matching the regular expression, and using this will capture words with any extra dashes in them.

Related

Regex that match table input

I have this kind of input
||ID||Part Number||Product Name||Serial Number||Status||Dunning Status||Commitment End||Address||Country||
|1|SX0486|Mobilný Hlas Postpaid|0911193419|Active|Closed|04. 08. 2020| | |
I am looking for two regexes, one that match only inside headers ||ID||Part Number||Product Name||Serial Number||Status||Dunning Status||Commitment End||Address||Country|| from whole table input so no match |1|SX0486|Mobilný Hlas Postpaid|0911193419|Active|Closed|04. 08. 2020| | | the other I could theoretically split by newlines and by |...
I have tried something like [^\|\|]+(?=\|\|) ist good solution?
regex

You can't negate a sequence of characters with a negated character class, only individual chars.
I suggest using a regex that will extract any chunks of chars other than | between double ||:
(?<=\|\|)[^|]+(?=\|\|)
See the regex demo.
Details
(?<=\|\|) - two | chars must be present immediately on the left
[^|]+ - 1+ chars other than |
(?=\|\|) - two | chars must be present immediately on the right.
If you ever need to make sure there is exactly two pipes on each side, and not match if there are three or more, you will need to precise the pattern as (?<=(?<!\|)\|\|)[^|]+(?=\|\|(?!\|)).

golang how to split a string to a slice containing the delimeters using a regular expression

I want to split a string to a slice but want to keep the delimiters. Is there a way to do that in golang?
For example
Input:
"Hello! It's, a? beautiful$ day* (today and tomorrow).
Output:
[Hello | ! | It's | , | a | ? | beautiful | $ | day | * | ( | today | and | tomorrow | ) | . ]
where | represents the separation of elements.
Can anyone help?

You can do that by creating a regular expression that matches either a word or one of your special characters. I don't know exactly what your rules are, but given the input and desired output, this works:
[A-Za-z']+|[*?()$.,!]
You can then use FindAllString to return the individual strings.
See https://play.golang.org/p/se6B74G_Fv.
The idea of the regular expression is to match either a word, here defined as a sequence of one or more upper or lower case letters or apostrophes, or a "special character", defined as any of the non-word characters in your sample input. The regular expression contains two parts separated by a |, meaning that either alternative can match. A set of characters contained within square brackets is called a character class, which means that any of the characters can match (and a-z means all characters from a to z). Since the left alternative is a character class followed by a +, one or more of the characters in the class will match. The portion on the right only matches one character. You may need to adjust this to behave exactly the way you want. For more information on Go regular expressions, see https://github.com/google/re2/wiki/Syntax.

RegEx that excludes characters doesn't begin matching until 2nd character

I'm trying to create a regular expression that will include all ascii but exclude certain characters such as "+" or "%" - I'm currently using this:
^[\x00-\x7F][^%=+]+$
But I noticed (using various RegEx validators) that this pattern only begins matching with 2 characters. It won't match "a" but it will match "ab." If I remove the "[^]" section, (^[\x00-\x7F]+$) then the pattern matches one character. I've searched for other options, but so far come up with nothing. I'd like the pattern to begin matching on 1 character but also exclude characters. Any suggestions would be great!

Try this:
^(?:(?![%=+])[\x00-\x7F])+$
Demo.
This will loop through, make sure that the "bad" characters aren't there with a negative lookahead, then match the "good" characters, then repeat.

You can use a negative lookahead here to exclude certain characters:
^((?![%=+])[\x00-\x7F])+$
RegEx Demo
(?![%=+]) is a negative lookahead that will assert that matched character is not one of the [%=+].

You could simply exclude those chars from the \x00-\x7f range (using the hex value of each char).
+----------------+
|Char|Dec|Oct|Hex|
+----------------+
| % |37 |45 |25 |
+----------------+
| + |43 |53 |2B |
+----------------+
| = |61 |75 |3D |
+----------------+
Regex:
^[\x00-\x24\x26-\x2A\x2C-\x3C\x3E-\x7F]+$
DEMO
Engine-wise this is more efficient than attempting an assertion for each character.

sed making digit optional

I am attempting to replace a date in the format 08/09/2014 but at the same time also the format 8/9/14 using sed. I know the + sign is supposed to match one or more occurrences, and ? 0 or more. I've tried both but none of the dates are being replaced with "testing". I was expecting this would find 1 or more digits followed by a slash, 1 or more digits followed by a slash, 4 digits.
Do I need to escape the special character, or what is wrong here?
sed -f mySed.sed dates.csv
# mySed.sed file
s#[0-9]+/[0-9]+/[0-9][0-9][0-9][0-9]#testing#g
# sample line in dates.csv
...,20/01/2001,2/1/2009,...

You have made several mistakes. Here is a working example:
echo '20/01/2001,2/1/2009' | sed 's~[0-9]\{1,2\}/[0-9]\{1,2\}/\([0-9]\{2\}\)\{1,2\}~toto~g'
Note that the ? means "optional" (in other words 0 or 1 time) and must be escaped.
To be more precise, I have choosen to use this quantifier {m,n} instead of +. But if you use + don't forget to escape it \+ otherwise it will be seen as a literal character.

You need to escape the + quantifiers in your regular expression, and you can use a range for the last set.
s#[0-9]\+/[0-9]\+/[0-9]\{2,4\}#testing#g
Or you can use the range quantifier throughout your pattern.
s#[0-9]\{1,2\}/[0-9]\{1,2\}/[0-9]\{2,4\}#testing#g

Simple regex validation

I want to implement the following validation. Match at least 5 digits and also some other characters between(for example letters and slashes). For example 12345, 1A/2345, B22226, 21113C are all valid combinations. But 1234, AA1234 are not. I know that {5,} gives minimum number of occurrences, but I don't know how to cope with the other characters. I mean [0-9A-Z/]{5,} won't work:(. I just don't know where to put the other characters in the regex expression.
Thanks in advance!
Best regards,
Petar

Using the simplest regex features since you haven't specified which engine you're using, you can try:
.*([0-9].*){5}
|/|\ /|/| |
| | \ / | | +--> exactly five occurrences of the group
| | | | +----> end group
| | | +------> zero or more of any character
| | +---------> any digit
| +------------> begin group
+--------------> zero or more of any character
This gives you any number (including zero) of characters, followed by a group consisting of a single digit and any number of characters again. That group is repeated exactly five times.
That'll match any string with five or more digits in it, along with anything else.
If you want to limit what the other characters can be, use something other than .. For example, alphas only would be:
[A-Za-z]*([0-9][A-Za-z]*){5}

EDIT: I'm picking up your suggestion from a comment to paxdiablo's answer: This regex now implements an upper bound of five for the number of "other" characters:
^(?=(?:[A-Z/]*\d){5})(?!(?:\d*[A-Z/]){6})[\dA-Z/]*$
will match and return a string that has at least five digits and zero or more of the "other" allowed characters A-Z or /. No other characters are allowed.
Explanation:
^ # Start of string
(?= # Assert that it's possible to match the following:
(?: # Match this group:
[A-Z/]* # zero or more non-digits, but allowed characters
\d # exactly one digit
){5} # five times
) # End of lookahead assertion.
(?! # Now assert that it's impossible to match the following:
(?: # Match this group:
\d* # zero or more digits
[A-Z/] # exactly one "other" character
){6} # six times (change this number to "upper bound + 1")
) # End of assertion.
[\dA-Z/]* # Now match the actual string, allowing only these characters.
$ # Anchor the match at the end of the string.

You may want to try counting the digits instead. I feel its much cleaner than writing a complex regex.
>> "ABC12345".gsub(/[^0-9]/,"").size >= 5
=> true
the above says substitute all things not numbers, and then finding the length of those remaining. You can do the same thing using your own choice of language. The most fundamental way would be to iterate the string you have, counting each character which is a digit until it reaches 5 (or not) and doing stuff accordingly.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js