RegEx for matching a string before a year - regex

I have directory names with include year numbers. I want to split them to variables what is before the year number:
Input:
Holidays.uS.2019.bla.bla
Holidays.ca.old.2017.bla.bla
Holidays.2015.bla.bla.bla
Holidays.1.2.3.4.at.old.1999.bla.bla.bla.bla
The year is not always in the same place, but, it always has 4 digits.
I always need everything up to the year.
For an input:
Holidays.ca.old.2017.bla.bla
Output:
Holidays.ca.old
Attempt
set name Holidays.ca.old.2017.bla.bla
set numbers [regexp -all -inline {[0-9]+} $name]
Output from my code is the year number, and sometimes other wrong numbers.

This expression might help you to design one:
([\w\.]+)(\.[0-9]{4}.+)
Graph
This graph displays how it would work:
Code:
set string "Holidays.1.2.3.4.at.old.1999.bla.bla.bla.bla"
set match [regsub {([\w\.]+)(\.[0-9]{4}.+)} $string "\\1"]
puts $match
Output
Holidays.1.2.3.4.at.old

You may use a regex to match a dot followed with 4 digits that are not followed with a word char, and then matching any other char 0 or more times, and remove the matched text using regsub like this:
regsub {\.[0-9]{4}\y.*} $name ""
See Tcl demo online:
set name "Holidays.ca.old.2017.bla.bla"
set res [regsub {\.[0-9]{4}\y.*} $name ""]
puts $res
# => Holidays.ca.old
Regex details
\. - a dot
[0-9]{4} - four digits
\y - a word boundary
.* - any 0 or more chars as many as possible.
If you want to see a demo of the regex at regex101.com, you need to replace \y with \b, see this demo here.

(\w|\.)+(?=\.\d{4})
Breakdown:
(\w|\.)+ One or more words (which includes digits) or literal periods.
(?=\.\d{4}) Positive lookahead for a literal period followed by exactly four digits.
Demo: https://regex101.com/r/vaofyC/6

Thank you for your help, that's really nice
I use this in tcl and working perfekt forme
set name_split [regsub {\.[0-9]{4}\y.*} $name ""]
I still need it for a bash script, how can use it?
this does not really work :(
name_split=$(echo $name | {\.[0-9]{4}\y.*}

Related

PowerShell Regular Expression match Y or Z

I am trying to match some strings using a regular expression in PowerShell but due to the differing format of the original string that I'm extracting from, encountering difficulty. I admittedly am not very strong with creating regular expressions.
I need to extract the numbers from each of these strings. These can vary in length but in both cases will be preceded by Foo
PC1-FOO1234567
PC2-FOO1234567/FOO98765
This works for the second example:
'PC2-FOO1234567/FOO98765' -match 'FOO(.*?)\/FOO(.*?)\z'
It lets me access the matched strings using $matches[1] and $matches[2] which is great.
It obviously doesn't work for the first example. I suspect I need some way to match on either / or the end of the string but I'm not sure how to do this and end up with my desired match.
Suggestions?
You may use
'FOO(.*?)(?:/FOO(.*))?$'
It will match FOO, then capture any 0 or more chars as few as possible into Group 1 and then will attempt to optionally match a sequence of patterns: /FOO, any 0 or more chars as many as possible captured into Group 2 and then the end of string position should follow.
See the regex demo
Details
FOO - literal substring
(.*?) - Group 1: any zero or more chars other than newline, as few as possible
(?:/FOO(.*))? - an optional non-capturing group matching 1 or 0 repetitions of:
/FOO - a literal substring
(.*) - Group 2: any 0+ chars other than newline as many as possible (* is greedy)
$ - end of string.
[edit - removed the unneeded pipe to Where-Object. thanks to mklement0 for that! [*grin*]]
this is a somewhat different approach. it splits on the foo, then replaces the unwanted / with nothing, and finally filters out any string that contains letters.
the pure regex solutions others offered will likely be faster, but this may be slightly easier to understand - and therefore to maintain. [grin]
# fake reading in a text file
# in real life, use Get-Content
$InStuff = #'
PC1-FOO1234567
PC2-FOO1234567/FOO98765
'# -split [environment]::NewLine
$InStuff -split 'foo' -replace '/' -notmatch '[a-z]'
output ...
1234567
1234567
98765
To offer a more concise alternative with the -split operator, which obviates the need to access $Matches afterwards to extract the numbers:
PS> 'PC1-FOO1234568', 'PC2-FOO1234567/FOO98765' -split '(?:^PC\d+-|/)FOO' -ne ''
1234568 # single match from 1st input string
1234567 # first of 2 matches from 2nd input string
98765
Note: -split always returns a [string[]] array, even if only 1 string is returned; result strings from multiple input strings are combined into a single, flat array.
^PC\d+-|/ matches PC followed by 1 or more (+) digits (\d) at the start of the string (^) or (|) a / char., which matches both PC2-FOO at the beginning and /FOO.
(?:...), a non-capturing subexpression, must be used to prevent -split from including what the subexpression matched in the results array.
-ne '' filters out the empty elements that result from the input strings starting with a separator.
To learn more about the regex-based -split operator and in what ways it is more powerful than the string literal-based .NET String.Split() method, see this answer.

Matching numbers with non-digits embedded

I am trying to match strings of digits that contain non-digits within them. Using the default text in http://regexr.com/, the following should match:
v2.1
-98.7
3.141
.6180
9,000
+42
555.123.4567
+1-(800)-555-2468
The following should not match:
0123456789
12345
I tried:
/[^\n\ ]{1,}\d+\S+\d/g
But it would not match +42 and it incorrectly matched 0123456789 and 12345, and it treated "555.123.4567 +1-(800)-555-2468" as one string.
I tried to alleviate it by putting a $ at the end but that matched nothing. Not sure what I am doing wrong.
You can use this regex to match any text with at least one non-digit:
/^\d*[^\d\n]+\d.*$/mg
RegEx Demo
RegEx Breakup:
^ - Start
\d* - Match 0 or more digits
[^\d\n]+ - Match 1 or more of any character that is not a digit and not a newline
\d - Match a digit
.* - Match 0 or more of any character
$ - End
Try this:
^(?=.*\d)(?=.*[^\d\s])\S+$
This means "at least one digit and one non-digit and no whitespace".
See live demo.
If no newlines were in your input, you could use slightly simpler:
^(?=.*\d)(?=.*\D)\S+$
Aren't you over-thinking this massively? What's wrong with using /\D/ to match a string that contains a non-digit?
I'm not sure what your exact requirements are, but if you're looking for a string that contains at least one digit and at least one non-digit, then the easiest approach is to use to regex matches - /\d/ && /\D/.
#!/usr/bin/perl
use strict;
use warnings;
use 5.010;
while (<DATA>) {
chomp;
say "$_: " . (/\d/ && /\D/ ? 'matches' : 'doesn\'t match');
}
__DATA__
v2.1
-98.7
3.141
.6180
9,000
+42
555.123.4567
+1-(800)-555-2468
0123456789
12345
Looks like you want to dodge strings made up entirely of digits, or entirely of letters. So you can exclude those. That will also let in strings without any numbers, so also require a number.
my $exclude = qr/(?: [0-9]+ | [A-Za-z]+ )/x;
my #res = grep { not /^$exclude$/ and /\d/ } #strings;
If any other characters need be excluded (underscore?), add it to the list.
It is not clear how your input is coming, this takes a list of ready strings. Add word boundaries and/or /s, depending on the input. Or parse the input into a list of strings for this.
If input comes as as a multi-line string, my #strings = split '\n|\s+', $text;.

Replace last occurrence of character in string [duplicate]

This question already has answers here:
How to replace last occurrence of characters in a string using javascript
(3 answers)
Closed 6 years ago.
I've got the following string :
01/01/2014 blbalbalbalba blabla/blabla
I would like to replace the last slash with a space, and keep the first 2 slashes in the date.
The closest thing I have come up with was this kind of thing :
PS E:\> [regex] $regex = '[a-z]'
PS E:\> $regex.Replace('abcde', 'X', 3)
XXXde
but I don't know how to start from the end of the line. Any help would be greatly appreciated.
Editing my question to clarify :
I just want to replace the last slash character with a space character, therefore :
01/01/2014 blbalbalbalba blabla/blabla
becomes
01/01/2014 blbalbalbalba blabla blabla
Knowing that the length of "blabla" varies from one line to the other and the slash character could be anywhere.
Thanks :)
Using the following string to match:
(.*)[/](.*)
and the following to replace:
$1 $2
Explanation:
(.*) matches anything, any number of times (including zero). By wrapping it in parentheses, we make it available to be used again during the replace sequence, using the placeholder $ followed by the element number (as an example, because this is the first element, $1 will be the placeholder). When we use the relevant placeholder in the replace string, it will put all of the characters matched by this section of the regex into the resulting string. In this situation, the matched text will be 01/01/2014 blbalbalbalba blabla
[/] is used to match the forward slash character.
(.*) again is used to match anything, any number of times, similar to the first instance. In this case, it will match blabla, making it available in the $2 placeholder.
Basically, the first three elements work together to find a number of characters, followed by a forward slash, followed by another number of characters. Because the first "match everything" is greedy (that is, it will attempt to match as many character as possible), it will include all of the other forward slashes as well, up until the last. The reason that it stops short of the last forward slash is that including it would make the regex fail, as the [/] wouldn't be able to match anything any more.
You can also use lookahead:
'01/01/2014 blbalbalbalba blabla/blabla' -replace '/(?=[^/]+$)',' '
01/01/2014 blbalbalbalba blabla blabla
'/(?=[^/]+$)' will match a '/' character that comes right before a series of 'not /' characters immediately before EOL, but this is probably less efficient than the direct matches.
'01/01/2014 blbalbalbalba blabla/blabla' -replace '^(\d{2}/\d{2})/(\d{4} .*)','$1 $2'
# outputs this:
# 01/01 2014 blbalbalbalba blabla/blabla
Here's how you can do it without regular expressions:
$string = "01/01/2014 blbalbalbalba blabla/blabla"
$last_index = $string.LastIndexOf('/')
$chars = $string.ToCharArray()
$chars[$last_index] = ' '
$new_string = $chars -join ''
Another way:
$string = "01/01/2014 blbalbalbalba blabla/blabla"
$last_index = $string.LastIndexOf('/')
$new_string = $string.Remove($last_index, 1).Insert($last_index, ' ')
$ is the anchor for end of line.
So
(.*?)([a-z])$
should match what you want, and the thing in () is what you want to replace.
Best regards

Finding all the ten different digits in a random string

Sorry if this is answered somewhere, but I couldn't find it.
I need to write a regexp to matches on strings that contain the digits from 0 to 9 exactly once. For example:
e8v5i0l9ny3hw1f24z7q6
You can see that numbers [0-9] are present exactly once and in random order. (Letters are present also exactly once, but that is an advanced quest...) It must not match if a digit is missing or if any digit is present more than one time.
So what would be the best regexp to match on strings like these? I am still learning regex and couldn't find a solution. It is PCRE, running in perl environment, but I cannot use perl, only the regex part of it. Sorry for my english and thank you in advance.
What about this pattern to verify the string:
^\D*(?>(\d)(?!.*\1)\D*){10}$
^\D* Starts with any amount of characters, that are no digit
(?>(\d)(?!.*\1)\D*){10} followed by 10x: a digit (captured in first capturing group), if the captured digit is not ahead, followed by any amount of \D non-digits, using a negative lookahead. So 10x a digit, with itself not ahead consecutive should result in 10 different [0-9].
\d is a shorthand for [0-9], \D is the negation [^0-9]
Test at regex101, Regex FAQ
If you need the digit-string then, just extract the digits, e.g. php (test eval.in)
$str = "e8v5i0l9ny3hw1f24z7q6";
$pattern = '/^\D*(?>(\d)(?!.*\1)\D*){10}$/';
if(preg_match($pattern, $str)) {
echo preg_replace('/\D+/', "", $str);
}
It is easy to create a regular expression that matches one specific permutations of the numbers and ingnore the other characters. E.g.
^[^\d]*0[^\d]1[^\d]*2[^\d]*3[^\d]*4[^\d]*5[^\d]*6[^\d]*7[^\d]*8[^\d]*9[^\d]*$
You can combine 10! expressions for every possible permutation with |
Although this is completely inpractical it shows that such a regular expression (without lookahead) is indeed possible.
However this is something that is much better done without regular expression matching.
$s = "e8v5i0l9ny3hw1f24z7q6";
$s = preg_replace('/[^\d]/i', '', $s); //remove non digits
if(strlen($s) == 10) //do we have 10 digits ?
if (!preg_match('/(\d)(\1+)/i', $s)) //if no repeated digits
echo "String has 10 different digits";
http://ideone.com/eY4eGx

Regex: Matching 4-Digits within words

I have a body of text I'm looking to pull repeat sets of 4-digit numbers out from.
For Example:
The first is 1234 2) The Second is 2098 3) The Third is 3213
Now I know i'm able to get the first set of digits out by simply using:
/\d{4}/
...returning 1234
But how do I match the second set of digits, or the third, and so on...?
edit: How do i return 2098, or 3213
You don't appear to have a proper answer to your question yet.
The solution is to use the /g modifier on your regex. In list context it will find all of the numbers in your string at once, like this
my $str = 'The first is 1234 2) The Second is 2098 3) The Third is 3213';
my #numbers = $str =~ /\b \d{4} \b/gx;
print "#numbers\n";
output
1234 2098 3213
Or you can iterate through them, using scalar context in a while loop, like this
while ($str =~ /\b (\d{4}) \b/gx) {
my $number = $1;
print $number, "\n";
}
output
1234
2098
3213
I have added the \b patterns to the regex so that it only matches whole four-digit numbers and doesn't, for example, find 1234 in 1234567. The /x modifier just allows me to add spaces so that the pattern is more intelligible.
See http://perldoc.perl.org/perlre.html for discussion on the use of the 'g' modifier which will cause your regular expression to match ALL occurrances of its pattern, not just the first.
If you want a pattern that finds the $n'th 4-digit group, this seems to work:
$pat = "^(?:.*?\\b(\\d{4})\\b){$n}";
if ($s =~ /$pat/) {
print "Found $1\n";
} else {
print "Not found\n";
}
I did this by building a string pattern because I couldn't get a variable interpolated into a quantifier {$n}.
This pattern finds 4-digit groups that are on word boundaries (the \b tests); I don't know if that meets your requirements. The pattern uses .*? to ensure that as few characters as possible are matched between each four-digit group. The pattern is matched $n times, and the capture group $1 is set to whatever it was in the last iteration, i.e. the $n'th one.
EDIT: When I just tried it again, it seemed to interpolate $n in a quantifier just fine. I don't know what I did differently that it didn't work last time. So maybe this will work:
if ($s =~ /^(?:.*?\b(\d{4}\b){$n}/) { ...
If not, see amon's comment about qr//.
If the regex is only matched once, then match all three in one regex and extract them using matched groups:
^.*\b(\d{4})\b.*\b(\d{4})\b.*\b(\d{4})\b.*$
The three 4-digit numbers will be captured in group 1. 2 and 3.
Ajb's answer with "gx" is the best. If you know you will have three numbers, this straighforward line does the trick:
my $str = 'The first is 1234 2) The Second is 2098 3) The Third is 3213';
my ($num1, $num2, $num3) = $str =~ /\b \d{4} \b/gx;
print "$num1, $num2, $num3\n";