Regex - Match n last numbers without the last one - regex

With regex - replace, I am trying to format a number like this:
The leading number should be separated by a +. Moreover, the last number should be separated by a + as well. The more tricky part is, that adjacent 1s to the + to the middle part should be removed, without touching the first and the last number, e.g.,
011023040 -> 0+02304+0
111023920443 -> 1+02392044+3
13242311 -> 1+32423+1
I almost achieved this with the following regex:
'^([0-9]{1})([1]+)?([0-9*)(0-9]{1}$'
And replace this with
'\1+\3+\4'
However, I have a problem with the last example, as this returns:
1+324231+1
However, the one before the second + should be removed.
Can anyone help me with this problem?

You have to use a non-greedy quantifier:
^([0-9])1*([0-9]*?)1*([0-9])$
^^
Live demo

I managed to group the numbers in the following way
^(\d)(1*)(\d+)(\d)$
by using multiline and global flags.
The replacement should look like \1+\3+\4

Related

Combining 2 regular expressions

I have 2 strings and I would like to get a result that gives me everything before the first '\n\n'.
'1. melléklet a 37/2018. (XI. 13.) MNB rendelethez\n\nÁltalános kitöltési előírások\nI.\nA felügyeleti jelentésre vonatkozó általános szabályok\n\n1.
'12. melléklet a 40/2018. (XI. 14.) MNB rendelethez\n\nÁltalános kitöltési előírások\n\nKapcsolódó jogszabályok\naz Önkéntes Kölcsönös Biztosító Pénztárakról szóló 1993. évi XCVI. törvény (a továbbiakban: Öpt.);\na személyi jövedelemadóról szóló 1995. évi CXVII.
I have been trying to combine 2 regular expressions to solve my problem; however, I could be on a bad track either. Maybe a function could be easier, I do not know.
I am attaching one that says that I am finding the character 'z'
extended regex : [\z+$]
I guess finding the first number is: [^0-9.].+
My problem is how to combine these two expressions to get the string inbetween them?
Is there a more efficient way to do?
You may use
re.findall(r'^(\d.*?)(?:\n\n|$)', s, re.S)
Or with re.search, since it seems that only one match is expected:
m = re.search(r'^(\d.*?)(?:\n\n|$)', s, re.S)
if m:
print(m.group(1))
See the Python demo.
Pattern details
^ - start of a string
(\d.*?) - Capturing group 1: a digit and then any 0+ chars, as few as possible
(?:\n\n|$) - a non-capturing group matching either two newlines or end of string.
See the regex graph:

Regex substitution does not replace match character for character

I am trying to use Regex to dynamically capture all numbers in a string such as 1234-12-1234 or 1234-123-1234 without knowing the number of characters that will occur in each string segment. I have been able to capture this using positive look ahead via the following expression: [0-9]*(?=-). However, when I try to replace the numbers to Xs such that each number that occurs before the last dash is replaced by an X, the Regex does not return X's for numbers 1:1. Instead, each section returns exactly two X's. How can I get the regex to return the following:
1234-123-1234 -> XXXX-XXX-1234
1234-12-1234 -> XXXX-XX-1234
instead of the current
1234-123-1234 -> XX-XX-1234
?
Link to demo
The problem is that by placing the * directly after the digit match, more than one digit would get replaced with a single X. And then zero digits would get replaced with a single X. Therefore any number of digits would be effectively replaced as two X's.
Use this instead:
[0-9](?=.*-)

R digit-expression and unlist doesn't work

So I've bought a book on R and automated data collection, and one of the first examples are leaving me baffled.
I have a table with a date-column consisting of numbers looking like this "2001-". According to the tutorial, the line below will remove the "-" from the dates by singling out the first four digits:
yend_clean <- unlist(str_extract_all(danger_table$yend, "[[:digit:]]4$"))
When I run this command, "yend_clean" is simply set to "character (empty)".
If I remove the ”4$", I get all of the dates split into atoms so that the list that originally looked like this "1992", "2003" now looks like this "1", "9" etc.
So I suspect that something around the "4$" is the problem. I can't find any documentation on this that helps me figure out the correct solution.
Was hoping someone in here could point me in the right direction.
This is a regular expression question. Your regular expression is wrong. Use:
unlist(str_extract_all("2003-", "^[[:digit:]]{4}"))
or equivalently
sub("^(\\d{4}).*", "\\1", "2003-")
of if really all you want is to remove the "-"
sub("-", "", "2003-")
Repetition in regular expressions is controlled by the {} parameter. You were missing that. Additionally $ means match the end of the string, so your expression translates as:
match any single digit, followed by a 4, followed by the end of the string
When you remove the "4", then the pattern becomes "match any single digit", which is exactly what happens (i.e. you get each digit matched separately).
The pattern I propose says instead:
match the beginning of the string (^), followed by a digit repeated four times.
The sub variation is a very common technique where we create a pattern that matches what we want to keep in parentheses, and then everything else outside of the parentheses (.* matches anything, any number of times). We then replace the entire match with just the piece in the parens (\\1 means the first sub-expression in parentheses). \\d is equivalent to [[:digit:]].
A good website to learn about regex
A visualization tool to see how specific regular expressions match strings
If you mean the book Automated Data Collection with R, the code could be like this:
yend_clean <- unlist(str_extract_all(danger_table$yend, "[[:digit:]]{4}[-]$"))
yend_clean <- unlist(str_extract_all(yend_clean, "^[[:digit:]]{4}"))
Assumes that you have a string, "1993–2007, 2010-", and you want to get the last given year, which is "2010". The first line, which means four digits and a dash and end, return "2010-", and the second line return "2010".

Comma Separated Numbers Regex

I am trying to validate a comma separated list for numbers 1-8.
i.e. 2,4,6,8,1 is valid input.
I tried [0-8,]* but it seems to accept 1234 as valid. It is not requiring a comma and it is letting me type in a number larger than 8. I am not sure why.
[0-8,]* will match zero or more consecutive instances of 0 through 8 or ,, anywhere in your string. You want something more like this:
^[1-8](,[1-8])*$
^ matches the start of the string, and $ matches the end, ensuring that you're examining the entire string. It will match a single digit, plus zero or more instances of a comma followed by a digit after it.
/^\d+(,\d+)*$/
for at least one digit, otherwise you will accept 1,,,,,4
[0-9]+(,[0-9]+)+
This works better for me for comma separated numbers in general, like: 1,234,933
You can try with this Regex:
^[1-8](,[1-8])+$
If you are using python and looking to find out all possible matching strings like
XX,XX,XXX or X,XX,XXX
or 12,000, 1,20,000 using regex
string = "I spent 1,20,000 on new project "
re.findall(r'(\b[1-8]*(,[0-9]*[0-9])+\b)', string, re.IGNORECASE)
Result will be ---> [('1,20,000', ',000')]
You need a number + comma combination that can repeat:
^[1-8](,[1-8])*$
If you don't want remembering parentheses add ?: to the parens, like so:
^[1-8](?:,[1-8])*$

RegEx Lookaround issue

I am using Powershell 2.0. I have file names like my_file_name_01012013_111546.xls. I am trying to get my_file_name.xls. I have tried:
.*(?=_.{8}_.{6})
which returns my_file_name. However, when I try
.*(?=_.{8}_.{6}).{3}
it returns my_file_name_01.
I can't figure out how to get the extension (which can be any 3 characters. The time/date part will always be _ 8 characters _ 6 characters.
I've looked at a ton of examples and tried a bunch of things, but no luck.
If you just want to find the name and extension, you probably want something like this: ^(.*)_[0-9]{8}_[0-9]{6}(\..{3})$
my_file_name will be in backreference 1 and .xls in backreference 2.
If you want to remove everything else and return the answer, you want to substitute the "numbers" with nothing: 'my_file_name_01012013_111546.xls' -replace '_[0-9]{8}_[0-9]{6}' ''. You can't simply pull two bits (name and extension) of the string out as one match - regex patterns match contiguous chunks only.
try this ( not tested), but it should works for any 'my_file_name' lenght , any lenght of digit and any kind of extension.
"my_file_name_01012013_111546.xls" -replace '(?<=[\D_]*)(_[\d_]*)(\..*)','$2'
non regex solution:
$a = "my_file_name_01012013_111546.xls"
$a.replace( ($a.substring( ($a.LastIndexOf('.') - 16 ) , 16 )),"")
The original regex you specified returns the maximum match that has 14 characters after it (you can change to (?=.{14}) who is the same).
Once you've changed it, it returns the maximum match that has 14 characters after it + the next 3 characters. This is why you're getting this result.
The approach described by Inductiveload is probably better in case you can use backreferences. I'd use the following regex: (.*)[_\d]{16}\.(.*) Otherwise, I'd do it in two separate stages
get the initial part
get the extension
The reason you get my_filename_01 when you add that is because lookaheads are zero-width. This means that they do not consume characters in the string.
As you stated, .*(?=_.{8}_.{6}) matches my_file_name because that string is is followed by something matching _.{8}_.{6}, however once that match is found, you've only consumed my_file_name, so the addition of .{3} will then consume the next 3 characters, namely _01.
As for a regex that would fit your needs, others have posted viable alternatives.