Need a regex to find a number and text in a filename - regex

Have filenames in the format:
021-05-05_10-10-12-111_Nancy_Test_123456-1234_194456454390816_OD_2021042911270.pdf
I need to find “123456-1234” and OD.
In the 123456-1234 number, the ‘-’ are wildcards so the number can be eg. 1234561234, 123456**1234, 123456_1234 - but there will always be 10 digits. (0-9) and the wildcard (if any) will be between the 6'th and 7'th digit.
The “OD” can be “OD” or “OS”, ignore case.
The number and OD/OS must be moved to the beginning of the filename with a server name in between, and today's date after OD/OS to uppercase.
Eg: 123456-1234_servername1_OD_yyyy_mm_dd_ss_021-05-05_10-10-12-111_Nancy_Test_194456454390816_2021042911270.pdf.
I'm using a file renaming program that will take the regex.
(Don't know if advertising is allowed at StackOverflow, if it is, I will of course provide a link to the program).
This is what I got so far:
(?:_od_|_sd_) gives me the OD or SD
(?<!\d)\d{10}(?!\d) gives me the 1234561234 but only if there are no wildcards between the 6'th and 7'th digit.
Furthermore, I can't figure out how to put them together and move them in front with the server name in between.

You can use
(?i)^(.*?)_(\d{6}\D*\d{4})_(\d+)_(od|sd)_
Replace with $2_servername1_$4_yyyy_mm_dd_ss_$1_$3_.
See the regex demo. Details:
(?i) - case insensitive mode on
^ - start of string
(.*?) - Group 1: any zero or more chars other than line break chars as few as possible
_ - an underscore
(\d{6}\D*\d{4}) - Group 2: six digits, zero or more non-digits, four digits
_ - an underscore
(\d+) - Group 3: one or more digits
_ - an underscore
(od|sd) - Group 4: od or sd
_ - an underscore

Related

Regex choose based on string format

I have following formats of data:
CumulativeReport_cumulativeReportBins_CumulativeBinNetworksViews_totalSuccessfulHeartbeats_1
CumulativeReport_cumulativeReportBins_CumulativeBinNetworksViews_totalSuccessfulHeartbeats__1
I am using following regex:
^(.*)_(.*?_.*?)(_\d$|__\d$)
My requirement every time is to get CumulativeBinNetworksViews_totalSuccessfulHeartbeats. For first case its working fine but for second case its printing "totalSuccessfulHeartbeats_1". How to solve this.
You can use
^(.*)_([^_]+_[^_]+)__?\d$
See the regex demo. Details:
^ - start of string
(.*) - Group 1: any zero or more chars other than line break chars as many as possible
_ - an underscore
([^_]+_[^_]+) - Group 2: one or more chars other than _, _ and one or more chars other than _
__? - one or two underscores
\d - a digit
$ - end of string.

Regex table of contents

I have a table of contents items I would need to regex. The data is not totally uniform and I cant get it to work in all cases.
Data is following:
1. Header 1
1.2. SubHeader2
1.2.1 Subheader
1.2.2. Another header
1.2.2.1 Test
1.2.2.2. Test2
So I would need to get both the number and the header in different groups. The number should be without the trailing dot, if it is there. The issue that im struggling with is that not all of the numbers have the trailing dot.
I have tried
^([0-9\.]+)[\.]\s+(.+)$ -- Doesnt work when there is no trailing
^([0-9\.]+)[\.]?\s+(.+)$ -- Contains the trailing dot if it is there
You can use
^(\d+(?:\.\d+)*)\.?\s+(.+)
See the regex demo. Details:
^ - start of string
(\d+(?:\.\d+)*) - Group 1: one or more digits and then zero or more repetitions of a . and one or more digits sequence
\.? - an optional .
\s+ - one or more whitespaces
(.+) - Group 2: any one or more chars other than line break chars, as many as possible.

Ignore Until "Spacebar+I or V or X" - Regex Expression

So... I had a regex which worked just fine (wasn't pretty but worked), until the Roman Numerals reached more than X.
Currently my Regex looks like this:
(.*?)(^(X{1,3})(I[XV]|V?I{0,3})$|^(I[XV]|V?I{1,3})$|^V$)*(.)( EP\. )(\d*)(.*)
The problem I have right now is that if roman numeral has value 10 or more it's is in 1st group which drives me nuts.
I need it to work in a way that all before roman numerals is ignored.
Test Text:
PEPA THE PIG XVI EP. 169 - BAD ENDING
Could you please help me fix the regex so it would actually do what it suppose to do?
You should re-consider using anchors in the middle of a regex: ^ requires start of string and $ requires the end of string.
Besides, (.) before ( Ep\. ) consume the space, and the Ep pattern cannot match it.
Consider using
^(.*?)\b(X{1,3}(?:I[XV]|V?I{0,3})|I[XV]|V?I{1,3}|V)\b(.)\b(EP\.)\s*(\d+)(.*)
See the regex demo. You might still need to check what exactly you want to match with (.).
Details:
^ - start of string
(.*?) - Group 1: any zero or more chars other than line break chars, as few as possible
\b - a word boundary
(X{1,3}(?:I[XV]|V?I{0,3})|I[XV]|V?I{1,3}|V) - Group 2: one to three Xs followed with IX or IV, or with an optional V and then zero to three Is, or IX, IV, or an optional V followed with one to three Is or V
\b - a word boundary
(.) - Group 3: any one char (other than a newline)
\b - a word boundary
(EP\.) - Group 4: EP.
\s* - zero or more whitespaces
(\d+) - Group 5: one or more digits
(.*) - Group 6: any zero or more chars other than line break chars, as many as possible

Regex to match a unlimited repeating pattern between two strings

I have a dataset with repeating pattern in the middle:
YM10a15b5c27
and
YM1b5c17
How can I get what is between "YM" and the last two numbers?
I'm using this but is getting one number in the end and should not.
/([A-Z]+)([0-9a-z]+)([0-9]+)/
Capture exactly two characters in the last group:
/([A-Z]+)([0-9a-z]+)([0-9]{2})/
You should use:
/^(?:([a-z]+))([0-9a-z]+)(?=\1)/
^ matches the start of the sentence. This is really important, because if your code is aaaa1234aaaa, then without the ^, it would also match the aaaa of the end.
(?:([a-z]+)) is a non-capturing group which takes any letter from 'a' to 'z' as group 1
(?=\1) tells the regex to match the text as long as it is followed by the same code at the starting.
All you have to do is extract the code by group(2)
An example is shown here.
Solution
If you want to match these strings as whole words, use \b(([a-z])\2)([0-9a-z]+)(\1)\b. If you need to match them as separate strings, use ^(([a-z])\2)([0-9a-z]+)(\1)$.
Explanation
\b - a word boundary (or if ^ is used, start of string)
(([a-z])\2) - Group 1: any lowercase ASCII letter, exactly two occurrences (aa, bb, etc.)
([0-9a-z]+) - Group 3: 1 or more digits or lowercase ASCII letters
(\1) - Group 4: the same text as stored in Group 1
\b - a word boundary (or if $ is used, end of string).

Find matches ending with a letter that is not a starting letter of the next match

Intro
I have a string containing diagnosis codes (ICD-10), not separated by any character. I would like to extract all valid diagnosis codes. Valid diagnosis codes are of the form
[Letter][between 2 and 4 numbers][optional letter that is not the next match starting letter]
The regex for this pattern is (I believe)
\w\d{2,4}\w?
Example
Here is an example
mystring='F328AG560F33'
In this example there are three codes:
'F328A' 'G560' 'F33'
I would like to extract these codes with a function like str_extract_all in R (preferably but not exclusively)
My solution so far
So far, I managed to come up with an expression like:
str_extract_all(mystring,pattern='\\w\\d{2,4}\\w?(?!(\\w\\d{2,4}\\w?))')
However when applied to the example above it returns
"F328" "G560F"
Basically it misses the letter A in the first code, and misses altogether the last code "F33" by mistakenly assigning F to the preceding code.
Question
What am I doing wrong? I only want to extract values that end with a letter that is not the start of the next match, and if it is, the match should not include the letter.
Application
This question is of great relevance for example when mining patient Electronic Health Records that have not been validated.
You have a letter, two-to-four numbers then an optional letter. That optional letter, if it's there, will only ever be followed by another letter; or, put another way, never followed by a number. You can write a negative lookahead to capture this:
\w\d{2,4}(?:\w(?!\d))?
This at least works with PCRE. I don't know about how R will handle it.
Your matches are overlapping. In this case, you might use str_match_all that allows easy access to capturing groups and use a pattern with a positive lookahead containing a capturing group inside:
(?i)(?=([A-Z]\d{2,4}(?:[A-Z](?!\d{2,4}))?))
See the regex demo
Details
(?= - a positive lookahead start (it will be run at every location before each char and at the end of the string
( - Group 1 start
[A-Z] - a letter (if you use a case insensitive modifier (?i), it will be case insensitive)
\d{2,4} - 2 to 4 digit
(?: - an optional non-capturing group start:
[A-Z] - a letter
(?!\d{2,4}) - not followed with 2 to 4 digits
)? - the optional non-capturing group end
) - Group 1 end
) - Lookahead end.
R demo:
> library(stringr)
> res <- str_match_all("F328AG560F33", "(?i)(?=([A-Z]\\d{2,4}(?:[A-Z](?!\\d{2,4}))?))")
> res[[1]][,2]
[1] "F328A" "G560" "F33"