I have a field that has the text file name being used as the data source. The file name is formatted like "file_name_example_2022-11-17_14.45.56.txt" with the "2022-11-17_14.45.56" being the date and time. I know I can do a series of RIGHT and LEFTs to extract the date time as a separate field, but I wanted to see if REGEX_EXTRACT would provide a cleaner way to do it. I've been looking at regular expression documentation and can't seem to figure it out. I am trying to end up with a full date time field.
So far I have tried
REGEXP_EXTRACT([File Paths], '\d(.+)')
and that results in "022-11-17_14.45.56.txt"
You can use
REGEXP_EXTRACT([File Paths], '\d{4}-\d{1,2}-\d{1,2}_\d{1,2}\.\d{1,2}\.\d{1,2}')
See the regex demo.
Details:
\d{4}-\d{1,2}-\d{1,2} - four digits, -, one or two digits, -, one or two digits
_ - a _ char
\d{1,2}\.\d{1,2}\.\d{1,2} - one or two digits, ., one or two digits, ., one or two digits.
Related
I'm not sure if using regex is the correct way to go about this here, but I wanted to try solving this with regex first (if it's possible)
I have an edifact file, where the data (in bold) in certain fields in some segments need to be substituted (with different dates, same format)
UNA:+,? '
UNB+UNOC:3+000000000+000000000+20190801:1115+00001+DDMP190001'
UNH+00001+BRKE:01+00+0'
INV+ED Format 1+Brustkrebs+19880117+E000000001+**20080702**+++1+0'
FAL+087897044+0000000++name+000000000+0+**20080702**++1+++J+N+N+N+N+N+++0'
INL+181095200+385762115+++0'
BEE+20080702++++0'
BAA+++J+J++++++J+++++++J++0'
BBA++++++++J++++++J+J++++++J+++++J+++J+J++++++++J+0'
BHP+J+++++J+++++J+++++0'
BLA+++J+++++++++0'
BFA++++++++++++J++0'
BSA++J+++J+J+++0'
BAT+20190801+0'
DAT+**20080702**++++0'
UNT+000014+00001'
UNZ+00001+00001'
at first I was able to match those fields using a positive lookahead and a lookbehind (I had different expressions for matching each date).
Here, for example is the expression I intially used to match the date in the "FAL" segment: (?<=\+[\d]{1}\+)\d{8}(?=\+\+), but then i saw that this date is sometimes preceeded by 9 digits, and sometimes by 1 (based on version) and followed by a either ++ or a + and a date so I added a logiacl OR like this: (?<=\+[\d]{9}\+|\+[\d]{1}\+)\d{8}(?=\+[\d]{8}\+|\+\+)and quickly realized it's not sustainable because I saw that these edifact files vary (far beyond only either 9 and 1 digits)
(I have 6 versions for each type, and i have 6 types total)
Because I have a scheme/map indicating what each version should be built like and I know on what position (based on the + separator) the date is written in each version, I thought about maybe matching the date based on the +, so after the 7th occurence (say in the FAL segment) of plus in a certain line, match the next 8 digits.
is this possible to achieve with regex? and if yes, could someone please tell me how?
I suggest using a pattern like
^((?:[^+\n]*\+){7})\d{8}(?=\+(?:\d{8})?\+)
where {7} can be adjusted to the value you need for each type of segments, and replace with the backreference to Group 1. In Python, it is \g<1>20200101 (where 20200101 is your new date), in PHP/.NET, it is ${1}20200101. In JS, it will be just $1.
To run on a multiline text, use m flag. In Python regex, you may embed it like (?m)^((?:[^+\n]*\+){7})\d{8}(?=\+(?:\d{8})?\+).
See the Python regex demo
Details
^ - start of string/line
((?:[^+\n]*\+){7}) - Group 1: 7 repetitions of any chars other than + and newline, and then a +
\d{8} - 8 digits
(?=\+(?:\d{8})?\+) - that are followed with +, and optional chunk of 8 digits and a +.
I need to validate with regex a date in format yyyy-mm-dd (2019-12-31) that should be within the range 2019-12-20 - 2020-01-10.
What would be the regex for this?
Thanks
Regex only deal with characters. so we have to work out at each position in the date what are the valid characters.
The first part is easy. The first two characters have to be 20
Now it gets complicated the next character can be a 1 or a 2 but what follows depends on the value of that character so we split the rest of the regex into two sections the first if the third character matches 1 and the second if it matches 2
We know that if the third character is a 1 then what must follow is the characters 9-12- as the range starts at 2019-12-20 now for the day part. The 9th character is the tens for the day this can only be 2 or 3 as we are already in the last month and the minimum date is 20. The last character can be any digit 0-9. This gives us a day match of [23][0-9]. Putting this together we now have a pattern for years starting 2019 as 19-12-[23][0-9]
It the third character is a 2 then we can match up to the day part of the date a gain as the range ends in January. This gives us a partial match of 20-01- leaving us to work on the day part. Hear we know that the first character of the day can either be a 1 or 0 however if it's a 1 then the last character must be a 0 and if it's a 0 then the last character can only be in the range 1 to 9. This give us another alteration (?:0[1-9]|10) Putting the second part together we get 20-01-(?:0[1-9]|10).
Combining these together gives the final regex 20(?:19-12-[23][0-9]|20-01-(?:0[1-9]|10))
Note that I'm assuming that the date you are testing against is a validly formatted date.
Try this:
(2019|2020)\-(12|01)\-([0-3][0-9]|[0-9])
But be aware that this will allow number up to where the first digit is between zero and three and the second digit between zero and nine for the dd value. You could specify all numbers you want to allow (from 20 to 10) like this (20|21|22|23|24|25|26|27|28|29|30|31|01|1|02|2|03|3|04|4|05|5|06|6|07|7|08|8|09|9|10).
(2019|2020)\-(12|01)\-(20|21|22|23|24|25|26|27|28|29|30|31|01|1|02|2|03|3|04|4|05|5|06|6|07|7|08|8|09|9|10)
But honestly... Regular-Expressions are not the right tool for this. RegExp gives a mask to something, not a logical context. Use regex to extract the data/value from a string and validate those values using another language.
The above 2nd Regex will, f.e. match your dates, but also values outside of this range since there is no context between 2019|2020 and the second group 12|01 so they match values like 2019-12-11 but also 2020-12-11.
To only match the values you want this will be a really large regex like this (inner brackets only if you need them) ((2019)-(12)-(20)|(2019)-(12)-(21)|(2019)-(12)-(22)|...) and continue with all possible dates - and ask yourself: what would you do if you find such a regex in a project you have to work with ;)
Better solution (quick and dirty, there might be better solutions):
(?<yyyy>20[0-9]{2})\-(?<mm>[01][0-9]|[0-9])\-(?<dd>[0-3][0-9]|[0-9])
This way you have three named groups (yyyy, mm, dd) you can access and validate the matched values... The regex is smaller, you have a better association between code and regex and both are easier to maintain.
How can you create a regular expression that checks if a user input matches characters formally found in a currency syntax? (number, period/decimal place, comma, or dollar sign?).
The following can find all characters listed above except for the dollar sign, any idea how to properly structure this?
/([0-9.,])/g
The regex I use for currency validation is as follows:
^(\$)?([1-9]{1}[0-9]{0,2})(\,\d{3})*(\.\d{2})?$|^(\$)?([1-9]{1}[0-9]{0,2})(\d{3})*(\.\d{2})?$|^(0)?(\.\d{2})?$|^(\$0)?(\.\d{2})?$|^$
RegExr is a great website for testing and reviewing these strings (perhaps you could make a regex string that's less of a beast!)
Are you just trying to test the characters? In that case
[0-9,.$]+
will suffice. Or are you testing for the format $1,123,123.12 with the correct placements of commas and everything?
In that case you would need something more like
(\$?\d{1,3}(?:,\d{3})*(?:.\d{2})?)
should do.
You need to define what you want your regex to match, more formally than "matches characters formally found in a currency syntax". We don't know which currencies you're interested in. We don't know how strict you need it to be.
Maybe you'll come up with something like:
These elements must come in this order:
A currency symbol ('£', '€' or '$') (your requirement might specify more currencies)
1 or more numeric digits
A period or a comma
Exactly two numeric digits
Once you have a specification like that, it's easy to translate into a regular expression:
[£€$] // one of these chars.
\d+ // '+' means 'one or more'
[.,] // '[]' means 'any one of these'.
\d\d // Two digits. Could also be written as '\d{2}'
Or concatenated together:
[£€$]\d+[.,]\d\d
If you've learned about escaping special characters like $ and ., you may be surprised not to see it done here. Within [], they lose their special meaning.
(There are dialects of regex -- check the documentation for whatever implementation you're using)
Your requirements may be different though. The example I've given doesn't match:
$ 12.00
$12
USD12
¥200.00
25¢
$0.00005
20 μBTC
44 dollars
£1/19/11¾d ("one pound, nineteen shillings and elevenpence three farthings")
Work out your requirement, then write your code to meet it.
you should set \ before special chars, also you should set star(0+) or plus(1+) for match full currency chars, for example:
/([0-9\.,]*)/g
or for real price how 200,00 where all time exist 2 symbols after comma:
/(([0-9]+)(\.|,)([0-9]){2})/g
I am trying to extract a specific text from an Outlook subject line. This is required to calculate turn around time for each order entered in SAP. I have a subject line as below
SO# 3032641559 FW: Attached new PO 4500958640- 13563 TYCO LJ
My final output should be like this: 3032641559
I have been able to do this in MS excel with the formulas like this
=IFERROR(INT(MID([#[Normalized_Subject]],SEARCH(30,[#[Normalized_Subject]]),10)),"Not Found")
in the above formula [#[Normalized_Subject]] is the name of column in which the SO number exists. I have asked to do this in oracle but I am very new to this. Your help on this would be greatly appreciated.
Note: in the above subject line the number 30 is common in every subject line.
The last parameter of REGEXP_SUBSTR() indicates the sub-expression you want to pick. In this case you can't just match 30 then some more numbers as the second set of digits might have a 30. So, it's safer to match the following, where x are more digits.
SO# 30xxxxxx
As a regular expression this becomes:
SO#\s30\d+
where \s indicates a space \d indicates a numeric character and the + that you want to match as many as there are. But, we can use the sub-expression substringing available; in order to do that you need to have sub-expressions; i.e. create groups where you want to split the string:
(SO#\s)(30\d+)
Put this in the function call and you have it:
regexp_substr(str, '(SO#\s)(30\d+)', 1, 1, 'i', 2)
SQL Fiddle
I have a column in an Excel spreadsheet that contains the following:
### - 3-digit number
#### - 4-digit number
A### - character with 3-digits
#A## - digit followed by character then 2 more digits
There may also be superfluous characters to the right of these strings.
I would like to sort the entire spreadsheet by this column in the following order (ascending or descending):
the first three types of strings alphabetically as expected (NOT ASCII-Betically!)
Then the #A## by the character first, then by the first digit.
Example:
000...999, 0000...9999, A000...Z999, 0A00...9A99, 0B00...9B99...9Z99
I feel there is a very simple solution using a regular expression or macro but my VBa and RegExp are pretty rusty (a friend asked me for this but I' m more of a C-guy these days). I have read some solutions which involve splitting the data into additional columns which I would be fine with.
I would settle for a link to a good guide. Eternal thanks in advance.
If you want to sort by second character regardless of the content ahead and behind, then regex ^.(.) represents second character match...