I have a column in an Excel spreadsheet that contains the following:
### - 3-digit number
#### - 4-digit number
A### - character with 3-digits
#A## - digit followed by character then 2 more digits
There may also be superfluous characters to the right of these strings.
I would like to sort the entire spreadsheet by this column in the following order (ascending or descending):
the first three types of strings alphabetically as expected (NOT ASCII-Betically!)
Then the #A## by the character first, then by the first digit.
Example:
000...999, 0000...9999, A000...Z999, 0A00...9A99, 0B00...9B99...9Z99
I feel there is a very simple solution using a regular expression or macro but my VBa and RegExp are pretty rusty (a friend asked me for this but I' m more of a C-guy these days). I have read some solutions which involve splitting the data into additional columns which I would be fine with.
I would settle for a link to a good guide. Eternal thanks in advance.
If you want to sort by second character regardless of the content ahead and behind, then regex ^.(.) represents second character match...
Related
I have a field that has the text file name being used as the data source. The file name is formatted like "file_name_example_2022-11-17_14.45.56.txt" with the "2022-11-17_14.45.56" being the date and time. I know I can do a series of RIGHT and LEFTs to extract the date time as a separate field, but I wanted to see if REGEX_EXTRACT would provide a cleaner way to do it. I've been looking at regular expression documentation and can't seem to figure it out. I am trying to end up with a full date time field.
So far I have tried
REGEXP_EXTRACT([File Paths], '\d(.+)')
and that results in "022-11-17_14.45.56.txt"
You can use
REGEXP_EXTRACT([File Paths], '\d{4}-\d{1,2}-\d{1,2}_\d{1,2}\.\d{1,2}\.\d{1,2}')
See the regex demo.
Details:
\d{4}-\d{1,2}-\d{1,2} - four digits, -, one or two digits, -, one or two digits
_ - a _ char
\d{1,2}\.\d{1,2}\.\d{1,2} - one or two digits, ., one or two digits, ., one or two digits.
I need to validate with regex a date in format yyyy-mm-dd (2019-12-31) that should be within the range 2019-12-20 - 2020-01-10.
What would be the regex for this?
Thanks
Regex only deal with characters. so we have to work out at each position in the date what are the valid characters.
The first part is easy. The first two characters have to be 20
Now it gets complicated the next character can be a 1 or a 2 but what follows depends on the value of that character so we split the rest of the regex into two sections the first if the third character matches 1 and the second if it matches 2
We know that if the third character is a 1 then what must follow is the characters 9-12- as the range starts at 2019-12-20 now for the day part. The 9th character is the tens for the day this can only be 2 or 3 as we are already in the last month and the minimum date is 20. The last character can be any digit 0-9. This gives us a day match of [23][0-9]. Putting this together we now have a pattern for years starting 2019 as 19-12-[23][0-9]
It the third character is a 2 then we can match up to the day part of the date a gain as the range ends in January. This gives us a partial match of 20-01- leaving us to work on the day part. Hear we know that the first character of the day can either be a 1 or 0 however if it's a 1 then the last character must be a 0 and if it's a 0 then the last character can only be in the range 1 to 9. This give us another alteration (?:0[1-9]|10) Putting the second part together we get 20-01-(?:0[1-9]|10).
Combining these together gives the final regex 20(?:19-12-[23][0-9]|20-01-(?:0[1-9]|10))
Note that I'm assuming that the date you are testing against is a validly formatted date.
Try this:
(2019|2020)\-(12|01)\-([0-3][0-9]|[0-9])
But be aware that this will allow number up to where the first digit is between zero and three and the second digit between zero and nine for the dd value. You could specify all numbers you want to allow (from 20 to 10) like this (20|21|22|23|24|25|26|27|28|29|30|31|01|1|02|2|03|3|04|4|05|5|06|6|07|7|08|8|09|9|10).
(2019|2020)\-(12|01)\-(20|21|22|23|24|25|26|27|28|29|30|31|01|1|02|2|03|3|04|4|05|5|06|6|07|7|08|8|09|9|10)
But honestly... Regular-Expressions are not the right tool for this. RegExp gives a mask to something, not a logical context. Use regex to extract the data/value from a string and validate those values using another language.
The above 2nd Regex will, f.e. match your dates, but also values outside of this range since there is no context between 2019|2020 and the second group 12|01 so they match values like 2019-12-11 but also 2020-12-11.
To only match the values you want this will be a really large regex like this (inner brackets only if you need them) ((2019)-(12)-(20)|(2019)-(12)-(21)|(2019)-(12)-(22)|...) and continue with all possible dates - and ask yourself: what would you do if you find such a regex in a project you have to work with ;)
Better solution (quick and dirty, there might be better solutions):
(?<yyyy>20[0-9]{2})\-(?<mm>[01][0-9]|[0-9])\-(?<dd>[0-3][0-9]|[0-9])
This way you have three named groups (yyyy, mm, dd) you can access and validate the matched values... The regex is smaller, you have a better association between code and regex and both are easier to maintain.
I'm trying to write a regex to parse a bank sort code from a database.
The reason I need a regex is that the sort code might be contained in a sentence.
But also, it might not be a sort code at all because the people entering data into the database some times put bank account numbers and phone numbers into the sort code column.
I can use
^[^0-9]*[0-9]{6}[^\d]*$
which works on
"blah123456blah"
but not on
"Emloyee 12's srt code : 123456"
Anything else I've tried gives me a match for 6 or more digits within a string (which is then most likely a bank account number).
Any help is greatly appreciated.
You say you are using
[0-9]{2}\s*-?\s*[0-9]{2}\s*-?\s*[0-9]{2}
To add the boundaries like you need, add (^|[^0-9]) (either the string start position (^) or (|) a non-digit ([^0-9])) in front and ([^0-9]|$) (matching a non-digit or the end of string position ($)) at the end:
(^|[^0-9])[0-9]{2}\s*-?\s*[0-9]{2}\s*-?\s*[0-9]{2}([^0-9]|$)
See the regex demo.
I have a large csv with a text column that has a max width of 200. In nearly all cases the data is fine. In some cases, the data is too long or has not quite been filled in properly, i would like to use regex to find the last instance of a specific numeric/character pairing and then remove everything after it.
eg data:
df <- data.frame(ID = c("1","2","3"),
text = c("A|explain what a is|12.2|Y|explain Y|2.36|",
"A|explain what a is|15.2|E|explain E|10.2|E|explain E but run out hal",
"D|explain what d is|0.48|Z|explain z but number 5 is present|"))
My specific character pair is any number followed by a |
This would mean Row 1 is fine, row 2 would have everything after '10.2' removed and row 3 would have everything after 0.48 removed
I tried this regex:
df[,2] <- sub("([^0-9]+[^|]*$)", "", df[,2])
It very nearly nearly worked but the very few rows in my data that have a number present in the explanation do not play along. Any clues? I'm not a great regexer yet, learning the ropes
I saw this question about grouping, but couldn't quite apply it to my problem.
Using sub, we capture one or more characters (.*) followed by one of more numbers, followed by a dot if present (\\.?) followed by one or more numbers as a group followed by | and the rest of the characters until the end of the string. In the replacement, the capture group is specified (\\1).
sub('^(.*[0-9]+\\.?[0-9]+)\\|.*$', '\\1', df$text)
I need a RegEx pattern that will return the first N words using a custom word boundary that is the normal RegEx white space (\s) plus punctuation like .,;:!?-*_
EDIT #1: Thanks for all your comments.
To be clear:
I'd like to set the characters that would be the word delimiters
Lets call this the "Delimiter Set", or strDelimiters
strDelimiters = ".,;:!?-*_"
nNumWordsToFind = 5
A word is defined as any contiguous text that does NOT contain any character in strDelimiters
The RegEx word boundary is any contiguous text that contains one or more of the characters in strDelimiters
I'd like to build the RegEx pattern to get/return the first nNumWordsToFind using the strDelimiters.
EDIT #2: Sat, Aug 8, 2015 at 12:49 AM US CT
#maraca definitely answered my question as originally stated.
But what I actually need is to return the number of words ≤ nNumWordsToFind.
So if the source text has only 3 words, but my RegEx asks for 4 words, I need it to return the 3 words. The answer provided by maraca fails if nNumWordsToFind > number of actual words in the source text.
For example:
one,two;three-four_five.six:seven eight nine! ten
It would see this as 10 words.
If I want the first 5 words, it would return:
one,two;three-four_five.
I have this pattern using the normal \s whitespace, which works, but NOT exactly what I need:
([\w]+\s+){<NumWordsOut>}
where <NumWordsOut> is the number of words to return.
I have also found this word boundary pattern, but I don't know how to use it:
a "real word boundary" that detects the edge between an ASCII letter
and a non-letter.
(?i)(?<=^|[^a-z])(?=[a-z])|(?<=[a-z])(?=$|[^a-z])
However, I would want my words to allow numbers as well.
IAC, I have not been able how to use the above custom word boundary pattern to return the first N words of my text.
BTW, I will be using this in a Keyboard Maestro macro.
Can anyone help?
TIA.
All you have to do is to adapt your pattern ([\w]+\s+){<NumWordsOut>} to, including some special cases:
^[\s.,;:!?*_-]*([^\s.,;:!?*_-]+([\s.,;:!?*_-]+|$)){<NumWordsOut>}
1. 2. 3. 4. 5.
Match any amount of delimiters before the first word
Match a word (= at least one non-delimiter)
The word has to be followed by at least one delimiter
Or it can be at the end of the string (in case no delimiter follows at the end)
Repeat 2. to 4. <NumWordsOut> times
Note how I changed the order of the -, it has to be at the start or end, otherwise it needs to be escaped: \-.
Thanks to #maraca for providing the complete answer to my question.
I just wanted to post the Keyboard Maestro macro that I have built using #maraca's RegEx pattern for anyone interested in the complete solution.
See KM Forum Macro: Get a Max of N Words in String Using RegEx