Need to capture single character, but ignore digit - regex

I'm parsing out flight info.
Here's the sample data:
E0.777 7 3:09
E0.319 N 1:43
E0.735 8 1:45
E0.735 N 1:48
E0.M80 9 3:21
E0.733 1:48
I need to populate fields like this:
Equipment: 735
On Time: N
Duration: 1:48
Problem I'm having is capturing the Y or N character but ignoring the single digit, then capturing the duration.
This is the expression I have tried:
#"^.{3}(.{3})\s?([N|Y]?)?(?:[0-9]\s+)?(\w{4})"
Edit: I updated the sample data to clarify my question. Equipment is not always three digits, it could be a character and two digits. The data between the equipment and the duration could be a boolean N or Y, a single digit, or white space. Only the boolean should be captured.

Firstly, you mix up the concepts of alternation and character classes [Y|N] would match 3 different characters: Y or | or N. Either use (...) or leave out the pipe.
Secondly your double ? after the character class does not really do anything. Thirdly, at the end you only match consecutive spaces if a digit was found. But if there is no digit, the last ? will ignore the subpattern, thus not allowing spaces either.
Lastly, \w does not match :.
Try this:
#"^.{3}(\d{3})\s?(?:([NY])|\d)\s+(\d:\d\d)"
You should also think about restricting the repeated . at the beginning to a more precise character class (i.e \w{2}\., but I don't know the possibilities there).

#"^..\.(\d{3})\s(?:([YN])|\d)\s*(\S{4})"
Changed .{3} to ..\. which is a bit more specific about there being a literal . for character 3.
(?:([YN])|\d) matches either Y/N or a digit, but only captures a Y or N. Notice that it's [YN] not [Y|N].
Changed \w{4} to \S{4} since \w doesn't match colons :.

This will do it...
^\w\d\.(\d{3})\s(?:([YN])|\d)\s*(\d:\d{2})$
I made some other changes to your regex because it was easier for me to just rewrite it based off your data then to try to modify what you had.
This will capture the Y or N or it won't capture anything in that group. I also tried to be more specific with your duration regex.
Update: This works with your new requirements...
^\w\d\.(\w{3})\s(?:([YN])|\d|\s)\s*(\d:\d{2})$
You can see it working on your data here... http://regexr.com?32j1b
(hover over each line to see the matched groups)

This captures all lines with Y or N and ignores everything else:
^...(\d{3})\s*([YN])\s*(\d+:\d+)

Related

How Can I Create a RegEx Pattern that will Get N Words Using Custom Word Boundary?

I need a RegEx pattern that will return the first N words using a custom word boundary that is the normal RegEx white space (\s) plus punctuation like .,;:!?-*_
EDIT #1: Thanks for all your comments.
To be clear:
I'd like to set the characters that would be the word delimiters
Lets call this the "Delimiter Set", or strDelimiters
strDelimiters = ".,;:!?-*_"
nNumWordsToFind = 5
A word is defined as any contiguous text that does NOT contain any character in strDelimiters
The RegEx word boundary is any contiguous text that contains one or more of the characters in strDelimiters
I'd like to build the RegEx pattern to get/return the first nNumWordsToFind using the strDelimiters.
EDIT #2: Sat, Aug 8, 2015 at 12:49 AM US CT
#maraca definitely answered my question as originally stated.
But what I actually need is to return the number of words ≤ nNumWordsToFind.
So if the source text has only 3 words, but my RegEx asks for 4 words, I need it to return the 3 words. The answer provided by maraca fails if nNumWordsToFind > number of actual words in the source text.
For example:
one,two;three-four_five.six:seven eight nine! ten
It would see this as 10 words.
If I want the first 5 words, it would return:
one,two;three-four_five.
I have this pattern using the normal \s whitespace, which works, but NOT exactly what I need:
([\w]+\s+){<NumWordsOut>}
where <NumWordsOut> is the number of words to return.
I have also found this word boundary pattern, but I don't know how to use it:
a "real word boundary" that detects the edge between an ASCII letter
and a non-letter.
(?i)(?<=^|[^a-z])(?=[a-z])|(?<=[a-z])(?=$|[^a-z])
However, I would want my words to allow numbers as well.
IAC, I have not been able how to use the above custom word boundary pattern to return the first N words of my text.
BTW, I will be using this in a Keyboard Maestro macro.
Can anyone help?
TIA.
All you have to do is to adapt your pattern ([\w]+\s+){<NumWordsOut>} to, including some special cases:
^[\s.,;:!?*_-]*([^\s.,;:!?*_-]+([\s.,;:!?*_-]+|$)){<NumWordsOut>}
1. 2. 3. 4. 5.
Match any amount of delimiters before the first word
Match a word (= at least one non-delimiter)
The word has to be followed by at least one delimiter
Or it can be at the end of the string (in case no delimiter follows at the end)
Repeat 2. to 4. <NumWordsOut> times
Note how I changed the order of the -, it has to be at the start or end, otherwise it needs to be escaped: \-.
Thanks to #maraca for providing the complete answer to my question.
I just wanted to post the Keyboard Maestro macro that I have built using #maraca's RegEx pattern for anyone interested in the complete solution.
See KM Forum Macro: Get a Max of N Words in String Using RegEx

Regular Expressions in R

I found somewhat similar questions
R - Select string text between two values, regex for n characters or at least m characters,
but I'm still having trouble
say I have a string in r
testing_String <- "AK ADAK NAS PADK ADK 70454 51 53N 176 39W 4 X T 7"
And I need to be able to pull anything between the first element in the string that contains 2 characters (AK) and PADK,ADK. PADK and ADK will change in character but will always be 4 and 3 characters in length respectively.
So I would need to pull
ADAK NAS
I came up with this but its picking up everything from AK to ADK
^[A-Za-z0_9_]{2}(.*?) +[A-Za-z0_9_]{4}|[A-Za-z0_9_]{3,}
If I understood your question correctly, this should do the trick:
\b[A-Z]{2}\s+(.+?)\s+[A-Z]{4}\s+[A-Z]{3}\b
Demo
You'll have to switch the perl = TRUE option (to use a decent regex engine).
\b means word boundary. So this pattern looks for a match starting with a 2-letter word and ending with a 4 letter word followed by a 3 letter word. Your value will be in the first group.
Alternatively, you can write the following to avoid using the capturing group:
\b[A-Z]{2}\s+\K.+?(?=\s+[A-Z]{4}\s+[A-Z]{3}\b)
But I'd prefer the first method because it's easier to read.
Lookbehind is supported for perl=TRUE, so this regex will do what you want:
(?<=\w{2}\s).*?(?=\s+[^\s]{4}\s[^\s]{2})

RegEx which accepts only two decimal places

Hi I am working on RegEx. Correct response should NOT allow for number to the tenths only, as in RESPONSE = "925.0", nor should it allow for trailing zeros after the hundredths place as in RESPONSE = "925.000". Only correct responses: 925, 0925, 0925., 925., 925.00, 00925
I worked on it and finally came up with this
"^-?(0)*(\d*(\.(00))?\d+.|(\d){1,3}(,(\d){3})*(\.(00))?)$"
It works for three digit numbers but if i want it for 38400.00 it doesn't allow it
I am not quite certain whether the decimal places can be any digit or if they have to be zero. If the former, then this should do the trick:
^-?\d{1,3}(,?\d{3})*(\.(\d{2})?)?$
If the latter, then this:
^-?\d{1,3}(,?\d{3})*(\.(00)?)?$
The entire match starting with the decimal point is optional, and the two decimal places in that match are optional as well.
UPDATE I just realized that it appears you need to accept commas in the response as well - I assume for thousands, millions, etc.
UPDATE #2 per OP's comment
^-?(\d+|\d{1,3}(,\d{3})*)(\.(00)?)?$
UPDATE #3 Added link to regex101 for explanation of this regular expression.
Have a try with:
^-?\d{1,3}(?:,?\d{3})*(?:\.(?:00)?)?$
I think your problem is that you're trying to match it in chunks of three, with commas separating, but 38400.00 doesn't have commas.
Try this:
^-?\d+(\.?(\d{2})?)$
The - indicates the character, -. With the ? after, it says that it may or may not apply. This allows negative numbers, so if you only want positive numbers matched, delete the first two characters.
\d represents every digit. The + after says that there can be as many as you want, as long as there's at least one.
Then there's a \., which is just a dot in the number. The ? does the same as before.. Since you seem to allow trailing periods, I assumed you wanted it to be considered separately from the following digits.
The () encloses the next group, which is the period (\.) followed by two characters that match \d -- two digits -- and which may be repeated 0 or 1 times, as dictated by the ?. This allows people to either have no digits after the period or two, but nothing else.
The ^ at the beginning specifies it has to be at the beginning of the line, and the $ at the end specifies it has to end at the end of the line. Remember to enable the multiline (m) flag so it works properly.
Disclaimer: I've not done much regex work before, so I could well be totally off. If it doesn't work, let me know.
Couldn't you do this without the ?'s
^[0-9,]+(\.){0,1}(\d{2}){0,1}$
improved: ^\d+[0-9,]*(\.){0,1}(\d{2}){0,1}$
Edit:
Broken down a bit as requested
Old one:
[0-9,]+
1 or more digits/commas (would have accepted ',' as true) so improved version:
\d+
for starts with 1 or more digits
[0-9,]*
0 or more digits/commas
followed by
(\.){0,1}
0 or 1 decimal
Followed by
(\d{2}){0,1}
0 or 1 of (exactly 2 digits)

best approach for my pattern match

So, I've built a regex which follows this:
4!a2!a2!c[3!c]
which is translated to
4 alpha character followed by
2 alpha characters followed by
2 characters followed by
3 optional character
this is a standard format for SWIFT BIC code HSBCGB2LXXX
my regex to pull this out of string is:
(?<=:32[^:]:)(([a-zA-Z]{4}[a-zA-Z]{2})[0-9][a-zA-Z]{1}[X]{3})
Now this is targeting a specific tag (32) and works, however, I'm not sure if it's the cleanest, plus if there are any characters before H then it fails.
the string being matched against is:
:32B:HsBfGB4LXXXHELLO
the following returns HSBCGB4LXXX, but this:
:32B:2HsBfGB4LXXXHELLO
returns nothing.
EDIT
For clarity. I have a string which contains multiple lines all starting with :2xnumber:optional letter (eg, :58A:) i want to specify a line to start matching in and return a BIC from anywhere in the line.
EDIT
Some more example data to help:
:20:ABCDERF Z
:23B:CRED
:32A:140310AUD2120,
:33B:AUD2120,
:50K:/111222333
Mr Bank of Dad
Dads house
England
:52D:/DBEL02010987654321
address 1
address 2
:53B:/HSBCGB2LXXX
:57A://AU124040
AREFERENCE
:59:/44556677
A line which HSBCGB2LXXX contains a BIC
:70:Another line of data
:71A:Even more
Ok, so I need to pass in as a variable the tag 53 or 59 and return the BIC HSBCGB2LXXX only!
Your regex can be simplified, and corrected to allow a character before the H, to:
:32[^:]:.?([a-zA-Z]{6}\d[a-zA-Z]XXX)
The changes made were:
Lost the look behind - just make it part of the match
Inserting .? meaning "optional character"
([a-zA-Z]{4}[a-zA-Z]{2}) ==> [a-zA-Z]{6} (4+2=6)
[0-9] ==> \d (\d means "any digit")
[X]{3} ==> XXX (just easier to read and less characters)
Group 1 of the match contains your target
I'm not quite sure if I understand your question completely, as your regular expression does not completely match what you have described above it. For example, you mentioned 3 optional characters, but in the regexp you use 3 mandatory X-es.
However, the actual regular expression can be further cleaned:
instead of [a-zA-Z]{4}[a-zA-Z]{2}, you can simply use [a-zA-Z]{6}, and the grouping parentheses around this might be unnecessary;
the {1} can be left out without any change in the result;
the X does not need surrounding brackets.
All in all
(?<=:32[^:]:)([a-zA-Z]{6}[0-9][a-zA-Z]X{3})
is shorter and matches in the very same cases.
If you give a better description of the domain, probably further improvements are also possible.

Regular Expression to match pattern once or more with no partial matches

Better explained with examples:
HHH
HHHH
HHHBBHHH
HHHBH
BB
HHBH
I need to come up with a regexp that matches only 3 H's or a multiple of 3 H's (so 6, 9, 12, ... H's are ok as well) and 5 H's are not ok. And if possible I don't want to use Perl regexps.
So for the input above the regexp would match (1), (3) and (6) only.
I'm just starting with regular expressions here so I don't exactly know how I'm supposed to approach this.
edit
Just to clear something up:, an H can only be in one group of 3 H's. The group of 3 H's might be HHH or HHBH.
That's why in example 2 above it is not a match because the last H is not in a group of 3 H's. And you can't take the last 3 H's in a group because the middle 2 H's have already been inside a group before.
You can use the following regular expression:
^([^H]*H[^H]*H[^H]*H[^H]*)+$
It matches any string which contains in total 3 H or any multiple of 3. In between there might be any other character.
Explanation:
^ begin of string
( start of group
[^H]*H any string of characters (or none) not including 'H' plus a single 'H'
[^H]*H any string of characters (or none) not including 'H' plus a single 'H'
[^H]*H any string of characters (or none) not including 'H' plus a single 'H'
[^H]* any string of characters (or none) which is not 'H'
)+ containing the group once or twice or ...
$ end of string
By repeating the subpattern [^H]*H three times we make sure that there are indeed 3 H included, [^H]* allows any separating characters.
Note: use either egrep or run grep with additional argument -E.
Use this to match a multiple of 3 H's:
(H{3})+
Here is a complete regex for your examples:
^(H{3})+B*(H{3})*$
Edit: It looks like you need to count non-consecutive H's. In that case:
^(([^H]*H){3})+[^H]*$
That should match any string with a multiple of 3 H's.
Given the requirement that H's can be arbitrarily interleaved with non-H's, but that the total number of H's must be a non-zero multiple of 3 (so XXX, containing no H's, is not a match), then the total regular expression is anything but trivial. This is not a beginner's regular expression.
I'm going to assume that the dialect of regular expression treats {} and () as metacharacters for counting and grouping, and includes + for one-or-more. If you're using a regular expression system that has a different requirement (\{\}, for example) then adjust accordingly.
You need the regex to match the whole string, so there are no stray H's allowed. So, it must start with ^ and end with $. You need to allow an arbitrary number of non-H's at front and back. The H's may be separated by an arbitrary number of non-H's. That leads to:
^([^H]*H[^H]*H[^H]*H)+[^H]*$
Ouch; that is hard to read! It says the line must consist of 1 or more (+) groups of an arbitrary number of non-H's followed by an H, an arbitrary number of non-H's, another H, an arbitrary number of non-H's and a third H; all of which can be followed by an arbitrary number of non-H's.
Using the {} for counting:
^(([^H]*H){3})+[^H]*$
That's still hard to read. Note that my description said "arbitrary number of non-H's at front and back", but I only use the [^H]* at the back; that's because the repeating pattern allows an arbitrary number of non-H's at the front anyway so there's no need to repeat that fragment.