How can I write this as a regular expression?
tabspaceSTRINGtabspace
My data looks like this:
12345 adsadasdasdasd 30
34562 adsadasdasdasd asdadaads<adasdad 30
12313 adsadasdasdasd asdadas dsaads 313123<font="TNR">adsada 30
1232131 adsadasdasdasd asdadaads<adasdad"asdja <div>asdjaıda 30
I want to get
12345 30
34562 30
12313 30
1232131 30
\t*\t doesn't work.
try the following regular expression
\t.+\t
The problem there is your definition of String...
If you use something like the suggested above, it'll match
tabspaceSTRINGtabspacetabspace
You get the picture. This might be acceptable, if not, you need to limit your "STRING" definition, like:
\t\w+\t
or:
\t[a-zA-Z]+\t
What characters are allowed in your string?
\t\w+\t
\w would allow letters, digits and the underscore (depending on your regex engine ASCII or Unicode)
See it here on Regexr, a good platform to test regular expressions.
Your "regex" \t*\t would match 0 or more tabs and then one tab. The * is a quantifier meaning 0 or more and is referring to the character or group before (here to your \t)
If your whitespace are not tabs, try this
\s+.+\s+30
\s is a whitespace character (space, tab, newline (not important for Notepad++)).
If you are not sure about the strings you are looking for except that they are separated by tabs it is a good approach to describe such a string as everything but a tab: (^\t*)
[^\t]*\t([^\t]*)\t[^\t]*
You can test it on regexpad.com.
Related
Let's say that we have this text:
2020-09-29
2020-09-30
2020-10-01
2020-10-02
2020-10-12
2020-10-16
2020-11-12
2020-11-23
2020-11-15
2020-12-01
2020-12-11
2020-12-30
I want to do something like this:
\d\d\d\d-(NOT10)-(30)
So i want to get all dates of any year, but not of the 10th month and it is important, that the day is 30.
I tried a lot to do this using negative lookahead asserations but i did not come up with any working regexes.
You can use negative lookaheads:
\d\d\d\d-(?!10)\d\d-30
The Part (?!10) ensures that no 10 follows at the point where it is inserted into the regex. Notice that you still need to match the following digits afterwards, thus the \d\d part.
Generally speaking you can not (to my knowledge) negate a part that then also matches parts of the string. But with negative lookaheads you can simulate this as I did above. The generalized idea looks something like:
(?!<special-exclusion-pattern>)<general-inclusion-pattern>
Where the special-exclusion-pattern matches a subset of the general-inclusion-pattern. In the above case the general inclusion pattern is \d\d and the special exclusion pattern ins 10.
Try :
/20\d{2}-(?:0[1-9]|1[12])-30/
Explanation :
20\d{2} it will match 20XX
(?:0[1-9]|1[12]) it will match 0X or 11, 12
30 it will match 30
Demo :https://regex101.com/r/O2F1eV/1
It's easiest to simply convert the substring (if present) that matches /^\d{4}-10-30$/ to an empty string, then split the resulting string on one or more newlines.
If your string were
2020-10-16
2020-10-30
2020-11-12
2020-11-23
and was held by the variable str, then in Ruby, for example,
str.sub(/^\d{4}-10-30$/,'')
#=> "2020-10-16\n\n2020-11-12\n2020-11-23\n"
so
str.sub(/^\d{4}-10-30$/,'').split
#=> ["2020-10-16", "2020-11-12", "2020-11-23"]
Whatever language you are using undoubtedly has similar methods.
I have a filename like this:
0296005_PH3843C5_SEQ_6210_QTY_BILLING_D_DEV_0000000000000183.PS.
I needed to break down the name into groups which are separated by a underscore. Which I did like this:
(.*?)_(.*?)_(.*?)_(.*?)_(.*?)_(.*?)_(.*?)_(.*?)_(.*?)(\d{16})(.*)
So far so go.
Now I need to extract characters from one of the group for example in group 2 I need the first 3 and 8 decimal ( keep mind they could be characters too ).
So I had try something like this :
(.*?)_([38]{2})(.*?) _(.*?)_(.*?)_(.*?)_(.*?)_(.*?)_(.*?)_(.*?)(\d{16})(.*)
It didn’t work but if I do this:
(.*?)_([PH]{2})(.*?) _(.*?)_(.*?)_(.*?)_(.*?)_(.*?)_(.*?)_(.*?)(\d{16})(.*)
It will pull the PH into a group but not the 38 ? So I’m lost at this point.
Any help would be great
Try the below Regex to match any first 3 char/decimal and one decimal
(.?)_([A-Z0-9]{3}[0-9]{1})(.?)(.*?)(.?)_(.?)(.*?)(.?)_(.?)
Try the below Regex to match any first 3 char/decimal and one decimal/char
(.?)_([A-Z0-9]{3}[A-Z0-9]{1})(.?)(.*?)(.?)_(.?)(.*?)(.?)_(.?)
It will match any 3 letters/digits followed by 1 letter/digit.
If your first two letter is a constant like "PH" then try the below
(.?)_([PH]+[0-9A-Z]{2})(.?)(.*?)(.?)_(.?)(.*?)(.?)_(.?)
I am assuming that you are trying to match group2 starting with numbers. If that is the case then you have change the source string such as
0296005_383843C5_SEQ_6210_QTY_BILLING_D_DEV_0000000000000183.PS.
It works, check it out at https://regex101.com/r/zem3vt/1
Using [^_]* performs much better in your case than .*? since it doesn't backtrack. So changing your original regex from:
(.*?)_(.*?)_(.*?)_(.*?)_(.*?)_(.*?)_(.*?)_(.*?)_(.*?)(\d{16})(.*)
to:
([^_]*)_([^_]*)_([^_]*)_([^_]*)_([^_]*)_([^_]*)_([^_]*)_([^_]*)_(.*?)(\d{16})(.*)
reduces the number of steps from 114 to 42 for your given string.
The best method might be to actually split your string on _ and then test the second element to see if it contains 38. Since you haven't specified a language, I can't help to show how in your language, but most languages employ a contains or indexOf method that can be used to determine whether or not a substring exists in a string.
Using regex alone, however, this can be accomplished using the following regular expression.
See regex in use here
Ensuring 38 exists in the second part:
([^_]*)_([^_]*38[^_]*)_([^_]*)_([^_]*)_([^_]*)_([^_]*)_([^_]*)_([^_]*)_(.*?)(\d{16})(.*)
Capturing the 38 in the second part:
([^_]*)_([^_]*)(38)([^_]*)_([^_]*)_([^_]*)_([^_]*)_([^_]*)_([^_]*)_([^_]*)_(.*?)(\d{16})(.*)
So, I've built a regex which follows this:
4!a2!a2!c[3!c]
which is translated to
4 alpha character followed by
2 alpha characters followed by
2 characters followed by
3 optional character
this is a standard format for SWIFT BIC code HSBCGB2LXXX
my regex to pull this out of string is:
(?<=:32[^:]:)(([a-zA-Z]{4}[a-zA-Z]{2})[0-9][a-zA-Z]{1}[X]{3})
Now this is targeting a specific tag (32) and works, however, I'm not sure if it's the cleanest, plus if there are any characters before H then it fails.
the string being matched against is:
:32B:HsBfGB4LXXXHELLO
the following returns HSBCGB4LXXX, but this:
:32B:2HsBfGB4LXXXHELLO
returns nothing.
EDIT
For clarity. I have a string which contains multiple lines all starting with :2xnumber:optional letter (eg, :58A:) i want to specify a line to start matching in and return a BIC from anywhere in the line.
EDIT
Some more example data to help:
:20:ABCDERF Z
:23B:CRED
:32A:140310AUD2120,
:33B:AUD2120,
:50K:/111222333
Mr Bank of Dad
Dads house
England
:52D:/DBEL02010987654321
address 1
address 2
:53B:/HSBCGB2LXXX
:57A://AU124040
AREFERENCE
:59:/44556677
A line which HSBCGB2LXXX contains a BIC
:70:Another line of data
:71A:Even more
Ok, so I need to pass in as a variable the tag 53 or 59 and return the BIC HSBCGB2LXXX only!
Your regex can be simplified, and corrected to allow a character before the H, to:
:32[^:]:.?([a-zA-Z]{6}\d[a-zA-Z]XXX)
The changes made were:
Lost the look behind - just make it part of the match
Inserting .? meaning "optional character"
([a-zA-Z]{4}[a-zA-Z]{2}) ==> [a-zA-Z]{6} (4+2=6)
[0-9] ==> \d (\d means "any digit")
[X]{3} ==> XXX (just easier to read and less characters)
Group 1 of the match contains your target
I'm not quite sure if I understand your question completely, as your regular expression does not completely match what you have described above it. For example, you mentioned 3 optional characters, but in the regexp you use 3 mandatory X-es.
However, the actual regular expression can be further cleaned:
instead of [a-zA-Z]{4}[a-zA-Z]{2}, you can simply use [a-zA-Z]{6}, and the grouping parentheses around this might be unnecessary;
the {1} can be left out without any change in the result;
the X does not need surrounding brackets.
All in all
(?<=:32[^:]:)([a-zA-Z]{6}[0-9][a-zA-Z]X{3})
is shorter and matches in the very same cases.
If you give a better description of the domain, probably further improvements are also possible.
Example Data (including newlines)
Alpha1
100
Bravo2
something else
Charlie3
200
Delta4
A==1
Echo5
300
Foxtrot6
I would like to get the output of:
Alpha1 100 Bravo2 something else
Charlie3 200 Delta4 A==1
Echo5 300 Foxtrot6
The pattern is:
AlphaNumeric
Numeric
AlphaNumeric
value that is not a single alphanumeric "word"
The first three parts are easy -- (\w+)\s+(\d+)\s+(\w+)\s+ -- but I don't know how to have the conditional fourth group. Is this possible? If so, how?
This pattern worked for me:
(\w+)\s+(\d+)\s+(\w+)(\s+([\w\s]+ \w+$))?
Before:
After:
Of course, \w includes underscores, so replace \w with [a-z0-9] if necessary.
Update
This pattern is more specific and should be more reliable:
(\w+)\n(\d+)\n(\w+)(\n([^\n]*[^\w\n][^\n]*))?
Before:
After:
this is the current expression i'm using, but when i input a value it only accepts 1 letter
when i input more than 1 i get an invalid input.
"regex":"/^(?:[a-zA-Z\ \']{30}|)$/",
what type of expression might be suitable for the input that i'm looking for?
ex:
John Franklin
or
(blank input)
The following should work for you:
/^([a-zA-Z ']*)$/
I've tested it and it appears to fit your needs.
For clarity, * means 'match 0 or any number of characters', which from my testing, satisfies your 'blank' requirement.
Letters, space, apostrophe, or blank:
/^[A-Za-z ']*$/
If you are capping it at 30 characters, then replace * with {0,30}. In some regex flavours you can omit the 0 and use {,30}.
x{30} means "exactly 30 instances of x". It sounds like you actually want up to 30 characters:
/^[a-zA-Z\ \']{0,30}$/
(using the {m,n} notation, meaning "between m and n instances").