How to substring in RegExp? - regex

Does anyone how to know to substring in regular expression? I am currently profiling data and i saw different format such as :
EB0000000
EB00000000PHL00000000F00000000
P0000000A
When I used my expression:
\b(?:[A-Z]{1}\d{7}[A-Z]{1}|[A-Z]{1}\d{7,8}|[A-Z]{2}\d{6}|[A-Z]{2}\d{7,8})\b
I captured the first and last sample, but the second looks improper data but i still want to capture EB and those 8 digits before PHL. Is it possible in regexp? TIA

Why is it so hard to write? Maybe there are some lines nearby that should not fall into the selection?
\b[A-Z\d]{8,}\b

It is possible, but you could change the order of the alternatives to put the most specific one at the beginning and then remove the word boundary at the end.
Note that you can omit {1}
\b(?:[A-Z]{2}\d{7,8}|[A-Z]\d{7}[A-Z]|[A-Z]\d{7,8}|[A-Z]{2}\d{6})
In parts
\b Word boundary
(?: Non capture group
[A-Z]{2}\d{7,8} Match 2 times A-Z and 7-8 digits
| Or
[A-Z]\d{7}[A-Z] Match A-Z, 7 digits and A-Z
| Or
[A-Z]\d{7,8} Match A-Z and 7-8 digits
| Or
[A-Z]{2}\d{6} Match 2 times A-Z and 6 digits
) Close group
Regex demo

Related

Regex: Replace certain part of the matched characters

I want to be able to match with a certain condition, and keep certain parts of it. For example:
June 2021 9 Feature Article Three-Suiters Via Puppets Kai-Ching Lin
should turn into:
Jun 2021 Three-Suiters Via Puppets Kai-Ching Lin
So, everything until the end of the word Article should be matched; then, only the first three characters of the months is kept, as well as the year, and this part is going to replace the matched characters.
My strong regex knowledge got me as far as:
.+Article(?)
You could use 2 capture groups and use those in a replacement:
\b([A-Z][a-z]+)[a-z](\s+\d{4})\b.*?\bArticle\b
\b A word boundary to prevent a partial word match
([A-Z][a-z]+) Capture group 1, match a single uppercase char and 1+ lowercase chars
[a-z] Match a single char a-z
(\s+\d{4})\b Capture group 2, match 1+ whitspace chars and 4 digits followed by a word boundary
.*?\bArticle\b Match as least as possible chars until Article
Regex demo
The replaced value will be
Jun 2021 Three-Suiters Via Puppets Kai-Ching Lin
You could use positive lookbehinds:
(?<=^[A-Z][a-z]{2})[a-z]*|(?<=\d{4}).*Article
(?<=^[A-Z][a-z]{2}) - behind me is the start of a line and 3 chars; presumably the first three chars of the month
[a-z]* - optionally, capture the rest of the month
| - or
(?<=\d{4}) - behind me is 4 digits; presumably a year
.*Article - capture everything leading up to and including "Article"
https://regex101.com/r/fbYdpH/1

In Scala, is it possible to insert commas via a regex to separate thousands in numbers?

In Scala, is it possible to actually insert commas via a regex to separate thousands in numbers where the comma definitely is not there to start with?
For example, I'd like to convert 30000.00 into 30,000.00.
I am not sure this is exactly what you need, but you can use this:
val formatter = java.text.NumberFormat.getNumberInstance
println(formatter.format(30000.00)) // prints 30,000
This is not scala based answer.
You can use regex \d{1,3}(?=(?:\d{3})+\.) to find the matches and substitute each match with the same match plus an extra comma $0,.
See the online demo.
Explanation:
\d{1,3} This matches a decimal character between 1 and 3 times
(?= Positive lookahead starts
(?: This indicates a Non-capturing group
\d{3} matches a digit exactly 3 times
) end of Non-capturing group.
+ matches the previous group one or more times
\. matches the character . literally
) Positive lookahead ends.

how to match a list of fixed length words separated by space or comma?

The words' length could be 2 or 6-10 and could be separated by space or comma. The word only include alphabet, not case sensitive.
Here is the groups of words that should be matched:
RE,re,rereRE
Not matching groups:
RE,rere,rel
RE,RERE
Here is the pattern that I have tried
((([a-zA-Z]{2})|([a-zA-Z]{6,10}))(,|\s+)?)
But unfortunately this pattern can match string like this: RE,RERE
Look like the word boundary has not been set.
You could match chars a-z either 2 or 6 - 10 times using an alternation
Then repeat that pattern 0+ times preceded by a comma or a space [ ,].
^(?:[A-Za-z]{6,10}|[A-Za-z]{2})(?:[, ](?:[A-Za-z]{6,10}|[A-Za-z]{2}))*$
Explanation
^ Start of string
(?:[A-Za-z]{6,10}|[A-Za-z]{2}) Match chars a-z 6 -10 or 2 times
(?: Non capturing group
[, ](?:[A-Za-z]{6,10}|[A-Za-z]{2}) Match comma or space and repeat previous pattern
)* Close non capturing group and repeat 0+ times
$ End of string
Regex demo
If lookarounds are supported, you might also assert what is directly on the left and on the right is not a non whitespace character \S.
(?<!\S)(?:[A-Za-z]{6,10}|[A-Za-z]{2})(?:[ ,](?:[A-Za-z]{6,10}|[A-Za-z]{2}))*(?!\S)
Regex demo
([a-zA-Z]{2}(,|\s)|[a-zA-Z]{6,10}|(,|\s))
This one will get only the words who have 2 letter, or between 6 and 10
\b,?([a-zA-Z]{6,10}|[a-zA-Z]{2}),?\b
You can use this
^(?!.*\b[a-z]{4}\b)(?:(?:[a-z]{2}|[a-z]{6,10})(?:,|[ ]+)?)+$
Regex Demo
This regex will match your first case, but neither of your two other cases:
^((([a-zA-Z]{2})|([a-zA-Z]{6,10}))(,|[ ]+|$))+$
I'm making the assumption here that each line should be a single match.
Here it is in action.

check if a string starts with number using regular expression

I am writing a filebeat configuration when I am matching if a line starts with a number like 03:32:33 ( a timestamp). I am currently doing it by-
\d
But its not getting recognised, is there anything else which I should do. I am not particularly good/ have experience with regex. Help will be appreciated.
The real problem is that filebeat does not support \d.
Replace \d by [0-9] and your regular expression will work.
I suggest you to give a look at the filebeat's Supported Patterns.
Also, be sure you've used ^, it stands for the start of the string.
Regex: (^\d)
1st Capturing group (^\d)
^ Match at the start of the string
\d match a digit [0-9]
You can use this regex:
^([0-9]{2}:?){3}
DEMO
Assert position at the beginning of the string «^»
Match the regex below and capture its match into backreference number 1 «([0-9]{2}:?){3}»
Exactly 3 times «{3}»
You repeated the capturing group itself. The group will capture only the last iteration. Put a capturing group around the repeated group to capture all iterations. «{3}»
Or, if you don’t want to capture anything, replace the capturing group with a non-capturing group to make your regex more efficient.
Match a single character in the range between “0” and “9” «[0-9]{2}»
Exactly 2 times «{2}»
Match the character “:” literally «:?»
Between zero and one times, as many times as possible, giving back as needed (greedy) «?»
You can use:
^\d{2}:\d{2}:\d{2}
The character ^ matches the start of a line.

regex how to match mutiple pattern

what pattern should i use in regex if i want to match the first pattern but then i want to unmatch the second pattern.
for example i want to match the string 'id' followed by decimal as long as that decimal is not 6 or 9.
so it should match id1,id2,id3 ... etc but not id6 and id9.
I tried this pattern and it's not working :
"id(\d|(?!6|9))"
You can use negative lookahead like this.
Regex: \bid(?![69])\d\b
Explanation:
\b ensures the word boundary.
(?![69]) negative lookahead makes sure that number is not 6 or 9.
\d matches a single digit after id.
Regex101 Demo
Its not the best solution but you can also do this using positive look ahead as
\bid(?=\d)(?:\d\d+|[^69])\b
Regex Breakdown
\b #word boundary
id #Match id literally
(?=\d) #Find if the next position contains digit (otherwise fails)
(?: #Non capturing group
\d\d+ #If there are more than one digits then match is success
| #OR (alternation)
[^69] #If its single digit don't match 6 or 9
) #End of non capturing group
\b
Regex Demo
If you want to check id is not followed by 6 or 9 and you want to accept cases like id16 but not id61, then you can use
\bid(?=\d)[^69]\d*\b
Regex Demo
The id(\d|(?!6|9)) pattern matches id followed with any 1 digit or if there is no 6 or 9. That alternation (\d or (?!6|9)) allows id6 and id9 because the first alternative "wins" in NFA regex (i.e. the further alternatives after one matches are not tested against).
If you need to only exclude id matches with 6 or 9 use
\bid(?![69]\b)\d+\b
See the regex demo
If you want to avoid matching all id with 6 and 9 following it, use
\bid(?![69])\d+
See another regex demo.
Here, \d+ matches one or more digits, \b stands for a word boundary (the digits should be preceded and followed with non-"word" characters), and the (?![69]) lookahead fails the match if there is 6 or 9 after id (with or without a word boundary check - depending on what you need).
UPDATE
If you need to exclude the id whose number does not start with 6 or 9, you can use
\bid[0-578]\d*
(demo)
Based on Shafizadeh's comment.