Tokenizing a string with a regular expression

Tokenizing a string with a regular expression - regex

Suppose I have a string like this: abc def ghi jkl (I put a space at the end for the sake of simplicity but it doesn't really matter for me) and I want to capture its "chunks" as follows:
abc
def
ghi
jkl
if and only if there are 1-4 "chunks" in the string. I have already tried the following regex:
^([^ ]+ ){1,4}$
at Regex101.com but it only captures the last occurrence. A warning about it is issued:
A repeated capturing group will only capture the last iteration. Put a capturing group around the repeated group to capture all iterations or use a non-capturing group instead if you're not interested in the data
How to correct the regular expression to achieve my goal?

Since you have no access to the code, the only solution you might use is a regex based on the \G operator that will only allow consecutive matches and a lookahead anchored at the start that will require 1 to 4 non-whitespace chunks in the string.
(?:^(?=\s*\S+(?:\s+\S+){0,3}\s*$)|\G(?!^))\s*\K\S+
See the regex demo
Details:
(?:^(?=\s*\S+(?:\s+\S+){0,3}\s*$)|\G(?!^)) - a custom boundary that checks if:
^(?=\s*\S+(?:\s+\S+){0,3}\s*$) - the string start position (^) that is followed with 1 to 4 non-whitespace chunks, separated with 1+ whitespaces, and trailing/leading whitespaces are allowed, too
| - or
\G(?!^) - the current position at the end of the previous successful match (\G also matches the start of a string, thus we have to use the negative lookahead to exclude that matching position, since there is a separate check performed)
\s* - zero or more whitespaces
\K - a match reset operator discarding all the text matched so far
\S+ - 1 or more characters other than whitespace

It can be done on linux using tr:
tr -sc 'a-zA-Z' '\n' < text.txt > out_text.txt
where in a text.txt file is your string to be normalized.

Related

Testing a single sentence with an optional period

I'm trying to write a regex that tests a single sentence. The sentence can contain any content and should either: end in a period and have nothing following that period or not have a period or any ending punctuation.
I started with this: .*?\.$ and it worked fine testing for a sentence ending in a period. But if I mark the period as optional .*?\.?$ then a sentence can have any ending including a period and text after that period.
To be clear, these should pass the test: He jumped over the fence. He jumped over the fence
And this should not pass the test: He jumped over the fence. She jumped over it too.

Try:
^(?:[^.]+\.|[^.]+)$
Regex demo.
^ - start of the string
(?:[^.]+\.|[^.]+) - match either [^.]+\. (one or more non-. characters and .) or [^.]+ (one or more non-. characters) in non-capturing group.
$ - end of the string

This pattern .*?\.$ can match the whole line He jumped over the fence. She jumped over it too. because the . can also match a literal dot.
If you don't want to cross newlines and you do want to match for example 1.2m when having to end on a dot, or matching only chars other than ending punctuations:
If a lookahead assertion is supported:
^(?:[^\.\n]*(?:\.(?![^\S\n])[^\.\n]*)*\.|[^!?.\n]+)$
Explanation
^ Start of string
(?: Non capture group
[^\.\n]* Match optional chars other than a dot
(?:\.(?![^\S\n])[^\.\n]*)* Optionally repeat matching a dot not directly followed by a space
\. Match a dot
| Or
[^!?.\n]+ Match 1+ times any char except for ! ? . or a newline (Or add more ending punctuation chars)
) Close the non capture group
$ End of string
See a regex101 demo

You can use such regex:
.*?[^.]$
Optional (?) means that regex will match if symbol presents or not presents in string
[^.]$ - means that you want to exclude the presence of a dot at the end of a sentence.

Regex to validate subtract equations like "abc-b=ac"

I've stumbled upon a regex question.
How to validate a subtract equation like this?
A string subtract another string equals to whatever remains(all the terms are just plain strings, not sets. So ab and ba are different strings).
Pass
abc-b=ac
abcde-cd=abe
ab-a=b
abcde-a=bcde
abcde-cde=ab
Fail
abc-a=c
abcde-bd=ace
abc-cd=ab
abcde-a=cde
abc-abc=
abc-=abc
Here's what I tried and you may play around with it
https://regex101.com/r/lTWUCY/1/

Disclaimer: I see that some of the comments were deleted. So let me start by saying that, though short (in terms of code-golf), the following answer is not the most efficient in terms of steps involved. Though, looking at the nature of the question and its "puzzle" aspect, it will probably do fine. For a more efficient answer, I'd like to redirect you to this answer.
Here is my attempt:
^(.*)(.+)(.*)-\2=(?=.)\1\3$
See the online demo
^ - Start line anchor.
(.*) - A 1st capture group with 0+ non-newline characters right upto;
(.+) - A 2nd capture group with 1+ non-newline characters right upto;
(.*) - A 3rd capture group with 0+ non-newline characters right upto;
-\2= - An hyphen followed by a backreference to our 2nd capture group and a literal "=".
(?=.) - A positive lookahead to assert position is followed by at least a single character other than newline.
\1\3 - A backreference to what was captured in both the 1st and 3rd capture group.
$ - End line anchor.
EDIT:
I guess a bit more restrictive could be:
^([a-z]*)([a-z]+)((?1))-\2=(?=.)\1\3$

You may use this more efficient regex with a lookahead at the start with a capture group that matches text on the right hand side of - i.e. substring between - and = and captures it in group #1. Then in the main body of regex we just check presence of capture group #1 and capture text before and after \1 in 2 separate groups.
^(?=[^-]+-([^=]+)=.)([^-]*?)\1([^-]*)-[^=]+=\2\3$
RegEx Demo
RegEx Demo:
^: Start
(?=[^-]+-([^=]+)=.): Lookahead to make sure we have expression structure of pqr-pq=r and also more importantly capture substring between - and = in capture group #1. . after = is there for a reason to disallow any empty string after =.
([^-]*?): Match 0 or more non-- characters in capture group #2
\1: Back-reference to group #1 to make sure we match same value as in capture group #1
([^-]*): Match 0 or more non-- characters in capture group #3
-: Match a -
[^=]+: Match 0 or more non-= characters
=: Match a =
\2\3: Back-reference to group #2 and #3 which is difference of substraction
$: End

How to capture second match from the given text using regex

I tried to capture the second match from given text i.e,
hash=e1467eb30743fb0a180ed141a26c58f7&token=a62ef9cf-2b4e-4a99-9335-267b6224b991:IO:OPCA:117804471:OPI:false:en:opsdr:117804471&providerId=paytm
In the above text, I want to capture the second number with the length of 9 (117804471).
I tried following, but it didn't work; so please help me resolving in this.
https://regex101.com/r/vBJceR/1

You can use
^(?:.*?\K\b[0-9]{9}\b){2}
See the regex demo.
Details:
^ - start of string
(?: - start of a non-capturing group:
.*? - any zero or more chars other than line break chars (as few as possible) followed with
\K - match reset operator discarding text matched so far
\b[0-9]{9}\b - a 9-digit number as a whole word
){2} - two occurrences of the pattern sequence defined above.

How to capture everything until another capture group

I have the following template :
1251 Left Random Text I want to fill
It can go through multiple lines
As you can see
9841 Right Again we see a lot of random text with 3115 numbers
And this also goes
To multiple lines
0121 Right
5151 Right This one is just one line
I was wrong
9731 Left This one is just a line
5123 NA Instruction 5151 was wrong
4113 Right Instr 9841 was correct
We checked
I want to have 3 groups:
1251
Left
Random Text I want to fill
It can go through multiple lines
As you can see
I'm using
(\d+)\s(\w+)\s(.*)
but it stops at the current line only (so I get only Random Text I want to fill in group 3, although I want including As you can see)
If I'm using Single line flag I get only 1 match for each group, group 3 almost being all
Here is live : https://regex101.com/r/W3x0mH/4

You could use a repeating group matching all the lines while asserting that the next line does not start wit 1+ digits followed by Left or Right:
(\d+)\s(\w+)\s(.*(?:\r?\n(?!\d).*)*)
Explanation
(\d+)\s(\w+)\s Match the first 2 groups
(Third capturing group
.* Match 0+ times any char except a newline
(?: Non capturing group
\r?\n(?!\d).* Match newline, assert what is on the right is not a digit
)* Close non capturing group and repeat 0+ times
) Close capturing group
Regex demo

You may use this regex with a lookahead:
^(\d+)\s(\w+)\s(.*?)(?=\n\d|\z)
with DOTALL and MULTILINE modifiers.
Updated Regex Demo
RegEx Details:
^: Line start
(\d+): Match and capture 1+ digits in group #1
\s: match a whitespace
(\w+): Match and capture 1+ word characters in group #2
\s: match a whitespace
(.*?): Match 0 or more of any character (non-greedy) provided next lookahead assertion is satiSfied
(?=\n\d|\z): Lookahead assertion to assert that we have a newline followed by a digit or there is end of input
Faster Regex:
If you are using this regex on a long string then you should also keep overall performance in mind as a regex with DOTALL modifier will tend to get slow on a large size text. For that I suggest using this regex that doesn't need DOTALL modifier:
^(\d+)\s(\w+)\s(.*(?:\n.*)*?)(?=\n\d|\z)
RegEx Demo 2
On regex101 demo this regex takes just 181 steps as compared to first one that takes 1300 steps.

For the third group, repeat any character while using negative lookahead for ^\d, which would indicate the start of a new match:
(\d+)\s(\w+)\s((?:(?!^\d)[\s\S])*)
https://regex101.com/r/W3x0mH/5

You may try with this regex:
^(\d+)\s+(\w+)\s+(.*?)(?=^\d|\z)
^(\d+)\s+ , ^\d+ Line begins with numbers followed by one or more whitespace character \s+
(\w+)\s+ where \w+ one or more characters (left,right,na or something else) followed by one or more whitespace \w+
(.*?) matches everything until it finds a line beginning with number or \z end of string.
I think it fits your requirement....
Regex101

Regex: remove all except first character and last number

I know that ^. is first character and (\d+)(?!.*\d) is last number. I've tried using | between these and have been trying to find code for the second character, but with no success.
This is in R.
Take for example:
'ABCD some random words and spaces 1234' should output 'A4' when I do
sub([regex here], "", 'ABCD some random words and spaces 1234')

If you used ^.|(\d+)(?!.*\d), the pattern would only match the first char and remove it with sub, and would remove the first char and the last 1+ digits if used with gsub without backreferences in the replacement pattern. See this pattern demo.
You can use
sub("^(.).*(\\d).*$", "\\1\\2", "ABCD some random words and spaces 1234")
See the R demo and the regex demo.
This TRE regex pattern matches:
^ - start of string
(.) - Group 1 capturing any char
.* - 0+ any chars as many as possible up to the last...
(\\d) - Group 2 capturing a digit
.* - the rest of the string
$ - end of string.
The \\1\\2 replacement pattern re-inserts the values captured with Group 1 and Group 2 back to the result.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Tokenizing a string with a regular expression - regex

It can be done on linux using tr: tr -sc 'a-zA-Z' '\n' < text.txt > out_text.txt where in a text.txt file is your string to be normalized.

Related

Testing a single sentence with an optional period

Regex to validate subtract equations like "abc-b=ac"

How to capture second match from the given text using regex

How to capture everything until another capture group

Regex: remove all except first character and last number

Categories

Resources