Find the first letter and sign of a sentence with Regex.
At the beginning of the sentence can sometimes be letters and sometimes numbers.
15. Lorem ipsum is placeholder text
B. Lorem ipsum is placeholder text
C.Lorem ipsum is placeholder text
D . Lorem ipsum is placeholder text
E,Lorem ipsum is placeholder text
I wrote something like this:
[\dga-zA-Z.]{1\s}
Demo with regex101
But it doesn't work right for every sentence. Moreover, it does not detect if there is a space between the first letter/digit and the sign with the sentence.
Where am I making a mistake?
Also, In terms of performance For such scenarios, it makes more sense to use regex or PHP?
Hello this matched all of your provided examples
([A-Za-z\d ]+)(\.|,)
What this does is the following:
it matches all small, big letters, digits or space. It should find at least
one of those or more (the + sign).
It should end with a dot or comma. (\.) Note: In regex, the dot should be escaped.
If that doesn't do the trick, comment below
Edit: demo here: click
The following regex will match a single letters or multiple digits that are placed at the beginning of a sentence and then followed with either a single period or comma:
^(([a-zA-Z]{1}|[0-9]+)\s*[.,]{1})(.*)$
This is the breakdown:
^ # Asserts position at start of the line
[a-zA-Z]{1}|[0-9]+ # Match a single alphabetic character or one or more digits
\s* # Matches whitespace characters between 0 and unlimited times
[.,]{1} # Matches a single period or comma character literal
.* # Matches the rest of the text
$ # Asserts position at end of the line
Group 1 - will return both the letter/numbers and the period/comma (including potential spaces). This is in case you need to get both for some reason.
Group 2 - will return only letter or numbers at the start of the sentence, which I assume you'll actually be looking for most of the times.
Group 3 - will return the rest of the text.
The regex will need to be modified depending on what you want. For example if you don't want a match when there are spaces after the letter/digits at the start of the sentence or if you want to include more delimiting characters that mark the separator character. Let me know if you have any additional constraints you'd like this regex conform to.
See the DEMO
Use: ^[\da-zA-Z]+\h*[.,]
Demo
Explanation:
^ # beginning of line
[\da-zA-Z]+ # 1 or more letter or digit
\h* # 0 or more horizontal spaces
[.,] # a dot or a comma
Related
So I have a text file in Vscode that contains several lines of text like so:
1801: Joseph Marie Jacquard, a French merchant and inventor invent a loom that uses punched wooden cards to automatically weave fabric designs. Early computers would use similar punch cards.
So now I'm trying to isolate the year number/the first 4 characters of each line. I'm new to regex, and I know how to get the first 4 characters (I used ^.{4}) but how would I be able to find all EXCEPT for the first 4 characters so that I can replace them with nothing and be left with just the year numbers?
Find: (?<=^\d{4}).*
Replace: with nothing
regex101 Demo
(?<=^\d{4}) if a line starts ^ with 4 digits , (?<=...) is a positive lookbehind
.* match everything else up to line terminators, so the : will be included in the match
Since you never matched the 4 digits, a lookbehind/lookahead isn't part of any match necessarily, that you want to keep, you don't have to worry about any capture groups or replacements.
You can
Find: ^(.{4}).+
Replace: $1
See the regex demo. Details:
^ - start of a line (in Visual Studio Code, ^ matches any line start)
(.{4}) - capturing group #1 that captures any four chars other than line break chars
.+ - one or more chars other than line break chars, as many as possible.
The $1 backreference in the replacement pattern replaces the match with Group 1 value.
I have a list of thousands of records within a .txt document.
some of them look like these records
201910031044 "00059" "11.31AG" "Senior Champion"
201910031044 "00060" "GBA146" "Junior Champion"
201910031044 "00999" "10.12G" "ProAM"
201910031044 "00362" "113.1LI" "Abcd"
Whenever a record similar to this occurs I'd like to get rid of the last words/numbers/etc in the last quotation marks (like "Senior Champion", "Junior Champion" etc. There are many possibilities here)
e.g. (before)
201910031044 "00059" "11.31AG" "Senior Champion"
after
201910031044 "00059" "11.31AG"
I tried the following regex but it wouldn't work.
Search: ^([0-9]{17,17} + "[0-9]{8,8}" + "[a-zA-Z0-9]").*$
Replace: \1 (replace string)
OK I forgot the . (dot) sign however even if I do not have a . (dot) sign it would not work. Not sure if it has anything to do when using the + sign used more than once.
I'd like to get rid of the last words/numbers/etc in the last quotation marks
This does the job:
Ctrl+H
Find what: ^.+\K\h+".*?"$
Replace with: LEAVE EMPTY
CHECK Wrap around
CHECK Regular expression
UNCHECK . matches newline*
Replace all
Explanation:
^ # beginning of line
.+ # 1 or more any character but newline
\K # forget all we have seen until this position
\h+ # 1 or more horizontal spaces
".*?" # something inside quotes
$ # end of line
Screen capture (before):
Screen capture (after):
The RegEx looks for the 4th double quote:
^(?:[^"]*\"){4}([^|]*)
You can see this demo: https://regex101.com/r/wJ9yS6/163
You will still need to parse the lines, so probably easier opening in excel or parsing using code as a CSV.
You have a problem with the count of your characters:
you specify that the line should start with exactly 17 digits ([0-9]{17,17}). However, there are only 12 digits in the data 201910031044.
you can specify exactly 12 digits by using {12} or if it could be 12-17, then {12,17}. I'll assume exactly 12 based on the current data.
similarly, for the second column you specify that it's exactly 8 digits surrounded by quotes ("[0-9]{8,8}") but it only has 5 digits surrounded by quotes.
again, you can specify exactly 5 with {5} or 5-8 with {5,8}. I will assume exactly 5.
finally, there is no quantifier for the final field, so the regex tries to match exactly one character that is a letter or a number surrounded by quotes "[a-zA-Z0-9]".
I'm not sure if there is any limit on the number of characters, so I would go with one or more using + as quantifier "[a-zA-Z0-9]+" - if you can have zero or more, then you can use *, or if it's any other count from m to n, then you can use {m,n} as before.
Not a character count problem but the final column can also have dots but the regex doesn't account for. You can just add . inside the square brackets and it will only match dot characters. It's usually used as a wildcard but it loses its special meaning inside a character class ([]), so you get "[a-zA-Z0-9.]+"
Putting it all together, you get
Search: ^([0-9]{12} + "[0-9]{5}" + "[a-zA-Z0-9.]+").*$
Replace: \1
Which will get rid of anything after the third field in Notepad++.
This can be shortened a bit by using \d instead of [0-9] for digits and \s+ for whitespace instead of +. As a benefit, \s will also match other whitespace like tabs, so you don't have to manually account for those. This leads to
Search: ^(\d{12}\s+"\d{5}"\s+"[a-zA-Z0-9.]+").*$
Replace: \1
If you want to get rid of the last words/numbers/etc in the last quotation marks you could capture in a group what is before that and match the last quotation marks and everything between it to remove it using a negated character class.
If what is between the values can be spaces or tabs, you could use [ \t]+ to match those (using \s could also match a newline)
Note that {17,17} and {8,8} may also be written as {17} and {8} which in this case should be {12} and {5}
^([0-9]{12}[ \t]+"[0-9]{5}"[ \t]+"[a-zA-Z0-9.]+")[ \t]{2,}"[^"\r\n]+"
In parts
^ Start of string
( Capture group 1
[0-9]{12}[ \t]+ Match 12 digits and 1+ spaces or tabs
"[0-9]{5}"[ \t]+ Match 5 digits between " and 1+ spaces or tabs
"[a-zA-Z0-9.]+" Match 1+ times any of the listed between "
) Close group
[ \t]{2,} Match 1+ times
"[^"\r\n]+"
In the replacement use group 1 $1
Regex demo
Before
After
In the app I use, I cannot select a match Group 1.
The result that I can use is the full match from a regex.
but I need the 5th word "jumps" as a match result and not the complete match "The quick brown fox jumps"
^(?:[^ ]*\ ){4}([^ ]*)
The quick brown fox jumps over the lazy dog
Here is a link https://regex101.com/r/nB9yD9/6
Since you need the entire match to be only the n-th word, you can try to use 'positive lookbehind', which allows you to only match something, if it is preceded by something else.
To match only the fifth word, you want to match the first word that has four words before it.
To match four words (i.e. word characters followed by a space character):
(\w+\s){4}
To match a single word, but only if it was preceded by four other words:
(?<=(\w+\s){4})(\w+)
Test the result here https://regex101.com/r/QIPEkm/1
To find the 3rd word of sentence, use:
^(?:\w+ ){2}\K\w+
Explanation:
^ # beginning of line
(?: # start non capture group
\w+ # 1 or more word character
# a space
){2} # group must appear twice (change {2} in {3} to get the 4th word and so on)
\K # forget all we have seen until this position
\w+ # 1 or more word character
Demo
It works https://regex101.com/r/pR22LK/2 with PCRE. Your app doesn't seem to support it, but I don't know how it works. I think you have to extract all the words in an array then select the ones you want. – Toto 23 hours ago
Hello Toto, your solution works in the the App too, like PCRE, thanks !!! – gsxr1300 just now edit
To match "the first" four words (i.e. word characters followed by a space character):
^(\w+\s){4}
To match a single word, but only if it was preceded by "the first" four other words:
(?<=^(\w+\s){4})(\w+)
Note the ^ difference
If you want to know what this "?<=" mean, check this:
https://stackoverflow.com/a/2973495/11280142
I'm trying to write a regexp that will find the letters "AD" followed by 4 number digits. In front of AD there should be a blank space.
Example: AD1239
My code: \bBC[0-9]{4}
The next part I don't know how to do. If there is an attached hyphen followed by characters... I want them to be included until the next empty space.
Example: asdf AD3213-4332 asd
The above should output AD3213-4332
Any help is appreciated, Thanks
You can use this regex:
\bAD[0-9]{4}(?:-\S+)?
Here (?:-\S+)? is a non capturing group that will match an optional group that is a hyphen followed by 1+ non-space characters.
I need an expression that will only accept:
numbers
normal letters (no special characters)
-
Spaces are not allowed either.
Example:
The regular expression should match:
this-is-quite-alright
It should not match
this -is/not,soålright
You can use:
^[A-Za-z0-9-]*$
This matches strings, possibly empty, that is wholly composed of uppercase/lowercase letters (ASCII A-Z), digits (ASCII 0-9), and a dash.
This matches (as seen on rubular.com):
this-is-quite-alright
and-a-1-and-a-2-and-3-4-5
yep---------this-is-also-okay
And rejects:
this -is/not,soålright
hello world
Explanation:
^ and $ are beginning and end of string anchors respectively
If you're looking for matches within a string, then you don't need the anchors
[...] is a character class
a-z, A-Z, 0-9 in a character class define ranges
- as a last character in a class is a literal dash
* is zero-or-more repetition
regular-expressions.info
Anchors, Character Class, Repetition
Variation
The specification was not clear, but if - is only to be used to separate "words", i.e. no double dash, no trailing dash, no preceding dash, then the pattern is more complex (only slightly!)
_"alpha"_ separating dash
/ \ /
^[A-Za-z0-9]+(-[A-Za-z0-9]+)*$
\__________/| \__________/|\
"word" | "word" | zero-or-more
\_____________/
group together
This matches strings that is at least one "word", where words consists of one or more "alpha", where "alpha" consists of letters and numbers. More "words" can follow, and they're always separated by a dash.
This matches (as seen on rubular.com):
this-is-quite-alright
and-a-1-and-a-2-and-3-4-5
And rejects:
--no-way
no-way--
no--way
[A-z0-9-]+
But your question is confusing as it asks for letters and numbers and has an example containing a dash.
This is a community wiki, an attempt to compile links to related questions about this "URL/SEO slugging" topic. Community is invited to contribute.
Related questions
regex/php: how can I convert 2+ dashes to singles and remove all dashes at the beginning and end of a string?
-this--is---a-test-- becomes this-is-a-test
Regex for [a-zA-Z0-9-] with dashes allowed in between but not at the start or end
allow spam123-spam-eggs-eggs1 reject eggs1-, -spam123, spam--spam
Translate “Lorem 3 ipsum dolor sit amet” into SEO friendly “Lorem-3-ipsum-dolor-sit-amet” in Java?
Related tags
[slug]