To illustrate my problem with an example, in the following paragraph,
1.If we may believe the Egyptians, Hephaestus was the son of the Nile, and with him philosophy began, priests and prophets being its chief exponents. 2. Hephaestus lived 48,863 years before Alexander of Macedon, and in the interval there occurred 373 solar and 832 lunar eclipses. The date of the Magians, beginning with Zoroaster the Persian, was 5000 years before the fall of Troy, as given by Hermodorus the Platonist in his work on mathematics; but Xanthus the Lydian reckons 6000 years from Zoroaster to the expedition of Xerxes, and after that event he places a long line of Magians in succession, bearing the names of Ostanas, Astrampsychos, Gobryas, and Pazatas, down to the conquest of Persia by Alexander,
I want to remove the "1." and "2.", but not the "373", "832", or any of the other numbers. The document contains much more than just this example, so just removing single digit numbers won't work. I assume this is fairly straightforward, but I'm new to using regex and I've been finding it difficult.
Find \d+\. and replace with empty string.
Related
I'm using a firefox addon called "rikaisama", this addon is a pop up dictionary for japanese and it allows epwing dictionary files. In the addon option we can use one regular expression to remove unnecessary parts of a dictionary entry.
I'm using the "Kenkyusha's New Japanese-English Dictionary" epwing file but it has way too much examples to be readable.
Example of an entry :
まにあう【間に合う】 ローマ(maniau)
1 〔時間に遅れない〕 be in time 《for…》.
▲7 時の列車に間に合う catch [make] the 7 o'clock train
・締め切りに間に合う meet the deadline
・開演に間に合う arrive before curtain time
▲9 時の札幌行きに間に合うように空港に着いた. I arrived in time for the nine o'clock flight to Sapporo.
・「間に合うかな」「走っても間に合いそうにないね」 "Will we be in time?"―"It doesn't look like we'll be in time even if we run."
2 〔役に立つ〕 answer [serve, suit, meet] the purpose; be useful; be serviceable; be of use [service]; be good enough; 〔十分である〕 be enough; 〔用意ができる〕 be ready; 〔必要をみたす〕 meet the requirements; serve the [one's] turn [need].
▲「費用はどのぐらいかな」「5 万もあれば間に合うよ」 "And what is the expense?"―"Fifty-thousand yen should cover it."
・これだけあれば丸 1 年は間に合う. This will last us [see us through] one whole year. | This will be enough for a whole year.
Where all entries starting with "▲" or "・" are examples and all entries matching this regex are definitions :
\n[″*〖〈《⇒=➡【〔(〜A-Za-z0-9].*
I already managed to come up with this regular expression on my own but it removes all examples:
\n[^″*〖〈《⇒=➡【〔(〜A-Za-z0-9].*
Is it possible to have a regex matching this regex AND the following line of the match ?
Wished result :
まにあう【間に合う】 ローマ(maniau)
1 〔時間に遅れない〕 be in time 《for…》.
▲7 時の列車に間に合う catch [make] the 7 o'clock train
2 〔役に立つ〕 answer [serve, suit, meet] the purpose; be useful; be serviceable; be of use [service]; be good enough; 〔十分である〕 be enough; 〔用意ができる〕 be ready; 〔必要をみたす〕 meet the requirements; serve the [one's] turn [need].
▲「費用はどのぐらいかな」「5 万もあれば間に合うよ」 "And what is the expense?"―"Fifty-thousand yen should cover it."
Any help appreciated !
You almost had it I think - just add \n.* so it becomes /\n[″*〖〈《⇒=➡【〔(〜A-Za-z0-9].*\n.*/. That makes it get the next line...
See it in action here: https://regexr.com/442st
source txt file:
34|Gurla Mandhata|7694|25243|2788|Nalakankar Himalaya|30°26'19"N
81°17'48"E|Dhaulagiri|1985|6 (4)|China
command input:
:%s/\(\d\+\)\(\d\d\d\)/\1,\2/g
command output:
34|Gurla Mandhata|7,694|25,243|2,788|Nalakankar Himalaya|30°26'19"N
81°17'48"E|Dhaulagiri|1,985|6 (4)|China
Desired output:
34|Gurla Mandhata|7,694|25,243|2,788|Nalakankar Himalaya|30°26'19"N
81°17'48"E|Dhaulagiri|1985|6 (4)|China
Basically 1985 is supposed to be 1985 and not 1,985. I tried to put a \? so every time the pattern matches it stops and a °+ after so it has to detect a ° to match the pattern, but no success. It just replaces the ° and everything before that, complete mess.
My knowledge of regular expressions however combined with the substitute is weak and I'm stuck here.
EDIT
the first 3 numbers represent heights of mountains, those 3 need to change with a (,) and the last number ( 1985 ) represents a year, which must not be changed.
Mathematical solutions are not going to work as loophole since there are mountains with a height off less than 1900
You haven't told us what is the difference between 1985 and other numbers, so I assumed that your "small" numbers are less than 2000.
You almost got it:
:%s/(\d*[2-90])(\d\d\d)/\1,\2/g
Alternatively if that isn't what you want, you can use c flag (:h s_flags):
:%s/\(\d\+\)\(\d\d\d\)/\1,\2/gc
this line will leave the last 3 columns untouched, just do substitution on the content before it:
%s/\v(.*)((\|[^|]*){3}$)/\=substitute(submatch(1),'\v(\d+)(\d{3})','\1,\2','g').submatch(2)/g
Note that the above line will change 1000000 into 1000,000 instead of 1,000,000. Vim's printf() doesn't support %'d, it is pity. If you do have number > 1m, we can find other solutions.
update
I solved it myself, by using 3 seperate commands; one for every number string in the file:
%s/^\(\d*|[^|]*|\)\(\d\+\)\(\d\d\d\)|/\1\2,\3|/g
:%s/^\(\d*|[^|]*|\d\+,*\d*|\)\(\d\+\)\(\d\d\d\)|/\1\2,\3|/g
:%s/^\(\d*|[^|]*|\d\+,*\d*|\d\+,*\d*|\)\(\d\+\)\(\d\d\d\)|/\1\2,\3|/g
In case you want to use perl:
:%!perl -F'\|' -lane 'for(#F[2..4]) { s/(\d+)(\d{3})/\1,\2/;} print join "|", #F'
I'm trying to extract regions around keywords from longer passages of text. They should include complete sentences, based on the following conditions:
n=250 Charactars before / after keyword should be included if existing (the keyword can be closer then this to the start / end of the text)
from there it should expand further to include the complete sentence (let's assume here we can define sentence borders with ".?! or :" knowing it's not completely accurate)
I already achieved the expanding to the end of the last sentence, but not to start of the first in the following example, where vitamin is the keyword and the italic is captured by the regex. However, it should capture from "An extra 24 hours..."
Apparently I don't get the corresponding group up front, neither using lazy nor using lookbehind.
((.{0,250}(vitamin)\b.{0,250})(.+?(\.|\!|\?|\:))?)/ig
Well, this year you’re getting an extra day to get ahead on your taxes or (finally) clean out the garage. (Hey, we’re not trying to tell you what do but you might as well be productive.) February 29 is back on the calendar this year because it’s a leap year. Whether you love or loathe the extra winter day, you’re probably wondering why it happens in the first place. An extra 24 hours — or day — is built into the calen dar every four years to ensure it aligns with the Earth’s movement around the sun. There’s 365 days in a calendar year, but it actually takes longer for the Earth’s annual journey — about 365.2421 days — around the star that gives us light, life and vitamin D. The difference may seem like no big deal to us, but over time, it adds up. “To ensure consistency with the true astronomical year, it is necessary to periodically add in an extra day to make up the lost time and get the calendar back in sync with the heavens,” according the history. com.
Acknowledgement of the need for a leap year happened around the time of Julius Caesar. In 46 B.C., Caesar enlisted the help of astronomer Sosigenes to update the calendar so that it had 12 months and 365 days, including a leap year every four years.,
You can try something like this:
(([.?!:][^.?!:]*.{250}\bvitamin\b.{250})[^.?!:]*[.?!:])
It works by consuming 250 characters of text before and after the keyword "vitamin". From that point it finds the first punctuation point (.?!:) before/after the 250 characters of text.
Here's a sample of it in action.
You can you use extra parentheses () to strategically group what exact output you want. For example, the above answer includes the ending period from the preceding sentence in the output. So you could use
(([.?!:]([^.?!:]*.{250}\bvitamin\b.{250})[^.?!:]*[.?!:]))
and use group 3 from the result set which doesn't have this ending period.
I do not see how the specification in the question can be matched by a regex. It boils down to the following logic problem:
to match as many characters as possible but no more than 250 before/after the keyword, .{0,250} needs to be greedy and can neither be lazy .{0,250}? nor possessive .{0,250}+
if this part is greedy, you will miss the occurrences of the keyword that start before the .{0,250} part is matched.
The same logic applies to my understanding to the 'match back to the start of the sentnence as well.
I played around with the following more or less meaningful regex:
[.?!:]?([^.?!:]*?(.{0,250}\byear\b.{0,250})[^.?!:]*[.?!:]?) misses first 'year'
[.?!:]?([^.?!:]*?(.{0,250}?\byear\b.{0,250})[^.?!:]*[.?!:]?) gets the first 'year' but fails on others.
I suggest you write your on extraction logic in a function, eihter using regex or not, to achieve the extraction you want.
You could for example find the index of the start of the keyword \bkeyword\b and the full stops (\.[^\d]|[.?!:]$) and then with this information extract the part of the text you want.
I trying to create a regular expression to catch the following conditions, but totally failing to get my head around it (Friday) and need a bit of help please?
Trying to capture UK phone numbers starting with area code or no area code, but excluding mobiles.
example: 01316691234 or 6691234 but not any number starting with 07
got this so far ^[0-9]1?(\d{6,11}) but struggling to exclude the 07 numbers.
This is based on the supposition that UK area codes:
start with 0 and are followed by 1 (usual) or 2 (London);
run to 3-5 digits
are followed by a phone number 6-7 digits long
Whilst this seems sound to me, I'm no telecoms anorak so you'll need to modify accordingly if any part of this supposition is wrong:
/^(0[12345689]\d{1,3} ?)?\d{6,7}$/
Either way, it's a bit of a can of worms. Postscodes and phone numbers don't lend themselves well to REGEX; the more tightly you refine it, the more at risk you are from new rules being added tomorrow - e.g. if they launched a new area code starting 03.
Use a negative look ahead to prevent numbers starting with "07" matching:
^(?!07)([0-9]1)?(\d{6,11})
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 9 years ago.
Improve this question
I'm thinking of presenting questions in the form of "here is your input: [foo], here are the capture groups/results: [bar]" (and maybe writing a small script to test their answers for my results).
What are some good regex questions to ask? I need everything from beginner questions like "validate a 4 digit number" to "extract postal codes from addresses".
A few that I can think off the top of my head:
Phone numbers in any format e.g. 555-5555, 555 55 55 55, (555) 555-555 etc.
Remove all html tags from text.
Match social security number (Finnish one is easy;)
All IP addresses
IP addresses with shorthand netmask (xx.xx.xx.xx/yy)
There's a bunch of examples of various regular expression techniques over at www.regular-expressions.info - everything for simple literal matching to backreferences and lookahead.
To keep things a bit more interesting than the usual email/phone/url stuff, try looking for more original exercises. Avoid boredom.
For example, have a look at the Forsysth-Edwards Notation which is used for describing a particular board position of a chess game.
Have your students validate and extract all the bits of information from a string like this:
rnbqkbnr/pp1ppppp/8/2p5/4P3/5N2/PPPP1PPP/RNBQKB1R b KQkq - 1 2
Additionaly, have a look at algebraic chess notation, used to describe moves. Extract chess moves out of a piece of text (and make them bold).
1. e4 e5 2. Nf3 Black now defends his pawn 2...Nc6 3. Bb5 Black threatens c4
Validate phone numbers (extract area code + rest of number with grouping) (Assuming US phone number, otherwise generalize for you style)
Play around with validating email address (probably want to tell the students that this is hugely complicated regular expression but for simple ones it is pretty straight forward)
regexplib.com has a good library you can search through for examples.
H0w about extract first name, middle name, last name, personal suffix (Jr., III, etc.) from a format like:
Smith III, John Paul
How about Reg Ex to remove line breaks and tabs from the input
I would start with the common ones:
validate email
validate phone number
separate the parts of a URL
Be cruel. Tell them parse HTML.
RegEx match open tags except XHTML self-contained tags
Are you teaching them theory of finite automata as well?
Here is a good one: parse the addresses of churches correctly from this badly structured format (copy and paste it as text first)
http://www.churchangel.com/WEBNY/newhart.htm
I'm a fan of parsing date strings. Define a few common data formats, as well as time and date-time formats. These are often good exercises because some dates are simple mixes of digits and punctuation. There's a limited degree of freedom in parsing dates.
Just to throw them for a loop, why not reword a question or two to suggest that they write a regular expression to generate data fitting a specific pattern like email addresses, phone numbers, etc.? It's the same thing as validating, but can help them get out of the mindset that regex is just for validation (whereas the data generation tool in visual studio uses regex to randomly generate data).
Rather than teaching examples based from the data set, I would do examples from the perspective of the rule set to get basics across. Give them simple examples to solve that leads them to use ONE of several basic groupings in each solution. Then have a couple of "compound" regex's at the end.
Simple:
s/abc/def/
Spinners and special characters:
s/a\s*b/abc/
Grouping:
s/[abc]/def/
Backreference:
s/ab(c)/def$1/
Anchors:
s/^fred/wilma/
s/$rubble/and betty/
Modifiers:
s/Abcd/def/gi
After this, I would give a few examples illustrating the pitfalls of trying to match html tags or other strings that shouldn't be done with regex's to show the limitations.
Try to think of some tests that don't include ones that can be found with Google.
Asking a email validator should pose no trouble finding..
Try something like a 5 proof test.
Input 5 digit. Sum up each digit must be dividable by five: 12345 = 1+2+3+4+5 = 15 / 5 = 3(.0)