There are url and email addresses in the middle of the sentence below. But I want to extract only url as a regular expression. The extracted results are as follows.
www.united.com
https://www.bbc.com/sport/football/64698988
https://linuxpip.org
www.gggggg.ac.us
github.com
What should I do?
example sentence:
"Wembley, Wembley, we're the famous Man United and we're off to Wembley," was the chant from the home supporters against Leicester.
United rode their luck, needing David de Gea two make two world-class saves to keep them in the contest, but two goals from Marcus rash#icloud.co.kr Rashford and one from Jadon Sancho helped them to a comfortable victory. gsgad#gmail.com England international Rashford is in the form of his life, taking his tally to 24 goals for the campaign, but Bruno Fernandes' impressive www.united.com performances have gone under the radar, https://www.bbc.com/sport/football/64698988 with the Portuguese playmaker providing two more assists on Sunday.
Free-flowing up front but solid in defence, https://linuxpip.org United's clean sheet against Leicester was their 10th in the league this season, two more than the entirety of the last campaign.
Ten Hag's men were www.gggggg.ac.us without midfield maestro report#abcdefcaf.net Casemiro, and it showed for large parts of the first half when they failed to gain control github.com in the middle of the park, but the Brazil international's return from suspension will provide a boost against the Magpies.
Use the regular expression below to get both url and email address.
(https?:\/\/)?(www\.)?[-a-zA-Z0-9#:%._\+~#=]{2,256}\.[a-z]{2,6}\b([-a-zA-Z0-9#:%_\+.~#?&//=]*)
You can directly copy this into Playground, and you will see, that the text is cut, but i don't know why it is cut? How can i prevent this?
import SwiftUI
import PlaygroundSupport
struct ContentView: View {
var body: some View {
VStack(alignment: .leading) {
Text("Background").font(.title).padding()
Text("Ahmad Shah DURRANI unified the Pashtun tribes and founded Afghanistan in 1747. The country served as a buffer between the British and Russian Empires until it won independence from notional British control in 1919. A brief experiment in democracy ended in a 1973 coup and a 1978 communist countercoup. The Soviet Union invaded in 1979 to support the tottering Afghan communist regime, touching off a long and destructive war. The USSR withdrew in 1989 under relentless pressure by internationally supported anti-communist mujahidin rebels. A series of subsequent civil wars saw Kabul finally fall in 1996 to the Taliban, a hardline Pakistani-sponsored movement that emerged in 1994 to end the country's civil war and anarchy. Following the 11 September 2001 terrorist attacks, a US, Allied, and anti-Taliban Northern Alliance military action toppled the Taliban for sheltering Usama BIN LADIN.\nA UN-sponsored Bonn Conference in 2001 established a process for political reconstruction that included the adoption of a new constitution, a presidential election in 2004, and National Assembly elections in 2005. In December 2004, Hamid KARZAI became the first democratically elected president of Afghanistan, and the National Assembly was inaugurated the following December. KARZAI was reelected in August 2009 for a second term. The 2014 presidential election was the country's first to include a runoff, which featured the top two vote-getters from the first round, Abdullah ABDULLAH and Ashraf GHANI. Throughout the summer of 2014, their campaigns disputed the results and traded accusations of fraud, leading to a US-led diplomatic intervention that included a full vote audit as well as political negotiations between the two camps. In September 2014, GHANI and ABDULLAH agreed to form the Government of National Unity, with GHANI inaugurated as president and ABDULLAH elevated to the newly-created position of chief executive officer. The day after the inauguration, the GHANI administration signed the US-Afghan Bilateral Security Agreement and NATO Status of Forces Agreement, which provide the legal basis for the post-2014 international military presence in Afghanistan. After two postponements, the next presidential election has been re-scheduled for September 2019.\nThe Taliban remains a serious challenge for the Afghan Government in almost every province. The Taliban still considers itself the rightful government of Afghanistan, and it remains a capable and confident insurgent force fighting for the withdrawal of foreign military forces from Afghanistan, establishment of sharia law, and rewriting of the Afghan constitution. In 2019, negotiations between the US and the Taliban in Doha entered their highest level yet, building on momentum that began in late 2018. Underlying the negotiations is the unsettled state of Afghan politics, and prospects for a sustainable political settlement remain unclear.").lineLimit(5000).padding()
Text("another")
Text("text")
}
}
}
PlaygroundPage.current.setLiveView(ContentView())
I think it is to do with how SwiftUI renders the VStack with regard to the available screen space. You would expect the large Text to take up the remaining space and you wouldn't be able to see the final two Texts. However you can see them, and the large Text has been truncated. I think the VStack is trying to fit all the items on the screen. If you add additional items after the large Text then it truncates the large Text even further.
Setting a .frame with a maxHeight of .infinity has no effect.
However, if you wrap yourVStack in a ScrollView or a List then the full text is shown.
If you don't want a line limit to your Text then you can pass nil to it as it says the following in the documentation, rather than picking some arbitrarily large number.
If nil, no line limit applies.
struct ContentView: View {
var body: some View {
ScrollView {
VStack(alignment: .leading) {
Text("Background").font(.title).padding()
Text("Ahmad Shah DURRANI unified the Pashtun tribes and founded Afghanistan in 1747. The country served as a buffer between the British and Russian Empires until it won independence from notional British control in 1919. A brief experiment in democracy ended in a 1973 coup and a 1978 communist countercoup. The Soviet Union invaded in 1979 to support the tottering Afghan communist regime, touching off a long and destructive war. The USSR withdrew in 1989 under relentless pressure by internationally supported anti-communist mujahidin rebels. A series of subsequent civil wars saw Kabul finally fall in 1996 to the Taliban, a hardline Pakistani-sponsored movement that emerged in 1994 to end the country's civil war and anarchy. Following the 11 September 2001 terrorist attacks, a US, Allied, and anti-Taliban Northern Alliance military action toppled the Taliban for sheltering Usama BIN LADIN.\nA UN-sponsored Bonn Conference in 2001 established a process for political reconstruction that included the adoption of a new constitution, a presidential election in 2004, and National Assembly elections in 2005. In December 2004, Hamid KARZAI became the first democratically elected president of Afghanistan, and the National Assembly was inaugurated the following December. KARZAI was reelected in August 2009 for a second term. The 2014 presidential election was the country's first to include a runoff, which featured the top two vote-getters from the first round, Abdullah ABDULLAH and Ashraf GHANI. Throughout the summer of 2014, their campaigns disputed the results and traded accusations of fraud, leading to a US-led diplomatic intervention that included a full vote audit as well as political negotiations between the two camps. In September 2014, GHANI and ABDULLAH agreed to form the Government of National Unity, with GHANI inaugurated as president and ABDULLAH elevated to the newly-created position of chief executive officer. The day after the inauguration, the GHANI administration signed the US-Afghan Bilateral Security Agreement and NATO Status of Forces Agreement, which provide the legal basis for the post-2014 international military presence in Afghanistan. After two postponements, the next presidential election has been re-scheduled for September 2019.\nThe Taliban remains a serious challenge for the Afghan Government in almost every province. The Taliban still considers itself the rightful government of Afghanistan, and it remains a capable and confident insurgent force fighting for the withdrawal of foreign military forces from Afghanistan, establishment of sharia law, and rewriting of the Afghan constitution. In 2019, negotiations between the US and the Taliban in Doha entered their highest level yet, building on momentum that began in late 2018. Underlying the negotiations is the unsettled state of Afghan politics, and prospects for a sustainable political settlement remain unclear.")
.lineLimit(nil)
.padding()
Text("another")
Text("text")
}
}
}
}
Hi I am on the course of developing Encoder-Decoder model with Attention which predicts WTO Panel Report for the given Factual Relation given as Text_Inputs.
Sample_sentence for factual relation is as follow:
sample_sentence = "On 23 January 1995, the United States received a request from Venezuela to hold consultations under Article XXII:1 of the General Agreement on Tariffs and Trade 1994 (\"General Agreement\"), Article 14.1 of the Agreement on Technical Barriers to Trade (\"TBT Agreement\") and Article 4 of the Understanding on Rules and Procedures Governing the Settlement of Disputes (\"DSU\"), on the rule issued by the Environmental Protection Agency on 15 December 1993, entitled \"Regulation of Fuels and Fuel Additives - Standards for Reformulated and Conventional Gasoline\" (WT/DS2/1). The consultations between Venezuela and the United States took place on 24 February 1995. As they did not result in a satisfactory solution of the matter, Venezuela, in a communication dated 25 March 1995, requested the Dispute Settlement Body (\"DSB\") to establish a panel to examine the matter under Article XXIII:2 of the General Agreement and Article 6 of the DSU (WT/DS2/2). On 10 April 1995, the DSB established a panel in accordance with the request made by Venezuela. On 28 April 1995, the parties to the dispute agreed that the Panel should have standard terms of reference (DSU, Art. 7) and agreed on the composition of the Panel as follows"
I am trying to using Word2Vec from google and encode each word into 300dim Word Vectors however, like number 23 appears as not included in the Word2Vec VocaSets.
Which would be the solution for this problem?
1) Use another Word Embedding for example Glovec?
2) Or Another any other advice?
Thx in advance for your help
edit)
I think to succefully fulfill this task, I think first I have to understand how current NMT application deals with Named Entity Recognition problem in advance before they actually train it.
Any suggestive literatures?
Word2Vec only learns words it has seen a lot.
Maybe try replacing the numbers in your source with text ie ("On the twenty third of ...")?
Before I write my own method, I am curious whether there is a regex that can help me.
The Context
I am cleaning raw text prior to running statistical analyses on the terms. The text is from websites and thus includes menus (many menus from many websites).
A typical list/menu appears as follows (Except with one line break between items):
STUDENT SERVICES
Guidance & Support
Core Services
Admissions & Records
Financial Aid
Counseling
Assessment Testing
Kickstart Orientation
Tutoring
Career & Transfer Center
Student Welcome Center
The Task at Hand
I want to remove all lists
I need to remove text blocks where there is a line break after every first second, third or fourth word, but only if this pattern repeats 3 or more times consecutively (I don't want to remove single short sentences such as "Students always succeed.")
Can a regex identify this pattern?
NOTE: I am working in java.
UPDATE with sample text
[[[I WANT TO REMOVE THIS LIST]]]
Offices & Services
Student Services
Activities & Athletics
Records & Registration
Costs & Financial Aid
Compliance & Diversity
Alumni
Faculty/Staff Resources
BMCC Foundation
Human Resources
BMCC Homepage>Academics>Health Education>Course Listings
[[[I WANT TO REMOVE THIS LIST]]]
Health Education Home
Course Listings
Faculty
[[[I WANT TO REMOVE THIS LIST]]]
Community Health Education
Gerontology
School Health Education
Public Health
Visit Admissions
Course Listings
[[[I WANT TO KEEP TEXT BELOW]]]
The following courses are offered by the Department of Health Education.
2CRS., 2HRS, 0 LAB HRS.
HED 100
Health Education
This is an introductory survey course to health education. The course provides students with the knowledge, skills, and behavioral models to enhance their physical, emotional, social, intellectual and spiritual health as well as facilitate their health decision-making ability. The primary areas of instruction include: health and wellness; stress; human sexuality; alcohol, tobacco and substance abuse; nutrition and weight management; and physical fitness. Students who have completed HED 110 - Comprehensive Health Education will not receive credit for this course.
3CRS., 3HRS, 0 LAB HRS.
HED 110
Comprehensive Health Education
This course in health educations offers a comprehensive approach that provides students with the knowledge, skills, and behavioral models to enhance their physical, emotional, social, intellectual and spiritual health as well as facilitate their health decision-making ability. Areas of specialization include: alcohol, tobacco and abused substances, mental and emotional health, human sexuality and family living, nutrition, physical fitness, cardiovascular health, environmental health and health care delivery. HED 110 fulfills all degree requirements for HE 100. Students who have completed HED 100 - Health Education will not receive credit for this course.
Assuming the part about the number of words is not important, try a regex pattern of (([A-Za-z& ])*(\n|\r|\r\n)){5,}, example here.
Change that five quantifier as needed, that is just an example. A five would not match two lines with an extra newline or a three line list without an ending new line.
I have a big text and I am trying to get most frequently word occurrences before and after a given word in this text.
For example:
I want to know what is the most frequent word occurrence after "lake". Idealy would get something like that: (word 1,# occurrence), (word 2,# occurrence),...
The same for the words which would come before...
I tried the NLTK bigran but it seems it only find the most common n-grans... Is it possible somehow to fix one of the words and find the most frequent n-grans based on the fixed word)?
Thanks for any help!!
Are you looking for something like this?
text = """
A lake is a body of relatively still water of considerable size, localized in a basin, that is surrounded by land apart from a river, stream, or other form of moving water that serves to feed or drain the lake. Lakes are inland and not part of the ocean and therefore are distinct from lagoons, and are larger and deeper than ponds.[1][2] Lakes can be contrasted with rivers or streams, which are usually flowing. However most lakes are fed and drained by rivers and streams.
Natural lakes are generally found in mountainous areas, rift zones, and areas with ongoing glaciation. Other lakes are found in endorheic basins or along the courses of mature rivers. In some parts of the world there are many lakes because of chaotic drainage patterns left over from the last Ice Age. All lakes are temporary over geologic time scales, as they will slowly fill in with sediments or spill out of the basin containing them.
Many lakes are artificial and are constructed for industrial or agricultural use, for hydro-electric power generation or domestic water supply, or for aesthetic or recreational purposes.
Etymology, meaning, and usage of "lake"[edit]
Oeschinen Lake in the Swiss Alps
Lake Tahoe on the border of California and Nevada
The Caspian Sea is either the world's largest lake or a full-fledged sea.[3]
The word lake comes from Middle English lake ("lake, pond, waterway"), from Old English lacu ("pond, pool, stream"), from Proto-Germanic *lakō ("pond, ditch, slow moving stream"), from the Proto-Indo-European root *leǵ- ("to leak, drain"). Cognates include Dutch laak ("lake, pond, ditch"), Middle Low German lāke ("water pooled in a riverbed, puddle"), German Lache ("pool, puddle"), and Icelandic lækur ("slow flowing stream"). Also related are the English words leak and leach.
There is considerable uncertainty about defining the difference between lakes and ponds, and no current internationally accepted definition of either term across scientific disciplines or political boundaries exists.[4] For example, limnologists have defined lakes as water bodies which are simply a larger version of a pond, which can have wave action on the shoreline or where wind-induced turbulence plays a major role in mixing the water column. None of these definitions completely excludes ponds and all are difficult to measure. For this reason there has been increasing use made of simple size-based definitions to separate ponds and lakes. One definition of lake is a body of water of 2 hectares (5 acres) or more in area;[5]:331[6] however, others[who?] have defined lakes as waterbodies of 5 hectares (12 acres) and above,[citation needed] or 8 hectares (20 acres) and above[citation needed] (see also the definition of "pond"). Charles Elton, one of the founders of ecology, regarded lakes as waterbodies of 40 hectares (99 acres) or more.[7] The term lake is also used to describe a feature such as Lake Eyre, which is a dry basin most of the time but may become filled under seasonal conditions of heavy rainfall. In common usage many lakes bear names ending with the word pond, and a lesser number of names ending with lake are in quasi-technical fact, ponds. One textbook illustrates this point with the following: "In Newfoundland, for example, almost every lake is called a pond, whereas in Wisconsin, almost every pond is called a lake."[8]
One hydrology book proposes to define it as a body of water with the following five chacteristics:[4]
it partially or totally fills one or several basins connected by straits[4]
has essentially the same water level in all parts (except for relatively short-lived variations caused by wind, varying ice cover, large inflows, etc.)[4]
it does not have regular intrusion of sea water[4]
a considerable portion of the sediment suspended in the water is captured by the basins (for this to happen they need to have a sufficiently small inflow-to-volume ratio)[4]
the area measured at the mean water level exceeds an arbitrarily chosen threshold (for instance, one hectare)[4]
With the exception of the sea water intrusion criterion, the other ones have been accepted or elaborated upon by other hydrology publications.[9][10]
""".split()
from nltk import bigrams
bgs = bigrams(text)
lake_bgs = filter(lambda item: item[0] == 'lake', bgs)
from collections import Counter
c = Counter(map(lambda item: item[1], lake_bgs))
print c.most_common()
Which output:
[('is', 4), ('("lake,', 1), ('or', 1), ('comes', 1), ('are', 1)]
Note, that you might want to use ifilter, imap, etc... if you have a very long text.
Edit: Here is the code for before and after 'lake'.
from nltk import trigrams
tgs = trigrams(text)
lake_tgs = filter(lambda item: item[1] == 'lake', tgs)
from collections import Counter
before_lake = map(lambda item: item[0], lake_tgs)
after_lake = map(lambda item: item[2], lake_tgs)
c = Counter(before_lake + after_lake)
print c.most_common()
Note that this can be done using bigrams as well :)
Just to add to #Ohad's answer, here's an ngram implementation in NLTK with some scalability.
#-*- coding: utf8 -*-
import string
from nltk import ngrams
from itertools import chain
from collections import Counter
text = """
A lake is a body of relatively still water of considerable size, localized in a basin, that is surrounded by land apart from a river, stream, or other form of moving water that serves to feed or drain the lake. Lakes are inland and not part of the ocean and therefore are distinct from lagoons, and are larger and deeper than ponds.[1][2] Lakes can be contrasted with rivers or streams, which are usually flowing. However most lakes are fed and drained by rivers and streams.
Natural lakes are generally found in mountainous areas, rift zones, and areas with ongoing glaciation. Other lakes are found in endorheic basins or along the courses of mature rivers. In some parts of the world there are many lakes because of chaotic drainage patterns left over from the last Ice Age. All lakes are temporary over geologic time scales, as they will slowly fill in with sediments or spill out of the basin containing them.
Many lakes are artificial and are constructed for industrial or agricultural use, for hydro-electric power generation or domestic water supply, or for aesthetic or recreational purposes.
Etymology, meaning, and usage of "lake"[edit]
Oeschinen Lake in the Swiss Alps
Lake Tahoe on the border of California and Nevada
The Caspian Sea is either the world's largest lake or a full-fledged sea.[3]
The word lake comes from Middle English lake ("lake, pond, waterway"), from Old English lacu ("pond, pool, stream"), from Proto-Germanic *lakō ("pond, ditch, slow moving stream"), from the Proto-Indo-European root *leǵ- ("to leak, drain"). Cognates include Dutch laak ("lake, pond, ditch"), Middle Low German lāke ("water pooled in a riverbed, puddle"), German Lache ("pool, puddle"), and Icelandic lækur ("slow flowing stream"). Also related are the English words leak and leach.
There is considerable uncertainty about defining the difference between lakes and ponds, and no current internationally accepted definition of either term across scientific disciplines or political boundaries exists.[4] For example, limnologists have defined lakes as water bodies which are simply a larger version of a pond, which can have wave action on the shoreline or where wind-induced turbulence plays a major role in mixing the water column. None of these definitions completely excludes ponds and all are difficult to measure. For this reason there has been increasing use made of simple size-based definitions to separate ponds and lakes. One definition of lake is a body of water of 2 hectares (5 acres) or more in area;[5]:331[6] however, others[who?] have defined lakes as waterbodies of 5 hectares (12 acres) and above,[citation needed] or 8 hectares (20 acres) and above[citation needed] (see also the definition of "pond"). Charles Elton, one of the founders of ecology, regarded lakes as waterbodies of 40 hectares (99 acres) or more.[7] The term lake is also used to describe a feature such as Lake Eyre, which is a dry basin most of the time but may become filled under seasonal conditions of heavy rainfall. In common usage many lakes bear names ending with the word pond, and a lesser number of names ending with lake are in quasi-technical fact, ponds. One textbook illustrates this point with the following: "In Newfoundland, for example, almost every lake is called a pond, whereas in Wisconsin, almost every pond is called a lake."[8]
One hydrology book proposes to define it as a body of water with the following five chacteristics:[4]
it partially or totally fills one or several basins connected by straits[4]
has essentially the same water level in all parts (except for relatively short-lived variations caused by wind, varying ice cover, large inflows, etc.)[4]
it does not have regular intrusion of sea water[4]
a considerable portion of the sediment suspended in the water is captured by the basins (for this to happen they need to have a sufficiently small inflow-to-volume ratio)[4]
the area measured at the mean water level exceeds an arbitrarily chosen threshold (for instance, one hectare)[4]
With the exception of the sea water intrusion criterion, the other ones have been accepted or elaborated upon by other hydrology publications.[9][10]
"""
def ngrammer(txt, n):
# Removes punctuations and numbers.
sentences = "".join([i for i in txt if i not in string.punctuation and not i.isdigit()]).split('\n')
return list(chain(*[ngrams(i.split(), n) for i in sentences]))
def before_after(ngs, word):
word_grams = filter(lambda item: item[1] == word, ngs)
before = map(lambda item: item[0], ngs)
after = map(lambda item: item[2], ngs)
return before, after
bgs = ngrammer(text,2) # bigrams
tgs = ngrammer(text,3) # trigrams
xgs = ngrammer(text,10) # 10grams
focus = 'lake'
bf, af = before_after(xgs, focus)
c = Counter(bf+af)
# Most common word before and after 'lake' from the 10grams.
print c.most_common()[0]