Word2Vec Doesn't Contain Embedding for Number 23 - word2vec

Hi I am on the course of developing Encoder-Decoder model with Attention which predicts WTO Panel Report for the given Factual Relation given as Text_Inputs.
Sample_sentence for factual relation is as follow:
sample_sentence = "On 23 January 1995, the United States received a request from Venezuela to hold consultations under Article XXII:1 of the General Agreement on Tariffs and Trade 1994 (\"General Agreement\"), Article 14.1 of the Agreement on Technical Barriers to Trade (\"TBT Agreement\") and Article 4 of the Understanding on Rules and Procedures Governing the Settlement of Disputes (\"DSU\"), on the rule issued by the Environmental Protection Agency on 15 December 1993, entitled \"Regulation of Fuels and Fuel Additives - Standards for Reformulated and Conventional Gasoline\" (WT/DS2/1). The consultations between Venezuela and the United States took place on 24 February 1995. As they did not result in a satisfactory solution of the matter, Venezuela, in a communication dated 25 March 1995, requested the Dispute Settlement Body (\"DSB\") to establish a panel to examine the matter under Article XXIII:2 of the General Agreement and Article 6 of the DSU (WT/DS2/2). On 10 April 1995, the DSB established a panel in accordance with the request made by Venezuela. On 28 April 1995, the parties to the dispute agreed that the Panel should have standard terms of reference (DSU, Art. 7) and agreed on the composition of the Panel as follows"
I am trying to using Word2Vec from google and encode each word into 300dim Word Vectors however, like number 23 appears as not included in the Word2Vec VocaSets.
Which would be the solution for this problem?
1) Use another Word Embedding for example Glovec?
2) Or Another any other advice?
Thx in advance for your help
edit)
I think to succefully fulfill this task, I think first I have to understand how current NMT application deals with Named Entity Recognition problem in advance before they actually train it.
Any suggestive literatures?

Word2Vec only learns words it has seen a lot.
Maybe try replacing the numbers in your source with text ie ("On the twenty third of ...")?

Related

Choose the appropriate way to deal with weights in svyset in Stata

I decided to post here a kind information for support I put in Statalist yesterday. I have not yet received a possible hint and thought it could be useful to extend the audience by posting it here.
The link to the original post is the following:
https://www.statalist.org/forums/forum/general-stata-discussion/general/1659627-choose-the-appropriate-way-to-deal-with-weights-in-svyset?view=thread
Dear Members,
I defined a questionnaire to gather respondents' willingness to get vaccinated against COVID-19 via a discrete choice experiment. I relied on a company specialized in political opinion polls and market research to administer the survey. The company computed a weight for each respondent based on 1) the geographical location where the respondent lives (five macroareas of Italy), 2) whether the respondent has a bachelor degree or not, and 3) to which age group she/he pertains (five classes are considered).
The sum of the weights is equal to the number of individuals in the database. The individuals pertaining to the age classes 30-39 and 40-49 are oversampled, as per our request (related to a research hypothesis). The proportion of such two classes within the sample is larger than the actual in the Italian population. Weights are computed in order to take into account for this feature and guarantee that the sample is representative of the characteristics of the Italian population.
I will use the data to estimate a logit model, multinomial logit models and mixed logit models.
The issue I am facing with is the proper path to follow to declare the nature of the weight. I have no experience in the use of Stata to deal with this issue.
I am using Stata 17 on a PC with Windows 10 Pro 64 bit.
Combining the information from the video, the svysvyset manual and the results from the help for "weight" I tried to think what is the most appropriate solution.
I tried to add here the code multiple times as well but I kept receiving an error message on how I formatted it. My apologies

Google Cloud VideoIntelligence Speech Transcription - Transcription Size

I use Google Cloud Speech Transcription as following :
video_client = videointelligence.VideoIntelligenceServiceClient()
features = [videointelligence.enums.Feature.SPEECH_TRANSCRIPTION]
operation = video_client.annotate_video(gs_video_path, features=features)
result = operation.result(timeout=3600)
And I present the transcript and store the transcript in Django Objects using PostgreSQL as following :
transcriptions = response.annotation_results[0].speech_transcriptions
for transcription in transcriptions:
best_alternative = transcription.alternatives[0]
confidence = best_alternative.confidence
transcript = best_alternative.transcript
if SpeechTranscript.objects.filter(text = transcript).count() == 0:
SpeechTranscript.objects.create(text = transcript,
confidence = confidence)
print(f"Adding -> {confidence:4.10%} | {transcript.strip()}")
else:
pass
For instance the following is the text that I receive from a sample video :
94.9425220490% | I refuse to where is it short sleeve dress shirt. I'm just not going there the president of the United States is a visit to Walter Reed hospital in mid-july format was the combination of weeks of cajoling by trump staff and allies to get the presents for both public health and political perspective wearing a mask to protect against the spread of covid-19 reported in advance of that watery trip and I quote one presidential aide to the president to set an example for a supporters by wearing a mask and the visit.
94.3865835667% | Mask wearing is because well science our best way to slow the spread of the coronavirus. Yes trump or Matthew or 3 but if you know what he said while doing sell it still anybody's guess about what can you really think about NASCAR here is what probably have a mass give you probably have a hospital especially and that particular setting were you talking to a lot of people I think it's but I do believe it. Have a a time and a place very special line trump saying I've never been against masks but I do believe they have a time and a place that isn't exactly a ringing endorsement for mask wearing.
94.8513686657% | Republican skip this isn't it up to four men over the perfumer's that wine about time and place should be a blinking red warning light for people who think debate over whether last for you for next coronavirus. They are is finally behind us time in a place lined everything you need to know about weird Trump is like headed next time he'll get watery because it was a hospital and will continue to express not so scepticism to wear masks in public house new CDC guidelines recommending that mask to be worn inside and one social this thing is it possible outside he sent this?
92.9862976074% | He wearing a face mask as agreed presidents prime minister's dictators Kings Queens and somehow. I don't see it for myself literally main door he responded this way back backstage, but they said you didn't need it trump went to Michigan to this later and he appeared in which personality approaching Mark former vice president Joe Biden
94.6677267551% | In his microwave fighting for wearing a mask and he walked onto the stage where it is massive mask there's nobody understands and there's any takes it off you like to have it hanging off you. I think it makes them feel good frankly if you want to know the truth who's got the largest basket together. Seen it because trump thinks that maths make him and people generally I guess what a week or something is resistant wearing one in public from 1 today which has had a correlation between the erosion of the public's confidence and trump have the corner coronavirus and his number is SE6 a second term in the 67.
94.9921131134% | The coronavirus pandemic in the heels of national and swings they both lots of them that show trump slipping further and further behind former vice president Joe Biden when it comes to General Election good policy would seem to make for good politics at all virtually every infectious disease expert believes that wearing masks in public is our best to contain the spread of coronavirus until a vaccine would do well to listen to buy on this one a mare is the point we make episode every Tuesday and Thursday make sure to check them all out.
What is the predicted size of a transcript that is generated within the speech transcription results. What decides the size of each transcript ? What is the max and minimum character length ? How should I design my SQL table column size, in order to be prepared for the expected transcript size ?
As I mentioned in the comments, the Video Intelligence transcripts are splits with roughly 50-60 seconds from the video.
I have created a Public Issue Tracker case, link, so the product team can clarify this information within the documentation. Although, I do not have an eta for this request, I encourage you to follow the case's thread.

Text will be cut by SwiftUI, but why?

You can directly copy this into Playground, and you will see, that the text is cut, but i don't know why it is cut? How can i prevent this?
import SwiftUI
import PlaygroundSupport
struct ContentView: View {
var body: some View {
VStack(alignment: .leading) {
Text("Background").font(.title).padding()
Text("Ahmad Shah DURRANI unified the Pashtun tribes and founded Afghanistan in 1747. The country served as a buffer between the British and Russian Empires until it won independence from notional British control in 1919. A brief experiment in democracy ended in a 1973 coup and a 1978 communist countercoup. The Soviet Union invaded in 1979 to support the tottering Afghan communist regime, touching off a long and destructive war. The USSR withdrew in 1989 under relentless pressure by internationally supported anti-communist mujahidin rebels. A series of subsequent civil wars saw Kabul finally fall in 1996 to the Taliban, a hardline Pakistani-sponsored movement that emerged in 1994 to end the country's civil war and anarchy. Following the 11 September 2001 terrorist attacks, a US, Allied, and anti-Taliban Northern Alliance military action toppled the Taliban for sheltering Usama BIN LADIN.\nA UN-sponsored Bonn Conference in 2001 established a process for political reconstruction that included the adoption of a new constitution, a presidential election in 2004, and National Assembly elections in 2005. In December 2004, Hamid KARZAI became the first democratically elected president of Afghanistan, and the National Assembly was inaugurated the following December. KARZAI was reelected in August 2009 for a second term. The 2014 presidential election was the country's first to include a runoff, which featured the top two vote-getters from the first round, Abdullah ABDULLAH and Ashraf GHANI. Throughout the summer of 2014, their campaigns disputed the results and traded accusations of fraud, leading to a US-led diplomatic intervention that included a full vote audit as well as political negotiations between the two camps. In September 2014, GHANI and ABDULLAH agreed to form the Government of National Unity, with GHANI inaugurated as president and ABDULLAH elevated to the newly-created position of chief executive officer. The day after the inauguration, the GHANI administration signed the US-Afghan Bilateral Security Agreement and NATO Status of Forces Agreement, which provide the legal basis for the post-2014 international military presence in Afghanistan. After two postponements, the next presidential election has been re-scheduled for September 2019.\nThe Taliban remains a serious challenge for the Afghan Government in almost every province. The Taliban still considers itself the rightful government of Afghanistan, and it remains a capable and confident insurgent force fighting for the withdrawal of foreign military forces from Afghanistan, establishment of sharia law, and rewriting of the Afghan constitution. In 2019, negotiations between the US and the Taliban in Doha entered their highest level yet, building on momentum that began in late 2018. Underlying the negotiations is the unsettled state of Afghan politics, and prospects for a sustainable political settlement remain unclear.").lineLimit(5000).padding()
Text("another")
Text("text")
}
}
}
PlaygroundPage.current.setLiveView(ContentView())
I think it is to do with how SwiftUI renders the VStack with regard to the available screen space. You would expect the large Text to take up the remaining space and you wouldn't be able to see the final two Texts. However you can see them, and the large Text has been truncated. I think the VStack is trying to fit all the items on the screen. If you add additional items after the large Text then it truncates the large Text even further.
Setting a .frame with a maxHeight of .infinity has no effect.
However, if you wrap yourVStack in a ScrollView or a List then the full text is shown.
If you don't want a line limit to your Text then you can pass nil to it as it says the following in the documentation, rather than picking some arbitrarily large number.
If nil, no line limit applies.
struct ContentView: View {
var body: some View {
ScrollView {
VStack(alignment: .leading) {
Text("Background").font(.title).padding()
Text("Ahmad Shah DURRANI unified the Pashtun tribes and founded Afghanistan in 1747. The country served as a buffer between the British and Russian Empires until it won independence from notional British control in 1919. A brief experiment in democracy ended in a 1973 coup and a 1978 communist countercoup. The Soviet Union invaded in 1979 to support the tottering Afghan communist regime, touching off a long and destructive war. The USSR withdrew in 1989 under relentless pressure by internationally supported anti-communist mujahidin rebels. A series of subsequent civil wars saw Kabul finally fall in 1996 to the Taliban, a hardline Pakistani-sponsored movement that emerged in 1994 to end the country's civil war and anarchy. Following the 11 September 2001 terrorist attacks, a US, Allied, and anti-Taliban Northern Alliance military action toppled the Taliban for sheltering Usama BIN LADIN.\nA UN-sponsored Bonn Conference in 2001 established a process for political reconstruction that included the adoption of a new constitution, a presidential election in 2004, and National Assembly elections in 2005. In December 2004, Hamid KARZAI became the first democratically elected president of Afghanistan, and the National Assembly was inaugurated the following December. KARZAI was reelected in August 2009 for a second term. The 2014 presidential election was the country's first to include a runoff, which featured the top two vote-getters from the first round, Abdullah ABDULLAH and Ashraf GHANI. Throughout the summer of 2014, their campaigns disputed the results and traded accusations of fraud, leading to a US-led diplomatic intervention that included a full vote audit as well as political negotiations between the two camps. In September 2014, GHANI and ABDULLAH agreed to form the Government of National Unity, with GHANI inaugurated as president and ABDULLAH elevated to the newly-created position of chief executive officer. The day after the inauguration, the GHANI administration signed the US-Afghan Bilateral Security Agreement and NATO Status of Forces Agreement, which provide the legal basis for the post-2014 international military presence in Afghanistan. After two postponements, the next presidential election has been re-scheduled for September 2019.\nThe Taliban remains a serious challenge for the Afghan Government in almost every province. The Taliban still considers itself the rightful government of Afghanistan, and it remains a capable and confident insurgent force fighting for the withdrawal of foreign military forces from Afghanistan, establishment of sharia law, and rewriting of the Afghan constitution. In 2019, negotiations between the US and the Taliban in Doha entered their highest level yet, building on momentum that began in late 2018. Underlying the negotiations is the unsettled state of Afghan politics, and prospects for a sustainable political settlement remain unclear.")
.lineLimit(nil)
.padding()
Text("another")
Text("text")
}
}
}
}

Correct regex for finding string with two or three words in it

I have a text as given:
(3) Reflects the adoption of SFAS No. 128, EARNINGS PER SHARE.
<PAGE>
ITEM 7. MANAGEMENT'S DISCUSSION AND ANALYSIS OF FINANCIAL CONDITION AND RESULTS
OF OPERATION
YEAR ENDED DECEMBER 28, 1997 COMPARED TO THE YEAR ENDED DECEMBER 29, 1996
In November 1996, the Company initiated a major restructuring and growth
plan designed to substantially reduce its cost structure and grow the business
in order to restore higher levels of profitability for the Company. By July
1997, the Company completed the major phases of the restructuring plan. The
$225.0 million of annualized cost savings anticipated from the restructuring
results primarily from the consolidation of administrative functions within
Here, I want to extract "MANAGEMENT'S DISCUSSION AND ANALYSIS" which occurs after <PAGE>. There are many other "MANAGEMENT'S DISCUSSION AND ANALYSIS" in the document ( I have not copied the document as it is 1000+ pages long).
I used the following Regex expression:
pattern = ('?<=<PAGE>')('.*')('?=Management\'s Discussion')
but it is giving this error
TypeError: 'str' object is not callable
What's wrong, where and how to rectify it?

Re-Training spaCy's NER v1.8.2 - Training Volume and Mix of Entity Types

I'm in the process of (re-) training spaCy's Named Entity Recognizer and have a couple of doubts that I hope a more experienced researcher/practitioner can help me figure out:
If a few hundred examples are considered 'a good starting point', then what would be a reasonable number to aim for? Is 100 000 entity/label excessive?
If I introduce a new label, is it best if the number of the entities of that labeled are roughly the same (balanced) during training?
Regarding the mixing in 'examples of other entity types':
do I just add random known categories/labels to my training set eg: ('The Business Standard published in its recent issue on crude oil and natural gas ...', [(4,21, 'ORG')], )?
can I use the same text for various labels? e.g. ('The Business Standard published in its recent issue on crude oil and natural gas ...', [(55,64, 'COMMODITY')], )?
on a similar note let's assume I want spaCyto also recognize a second COMMODITY could I then just use the same sentence and label a different region e.g. ('The Business Standard published in its recent issue on crude oil and natural gas ...', [(69,80, 'COMMODITY')], )? Is that how it's supposed to be done?
what ratio between new and other (old) labels is considered reasonable
Thanks
PS I'm working with Python2.7 in Ubuntu 16.04 using spaCy 1.8.2
For a full answer by Matthew Honnibal check out issue 1054 on spaCy's github page. Below are the most important points as they relate to my questions:
Question(Q) 1: If a few hundred examples are considered 'a good starting point', then what would be a reasonable number to aim for? Is 100 000 entity/label excessive?
Answer(A): Every machine learning problem will have a different examples/accuracy curve. You can get an idea for this by training with less data than you have, and seeing what the curve looks like. If you have 1,000 examples, then try training with 500, 750, etc, and see how that affects your accuracy.
Q 2: If I introduce a new label, is it best if the number of the entities of that label are roughly the same (balanced) during training?
A: There's trade-off between making the gradients too sparse, and making the learning problem too unrepresentative of what the actual examples will look like.
Q 3: Regarding the mixing in 'examples of other entity types':
do I just add random known categories/labels to my training set:
A: No, one should annotate all the entities in that text, so the example above: ('The Business Standard published in its recent issue on crude oil and natural gas ...', [(4,21, 'ORG')], ) should be ('The Business Standard published in its recent issue on crude oil and natural gas ...', [(4,21, 'ORG'), (55,64, 'COMMODITY'), (69,80, 'COMMODITY')], )
can I use the same text for various labels?:
A: Not in the way the examples were given. See previous answer.
what ratio between new and other (old) labels is considered reasonable?:
A: See answer Q 2.
PS: Double citations are direct quotes from the github issue answer.