What is the regular expression that only extracts the URL address? - regex

There are url and email addresses in the middle of the sentence below. But I want to extract only url as a regular expression. The extracted results are as follows.
www.united.com
https://www.bbc.com/sport/football/64698988
https://linuxpip.org
www.gggggg.ac.us
github.com
What should I do?
example sentence:
"Wembley, Wembley, we're the famous Man United and we're off to Wembley," was the chant from the home supporters against Leicester.
United rode their luck, needing David de Gea two make two world-class saves to keep them in the contest, but two goals from Marcus rash#icloud.co.kr Rashford and one from Jadon Sancho helped them to a comfortable victory. gsgad#gmail.com England international Rashford is in the form of his life, taking his tally to 24 goals for the campaign, but Bruno Fernandes' impressive www.united.com performances have gone under the radar, https://www.bbc.com/sport/football/64698988 with the Portuguese playmaker providing two more assists on Sunday.
Free-flowing up front but solid in defence, https://linuxpip.org United's clean sheet against Leicester was their 10th in the league this season, two more than the entirety of the last campaign.
Ten Hag's men were www.gggggg.ac.us without midfield maestro report#abcdefcaf.net Casemiro, and it showed for large parts of the first half when they failed to gain control github.com in the middle of the park, but the Brazil international's return from suspension will provide a boost against the Magpies.
Use the regular expression below to get both url and email address.
(https?:\/\/)?(www\.)?[-a-zA-Z0-9#:%._\+~#=]{2,256}\.[a-z]{2,6}\b([-a-zA-Z0-9#:%_\+.~#?&//=]*)

Related

Google Cloud VideoIntelligence Speech Transcription - Transcription Size

I use Google Cloud Speech Transcription as following :
video_client = videointelligence.VideoIntelligenceServiceClient()
features = [videointelligence.enums.Feature.SPEECH_TRANSCRIPTION]
operation = video_client.annotate_video(gs_video_path, features=features)
result = operation.result(timeout=3600)
And I present the transcript and store the transcript in Django Objects using PostgreSQL as following :
transcriptions = response.annotation_results[0].speech_transcriptions
for transcription in transcriptions:
best_alternative = transcription.alternatives[0]
confidence = best_alternative.confidence
transcript = best_alternative.transcript
if SpeechTranscript.objects.filter(text = transcript).count() == 0:
SpeechTranscript.objects.create(text = transcript,
confidence = confidence)
print(f"Adding -> {confidence:4.10%} | {transcript.strip()}")
else:
pass
For instance the following is the text that I receive from a sample video :
94.9425220490% | I refuse to where is it short sleeve dress shirt. I'm just not going there the president of the United States is a visit to Walter Reed hospital in mid-july format was the combination of weeks of cajoling by trump staff and allies to get the presents for both public health and political perspective wearing a mask to protect against the spread of covid-19 reported in advance of that watery trip and I quote one presidential aide to the president to set an example for a supporters by wearing a mask and the visit.
94.3865835667% | Mask wearing is because well science our best way to slow the spread of the coronavirus. Yes trump or Matthew or 3 but if you know what he said while doing sell it still anybody's guess about what can you really think about NASCAR here is what probably have a mass give you probably have a hospital especially and that particular setting were you talking to a lot of people I think it's but I do believe it. Have a a time and a place very special line trump saying I've never been against masks but I do believe they have a time and a place that isn't exactly a ringing endorsement for mask wearing.
94.8513686657% | Republican skip this isn't it up to four men over the perfumer's that wine about time and place should be a blinking red warning light for people who think debate over whether last for you for next coronavirus. They are is finally behind us time in a place lined everything you need to know about weird Trump is like headed next time he'll get watery because it was a hospital and will continue to express not so scepticism to wear masks in public house new CDC guidelines recommending that mask to be worn inside and one social this thing is it possible outside he sent this?
92.9862976074% | He wearing a face mask as agreed presidents prime minister's dictators Kings Queens and somehow. I don't see it for myself literally main door he responded this way back backstage, but they said you didn't need it trump went to Michigan to this later and he appeared in which personality approaching Mark former vice president Joe Biden
94.6677267551% | In his microwave fighting for wearing a mask and he walked onto the stage where it is massive mask there's nobody understands and there's any takes it off you like to have it hanging off you. I think it makes them feel good frankly if you want to know the truth who's got the largest basket together. Seen it because trump thinks that maths make him and people generally I guess what a week or something is resistant wearing one in public from 1 today which has had a correlation between the erosion of the public's confidence and trump have the corner coronavirus and his number is SE6 a second term in the 67.
94.9921131134% | The coronavirus pandemic in the heels of national and swings they both lots of them that show trump slipping further and further behind former vice president Joe Biden when it comes to General Election good policy would seem to make for good politics at all virtually every infectious disease expert believes that wearing masks in public is our best to contain the spread of coronavirus until a vaccine would do well to listen to buy on this one a mare is the point we make episode every Tuesday and Thursday make sure to check them all out.
What is the predicted size of a transcript that is generated within the speech transcription results. What decides the size of each transcript ? What is the max and minimum character length ? How should I design my SQL table column size, in order to be prepared for the expected transcript size ?
As I mentioned in the comments, the Video Intelligence transcripts are splits with roughly 50-60 seconds from the video.
I have created a Public Issue Tracker case, link, so the product team can clarify this information within the documentation. Although, I do not have an eta for this request, I encourage you to follow the case's thread.

Regex to identify Store Credit Card numbers

There are very detailed regex expressions to identify Visa, MasterCard, Discover and other popular credit card numbers.
However, there are tons of other credit cards; termed popularly as Store Credit Cards (these are not the Visa or Amex powered cards). Examples of these cards are Amazon, GAP brands, Williams Sonoma, Macy's and so on. Most of these are Synchrony Bank Credit Cards.
Is there a regex to identify these different brand credit card numbers?
It's ludicrous to use a regex to identify the network. All it takes is a prefix matching at most.
A card number has 16 digits. The first few identify the network and the bank.
Some people would say that Visa starts with 4 and MasterCard starts with 5 but that's a broad approximation at best. You can have a look at your card, should be right most of the time.
It would be easy to figure out what a card is if one could get a registry of known prefixes, but there is no public registry to my knowledge. I highly doubt that any of the parties involved would like to publish that information.
The first eight digits (until recently this was six digits) of an international card number are known as the Issuer Identification Number (IIN) and the registry that maintains this index is the American Bankers Association
The list of IINs is updated monthly and spans tens of thousands of rows. Unfortunately a fixed Regex isn't going to be accurate for any length of time.

Word2Vec Doesn't Contain Embedding for Number 23

Hi I am on the course of developing Encoder-Decoder model with Attention which predicts WTO Panel Report for the given Factual Relation given as Text_Inputs.
Sample_sentence for factual relation is as follow:
sample_sentence = "On 23 January 1995, the United States received a request from Venezuela to hold consultations under Article XXII:1 of the General Agreement on Tariffs and Trade 1994 (\"General Agreement\"), Article 14.1 of the Agreement on Technical Barriers to Trade (\"TBT Agreement\") and Article 4 of the Understanding on Rules and Procedures Governing the Settlement of Disputes (\"DSU\"), on the rule issued by the Environmental Protection Agency on 15 December 1993, entitled \"Regulation of Fuels and Fuel Additives - Standards for Reformulated and Conventional Gasoline\" (WT/DS2/1). The consultations between Venezuela and the United States took place on 24 February 1995. As they did not result in a satisfactory solution of the matter, Venezuela, in a communication dated 25 March 1995, requested the Dispute Settlement Body (\"DSB\") to establish a panel to examine the matter under Article XXIII:2 of the General Agreement and Article 6 of the DSU (WT/DS2/2). On 10 April 1995, the DSB established a panel in accordance with the request made by Venezuela. On 28 April 1995, the parties to the dispute agreed that the Panel should have standard terms of reference (DSU, Art. 7) and agreed on the composition of the Panel as follows"
I am trying to using Word2Vec from google and encode each word into 300dim Word Vectors however, like number 23 appears as not included in the Word2Vec VocaSets.
Which would be the solution for this problem?
1) Use another Word Embedding for example Glovec?
2) Or Another any other advice?
Thx in advance for your help
edit)
I think to succefully fulfill this task, I think first I have to understand how current NMT application deals with Named Entity Recognition problem in advance before they actually train it.
Any suggestive literatures?
Word2Vec only learns words it has seen a lot.
Maybe try replacing the numbers in your source with text ie ("On the twenty third of ...")?

Text Analysis Tools

I am currently building a datatable in base sas and using an index function to flag certain company names embedded in a paragraph of text in a column. If the company name exists I will flag them with a one. When I've looked into the paragraphs in more detail this simple approach doesn't work. Take this example below;
"John Smith advised Coco-cola on its merger with Pepsi". I'm searching on both Coca-cola and Pepsi but only want to flag Coca-cola in this example as John Smith "advised" them. I don't want both Coco-cola and Pepsi flagged with a "1". I understand that I can write code that takes words after certain anchor words such as "advised", "represented" which does work. What happens if one record simply lists all companies that they have advised without using an anchor words to identify them? Is there any tools out there that can do this automatically by AI?
Thanks
Chris

Web service or mechanism to detect Person, Place or an Object

Is there a web service or a tool to detect if what a certain text is the name or a person, a place or an object (device).
eg:
Input: Bill Clinton Output: Person
Input: Blackberry Output: Device
Input: New york Output: Place
Accuracy can be low. I have looked at opencyc but I couldnt get it to work. Is there a way I can use WikiPedia for this?
For a start separating a person or a thing will be great.
I think wikipedia would be a very good source. Given the input, you could try and find an entry in wikipedia and scrape the resulting page (if it exists).
Persons and Places should have fairly distinct sets of data - birthdates, locations, etc in the article that you could use to tell them apart, and anything else is an object.
It's worth a shot anyway.
Looking at the output of Wolfram Alpha, it seems that you can possibly identify a person by searching Bill Clinton Birthday or just Bill Clinton, or you can identify a location by searching New York GPS coordinates or just New York, for even better results. Blackberry seems like a tough word for Alpha, because it keeps wanting to interpret it as a fruit. You might have luck searching Froogle to identify a device.
It seems like WA will give you a fairly decent accuracy, at least if you're using famous people/places.
How about using a search engine? Google would be good, and I think Yahoo! has tools for building your own search.
I googled:
Results 1 - 10 of about 27,100,000 for "bill clinton" person
Results 1 - 10 of about 6,050,000 for "bill clinton" place
Results 1 - 10 of about 601,000 for "bill clinton" device
He's a person!
Results 1 - 10 of about 391,000,000 for "new york" place.
Results 1 - 10 of about 280,000,000 for "new york" person.
Results 1 - 10 of about 84,100,000 for "new york" device.
It's a place!
Results 1 - 10 of about 11,000,000 for "blackberry" person
Results 1 - 10 of about 36,600,000 for "blackberry" place
Results 1 - 10 of about 28,000,000 for "blackberry" device
Unfortunately, blackberry is a place as well. :-/
Note that only in the case of 'blackberry' did "device" even get close. Maybe you need to weight the page hit values. What is your application? Do you have any idea which "devices" you'd have to classify? What is the possible range of inputs?
Maybe you want to combine the results you get from different sources.
I think the basic task you're trying to accomplish is more formally known as named entity recognition. This task is nontrivial, and by only inputting the name stripped of any context, you're making it even harder.
For example, we'd like to think examples such as "Bill Clinton" and "New York" are obviously unambiguous, but looking at their disambiguation pages in Wikipedia shows that there are several potential entities they may refer to. "New York" is both a state, city, and movie title. "Bill Clinton" is a bit less ambiguous if you're only looking at Wikipedia, but I'm sure you'll find dozens of Bill Clintons in any phonebook. It might also be the name of someone's sailboat or pet dog. What if someone inputs "Washington"? That could be both a U.S. President, state, district, city, lake, street, island, movie, one of several U.S. navy ships, bridge, as well as other things. Determining which is the "correct" usage you'd want the webservice to return could become very complicated.
As much as Cyc knows, I think you'll find it's still not as comprehensive as Wikipedia. However, the main downside to Wikipedia is that it's essentially unstructured. Personally, I find Cyc's API so convoluted and poorly documented, that parsing Wikipedia's natural language almost seems easier.
If I had to implement such a webservice from scratch, I'd start by downloading a snapshot of Wikipedia, and then writing a parser that would read through all the articles, and generate a named entity index based on article titles. You could manually "classify" a few dozen examples as person/place/object, and train a classifier (Bayesian,Maxent,SVM) to automatically classify other examples based on the word frequencies of their articles.