Text Analysis Tools - sas

I am currently building a datatable in base sas and using an index function to flag certain company names embedded in a paragraph of text in a column. If the company name exists I will flag them with a one. When I've looked into the paragraphs in more detail this simple approach doesn't work. Take this example below;
"John Smith advised Coco-cola on its merger with Pepsi". I'm searching on both Coca-cola and Pepsi but only want to flag Coca-cola in this example as John Smith "advised" them. I don't want both Coco-cola and Pepsi flagged with a "1". I understand that I can write code that takes words after certain anchor words such as "advised", "represented" which does work. What happens if one record simply lists all companies that they have advised without using an anchor words to identify them? Is there any tools out there that can do this automatically by AI?
Thanks
Chris

Related

What is the regular expression that only extracts the URL address?

There are url and email addresses in the middle of the sentence below. But I want to extract only url as a regular expression. The extracted results are as follows.
www.united.com
https://www.bbc.com/sport/football/64698988
https://linuxpip.org
www.gggggg.ac.us
github.com
What should I do?
example sentence:
"Wembley, Wembley, we're the famous Man United and we're off to Wembley," was the chant from the home supporters against Leicester.
United rode their luck, needing David de Gea two make two world-class saves to keep them in the contest, but two goals from Marcus rash#icloud.co.kr Rashford and one from Jadon Sancho helped them to a comfortable victory. gsgad#gmail.com England international Rashford is in the form of his life, taking his tally to 24 goals for the campaign, but Bruno Fernandes' impressive www.united.com performances have gone under the radar, https://www.bbc.com/sport/football/64698988 with the Portuguese playmaker providing two more assists on Sunday.
Free-flowing up front but solid in defence, https://linuxpip.org United's clean sheet against Leicester was their 10th in the league this season, two more than the entirety of the last campaign.
Ten Hag's men were www.gggggg.ac.us without midfield maestro report#abcdefcaf.net Casemiro, and it showed for large parts of the first half when they failed to gain control github.com in the middle of the park, but the Brazil international's return from suspension will provide a boost against the Magpies.
Use the regular expression below to get both url and email address.
(https?:\/\/)?(www\.)?[-a-zA-Z0-9#:%._\+~#=]{2,256}\.[a-z]{2,6}\b([-a-zA-Z0-9#:%_\+.~#?&//=]*)

Replace the text of a specific cell given certain circumstances

I am creating a work schedule for the company I work at. There are four different jobs at the company and therefore 4 separate tabs for schedules.
I have a tab specifically for when an employee calls out sick or requests time off. I am looking for a way for when the user enters the employee's name, specific date, and sick/request off, for it to automatically update the work schedule that that employee belongs too (Job1, Job2, Job3, or Job4)
Example:
This is John Doe's Work Schedule for Job 1(and therefore located on Job1 tab)
John Doe calls out sick on Friday, 01/18/19. The supervisor fills out the following on Time Off Reqs/Sick tab
Given that the user inputs the above data in Time Off Reqs/Sick tab, I would like John Doe's Schedule to change automatically in Job1 tab to the following
John Doe
Here is the link to my dummy data
Any help is greatly appreciated!
I was able to get what you were looking for. See the sheet (and make a copy to modify) HERE.
I used the formula below in each cell of the "Job1" sheet (for this example, I only did it for Column F).
=IFERROR(IF(AND(D2<>"SATURDAY",D2<>"SUNDAY",ISNA(QUERY(Requests!A$2:C,"Select C where A='"&$F$1&"' and B = date '"&TEXT(DATEVALUE(C2),"yyyy-mm-dd")&"'",0)))=TRUE,"Work",QUERY(Requests!A$2:C,"Select C where A='"&$F$1&"' and B = date '"&TEXT(DATEVALUE(C2),"yyyy-mm-dd")&"'",0)),"Off")
You could probably also use multiple INDEX/MATCH statements to get what you're looking for. If you restructure the data, you may be able to use ARRAYFORMULA to reduce the number of formulas you need to use.

Extract substring starting with the comma, moving right, until you hit a space

I have a series of addresses in one column. I am trying to extract each component (Street Address, City, State, and Zip Code) into a separate columns.
I was able to extract the zip codes rather easily with `=RIGHT(A1, 5)'. However, I am having a hard time extracting the city. All of the rows follow the same format below. My idea is to find the comma, and extract the substring from right to left until getting a space. How do I do this?
Here is an example of what the data looks like:
2209 Fake Street Arlington, TX 76015
3100 Fake Street Bedford, TX 76021
3558 Fake Street Flower Mound, TX 75028
4230 Fake Street Fort Worth, TX 76119
2662 Fake Street Bedford, TX 76021
That will only work with cities that have one word. And looking for the type of address (road, street, etc) for the start of the city name, won't work when there is no type. I think if your layout has no unique separator between street and city, you'll probably need a zip code lookup table to get the city.
In addition, you will need code to resolve issues where two different cities have the same zip code. For example, in Texas, 76119 could refer to FORT WORTH, FOREST HILL, or FT WORTH. And you may need code to handle misspellings.
It might be that these are few enough to allow manual correction.

Getting stocks by industry via Yahoo Finance

i want to list all available industries ( like: http://biz.yahoo.com/p/ ) and show all corresponding stocks.
Until now I'm using YAHOO.Finance.SymbolSuggest.ssCallback for the symbol suggestion and http://finance.yahoo.com/d/quotes.csv?s=... for getting the stock's data.
Does anyone have any idea how to get all industries and corresponding stocks?
Is there another hidden Yahoo API?
Lists of all available industries are called GICS Sectors for Standard and Poor's (S&P500 will use that) and ICB for Dow Jones and FTSE. Hence it used by Nasdaq, Nyse and others markets.
It seems like Yahoo uses a third industry classification by Morning Star, but since I'm not quite sure I will give both ways of retrieving data.
Morning Star
I don't know if Yahoo really sticks to this classification, but some names were really close so let's see it:
You need to go to their Index Data and in each sector, click on it and then at the bottom View complete index holdings.
It's not as precise as in Yahoo industry list, but it's all you can do with Morning Star. Not very convincing, I know...
GICS Sectors
GICS Sectors are now a trademark of Standard and Poor's and then data have to be sought for in S&P's website.
Short answer: take a look at this page, you will need to be registered (it's free and easy) and you can download spreadsheets (xls) with stocks and corresponding sectors. Nevertheless, things aren't always easy, and you will have to do a bit of a search to retrieve all stocks with their corresponding industries. For example, the file INDICATED_RATE_CHANGE.xls will give you some companies and their sectors in each month of 2012. Using that and SP500_DividendAristocrats_2012.xls you should be able to retrieve at least a large part of S&P 500 companies.
ICB
ICB is used by NYSE, NASDAQ etc... Then it's a lot simpler than S&P and MorningStar. Here is your answer. BOOM! Direct link!
Link is dead :(
Finally
I strongly advise you to use the simpler and most-used industry classification index: the ICB. It will always be available and publicly displayed since millions of investors relay everyday on it, without having to use S&P financial services or MorningStar brokerage services...
EDIT
You can look at nasdaq.com to retrieve all companies and their corresponding sector: here for Nasdaq and here for Nyse
Get all industry-IDs from here:
http://biz.yahoo.com/ic/ind_index.html
(look at the links)
Then use YQL ( https://developer.yahoo.com/yql/console/ )
with a query like this:
select * from yahoo.finance.industry where id=912

Web service or mechanism to detect Person, Place or an Object

Is there a web service or a tool to detect if what a certain text is the name or a person, a place or an object (device).
eg:
Input: Bill Clinton Output: Person
Input: Blackberry Output: Device
Input: New york Output: Place
Accuracy can be low. I have looked at opencyc but I couldnt get it to work. Is there a way I can use WikiPedia for this?
For a start separating a person or a thing will be great.
I think wikipedia would be a very good source. Given the input, you could try and find an entry in wikipedia and scrape the resulting page (if it exists).
Persons and Places should have fairly distinct sets of data - birthdates, locations, etc in the article that you could use to tell them apart, and anything else is an object.
It's worth a shot anyway.
Looking at the output of Wolfram Alpha, it seems that you can possibly identify a person by searching Bill Clinton Birthday or just Bill Clinton, or you can identify a location by searching New York GPS coordinates or just New York, for even better results. Blackberry seems like a tough word for Alpha, because it keeps wanting to interpret it as a fruit. You might have luck searching Froogle to identify a device.
It seems like WA will give you a fairly decent accuracy, at least if you're using famous people/places.
How about using a search engine? Google would be good, and I think Yahoo! has tools for building your own search.
I googled:
Results 1 - 10 of about 27,100,000 for "bill clinton" person
Results 1 - 10 of about 6,050,000 for "bill clinton" place
Results 1 - 10 of about 601,000 for "bill clinton" device
He's a person!
Results 1 - 10 of about 391,000,000 for "new york" place.
Results 1 - 10 of about 280,000,000 for "new york" person.
Results 1 - 10 of about 84,100,000 for "new york" device.
It's a place!
Results 1 - 10 of about 11,000,000 for "blackberry" person
Results 1 - 10 of about 36,600,000 for "blackberry" place
Results 1 - 10 of about 28,000,000 for "blackberry" device
Unfortunately, blackberry is a place as well. :-/
Note that only in the case of 'blackberry' did "device" even get close. Maybe you need to weight the page hit values. What is your application? Do you have any idea which "devices" you'd have to classify? What is the possible range of inputs?
Maybe you want to combine the results you get from different sources.
I think the basic task you're trying to accomplish is more formally known as named entity recognition. This task is nontrivial, and by only inputting the name stripped of any context, you're making it even harder.
For example, we'd like to think examples such as "Bill Clinton" and "New York" are obviously unambiguous, but looking at their disambiguation pages in Wikipedia shows that there are several potential entities they may refer to. "New York" is both a state, city, and movie title. "Bill Clinton" is a bit less ambiguous if you're only looking at Wikipedia, but I'm sure you'll find dozens of Bill Clintons in any phonebook. It might also be the name of someone's sailboat or pet dog. What if someone inputs "Washington"? That could be both a U.S. President, state, district, city, lake, street, island, movie, one of several U.S. navy ships, bridge, as well as other things. Determining which is the "correct" usage you'd want the webservice to return could become very complicated.
As much as Cyc knows, I think you'll find it's still not as comprehensive as Wikipedia. However, the main downside to Wikipedia is that it's essentially unstructured. Personally, I find Cyc's API so convoluted and poorly documented, that parsing Wikipedia's natural language almost seems easier.
If I had to implement such a webservice from scratch, I'd start by downloading a snapshot of Wikipedia, and then writing a parser that would read through all the articles, and generate a named entity index based on article titles. You could manually "classify" a few dozen examples as person/place/object, and train a classifier (Bayesian,Maxent,SVM) to automatically classify other examples based on the word frequencies of their articles.