How to scrape a particular part of a HTML page using regular expression or HTMLAgilityPack in visual studio 2010 using C#? - regex

I have var source="<p><img src="http://l.yimg.com/bt/api/res/1.2/TRLtYhdbTvFcX_GOU_0S4g--/YXBwaWQ9eW5ld3M7Zmk9ZmlsbDtoPTg2O3E9ODU7dz0xMzA-/http://media.zenfs.com/en_us/News/Reuters/2012-04-14T023232Z_5_CBRE83B1MAL00_RTROPTP_2_USA.JPG" width="130" height="86" alt="People visit Google's stand at the National Retail Federation Annual Convention and Expo in New York" align="left" title="People visit Google's stand at the National Retail Federation Annual Convention and Expo in New York" border="0" />(Reuters) - An unusual stock split designed to preserve Google Inc founders' control of the Web search leader raised questions and some grumbling on Wall Street, even as investors focused on the company's short-term business concerns. Shares of Google closed 4 percent lower at $624.60 on Friday, driven by deepening worries about its search ad rates and payments to partners. The declining search trends underscored investor uncertainty about Google's growth prospects and unease about the company's pending $12.5 billion acquisition of Motorola Mobility. ...</p><br clear="all"/>" Now i need to parse/scrape this to get the link address in a variable i.e http://in.news.yahoo.com/googles-stock-split-raises-questions-023232813.html and the image src in a separate variable. I also need the description text between </a> and </p>.. Kindly help i am badly stuck...

Try the below code snippet using HtmlAgilityPack
var source = #"<p><img src=""http://l.yimg.com/bt/api/res/1.2/TRLtYhdbTvFcX_GOU_0S4g--/YXBwaWQ9eW5ld3M7Zmk9ZmlsbDtoPTg2O3E9ODU7dz0xMzA-/http://media.zenfs.com/en_us/News/Reuters/2012-04-14T023232Z_5_CBRE83B1MAL00_RTROPTP_2_USA.JPG"" width=""130"" height=""86"" alt=""People visit Google's stand at the National Retail Federation Annual Convention and Expo in New York"" align=""left"" title=""People visit Google's stand at the National Retail Federation Annual Convention and Expo in New York"" border=""0"" />(Reuters) - An unusual stock split designed to preserve Google Inc founders' control of the Web search leader raised questions and some grumbling on Wall Street, even as investors focused on the company's short-term business concerns. Shares of Google closed 4 percent lower at $624.60 on Friday, driven by deepening worries about its search ad rates and payments to partners. The declining search trends underscored investor uncertainty about Google's growth prospects and unease about the company's pending $12.5 billion acquisition of Motorola Mobility. ...</p><br clear=""all""/>";
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(source);
var paraNode = doc.DocumentNode.SelectSingleNode("//p");
var desc = paraNode.InnerText;
var anchorNode = doc.DocumentNode.SelectSingleNode("//p/a");
var link = anchorNode.GetAttributeValue("href", null);
var imgNode = doc.DocumentNode.SelectSingleNode("//p/a/img");
var src = imgNode.GetAttributeValue("src", null);
There are many ways to do this, but this is just one of the approach to get the job done. It gives you an idea how to do it with HtmlAgilityPack. XPATH will give you lot of power while parsing stuff like this.

Related

Regular expression in SSIS

I have this data in a column:
<b>Dummy Alerts: </b>3/3Alerts have been addressed
Question Alert: Have you had problems or are your volumes lower than normal? " +
"Yes Alert is closed on 01/09/2018 at 01:08 PM
Question Alert: Have you been drinking more fluid? " +
" Yes Alert is closed on 10/09/2019 at 01:08 PM
Ram support visit performed 10/9/17, Weight 90.2kg (dry). " +
"TW achieved. No peripheral edema. BP within routine range per patient history. Urine output 1050ml. No PO fluid restriction at this time. " +
"Patient did forget to bring in flow sheets. Monitor UF trend with flow sheet review in one week. Michelle Mayhew Smith, RN."
and other similar types of records.
I want to select only these text and insert into new column:-
Ram support visit performed 10/9/17, Weight 90.2kg (dry).
TW achieved. No peripheral edema. BP within routine range per patient history. Urine output 1050ml. No PO fluid restriction at this time.
Patient did forget to bring in flow sheets.Monitor UF trend with flow sheet review in one week. Michelle Mayhew Smith, RN.
I would like to use Regular expression in SSIS
SSIS does not support Regex out of the box.
You can definitely use RegexClean and Regular Expression transformations from Darren Green. Or - you can do whatever needed in your custom Script transformation, like described in this sample employing C# and .Net.

Regex: Recognizing and removing lists and menus

Before I write my own method, I am curious whether there is a regex that can help me.
The Context
I am cleaning raw text prior to running statistical analyses on the terms. The text is from websites and thus includes menus (many menus from many websites).
A typical list/menu appears as follows (Except with one line break between items):
STUDENT SERVICES
Guidance & Support
Core Services
Admissions & Records
Financial Aid
Counseling
Assessment Testing
Kickstart Orientation
Tutoring
Career & Transfer Center
Student Welcome Center
The Task at Hand
I want to remove all lists
I need to remove text blocks where there is a line break after every first second, third or fourth word, but only if this pattern repeats 3 or more times consecutively (I don't want to remove single short sentences such as "Students always succeed.")
Can a regex identify this pattern?
NOTE: I am working in java.
UPDATE with sample text
[[[I WANT TO REMOVE THIS LIST]]]
Offices & Services
Student Services
Activities & Athletics
Records & Registration
Costs & Financial Aid
Compliance & Diversity
Alumni
Faculty/Staff Resources
BMCC Foundation
Human Resources
BMCC Homepage>Academics>Health Education>Course Listings
[[[I WANT TO REMOVE THIS LIST]]]
Health Education Home
Course Listings
Faculty
[[[I WANT TO REMOVE THIS LIST]]]
Community Health Education
Gerontology
School Health Education
Public Health
Visit Admissions
Course Listings
[[[I WANT TO KEEP TEXT BELOW]]]
The following courses are offered by the Department of Health Education.
2CRS., 2HRS, 0 LAB HRS.
HED 100
Health Education
This is an introductory survey course to health education. The course provides students with the knowledge, skills, and behavioral models to enhance their physical, emotional, social, intellectual and spiritual health as well as facilitate their health decision-making ability. The primary areas of instruction include: health and wellness; stress; human sexuality; alcohol, tobacco and substance abuse; nutrition and weight management; and physical fitness. Students who have completed HED 110 - Comprehensive Health Education will not receive credit for this course.
3CRS., 3HRS, 0 LAB HRS.
HED 110
Comprehensive Health Education
This course in health educations offers a comprehensive approach that provides students with the knowledge, skills, and behavioral models to enhance their physical, emotional, social, intellectual and spiritual health as well as facilitate their health decision-making ability. Areas of specialization include: alcohol, tobacco and abused substances, mental and emotional health, human sexuality and family living, nutrition, physical fitness, cardiovascular health, environmental health and health care delivery. HED 110 fulfills all degree requirements for HE 100. Students who have completed HED 100 - Health Education will not receive credit for this course.
Assuming the part about the number of words is not important, try a regex pattern of (([A-Za-z& ])*(\n|\r|\r\n)){5,}, example here.
Change that five quantifier as needed, that is just an example. A five would not match two lines with an extra newline or a three line list without an ending new line.

Getting stocks by industry via Yahoo Finance

i want to list all available industries ( like: http://biz.yahoo.com/p/ ) and show all corresponding stocks.
Until now I'm using YAHOO.Finance.SymbolSuggest.ssCallback for the symbol suggestion and http://finance.yahoo.com/d/quotes.csv?s=... for getting the stock's data.
Does anyone have any idea how to get all industries and corresponding stocks?
Is there another hidden Yahoo API?
Lists of all available industries are called GICS Sectors for Standard and Poor's (S&P500 will use that) and ICB for Dow Jones and FTSE. Hence it used by Nasdaq, Nyse and others markets.
It seems like Yahoo uses a third industry classification by Morning Star, but since I'm not quite sure I will give both ways of retrieving data.
Morning Star
I don't know if Yahoo really sticks to this classification, but some names were really close so let's see it:
You need to go to their Index Data and in each sector, click on it and then at the bottom View complete index holdings.
It's not as precise as in Yahoo industry list, but it's all you can do with Morning Star. Not very convincing, I know...
GICS Sectors
GICS Sectors are now a trademark of Standard and Poor's and then data have to be sought for in S&P's website.
Short answer: take a look at this page, you will need to be registered (it's free and easy) and you can download spreadsheets (xls) with stocks and corresponding sectors. Nevertheless, things aren't always easy, and you will have to do a bit of a search to retrieve all stocks with their corresponding industries. For example, the file INDICATED_RATE_CHANGE.xls will give you some companies and their sectors in each month of 2012. Using that and SP500_DividendAristocrats_2012.xls you should be able to retrieve at least a large part of S&P 500 companies.
ICB
ICB is used by NYSE, NASDAQ etc... Then it's a lot simpler than S&P and MorningStar. Here is your answer. BOOM! Direct link!
Link is dead :(
Finally
I strongly advise you to use the simpler and most-used industry classification index: the ICB. It will always be available and publicly displayed since millions of investors relay everyday on it, without having to use S&P financial services or MorningStar brokerage services...
EDIT
You can look at nasdaq.com to retrieve all companies and their corresponding sector: here for Nasdaq and here for Nyse
Get all industry-IDs from here:
http://biz.yahoo.com/ic/ind_index.html
(look at the links)
Then use YQL ( https://developer.yahoo.com/yql/console/ )
with a query like this:
select * from yahoo.finance.industry where id=912

Programmatic access to detailed historical financial data [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 3 years ago.
Improve this question
I know that Yahoo has a great API for accessing detailed financial metrics about a company documented at http://www.gummy-stuff.org/Yahoo-data.htm. Yahoo also provides historical pricing data, documented at http://code.google.com/p/yahoo-finance-managed/wiki/csvHistQuotesDownload.
However, I'm trying to find a place where I can programmatically access detailed historical data, like what was a company's earnings 10 years ago, and not just the price of the stock. Does anyone know of such a site? I'm willing to pay, and I think http://www.mergent.com/servius, but they seem very, very expensive. A single standardized financial report from a company costs 50 units, which is $2.50 under their pay-as-you-go plan.
Google seems to have pretty good historical financial data that appears to go back 5 years. I may try scraping them, but I'd like to go back much, much further. Any ideas?
Quandl provides a huge amount of different databases with all sorts of data, not only EOD but e.g. earnings per share and a lot of other stuff like US employment data.
API is easy to use and well documented. It also provides an Excel plug-in, a Matlab plug-in, a Python package, an R package, and a number of languages has a support through community maintained libraries.
Not all the data are free though. For more advanced data bases there is a subscription fee. I think the price is different depending on the database and the number of potential users.
Check out this page: ADVFN Financial Data Scraper. You can download a spreadsheet with built-in macro that scrapes up 22 years of financial earnings data for any publically traded company that ADVFN posts historic data for. Just keep in mind that it's not an quick process, for the 3000 odd companies pre-listed in the spreadsheet, the macro will need to run for a couple of days (obviously you can download less if you like though). But, you'll end up with over 8 million data values and you'll have them saved locally in a spreadsheet for quick and easy analysis.
ADVFN posts up to 307 rows of data per company per year and this spreadsheet can capture them all, yielding a very comprehensive data base of historical financial data.
Wolfram Alpha has the data you desire
Examples:
http://www.wolframalpha.com/input/?i=msft+earnings
http://www.wolframalpha.com/input/?i=aapl+earnings+2001
http://www.wolframalpha.com/input/?i=ko+price+1983
I have not used it, but I see they provide a free API with an option to upgrade if you exceed their monthly limits.
Intrinio provides income statements, balance sheets, and statement of cashflows going back 10 years, in addition to stock prices and valuation ratios, via API. You can programmatically query the API to pull the data into your app.
Some examples:
https://api.intrinio.com/financials/standardized?identifier=YUM&statement=income_statement&fiscal_period=Q2&fiscal_year=2015
This grabs YUM's income statement from Q2, 2015.
https://api.intrinio.com/companies?latest_filing_date=2017-03-06
That shows all companies with a new filing date on or after 2017-03-06, which is useful for determining which fundamentals need to be updated.
https://api.intrinio.com/data_point?ticker=AAPL,MSFT&item=pricetoearnings
That pulls the current price to earnings ratio for Apple and Microsoft. You could swap out last_price to get the current stock price.
https://api.intrinio.com/data_point?ticker=$FEDFUNDS&item=level
This call returns the current federal funds interest rate from the federal reserve.
https://api.intrinio.com/prices?ticker=AAPL
That returns the price history for AAPL.
Intrinio gives away 500 daily API calls to any developer.
Depends what you want. Lets say, if you looking for FX historical data you can take a look on Dukascopy historical data feed(http://www.dukascopy.com/swiss/english/data_feed/historical/)
It is possible to write some scripts to download data into your app.
You can get what you want from financialmodelingprep they have quarterly income statement, balance sheet and cash flow. I include a sample code so you can see how I took the data in jquery.
They also offer historical quote according to their documentation.
Fiddle: https://jsfiddle.net/7g238qrp/
$(document).ready(function() {
var url = "https://financialmodelingprep.com/api/financials/income-statement/AAPL?period=quarter";
$.ajax({
url: url,
type: "GET",
crossDomain: true,
success: function (response) {
let resp = response;
resp = resp.substring(5);
resp = resp.substring(0, resp.length - 5);
// if you want to convert to JSON
//resp = JSON.parse(resp)
//console.log(resp);
$('#JonContent').text(resp);
},
error: function (xhr, status) {
alert("error");
}
});
});
<script src="https://cdnjs.cloudflare.com/ajax/libs/jquery/1.8.3/jquery.min.js"></script>
<pre>
<div id="JonContent"></div>
</pre>

Yahoo! Finance API DOW [closed]

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 9 years ago.
Improve this question
Until now, I've been using the INDU ticker to follow the DOW with the Yahoo! API. For whatever reason you were unable to directly follow ^dji ^djia or any other reasonable combination. Up until yesterday, INDU was working fine. However now I receive no data when requesting indu.
What other ticker can I use with the Yahoo! finance API that will return the DJIA?
This index is not available under any other name.
However, this problem was just a temporary glitch, now resolved by Yahoo. Unfortunately, their financial data availability is very flaky lately. E.g. data available on the web page, but CSV downloads give "N/A" for all fields, etc. There were similar incidents in recent months, with stock prices for random stocks given wrong values, and more.
So, if you're building a new service around these Yahoo services, be aware that:
These services are not reliable.
You're breaking Yahoo ToS, so there's nothing you can do if they are broken / not working, you cannot even complain to Yahoo in good faith.
According to Yahoo (post by Yahoo Developer Network Community Manager Robyn Tippins on Yahoo developer forums):
The reason for the lack of documentation is that we don't have a Finance API. It appears some have reverse engineered an API that they use to pull Finance data, but they are breaking our Terms of Service (no redistribution of Finance data) in doing this so I would encourage you to avoid using these webservices.
The formula for the DJIA isn't very complicated. If you are still able to pull quotes from individual stocks, you could use your code to pull the prices of the existing 30 components of the DJIA, add them up and divide by the current divisor. Of course, this has several disadvantages.
You need to make 30 requests instead of one.
You will have to adjust the divisor if there is a stock-split.
You will have to change the the queries when the components
change.
The components of the DJIA are
AA AXP BA BAC CAT CSCO CVX DD DIS GE HD
HPQ IBM INTC JNJ JPM KFT KO MCD MMM MRK
MSFT PFE PG T TRV UTX VZ WMT XOM
The current divisor is 0.132129493.
The divisor changes whenever there is a stock split in on of the components. The components of the DOW changed 48 times from 1896-2009.
It seems like Yahoo Finance does not support the web service to query ^DJI or INDU.
Check out this discussion:
http://developer.yahoo.com/forum/General-Discussion-at-YDN/Dow-Jones-Industrial-Average-Quote-Error/1317052217631-f9173931-04fd-4519-b1b3-efb65d7ff8fa/1317065435082
Assuming that your application does not need to be real time market data (to the second), you can use the RAW data that is provided to build the interactive graph on yahoo. This data is comma separated and updates about once every minute. The downside: it will include all the data from the trading day. The time given is in Unix time so a conversion would be needed. I tried this out for the ticker symbols you listed and the only one I was able to get data with was ^dji. Hopefully this is what you are looking for!
You can mess with the link and see what happens to the data. For example you can change the amount of days.
http://chartapi.finance.yahoo.com/instrument/1.0/%5Edji/chartdata;type=quote;range=1d/csv/
I think Yahoo Finance All Currencies quote API Documentation will help you.
I found a Yahoo forum answer that says we cannot download CSV data for ^DJI.
Check also YQL console. This console will fetch values in JSON format.
The DIA ticker (SPDR Dow Jones Industrial Average) closely imitates the Dow.