I am trying to parse the yahoo answers feed - http://answers.yahoo.com/rss/allq
The issue is that the titles have
[ Category ] : Open Question :
in every title that I do not want... I want to write a regexp to remove this...
anything that we can make to remove all the letters in the starting [ and the first : should do it.
there is a space after the : also, we need to remove that too.
Thanks for this in advance, I will also try to find a solution myself.
Have you considered using Yahoo's YQL service to parse this feed (or other web pages)?
Querying html using Yahoo YQL
Yahoo! Query Language
YQL Console
They already have sample queries for you to get at Yahoo Answers data:
answers.getbycategory:
http://developer.yahoo.com/yql/console/#h=select%20*%20from%20answers.getbycategory%20where%20category_id%3D2115500137%20and%20type%3D%22resolved%22
answers.getbyuser:
http://developer.yahoo.com/yql/console/#h=select%20*%20from%20answers.getbyuser%20where%20user_id%3D%22YbaMGtHFaa%22
answers.getquestion:
http://developer.yahoo.com/yql/console/#h=select%20*%20from%20answers.getquestion%20where%20question_id%3D%2220090526102023AAkRbch%22
answers.search:
http://developer.yahoo.com/yql/console/#h=select%20*%20from%20answers.search%20where%20query%3D%22cars%22%20and%20category_id%3D2115500137%20and%20type%3D%22resolved%22
(Just an FYI in case you weren't aware of this convenient service. I use it instead of screen scraping with RegEx's.)
the following regex should do the job:
^\[.*?:
Usage sample in c#:
string resultString = Regex.Replace(subjectString, #"^\[.*?: ", "");
What it does is start with an [ bracket and take any characters until it matches a : and take the follwing space.
Hope this helps,
Tom.
Thanks # cmptrgeekken for pointing the non greedy thing out!
Related
I have response body which contains
"<h3 class="panel-title">Welcome
First Last </h3>"
I want to fetch 'First Last' as a output
The regular expression I have tried are
"Welcome(\s*([A-Za-z]+))(\s*([A-Za-z]+))"
"Welcome \s*([A-Za-z]+)\s*([A-Za-z]+)"
But not able to get the result. If I remove the newline and take it as
"<h3 class="panel-title">Welcome First Last </h3>" it is detecting in online regex maker.
I suspect your problem is the carriage return between "Welcome" and the user name. If you use the "single-line mode" flag (?s) in your regex, it will ignore newlines. Try these:
(?s)Welcome(\s*([A-Za-z]+))(\s*([A-Za-z]+))
(?s)Welcome \s*([A-Za-z]+)\s*([A-Za-z]+)
(this works in jMeter and any other java or php based regex, but not in javascript. In the comments on the question you say you're using javascript and also jMeter - if it is a jMeter question, then this will help. if javaScript, try one of the other answers)
Well, usually I don't recommend regex for this kind of work. DOM manipulation plays at its best.
but you can use following regex to yank text:
/(?:<h3.*?>)([^<]+)(?:<\/h3>)/i
See demo at https://regex101.com/r/wA2sZ9/1
This will extract First and Last names including extra spacing. I'm sure you can easily deal with spaces.
In jmeter reg exp extractor you can use:
<h3 class="panel-title">Welcome(.*?)</h3>
Then take value using $1$.
In the data you shown welcome is followed by enter.If actually its part of response then you have to use \n.
<h3 class="panel-title">Welcome\n(.*?)</h3>
Otherwise above one is enough.
First verify this in jmeter using regular expression tester of response body.
Welcome([\s\S]+?)<
Try this, it will definitely work.
Regular expressions are greedy by default, try this
Welcome\s*([A-Za-z]+)\s*([A-Za-z]+)
Groups 1 and 2 contain your data
Check it here
I want to crawl pages related to Disney on bloomberg websites. The url follow pattern as
"http://bloomberg.com/news/2013-07-08/disney-welcometohomepageofdisney"
So, i have written below rule for it
rules = [
Rule(SgmlLinkExtractor(allow=('/news/*/disney*',)), follow=True),
]
but the above rule doesn't working as i want and i am getting crawled pages output not related to Disney. please help to fix this rule.
/news/* matches /news followed by any number of /.
The correct regex would be:
/news/.*/disney
You likely need the following regex:
/news/[^/]+/disney.*
which escaped looks like
\/news\/[^\/]+\/disney.*
this way you will find the next / but not anything.
Example here
for example, I have some feed, with an item title like this
Some text is better than one text http://t.co/blablabla #hashtag
then I want to get only the URL using regex like this
http://t.co/blablabla
how do i do that ?
(sorry I use google translate to make this question)
thanks for answer
Here's a random one I found with a simple google search:
(http|ftp|https)://([\w-_]+(?:(?:.[\w-_]+)+))([\w-.,#?^=%&:/~+#]*[\w-\#?^=%&/~+#])?
I have a string and I can find the following
Kbps
Duration
Mb
Song Title
Website
http://abmp3.com/
I can't seem to find the URL i used Expresso to create the regex and used the source from the webpage to get matches but for some reason when i add this href="(.*.mp3)" to the end of the string it won't find anything. The kbps,duration,and mb are on all on the same line. The Song Title is on a different line and so is the URL
My question is how would you add the href="(.*.mp3)" to the end of the regex string?
Regex Code
":6px;"">(.* Kbps)<br>(.*)<br> (.* Mb)</div></td>\D+\S+<strong>(.*) mp3"
Need to add this to the end
href="(.*.mp3)"
Thanks in advance!
Looking at the website, it appears this would work for you:
href=\".*\.mp3\"
Problem: I need a Regex which would check a given author URL is valid or not.
Requirement : Author URL is basically a URL from social networking sites/blogs etc having author id (profile id)
For eg .
www.facebook.com/RyanMathews
www.mouthshut.com/zobo.786
The regex as per my understanding would have to accept any string(combination of any characters ) after the sites complete address is followed by a " / " .
Tried Using this regex but doesnt support author ids
var urlregex = /^((https?:\/\/)?((([a-z\d])+(\-)?([a-z\d])+)+)(\.([a-z\d])+(\-)?([az\d])+)?)(\.[a-z]{2,4}?){1,2}$/i;
PS : Please explain the Regex & Logic too :D
it should Help but I will recommend to do little background reading.
What is the best regular expression to check if a string is a valid URL?
Getting parts of a URL (Regex)
Please spend some time to read these links and understand them, hope this helps, cheers!
^(http:\/\/){0,1}(www.[^\W]+.com)(\/[^\W]+)+
maybe this would work