So I'm trying to download all my old reddit posts using a combination of AutoPagerize and DownThemAll.
Here are two sample URLs I want to distinguish between:
http://www.reddit.com/r/China/comments/kqjr1/what_is_the_name_of_this_weird_chinese_medicine/c2med97
http://www.reddit.com/r/China/comments/kqjr1/what_is_the_name_of_this_weird_chinese_medicine/c2meana?context=3
The regexp I'm trying to use is this: (\b)http://www.reddit.com/([^?\s]*)?
I want all my reddit posts downloaded, but I don't want any redundancy, so I want to match all of my reddit posts except for anything with a question mark (after which there's a "context=3" character).
I've used RegEx Buddy to show that the regexp fits the first URL but not the second one. However, DownThemAll does not recognize this. Is DownThemAll's ability to parse regexp limited, or am I doing something wrong?
For now, I've just decided to download them all, but to use a renaming mask of *subdirs*.*text*.*html* so that I can later mass remove anything containing the word "context" in its filename.
Reddit does have an API, you might want to take a look at that instead, might be easier.
https://github.com/reddit/reddit/wiki/API
EDIT: Looks like http://www.reddit.com/user/USERNAME/.json might be what you want
Related
So, I'm trying to crawl a website that has like 7,000 product pages and the link structure is like this:
https://example.com/category/sub-category/numericid-name-of-the-product/
What I'm trying to achieve is to Generate a URL list, the Kimono App has that option, and it actually sections the URL but I'm only offered default value, range, and custom list.
I tried to put in stuff like "/.+/" to match all the chars, but that does not work, I couldn't find any help on that on official kb.
.I know that import.io had that "{alpahnumeric}" for example for different parts of URL so it matches them, is there a way to accomplish that in kimonolabs app?
Try this regex: https://example.com/([^/]+)/([^/]+)/([0-9]+)-([^/]+)
Note: you may need to escape some characters (namely / would be escaped as \/).
Also, I'm not familiar with KimonoLabs, so I don't know if this is what you're looking for exactly. Feel free to clarify.
Explanation
https://example.com/ literally
([^/]+)/ a bunch of not /s, followed by a /
([0-9]+)-([^/]+) Numbers followed by another bunch of not /s
I am creating a google form and trying to create a regex on of the fields because I need them to enter a profile link from a specific website. I'm a beginner with regex and this is what I have come up with:
/^(http:\/\/)?(steamcommunity\.com\/id\/)*\/?$/
But when I go to enter a test link such as: http://steamcommunity.com/id/bagzli it fails it. I don't understand what is wrong about it.
You missed a dot (meaning any character) after the (/id\). Try this:
/^(http:\/\/)?(steamcommunity\.com\/id\/).*\/?$/
^-- added
The ultimate goal of what I was trying to accomplish is to ensure that certain text was entered in the box. I thought I had to use Regex to accomplish that, but google forms also has "Text Contains" feature which I made use of to solve my problem. The regex by Zoff Dino did not work, I am not sure why as it seems completely correct.
I will mark this as resolved as I managed to get my answer, even if it was not via regex.
I am coding custom CSS for Facebook using Stylish.
Everything goes well except that I need to have some custom values under the condition of URL-suffix. The only thing that comes close is URL-prefix which is the exact opposite.
So I was wondering if I could do something like:
Detect if URL is like either:
www.facebook.com/*/posts or just */post
where * could be any value.
Is it possible to do this through RegEx?
I googled it but I couldn't make anything out of it.
I want to apply some CSS code only when viewing some individual Facebook posts, and the URLbar shows:
www.facebook.com/User/Posts/PostID.php
Therefore, I would only like to detect if Post or post/postID.php exists and apply the style.
The below regex would match the links which contain the string /posts,
(?=.*?\/posts).*
DEMO
I need to enhance the search functionality on a page listing user accounts. Rather than have multiple search boxes for each possible field, or a drop down menu where the user can only search against one field, I'd like a single search box and to use a gmail like syntax. That's the best way I can describe it, and what I mean by a gmail like search syntax is being able to type the following into the input box:
username:bbaggins type:admin "made up plc"
When the form is submitted, the search string should be split into it's separate parts, which will allow me to construct a SQL query. So for example, type:admin would form part of the WHERE clause so that it would find any record where the field type is equal to admin and the same for username. The text in quotes may be a free text search, but I'm not sure on that yet.
I'm thinking that a regular expression or two would be the best way to do this, but that's something I'm really not good at. Can anyone help to construct a regular expression which could be used for this purpose? I've searched around for some pointers but either I don't know what to search for or it's not out there as I couldn't find anything obvious. Maybe if I understood regular expressions better it would be easier :-)
Cheers,
Adam
No, you would not use regular expressions for this. Just split the string on spaces in whatever language you're using.
You don't necessarily have to use a regex. Regexes are powerful, but in many cases also slow. Regex also does not handle nested parameters very well. It would be easier for you to write a script that uses string manipulation to split the string and extract the keywords and the field names.
If you want to experiment with Regex, try the online REGex tester. Find a tutorial and play around, it's fun, and you should quickly be able to produce useful regexes that find any words before or after a : character, or any sentences between " quotation marks.
thanks for the answers...I did start doing it without regex and just wondered if a regex would be simpler. Sounds like it wouldn't though, so I'll go back to the way I was doing it and test it again.
Good old Mr Bilbo is my go to guy for any naming needs :-)
Cheers,
Adam
I'm trying to write some Perl to convert some HTML-based text over to MediaWiki format and hit the following problem: I want to search and replace within a delimited subsection of some text and wondered if anyone knew of a neat way to do it. My input stream is something like:
Please mail support. if you want some help.
and I want to change Please help and Please can some one help me out here to Please%20help and Please%20can%20some%20one%20help%20me%20out%20here respectively, without changing any of the other spaces on the line.
Naturally, I also need to be able to cope with more than one such link on a line so splicing isn't such a good option.
I've taken a good look round Perl tutorial sites (it's not my first language) but didn't come across anything like this as an example. Can anyone advise an elegant way of doing this?
Your task has two parts. Find and replace the mailto URIs - use a HTML parsing module for that. This topic is covered thoroughly on Stack Overflow.
The other part is to canonicalise the URI. The module URI is suitable for this purpose.
use URI::mailto;
my #hrefs = ('mailto:help#myco.com&Subject=Please help&Body=Please can some one help me out here');
print URI::mailto->new($_)->as_string for #hrefs;
__END__
mailto:help#myco.com&Subject=Please%20help&Body=Please%20can%20some%20one%20help%20me%20out%20here
Why dont you just search for the "Body=" tag until the quotes and replace every space with %20.
I would not even use regular expresions for that since I dont find them useful for anything except mass changes where everything on the line is changes.
A simple loop might be the best solution.