Using comma vs underscore in urls - web-services

Only the first one works, second and third do not. Any ideas why?
http://abc.def.ghi.com:8080/mango?ll=76.93839283938492_-145.126282939231&locale=en_US&q= //returns json response
http://abc.def.ghi.com:8080/mango?ll=76.93839283938492,-145.126282939231&locale=en_US&q= //does not work, gives "Bad Request"
http://abc.def.ghi.com:8080/mango?ll=76.93839283938492%2C-145.126282939231&locale=en_US&q= //does not work, gives "Bad Request"

Okay so, here is my understanding. In general, comma's are not typically used in URL's, but it is possible to have them, just not commonly practiced. For your #2 and #3 lines, there are a couple of things that could be causing the bad requests to happen.
Here is a link worth looking into to get a better understanding. But from what I understood from it is that when you are using those links, its not encoding the comma's properly to allow you to use them in the URL address. So the interpreter could be stumbling over the addresses once it hits the comma "," because it doesn't know what to do with it.
" https://www.searchenginenews.com/sample/content/are-you-using-commas-in-your-urls-heres-what-you-need-to-know "
Hope this helps, I know my answer isn't much, but you brought up an interesting question! I'm curious to learn more about it too.

Related

trying to stem a string in natural language using python-2.7

I am importing from nltk.stem.snowball import SnowballStemmer
and I have a string as follows:
text_string="Hi Everyone If you can read this message youre properly using parseOutText Please proceed to the next part of the project"
I run this code on it:
words = " ".join(stemmer.stem(word) for word in text_string.split(" "))
and I get the following which has a couple of 'e' missing. Can't figure out what is causing it. Any suggestions? Thanks for the feedbacks
"hi everyon if you can read this messag your proper use parseouttext pleas proceed to the next part of the project"
You're using it correctly; it's the stemmer that's acting weird. It could be caused by too little training data, or the wrong balance, or simply the wrong conclusion by the stemmer's statistical algorithm. We can't expect perfection, but it's annoying when it happens with common words. It's also stemming "everything" to "everyth", as if it's a verb. At least here it's clear what it's doing. But "-e" is not a suffix in English...
The stemmer allows the option ignore_stopwords=True, which will suppress stemming of words in the stopword list (these are common words, usually irregular, that Porter thought fit to exclude from the training set because he got worse results when they are included.) Unfortunately it doesn't help with the particular examples you ask about.

How do I imitate twitters url-shortener?

the main question is a bit short so I'll collaborate.
I'm building an app for twitter with which you can do the basic actions (get posts, do a post, reply etc.)
Now I figured it would be a good idea if I'd check the max 140 char limit in my app.
So far so good, then someone asked if I could also do the url-shortener thing.
so at the moment I have a regex that picks op most (in fact too much) url's, takes the lenght of them and either adds or deduces the difference from the 140 max.
It's still a but buggy but I can manage that.
Now my problem....
It seems twitter is quite picky in what they think is an url:
I got the most basic ones (starting with http(s):// and such), but twitter also replaces some tld's very easily, (www.)google.com [whatever].net/.biz/.info are just a few of them)
but not .nl .de .tk
Now I was wondering if perhaps someone has found out which ones they do and which ones they don't 'shorten'.
now because I'm pretty sure my regex isn't the best either I'll drop that here as well:
((http|https):\/\/[\w\-_]+(\.[\w\-_]+)+([\w\-\.,#?^=%&:\/~\+#]*[\w\-\#?^=%&\/~\+#])?)|([\w\-_]+(\.[\w\-_]+)+([\w\-\.,#?^=%&:\/~\+#]*[\w\-\#?^=%&\/~\+#])?)
http://support.twitter.com/articles/78124-how-to-shorten-links-urls# indicates that all URLs posted to Twitter will be rewritten to be exactly 19 characters long.
I am using this: var url_expression = /[-a-zA-Z0-9#:%_\+.~#?&//=]{2,256}\.[a-z]{2,4}\b(\/[-a-zA-Z0-9#:%_\+.~#?&//=]*)?/gi; Nobody has complained :)
I figured it out, I found a pretty important line on the tld wikipage. It states that all country TLD's are two chars long. And also the other way around; all 2 char tld's are countries. With that in mind, I started testing a bunch of them with twitter and I'm pretty sure I now know what url's twitter shortens and which ones they don't.
All url's starting with http:// or https://
All url's like [something].[non country tld] # .com .biz .mobi etc. (Except .arpa & .aero)
All url's like [something].[something].[valid tld] # including countries
links like http://[user]:[pass]#[something].[tld] will NOT be shortened
Now to build a regex for it, i'll post it here as soon as I think I have it :D
this is what I got this far:
/(^(?:(?:ht|f)tp(?:s?)\:\/\/|~\/|\/)?(?:(?:[-\w]+\.)+(?:com|asia|cat|coop|edu|int|tel|pro|org|net|gov|mil|biz|info|mobi|name|jobs|museum|travel|([a-z]{2})))(?::[\d]{1,5})?(?:(?:(?:\/(?:[-\w~!$+|.,=\(\)]|%[a-f\d]{2})+)+|\/)+|\?|#)?(?:(?:\?(?:[-\w~!$+|.,*:]|%[a-f\d{2}])+=?(?:[-\w~!$+|.,*:=]|%[a-f\d]{2})*)(?:&(?:[-\w~!$+|.,*:]|%[a-f\d{2}])+=?(?:[-\w~!$+|.,*:=]|%[a-f\d]{2})*)*)*(?:#(?:[-\w~!$+|.,*:=]|%[a-f\d]{2})*)?)/gim;
one major flaw still in it, it also accepts [domain].[tld] which twitter doesn't.
I hope this will help someone in the future. I'm pretty sure there's not a whole lot easy-to-find info about this on the web (or at least I couldn't find it).

Weird appearing for no reason

I've just launched my new site, and in going though it in multiple browsers to see how it performs, I've noticed something weird.
Can you see the gap after the word 'but'? By my reasoning, the word 'was' on the following line should be next to it, as there is plenty of space for it - but as you can see, it isn't.
Although this screenshot is from Firefox (10), I'm getting the same thing in Chrome (17) and Internet Explorer (9).
Using Firebug to inspect the element, it is showing a between the 'was' and 'disappointed' (which would explain why it isn't on the line above) - but upon viewing the source, no such exists.
This is leading me to suggest that the browser is inserting them - but I have no idea why.
Anyway, the page in question is http://limeblast.co.uk/2012/02/currently-playing/
I used wget to download the page directly to a file and I noticed that the space between was and 'disappointed' and all other spaces that you see as
are encoded with two bytes, C2 A0 hex, while the other spaces are encoded with one byte, 20hex.
Hope this helps.
Off-topic, I would also recommend justifying the text.
You should look into your colorbox implementation. It isn't working properly, and it's causing other issues on the page. I don't think it's necessarily related to what you're describing, but fix it, and you'll get your colorbox effect that you're after.
I took a look and tried a few things with your code, best solution I would recommend is to force your to be justified, right now your text is left aligned.
I had the same problem with text coming from the WP tinymce editor. This here solved it for me:
function b09_remove_forced_spaces($content) {
$string = htmlentities($content, null, 'utf-8');
$content = str_replace(" ", " ", $string);
$content = html_entity_decode($content);
return $content;
}
add_filter("the_content", "b09_remove_forced_spaces", 9);
based on https://stackoverflow.com/a/21801444/586823

Controlling Word Wrap in a container

I have a peculiar problem. I have an email group that pipes emails to a message board. The word wrap of the emails varies. In yahoo, the messages tend to fill the entire container on the message board. But in all other mail clients, only part of the container width is filled, because the original mail was wrapped. I want all of the email messages to fill the entire width of the container. I've thought of two possible solutions: CSS, or a Regex that eliminates line breaks. Because I am only a garage mechanic (at these sorts of things), I simply cannot get the job done. Any help out there?
Here is a link that shows the issue: http://seanwilson.org/forum/index.php?t=msg&th=1729&start=0&S=171399e41f2c10c4357dd9b217caaa3f
(compare the message of "sean" with that of "rob." One fills the container, the other not).
Can any of you suggest how to get all the mail to fill the container?
You gave too little information - what programming language are you using - PHP/Javascript/anything different?
I think you only need to replace \n, \r and \r\n with whitespace. PHP code for that:
$nowrap = str_replace('\r\n', ' ', $nowrap);
$nowrap = str_replace('\r', ' ', $nowrap);
$nowrap = str_replace('\n', ' ', $nowrap);
You can do that analogically in other languages (for JS see string.replace method: http://www.tizag.com/javascriptT/javascript-string-replace.php).
Depending on the situation (people always seem to add 2 linebreaks between paragraphs), you could say the problem is: replace all newlines not directly preceded or followed by a newline with a space.
//just to be sure, remove \r's
$string = str_replace("\r",'',$string);
$string = preg_replace('/(?<!\n)\n(?!\n)/',' ',$string);
While allowing \r's:
$string = preg_replace('/(?<!\r|\n)\r?\n(?!\r|\n)/',' ',$string);
Edit: nevermind: do not use: while people tend to write their email text in paragraphs, you will break their signature / signoff with this regex. One could fiddle around with a minimum linelength before deeming it 'breakable' (i chose 63), but fiddly it will be:
$string = preg_replace('/([^\r\n]{63,})\r?\n(?!\r|\n)/','$1 ',$string);
The problem is: there are no assurance the linebreak wasn't intended. With a fiddleable line-length you could base it on average users, but the question is: what do they mind more: the differences between breaking & non-breaking paragraphs, or the breaking of their signatures?
Thanks for getting back so quickly!
The discussion board uses php (and also CSS). The only trouble is that I am somewhat limited in my ability to tinker with its programing. If I am to do this at my current level of skilty, I have only one of two options.
using a preg-replace in php. The discussion board allows us to do this from a control panel. So If I could do it with one preg-replace statement, it should work.
Would Wrikken's solution work if I do not remove \r's? Because that seems to be spot on. (could the \r's be added to the preg-replace?)
I had hoped the solution could come through a css property of some sort. I guess that isn't possible.
Thanks so much for your help!
[NOTE: thanks so much for your help! The solution worked!!! I changed the number to 53 or so. It needed to be a little smaller. I don't care that a rare, long signature lines may lose its carriage return. That's a small price to pay for a full message box! You easily saved me several days of learning something that was bound to be moderately frustrating, Thanks so much for that quick fix. I am joyous at the help I received here.]

Finding type of break in icu::BreakIterator

I'm trying to understang how to use icu::BreakIterator to find specific words.
For example I have following sentence:
To be or not to be? That is the question...
Word instance of break iterator would put breaks there:
|To| |be| |or| |not| |to| |be|?| |That| |is| |the| |question|.|.|.|
Now, not every pair of break points is actual word.
In derived class icu::RuleBasedBreakIterator there is a "getRuleStatus()" that returns some kind of information about break, and it gives "Word status at following points (marked "/")"
|To/ |be/ |or/ |not/ |to/ |be/?| |That/ |is/ |the/ |question/.|.|.|
But... It all depends on specific rules, and there is absolutely no documentation to understand it (unless I just try), but what would happend with different locales and languages where dictionaries are used? what happens with backware iteration?
Is there any way to get "Begin of Word" or "End of Word" information like in Qt QTextBoundaryFinder: http://qt.nokia.com/doc/4.5/qtextboundaryfinder.html#BoundaryReason-enum?
How should I solve such problem in ICU correctly?
Have you tried the ICU documentation? It appears to explain everything you are asking about including handling of internationalisation, reverse iteration, and the rules, both default and how to create your own custom set. They also have code snippets to help.