How to remove emoticons from tweets in C++? - c++

I'm working on a twitter sentiment analysis tool in C++. So far I get the tweets from Twitter and I process them a bit ( lowercase, remove RT, remove # and URLs).
The next step is to remove emoticons and all those special characters. How does one do that? before you jump me, I already looked at other similar questions but none of them deals with C++. Mostly R,Python and PHP.
I was thinking to use regex however I can't get it to work. I tried it with removal of hashtags and URLs and I gave up. I ended up using normal string:find and find_first_of.
Is there any library or method available to get rid of those emoticons and special stuff ?
Thanks

I would recommend using regular expressions for this. Now you have two options, you can either extract only the characters you are interested in (if you are working with English tweets this would probably be A-Z,a-z, numbers and maybe some symbols, depending on your needs), or you can select invalid characters (emoticons) and replace them with an empty string.
I only have experience with Qt's RegularExpression engine, but the c++ standard library has regex support (although I'm not sure how good it is with Unicode), but the ICU provides a regex library too.
*I'd provide more links but I don't have enough reputation yet :/

Related

Is it possible using NLP? Natural Language processing

I have a set of Project Names, a set of keywords and a set of paragraphs.
Now my task is, to check whether keywords match any project names , and keywords match any word in any paragraph.
If any set of paragraphs are matched with a keyword and any project matched with same keyword, then I have to assign these paragraphs to that project.
I have been using String Regex for this. But can this be implemented using Natural Language Processing concepts.
If yes... Please let me know how can it be implemented. It would be very helpful for me.
Thanks in advance.
There's no NLP involved in this as such. No matter what you do, you must have to go through all the projects and all the paragraphs at least once. Yes, you can optimize your process by using hashmaps or dictionaries but at the end of the day, you will be searching and matching strings no matter what.
You can do it using dictionaries as mapping becomes easy with the help of dictionaries and regex will be in action too.

How to find non localised strings in a swift project

How can I find all strings a project that are not being localised?
My goal is to add support for localisation by generating the XLIFF file, via Editor->Export For Localisation. In order to do that, I first added comment for Localiser where needed in the Storyboard.
Next, I need to find in the code all Strings that do not use NSLocalizedString("...", comment:"...").
Is there a way to find all these strings?
I didn't succeed writing a regex to find them, due to my lack of competence in regex.
My goal is to have a regex like this:
[withIdentifier: |NSLocalizedString(]".*"
in order to find all strings surrounded by quotes, and that are not precedeed by some keywords.
I tried with no success using negative look ahead with
A regular expression to exclude a word/string
It's not meant for automated replacement, but just to have a quick view if I haven't forgotten some strings.
Thank you very much!
OBJC
There is a possible way that you can find ALL The Strings which are not used by NSLocalizedString
Goto Product -> Analyze
From Left Panel you can see
Where you can find each and every string which are not Localized
On tap on that
XCode will tell you issue
SWIFT3
I am not Sure about solution NOT TESTED
https://medium.com/#pinmadhon/finding-non-nslocalized-strings-in-xcode-8-in-swift-3-or-objc-589ee279a166

Python3 unicode regex

I'm not a native English speaker, but it happened that I've never written any regex for any non-ASCII text in my life, so I'm confused with a seemingly trivial case.
I have a large dictionary scrapped from a website by a robot. All HTML tags are removed. My goal is to remove most carry over hyphens. The idea is that >90% of problematic punctuation have a form lowercase-lowercase, so they could be caught by regex like '\p{Ll}-\p{Ll}'. This should be able to capture Russian lowercase chars, при-мер for example.
However, it seems like \p isn't supported by python's re engine. I'm not sure which alternative regex engine I'm supposed to choose because googling doesn't show any information relevant to Python 3. I thought Python3 is much more advanced when it comes to i14n and Unicode, and it's supposed to have Unicode character class support.

Recommended built-in WinXP language support for UTF-8 regex

It's my first foray into UTF-8 land. I'm an IIS Admin, so I've never gotten to touch this professionally. I'm trying to help a missionary who's translated the bible into an African language and now needs to do some global matching against large UTF-8 files. We're specifically matching for accented characters.
We're using older XP computers here, so I cobbled together a quick script in VBS knowing the language would be installed on their boxes already. After playing around for a few minutes, it appears VBS regexes handle UTF-8 by breaking each character up into 2 characters. To match a single â, my pattern is \u00c3\u00a2. Shouldn't this be \u00e2?
Since I'm out of my depth I thought I'd seek a little guidance. It almost looks like UTF-8 simply requires this kind of double matching (and UTF-8 is required.) Can someone tell me into which box canyon I'm coding? :-)
Downloading and installing Perl or Java is probably outside this project's bandwidth and technical know-how. The tool should be built in. MS Office is installed, so VBA is an option if there's some library that offers specific support. JavaScript is installed as well, though I don't know what versions.
Thanks
Unless you need to match two or more consecutive dots (e.g. you have .. or ... in your regex but not .*) you can use any ASCII regex library on UTF-8 and expect it to work correctly.
The trick is to know what you are looking for. UTF-8 does that kind of byte breakup, so write your regex in whatever you are familiar with and convert it to UTF-8 and it will work unless it contains "..".
What about PowerShell? It uses the .NET regular expressions library, and that is one of the best libraries available, especially for Unicode support.

Regex misspellings

I have a regex created from a list in a database to match names for types of buildings in a game. The problem is typos, sometimes those writing instructions for their team in the game will misspell a building name and obviously the regex will then not pick it up (i.e. spelling "University" and "Unversity").
Are there any suggestions on making a regex match misspellings of 1 or 2 letters?
The regex is dynamically generated and run on a local machine that's able to handle a lot more load so I have as a last resort to algorithmically create versions of each word with a letter missing and then another with letters added in.
I'm using PHP but I'd hope that any solution to this issue would not be PHP specific.
Allow me to introduce you to the Levenshtein Distance, a measure of the difference between strings as the number of transformations needed to convert one string to the other.
It's also built into PHP.
So, I'd split the input file by non-word characters, and measure the distance between each word and your target list of buildings. If the distance is below some threshold, assume it was a misspelling.
I think you'd have more luck matching this way than trying to craft regex's for each special case.
Google's implementation of "did you mean" by looking at previous results might also help:
How do you implement a "Did you mean"?
What is Soundex() ? – Teifion (28 mins ago)
A soundex is similar to the levenshtein function Triptych mentions. It is a means of comparing strings. See: http://us3.php.net/soundex
You could also look at metaphone and similar_text. I would have put this in a comment but I don't have enough rep yet to do that. :D
Back in the days we sometimes used Soundex() for these problems.
You're in luck; the algorithms folks have done lots of work on approximate matching of regular expressions. The oldest of these tools is probably agrep originally developed at the University of Arizona and now available in a nice open-source version. You simply tell agrep how many mistakes you are willing to tolerate and it matches from there. It can also match other blocks of text besides lines. The link above has links to a newer, GPLed version of agrep and also a number of language-specific libraries for approximate matching of regular expressions.
This might be overkill, but Peter Norvig of Google has written an excellent article on writing a spell checker in Python. It's definitely worth a read and might apply to your case.
At the end of the article, he's also listed contributed implementations of the algorithm in various other languages.