Is it possible using NLP? Natural Language processing - regex

I have a set of Project Names, a set of keywords and a set of paragraphs.
Now my task is, to check whether keywords match any project names , and keywords match any word in any paragraph.
If any set of paragraphs are matched with a keyword and any project matched with same keyword, then I have to assign these paragraphs to that project.
I have been using String Regex for this. But can this be implemented using Natural Language Processing concepts.
If yes... Please let me know how can it be implemented. It would be very helpful for me.
Thanks in advance.

There's no NLP involved in this as such. No matter what you do, you must have to go through all the projects and all the paragraphs at least once. Yes, you can optimize your process by using hashmaps or dictionaries but at the end of the day, you will be searching and matching strings no matter what.

You can do it using dictionaries as mapping becomes easy with the help of dictionaries and regex will be in action too.

Related

Structural Search - Complete Match expression in IntelliJ

I find the Structural Search and Replace feature in IntelliJ IDEs very powerful.
While browsing through the existing templates and discovering my new super powers I came accross the template called "logging without if".
My spider sense urged me to check out the "without" part as it uses invert condition in Complete Match.
However, I am baffled by the expression used in Complete Match.
Here it is:
if('_a) { 'st*; }
Please help me understand how this expression is used.
UPDATE 2017/01/19:
As pointed out by #Faibbus, the docs say that _a and _st are variables.
My confusion is with the variable names.
The names _a and _st only appear here, and nowhere else in the template.
What makes them variables? All other variables in Structural Search are surrounded by $dollar$ signs.
What is the role of the underscores as variable prefix?, what does the apostrophe do in that expression?
I don't find it clear at all. What am I missing?
The expression is using an internal search criteria language. With this language it is possible to specify a complete search query in text without needing all the text fields and checkboxes of the usual Structural Search dialog. This language maybe shouldn't have been exposed and will be more hidden in IntelliJ IDEA 2017.2.
That said, here's a short explanation of the features of the language used:
- a single tick mark indicates a variable. So there are two variables, _a and st.
- a variable not starting with an underscore indicates this variable is target of the search. There can be only one target per query. So st is the target.
- the * indicates zero or more times.
- the rest of the query is a regular Java fragment
For other features of this search criteria language you can check out the source if you are interested.

How to remove emoticons from tweets in C++?

I'm working on a twitter sentiment analysis tool in C++. So far I get the tweets from Twitter and I process them a bit ( lowercase, remove RT, remove # and URLs).
The next step is to remove emoticons and all those special characters. How does one do that? before you jump me, I already looked at other similar questions but none of them deals with C++. Mostly R,Python and PHP.
I was thinking to use regex however I can't get it to work. I tried it with removal of hashtags and URLs and I gave up. I ended up using normal string:find and find_first_of.
Is there any library or method available to get rid of those emoticons and special stuff ?
Thanks
I would recommend using regular expressions for this. Now you have two options, you can either extract only the characters you are interested in (if you are working with English tweets this would probably be A-Z,a-z, numbers and maybe some symbols, depending on your needs), or you can select invalid characters (emoticons) and replace them with an empty string.
I only have experience with Qt's RegularExpression engine, but the c++ standard library has regex support (although I'm not sure how good it is with Unicode), but the ICU provides a regex library too.
*I'd provide more links but I don't have enough reputation yet :/

regular expression to convert state names to abbreviations

I'm working on a project that requires only the use of regular expression to convert state names (must be case insensitive) to their two letter abbreviations.
I cannot use any sort of development environment or link to any databases or xml or ini files.
Please help!
Since states don't have something regular in them regular expressions is the WRONG tool. I would suggest getting a new project.
However, the only solution (apart from stupid illogical hacks) is to hardcore every state:
s/Alabama/Al/
s/Alaska/Ak/
...
s/Wyoming/Wy/
A list of the states and their abbreviations can be found here.

Need to create a gmail like search syntax; maybe using regular expressions?

I need to enhance the search functionality on a page listing user accounts. Rather than have multiple search boxes for each possible field, or a drop down menu where the user can only search against one field, I'd like a single search box and to use a gmail like syntax. That's the best way I can describe it, and what I mean by a gmail like search syntax is being able to type the following into the input box:
username:bbaggins type:admin "made up plc"
When the form is submitted, the search string should be split into it's separate parts, which will allow me to construct a SQL query. So for example, type:admin would form part of the WHERE clause so that it would find any record where the field type is equal to admin and the same for username. The text in quotes may be a free text search, but I'm not sure on that yet.
I'm thinking that a regular expression or two would be the best way to do this, but that's something I'm really not good at. Can anyone help to construct a regular expression which could be used for this purpose? I've searched around for some pointers but either I don't know what to search for or it's not out there as I couldn't find anything obvious. Maybe if I understood regular expressions better it would be easier :-)
Cheers,
Adam
No, you would not use regular expressions for this. Just split the string on spaces in whatever language you're using.
You don't necessarily have to use a regex. Regexes are powerful, but in many cases also slow. Regex also does not handle nested parameters very well. It would be easier for you to write a script that uses string manipulation to split the string and extract the keywords and the field names.
If you want to experiment with Regex, try the online REGex tester. Find a tutorial and play around, it's fun, and you should quickly be able to produce useful regexes that find any words before or after a : character, or any sentences between " quotation marks.
thanks for the answers...I did start doing it without regex and just wondered if a regex would be simpler. Sounds like it wouldn't though, so I'll go back to the way I was doing it and test it again.
Good old Mr Bilbo is my go to guy for any naming needs :-)
Cheers,
Adam

Regex misspellings

I have a regex created from a list in a database to match names for types of buildings in a game. The problem is typos, sometimes those writing instructions for their team in the game will misspell a building name and obviously the regex will then not pick it up (i.e. spelling "University" and "Unversity").
Are there any suggestions on making a regex match misspellings of 1 or 2 letters?
The regex is dynamically generated and run on a local machine that's able to handle a lot more load so I have as a last resort to algorithmically create versions of each word with a letter missing and then another with letters added in.
I'm using PHP but I'd hope that any solution to this issue would not be PHP specific.
Allow me to introduce you to the Levenshtein Distance, a measure of the difference between strings as the number of transformations needed to convert one string to the other.
It's also built into PHP.
So, I'd split the input file by non-word characters, and measure the distance between each word and your target list of buildings. If the distance is below some threshold, assume it was a misspelling.
I think you'd have more luck matching this way than trying to craft regex's for each special case.
Google's implementation of "did you mean" by looking at previous results might also help:
How do you implement a "Did you mean"?
What is Soundex() ? – Teifion (28 mins ago)
A soundex is similar to the levenshtein function Triptych mentions. It is a means of comparing strings. See: http://us3.php.net/soundex
You could also look at metaphone and similar_text. I would have put this in a comment but I don't have enough rep yet to do that. :D
Back in the days we sometimes used Soundex() for these problems.
You're in luck; the algorithms folks have done lots of work on approximate matching of regular expressions. The oldest of these tools is probably agrep originally developed at the University of Arizona and now available in a nice open-source version. You simply tell agrep how many mistakes you are willing to tolerate and it matches from there. It can also match other blocks of text besides lines. The link above has links to a newer, GPLed version of agrep and also a number of language-specific libraries for approximate matching of regular expressions.
This might be overkill, but Peter Norvig of Google has written an excellent article on writing a spell checker in Python. It's definitely worth a read and might apply to your case.
At the end of the article, he's also listed contributed implementations of the algorithm in various other languages.