Structural Search - Complete Match expression in IntelliJ - regex

I find the Structural Search and Replace feature in IntelliJ IDEs very powerful.
While browsing through the existing templates and discovering my new super powers I came accross the template called "logging without if".
My spider sense urged me to check out the "without" part as it uses invert condition in Complete Match.
However, I am baffled by the expression used in Complete Match.
Here it is:
if('_a) { 'st*; }
Please help me understand how this expression is used.
UPDATE 2017/01/19:
As pointed out by #Faibbus, the docs say that _a and _st are variables.
My confusion is with the variable names.
The names _a and _st only appear here, and nowhere else in the template.
What makes them variables? All other variables in Structural Search are surrounded by $dollar$ signs.
What is the role of the underscores as variable prefix?, what does the apostrophe do in that expression?
I don't find it clear at all. What am I missing?

The expression is using an internal search criteria language. With this language it is possible to specify a complete search query in text without needing all the text fields and checkboxes of the usual Structural Search dialog. This language maybe shouldn't have been exposed and will be more hidden in IntelliJ IDEA 2017.2.
That said, here's a short explanation of the features of the language used:
- a single tick mark indicates a variable. So there are two variables, _a and st.
- a variable not starting with an underscore indicates this variable is target of the search. There can be only one target per query. So st is the target.
- the * indicates zero or more times.
- the rest of the query is a regular Java fragment
For other features of this search criteria language you can check out the source if you are interested.

Related

How to store regex "literals" in Postgres?

I want to store regex pattern/option "literals" in a Postgres database, like:
/<pattern>/options
I think it's helpful to indicate the expected format and use of the text. Also, the application framework I'm using can coerce this kind of text into the proper Regex type.
I looked through the data types and provided extensions and didn't see anything specific. Am I missing one?
If there is no specialized type, is there a reasonable way to constrain TEXT to likely contain a regex (not to validate the regex, just to ensure text between forward-slashes). Does this work?
pattern TEXT CONSTRAINT is_regex (pattern LIKE '/%/%')
At the moment, I'm only using these literals in application code, which is why the TEXT to Regex transformation is very helpful. At some point, I might get better at CTEs and transform them back to regular TEXT (without forward-slashes or options) to be used in Postgres pattern matching functions.
PostgreSQL doesn't offer such type (as of now), but generally speaking you have a few options to preserve database integrity (I can only assume you want this to avoid worrying that the data you read from the database fails your application, because it's not a valid regular expression).
Your best bet is (which you already figured out) is to use a CHECK constraint, one way or the other. If you plan to use this pattern in multiple places, I suggest you to use domain types. That way, you don't have to define these constraints at multiple columns. Ironically the best way to write such a CHECK constraint is to write a regexp pattern to match your regexp patterns (because there are multiple regexp implementations with slight differences). It obviously won't be perfect, but it might be good enough. I.e.
create domain likely_regexp as text
check (value ~ '^/([^/]*(\\/[^/]*)*[^\\])?/[a-z]*$');
But if you're okay to check against PostgreSQL's implementation, you can (ab)use the fact that CHECK constraints fails not only when the evaluated expression is false, but they also fail when the expression throws (raises) some error. So you can call a regexp function in order to detect if it's actually a valid regular expression or not. Altough you still have to split the pattern and the options part.
create domain pg_regexp as text
check (regexp_replace('', replace(substring(value from '^/(.*)/'), '\/', '/'),
'', substring(value from '/([^/]*)$')) = '');
https://rextester.com/YFG18381

Is it possible using NLP? Natural Language processing

I have a set of Project Names, a set of keywords and a set of paragraphs.
Now my task is, to check whether keywords match any project names , and keywords match any word in any paragraph.
If any set of paragraphs are matched with a keyword and any project matched with same keyword, then I have to assign these paragraphs to that project.
I have been using String Regex for this. But can this be implemented using Natural Language Processing concepts.
If yes... Please let me know how can it be implemented. It would be very helpful for me.
Thanks in advance.
There's no NLP involved in this as such. No matter what you do, you must have to go through all the projects and all the paragraphs at least once. Yes, you can optimize your process by using hashmaps or dictionaries but at the end of the day, you will be searching and matching strings no matter what.
You can do it using dictionaries as mapping becomes easy with the help of dictionaries and regex will be in action too.

Too Many Characters Included in Attempt to Parse a CSV File

Background
I am attempting to parse a CSV file using PCRE regular expressions. That is, making out (or extracting) the various different "cells" available in the CSV, to then put them in a somewhat nicely organized array containing all the parts that the process of parsing managed to make out.
The following regular expression is what I have come up with so far:
/(?:;|^)(?:(?:"(?:(?!"(;|$)).)*)|(?:([^;]*)))/g
I would highly recommend that you put this in a tester for regular expressions. Here is a slight bit of test data, that should match to a great extent.
"There; \"be";"but; someone spoke";hence the young man;hence the son;"test;"
The Problem
The regular expression manages to extract the correct number of parts. It is meant for the regular expression to retrieve the text from inside each and every "cell" available in the CSV (use the CSV provided above for reference). It does to some extent.
Here is the result of the groups in the regular expression above:
"There; \"be
;"but; someone spoke
hence the young man
hence the son
;"test;
As we can clearly see, the lines that are "escaped" using double-quotation marks include the " inside its group for the match, also selects the ", and sometimes even the semi-colon. From my understanding, the group for the negative lookahead should not include those.
I have probably missed something very essential here. Perhaps someone can point me in the right direction towards a fix.
Edit and Potential Solution
It appears as though I might have managed to solve it. As opposed to what I said above, the negative lookahead does not actually appear to create a capture group, which I initially thought. As such, adding yet another group to the equation seems to parse out the segments I am after.
/(?:;|^)(?:(?:"((?:(?!"(;|$)).)*))|(?:([^;]*)))/g
I will, however, leave the question open for now, and will answer it myself if no other answer comes tumbling in. As not to make it opinion based, I would therefore further inquire as to whether there might be a more efficient way in terms of speed than that in which I am using above.

Regex and non-technical users

Given that:
You have some Key-Value data that can be modified
Modification is done via by applying filters to the data. Filters that control what gets changed are created by non-technical people
The filters are setup using regular expressions. An example of a rule described as part of a filter may be "If a key matches some regex, replace value with some other value"
How would you:
If filters are to be produced by business people, who can't create regular expressions, in what form would you have them submit their input that would be easily translated to regex?
Agent Ransack contains a GUI editor for creating regular expressions from plain English, I would suggest taking a look at that and implementing your own variation of it.
See the screenshot for an example:
If it works, I would go for "wildcard only" support - ie the asterisk * is the only special character allowed and you translate that to the regex .*? in code.
Most non-technical people can grasp * meaning "anything".

regular expression to convert state names to abbreviations

I'm working on a project that requires only the use of regular expression to convert state names (must be case insensitive) to their two letter abbreviations.
I cannot use any sort of development environment or link to any databases or xml or ini files.
Please help!
Since states don't have something regular in them regular expressions is the WRONG tool. I would suggest getting a new project.
However, the only solution (apart from stupid illogical hacks) is to hardcore every state:
s/Alabama/Al/
s/Alaska/Ak/
...
s/Wyoming/Wy/
A list of the states and their abbreviations can be found here.