False word elemination using Regex replacement - regex

I need to perform content/keyword based search in a list of files. for that i need to extract the keywords and store them in MySQL database. the key words are extracted in following manner:
Read the file content
Remove special characters and additional white spaces if any using
Regex.Replace(input, "[^a-zA-Z0-9_]+", " ")
Remove am/is/are/be/being/been/ , have/has/having/had/, do/does/doing/did/ adjectives, phrases, Adverbs etc..
Removing endings like :
-IC-ATION fortification
-IC-ITY electricity
-IC-MENT fantastically
-AT-IV contemplative
-AT-OR conspirator
-IV-ITY relativity
-IV-MENT instinctively
-ABLE-ITY incapability
-ABLE-MENT charitably
-OUS-MENT famously
Can i do the whole operation using a single Regular expression? is their any simplest method for this? Here i have a reference algorithm for this operation.

I don't think it would be possible to implement a stemming algorithm using regular expressions exclusively. Maybe you should take a look at already existing implementations to get ideas. Here is a link to the Porter stemming algorithm in VB.net

Related

How can I create a Regex that matches and transforms a period delimited path?

I am using den4b Renamer to rename a lot of files that follow a specific pattern. The program allows me to use RegEx: (https://www.den4b.com/wiki/ReNamer:Regular_Expressions)
I am stuck trying to conjure up an expression for a specific pattern.
My current RegEx:
Expression: ^(com\.)(([\w\s]*\.){0,4})([\w\s]*)$
Replace: \L$1\L$2\u$4
Note: \L and \u transform the sub-expression to upper and lower case as defined in the table below:
Here are a few example strings so you can get an idea of the input:
Android File Transfer.svg
Angular Console.svg
Au.Edu.Uq.Esys.Escript.svg
Avidemux.svg
Blackmagic Fusion8.svg
Broken Sword.svg
Browser360 Beta.svg
Btsync GUI.svg
Buttercup Desktop.svg
Calc.svg
Calibre EBook Edit.svg
Calibre Viewer.svg
Call Of Duty.svg
com.GitHub.Plugarut.Pwned Checker.svg
com.GitHub.Plugarut.Wingpanel Monitor.svg
com.GitHub.Rickybas.Date Countdown.svg
com.GitHub.Spheras.Desktopfolder.svg
com.GitHub.Themix Project.Oomox.svg
com.GitHub.Unrud.Remote Touchpad.svg
com.GitHub.Unrud.Video Downloader.svg
com.GitHub.Weclaw1.Image Roll.svg
com.GitHub.Zelikos.Rannum.svg
com.Gitlab.Miridyan.Mt.svg
com.Inventwithpython.Flippy.svg
com.Neatdecisions.Detwinner.svg
com.Rafaelmardojai.Share Preview.svg
com.Rafaelmardojai.Webfont Kit Generator.svg
Distributor Logo Antix.svg
Distributor Logo Archlabs.svg
Distributor Logo Dragonflybsd.svg
DOSBox.svg
Drawio.svg
Drweb GUI.svg
For this question I am focused on the strings that begin with com.xxx.xxx.
Since I can't only target those names in Renamer, the expression has to "play nice" with the other input file names and correctly leave them alone. That's why I've prefixed my expression with ^(com\.)
What I want:
Transform the entire string to lower case except for the last period separated part of the string.
Strip white space from the entire string.
For instance:
Original: com.GitHub.Alcadica.Develop.svg
After my Regex: com.github.alcadica.Develop.svg
What I want: com.github.alcadica.Develop.svg
This specific file is correctly renamed. What I'm having trouble with are names that have spaces in any part of the string. I can't figure out how to strip whitespace:
Original: com.Belmoussaoui.Read it Later.svg
After my Regex: com.belmoussaoui.Read it Later.svg
What I want: com.belmoussaoui.ReaditLater.svg
Here is a hypothetical example because I couldn't find a file with more than four parts. I want my pattern to be robust enough to handle this:
Original: com.Shatteredpixel.Another Level.Next.Pixel Dungeon.svg
After my Regex: com.shatteredpixel.another level.next.Pixel Dungeon.svg
What I want: com.shatteredpixel.anotherlevel.next.PixelDungeon.svg
Note that since I'm not using any kind of programming language, I don't have access to common string operations like trim, etc. I can, however, stack expressions. But this would create more overhead and since I am renaming thousands of files at a time I'd ideally like to keep it to one find/replace expression.
Any help would be greatly appreciated. Please let me know if I can provide any more information to make this more clear.
Edit:
I got it to work with the following rules:
Really inefficient, but it works. (Thanks to Jeremy in the comments for the idea)

Scanning a language with non-delimited strings with nested tokens

I want to create a lexer/parser for a language that has non-delimited strings.
Which part of the language is a string is defined by the command preceding it.
For example it has statements that look like this:
pause 5
alert Hello world[CRLF] this contains 'pause' once (1)
Alert in this instance can end with any string, including keywords and numbers.
Further complicating things, the text can contain tags like [CRLF] that I want to separate too.
Ideally I'd want this to be broken up into:
[PAUSE][INT 5]
[ALERT][STR "Hello world"][CRLF][STR " this contains 'pause' once (1)"]
I'm currently using flex but from what I've gathered this kind of thing isn't possible with flex.
How can I achieve what I want here?
(Since one of your tags is "regex", I'll suggest a non-flex approach.)
From the example, it seems like you could just:
match each line against ^(\w+) (.+) to obtain command and arguments-text, and then
get individual arguments by splitting the arguments-text on (\[\w+\]) (assuming your regex library's split function can return both the splitter-strings and the split-strings).
It's possible your actual situation is more complex and something like flex makes more sense, but I'm not really seeing it so far.

Possible combination (variations) of words in a string variable in stata

I have a string variable containing school names and I need to find all the possible combination of each word in this string variable in stata:
For example variation of a word "Academy" would be:
Academy,
Academy,
acdamey,
aacdemy,
dmcaamy,
aacedmy,
and so on.
I need this to standardize the raw data of school names, which has many typos of each word due to data entry issues, like the ones given above for "academy".
Depending whether your data is already in the Excel sheets or a file, you can either use regex trying to match all possible combinations (and probably fix them when found) or parse the strings first before bringing them into Excel. In either case you could make a file (or Excel list/table/area/etc.) that includes all the common typos and pick each typo as regex match to use when comparing to your actual input.
Making regexp that would actually find all possible cases is next to impossible, especially if there are cases where very similar (but correct) names for schools exist. In any case direct regexps would be very messy and complex, so I would advice you to parse the data by finding first the correct form, excluding it and then using (greedy) search/regex to find the typoed versions. You can then save the typos to use them as a filter/match/pattern.
To get some sort of starting ideas, check this links:
Regex: Search for verb roots
Read text file and extract string into Excel sheet using regex
P.s You should keep the count of all strings/school names and finally get a list of all names that did not match correct form or any of your regexp filters, so you can manually insert/correct them.

how to find a specific word having random located newline

As I stated on the title.
I'm try to find regex result on a specific word(like apple) having random newline(\r\n) special character.
Illustrate more detail...
Let's find a word 'apple' on the text file. but We don't know where is exact position of newline(\r\n) on the file like below...
ap
ple
or
appl
e
I also googled many pages but I couldn't find the answer.
Should I have to write beginner regex like below?
(a\r\npple|ap\r\nple|app\r\nle|appl\r\ne|apple\r\n|)
I need to find more smarter regex to find exact word.
updated.
the word can be vary like "ripe apple", "rotten apple" and "brightapple".
In the case of third item, white space removed by writer.
updated
i have many txt files. i have to find the string within those.
So remove /r/n is not useful and cannot handle(too much menory and time required).
You have to take the naive approach ("beginner regex") if you want to use regular expressions, since they belong to the type 3 grammars and cannot express the state needed (see also The difference between Chomsky type 3 and Chomsky type 2 grammar)

Regex/Textmate Confusion

I'm trying to create a Textmate snippet, but have run into some difficulties. Basically, I want to type in a Name and split it into its parts.
Example,
Bill Gates: (Bill), (bill), (Gates), (gates), (Bill Gates), (Bill gates), (bill Gates), (bill gates)
EDIT**
So I most certainly can produce these results quite simply if I was using a programming language. For example, I could split the words and then call the uppercase or lowercase functions to produce this output.
But in my situation I am using Textmate and it regular expression capabilities to create a tab snippet. I want to type some trigger key, ie doit, press tab and then type in a username. Then the ouput above will be created. This won't save me that much time, but I feel like I come across this sort of stuff in Textmate quite frequently and want to figure it out.
I have been using this as a reference, but still don't know how use regexps to be selective with the words and upper and lowercase the values (\u \U \l \L)
http://manual.macromates.com/en/snippets
You can use Ruby for textmate snippets. That should make it simpler.