What is the name for the type of "loose" search/filter that allows for other characters in the middle? - regex

I'm not sure which community this belongs in, feel free to suggest a better one if this doesn't fit here.
In Visual Studio Code, when searching for a file, you can CMD/Ctrl + P to bring up the Quick Open search box for finding a file by name. The search doesn't have to be the exact name and it filters as long as the search query contains the characters in that order, while being "loose" enough to ignore any characters between those.
Example:
Searching "cat" would show the following:
bigcat.txt
cat.txt
candlelight.txt
In the above, all the strings contain "cat" within it, even if there are other characters between it. The regex would probably be something like /.*c.*a.*t.*/.
Is there a name for this type of search/filter?

Fuzzy Filter/Search
After looking through VS Code's GitHub issues list, I found an issue that mentioned it.
I also found a node module that does this exact same thing.
There is also a Wikipedia entry on Approximate String Matching, which is similar to the above.

Related

Is it possible to make an index search by regex in PDF?

I want to search for all lines that match this regex
^([0-9IVX]\.)*.*\R
and report with the page number they are at. The output would be something like:
1. Heading/page number
1.1 Subheading/page number
1.1.1. Subsubheading/page number
Is this possible to do in PDF? I suppose that would require Ghostscript, but searching the How to Use Ghostscript page for regex I find nothing.
I can't think why you would expect Ghostscript to do search for you.
I'm not sure if you are hoping to get the data type 'heading, page number' etc from the PDF file, or if you are going to work that out yourself based on the data you find.
If it's the former then the first problem is that, in general, PDF files don't have the kind of structure information you are looking for. There is nothing in most PDF files which says 'this is a heading', 'this is a page number' etc.
There are such things as 'tagged PDF' which adds non-printing elements to a PDF file which do carry that kind of data around with them. This is an entirely optional feature, the vast majority of PDF files don't contain it, and Ghostscript completely ignores it.
Since most PDF files don't have that information, you can't rely on it, unless you are in the happy position of knowing where your PDF files are being generated and that they contain this kind of information. In which case there are numerous tools around which will extract it for you, or enable you to write code to do so.
The problem with just searching for the text is that firstly the text need not be written as a contiguous stream. So if you are looking for '1.1' that might be written as:
(1.1) Tj
(1) Tj
(.) Tj
(1) Tj
[(1) -0.1 (.) 0.1 (1)] TJ
or any combination of those. The individual character codes need not even appear in order or in the same content stream.
Secondly the character code in a PDF content stream need not be (and often is not) a Unicode code point. Or ASCII, or any other standard coding scheme, it can be totally arbitrary.
Some PDF files carry a ToUnicode CMap around which maps the character codes to Unicode code points, but not all do. Some fonts may use a standard (that's PDF standard) Encoding, in which case it's possible to infer the Unicode code points. Some Encodings may contain glyph names, from which it's again possible to infer Unicode code points.
In the end though, some PDF files are simply impossible to extract text from without using OCR.
Your best bet is probably to write code to extract text, and Ghostscript will do that. It even goes through the heirarchy of fallbacks listed above to try and find a Unicode code point. If all else fails it just uses the character code and hopes that's good enough.
If you use Ghostscript's txtwrite device it will produce either a faked up text page (the default) which attempts, as far as possible, to mimic the text layout in the original PDF file, including merging bits of text that aren't contiguous in the PDF file but are next to each other on the page. Or an 'XML-like' output which will tell you which Unicode code points, or character codes, were encountered and what their position is on the original page. If you don't like txtwrite's attempts to figure out which text goes with what, then you can use this to write your own.
I suspect the text page is probably good enough for your purposes. You can have the txtwrite device produce one file per page, so you can get the page number from the filename. Then you can write your own regex expression(s) to search the files and find your matches.

Possible combination (variations) of words in a string variable in stata

I have a string variable containing school names and I need to find all the possible combination of each word in this string variable in stata:
For example variation of a word "Academy" would be:
Academy,
Academy,
acdamey,
aacdemy,
dmcaamy,
aacedmy,
and so on.
I need this to standardize the raw data of school names, which has many typos of each word due to data entry issues, like the ones given above for "academy".
Depending whether your data is already in the Excel sheets or a file, you can either use regex trying to match all possible combinations (and probably fix them when found) or parse the strings first before bringing them into Excel. In either case you could make a file (or Excel list/table/area/etc.) that includes all the common typos and pick each typo as regex match to use when comparing to your actual input.
Making regexp that would actually find all possible cases is next to impossible, especially if there are cases where very similar (but correct) names for schools exist. In any case direct regexps would be very messy and complex, so I would advice you to parse the data by finding first the correct form, excluding it and then using (greedy) search/regex to find the typoed versions. You can then save the typos to use them as a filter/match/pattern.
To get some sort of starting ideas, check this links:
Regex: Search for verb roots
Read text file and extract string into Excel sheet using regex
P.s You should keep the count of all strings/school names and finally get a list of all names that did not match correct form or any of your regexp filters, so you can manually insert/correct them.

how to find a specific word having random located newline

As I stated on the title.
I'm try to find regex result on a specific word(like apple) having random newline(\r\n) special character.
Illustrate more detail...
Let's find a word 'apple' on the text file. but We don't know where is exact position of newline(\r\n) on the file like below...
ap
ple
or
appl
e
I also googled many pages but I couldn't find the answer.
Should I have to write beginner regex like below?
(a\r\npple|ap\r\nple|app\r\nle|appl\r\ne|apple\r\n|)
I need to find more smarter regex to find exact word.
updated.
the word can be vary like "ripe apple", "rotten apple" and "brightapple".
In the case of third item, white space removed by writer.
updated
i have many txt files. i have to find the string within those.
So remove /r/n is not useful and cannot handle(too much menory and time required).
You have to take the naive approach ("beginner regex") if you want to use regular expressions, since they belong to the type 3 grammars and cannot express the state needed (see also The difference between Chomsky type 3 and Chomsky type 2 grammar)

internal code-completion in vim

There's a completion type that isn't listed in the vim help files (notably: insert.txt), but which I instinctively feel the need for rather often. Let's say I have the words "Awesome" and "SuperCrazyAwesome" in my file. I find an instance of Awesome that should really be SuperCrazyAwesome, so I hop to the beginning of the word, enter insert mode, and then must type "SuperCrazy".
I feel I should be able to type "S", creating "SCrazy", and then simply hit a completion hotkey or two to have it find what's to the left of the cursor ("S"), what's to the right ("Crazy"), regex this against all words in the file ("/S\w*Crazy/"), and provide me with a completion popup menu of choices, or just do the replace if there's only one match.
I'd like to use the actual completion system for this. There exists a "user defined" completion which uses a function, and has a good example in the helps for replacing from a given list. However, I can't seem to track down many particulars that I'd need to make this happen, including:
How do I get a list of all words in the file from a vim function?
Can I list words from all buffers (with filenames), as vim's complete does?
How do I, in insert mode, get the text in the word before/after the cursor?
Can completion replace the entire word, and not just up to the cursor?
I've been at this for a couple of hours now. I keep hitting dead ends, like this one, which introduced me to \%# for matching with the cursor position, which doesn't seem to work for me. For instance, a search for \w*\%# returns only the first character of the word I'm on, regardless of where I'm in it. The \%# doesn't seem to anchor.
Although its not exactly following your desired method in the past I've written https://github.com/mjbrownie/swapit which might perform your task if you are looking for related keywords. It would fall down in this scenario if you have hundreds of matches.
It's mainly useful for 2-10 possible sequenced matches.
You would define a list
:SwapList awesomes Awesome MoreAwesome SuperCrazyAwesome FullyCompletelyAwesome UnbelievablyAwesome
and move through the matches with the incrementor decrementor keys (c+a) (c+x)
There are also a few other cycling type plugins like swap words that I know of on vim.org and github.
The advantage here is you don't have to group words together with regex.
I wrote something like that years ago when working with 3rd party libraries with rather long CamelCasePrefixes in every function different for each component. But it was in Before Git Hub era and I considered it a lost jewel, but search engine says I am not a complete ass and posted it to Vim wiki.
Here it is: http://vim.wikia.com/wiki/Custom_keyword_completion
Just do not ask me what 'MKw' means. No idea.
This will need some adaptation to your needs, as it is looking up only the word up to the cursor, but the idea is there. It works for current buffer only. Iterating through all buffers would be sluggish as it is not creating any index. For those purposes I would go with external grep.

Emacs-style Regex in Info-reader?

I am a Vim-user lost in the Emacs-style Regex of Info-reader. I want to match:
$ info find
?How-in-Info-reader? :%s#\(\\;.*\\+\)\|\(\\+.*\\;\)#WORKS!#g
INFO: "C-X n" to go through the matches
I am looking for the Emacs-counterpart for the Vim-command marked with "?How-in-Info-reader?".
How can you find the matches in Info-reader?
For the standalone info reader, your choices are more limited than when using Emacs proper for browsing *info* pages.
I'm not familiar with the details of ?How-in-Info-reader, but there are two ways (I can see to search in the standalone info browser.
M-x index-apropos SOMESTRING
will give you a list of all the index nodes which contain SOMESTRING.
And the other searches C-s (for interactive search) and / or s (non-interactive search) for a particular string in the current view (they don't drop down into the nodes).
I think you're trying to replace either backslash-semi-anystring-backslashes or backslashes-anystring-backslash-semi with "WORKS!" everywhere in the file. It doesn't look like info is an editor. it doesn't even look like it has regex searching. In emacs, I'd type esc-control-s (to get incremental regular expression search, which means you can try out expressions and see how they work).
Once you're in emacs, the search string you presented should work just fine if I've understood your question. You can also type Esc-r, and then type the first string ("\(\\;.*\\+\)\|\(\\+.*\\;\)"), a RETURN, and the replacement string ("#WORKS!").