Inverted index for regex? Regex search engine? [closed] - regex

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 9 years ago.
Improve this question
I was wondering if it would be possible at all to build an inverted index for all possible regular expressions... I have had a few ideas, but they are extremely vague at the moment.
My reasoning behind this is because I think that a search engine that uses regex would be pretty useful (I'm sure many people would agree), although the problem with a search engine is that there is quite a lot of things to search. This is why there are inverted indexes, I guess.
Maybe something similar? I don't really know.
Here's a description of my idea:
The search engine should be a regex search engine. Instead of being like a normal search engine which only matches words, this will match specific regex specified by the user.
an example of a search: [^ ]*ell[^ ]* .*\.
something like that, for example. the reasoning behind this is that sometimes i want to search something that can't be found due to the limitedness of normal search engines.
it'll be a simple sed-like regex, maybe a bit javascripty.
they are all similar anyway (with the basics)
Edit: I've seen regular expression search engine, but it's not what I am asking. I'm wondering if it's possible to build one.
Edit 2: Maybe an inverted index that has bits of words, and numbers (and their length), etc. Maybe some kind of table where I can quickly pick things out, so if I have a number of a certain length in my regex, I can quickly filter all the numbers that i have indexed that have that length?
If I combine those ideas, I just realized that maybe multiple searches, but with a shrinking data source, until everything that is left is what matches the regex? Eg: ell.\*\\. would search for everything with e, then everything with a l following the a, then everything with another l following the el, and then any number of characters followed by a ..

Related

Recognize pattern of characters [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 6 years ago.
Improve this question
I am student in the college and I need some help with recognition pattern of characters: what it will match ? or maybe somebody can explain how does it work ?
" (k[abc]*p)+ "
Thank you for any help.
This is a bit of a vague question as you're basically asking how regular expression work.
First of all I would recommend 'Mastering Regular Expressions' which is a pretty great O'Reily book on regex.
Also, for a regex playground, I really like to use Rubular (http://www.rubular.com/) as a playground, although this is meant for ruby, it can give you a good understanding into general regex expression and comes with a nice quick reference guide.
Taking some time to figure this out yourself will be very helpful, regular expressions are not going away.
In this case, your expression is evaluating everything inside the () as one chunk. So it's looking for a k, then at least one (+) of either abc ([abc]) followed by a p, at least one time (+).
So things like kap, kabcp will match.

Is it OK to mix string parsing while learning reg ex? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 8 years ago.
Improve this question
I'm performing some regular expression exercises in Python 2.7.3, on Windows 7. Some of these exercises have me looking for similar patterns on the same line. For example, I want to use regex to capture name1 and name2...
<XML tag><more tags>[name1]</XML tag><XML2 tag>[name2]<XML2 tag></more tags>
Would it be "cheating" or "missing the point" if I used any string parsing to capture name2? I feel like using regex the correct way alone should be able to capture both of those names, but string parsing is what I've always been familiar with.
An analogy would be like someone studying recursion in C++, but using a While loop. Recursion should NOT have any While loops (although of course it may be part of some other grand design).
Good question! Many beginners come into it believing they should be able do everything with one regex match. After all, people are always saying how powerful regexes are, and what you're trying to do is so simple...
But no, the regex is responsible for finding the next match, that's all. Retrieving the substring that it matched, or finding multiple matches, or performing substitutions, that's all external to the act of matching the regex. That's why languages provide methods like Python's findall() and sub(); to do the kind of "string parsing" operations you're talking about, so you don't have to.
It occurred to me a while back that the process of mastering regexes is one of learning everything you can't do with them, and why not. Understanding which parts of the regex matching operation are performed by the regex engine, and which parts are the responsibility of the enclosing language or tool, is a good start.

Transliteration between different writing systems [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 9 years ago.
Improve this question
I need to learn how to change a transliteration of a text to another writing system. Apparently the best way would somehow involve regular expressions and perl, probably from command line? I've been using regular expressions earlier in Notepad++ and TextWrangler, so I know some basics already. If there is some really good (and relatively easy and customizable) way to do this in Ruby or something else, I can start learning that as well. There is a constant need to transliterate linguistic sample texts in my field in Uralic linguistics, where many different variants of transliteration systems are used. So it is worth investing some time.
So the material I have now consists of lines with a sentence on each line. Some lines have other data like numbers, but those should stay as they are. I want to keep the punctuation marks as they are, this is just about converting one set of unicode letter characters to another. I searched the site but a lot was about converting from ascii to unicode and so on - this is not the problem here.
So the original text is like this (in broad Finno-Ugric Transcription):
mödis ivan velöććyny pećoraö ščötövödnej kurs vylö.
And I would need it in a form like this:
мӧдiс иван велӧччыны печораӧ щӧтӧвӧднэй курс вылӧ.
This continues for some thousand lines.
There is a clear correspondence between characters used, but it is sometimes complex and involves dealing first with some digraphs and consonant + vowel combinations, etc. As you see from the example, in some situations latin i corresponds to cyrillic и but in some positions can remain as i. Different texts have different solutions, so I would need to adjust the rules in each case. I understand I would need to run a long series of regular expressions in a very specific order to make it work. This order I will figure out myself, but I need to know into what kind of tool I have feed these rules in and how to do it.
I also have often situations where I would like to have the original sentence and transliterated one separated by a tab, so that the lines would have a form like this:
mödis ivan velöććyny pećoraö ščötövödnej kurs vylö. мӧдiс иван
велӧччыны печораӧ щӧтӧвӧдней курс вылӧ.
Of course there are many more questions, but after learning these basics I think I can move forward independently. Learning this would help me a lot. Thanks in advance!
Niko

Can Regular Expressions search for groups no matter the order or whether they all exist? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking for code must demonstrate a minimal understanding of the problem being solved. Include attempted solutions, why they didn't work, and the expected results. See also: Stack Overflow question checklist
Closed 8 years ago.
Improve this question
So I want to search for A,B,C,D in a string in any order, but if C doesn't exist I still want it to give me A,B, and D, etc.
To be more specific, here is the exact problem I'm trying to solve. CSV file with lines that look like this:
Name,(W)5555555,(H)5555555,(M)5555555,(P)5555555
However, the W,H,M,P could be in any order. Plus they don't all exist on every line. So it looks more like this:
Name,(W)5555555,(H)5555555,(M)5555555,(P)5555555
Name,(H)5555555,(P)5555555,(W)5555555,(M)5555555
Name,(M)5555555,(H)5555555,,
Name,(P)5555555,,,
What I need to accomplish is to put all items in the correct order so they line up under the correct columns. So the above should look like this when I'm done:
Name,(W)5555555,(H)5555555,(M)5555555,(P)5555555
Name,(W)5555555,(H)5555555,(M)5555555,(P)5555555
Name,,(H)5555555,(M)5555555,
Name,,,,(P)5555555
Edit: It appears I'm a bad Stack Overflow citizen. I didn't get answers fast enough for when my project needed to be done, and therefore forgot to come back and add a correct issues in my post. I ended up writing a python script to do this instead of just using find/replace in BBEdit or Sublime Text 2 like I was originally trying to do.
So I would like a method to do something like this that works in either BBEdit or Sublime Text. Or Vim for that matter. I'll try to keep a better eye on it this time, and I'll respond to the answers that already exist.
If your regex flavor supports lookarounds, this can be done with a simple regex-replace. Since lookaheads do not advance the position of the regex engine's cursor, we can use them to look for multiple patterns somewhere after one particular position. We can capture all these findings and write them back in the replacement string. To make sure that all of them are optional we could simply use ?, but in this case, I'll add an empty alternative to the lookahead - this is necessary to trick the engine when it's backtracking. The pattern could then look like this:
^Name,(?=.*([(]W[)]\d+)|)(?=.*([(]H[)]\d+)|)(?=.*([(]M[)]\d+)|)(?=.*([(]P[)]\d+)|).*
The .* at the end is to make sure that everything gets removed in the replacement.
And the replacement string like this:
Name,$1,$2,$3,$4
Here is a working demo using the ECMAScript flavor. It's a rather limited flavor, so this solution should be adaptable to most environments.
Something like this?
^Name,(\((?:W|H|P|M)\)\d+(?:,)?)*[,]*$
Edit live on Debuggex
Will give you all the matches per row. Then you simple need to allocate each match to the right column.

How to use regex to find if a text contains specific words? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions must demonstrate a minimal understanding of the problem being solved. Tell us what you've tried to do, why it didn't work, and how it should work. See also: Stack Overflow question checklist
Closed 9 years ago.
Improve this question
I am using a file search tool which can use regex to find files which contain certain text. My regex skills are pretty simple. (I am going to assume the file is treated like a single text with some line breaks)
Let's say I want to find files which contain these 3 words: route, boy & skill.
How to create two regex's, one to search for those words where each word needs to be a whole word (white space before or after, at beggining or end of line), and another regex where one or more words could be part of another word (like substring function)?
Update
I am not interested in regex tutorials and testers. If I need one, I certainly can google for one and find dozens. This is a regex that I simply can't create but which I will use over and over in that tool. Maybe regex doesn't support what I want and a regex expert can tell me that's the case. So no amount of regex tutorials and testers is going to help. I appreciate the links but they are not going to help me here.
Try following regular expression:
(?=.*\broute\b)(?=.*\bboy\b)(?=.*\bskill\b)