Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 9 years ago.
Improve this question
I need to learn how to change a transliteration of a text to another writing system. Apparently the best way would somehow involve regular expressions and perl, probably from command line? I've been using regular expressions earlier in Notepad++ and TextWrangler, so I know some basics already. If there is some really good (and relatively easy and customizable) way to do this in Ruby or something else, I can start learning that as well. There is a constant need to transliterate linguistic sample texts in my field in Uralic linguistics, where many different variants of transliteration systems are used. So it is worth investing some time.
So the material I have now consists of lines with a sentence on each line. Some lines have other data like numbers, but those should stay as they are. I want to keep the punctuation marks as they are, this is just about converting one set of unicode letter characters to another. I searched the site but a lot was about converting from ascii to unicode and so on - this is not the problem here.
So the original text is like this (in broad Finno-Ugric Transcription):
mödis ivan velöććyny pećoraö ščötövödnej kurs vylö.
And I would need it in a form like this:
мӧдiс иван велӧччыны печораӧ щӧтӧвӧднэй курс вылӧ.
This continues for some thousand lines.
There is a clear correspondence between characters used, but it is sometimes complex and involves dealing first with some digraphs and consonant + vowel combinations, etc. As you see from the example, in some situations latin i corresponds to cyrillic и but in some positions can remain as i. Different texts have different solutions, so I would need to adjust the rules in each case. I understand I would need to run a long series of regular expressions in a very specific order to make it work. This order I will figure out myself, but I need to know into what kind of tool I have feed these rules in and how to do it.
I also have often situations where I would like to have the original sentence and transliterated one separated by a tab, so that the lines would have a form like this:
mödis ivan velöććyny pećoraö ščötövödnej kurs vylö. мӧдiс иван
велӧччыны печораӧ щӧтӧвӧдней курс вылӧ.
Of course there are many more questions, but after learning these basics I think I can move forward independently. Learning this would help me a lot. Thanks in advance!
Niko
Related
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 8 years ago.
Improve this question
I'm performing some regular expression exercises in Python 2.7.3, on Windows 7. Some of these exercises have me looking for similar patterns on the same line. For example, I want to use regex to capture name1 and name2...
<XML tag><more tags>[name1]</XML tag><XML2 tag>[name2]<XML2 tag></more tags>
Would it be "cheating" or "missing the point" if I used any string parsing to capture name2? I feel like using regex the correct way alone should be able to capture both of those names, but string parsing is what I've always been familiar with.
An analogy would be like someone studying recursion in C++, but using a While loop. Recursion should NOT have any While loops (although of course it may be part of some other grand design).
Good question! Many beginners come into it believing they should be able do everything with one regex match. After all, people are always saying how powerful regexes are, and what you're trying to do is so simple...
But no, the regex is responsible for finding the next match, that's all. Retrieving the substring that it matched, or finding multiple matches, or performing substitutions, that's all external to the act of matching the regex. That's why languages provide methods like Python's findall() and sub(); to do the kind of "string parsing" operations you're talking about, so you don't have to.
It occurred to me a while back that the process of mastering regexes is one of learning everything you can't do with them, and why not. Understanding which parts of the regex matching operation are performed by the regex engine, and which parts are the responsibility of the enclosing language or tool, is a good start.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 9 years ago.
Improve this question
I am struggling with a homework problem. I have tried this problem for hours literally. I found a similar question here, but it is not exactly my problem.
The homework problem says 1. (20 points) Construct regular expressions for the following languages.
a) All strings of digits with at most one repeated digit.
The only way I see how this is possible would be to exhaustively somehow take care of every possible case. There are 10 different digits, so it's like A LOT of different cases. I think the max length string can be 11, because after 11, you have to have a second repeated digit. So the number of possible combinations is 10^11. I thought even about writing a DFA and just converting it to a regex, but even that seems like it's impossible.
Does anyone have any advice? We aren't allowed to use non-standard regex features, like groups, lookahead, etc. This is just a plain old regex kind of problem.
Response to comment:
It is not binary. I already asked the teacher.
"Commenters, “regular expression” has one well-defined meaning in computer science. Since this is homework, it’s almost certainly that which is meant (and even more so as it talks of “languages”), and not some specific library. There’s no ambiguity here, and no clarification needed." This is basically what we want. The standard regex stuff often used in theoretical CS classes. As far as what we learned in class, I go to USC if anyone is familiar with that and we only barely talked about this at all. We're onto a completely different topic now.
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 9 years ago.
Improve this question
I was wondering if it would be possible at all to build an inverted index for all possible regular expressions... I have had a few ideas, but they are extremely vague at the moment.
My reasoning behind this is because I think that a search engine that uses regex would be pretty useful (I'm sure many people would agree), although the problem with a search engine is that there is quite a lot of things to search. This is why there are inverted indexes, I guess.
Maybe something similar? I don't really know.
Here's a description of my idea:
The search engine should be a regex search engine. Instead of being like a normal search engine which only matches words, this will match specific regex specified by the user.
an example of a search: [^ ]*ell[^ ]* .*\.
something like that, for example. the reasoning behind this is that sometimes i want to search something that can't be found due to the limitedness of normal search engines.
it'll be a simple sed-like regex, maybe a bit javascripty.
they are all similar anyway (with the basics)
Edit: I've seen regular expression search engine, but it's not what I am asking. I'm wondering if it's possible to build one.
Edit 2: Maybe an inverted index that has bits of words, and numbers (and their length), etc. Maybe some kind of table where I can quickly pick things out, so if I have a number of a certain length in my regex, I can quickly filter all the numbers that i have indexed that have that length?
If I combine those ideas, I just realized that maybe multiple searches, but with a shrinking data source, until everything that is left is what matches the regex? Eg: ell.\*\\. would search for everything with e, then everything with a l following the a, then everything with another l following the el, and then any number of characters followed by a ..
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking for code must demonstrate a minimal understanding of the problem being solved. Include attempted solutions, why they didn't work, and the expected results. See also: Stack Overflow question checklist
Closed 8 years ago.
Improve this question
So I want to search for A,B,C,D in a string in any order, but if C doesn't exist I still want it to give me A,B, and D, etc.
To be more specific, here is the exact problem I'm trying to solve. CSV file with lines that look like this:
Name,(W)5555555,(H)5555555,(M)5555555,(P)5555555
However, the W,H,M,P could be in any order. Plus they don't all exist on every line. So it looks more like this:
Name,(W)5555555,(H)5555555,(M)5555555,(P)5555555
Name,(H)5555555,(P)5555555,(W)5555555,(M)5555555
Name,(M)5555555,(H)5555555,,
Name,(P)5555555,,,
What I need to accomplish is to put all items in the correct order so they line up under the correct columns. So the above should look like this when I'm done:
Name,(W)5555555,(H)5555555,(M)5555555,(P)5555555
Name,(W)5555555,(H)5555555,(M)5555555,(P)5555555
Name,,(H)5555555,(M)5555555,
Name,,,,(P)5555555
Edit: It appears I'm a bad Stack Overflow citizen. I didn't get answers fast enough for when my project needed to be done, and therefore forgot to come back and add a correct issues in my post. I ended up writing a python script to do this instead of just using find/replace in BBEdit or Sublime Text 2 like I was originally trying to do.
So I would like a method to do something like this that works in either BBEdit or Sublime Text. Or Vim for that matter. I'll try to keep a better eye on it this time, and I'll respond to the answers that already exist.
If your regex flavor supports lookarounds, this can be done with a simple regex-replace. Since lookaheads do not advance the position of the regex engine's cursor, we can use them to look for multiple patterns somewhere after one particular position. We can capture all these findings and write them back in the replacement string. To make sure that all of them are optional we could simply use ?, but in this case, I'll add an empty alternative to the lookahead - this is necessary to trick the engine when it's backtracking. The pattern could then look like this:
^Name,(?=.*([(]W[)]\d+)|)(?=.*([(]H[)]\d+)|)(?=.*([(]M[)]\d+)|)(?=.*([(]P[)]\d+)|).*
The .* at the end is to make sure that everything gets removed in the replacement.
And the replacement string like this:
Name,$1,$2,$3,$4
Here is a working demo using the ECMAScript flavor. It's a rather limited flavor, so this solution should be adaptable to most environments.
Something like this?
^Name,(\((?:W|H|P|M)\)\d+(?:,)?)*[,]*$
Edit live on Debuggex
Will give you all the matches per row. Then you simple need to allocate each match to the right column.
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 1 year ago.
Improve this question
I'm masochistically writing an open-source text editor for Mac and have finally reached the point at which I want to add syntax highlighting. I've been going back and forth on various solutions for the past few days, and I've finally decided to open the question to a wider audience.
Here are the options I see:
Define languages basically with a series of regex pattern matching (similar to how TextMate defines its languages)
Define languages with a formal grammar like BNF or PEG
Using regex pattern matching seems less than ideal as it cannot formally represent a language nearly as well as a formal grammar; however, some less formal languages will have a hard time fitting into BNF (i.e. Markdown -- though I know there's a great PEG implementation).
What are the performance tradeoffs for live syntax highlighting? What about flexibility for a wide range of languages?
If I go the BNF route, Todd Ditchendorf created the awesome ParseKit framework which would work nicely out-of-the-box. Anyone know of any anything similar for PEG's?
Unless you want to fight the battle of getting a full-context free (or worse, a full context-sensitive) grammar completely correct for every language you want to process (or worse, for every dialect of the language you want to process... how many kinds of C++ are there?), for the purposes of syntax highlighting you're probably better giving up on complete correctness and accept that sometimes you'll get it wrong. In that case, regexps seem like an extremely good answer. They can also be very fast, so they won't interfere with the person doing the editing.
If you insist on doing full syntax checking/completion (I don't think you are), then you'll need that full grammar. You'll also be a very long time in producing editors for real languages.
Sometimes it is better not to be too serious. A 98% solution that you can get is better than a 100% solution that never materializes.
It might not be exactly what you need since you are writing the editor yourself, but there is an awesome framework called Xtext that will actually generate a complete editor with syntax coloring, customizable outline view and auto-completion etc., based on a grammar for your language: http://eclipse.org/Xtext