As I stated on the title.
I'm try to find regex result on a specific word(like apple) having random newline(\r\n) special character.
Illustrate more detail...
Let's find a word 'apple' on the text file. but We don't know where is exact position of newline(\r\n) on the file like below...
ap
ple
or
appl
e
I also googled many pages but I couldn't find the answer.
Should I have to write beginner regex like below?
(a\r\npple|ap\r\nple|app\r\nle|appl\r\ne|apple\r\n|)
I need to find more smarter regex to find exact word.
updated.
the word can be vary like "ripe apple", "rotten apple" and "brightapple".
In the case of third item, white space removed by writer.
updated
i have many txt files. i have to find the string within those.
So remove /r/n is not useful and cannot handle(too much menory and time required).
You have to take the naive approach ("beginner regex") if you want to use regular expressions, since they belong to the type 3 grammars and cannot express the state needed (see also The difference between Chomsky type 3 and Chomsky type 2 grammar)
Related
Got an interesting one, and can't come up with any solid ideas, so thought maybe someone else may have done something similar.
I want to be able to identify strings of letters in a longer sentence that are not words and remove them. Essentially things like kuashdixbkjshakd
Everything annoyingly is in lowercase which makes it more difficult, but since I only care about English, I'm essentially looking for the opposite of consonant clusters, groups of them that don't make phonetically pronounceable sounds.
Has anyone heard of/done something like this before?
EDIT: this is what ChatGpt tells me
It is difficult to provide a comprehensive list of combinations of consonants that have never appeared in a word in the English language. The English language is a dynamic and evolving language, and new words are being created all the time. Additionally, there are many regional and dialectal variations of the language, which can result in different sets of words being used in different parts of the world.
It is also worth noting that the frequency of use of a particular combination of consonants in the English language is difficult to quantify, as the existing literature on the subject is limited. The best way to determine the frequency of use of a particular combination of consonants would be to analyze a large corpus of written or spoken English.
In general, most combinations of consonants are used in some words in the English language, but some combinations of consonants may be relatively rare. Some examples of relatively rare combinations of consonants in English include "xh", "xw", "ckq", and "cqu". However, it is still possible that some words with these combinations of consonants exist.
You could try to pass every single word inside the sentence to a function that checks wether the word is listed inside a dictionary. There is a good number of dictionary text files on GitHub. To speed up the process: use a hash map :)
You could also use an auto-corretion API or a library.
Algorithm to combine both methods:
Run sentence through auto correction
Run every word through dictionary
Delete words that aren't listed in the dictionary
This could remove typos and words that are non-existent.
You could train a simple model on sequences of characters which are permitted in the language(s) you want to support, and then flag any which contain sequences which are not in the training data.
The LangId language detector in SpamAssassin implements the Cavnar & Trenkle language-identification algorithm which basically uses a sliding window over the text and examines the adjacent 1 to 5 characters at each position. So from the training data "abracadabra" you would get
a 5
ab 2
abr 2
abra 2
abrac 1
b 2
br 2
bra 2
brac 1
braca 1
:
With enough data, you could build a model which identifies unusual patterns (my suggestion would be to try a window size of 3 or smaller for a start, and train it on several human languages from, say, Wikipedia) but it's hard to predict how precise exactly this will be.
SpamAssassin is written in Perl and it should not be hard to extract the language identification module.
As an alternative, there is a library called libtextcat which you can run standalone from C code if you like. The language identification in LibreOffice uses a fork which they adapted to use Unicode specifically, I believe (though it's been a while since I last looked at that).
Following Cavnar & Trenkle, all of these truncate the collected data to a few hundred patterns; you would probably want to extend this to cover up to all the 3-grams you find in your training data at least.
Perhaps see also Gertjan van Noord's link collection: https://www.let.rug.nl/vannoord/TextCat/
Depending on your test data, you could still get false positives e.g. on peculiar Internet domain names and long abbreviations. Tweak the limits for what you want to flag - I would think that GmbH should be okay even if you didn't train on German, but something like 7 or more letters long should probably be flagged and manually inspected.
This will match words with more than 5 consonants (you probably want "y" to not be considered a consonant, but it's up to you):
\b[a-z]*[b-z&&[^aeiouy]]{6}[a-z]*\b
See live demo.
5 was chosen because I believe witchcraft has the longest chain of consonants of any English word. You could dial back "6" in the regex to say 5 or even 4 if you don't mind matching some outliers.
I am using den4b Renamer to rename a lot of files that follow a specific pattern. The program allows me to use RegEx: (https://www.den4b.com/wiki/ReNamer:Regular_Expressions)
I am stuck trying to conjure up an expression for a specific pattern.
My current RegEx:
Expression: ^(com\.)(([\w\s]*\.){0,4})([\w\s]*)$
Replace: \L$1\L$2\u$4
Note: \L and \u transform the sub-expression to upper and lower case as defined in the table below:
Here are a few example strings so you can get an idea of the input:
Android File Transfer.svg
Angular Console.svg
Au.Edu.Uq.Esys.Escript.svg
Avidemux.svg
Blackmagic Fusion8.svg
Broken Sword.svg
Browser360 Beta.svg
Btsync GUI.svg
Buttercup Desktop.svg
Calc.svg
Calibre EBook Edit.svg
Calibre Viewer.svg
Call Of Duty.svg
com.GitHub.Plugarut.Pwned Checker.svg
com.GitHub.Plugarut.Wingpanel Monitor.svg
com.GitHub.Rickybas.Date Countdown.svg
com.GitHub.Spheras.Desktopfolder.svg
com.GitHub.Themix Project.Oomox.svg
com.GitHub.Unrud.Remote Touchpad.svg
com.GitHub.Unrud.Video Downloader.svg
com.GitHub.Weclaw1.Image Roll.svg
com.GitHub.Zelikos.Rannum.svg
com.Gitlab.Miridyan.Mt.svg
com.Inventwithpython.Flippy.svg
com.Neatdecisions.Detwinner.svg
com.Rafaelmardojai.Share Preview.svg
com.Rafaelmardojai.Webfont Kit Generator.svg
Distributor Logo Antix.svg
Distributor Logo Archlabs.svg
Distributor Logo Dragonflybsd.svg
DOSBox.svg
Drawio.svg
Drweb GUI.svg
For this question I am focused on the strings that begin with com.xxx.xxx.
Since I can't only target those names in Renamer, the expression has to "play nice" with the other input file names and correctly leave them alone. That's why I've prefixed my expression with ^(com\.)
What I want:
Transform the entire string to lower case except for the last period separated part of the string.
Strip white space from the entire string.
For instance:
Original: com.GitHub.Alcadica.Develop.svg
After my Regex: com.github.alcadica.Develop.svg
What I want: com.github.alcadica.Develop.svg
This specific file is correctly renamed. What I'm having trouble with are names that have spaces in any part of the string. I can't figure out how to strip whitespace:
Original: com.Belmoussaoui.Read it Later.svg
After my Regex: com.belmoussaoui.Read it Later.svg
What I want: com.belmoussaoui.ReaditLater.svg
Here is a hypothetical example because I couldn't find a file with more than four parts. I want my pattern to be robust enough to handle this:
Original: com.Shatteredpixel.Another Level.Next.Pixel Dungeon.svg
After my Regex: com.shatteredpixel.another level.next.Pixel Dungeon.svg
What I want: com.shatteredpixel.anotherlevel.next.PixelDungeon.svg
Note that since I'm not using any kind of programming language, I don't have access to common string operations like trim, etc. I can, however, stack expressions. But this would create more overhead and since I am renaming thousands of files at a time I'd ideally like to keep it to one find/replace expression.
Any help would be greatly appreciated. Please let me know if I can provide any more information to make this more clear.
Edit:
I got it to work with the following rules:
Really inefficient, but it works. (Thanks to Jeremy in the comments for the idea)
I'm not sure which community this belongs in, feel free to suggest a better one if this doesn't fit here.
In Visual Studio Code, when searching for a file, you can CMD/Ctrl + P to bring up the Quick Open search box for finding a file by name. The search doesn't have to be the exact name and it filters as long as the search query contains the characters in that order, while being "loose" enough to ignore any characters between those.
Example:
Searching "cat" would show the following:
bigcat.txt
cat.txt
candlelight.txt
In the above, all the strings contain "cat" within it, even if there are other characters between it. The regex would probably be something like /.*c.*a.*t.*/.
Is there a name for this type of search/filter?
Fuzzy Filter/Search
After looking through VS Code's GitHub issues list, I found an issue that mentioned it.
I also found a node module that does this exact same thing.
There is also a Wikipedia entry on Approximate String Matching, which is similar to the above.
I have a string variable containing school names and I need to find all the possible combination of each word in this string variable in stata:
For example variation of a word "Academy" would be:
Academy,
Academy,
acdamey,
aacdemy,
dmcaamy,
aacedmy,
and so on.
I need this to standardize the raw data of school names, which has many typos of each word due to data entry issues, like the ones given above for "academy".
Depending whether your data is already in the Excel sheets or a file, you can either use regex trying to match all possible combinations (and probably fix them when found) or parse the strings first before bringing them into Excel. In either case you could make a file (or Excel list/table/area/etc.) that includes all the common typos and pick each typo as regex match to use when comparing to your actual input.
Making regexp that would actually find all possible cases is next to impossible, especially if there are cases where very similar (but correct) names for schools exist. In any case direct regexps would be very messy and complex, so I would advice you to parse the data by finding first the correct form, excluding it and then using (greedy) search/regex to find the typoed versions. You can then save the typos to use them as a filter/match/pattern.
To get some sort of starting ideas, check this links:
Regex: Search for verb roots
Read text file and extract string into Excel sheet using regex
P.s You should keep the count of all strings/school names and finally get a list of all names that did not match correct form or any of your regexp filters, so you can manually insert/correct them.
Is there an easy way to generate a human-readable inflection list from Hunspell/Aspell dictionary data files?
For example, I'd like to generate the following outputs (for different languages):
...
book, books
book, books, booked, booking
...
go, goes, went, gone, going
...
I looked at the Hunspell/Aspell docs, but couldn't find an API call that would do this.
There is a method that the command line one does, but it doesn't output quite in the format you're looking for. You could also do this manually if you wanted though just by some simple scripting with regex.
The format of for each set of affixes is
TYPE TAG REMOVE REPLACE MATCH
Such that where TAG matches what follows what's behind the /in a given word in the .dicfile, you can do the following (presuming you've already stripped the word of the /...):
if($word =~ /$match$/) $word =~ s/$remove$/$replace/;
Notice the $ there matching the end-of-line/word. Adjust with ^ if it's a prefix.
There are three caveats:
The $match directly from the .aff file is in almost all cases equivalent to standard regex. There are minor variations such that if the match is something like [abc-gh], you'd be better to change it to (a|b|c|-|g|h) or [abcgh-] (hunspell doesn't use hyphen as a metacharacter) otherwise it'll be interpreted as [abcdefgh] (standard regex). For a negated character class, your options are to manually move the - to the end of the expression (e.g. [^a-df] to [^adf-] or to use negative look behinds.
If $replace is 0, then you should change it to an empty string.
If your result ends with /..., you need to reprocess it again because it has a double affix.
Be careful. By my rough calculations, the dictionary I'm working on could have more than 50 million words being formed (and I wouldn't be surprised if it hits beyond 100 million).