Using regexp to find a reoccurring pattern in MATLAB - regex

input = ' 12Z taj 20501 jfdjda OCNL jtjajd ptpa 23Z jfdakdkf tjajdfk OCNL fdkadja 02Z fdjafsdk fkdsafk OCNL fdkafk dksakj = '
using regexp
regexp(input,'\s\d{2,4}Z\s.*(OCNL)','match')
I'm trying to get the output
[1,1] = 12Z taj 20501 jfdjda OCNL jtjajd ptpa
[1,2] = 23Z jfdakdkf tjajdfk OCNL fdkadja
[1,3] = 02Z fdjafsdk fkdsafk OCNL fdkafk dksakj

You may use
(?<!\S)\d{2,4}Z\s+.*?\S(?=\s\d{2,4}Z\s|\s*=\s*$)
See the regex demo.
Details
(?<!\S) - there must be a whitespace or start of string immediately to the left of the current location
\d{2,4} - 2, 3 or 4 digits
Z - a Z letter
\s+ - 1+ whitespaces
.*?\S - any zero or more chars as few as possible and then a non-whitespace
(?=\s\d{2,4}Z\s|\s*=\s*$) - there must be either of the two patterns immediately to the right of the current location:
\s\d{2,4}Z\s - a whitespace, 2, 3 or 4 digits, Z and a whitespace
| - or
\s*=\s*$ - a = enclosed with 0+ whitespace chars at the end of the string.

Related

Find if either followed by non number or end of file

I want to match the string b5 with optional $ in front of the b or tha 5 :
=b5
b$5
= $b$5
($b5)
But the 5 can't be followed by any number . And the b can't be preceded by any alphabet. So this should return false :
b55
ab5
I tried this :
\W\$*b\$*5\W
it works fine. i will match X=($b$5) but the problem is : it won't match anymore if the '5' is the last character in the line.
because 5 is last character
You can use
(?:\W|^)\$*b\$*5(?:\W|$)
(?:\W|^)\$*b\$*5\b
See the RE2 regex demo.
Details
(?:\W|^) - a non-capturing group matching either a non-word char or start of string
\$* - zero or more $ chars
b - a b char
\$* - zero or more $ chars
5 - a 5 char
(?:\W|$) - a non-capturing group matching either a non-word char or end of string or
\b - a word boundary.

Regexp to search names of books (Delphi 7 & DiRegEx 8.8.1)

I am using Delphi 7 and this is first time I am using the library DiRegEx.
What I need to do is collect names of the books which are in a list. The list is long but to have an idea it looks like this:
2 Tesalonickým 3:14
2 Tesalonickým 3:15
2 Tesalonickým 3:16
2 Tesalonickým 3:17
2 Tesalonickým 3:18
1 Timoteovi 1:1
1 Timoteovi 1:2
1 Timoteovi 1:3
1 Timoteovi 1:4
So what I want to find by RegEx.Match is the '2 Tesalonickým' and '1 Timoteovi' strings. So I want to search for ^some string\d\d?\d?:\d\d?\d? ...
My code is:
var
contents : TStringList;
RegEx: TDIRegEx;
WordCount: Integer;
s:string;
begin
Contents := TStringList.Create;
RegEx := TDIPerlRegEx.Create{$IFNDEF DI_No_RegEx_Component}(nil){$ENDIF};
Contents.LoadFromFile('..\reference dlouhé CS.txt');
for i:=0 to Contents.count-1 do
begin
Contents[i];
try
RegEx.SetSubjectStr(Contents[i]);
RegEx.MatchPattern := '\w+';
WordCount := 0;
if RegEx.Match(0) >= 0 then
begin
repeat
Inc(WordCount);
s := RegEx.MatchedStr;
WriteLn(WordCount, ' - ', s);
until RegEx.MatchNext < 0;
end;
finally
RegEx.Free;
end; // end try
end; // end for
end;
And I need to modify the regex so the \d\d?\d?:\d\d?\d? won't be in the result, but should be an "anchor" or a "needle". How to make the regexp?
Result:
This is a complete list of 66 books of bible in UTF-8. There were some problems with the \w pattern because this dos not include characters like Ž or š.
Genesis;Exodus;Leviticus;Numeri;Deuteronomium;Jozue;Soudců;Rút;1 Samuelova;2 Samuelova;1 Královská;2 Královská;1 Paralipomenon;2 Paralipomenon;Ezdráš;Nehemjáš;Ester;Jób;Žalmy;Přísloví;Kazatel;Píseň písní;Izajáš;Jeremjáš;Pláč;Ezechiel;Daniel;Ozeáš;Jóel;Ámos;Abdijáš;Jonáš;Micheáš;Nahum;Abakuk;Sofonjáš;Ageus;Zacharjáš;Malachiáš;Matouš;Marek;Lukáš;Jan;Skutky apoštolské;Římanům;1 Korintským;2 Korintským;Galatským;Efezským;Filipským;Koloským;1 Tesalonickým;2 Tesalonickým;1 Timoteovi;2 Timoteovi;Titovi;Filemonovi;Židům;Jakubův;1 Petrův;2 Petrův;1 Janův;2 Janův;3 Janův;Judův;Zjevení Janovo;
You may use
(*UCP)^(?:\d+\s+)?\w+(?=\s+\d\d?\d?:\d)
Or
(*UCP)^(?:\d+\s+)?\w+(?=\s+\d{1,3}:\d)
A (*UCP) at the pattern start (PCRE verb) to make all shorthands Unicode-aware.
The patterns match
^ - start of the string
(?: - start of a non-capturing group
\d+ - 1+ digits,
\s+ - 1+ whitespaces and
)? - end of non-capturing group, 1 or 0 occurrences (? makes it optional)
\w+ - 1+ word chars...
(?=\s+\d{1,3}:\d) - followed with 1+ whitespaces, 1 to 3 digits, : and a digit.
See the regex demo.
The \w might need replacing with \p{L} if you only need to match letters.

Remove the text before second comma ('',") String replace pattern

how can we remove the text before the line that start's with second comma(line 5 in the example),how can i do that using regex?
example :
,
abc,xyz,ggg,nrmr
cde,jjj,kkkk,iiii,tem,posting
234,mm/dd/yy
,
454654,output2,sample
45646,output1,non-sample
16546,225.02
ABC,2.98
expected :
454654,output2,sample
45646,output1,non-sample
16546,225.02
ABC,2.98
It seems you may use
val s = """,
abc,xyz,ggg,nrmr
cde,jjj,kkkk,iiii,tem,posting
234,mm/dd/yy
,
454654,output2,sample
45646,output1,non-sample
16546,225.02
ABC,2.98"""
val res = s.replaceFirst("(?sm)\\A(.*?^,$){2}", "").trim()
println(res)
// =>
// 454654,output2,sample
// 45646,output1,non-sample
// 16546,225.02
// ABC,2.98
See the Scala demo.
Pattern details:
(?sm) - s enables . to match any char in the string including newlines, and m makes ^ and $ match start/end of line respectively
\\A - the start of string
(.*?^,$){2} - 2 occurrences of:
.*? - any 0+ chars as few as possible up to the leftmost
^,$ - line that only contains ,.

Regex - match words that follow the rule "i before e except after c"

Hi I'm wondering how to match words that follow the spelling rule “i before e
except after c” (such as brief, receipt, receive, pier). But shouldn't match words that don't follow that rule such as science.
What I have here is incorrect (as science shouldn't match) but it's what I got so far:
enter link description here
I don't really know how to do this without using look behind (which I know isn't very well supported).
This seems simple enough:
[A-Za-z]*(cei|[^c]ie)[A-Za-z]*
You can try this:
\b\w*(cei|\bie|(?!c)\w(?=ie))\w*\b
Explanation
I suggest using a regex with a negative lookahead:
/\b(?![a-z]*cie)[a-z]*(?:cei|ie)[a-z]*/i
See the regex demo
Details:
\b - a leading word boundary
(?![a-z]*cie) - a negative lookahead that fails the match if the word has cie after 0+ letters
[a-z]* - 0+ letters
(?:cei|ie) - cei or ie
[a-z]* - 0+ letters.
I was analysing the text of a novel, looking for the most common words that obeyed and broke the i before e rule. It would appear you're more likely to use a word that breaks rather than one which obeys the rule.
import operator, re
ie = {}
cei = {}
ei = {}
file = open('1342.txt', 'r')
for line in file:
line = line.lower()
# ditch anything that's not a word or a white-space character
line = re.sub(r'[^\_\w\s]','', line)
# split line into words
line = line.split()
for word in range(len(line)):
# add the word to the appropriate dictionary or increment
# frequency of occurrence if already in that dictionary
if re.search(r'ie', line[word], flags=re.IGNORECASE):
if line[word] in ie:
ie[line[word]] = ie[line[word]] + 1
else:
ie[line[word]] = 1
if re.search(r'cei', line[word], flags=re.IGNORECASE):
if line[word] in cei:
cei[line[word]] = cei[line[word]] + 1
else:
cei[line[word]] = 1
if re.search(r'[a-b,d-z]+ei', line[word], flags=re.IGNORECASE):
if line[word] in ei:
ei[line[word]] = ei[line[word]] + 1
else:
ei[line[word]] = 1
# sort each dictionary and display the 10 most common words from each
x = sorted(ie.items(), key=operator.itemgetter(1))
for word in range(len(x) - 1, len(x) - 10, -1):
print(x[word])
print()
x = sorted(cei.items(), key=operator.itemgetter(1))
for word in range(len(x) - 1, len(x) - 10, -1):
print(x[word])
print()
x = sorted(ei.items(), key=operator.itemgetter(1))
for word in range(len(x) - 1, len(x) - 10, -1):
print(x[word])

Remove optional whitespace when splitting with math operators while keeping them in the result

How to remove whitespace characters from the input string? I am using the following code that
Dim input As String = txtInput.Text
Dim symbol As String = "([-+*/])"
Dim substrings() As String = Regex.Split(input, symbol)
Dim cleaned As String = Regex.Replace(input, "\s", " ")
For Each match As String In substrings
lstOutput.Items.Add(match)
Next
Input: z + x
Output: z, + and x.
I want to get rid of the whitespace in the last item.
You may remove the redundant whitespace while splitting with
\s*([-+*/])\s*
See the regex demo. Also, it is a good idea to trim the input before passing to the regex replace method with .Trim().
Pattern details:
\s* - matches 0+ whitespaces (these will be discarded from the result as they are not captured)
([-+*/]) - Group 1 (captured texts will be output to the resulting array) capturing 1 char: -, +, * or /
\s* - matches 0+ whitespaces (these will be discarded from the result as they are not captured)