Java regular expression to get a value in matlab - regex

Hi I was wondering how I can do this in matlab: I have a file and somewhere in the file i have this string = "1 to 10 of 434M" . I would like to get the "434M". Though keeping in mind that the M can also be other letters (K or B), but is always a capital letter. The ciphers before the letter can be up to 3 chippers, but can also be smaller.
How would I get this out of a text in matlab?

Assume that you read your file line by line. Then for each line execute the following commands:
% line is current line of input file
[matchstart,~,~,~,tokenstring] = regexp(line, '1 to 10 of (\d+[MKB])');
if ~isempty(matchstart)
desired_string = tokenstring{1};
end
This regular expression matches at least one digit before M. (E. g. also 451274M) If it should only match numbers with 1 to 3 digits use:
'1 to 10 of (\d{1,3}[MKB])'

Related

RegEx matching odd number of characters at the beginning and the end of string

I have grammar is Lezer where I need to match a "custom string" which can start with any odd number of " and end with the same corresponding number. It can span multiple lines as well and anything inside needs to be skipped as far as the parser is concerned. I am struggling a little with the regEx part of the matching.
str"test" // valid
str"""test""" // valid
str""test"" // not valid
str""""test"""" // not valid
I am trying to match the beginning and end of that string.
I tried among other things "("")*[^"] but it matches the first letter after the odd double quotes (due to the [^"] which is something I would like to avoid.
For matching the end I have a similar issue.
So with the given input of:
1 str"test"
2 str"""
3 a
4 b
5 c
6 """
7 str""nope""
I am trying to match only str" for line 1 and str""" for line 2 and not match on line 7.
Need to match the ends as well (not in the same regex). So the match should be on " for line 1 and """ for line 6.
I have this so far: start: ^str"("")*[^"] end: [^"]"("")*$ but it is not optimal.
FYI I need start and end since the expectation is when you start writing and we hit a match on the beginning the highlighting in the editor should highlight all remaining text as a string until you have a matching odd number of ".
Any advice is appreciated.

Regex - Match n occurences of substring within any m-lettered window

I am facing some issues forming a regex that matches at least n times a given pattern within m characters of the input string.
For example imagine that my input string is:
00000001100000001110111100000000000000000000000000000000000000000000000000110000000111000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001100
I want to detect all cases where an 1 appears at least 7 times (not necessarily consecutively) in the input string, but within a window of up to 20 characters.
So far I have built this expression:
(1[^1]*?){7,}
which detects all cases where an 1 appears at least 7 times in the input string, but this now matches both the:
11000000011101111
and the
1100000001110000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000011
parts whereas I want only the first one to be kept, as it is within a substring composed of less than 20 characters.
It tried to combine the aforementioned regex with:
(?=(^[01]{0,20}))
to also match only parts of the string containing either an '1' or a '0' of length up to 20 characters but when I do that it stops working.
Does anyone have an idea gow to accomplish this?
I have put this example in regex101 as a quick reference.
Thank you very much!
This is not something that can be done with regex without listing out every possible string. You would need to iterate over the string instead.
You could also iterate over the matches. Example in Python:
import re
matches = re.finditer(r'(?=((1[^1]*?){7}))', string)
matches = [match.group(1) for match in matches if len(match.group(1)) <= 20]
The next Python snippet is an attempt to get the desired sequences using only the regular expression.
import re
r = r'''
(?mx)
( # the 1st capturing group will contain the desired sequence
1 # this sequence should begin with 1
(?=(?:[01]{6,19}) # let's see that there are enough 0s and 1s in a line
(.*$)) # the 2nd capturing group will contain all characters to the end of a line
(?:0*1){6}) # there must be six more 1s in the sequence
(?=.{0,13} # complement the 1st capturing group to 20 characters
\2) # the rest of a line should be 2nd capturing group
'''
s = '''
0000000
101010101010111111100000000000001
00000001100000001110111100000000000000000000000000000000000000000000000000110000000111000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001100
1111111
111111
'''
print([m.group(1) for m in re.finditer(r, s)])
Output:
['1010101010101', '11111100000000000001', '110000000111011', '1111111']
You can find an exhaustive explanation of this regular expression on RegEx101.

Using grep reverse to get rid of a line and a few before

I'd like to get rid of a line with a pattern containing:
CE1(2or8 # CE1(number 2 or 8
CE2(-1-17-2or8 # CE2(any number from -1 to 17, a dash, number 2 or 8
and 6 lines before that and 1 line after that.
grep -B6 -A1 'CE1([28]\|CE2([-1-17]-[28]' file
This attempt seems to match my pattern (does it do what I explicitly described?) but I was thinking of using reverse option to get rid of that pattern search from my file. Is it possible? It does not seem to work.
Not a complete answer, but some explanations:
A character class matches only one character. The hyphen in a character class, when it doesn't represent a literal hyphen (at the first position, at the end, when escaped or immediately after ^), defines a range of characters, but not a range of numbers. (make some tries with the ascii table on a corner to well understand.)
[-1-17] matches one of these characters that can be:
a literal hyphen (because at the beginning)
a character in the range 1-1 (so 1)
the character 7
To match an integer between -1 and 17, you need:
\(-1\|1[0-7]\|[0-9]\)
The simplest and most robust (since it works even when the skipped range includes lines that match the regexp or when the range runs off the start/end of the input file) approach, IMHO, is 2 passes - the first to identify the lines to be skipped and the second to skip those lines:
$ cat file
a 1
b 2
c 3
d 4
e 5
f 6
g 7
h 8
i 9
$ awk -v b=3 -v a=1 'NR==FNR{if (/f/) for (i=NR-b;i<=NR+a;i++) skip[i]; next} !(FNR in skip)' file file
a 1
b 2
h 8
i 9
Just change /f/ to /<your regexp of choice>/ and set the b(efore) and a(fter) values as you like.
As for your particular regexp, you didn't provide any sample input and expected output for us to test against but I THINK what you want might be:
awk -v b=6 -v a=1 'NR==FNR{if (/CE(1|2(-1|[0-9]|1[0-7])-)[28]/) for (i=NR-b;i<=NR+a;i++) skip[i]; next} !(FNR in skip)' file file

Find repeating gps using regular expression

I work with text files, and I need to be able to see when the gps (last 3 columns of csv) "hangs up" for more than a few lines.
So for example, usually, part of a text file looks like this:
5451,1667,180007,35.7397387,97.8161897,375.8
5448,1053z,180006,35.7397407,97.8161814,375.7
5444,1667,180005,35.7397445,97.8161674,375.6
5439,1668,180004,35.7397483,97.8161526,375.5
5435,1669,180003,35.7397518,97.8161379,375.5
5431,1669,180002,35.7397554,97.8161269,375.6
5426,1054z,180001,35.7397584,97.8161115,375.6
5420,1670,175959,35.7397649,97.8160931,375.9
But sometimes there is an error with the gps and it looks like this:
36859,1598,202603.00,35.8867316,99.2515545,555.700
36859,1598,202608.00,35.8867316,99.2515545,555.700
36859,1142z,202610.00,35.8867316,99.2515545,555.700
36859,1597,202612.00,35.8867316,99.2515545,555.700
36859,1597,202614.00,35.8867316,99.2515545,555.700
36859,1596,202616.00,35.8867316,99.2515545,555.700
36859,1595,202618.00,35.8867316,99.2515545,555.700
I need to be able to figure out a way to search for matching strings of 7 different numbers, (the decimal portion of the gps) but so far I've only been able to figure out how to search for repeating #s or consecutive numbers.
Any ideas?
If you were to find such repetitions in an editor (such as Notepad++), you could use the following regex to find 4 or more repeating lines:
([^,]+(?:,[^,]+){2})\v+(?:(?:[^,]+,){3}\1(?:\v+|$)){3,}
To go a bit into detail
([^,]+(?:,[^,]+){2})\v+ is a group consisting of one or more non-commas followed by comma and another one or more non-commas followed by a vertical space (linebreak), that is not part of the group (e.g. 1,1,1\n)
(?:[^,]+,){3} matches one or more non-commas followed by comma, three times (your columns that don't have to be considered)
\1 is a backreference to group 1, matching if it contains exactly the same as group 1
(?:\v+|$) matches either another vertical whitespaces or the end of the text
{3,} for 3 or more repetitions - increase it if you want more
Here you can see, how it works
However, if you are using any programming language to check this, I wouldn't walk on the path of regex, as checking for those repetitions can be done a lot easier. Here is one example in Python, I hope you can adopt it for your needs:
oldcoords = [0,0,0]
lines = [line.rstrip('\n') for line in open(r'C:\temp\gps.csv')]
for line in lines:
gpscoords = line.split(',')[3:6]
if gpscoords == oldcoords:
repetitions += 1
else:
oldcoords = gpscoords
repetitions = 0
if repetitions == 4: #or however you define more than a few
print(', '.join(gpscoords) + ' is repeated')
If you can use perl, and if I understood you:
perl -ne 'm/^[^,]*,[^,]*,[^,]*,([^,]*,[^,]*,[^,]*$)/g; $current_line=$1; ++$line_number; if ($prev_line==$current_line){$equals++} else {if ($equals>=6){ print "Last three fields in lines ".($line_number-$equals-1)." to ".($line_number-1)." are equals to:\n$prev_line" } ; $equals=0}; $prev_line=$current_line' < onlyreplacethiswithyourfilepath should do the trick.
Sample output:
Last three fields in lines 1 to 7 are equals to:
35.8867316,99.2515545,555.700
Last three fields in lines 16 to 22 are equals to:
37.8782116,99.7825545,572.810
Last three fields in lines 31 to 44 are equals to:
36.6868916,77.2594245,581.358
Last three fields in lines 57 to 63 are equals to:
35.5128764,71.2874545,575.631

c# split text file by changing the line number

I'm trying to split text file by line numbers,
for example, if I have text file like:
1 ljhgk uygk uygghl \r\n
1 ljhg kjhg kjhg kjh gkj \r\n
1 kjhl kjhl kjhlkjhkjhlkjhlkjhl \r\n
2 ljkih lkjhl kjhlkjhlkjhlkjhl \r\n
2 lkjh lkjh lkjhljkhl \r\n
3 asdfghjkl \r\n
3 qweryuiop \r\n
I want to split it to 3 parts (1,2,3),
How can I do this? the size of the text is very large (~20,000,000 characters) and I need an efficient way (like regex).
Another idea, you can use linq to get the groups you're after, by splitting by each first word. Note that this will take each first word, so make sure you only have numbers there. This is using the split/join antipattern, but it seems to work nice here.
var lines = from line in s.Split("\r\n".ToCharArray(),
StringSplitOptions.RemoveEmptyEntries)
let lineNumber = line.Split(" ".ToCharArray(), 2).FirstOrDefault()
group line by lineNumber
into g
select String.Join("\n", g);
Notes:
GroupBy is gurenteed to return lines in the order they appeared.
If a block appears more than once (e.g. "1 1 2 2 3 3 1"), all blocks with the same number will be merged.
You can use a regex, but Split will not work too well. You can Match for the following pattern:
^(\d).*$ # Match first line, capture number
([\r\n]+^\1.*$)* # Match additional lines that begin with the same number
Example: here
I did try to split by$(?<=^(\d+).*)[\r\n]+^(?!\1), but it adds the line numbers as additional elementnt in the array.