Regex which satisfies 3 separate cases - regex

I'm trying to figure out a regex which can be used with java's String.split(regex) in order to get an array of "lines" from a file.
A carriage return does not define the end of a line, instead a comma does - however not all commas. If a comma is between parentheses, single quotes or a comment (/* comment, more comment */) it does not signify the end of a line.
Example:
1 test fixed(5,2),
2 another_test char(12),
2 a_third_test,
3 one pic'9{9}V.99',
3 two pic'9,999V.99',
3 three fixed(7,2),
/* test,t*/
/*test 2,*/
/*and more */
2 another_field fixed bin(13),
2 a_really_long_super_long_field_name_requiring_two_lines_for_declaration
char(1),
2 a_field char(8);
The output expected is (with \t and extra white spaces omitted for clarity):
1 test fixed(5,2)
2 another_test char(12)
2 a_third_test
3 one pic'9{9}V.99'
3 two pic'9,999V.99'
3 three fixed(7,2)
/* test,t*//*test 2,*//*and more */ 2 another_field fixed bin(13)
2 a_really_long_super_long_field_name_requiring_two_lines_for_declaration
char(1)
2 a_field char(8)
I've come up with 3 separate regex expressions to get the 3 pieces:
,(?![^(]*\)) - all commas not in parentheses
(,(?![^']*')) - all commas not in single quotes
(,(?![^\/\*]*\*\/)) - all commas not in a comment
I've tried joining them with an or (.*?)|(,)|'.*?'|(,)|\/*.*?*\/|(,) but get the following:
1 test fixed
2 another_test char
2 a_third_test
3 one pic
3 two pic
3 three fixed
2 another_field fixed bin
2 a_really_long_super_long_field_name_requiring_a_line_break_... char
2 a_field char
Is there a way where these 3 regex expressions (or is there a better one?) can be combined to find the groups that satisfy all 3?
UPDATE:
I can accomplish the exact thing with some simple java, but I'd like to do so with regex as an academic persuit.
String temp = "";
for(String line:text.split("\n")){
if(line.trim().charAt(line.trim().length()-1) == ',' || line.trim().charAt(line.trim().length()-1) == ';'){
System.out.println(temp + line);
temp = "";
} else {
temp += line.trim();
}
}

I think you might be over thinking this a little bit. It's important to keep in mind that regular expressions are made for parsing regular languages. When you need to check if you're inside a comment or parens or whatever else to know what a comma means, what you're looking at is a context-sensitive language (see diagram below).
By J. Finkelstein (Own work) [CC BY-SA 3.0 (https://creativecommons.org/licenses/by-sa/3.0)
That being said, matching commas and semi-colons at the end of a line is easy enough. /\s*(.*?)[,;]$/gsm works for the test input in your question. However, this doesn't take into account something like
test fixed(5,2),
/* a,
multi-line,
comment,
*/
The best option to get around this in my opinion would be to discard comments before you start parsing with \/\*.*?\*\/. If you need to keep the comments, you could probably use negative lookarounds but those are very inefficient and you would be better off writing a tokenizer/parser.

Related

Regex to match position from right to left

I hope you can help me, I'm studying how to make a regex and now I have this problem:
Write a regex that accepts strings with 0 and 1 and that has a 1 on position 5 from right to left.
e.g. 10000 is accepted because it has an 1 on the position 5 from right to left or 010000, 0010000 or 1110000 are accepted.
I was thinking with something like: (0+1)*+1(0+1)(0+1)(0+1)(0+1)(0+1)
You can use this regex:
1[01]{4}$
If you want to match full input then use:
^[01]*1[01]{4}$
Here 1[01]{4}$ ensures that we have 4 digits of 0 and 1 after we match 1 thus making 1 at 5th position from right to left.
RegEx Demo
Well - think of it this way. It needs to be as many 1s and 0s as you please, followed by a 1, followed by 4 more ones or zeroes.
So:
my_regex =
"^[01]*" + // Starts with One or zero, zero or more times
"1" + // Followed by a one
"[01]{4}$" // Followed by four things, which could be either zero or one, before ending.
Your (0 + 1) syntax looks foreign to me. I'm using character classes to specify the [01] things but you could use (0|1) in their place, which is what your attempt looks more like.
The full thing, together, is ^[01]*1[01]{4}$

Regular Expression for parsing a sports score

I'm trying to validate that a form field contains a valid score for a volleyball match. Here's what I have, and I think it works, but I'm not an expert on regular expressions, by any means:
r'^ *([0-9]{1,2} *- *[0-9]{1,2})((( *[,;] *)|([,;] *)|( *[,;])|[,;]| +)[0-9]{1,2} *- *[0-9]{1,2})* *$'
I'm using python/django, not that it really matters for the regex match. I'm also trying to learn regular expressions, so a more optimal regex would be useful/helpful.
Here are rules for the score:
1. There can be one or more valid set (set=game) results included
2. Each result must be of the form dd-dd, where 0 <= dd <= 99
3. Each additional result must be separated by any of [ ,;]
4. Allow any number of sets >=1 to be included
5. Spaces should be allowed anywhere except in the middle of a number
So, the following are all valid:
25-10 or 25 -0 or 25- 9 or 23 - 25 (could be one or more spaces)
25-10,25-15 or 25-10 ; 25-15 or 25-10 25-15 (again, spaces allowed)
25-1 2 -25, 25- 3 ;4 - 25 15-10
Also, I need each result as a separate unit for parsing. So in the last example above, I need to be able to separately work on:
25-1
2 -25
25- 3
4 - 25
15-10
It'd be great if I could strip the spaces from within each result. I can't just strip all spaces, because a space is a valid separator between result sets.
I think this is solution for your problem.
str.replace(r"(\d{1,2})\s*-\s*(\d{1,2})", "$1-$2")
How it works:
(\d{1,2}) capture group of 1 or 2 numbers.
\s* find 0 or more whitespace.
- find -.
$1 replace content with content of capture group 1
$2 replace content with content of capture group 2
you can also look at this.

Find repeating gps using regular expression

I work with text files, and I need to be able to see when the gps (last 3 columns of csv) "hangs up" for more than a few lines.
So for example, usually, part of a text file looks like this:
5451,1667,180007,35.7397387,97.8161897,375.8
5448,1053z,180006,35.7397407,97.8161814,375.7
5444,1667,180005,35.7397445,97.8161674,375.6
5439,1668,180004,35.7397483,97.8161526,375.5
5435,1669,180003,35.7397518,97.8161379,375.5
5431,1669,180002,35.7397554,97.8161269,375.6
5426,1054z,180001,35.7397584,97.8161115,375.6
5420,1670,175959,35.7397649,97.8160931,375.9
But sometimes there is an error with the gps and it looks like this:
36859,1598,202603.00,35.8867316,99.2515545,555.700
36859,1598,202608.00,35.8867316,99.2515545,555.700
36859,1142z,202610.00,35.8867316,99.2515545,555.700
36859,1597,202612.00,35.8867316,99.2515545,555.700
36859,1597,202614.00,35.8867316,99.2515545,555.700
36859,1596,202616.00,35.8867316,99.2515545,555.700
36859,1595,202618.00,35.8867316,99.2515545,555.700
I need to be able to figure out a way to search for matching strings of 7 different numbers, (the decimal portion of the gps) but so far I've only been able to figure out how to search for repeating #s or consecutive numbers.
Any ideas?
If you were to find such repetitions in an editor (such as Notepad++), you could use the following regex to find 4 or more repeating lines:
([^,]+(?:,[^,]+){2})\v+(?:(?:[^,]+,){3}\1(?:\v+|$)){3,}
To go a bit into detail
([^,]+(?:,[^,]+){2})\v+ is a group consisting of one or more non-commas followed by comma and another one or more non-commas followed by a vertical space (linebreak), that is not part of the group (e.g. 1,1,1\n)
(?:[^,]+,){3} matches one or more non-commas followed by comma, three times (your columns that don't have to be considered)
\1 is a backreference to group 1, matching if it contains exactly the same as group 1
(?:\v+|$) matches either another vertical whitespaces or the end of the text
{3,} for 3 or more repetitions - increase it if you want more
Here you can see, how it works
However, if you are using any programming language to check this, I wouldn't walk on the path of regex, as checking for those repetitions can be done a lot easier. Here is one example in Python, I hope you can adopt it for your needs:
oldcoords = [0,0,0]
lines = [line.rstrip('\n') for line in open(r'C:\temp\gps.csv')]
for line in lines:
gpscoords = line.split(',')[3:6]
if gpscoords == oldcoords:
repetitions += 1
else:
oldcoords = gpscoords
repetitions = 0
if repetitions == 4: #or however you define more than a few
print(', '.join(gpscoords) + ' is repeated')
If you can use perl, and if I understood you:
perl -ne 'm/^[^,]*,[^,]*,[^,]*,([^,]*,[^,]*,[^,]*$)/g; $current_line=$1; ++$line_number; if ($prev_line==$current_line){$equals++} else {if ($equals>=6){ print "Last three fields in lines ".($line_number-$equals-1)." to ".($line_number-1)." are equals to:\n$prev_line" } ; $equals=0}; $prev_line=$current_line' < onlyreplacethiswithyourfilepath should do the trick.
Sample output:
Last three fields in lines 1 to 7 are equals to:
35.8867316,99.2515545,555.700
Last three fields in lines 16 to 22 are equals to:
37.8782116,99.7825545,572.810
Last three fields in lines 31 to 44 are equals to:
36.6868916,77.2594245,581.358
Last three fields in lines 57 to 63 are equals to:
35.5128764,71.2874545,575.631

best approach for my pattern match

So, I've built a regex which follows this:
4!a2!a2!c[3!c]
which is translated to
4 alpha character followed by
2 alpha characters followed by
2 characters followed by
3 optional character
this is a standard format for SWIFT BIC code HSBCGB2LXXX
my regex to pull this out of string is:
(?<=:32[^:]:)(([a-zA-Z]{4}[a-zA-Z]{2})[0-9][a-zA-Z]{1}[X]{3})
Now this is targeting a specific tag (32) and works, however, I'm not sure if it's the cleanest, plus if there are any characters before H then it fails.
the string being matched against is:
:32B:HsBfGB4LXXXHELLO
the following returns HSBCGB4LXXX, but this:
:32B:2HsBfGB4LXXXHELLO
returns nothing.
EDIT
For clarity. I have a string which contains multiple lines all starting with :2xnumber:optional letter (eg, :58A:) i want to specify a line to start matching in and return a BIC from anywhere in the line.
EDIT
Some more example data to help:
:20:ABCDERF Z
:23B:CRED
:32A:140310AUD2120,
:33B:AUD2120,
:50K:/111222333
Mr Bank of Dad
Dads house
England
:52D:/DBEL02010987654321
address 1
address 2
:53B:/HSBCGB2LXXX
:57A://AU124040
AREFERENCE
:59:/44556677
A line which HSBCGB2LXXX contains a BIC
:70:Another line of data
:71A:Even more
Ok, so I need to pass in as a variable the tag 53 or 59 and return the BIC HSBCGB2LXXX only!
Your regex can be simplified, and corrected to allow a character before the H, to:
:32[^:]:.?([a-zA-Z]{6}\d[a-zA-Z]XXX)
The changes made were:
Lost the look behind - just make it part of the match
Inserting .? meaning "optional character"
([a-zA-Z]{4}[a-zA-Z]{2}) ==> [a-zA-Z]{6} (4+2=6)
[0-9] ==> \d (\d means "any digit")
[X]{3} ==> XXX (just easier to read and less characters)
Group 1 of the match contains your target
I'm not quite sure if I understand your question completely, as your regular expression does not completely match what you have described above it. For example, you mentioned 3 optional characters, but in the regexp you use 3 mandatory X-es.
However, the actual regular expression can be further cleaned:
instead of [a-zA-Z]{4}[a-zA-Z]{2}, you can simply use [a-zA-Z]{6}, and the grouping parentheses around this might be unnecessary;
the {1} can be left out without any change in the result;
the X does not need surrounding brackets.
All in all
(?<=:32[^:]:)([a-zA-Z]{6}[0-9][a-zA-Z]X{3})
is shorter and matches in the very same cases.
If you give a better description of the domain, probably further improvements are also possible.

How can I parse this without regex?

A friend of mine said if the regex I'm using is too long, it's probably the wrong tool for the job. Any thoughts here on a better way to parse this text? I have a regex that returns everything to an array I can easily just chunk out, but if there's another simpler way I'd really like to see it.
Here's what it looks like:
2 AB 123A 01JAN M ABCDEF AA1 100A 200A 02JAN T /ABCD /E
Here's a break down of that:
2 is the line number, these range from 1 all the way to 99. If you can't see because of formatting, there is a space charecter prepending numbers less than 10.
The space may or may not be replaced by an *
AB is an important unit of data (UOD).
AB may be prepended by /CD which is another important UOD.
123 is an important UOD. It can range from 1 (prepended by 4 spaces) to 99999.
A is an important UOD.
01JAN is a day/month combination, I need to extract both UODs.
M is a day name short form. This may be a number between 1 and 7.
ABC is an important UOD.
DEF is an important UOD.
The space after DEF may be an *
AA1 may be zero characters, or it may be 5. It is unimportant.
100A is a timestamp, but may be in the format 1300. The A may be N when the time is 1200 or P for times in the PM.
We then see another timestamp.
The next date part may not be there, for example, this is valid:
93*DE/QQ51234 30APR J QWERTY*QQ0 1250 0520 /ABCD*ASDFAS /E
The data where /ABCD*ASDFAS /E appears is irrelevant to the application, but, this is where the second date stamp may appear. The front-slash may be something else (such as a letter).
Note:
It is not space delimited, some parts of the body run into others. Character position is only accurate for the first two or three items on the list
I don't think I left anything out, but, if there's an easier way to parse out a string like this than writing a regex, please let me know.
This is a perfect task for regular expressions. The text does not contain nesting and the items you're matching are fairly simple taken individually.
Most regular expression syntaxes have an xtended flag or mode that allows whitespace and comments to improve readability. For example:
$regex = '#
# 2 is the line number, these range from 1 all the way to 99.
# There is a space character prepending numbers less than 10.
# The space may or may not be replaced by an *.
[ *]\d|\d\d
\s
# AB is an important unit of data (UOD).
# AB may be prepended by /CD which is another important UOD.
(/CD)?AB
\s
# 123 is an important UOD. It can range from 1 (prepended by 4 spaces)
# to 99999.
\s{4}\d{1}|\s{3}\d{2}|\s{2}\d{3}|\s{1}\d{4}|\d{5}
#x';
And so on.
A regex seems fine for this application, but for simplicity and readability, you might want to split this into several regexes (one for each field) so people can more easily follow which part of the regex corresponds to which variable.
You can always code your own parser by hand, but that would be more lines of code than a regex. The lines of code, however, will probably be simpler to follow for the reader.
Simply write a custom parser that handles it line by line. It seems like everything is at a fixed position rather than space/comma-delimited, so simply use those as indices into what you need:
line_number = int(line_text[0:1])
ab_unit = line_text[3:4]
...
If it is indeed space-delimited, simply split() each line and then parse through each, splitting each chunk into component parts where appropriate.