perl regexp with sas - exact match one or the other - regex

I need to extract numbers written in words or in figures in a text.
I have a table that looks like that,
... 1 child ...
... three children ...
...four children ...
...2 children...
...five children
I want capture a number written in words or in numeric figures. There is one number per line. So the desired output would be:
1
three
four
2
five
My regex looks like that:
prxparse("/one|two|three|four|five|six|seven|eight|nine|ten|eleven|twelve|thirteen|child|\d\d?/")
Any help ?

Description
This regex will match the numbers in the string providing the numbers are surrounded by whitespace or the symbols.
(?<=\s|^)(?:[0-9]+|one|two|three|four|five|six|seven|eight|nine|ten)(?=\s|$)
Live Example: http://www.rubular.com/r/6ua7fTb8IS
To include the spelled out word version of numbers outside of one - ten, you'll need to include those. This regex will capture the numbers from zero to one hundred [baring any typos]
(?<=\s|^)(?:[0-9]+|(?:(?:twenty|thirty|forty|fifty|sixty|seventy|eighty|ninety)\s)?(?:one(?:[\s-]hundred)?|two|three|four|five|six|seven|eight|nine)|ten|eleven|twelve|(?:thir|four|fif|six|seven|eight|nine)teen|twenty|thirty|forty|fifty|sixty|seventy|eighty|ninety|zero)(?=\s|$)
Live Example: http://www.rubular.com/r/EIa18nx731
Perl Example
$string = <<END;
... 1 child ...
... three children ...
... four children ...
... 2 children...
... five children
END
#matches = $string =~ m/(?<=\s|^)[0-9]+|one|two|three|four|five|six|seven|eight|nine|ten(?=\s|$)/gi;
print join("\n", #matches);
Yields
1
three
four
2
five

Related

How to use a Regular Expression to find a Specific word and return following 10 characters?

I need to find a regular expression that finds "Order #" and then return the following 10 characters.
For example I can the following rows (Ignore row numbers just using them to designate that it is a new or next line in the original data):
Row 1 Order #100013661 By John DOE
Row 2 REFUND for CHARGE(Order #100013667 By Lara Croft
Row 3 Order #100013668 By Sammy
Row 4 Blah Blah Blah Order #10013664 By Fluffy fluff
I want the expression to return:
ROW 1 100013661
ROW 2 100013667
Row 3 100013668
Row 4 100013664
Use capturing groups for that:
Order #(.{9})
Use the tools in your hosting language to harvest the capturing group.
Demo.
The regex you need is
(?<=Order #).{10}
Detailed explanation:
(?<=Order #) is a positive lookbehind: it matches if the literal string Order # occurs before current position;
.{10} matches any 10 characters.
Note that this won't match if your line has less than 10 characters in a line after the search string. If you need to match up to 10 characters, not exactly 10 characters, replace {10} with {1,10}.
Here is a demo.
Order #(.{10}) or Order #(.{1,10}) if it could be up to 10 characters.
Order #(\d{1,10}) if they are always numbers.

Obtaining geographic decimal coordinates from proprietary text format using regex

Using only Notepad++ with regex support I would like to extract some data from a txt file, representing geographic coordinates and organize the output like that:
-123456789 becomes -123.456789
123456789 becomes 123.456789
-23456789 becomes -23.456789
56789 becomes 0.056789
-89 becomes -0.000089
Tried this: (-?)([0-9]*)([0-9]{6}) but fails when input is less than 6 digits long
You will need 2 steps in notepad++ to do this. First, let's take a look at the regex:
(?<sign>-?)(?<first>\d+(?=\d{6}))?(?<last>\d+)
captures the necessary parts in groups.
Explanation: (you can lose the named grouping if you want)
(?<sign>-?) # read the '-' sign
(?<first>\d+(?=\d{6}))? # read as many digits as possible,
# leaving 6 digits at the end.
(?<last>\d+) # read the remaining digits.
see regex101.com
How to use this in notepad++? Using a two step-search and replace:
(-?)(\d+(?=\d{6}))?(\d+)
replace with:
\1(?2\2.:0.)000000\3 # copy sign, if group 2 contains any
# values, copy them, followed by '.'.
# If not show a '0.'
# Print 6 zero's, followed by group 3.
Next, replace the superfluous zeros.
\.(0+(?=\d{6}\b))(\d{6}) # Replace the maximum number of zero's
# leaving 6 digits at the end.
replace with:
.\2
You can do it with three steps :
Step1 : replace : (-?)\b(\d{1,6})\b with \10000000\2
Step2 : replace : (-?)(\d{0,})(\d{6}) with \1\2.\3
Step3 : replace : 0{2,}\. with 0.
The idea is simple :
In the first step comple all the numbers less than 6 length with 6
zeros before to insure the length should be more than 6
In the step two put the dot before the 6th number
Step three replace all the multiple zeros before the dot with just one
In the end the output
-123.456789
123.456789
-23.456789
0.056789
-0.000089
Check the three steps :
You could use a Python Script plugin available for notepad++:
editor.rereplace('(\d+)', lambda m: ('%f' % (float(m.group(1))/1000000)))

Find repeating gps using regular expression

I work with text files, and I need to be able to see when the gps (last 3 columns of csv) "hangs up" for more than a few lines.
So for example, usually, part of a text file looks like this:
5451,1667,180007,35.7397387,97.8161897,375.8
5448,1053z,180006,35.7397407,97.8161814,375.7
5444,1667,180005,35.7397445,97.8161674,375.6
5439,1668,180004,35.7397483,97.8161526,375.5
5435,1669,180003,35.7397518,97.8161379,375.5
5431,1669,180002,35.7397554,97.8161269,375.6
5426,1054z,180001,35.7397584,97.8161115,375.6
5420,1670,175959,35.7397649,97.8160931,375.9
But sometimes there is an error with the gps and it looks like this:
36859,1598,202603.00,35.8867316,99.2515545,555.700
36859,1598,202608.00,35.8867316,99.2515545,555.700
36859,1142z,202610.00,35.8867316,99.2515545,555.700
36859,1597,202612.00,35.8867316,99.2515545,555.700
36859,1597,202614.00,35.8867316,99.2515545,555.700
36859,1596,202616.00,35.8867316,99.2515545,555.700
36859,1595,202618.00,35.8867316,99.2515545,555.700
I need to be able to figure out a way to search for matching strings of 7 different numbers, (the decimal portion of the gps) but so far I've only been able to figure out how to search for repeating #s or consecutive numbers.
Any ideas?
If you were to find such repetitions in an editor (such as Notepad++), you could use the following regex to find 4 or more repeating lines:
([^,]+(?:,[^,]+){2})\v+(?:(?:[^,]+,){3}\1(?:\v+|$)){3,}
To go a bit into detail
([^,]+(?:,[^,]+){2})\v+ is a group consisting of one or more non-commas followed by comma and another one or more non-commas followed by a vertical space (linebreak), that is not part of the group (e.g. 1,1,1\n)
(?:[^,]+,){3} matches one or more non-commas followed by comma, three times (your columns that don't have to be considered)
\1 is a backreference to group 1, matching if it contains exactly the same as group 1
(?:\v+|$) matches either another vertical whitespaces or the end of the text
{3,} for 3 or more repetitions - increase it if you want more
Here you can see, how it works
However, if you are using any programming language to check this, I wouldn't walk on the path of regex, as checking for those repetitions can be done a lot easier. Here is one example in Python, I hope you can adopt it for your needs:
oldcoords = [0,0,0]
lines = [line.rstrip('\n') for line in open(r'C:\temp\gps.csv')]
for line in lines:
gpscoords = line.split(',')[3:6]
if gpscoords == oldcoords:
repetitions += 1
else:
oldcoords = gpscoords
repetitions = 0
if repetitions == 4: #or however you define more than a few
print(', '.join(gpscoords) + ' is repeated')
If you can use perl, and if I understood you:
perl -ne 'm/^[^,]*,[^,]*,[^,]*,([^,]*,[^,]*,[^,]*$)/g; $current_line=$1; ++$line_number; if ($prev_line==$current_line){$equals++} else {if ($equals>=6){ print "Last three fields in lines ".($line_number-$equals-1)." to ".($line_number-1)." are equals to:\n$prev_line" } ; $equals=0}; $prev_line=$current_line' < onlyreplacethiswithyourfilepath should do the trick.
Sample output:
Last three fields in lines 1 to 7 are equals to:
35.8867316,99.2515545,555.700
Last three fields in lines 16 to 22 are equals to:
37.8782116,99.7825545,572.810
Last three fields in lines 31 to 44 are equals to:
36.6868916,77.2594245,581.358
Last three fields in lines 57 to 63 are equals to:
35.5128764,71.2874545,575.631

How to refer to the match order in the replace string?

I'm looking for a way to type the order of match of the being-replaced-string among all other found-strings. For Example, for the 1st matched string, the value should be 1, for the 2nd it's 2 … for the nth it should be n. The value I'm looking for is the order of the matched string among all other matched strings.
Example for what I'm trying to get
Let's say that I have this original content ...
<"BOY"(GUN)><"GIRL"(BAG)><"SISTERS"(CANDY)><"JOHN"(HAT)>
... and I want it to be manipulated to be like this ...
1
BOY
GUN
2
GIRL
BAG
3
SISTERS
CANDY
4
JOHN
HAT
I already know that I need <"(.*?)"\((.*?)\)> to match each element. For the replace code I think I need something like #MATCH ORDER REFERENCE#\n\$1\n$2\n.
Note
I'm using Perl on Windows.
Use the /e modifier to evaluate the replacement. See Regexp Quote-Like Operators.
Then you can increase a counter on each replacement.
Code
my $text = '<"BOY"(GUN)><"GIRL"(BAG)><"SISTERS"(CANDY)><"JOHN"(HAT)>';
my $counter = 1;
$text =~ s/<"([^"]+)"\(([^()]+)\)>/$counter++."\n$1\n$2\n\n"/ge;
print $text;
Output
1
BOY
GUN
2
GIRL
BAG
3
SISTERS
CANDY
4
JOHN
HAT

to search for consecutive list elements prefixed by number and dot in plain text

The text looks like this:
"Beginning. 1. The container is 1.5 meters long 2. It can hold up to 2lt of fluid. 3. It 4 holes."
There may not be a dot at the end of each list element.
How can I split this text into a list as shown below?
"Beginning."
"The container is 1.5 meters long"
"It can hold up to 2lt of fluid."
"It has 4 holes."
In other words I need to match (\d+)\. such that all (\d+) are consecutive integers so that I can split and trim the text between them. Is it possible with regex? How far do I have to venture into the realm of computer science?
Use
\d+\.(?!\d)
as the splitting regex, i. e. in PHP
$result = preg_split('/\d+\.(?!\d)/', $subject);
The negative lookahead (?!\d) ensures that no digit follows after the dot has been matched.
Or make the spaces mandatory - if that's an option:
$result = preg_split('/\s+\d+\.\s+/', $subject);
This is working c# code:
string s = "Beginning. 1. The container is 1.5 meters long 2. It can hold up to 2lt of fluid. 3. It has 4 holes.";
string[] res = Regex.Split(s, #"\s*\d+\.\s+");
foreach (var r in res)
{
Console.WriteLine(r);
}
Console.ReadLine();
I split on \s*\d+\.\s+ that means optional white space, followed by at least one digit ,followed by a dot, then at least one whitespace.