How to extract the year via regex from a string in Ruby - regex

I'm trying to extract the year from a string with this format:
dataset_name = 'ALTVALLEDAOSTA000020191001.json'
I tried:
dataset_name[/<\b(19|20)\d{2}\b>/, 1]
/\b(19|20)\d{2}\b/.match(dataset_name)
I'm still reading the docs but so far I'm not able to achieve the result I want. I'm really bad at regex.

Since your dataset name always ends in yyyymmdd.json, you can take a slice of the last 13-9 characters counting from the rear:
irb(main):001:0> dataset_name = 'ALTVALLEDAOSTA000020191001.json'
irb(main):002:0> dataset_name[-13...-9]
=> "2019"
You can also use a regex if you want a bit more precision:
irb(main):003:0> dataset_name =~ /(\d{4})\d{4}\.json$/
=> 18
irb(main):004:0> $1
=> "2019"

There are many ways to get to Rome.
Starting with:
foo = 'ALTVALLEDAOSTA000020191001.json'
Stripping the extended filename + extension to its basename then using a regex:
ymd = /(\d{4})(\d{2})(\d{2})$/
ext = File.extname(foo)
File.basename(foo, ext) # => "ALTVALLEDAOSTA000020191001"
File.basename(foo, ext)[ymd, 1] # => "2019"
File.basename(foo, ext)[ymd, 2] # => "10"
File.basename(foo, ext)[ymd, 3] # => "01"
Using a regex against the entire filename to grab just the year:
ymd = /^.*(\d{4})/
foo[ymd, 1] # => "1001"
or extracting the year, month and day:
ymd = /^.*(\d{4})(\d{2})(\d{2})/
foo[ymd, 1] # => "2019"
foo[ymd, 2] # => "10"
foo[ymd, 3] # => "01"
Using String's unpack:
ymd = '#18A4'
foo.unpack(ymd) # => ["2019"]
or:
ymd = '#18A4A2A2'
foo.unpack(ymd) # => ["2019", "10", "01"]
If the strings are consistent length and format, then I'd work with unpack, because, if I remember right, it is the fastest, followed by String slicing, with anchored, then unanchored regular expressions trailing.

Related

How to write a regex for a date-time string

dateTime = "SATURDAY1200PM1230PMWEEKLY"
Desired Result: "12:00 PM - 12:30 PM"
I tried doing this: let str = "SATURDAY600PM630PMWEEKLY".split(/[^A-Z][0-9]{3,4}(A|P)M/);
But I keep getting an array with chars/numbers. I am unsure if split is the way to go here.
Try a match approach:
var dateTime = "SATURDAY1200PM1230PMWEEKLY";
var ts = dateTime.match(/\d{3,4}[AP]M/g)
.map(x => x.replace(/(\d{1,2})(\d{2})([AP]M)/, "$1:$2 $3"))
.join(" - ");
console.log(ts);
As the programming language was not given I will provide a straightforward solution in Ruby which I expect could be converted easily to most other languages.
str = "SATURDAY1130AM130PMWEEKLY"
rgx = /\A[A-Z]+(\d{1,2})(\d{2})([AP]M)(\d{1,2})(\d{2})([AP]M)[A-Z]+\z/
m = str.match(rgx)
#=> #<MatchData "1130AM130PM" 1:"11" 2:"30" 3:"AM" 4:"1" 5:"30" 6:"PM">
"%s:%s %s - %s:%s %s" % [$1, $2, $3, $4, $5, $6]
#=> "11:30 AM - 1:30 PM"
Demo
The regular expression could be broken down as follows.
\A # match beginning of string
[A-Z]+ # match one or more uppercase letters
(\d{1,2}) # match 1 or 2 digits, save to capture group 1
(\d{2}) # match 2 digits, save to capture group 2
([AP]M) # match 'AM' or 'PM', save to capture group 3
(\d{1,2}) # match 1 or 2 digits, save to capture group 4
(\d{2}) # match 2 digits, save to capture group 5
([AP]M) # match 'AM' or 'PM', save to capture group 6
[A-Z]+ # match one or more uppercase letters
\z # match end of string
The last statement could also be written:
"%s:%s %s - %s:%s %s" % m.captures
#=> "11:30 AM - 1:30 PM"
which of course is specific to Ruby.
Another way is to make use of a language's date-time library. Again, this could be done as follows in Ruby.
require 'time'
s1, s2 = str.scan(/\d{3,4}[AP]M/).map do |s|
s.sub(/(?=\d{2}[AP])/, ' ')
end
#=> ["11 30AM", "1 30PM"]
t1 = DateTime.strptime(s1, '%I %M%p')
#=> #<DateTime: 2022-02-01T11:30:00+00:00
# ((2459612j,41400s,0n),+0s,2299161j)>
t2 = DateTime.strptime(s2, '%I %M%p')
#=> #<DateTime: 2022-02-01T13:30:00+00:00
# ((2459612j,48600s,0n),+0s,2299161j)>
t1.strftime('%l:%M %p') + " - " + t2.strftime('%l:%M %p')
#=> "11:30 AM - 1:30 PM"
If you are wondering why .map do |s| s.sub(/(?=\d{2}[AP])/, ' ') end is needed in calculating s1 and s2 try removing it and changing the format string to '%I%M%p'.
Solution is use match and then convert resoult to your string
let str = "SATURDAY600PM630PMWEEKLY"
.match(/[\d]{3,4}(A|P)M/g)
.map((time) => {
const AMPM = time.slice(-2);
const m = time.slice(-4,-2);
const h = time.slice(0,-4);
return `${h}:${m} ${AMPM}`;
})
.join(' - ')
console.log(str)

make hash consistent when some of its keys are symbols and others are not

I use this code to merge a Matchdata with a Hash :
params = {
:url => 'http://myradiowebsite.com/thestation'
}
pattern = Regexp.new('^https?://(?:www.)?myradiowebsite.com/(?<station_slug>[^/]+)/?$')
matchdatas = pattern.match(params[:url])
#convert named matches in MatchData to Hash
#https://stackoverflow.com/a/11690565/782013
datas = Hash[ matchdatas.names.zip( matchdatas.captures ) ]
params = params.merge(datas)
But this gives me mixed keys in my params hash:
{:url=>"http://myradiowebsite.com/thestation", "station_slug"=>"thestation"}
Which is a problem to get the hash values using the keys later. I would like to standardize them to symbols.
I'm learning Ruby, can someone explain me if there is something wrong with this code, and how to improve it ?
Thanks !
First, note that with
pattern =
Regexp.new('^https?://(?:www.)?myradiowebsite.com/(?<station_slug>[^/]+)/?$')
#=> /^https?:\/\/(?:www.)?myradiowebsite.com\/(?<station_slug>[^\/]+)\/?$/
we obtain
'http://wwwXmyradiowebsiteYcom/thestation'.match?(pattern)
#=> true
which means that the periods after 'www' and before 'com' need to be escaped:
pattern =
Regexp.new('\Ahttps?://(?:www\.)?myradiowebsite\.com/(?<station_slug>[^/]+)/?\z')
#=> /\Ahttps?:\/\/(?:www\.)?myradiowebsite\.com\/(?<station_slug>[^\/]+)\/?\z/
I've also replaced the beginning-of-line anchor (^) with the beginning-of-string anchor (\A) and the end-of-line anchor ($) with the end-of-string anchor (\z), though either can be used here since the string consists of a single line.
You are given the two keys you want in the hash you are returning: :url and :station_slug, so for
params = { :url => 'http://myradiowebsite.com/thestation' }
you can compute
m = params[:url].match(pattern)
#=> #<MatchData "http://myradiowebsite.com/thestation" station_slug:"thestation">
then so long as m is not nil (as here), write
{ :url => m[0], :station_slug => m["station_slug"] }
#=> {:url=>"http://myradiowebsite.com/thestation", :station_slug=>"thestation"}
See MatchData#[]. m[0] returns the entire match; m["station_slug"] returns the contents of the capture group named "station_slug".
Obviously, the name of the capture group can be any valid string, or you could make it an unnamed capture group and write
{ :url => m[0], :station_slug => m[1] }
You could transform the keys of datas to symbols:
Hash[ matchdatas.names.zip( matchdatas.captures ) ].transform_keys(&:to_sym)
Or to define your params hash with string keys:
params = { 'url' => 'http://myradiowebsite.com/thestation' }

Why condition returns True using regular expressions for finding special characters in the string?

I need to validate the variable names:
name = ["2w2", " variable", "variable0", "va[riable0", "var_1__Int", "a", "qq-q"]
And just names "variable0", "var_1__Int" and "a" are correct.
I could Identify most of "wrong" name of variables using regex:
import re
if re.match("^\d|\W|.*-|[()[]{}]", name):
print(False)
else:
print(True)
However, I still become True result for va[riable0. Why is it the case?
I control for all type of parentheses.
.match() checks for a match only at the beginning of the string, while .search() checks for a match anywhere in the string.
You can also simplify your regex to this and call search() method:
^\d|\W
That basically checks whether first character is digit or a non-word is anywhere in the input.
RegEx Demo
Code Demo
Code:
>>> name = ["2w2", " variable", "variable0", "va[riable0", "var_1__Int", "a", "qq-q"]
>>> pattern = re.compile(r'^\d|\W')
>>> for str in name:
... if pattern.search(str):
... print(str + ' => False')
... else:
... print(str + ' => True')
...
2w2 => False
variable => False
variable0 => True
va[riable0 => False
var_1__Int => True
a => True
qq-q => False
Your expression is:
"^\d|\W|.*-|[()[]{}]"
But re.match() matches from the beginning of the string always, so your ^ is unnecessary, but you need a $ at the end, to make sure the entire input string matches, and not just a prefix.

Regex pattern to match groups starting with pattern

I am extract data from a text stream which is data structured as such
/1-<id>/<recType>-<data>..repeat n times../1-<id>/#-<data>..repeat n times..
In the above, the "/1" field precedes the record data which can then have any number of following fields, each with choice of recType from 2 to 9 (also, each field starts with a "/")
For example:
/1-XXXX/2-YYYY/9-ZZZZ/1-AAAA/3-BBBB/5-CCCC/8=NNNN/9=DDDD/1-QQQQ/2-WWWW/3=PPPP/7-EEEE
So, there are three groups of data above
1=XXXX 2=YYYY 9=ZZZZ
1=AAAA 3=BBBB 5=CCCC 8=NNNN 9=DDDD
1=QQQQ 2=WWWW 3=PPPP 7=EEEE
Data is for simplicity, I know for certain that its only contains [A-Z0-9. ] but can be variable length (not just 4 chars as per example)
Now, the following expression sort of works, but its only capturing the first 2 fields of each group and none of the remaining fields...
/1-(?'fld1'[A-Z]+)/((?'fldNo'[2-9])-(?'fldData'[A-Z0-9\. ]+))
I know I need some sort of quantifier in there somewhere, but I do not know what or where to place it.
You can use a regex to match these blocks using 2 .NET regex features: 1) capture collection and 2) multiple capturing groups with the same name in the pattern. Then, we'll need some Linq magic to combine the captured data into a list of lists:
(?<fldNo>1)-(?'fldData'[^/]+)(?:/(?<fldNo>[2-9])[-=](?'fldData'[^/]+))*
Details:
(?<fldNo>1) - Group fldNo matching 1
- - a hyphen
(?'fldData'[^/]+) - Group "fldData" capturing 1+ chars other than /
(?:/(?<fldNo>[2-9])[-=](?'fldData'[^/]+))* - zero or more sequences of:
/ - a literal /
(?<fldNo>[2-9]) - 2 to 9 digit (Group "fldNo")
[-=] - a - or =
(?'fldData'[^/]+)- 1+ chars other than / (Group "fldData")
See the regex demo, results:
See C# demo:
using System;
using System.Linq;
using System.Text.RegularExpressions;
public class Test
{
public static void Main()
{
var str = "/1-XXXX/2-YYYY/9-ZZZZ/1-AAAA/3-BBBB/5-CCCC/8=NNNN/9=DDDD/1-QQQQ/2-WWWW/3=PPPP/7-EEEE";
var res = Regex.Matches(str, #"(?<fldNo>1)-(?'fldData'[^/]+)(?:/(?<fldNo>[2-9])[-=](?'fldData'[^/]+))*")
.Cast<Match>()
.Select(p => p.Groups["fldNo"].Captures.Cast<Capture>().Select(m => m.Value)
.Zip(p.Groups["fldData"].Captures.Cast<Capture>().Select(m => m.Value),
(first, second) => first + "=" + second))
.ToList();
foreach (var t in res)
Console.WriteLine(string.Join(" ", t));
}
}
I would suggest to first split the string by /1, then use a patern along these lines:
\/([1-9])[=-]([A-Z]+)
https://regex101.com/r/0nyzzZ/1
A single regex isn't the optimal tool for doing this (at least used in this way). The main reason is because your stream has a variable number of entries in it, and using a variable number of capture groups is not supported. I also noticed some of the values had "=" between them as well as the dash, which your current regex doesn't address.
The problem comes when you try and add a quantifier to a capture group - the group will only remember the last thing it captured, so if you add a quantifier, it will end up catching the first and last fields, leaving out all the rest of them. So something like this won't work:
\/1-(?'fld1'[A-Z]+)(?:\/(?'fldNo'[2-9])[-=](?'fldData'[A-Z]+))+
If your streams were all the same length, then a single regex could be used, but there's a way to do it using a foreach loop with a much simpler regex working on each part of your stream (so it verifies your stream as well when it goes along!)
Now I'm not sure what language you're working with when using this, but here is a solution in PHP that I think delivers what you need.
function extractFromStream($str)
{
/*
* Get an array of [num]-[letters] with explode. This will make an array that
* contains [0] => 1-AAAA, [1] => 2-BBBB ... etc
*/
$arr = explode("/", substr($str, 1));
$sorted = array();
$key = 0;
/*
* Sort this data into key->values based on numeric ordering.
* If the next one has a lower or equal starting number than the one before it,
* a new entry will be created. i.e. 2-aaaa => 1-cccc will cause a new
* entry to be made, just in case the stream doesn't always start with 1.
*/
foreach ($arr as $value)
{
// This will get the number at the start, and has the added bonus of making sure
// each bit is in the right format.
if (preg_match("/^([0-9]+)[=-]([A-Z]+)$/", $value, $matches)) {
$newKey = (int)$matches[1];
$match = $matches[2];
} else
throw new Exception("This is not a valid data stream!");
// This bit checks if we've got a lower starting number than last time.
if (isset($lastKey) && is_int($lastKey) && $newKey <= $lastKey)
$key += 1;
// Now sort them..
$sorted[$key][$newKey] = $match;
// This will be compared in the next iteration of the loop.
$lastKey = $newKey;
}
return $sorted;
}
Here's how you can use it...
$full = "/1-XXXX/2-YYYY/9-ZZZZ/1-AAAA/3-BBBB/5-CCCC/8=NNNN/9=DDDD/1-QQQQ/2-WWWW/3=PPPP/7-EEEE";
try {
$extracted = extractFromStream($full);
$stream1 = $extracted[0];
$stream2 = $extracted[1];
$stream3 = $extracted[2];
print "<pre>";
echo "Full extraction: \n";
print_r($extracted);
echo "\nFirst Stream:\n";
print_r($stream1);
echo "\nSecond Stream:\n";
print_r($stream2);
echo "\nThird Stream:\n";
print_r($stream3);
print "</pre>";
} catch (Exception $e) {
echo $e->getMessage();
}
This will print
Full extraction:
Array
(
[0] => Array
(
[1] => XXXX
[2] => YYYY
[9] => ZZZZ
)
[1] => Array
(
[1] => AAAA
[3] => BBBB
[5] => CCCC
[8] => NNNN
[9] => DDDD
)
[2] => Array
(
[1] => QQQQ
[2] => WWWW
[3] => PPPP
[7] => EEEE
)
)
First Stream:
Array
(
[1] => XXXX
[2] => YYYY
[9] => ZZZZ
)
Second Stream:
Array
(
[1] => AAAA
[3] => BBBB
[5] => CCCC
[8] => NNNN
[9] => DDDD
)
Third Stream:
Array
(
[1] => QQQQ
[2] => WWWW
[3] => PPPP
[7] => EEEE
)
So you can see you have the numbers as the array keys, and the values they correspond to, which are now readily accessible for further processing. I hope this helps you :)

Using regex in Scala to group and pattern match

I need to process phone numbers using regex and group them by (country code) (area code) (number). The input format:
country code: between 1-3 digits
, area code: between 1-3 digits
, number: between 4-10 digits
Examples:
1 877 2638277
91-011-23413627
And then I need to print out the groups like this:
CC=91,AC=011,Number=23413627
This is what I have so far:
String s = readLine
val pattern = """([0-9]{1,3})[ -]([0-9]{1,3})[ -]([0-9]{4,10})""".r
val ret = pattern.findAllIn(s)
println("CC=" + ret.group(1) + "AC=" + ret.group(2) + "Number=" + ret.group(3));
The compiler said "empty iterator." I also tried:
val (cc,ac,n) = s
and that didn't work either. How to fix this?
The problem is with your pattern. I would recommend using some tool like RegexPal to test them. Put the pattern in the first text box and your provided examples in the second one. It will highlight the matched parts.
You added spaces between your groups and [ -] separators, and it was expecting spaces there. The correct pattern is:
val pattern = """([0-9]{1,3})[ -]([0-9]{1,3})[ -]([0-9]{4,10})""".r
Also if you want to explicitly get groups then you want to get a Match returned. For an example the findFirstMatchIn function returns the first optional Match or the findAllMatchIn returns a list of matches:
val allMatches = pattern.findAllMatchIn(s)
allMatches.foreach { m =>
println("CC=" + m.group(1) + "AC=" + m.group(2) + "Number=" + m.group(3))
}
val matched = pattern.findFirstMatchIn(s)
matched match {
case Some(m) =>
println("CC=" + m.group(1) + "AC=" + m.group(2) + "Number=" + m.group(3))
case None =>
println("There wasn't a match!")
}
I see you also tried extracting the string into variables. You have to use the Regex extractor in the following way:
val Pattern = """([0-9]{1,3})[ -]([0-9]{1,3})[ -]([0-9]{4,10})""".r
val Pattern(cc, ac, n) = s
println(s"CC=${cc}AC=${ac}Number=$n")
And if you want to handle errors:
s match {
case Pattern(cc, ac, n) =>
println(s"CC=${cc}AC=${ac}Number=$n")
case _ =>
println("No match!")
}
Also you can also take a look at string interpolation to make your strings easier to understand: s"..."