Trouble sorting a list after using regex - regex

The code below is parsing data from this text sample:
rf-Parameters-v1020
supportedBandCombination-r10: 128 items
Item 0
BandCombinationParameters-r10: 1 item
Item 0
BandParameters-r10
bandEUTRA-r10: 2
bandParametersUL-r10: 1 item
Item 0
CA-MIMO-ParametersUL-r10
ca-BandwidthClassUL-r10: a (0)
bandParametersDL-r10: 1 item
Item 0
CA-MIMO-ParametersDL-r10
ca-BandwidthClassDL-r10: a (0)
supportedMIMO-CapabilityDL-r10: fourLayers (1)
I am having trouble replacing the first 'a' from the "ca-BandwidthClassUL-r10" line with 'u' and placing it before 'm' in the final output: [2 a(0) u m]
import re
regex = r"bandEUTRA-r10: *(\d+)(?:\r?\n(?!ca-BandwidthClassUL-r10:).*)*\r?\nca-BandwidthClassUL-r10*: *(\w.*)(" \
r"?:\r?\n(?!ca-BandwidthClassDL-r10:).*)*\r?\nca-BandwidthClassDL-r10*: *(" \
r"\w.*)\nsupportedMIMO-CapabilityDL-r10: *(.*) "
regex2 = r"^.*bandEUTRA-r10: *(\d+)(?:\r?\n(?!ca-BandwidthClassUL-r10:).*)*\r?\nca-BandwidthClassUL-r10*: *(\w.*)(?:\r?\n(?!ca-BandwidthClassDL-r10:).*)*\r?\nca-BandwidthClassDL-r10*: *(\w.*)\nsupportedMIMO-CapabilityDL-r10: *(.*)(?:\r?\n(?!bandEUTRA-r10:).*)*\r?\nbandEUTRA-r10: *(\d+)(?:\r?\n(?!ca-BandwidthClassDL-r10:).*)*\r?\nca-BandwidthClassDL-r10*: *(\w.*)\nsupportedMIMO-CapabilityDL-r10: *(.*)"
my_file = open("files.txt", "r")
content = my_file.read().replace("fourLayers", 'm').replace("twoLayers", " ")
#print(content)
#if 'BandCombinationParameters-r10: 1 item' in content:
result = ["".join(m) for m in re.findall(regex, content, re.MULTILINE)]
print(result)

You might use an optional part where you capture group 2.
Then you can print group 3 concatenated with u if there is group 2, else only print group 3.
As you are already matching the text in the regex, you don't have to do the separate replacement calls. You can use the text in the replacement itself.
bandEUTRA-r10: *(\d+)(?:\r?\n(?!ca-BandwidthClassUL-r10:).*)*(?:\r?\n(ca-BandwidthClassUL-r10)?: *(\w.*))(?:\r?\n(?!ca-BandwidthClassDL-r10:).*)*\r?\nca-BandwidthClassDL-r10*: *\w.*\nsupportedMIMO-CapabilityDL-r10:
Regex demo | Python demo
For example
import re
regex = r"bandEUTRA-r10: *(\d+)(?:\r?\n(?!ca-BandwidthClassUL-r10:).*)*(?:\r?\n(ca-BandwidthClassUL-r10)?: *(\w.*))(?:\r?\n(?!ca-BandwidthClassDL-r10:).*)*\r?\nca-BandwidthClassDL-r10*: *\w.*\nsupportedMIMO-CapabilityDL-r10:"
s = "here the example data with and without ca-BandwidthClassUL-r10"
matches = re.finditer(regex, s, re.MULTILINE)
for matchNum, match in enumerate(matches, start=1):
result = "{0}{1} m".format(
match.group(1),
match.group(3) + " u" if match.group(2) else match.group(3)
)
print(result)
Output
2a (0) u m
2a (0) m

Related

How to write a regex for a date-time string

dateTime = "SATURDAY1200PM1230PMWEEKLY"
Desired Result: "12:00 PM - 12:30 PM"
I tried doing this: let str = "SATURDAY600PM630PMWEEKLY".split(/[^A-Z][0-9]{3,4}(A|P)M/);
But I keep getting an array with chars/numbers. I am unsure if split is the way to go here.
Try a match approach:
var dateTime = "SATURDAY1200PM1230PMWEEKLY";
var ts = dateTime.match(/\d{3,4}[AP]M/g)
.map(x => x.replace(/(\d{1,2})(\d{2})([AP]M)/, "$1:$2 $3"))
.join(" - ");
console.log(ts);
As the programming language was not given I will provide a straightforward solution in Ruby which I expect could be converted easily to most other languages.
str = "SATURDAY1130AM130PMWEEKLY"
rgx = /\A[A-Z]+(\d{1,2})(\d{2})([AP]M)(\d{1,2})(\d{2})([AP]M)[A-Z]+\z/
m = str.match(rgx)
#=> #<MatchData "1130AM130PM" 1:"11" 2:"30" 3:"AM" 4:"1" 5:"30" 6:"PM">
"%s:%s %s - %s:%s %s" % [$1, $2, $3, $4, $5, $6]
#=> "11:30 AM - 1:30 PM"
Demo
The regular expression could be broken down as follows.
\A # match beginning of string
[A-Z]+ # match one or more uppercase letters
(\d{1,2}) # match 1 or 2 digits, save to capture group 1
(\d{2}) # match 2 digits, save to capture group 2
([AP]M) # match 'AM' or 'PM', save to capture group 3
(\d{1,2}) # match 1 or 2 digits, save to capture group 4
(\d{2}) # match 2 digits, save to capture group 5
([AP]M) # match 'AM' or 'PM', save to capture group 6
[A-Z]+ # match one or more uppercase letters
\z # match end of string
The last statement could also be written:
"%s:%s %s - %s:%s %s" % m.captures
#=> "11:30 AM - 1:30 PM"
which of course is specific to Ruby.
Another way is to make use of a language's date-time library. Again, this could be done as follows in Ruby.
require 'time'
s1, s2 = str.scan(/\d{3,4}[AP]M/).map do |s|
s.sub(/(?=\d{2}[AP])/, ' ')
end
#=> ["11 30AM", "1 30PM"]
t1 = DateTime.strptime(s1, '%I %M%p')
#=> #<DateTime: 2022-02-01T11:30:00+00:00
# ((2459612j,41400s,0n),+0s,2299161j)>
t2 = DateTime.strptime(s2, '%I %M%p')
#=> #<DateTime: 2022-02-01T13:30:00+00:00
# ((2459612j,48600s,0n),+0s,2299161j)>
t1.strftime('%l:%M %p') + " - " + t2.strftime('%l:%M %p')
#=> "11:30 AM - 1:30 PM"
If you are wondering why .map do |s| s.sub(/(?=\d{2}[AP])/, ' ') end is needed in calculating s1 and s2 try removing it and changing the format string to '%I%M%p'.
Solution is use match and then convert resoult to your string
let str = "SATURDAY600PM630PMWEEKLY"
.match(/[\d]{3,4}(A|P)M/g)
.map((time) => {
const AMPM = time.slice(-2);
const m = time.slice(-4,-2);
const h = time.slice(0,-4);
return `${h}:${m} ${AMPM}`;
})
.join(' - ')
console.log(str)

Scala regex on a whole column

I have the following pattern that I could parse using pandas in Python, but struggle with translating the code into Scala.
grade string_column
85 (str:ann smith,14)(str:frank chase,15)
86 (str:john foo,15)(str:al more,14)
In python I used:
df.set_index('grade')['string_column']\
.str.extractall(r'\((str:[^,]+),(\d+)\)')\
.droplevel(1)
with the output:
grade 0 1
85 str:ann smith 14
85 str:frank chase 15
86 str:john foo 15
86 str:al more 14
In Scala I tried to duplicate the approach, but it's failing:
import scala.util.matching.Regex
val pattern = new Regex("((str:[^,]+),(\d+)\)")
val str = "(str:ann smith,14)(str:frank chase,15)"
println(pattern findAllIn(str)).mkString(","))
There are a few notes about the code:
There is an unmatched parenthesis for a group, but that one should be escaped
The backslashes should be double escaped
In the println you don't have to use all the parenthesis and the dot
findAllIn returns a MatchIterator, and looping those will expose a matched string. Joining those matched strings with a comma, will in this case give back the same string again.
For example
import scala.util.matching.Regex
val pattern = new Regex("\\((str:[^,]+),(\\d+)\\)")
val str = "(str:ann smith,14)(str:frank chase,15)"
println(pattern findAllIn str mkString ",")
Output
(str:ann smith,14),(str:frank chase,15)
But if you want to print out the group 1 and group 2 values, you can use findAllMatchIn that returns a collection of Regex Matches:
import scala.util.matching.Regex
val pattern = new Regex("\\((str:[^,]+),(\\d+)\\)")
val str = "(str:ann smith,14)(str:frank chase,15)"
pattern findAllMatchIn str foreach(m => {
println(m.group(1))
println(m.group(2))
}
)
Output
str:ann smith
14
str:frank chase
15
In Python, Series.str.extractall only returns captured substrings. In Scala, findAllIn returns the matched values if you do not query its matchData property that in its turn contains a subgroups property.
So, to get the captures only in Scala, you need to use
val pattern = """\((str:[^,()]+),(\d+)\)""".r
val str = "(str:ann smith,14)(str:frank chase,15)"
(pattern findAllIn str).matchData foreach {
m => println(m.subgroups.mkString(","))
}
Output:
str:ann smith,14
str:frank chase,15
See the Scala online demo.
Here, m.subgroups accesses all subgroups (captures) of each match (m).
Also, note you do not need to double backslashes in triple-quoted string literals. \((str:[^,()]+),(\d+)\) matches
\( - a ( char
(str:[^,()]+) - Group 1: str: and one or more chars other than ,, ( and )
, - a comma
(\d+) - Group 2: one or more digits
\) - a ) char.
If you just want to get all matches without captures, you can use
val pattern = """\((str:[^,]+),(\d+)\)""".r
println((pattern findAllIn str).matchData.mkString(","))
Output:
(str:ann smith,14),(str:frank chase,15)
See the online demo.

Regex match string where symbol is not repeated

I have like this strings:
group items % together into% FALSE
characters % that can match any single TRUE
How I can match sentences where symbol % is not repeated?
I tried like this pattern but it's found first match sentence with symbol %
[%]{1}
You may use this regex in python to return failure for lines that have more than one % in them:
^(?!([^%]*%){2}).+
RegEx Demo
(?!([^%]*%){2}) is a negative lookahead that fails the match if % is found twice after line start.
You could use re.search as follows:
items = ['group items % together into%', 'characters % that can match any single']
for item in items:
output = item
if re.search(r'^.*%.*%.*$', item):
output = output + ' FALSE'
else:
output = output + ' TRUE'
print(output)
This prints:
group items % together into% FALSE
characters % that can match any single TRUE
Just count them (Python):
>>> s = 'blah % blah %'
>>> s.count('%') == 1
False
>>> s = 'blah % blah'
>>> s.count('%') == 1
True
With regex:
>>> re.match('[^%]*%[^%]*$','gfdg%fdgfgfd%')
>>> re.match('[^%]*%[^%]*$','blah % blah % blah')
>>> re.match('[^%]*%[^%]*$','blah % blah blah')
<re.Match object; span=(0, 16), match='blah % blah blah'>
re.match must match from start of string, use ^ (match start of string) if using re.search, which can match in the middle of a string.
>>> re.search('^[^%]*%[^%]*$','gfdg%fdgfgfd%')
>>> re.search('^[^%]*%[^%]*$','gfdg%fdgfgfd')
<re.Match object; span=(0, 12), match='gfdg%fdgfgfd'>
I am assuming that "sentence" in your question is the same as a line in the input text. With that assumption, you can use the following:
^[^%\r\n]*(%[^%\r\n]*)?$
This, along with the multi-line and global flags, will match all lines in the input string that contain 0 or 1 '%' symbols.
^ matches the start of a line
[^%\r\n]* matches 0 or more characters that are not '%' or a new line
(...)? matches 0 or 1 instance of the contents in parentheses
% matches '%' literally
$ matches the end of a line

How do I use reaLline() to find matches in a file using regex and print them out to the console

I am trying to have the user input a class number and name to pull up a list of information on that class I have on a file. I have figured out how to match the information using .toRegex. I can't figure out how to use the users input to find the match they need and not all matching in the file. I am very new to Regnex.
val pattern = """\d+\s+([A-Z]+).\s+(\d+)\s.+\s+\w.+""".toRegex()
val fileName = "src/main/kotlin/Enrollment.txt"
var lines = File(fileName).readLines()// reads every line on the file
do{
print("please enter class name")
var className = readLine()!!
print("please enter class number ")
var classNum = readLine()!!
for(i in 0..(lines.size-1) ){
var matchResult = pattern.find(lines[i])
if(matchResult != null) {
var (className,classNum) = matchResult.groupValues
println("className: $className, class number: $classNum ")
}
}
}while (readLine()!! != "EXIT") ```
example line from file
Name Num
0669 HELP 134 AN CV THING ETC 4.0 4.0 Smith P 001 0173 MTWTh 9:30A 10:30A 23 15 8 4.0
See MatchResult#groupValues reference:
This list has size of groupCount + 1 where groupCount is the count
of groups in the regular expression. Groups are indexed from 1 to
groupCount and group with the index 0 corresponds to the entire
match.
If the group in the regular expression is optional and there were no
match captured by that group, corresponding item in groupValues
is an empty string.
You need
var (_, className,classNum) = matchResult.groupValues
See Kotlin demo:
val lines = "0669 HELP 134 AN CV THING ETC 4.0 4.0 Smith P 001 0173 MTWTh 9:30A 10:30A 23 15 8 4.0 "
val pattern = """^\d+\s+([A-Z]+)\s+(\d+)""".toRegex()
var matchResult = pattern.find(lines)
if(matchResult != null) {
var (_, className,classNum) = matchResult.groupValues
println("className: $className, class number: $classNum ")
}
// => className: HELP, class number: 134
I simplified the regex a bit since find() does not require a full string match to
^\d+\s+([A-Z]+)\s+(\d+)
See the regex demo. Details:
^ - start of string
\d+ - one or more digits
\s+ - one or more whitespaces
([A-Z]+) - Group 1: one or more uppercase ASCII letters
\s+ - one or more whitespaces
(\d+) - Group 2: one or more digits
You need to use a variable in the pattern that you get from the user .readLine()
Use a loop to check each line with another loop checking if the patter is in that line. pattern.containMatchIn()
val className = readLine()!!.toUpperCase()
print("please enter class number ")
val classNum = readLine()!!
val pattern = """\s+\d+\s+$className.\s+$classNum""".toRegex()
for(i in 0..(lines.size-1) ) {
var matchResult = pattern.find(lines[i])
if(matchResult != null ){
if (pattern.containsMatchIn(lines[i])) {
println(lines[i])
}
}
}```

How to regex match everything but long words?

I would like to select all long words from a string: re.findall("[a-z]{3,}")
However, for a reason I can use substitute only. Hence I need to substitute everything but words of 3 and more letters by space. (e.g. abc de1 fgh ij -> abc fgh)
How would such a regex look like?
The result should be all "[a-z]{3,}" concatenated by spaces. However, you can use substitution only.
Or in Python: Find a regex such that
re.sub(regex, " ", text) == " ".join(re.findall("[a-z]{3,}", text))
Here is some test cases
import re
solution_regex="..."
for test_str in ["aaa aa aaa aa",
"aaa aa11",
"11aaa11 11aa11",
"aa aa1aa aaaa"
]:
expected_str = " ".join(re.findall("[a-z]{3,}", test_str))
print(test_str, "->", expected_str)
if re.sub(solution_regex, " ", test_str)!=expected_str:
print("ERROR")
->
aaa aa aaa aa -> aaa aaa
aaa aa11 -> aaa
11aaa11 11aa11 -> aaa
aa aa1aa aaaa -> aaaa
Note that space is no different than any other symbol.
\b(?:[a-z,A-Z,_]{1,2}|\w*\d+\w*)\b
Explanation:
\b means that the substring you are looking for start and end by border of word
(?: ) - non captured group
\w*\d+\w* Any word that contains at least one digit and consists of digits, '_' and letters
Here you can see the test.
You can use the regex
(\s\b(\d*[a-z]\d*){1,2}\b)|(\s\b\d+\b)
and replace with an empty string, here is a python code for the same
import re
regex = r"(\s\b(\d*[a-z]\d*){1,2}\b)|(\s\b\d+\b)"
test_str = "abcd abc ad1r ab a11b a1 11a 1111 1111abcd a1b2c3d"
subst = ""
# You can manually specify the number of replacements by changing the 4th argument
result = re.sub(regex, subst, test_str, 0)
if result:
print (result)
here is a demo
In Autoit this works for me
#include <Array.au3>
$a = StringRegExp('abc de1 fgh ij 234234324 sdfsdfsdf wfwfwe', '(?i)[a-z]{3,}', 3)
ConsoleWrite(_ArrayToString($a, ' ') & #CRLF)
Result ==> abc fgh sdfsdfsdf wfwfwe
import re
regex = r"(?:^|\s)[^a-z\s]*[a-z]{0,2}[^a-z\s]*(?:\s|$)"
str = "abc de1 fgh ij"
subst = " "
result = re.sub(regex, subst, str)
print (result)
Output:
abc fgh
Explanation:
(?:^|\s) : non capture group, start of string or space
[^a-z\s]* : 0 or more any character that is not letter or space
[a-z]{0,2} : 0, 1 or 2 letters
[^a-z\s]* : 0 or more any character that is not letter or space
(?:\s|$) : non capture group, space or end of string
With the other ideas posted here, I came up with an answer. I can't believe I missed that:
([^a-z]+|(?<![a-z])[a-z]{1,2}(?![a-z]))+
https://regex101.com/r/IIxkki/2
Match either non-letters, or up to two letters bounded by non-letters.