Extract words based on a pattern in Groovy [duplicate]

Extract words based on a pattern in Groovy [duplicate] - regex

Is there a nicer/shorter/better way of performing the following:
filename = "AA_BB_CC_DD_EE_FF.xyz"
parts = filename.split("_")
packageName = "${parts[0]}_${parts[1]}_${parts[2]}_${parts[3]}"
//packageName == "AA_BB_CC_DD"
The format remains constant (6 parts, _ separator) but some of the values and lengths of AA,BB are variable.

You can do the same thing by just programming the "joining" part differently:
The following result in the same thing as packageName:
filename.split('_')[0..3].join('_')
It just uses a range to slice the array, and .join to concatenate with a delimiter.

As the separator char between the "segments" in the source filename and in the
result is the same (_), you don't need to split the filename and join the parts again.
Your task can be done with a single regex:
def result = filename.find(/([A-Z0-9]+_){3}[A-Z0-9]+/)

Related

How can I tell if there are three or more characters between matches in a regex?

I'm using Ruby 2.1. I have this logic that looks for consecutive pairs of strings in a bigger string
results = line.scan(/\b((\S+?)\b.*?\b(\S+?))\b/)
My question is, how do I iterate over the list of results and print out whether there are three or more characters between the two strings? For instance if my string were
"abc def"
The above would produce
[["abc def", "abc", "def"]]
and I'd like to know whether there are three or more characters between "abc" and "def."

Use a quantifier for the spaces inbetween: \b((\S+?)\b\s{3,}\b(\S+?))\b
Also, the inner boundries are not really needed:
\b((\S+?)\s{3,}(\S+?))\b

A straightforward way to check this is by running a separate regex:
results.select!{|x|p x[/\S+?\b(.*?)\b\S+?/,1].size}
will print the size for every of the bunch.
Another way is to take the size of the captured groups and subtract them:
results = []
line.scan(/\b((\S+?)\b.*?\b(\S+?))\b/) do |s, group1, group2|
results << $~ if s.size - group1.size - group2.size >= 3
end

Find group of strings starting and ending by a character using regular expression

I have a string, and I want to extract, using regular expressions, groups of characters that are between the character : and the other character /.
typically, here is a string example I'm getting:
'abcd:45.72643,4.91203/Rou:hereanotherdata/defgh'
and so, I want to retrieved, 45.72643,4.91203 and also hereanotherdata
As they are both between characters : and /.
I tried with this syntax in a easier string where there is only 1 time the pattern,
[tt]=regexp(str,':(\w.*)/','match')
tt = ':45.72643,4.91203/'
but it works only if the pattern happens once. If I use it in string containing multiples times the pattern, I get all the string between the first : and the last /.
How can I mention that the pattern will occur multiple time, and how can I retrieve it?

Use lookaround and a lazy quantifier:
regexp(str, '(?<=:).+?(?=/)', 'match')
Example (Matlab R2016b):
>> str = 'abcd:45.72643,4.91203/Rou:hereanotherdata/defgh';
>> result = regexp(str, '(?<=:).+?(?=/)', 'match')
result =
1×2 cell array
'45.72643,4.91203' 'hereanotherdata'

In most languages this is hard to do with a single regexp. Ultimately you'll only ever get back the one string, and you want to get back multiple strings.
I've never used Matlab, so it may be possible in that language, but based on other languages, this is how I'd approach it...
I can't give you the exact code, but a search indicates that in Matlab there is a function called strsplit, example...
C = strsplit(data,':')
That should will break your original string up into an array of strings, using the ":" as the break point. You can then ignore the first array index (as it contains text before a ":"), loop the rest of the array and regexp to extract everything that comes before a "/".
So for instance...
'abcd:45.72643,4.91203/Rou:hereanotherdata/defgh'
Breaks down into an array with parts...
1 - 'abcd'
2 - '45.72643,4.91203/Rou'
3 - 'hereanotherdata/defgh'
Then Ignore 1, and extract everything before the "/" in 2 and 3.

As John Mawer and Adriaan mentioned, strsplit is a good place to start with. You can use it for both ':' and '/', but then you will not be able to determine where each of them started. If you do it with strsplit twice, you can know where the ':' starts :
A='abcd:45.72643,4.91203/Rou:hereanotherdata/defgh';
B=cellfun(#(x) strsplit(x,'/'),strsplit(A,':'),'uniformoutput',0);
Now B has cells that start with ':', and has two cells in each cell that contain '/' also. You can extract it with checking where B has more than one cell, and take the first of each of them:
C=cellfun(#(x) x{1},B(cellfun('length',B)>1),'uniformoutput',0)
C =
1×2 cell array
'45.72643,4.91203' 'hereanotherdata'

Starting in 16b you can use extractBetween:
>> str = 'abcd:45.72643,4.91203/Rou:hereanotherdata/defgh';
>> result = extractBetween(str,':','/')
result =
2×1 cell array
{'45.72643,4.91203'}
{'hereanotherdata' }
If all your text elements have the same number of delimiters this can be vectorized too.

Reg exp in matlab

I'm analyzing a file in matlab and I want to find the number of occurrences of the letter I (capitalized). I'm confused on how to write the regular expression for this step. Would it be something like (lines,'.I.')? Any help would be greatly appreciated.

If you want to count the number of capital 'I's in a file, assuming you have read the file in as a string, you could just do this:
count = sum(file_string == 'I');
If, as in this case, the file is read into a cell-string, one possible way of doing this would be to use:
count = sum(strcat(file_cellstr{:}) == 'I');
strcat will concatenate all of the strings passed to it into a single string. Passing file_cellstr{:} to strcat is essentially concatenating each of the cells (i.e. each line in your case) into a single string, then searching through it for the letter 'I'. If you wanted to find a whole word, you could use
count = length(strfind(strcat(file_cellstr{:}),'word'));
If you wanted a regular expression match, you could do the following:
count = length(regexp(strcat(file_cellstr{:}),'[a-z]+'));

How can I parse a char array with octal values in Python?

EDIT: I should note that I want a general case for any hex array, not just the google one I provided.
EDIT BACKGROUND: Background is networking: I'm parsing a DNS packet and trying to get its QNAME. I'm taking in the whole packet as a string, and every character represents a byte. Apparently this problem looks like a Pascal string problem, and using the struct module seems like the way to go.
I have a char array in Python 2.7 which includes octal values. For example, let's say I have an array
DNS = "\03www\06google\03com\0"
I want to get:
www.google.com
What's an efficient way to do this? My first thought would be iterating through the DNS char array and adding chars to my new array answer. Every time i see a '\' char, I would ignore the '\' and two chars after it. Is there a way to get the resulting www.google.com without using a new array?
my disgusting implementation (my answer is an array of chars, which is not what i want, i want just the string www.google.com:
DNS = "\\03www\\06google\\03com\\0"
answer = []
i = 0
while i < len(DNS):
if DNS[i] == '\\' and DNS[i+1] != 0:
i += 3
elif DNS[i] == '\\' and DNS[i+1] == 0:
break
else:
answer.append(DNS[i])
i += 1

Now that you've explained your real problem, none of the answers you've gotten so far will work. Why? Because they're all ways to remove sequences like \03 from a string. But you don't have sequences like \03, you have single control characters.
You could, of course, do something similar, just replacing any control character with a dot.
But what you're really trying to do is not replace control characters with dots, but parse DNS packets.
DNS is defined by RFC 1035. The QNAME in a DNS packet is:
a domain name represented as a sequence of labels, where each label consists of a length octet followed by that number of octets. The domain name terminates with the zero length octet for the null label of the root. Note that this field may be an odd number of octets; no padding is used.
So, let's parse that. If you understand how "labels consisting of "a length octet followed by that number of octets" relates to "Pascal strings", there's a quicker way. Also, you could write this more cleanly and less verbosely as a generator. But let's do it the dead-simple way:
def parse_qname(packet):
components = []
offset = 0
while True:
length, = struct.unpack_from('B', packet, offset)
offset += 1
if not length:
break
component = struct.unpack_from('{}s'.format(length), packet, offset)
offset += length
components.append(component)
return components, offset

import re
DNS = "\\03www\\06google\\03com\\0"
m = re.sub("\\\\([0-9,a-f]){2}", "", DNS)
print(m)

Maybe something like this?
#!/usr/bin/python3
import re
def convert(adorned_hostname):
result1 = re.sub(r'^\\03', '', adorned_hostname )
result2 = re.sub(r'\\0[36]', '.', result1)
result3 = re.sub(r'\\0$', '', result2)
return result3
def main():
adorned_hostname = r"\03www\06google\03com\0"
expected_result = 'www.google.com'
actual_result = convert(adorned_hostname)
print(actual_result, expected_result)
assert actual_result == expected_result
main()

For the question as originally asked, replacing the backslash-hex sequences in strings like "\\03www\\06google\\03com\\0" with dots…
If you want to do this with a regular expression:
\\ matches a backslash.
[0-9A-Fa-f] matches any hex digit.
[0-9A-Fa-f]+ matches one or more hex digits.
\\[0-9A-Fa-f]+ matches a backslash followed by one or more hex digits.
You want to find each such sequence, and replace it with a dot, right? If you look through the re docs, you'll find a function called sub which is used for replacing a pattern with a replacement string:
re.sub(r'\\[0-9A-Fa-f]+', '.', DNS)
I suspect these may actually be octal, not hex, in which case you want [0-7] rather than [0-9A-Fa-f], but nothing else would change.
A different way to do this is to recognize that these are valid Python escape sequences. And, if we unescape them back to where they came from (e.g., with DNS.decode('string_escape')), this turns into a sequence of length-prefixed (aka "Pascal") strings, a standard format that you can parse in any number of ways, including the stdlib struct module. This has the advantage of validating the data as you read it, and not being thrown off by any false positives that could show up if one of the string components, say, had a backslash in the middle of it.
Of course that's presuming more about the data. It seems likely that the real meaning of this is "a sequence of length-prefixed strings, concatenated, then backslash-escaped", in which case you should parse it as such. But it could be just a coincidence that it looks like that, in which case it would be a very bad idea to parse it as such.

How to print an integer with a thousands separator in Matlab?

I would like to turn a number into a string using a comma as a thousands separator. Something like:
x = 120501231.21;
str = sprintf('%0.0f', x);
but with the effect
str = '120,501,231.21'
If the built-in fprintf/sprintf can't do it, I imagine cool solution could be made using regular expressions, perhaps by calling Java (which I assume has some locale-based formatter), or with a basic string-insertion operation. However, I'm not an expert in either Matlab regexp's or calling Java from Matlab.
Related question: How can I print a float with thousands separators in Python?
Is there any established way to do this in Matlab?

One way to format numbers with thousands separators is to call the Java locale-aware formatter. The "formatting numbers" article at the "Undocumented Matlab" blog explains how to do this:
>> nf = java.text.DecimalFormat;
>> str = char(nf.format(1234567.890123))
str =
1,234,567.89
where the char(…) converts the Java string to a Matlab string.
voilà!

Here's the solution using regular expressions:
%# 1. create your formated string
x = 12345678;
str = sprintf('%.4f',x)
str =
12345678.0000
%# 2. use regexprep to add commas
%# flip the string to start counting from the back
%# and make use of the fact that Matlab regexp don't overlap
%# The three parts of the regex are
%# (\d+\.)? - looks for any number of digits followed by a dot
%# before starting the match (or nothing at all)
%# (\d{3}) - a packet of three digits that we want to match
%# (?=\S+) - requires that theres at least one non-whitespace character
%# after the match to avoid results like ",123.00"
str = fliplr(regexprep(fliplr(str), '(\d+\.)?(\d{3})(?=\S+)', '$1$2,'))
str =
12,345,678.0000

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Extract words based on a pattern in Groovy [duplicate] - regex

You can do the same thing by just programming the "joining" part differently: The following result in the same thing as packageName: filename.split('_')[0..3].join('_') It just uses a range to slice the array, and .join to concatenate with a delimiter.

As the separator char between the "segments" in the source filename and in the result is the same (_), you don't need to split the filename and join the parts again. Your task can be done with a single regex: def result = filename.find(/([A-Z0-9]+_){3}[A-Z0-9]+/)

Related

How can I tell if there are three or more characters between matches in a regex?

Find group of strings starting and ending by a character using regular expression

Reg exp in matlab

How can I parse a char array with octal values in Python?

How to print an integer with a thousands separator in Matlab?

Categories

Resources