I'm trying to match fields in a Java class using regex. So far it's been working great overall, but with just one issue. Once I match a field, I then have a second regex for matching the name of the field. I use (\w+)\s*(?:\s*=\s*[^;,]+)? to match the name of fields, but if the value of the field is wrapped in brackets with spaces separating values for that field, it starts matching field values as field names. For example, the below matches all the names called value#, however, once it gets to value5, the 2 is matched because of the space inside the brackets separating the field's value. I need a way to not match spaces while inside brackets, if possible, while still accomplishing what is currently matched with values 1-4. Even a solution that can match value 5 properly but will mess up with 6 and 7 would be a welcomed improvement.
Sample code:
value1;
value2 = 3;
value3 = 4, value4 = 4;
value5 = {
1, 2
};
value6 = {
1, 2
}, value7 = {
1, 2
};
Try this:
([a-zA-Z_]\w*)(?:\s*=\s*(?:\{[^\}]+\}|[\{;,]+))?
This expression also takes advantage of the fact that field names cannot begin with a digit.
Related
I need to create some columns from a cell that contains text separated by "_".
The input would be:
campaign1_attribute1_whatever_yes_123421
And the output has to be in different columns (one per field), with no "_" and excluding the final number, as it follows:
campaign1 attribute1 whatever yes
It must be done using a regex formula!
help!
Thanks in advance (and sorry for my english)
=REGEXEXTRACT("campaign1_attribute1_whatever_yes_123421","(("®EXREPLACE("campaign1_attribute1_whatever_yes_123421","((_)|(\d+$))",")$1(")&"))")
What this does is replace all the _ with parenthesis to create capture groups, while also excluding the digit string at the end, then surround the whole string with parenthesis.
We then use regex extract to actuall pull the pieces out, the groups automatically push them to their own cells/columns
To solve this you can use the SPLIT and REGEXREPLACE functions
Solution:
Text - A1 = "campaign1_attribute1_whatever_yes_123421"
Formula - A3 = =SPLIT(REGEXREPLACE(A1,"_+\d*$",""), "_", TRUE)
Explanation:
In cell A3 We use SPLIT(text, delimiter, [split_by_each]), the text in this case is formatted with regex =REGEXREPLACE(A1,"_+\d$","")* to remove 123421, witch will give you a column for each word delimited by ""
A1 = "campaign1_attribute1_whatever_yes_123421"
A2 = "=REGEXREPLACE(A1,"_+\d*$","")" //This gives you : *campaign1_attribute1_whatever_yes*
A3 = SPLIT(A2, "_", TRUE) //This gives you: campaign1 attribute1 whatever yes, each in a separate column.
I finally figured it out yesterday in stackoverflow (spanish): https://es.stackoverflow.com/questions/55362/c%C3%B3mo-separo-texto-por-guiones-bajos-de-una-celda-en...
It was simple enough after all...
The reason I asked to be only in regex and for google sheets was because I need to use it in Google data studio (same regex functions than spreadsheets)
To get each column just use this regex extract function:
1st column: REGEXP_EXTRACT(Campaña, '^(?:[^_]*_){0}([^_]*)_')
2nd column: REGEXP_EXTRACT(Campaña, '^(?:[^_]*_){1}([^_]*)_')
3rd column: REGEXP_EXTRACT(Campaña, '^(?:[^_]*_){2}([^_]*)_')
etc...
The only thing that has to be changed in the formula to switch columns is the numer inside {}, (column number - 1).
If you do not have the final number, just don't put the last "_".
Lastly, remember to do all the calculated fields again, because (for example) it gets an error with CPC, CTR and other Adwords metrics that are calculated automatically.
Hope it helps!
I've text as:
field = a value = 1345 field = b value = 3453
I need regular expression so that I can extract data as:
field value
a 1345
b 3453
What can be the perfect regex for it?
The expression should be:
field\s=\s([^\s]+)\svalue\s=\s([^\s]+)
Regex demo.
There are two capturing groups: one for field and one for value. Also note that I have specified any character other than space as a value for field or value - you could make it stricter.
Try this:
field\s*=\s*([^\s]+)\s*value\s*=\s*([^\s]+)
demo
it fill up more than one spaces case like field = a value = 1345 field = b value = 3453
My CF backend has to read through a CFM file as if it was a TEXT file to extract the names and values of different parameters, the data looks like this :
request.config.MY_PARAM_1 = 'ABCDEFGHI';
request.config.MY_PARAM_2 = "BlaBlaBla";
request.config.MY_PARAM_3 = TRUE;
request.config.MY_PARAM_4 = 'true';
request.config.MY_PARAM_5 = "1337";
request.config.MY_PARAM_6 = 1337;
As you can see, I can have STRINGS which can be SINGLE or DOUBLE quoted.
I also have BOOLEANS and NUMBERS which usually are without quotes, but that can also have (single or double).
I am "parsing" the file and extracting the values, I want to find a pattern that would return the matches like this :
request.config.MY_PARAM_2 = "BlaBlaBla";
I am VERY close to succeeding, but unfortunately the following expression cannot get rid of the closing quote.
<cfset match = REFind("^request\.config\.(\S+) = ['|""]?(.*)['|""]?;$", str, 1, "Yes")>
<cfset paramVal = Mid( str, match.pos[3], match.len[3] ) >
<cfdump var=#paramVal# >
For example, it returns BlaBlaBla", it has successfully omitted the opening quote, but not the last one, what am I doing wrong?
From your comments, it sounds like you're saying that you want to parse two ARBITRARY lines. This will do it:
^(?:[^\n]*\n){1}request\.config.(\w+)\s*=\s*(['"]?)(\w+)\2;(?:[^\n]*\n){4}request\.config.(\w+)\s*=\s*(['"]?)(\w+)\5;
In your code, just change the two numbers in the quantifiers: {1} and {4} as they specify how many lines to skip at the top and in the middle. For line 1, for instance you would have {0} in the first quantifier.
The data you want is in Groups 1, 3, 4 and 5. Please see the capture groups in the lower right panel of this demo
I am sure you will have no trouble building the regex in code by concatenating the pieces:
method Parse(x,y)
Build the regex by concatenating
^(?:[^\n]*\n){
With
x-1
With
}request\.config.(\w+)\s*=\s*(['"]?)(\w+)\2;(?:[^\n]*\n){
With
y-x
With
}request\.config.(\w+)\s*=\s*(['"]?)(\w+)\5;
Then match and retrieve Groups 1, 3, 4 and 5
Also see this visualization which makes it quite clear.
Debuggex Demo
A Bloomberg futures ticker usually looks like:
MCDZ3 Curcny
where the root is MCD, the month letter and year is Z3 and the 'yellow key' is Curcny.
Note that the root can be of variable length, 2-4 letters or 1 letter and 1 whitespace (e.g. S H4 Comdty).
The letter-year allows only the letter listed below in expr and can have two digit years.
Finally the yellow key can be one of several security type strings but I am interested in (Curncy|Equity|Index|Comdty) only.
In Matlab I have the following regular expression
expr = '[FGHJKMNQUVXZ]\d{1,2} ';
[rootyk, monthyear] = regexpi(bbergtickers, expr,'split','match','once');
where
rootyk{:}
ans =
'mcd' 'curncy'
and
monthyear =
'z3 '
I don't want to match the ' ' (space) in the monthyear. How can I do?
Assuming there are no leading or trailing whitespaces and only upcase letters in the root, this should work:
^([A-Z]{2,4}|[A-Z]\s)([FGHJKMNQUVXZ]\d{1,2}) (Curncy|Equity|Index|Comdty)$
You've got root in the first group, letter-year in the second, yellow key in the third.
I don't know Matlab nor whether it covers Perl Compatible Regex. If it fails, try e.g. with instead of \s. Also, drop the ^...$ if you'd like to extract from a bigger source text.
The expression you're feeding regexpi with contains a space and is used as a pattern for 'match'. This is why the matched monthyear string also has a space1.
If you want to keep it simple and let regexpi do the work for you (instead of postprocessing its output), try a different approach and capture tokens instead of matching, and ignore the intermediate space:
%// <$1><----------$2---------> <$3>
expr = '(.+)([FGHJKMNQUVXZ]\d{1,2}) (.+)';
tickinfo = regexpi(bbergtickers, expr, 'tokens', 'once');
You can also simplify the expression to a more genereic '(.+)(\w{1}\d{1,2})\s+(.+)', if you wish.
Example
bbergtickers = 'MCDZ3 Curncy';
expr = '(.+)([FGHJKMNQUVXZ]\d{1,2})\s+(.+)';
tickinfo = regexpi(bbergtickers, expr, 'tokens', 'once');
The result is:
tickinfo =
'MCD'
'Z3'
'Curncy'
1 This expression is also used as a delimiter for 'split'. Removing the trailing space from it won't help, as it will reappear in the rootyk output instead.
Assuming you just want to get rid of the leading and or trailing spaces at the edge, there is a very simple command for that:
monthyear = trim(monthyear)
For removing all spaces, you can do:
monthyear(isspace(monthyear))=[]
Here is a completely different approach, basically this searches the letter before your year number:
s = 'MCDZ3 Curcny'
p = regexp(s,'\d')
s(min(p)
s(min(p)-1:max(p))
I have a '3 x 1' cell array the contents of which appear like the following:
'ASDF_LE_NEWYORK Fixedafdfgd_ML'
'Majo_LE_WASHINGTON FixedMonuts_ML'
'Array_LE_dfgrt_fdhyuj_BERLIN Potato Price'
I want to be able to elegantly extract and create another '3x1' cell array with contents as:
'NEWYORK'
'WASHINGTON'
'BERLIN'
If you notice in above the NAME's are after the last underscore and before the first SPACE or '_ML'. How do I write such code in a concise manner.
Thanks
Edit:
Sorry guys I should have used a better example. I have it corrected now.
You can use lookbehind for _ and lookahead for space:
names = regexp(A, '(?<=_)[^\s_]*(?=\s)', 'match', 'once');
Where A is the cell array containing the strings:
A = {...
'ASDF_LE_NEWYORK Fixedafdfgd_ML'
'Majo_LE_WASHINGTON FixedMonuts_ML'
'Array_LE_dfgrt_fdhyuj_BERLIN Potato Price'};
>> names = regexp(A, '(?<=_)[^\s_]*(?=\s)', 'match', 'once')
names =
'NEWYORK'
'WASHINGTON'
'BERLIN'
NOTE: The question was changed, so the answer is no longer complete, but hopefully the regexp example is still useful.
Try regexp like this:
names = regexp(fullNamesCell,'_(NAME\d?)\s','tokens');
names = cellfun(#(x)(x{1}),names)
In the pattern _(NAME\d?)\s, the parenthesis define a subexpression, which will be returned as a token (a portion of matched text). The \d? specifies zero or one digits, but you could use \d{1} for exactly one digit or \d{1,3} if you expect between 1 and 3 digits. The \s specified whitespace.
The reorganization of names is a little convoluted, but when you use regexp with a cell input and tokens you get a cell of cells that needs some reformatting for your purposes.