I am working with a messy manually maintained "database" that has a column containing a string with name,value pairs. I am trying to parse the entire column with regexp to pull out the values. The column is huge (>100,000 entries). As a proxy for my actual data, let's use this code:
line1={'''thing1'': ''-583'', ''thing2'': ''245'', ''thing3'': ''246'', ''morestuff'':, '''''};
line2={'''thing1'': ''617'', ''thing2'': ''239'', ''morestuff'':, '''''};
line3={'''thing1'': ''unexpected_string(with)parens5'', ''thing2'': 245, ''thing3'':''246'', ''morestuff'':, '''''};
mycell=vertcat(line1,line2,line3);
This captures the general issues encountered in the database. I want to extract what thing1, thing2, and thing3 are in each line using cellfun to output a scalar cell array. They should normally be 3 digit numbers, but sometimes they have an unexpected form. Sometimes thing3 is completely missing, without the name even showing up in the line. Sometimes there are minor formatting inconsistencies, like single quotes missing around the value, spaces missing, or dashes showing up in front of the three digit value. I have managed to handle all of these, except for the case where thing3 is completely missing.
My general approach has been to use expressions like this:
expr1='(?<=thing1''):\s?''?-?([\w\d().]*?)''?,';
expr2='(?<=thing2''):\s?''?-?([\w\d().]*?)''?,';
expr3='(?<=thing3''):\s?''?-?([\w\d().]*?)''?,';
This looks behind for thingX' and then tries to match : followed by zero or one spaces, followed by 0 or 1 single quote, followed by zero or one dash, followed by any combination of letters, numbers, parentheses, or periods (this is defined as the token), using a lazy match, until zero or one single quote is encountered, followed by a comma. I call regexp as regexp(___,'tokens','once') to return the matching token.
The problem is that when there is no match, regexp returns an empty array. This prevents me from using, say,
out=cellfun(#(x) regexp(x,expr3,'tokens','once'),mycell);
unless I call it with 'UniformOutput',false. The problem with that is twofold. First, I need to then manually find the rows where there was no match. For example, I can do this:
emptyout=cellfun(#(x) isempty(x),out);
emptyID=find(emptyout);
backfill=cell(length(emptyID),1);
[backfill{:}]=deal('Unknown');
out(emptyID)=backfill;
In this example, emptyID has a length of 1 so this code is overkill. But I believe this is the correct way to generalize for when it is longer. This code will change every empty cell array in out with the string Unknown. But this leads to the second problem. I've now got a 'messy' cell array of non-scalar values. I cannot, for example, check unique(out) as a result.
Pardon the long-windedness but I wanted to give a clear example of the problem. Now my actual question is in a few parts:
Is there a way to accomplish what I'm trying to do without using 'UniformOutput',false? For example, is there a way to have regexp pass a custom string if there is no match (e.g. pass 'Unknown' if there is no match)? I can think of one 'cheat', which would be to use the | operator in the expression, and if the first token is not matched, look for something that is ALWAYS found. I would then still need to double back through the output and change every instance of that result to 'Unknown'.
If I take the 'UniformOutput',false approach, how can I recover a scalar cell array at the end to easily manipulate it (e.g. pass it through unique)? I will admit I'm not 100% clear on scalar vs nonscalar cell arrays.
If there is some overall different approach that I'm not thinking of, I'm also open to it.
Tangential to the main question, I also tried using a single expression to run regexp using 3 tokens to pull out the values of thing1, thing2, and thing3 in one pass. This seems to require 'UniformOutput',false even when there are no empty results from regexp. I'm not sure how to get a scalar cell array using this approach (e.g. an Nx1 cell array where each cell is a 3x1 cell).
At the end of the day, I want to build a table using these results:
mytable=table(out1,out2,out3);
Edit: Using celldisp sheds some light on the problem:
celldisp(out)
out{1}{1} =
246
out{2} =
Unknown
out{3}{1} =
246
I assume that I need to change the structure of out so that the contents of out{1}{1} and out{3}{1} are instead just out{1} and out{3}. But I'm not sure how to accomplish this if I need 'UniformOutput',false.
Note: I've not used MATLAB and this doesn't answer the "efficient" aspect, but...
How about forcing there to always be a match?
Just thinking about you really wanting a match to skip this problem, how about an empty match?
Looking on the MATLAB help page here I can see a 'emptymatch' option, perhaps this is something to try.
E.g.
the_thing_i_want_to_find|
Match "the_thing_i_want_to_find" or an empty match, note the | character.
In capture group it might look like this:
(the_thing_i_want_to_find|)
As a workaround, I have found that using regexprep can be used to find entries where thing3 is missing. For example:
replace='$1 ''thing3'': ''Unknown'', ''morestuff''';
missingexpr='(?<=thing2'':\s?)(''?-?[\w\d().]*?''?,) ''morestuff''';
regexprep(mycell{2},missingexpr,replace)
ans =
''thing1': '617', 'thing2': '239', 'thing3': 'Unknown', 'morestuff':, '''
Applying it to the entire array:
fixedcell=cellfun(#(x) regexprep(x,missingexpr,replace),mycell);
out=cellfun(#(x) regexp(x,expr3,'tokens','once'),fixedcell,'UniformOutput',false);
This feels a little roundabout, but it works.
cellfun can be replaced with a plain old for loop. Your code will either be equally fast, or maybe even faster. cellfun is implemented with a loop anyway, there is no advantage of using it other than fewer lines of code. In your explicit loop, you can then check the output of regexp, and build your output array any way you like.
I have data being exported from BigQuery into Google Data Studio one field contains a username like the following.
xvth20-00-tt-wr
xvth27-00-pt-px
The first 4 characters (xvth) are always the same and the numbers that follow (xvth) correspond to a group. Multiple usernames will contain the same numbers after those characters but the rest of the string from 00- and on will be different.
What I'm trying to do is extract the numbers that follow the 4 characters and create a new field that looks like the following.
Group-20
Group-27
I've tried the following REPLACE(SUBSTR(Users,1, 6), 'xvth20', 'Group-20') and I will have to create one for every condition which seems like too much. Also the data will keep growing so I wouldn't want to keep going in and adding another function.
Is there an easier way to do this?
Either of the below REGEXP_REPLACE Calculated Fields will replace xvth with Group-, immediately followed by the respective captured numbers; Calculated Field #1 uses a Raw Literal, indicated by the letter r which requires a single \ to escape special RegEx characters whereas Calculated Field #2 requires \\ to escape a Google Data Studio RegEx as it does not use a Raw Literal:
1) With r (Raw Literal)
REGEXP_REPLACE(Users, r"^xvth(\d+).*", r"Group-\1")
2) Without r (Raw Literal)
REGEXP_REPLACE(Users, "^xvth(\\d+).*", "Group-\\1")
Editable Google Data Studio Report (Embedded Google Sheets Data Source) and a GIF to elaborate:
I am getting an error importing an XML file into a custom program. Other files import correctly. However, one file produces an error from a float field. I am using Notepad++ search function with Regular Expression to try and find the issue in the XML file.
When I use <milepost>([a-zA-Z0-9.]+)</milepost> I get around 30,000 results which is the correct number of records but the field is supposed to be DOUBLE. When I use <milepost>([0-9.]+)</milepost> I only get 29,994 records. This tells me that the import is most likely failing because there are letters in my number fields.
I have tried a number of variations like:
<milepost>([\S\D\d]+)</milepost>
<milepost>(.*?)</milepost>
<milepost>([\Sa-zA-Z]+)</milepost>
<milepost>([0-9.\w]+)</milepost>
etc.
Each of these returns the expected 30,000 records.
When I try to search for letters using :
<milepost>([a-zA-Z.]*)</milepost>
<milepost>([a-zA-Z]+)</milepost>
<milepost>(^[a-zA-Z]+$)</milepost>
<milepost>([a-zA-Z.a-zA-Z]+)</milepost>
I get 0 results (most likely because it excludes numbers)
I did manage to find one of the records I am looking for using this method:
<milepost>173.811818181818a</milepost>
But I do not feel like scrolling through 30,000+ lines to look for 5 more records with a letter in them.
Is there a regular expression that will return to me ONLY the values that have a letter/letters in them while allowing numbers? (Fields with only numbers and a period should be excluded)
The 6 problem records presumably contain a mixture of letters and numbers, but your searches for records containing letters will only match records consisting exclusively of letters.
Try
<milepost>.*[a-zA-Z].*</milepost>
which matches any record containing an ASCII letter in its value, as well as allowing other characters such as digits.
What you want is a negative look-ahead. Something like
<milepost>(?![0-9.]+</milepost>)
should be very close.
In plain English <milepost> not followed by exclusively digits and dots and a closing </milepost>
I'm working with a huge DB2 table (hundreds of millions of rows), trying to select only the rows that are matched by this regular expression:
\b\d([- \/\\]?\d){12,15}(\D|$)
(That is, a word boundary, followed by 13 to 16 digits separated by nothing or a single dash, space, slash, or backslash, followed be either a non-digit or the end of the line.)
After much Googling, I've managed to create the following SQL:
SELECT idx, comment FROM tblComment
WHERE xmlcast(xmlquery('fn:matches($c,"\b\d([- \/\\]?\d){12,15}(\D|$)")' PASSING comment AS "c") AS INTEGER)=1
Which works perfectly, as far as I can tell... unless it finds a row with an illegal character:
An illegal XML character "#x3" was found in an SQL/XML expression or function argument that begins with string [...]
The data contains many illegal XML characters, and changing the data is not an option (I have limited read-only access, and there are far too many rows that would need to be fixed). Is there a way to strip out or ignore illegal characters, without first modifying the database? Or, is there a different way for me to write my query that has the same effect?
You will have to identify what are all the illegal XML characters that occur in your data. Once you know them, you can use the TRANSLATE() function to eliminate them during the pattern matching.
Say, you determine that all ASCII control characters (0x00 through 0x0F and 0x7F) may be present in the COMMENT column. Your query might then look like:
SELECT idx, comment FROM tblComment
WHERE xmlcast(xmlquery(
'fn:matches($c,"\b\d([- \/\\]?\d){12,15}(\D|$)")'
PASSING TRANSLATE(comment, ' ', x'01020304050607080B0C0F7F') AS "c")
AS INTEGER)=1
All legal XML characters are listed in the manual. 0x09, 0x0A and 0x0D are legal, so you don't need to TRANSLATE() them, for example.
I'm using Selenium IDE and can't figure out how to select a given element that has a certain attribute which contains some text (number) of a certain length after a specified character.
In order to better understand what exactly I would like to achieve please see below an example.
I have the following HTML element:
<div><h2 class="attribute" onclick="PropertyPopup.Show(63854, 4065)">test test</h2></div>
In my case both the numbers in the bracket (63854 and 4065) are changing dynamically and I'm mostly interested in the second number (4065). This can have a length of 4 or 7 so I would need an XPATH (combined with regexp?) that would extract only those elements where this number has a length of 4 for example (like in the above example).
So far I've used the following XPATH:
//div[h2[#onclick][string-length(#onclick)<=31]]
This is working fine at the moment (since in most cases when the second number has a length of 4, the whole line will have less (or equal) than 31 characters) but if the first number will contain 6 numbers (and the whole line will have 32 characters), the above example will not be selected. If I would put "<=32", then in some cases, it would select those elements where the second number has a length of 7 (like when the first number has a length of 3 and the second 7).
I've tried to use something like the below:
//div[h2[#onclick][contains(#onclick,', \d{4}']]
but this will not be recognized as a regexp and will look for an 'onclick' attribute that contain the word ", \d{4}".
Is there anything I could do in order to select the node only based on the second number (its length)?
thank you,
Szabi
You could try something like this:
//div[string-length(normalize-space(substring-before(substring-after(h2/#onclick,','),')')))=4]