In my XSL implementation (2.0), I tried using the below statement to remove all the spaces & non breaking spaces within a text node. It works for spaces only but not for non breaking spaces whose ASCII codes are, etc. I am using SAXON processor for execution.
Current XSL code:
translate(normalize-space($text-nodes[1]), ' ' , '' ))
How can I have them removed. Please share your thoughts.
Those codes are Unicode, not ASCII (for the most part), so you should probably use the replace function with a regex containing the Unicode separator character class:
replace($text-nodes[1], '\p{Z}+', '')
In more detail:
The regex \p{Z}+ matches one or more characters that are in the "separator" category in Unicode. \p{} is the category escape sequence, which matches a single character in the category specified within the curly braces. Z specifies the "separator" category (which includes various kinds of whitespace). + means "match the preceding regex one or more times". The replace function returns a version of its first argument with all non-overlapping substrings matching its second argument replaced with its third argument. So this returns a version of $text-nodes[1] with all sequences of separator characters replaced with the empty string, i.e. removed.
Related
I'm trying to implement a regex pattern for username that allows English letters, Arabic letters, digits, dash and space.
The following pattern always returns no match if the input string has a space even though \s is included in the pattern
Pattern _usernamePattern = r'^[a-zA-Z0-9\u0621-\u064A\-\s]{3,30}$';
I also tried replacing \s with " " and \\s but the regex always returns no matches for any input that has a space in it.
Edit: It turns out that flutter adds a unicode character for "Right-To-Left Mark" or "Left-To-Right Mark" when using a textfield with a mix of languages that go LTR or RTL. This additional mark is a unicode character that's gets added to the text. The regex above was failing because of this additional character. To resolve the issue simply do a replaceAll for these characters. Read more here: https://github.com/flutter/flutter/issues/56514.
This is a fairly nasty problem and worth documenting in an answer here.
As documented in the source:
/// When LTR text is entered into an RTL field, or RTL text is entered into an
/// LTR field, [LRM](https://en.wikipedia.org/wiki/Left-to-right_mark) or
/// [RLM](https://en.wikipedia.org/wiki/Right-to-left_mark) characters will be
/// inserted alongside whitespace characters, respectively. This is to
/// eliminate ambiguous directionality in whitespace and ensure proper caret
/// placement. These characters will affect the length of the string and may
/// need to be parsed out when doing things like string comparison with other
/// text.
While this is well-intended it can cause problems when you work with mixed LTR/RTL text patterns (as it is the case here) and have to ensure exact field length, etc.
The suggested solution is to remove all left-right-marks:
void main() {
final String lrm = 'aaaa \u{200e}bbbb';
print('lrm: "$lrm" with length ${lrm.length}');
final String lrmFree = lrm.replaceAll(RegExp(r'\u{200e}', unicode: true), '');
print('lrmFree: "$lrmFree" with length ${lrmFree.length}');
}
Related: right-to-left (RTL) in flutter
I want to define a table name by regular expression defined here such that:
Always begin a name with a letter, an underscore character (_), or a
backslash (). Use letters, numbers, periods, and underscore
characters for the rest of the name.
Exceptions: You can’t use "C", "c", "R", or "r" for the name, because
they’re already designated as a shortcut for selecting the column or
row for the active cell when you enter them in the Name or Go To box.
let lex_valid_characters_0 = ['a'-'z' 'A'-'Z' '_' '\x5C'] ['a'-'z' 'A'-'Z' '0'-'9' '.' '_']+
let haha = ['C' 'c' 'R' 'r']
let lex_table_name = lex_valid_characters_0 # haha
But it returns me an error character 0: character set expected.. Could anyone help?
Here is the description of # from the manual:
regexp1 # regexp2
(difference of character sets) Regular expressions regexp1 and regexp2 must be character sets defined with [… ] (or a single character expression or underscore _). Match the difference of the two specified character sets.
The description says the two sets must be character sets defined with [ ... ] but your definition of lex_valid_characters_0 is far more complex than that.
The idea of # is that it defines a pattern that matches exactly one character from a set specified as the difference of two one-character patterns. So it doesn't make sense to apply it to lex_valid_characters_0, which matches strings of arbitrary length.
Update
Here is my thinking on the problem, for what it's worth. There are no extra restrictions on names that are 2 or more characters long (as I read the spec). So it shouldn't be too difficult to specify a regular expression for these names. And it also wouldn't be that hard to come up with a regular expression that defines all the valid 1-character names. The full set of names is the union of these two sets.
You could also use the fact that the longest, first match is the one that applies for ocamllex. I.e., you could have rules for the 4 special cases before the general rule.
I'm trying to put together an expression that will grab text between quotation marks (single or double), including text in nested quotes, but will ignore text in comments, so except if the line starts with //.
Code example:
// this is a "comment" and should be ignored
//this is also a 'comment' and should be ignored
printf("This is text meant to be "captured", and can include any type of character");
printf("This is the same as above, only with 'different' nested quotes");
This can be quite useful to extract translatable content from a file.
So far, I have managed to use ^((?!\/\/).)* to exclude comment lines from being imported, and ["'](.+)["'] to extract text between quotes, but I haven't been able to combine it on a single expression.
Running them in a sequence also doesn't work, I think because of the greedy quantifier in the first expression.
There is written nothing about type of input files and so I assume C source code files.
I suggest following regular expression tested with text editor UltraEdit which uses the
Boost C++ Perl regular expression library.
^(?:(?!//|"|').)*(["'])(?!\1)\K(?:\\\1|.)+?(?=\1)
It matches first single or double quoted string on a line. Other strings on same line are ignored by this regular expression search string in Perl syntax which is not optimal.
It ignores single or double quoted strings in line comments starting with // independent on line comment being at start of a line without or with leading spaces/tabs or somewhere else on a line after code.
It ignores also empty strings like "" or ''. If a line contains first "" or '' and second one more single or double quoted non-empty string, the non-empty string is ignored, too. This is not optimal.
The string delimiting character is not matched on both sides of the matched string.
The string delimiting character must be escaped with a backslash character to be interpreted by this search expression as literal character inside the string. The first printf in example in question would definitely result in a syntax error on compilation by a C compiler.
Strings in block comments are not ignored by this expression as also strings in code ignored by compiler because of a preprocessor macro.
Example:
// This is a "comment" and should be ignored.
//This is also a 'comment' which should be ignored.
printf("This is text meant to be \"captured\", and can include any type of character.\n"); // But " must be escaped with a backslash.
printf("This is the same as above, only with 'different' nested quotes.\n");
putchar('I');
putchar('\'');
printf(""); printf("m thinking."); // First printf is not very useful.
printf("\"OK!\"");
printf("Hello"); printf(" world!\n");
printf("%d file%s found.\n",iFileCount,((iFileCount != 1) ? "s" : "");
printf("Result is: %s\n",sText); // sText is "success" or "failure".
return iReturnCode; // 0 ... "success", 1 ... "error"
The search expression matches for this example:
This is text meant to be \"captured\", and can include any type of character.\n
This is the same as above, only with 'different' nested quotes.\n
I
\'
\"OK!\"
Hello
%d file%s found.\n
Result is: %s\n
So it does not find all non-empty strings output on running C code example.
Explanation for search string:
^ ... start search at beginning of a line.
This is the main reason why it is not possible to match with this expression second, third, ... string on a line not being in a line comment.
(?:(?!//|"|').)* ... search with a non-marking group for zero or more characters not being a newline character on which next there is neither // nor " nor ' verified with a negative look-ahead containing an OR expression.
This expression is responsible for ignoring everything after // once found in a line because of
(["']) ... " or ' must be found next and found character is marked for back-referencing.
(?!\1) ... but match is only positive if the next character is not the same character to ignore empty strings.
\K ... resets the start location of $0 to the current text position: in other words everything to the left of \K is "kept back" and does not form part of the regular expression match. So everything matched from beginning of line up to " or ' at beginning of non-empty string is not matched (selected) anymore.
(?:\\\1|.)+? ... non-marking group to find either a backslash and the character at beginning of the string or any character not being a newline character non-greedy one or more times.
This expression matches the string of interest.
(?=\1) ... matching any character not being a newline character should stop on next character (positive look-ahead) being the same quote character not escaped with a backslash as at beginning of the string without matching this quote character.
For matching first non-empty string outside a line comment with the delimiting character on both sides:
^(?:(?!//|"|').)*\K(["'])(?!\1)(?:\\\1|.)+?\1
How to get really all non-empty strings outside of comments?
Copy content of entire C source code file into a new file.
Remove in new file all not nested block comments and all line comments with the search expression:
^[\t ]*//.*[\r\n]+|[\t ]*/\*[\s\S]+?\*/[\r\n]*|[\t ]*//.*$
The replace string is an empty string.
Note: // inside a string is interpreted by third part of OR expression also as beginning of a line comment although this it not correct.
Use as search string (["'])(?!\1)(?:\\\1|.)+?\1 to find really all non-empty strings with matching also the string delimiting character on both sides of every string.
Best would parsing a C source code file with a program written in C for non-empty strings because of such a program could much better find out what is a line comment, what is a block comment even on source code file containing nested block comments and what are non-empty strings. Well, it would be also possible to let the C compiler just preprocess the C source code files of a project with generating the output files after preprocessing with all line and block comments already removed and search in those preprocessed files for non-empty strings.
I am trying to extract R7080075 and X1234567 from the sample data below. The format is always a single upper case character followed by 7 digit number. This ID is also always preceded by an underscore. Since it's user generated data, sometimes it's the first underscore in the record and sometimes all preceding spaces have been replaced with underscores.
I'm querying HDP Hive with this in the select statement:
REGEXP_EXTRACT(column_name,'[(?:(^_A-Z))](\d{7})',0)
I've tried addressing positions 0-2 and none return an error or any data. I tested the code on regextester.com and it highlighted the data I want to extract. When I then run it in Zepplin, it returns NULLs.
My regex experience is limited so I have reviewed the articles here on regexp_extract (+hive) and talked with a colleague. Thanks in advance for your help.
Sample data:
Sept Wk 5 Sunny Sailing_R7080075_12345
Holiday_Wk2_Smiles_X1234567_ABC
The Hive manual says this:
Note that some care is necessary in using predefined character classes: using '\s' as the second argument will match the letter s; '\\s' is necessary to match whitespace, etc.
Also, your expression includes unnecessary characters in the character class.
Try this:
REGEXP_EXTRACT(column_name,'_[A-Z](\\d{7})',0)
Since you want only the part without underscore, use this:
REGEXP_EXTRACT(column_name,'_([A-Z]\\d{7})',1)
It matches the entire pattern, but extracts only the second group instead of the entire match.
Or alternatively:
REGEXP_EXTRACT(column_name,'(?<=_)[A-Z]\\d{7}', 0)
This uses a regexp technique called "positive lookbehind". It translates to : "find me an upper case alphabet followed by 7 digits, but only if they are preceded by an _". It uses the _ for matching but doesn't consider it part of the extracted match.
I am trying to extract words after the first space using
species<-gsub(".* ([A-Za-z]+)", "\1", x=genus)
This works fine for the other rows that have two words, however row [9] "Eulamprus tympanum marnieae" has 3 words and my code is only returning the last word in the string "marnieae". How can I extract the words after the first space so I can retrieve "tympanum marnieae" instead of "marnieae" but have the answers stored in one variable called >species.
genus
[9] "Eulamprus tympanum marnieae"
Your original pattern didn't work because the subpattern [A-Za-z]+ doesn't match spaces, and therefore will only match a single word.
You can use the following pattern to match any number of words (other than 0) after the first, within double quotes:
"[A-Za-z]+ ([A-Za-z ]+)" https://regex101.com/r/p6ET3I/1
https://regex101.com/r/p6ET3I/2
This is a relatively simple, but imperfect, solution. It will also match trailing spaces, or just one or more spaces after the first word even if a second word doesn't exist. "Eulamprus " for example will successfully match the pattern, and return 5 spaces. You should only use this pattern if you trust your data to be properly formatted.
A more reliable approach would be the following:
"[A-Za-z]+ ([A-Za-z]+(?: [A-Za-z]+)*)"
https://regex101.com/r/p6ET3I/3
This pattern will capture one word (following the first), followed by any number of addition words (including 0), separated by spaces.
However, from what I remember from biology class, species are only ever comprised of one or two names, and never capitalized. The following pattern will reflect this format:
"[A-Za-z]+ ([a-z]+(?: [a-z]+)?)"
https://regex101.com/r/p6ET3I/4