I have a pipe delimited file which has a line
H||CUSTCHQH2H||PHPCCIPHP|1010032000|28092017|25001853||||
I want to substitute the date (28092017) with a regex "[0-9]{8}" if the first character is "H"
I tried the following example to test my understanding where Im trying to subtitute "a" with "i".
str = "|123||a|"
str.gsub /\|(.*?)\|(.*?)\|(.*?)\|/, "\|\\1\|\|\\1\|i\|"
But this is giving o/p as
"|123||123|i|"
Any clue how this can be achieved?
You may replace the first occurrence of 8 digits inside pipes if a string starts with H using
s = "H||CUSTCHQH2H||PHPCCIPHP|1010032000|28092017|25001853||||"
p s.gsub(/\A(H.*?\|)[0-9]{8}(?=\|)/, '\100000000')
# or
p s.gsub(/\AH.*?\|\K[0-9]{8}(?=\|)/, '00000000')
See the Ruby demo. Here, the value is replaced with 8 zeros.
Pattern details
\A - start of string (^ is the start of a line in Ruby)
(H.*?\|) - Capturing group 1 (you do not need it when using the variation with \K): H and then any 0+ chars as few as possible
\K - match reset operator that discards the text matched so far
[0-9]{8} - eight digits
(?=\|) - the next char must be |, but it is not added to the match value since it is a positive lookahead that does not consume text.
The \1 in the first gsub is a replacement backreference to the value in Group 1.
I have three strings as list below:
Levofloxacin 500mg/100mL
Levofloxacin 500mg
Procaterol Hydrochloride …………… 25μg
The first line, I want to just get 'mg' without 'mL' in my result.
The second line, I want get 'mg'.
The third line, I want get 'ug'.
I have try regexp pattern like:
(?!(.*[ ]{1}[0-9]+))[a-zA-Zμ]+
However, the first line always returns 'mg' with 'mL'...
How could I just acquire 'mg' with regexp?
Any suggestions will be appreciated.
As mentioned in the comment section, try this regex:
^\D*[\d.]+\K[a-zμ]+
Click for Demo
Explanation:
^ - asserts the start of the string
\D* - matches 0+ occurrences of any character that is not a digit
[\d.]+ - matches 1+ occurrences of any character that is a digit
\K - removes what has been matched so far
[a-zμ]+ - this is what you want. This will contain the units like mg, ml appearing after the first number. If there are any other special characters like μ, you can add them too in this character list
Given the string
170905-CBM-238.pdf
I'm trying to match 170905-CBM and .pdf so that I can replace/remove them and be left with 238.
I've searched and found pieces that work but can't put it all together.
This-> (.*-) will match the first section and
This-> (.[^/.]+$) will match the last section
But I can't figure out how to tie them together so that it matches everything before, including the second dash and everything after, including the period (or the extension) but does not match the numbers between.
help :) and thank you for your kind consideration.
There are several options to achieve what you need in Nintex.
If you use Extract operation, use (?<=^.*-)\d+(?=\.[^.]*$) as Pattern.
See the regex demo.
Details
(?<=^.*-) - a positive lookbehind requiring, immediately to the left of the current location, the start of string (^), then any 0+ chars other than LF as many as possible up to the last occurrence of - and the subsequent subpatterns
\d+ - 1 or more digits
(?=\.[^.]*$) - a positive lookahead requiring, immediately to the right of the current location, the presence of a . and 0+ chars other than . up to the end of the string.
If you use Replace text operation, use
Pattern: ^.*-([0-9]+)\.[^.]+$
Replacement text: $1
See another regex demo (the Context tab shows the result of the replacement).
Details
^ - a start of string anchor
.* - any 0+ chars other than LF up to the last occurrence of the subsequent subpatterns...
- - a hyphen
([0-9]+) - Group 1: one or more ASCII digits
\. - a literal .
[^.]+ - 1 or more chars other than .
$ - end of string.
The replacement $1 references the value stored in Group 1.
I don't know ninetex regex, but a sed type regex:
$ echo "170905-CBM-238.pdf" | sed -E 's/^.*-([0-9]*)\.[^.]*$/\1/'
238
Same works in Perl:
$ echo "170905-CBM-238.pdf" | perl -pe 's/^.*-([0-9]*)\.[^.]*$/$1/'
238
<.*>|\n.*\s.*\sid="(\w*)".*\n+|.*>\n|\n.+
and replace $1
This regex can take all id out from file
<a href="java" class="total" id="maker" placeholder="getTheResult('local6')">master6<a>
Result is maker
How can I extract getTheResult key name?
so my result will be local6
Tried <.*>|\n.*\s.*\sgetTheResult('(\w*)').*\n+|.*>\n|\n.+ but didn't helped
I assume that:
you have files with text like getTheResult('local6')
you may have several values like that on a line
you'd like to keep those text only, one value per line.
I suggest
getTheResult\('([^']*)'\)|(?:(?!getTheResult\(')[\s\S])*
and replace with $1\n. The \n will insert a newline between the values. You can then use ^\n regex (to replace with empty string) to remove empty lines.
Pattern details:
getTheResult\(' - matches getTheResult(' as a literal string (note the ( is escaped)
([^']*) - Group 1 capturing 0+ chars other than '
'\) - a literal ')
| - or
(?:(?!getTheResult\(')[\s\S])* - 0+ chars that are not starting chars of the getTheResult(' character sequence (this is a tempered greedy token).
I am looking for a regex that can be fed to a "create external table" statement of Hive QL in the form of
"input.regex"="the regex goes here"
The condition is that the logs in the files that the RegexSerDe must be reading are of the following form:
2013-02-12 12:03:22,323 [DEBUG] 2636hd3e-432g-dfg3-dwq3-y4dsfq3ew91b Some message that can contain any special character, including linebreaks. This one does not have a linebreak. It just has spaces on the same line.
2013-02-12 12:03:24,527 [DEBUG] 265y7d3e-432g-dfg3-dwq3-y4dsfq3ew91b Some other message that can contain any special character, including linebreaks. This one does not have one either. It just has spaces on the same line.
2013-02-12 12:03:24,946 [ERROR] 261rtd3e-432g-dfg3-dwq3-y4dsfq3ew91b Some message that can contain any special character, including linebreaks.
This is a special one.
This has a message that is multi-lined.
This is line number 4 of the same log.
Line 5.
2013-02-12 12:03:24,988 [INFO] 2632323e-432g-dfg3-dwq3-y4dsfq3ew91b Another 1-line log
2013-02-12 12:03:25,121 [DEBUG] 263tgd3e-432g-dfg3-dwq3-y4dsfq3ew91b Yet another one line log.
I am using the following create external table code:
CREATE EXTERNAL TABLE applogs (logdatetime STRING, logtype STRING, requestid STRING, verbosedata STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES
(
"input.regex" = "(\\A[[0-9:-] ]{19},[0-9]{3}) (\\[[A-Z]*\\]) ([0-9a-z-]*) (.*)?(?=(?:\\A[[0-9:-] ]{19},[0-9]|\\z))",
"output.format.string" = "%1$s \\[%2$s\\] %3$s %4$s"
)
STORED AS TEXTFILE
LOCATION 'hdfs:///logs-application';
Here's the thing:
It is able to pull all the FIRST LINES of each log. But not the other lines of logs that have more than one lines. I tried all links, replaced \z with \Z at the end, replaced \A with ^ and \Z or \z with $, nothing worked. Am I missing something in the output.format.string's %4$s? or am I not using the regex properly?
What the regex does:
It matches the timestamp first, followed by the log type (DEBUG or INFO or whatever), then the ID (mix of lower case alphabets, numbers and hyphens) followed by ANYTHING, till the next timestamp is found, or till the end of input is found to match the last log entry. I also tried adding the /m at the end, in which case, the table generated has all NULL values.
There seem to be a number of issues with your regex.
First, remove your double square brackets.
Second, \A and \Z/\z are to match the beginning and end of the input, not just a line. Change \A to ^ to match start-of-line but don't change \z to $ as you do actually want to match end-of-input in this case.
Third, you want to match (.*?), not (.*)?. The first pattern is ungreedy, whereas the second pattern is greedy but optional. It should have matched your entire input to the end as you allowed it to be followed by end-of-input.
Fourth, . does not match newlines. You can use (\s|\S) instead, or ([x]|[^x]), etc., any pair of complimentary matches.
Fifth, if it was giving you single line matches with \A and \Z/\z then the input was single lines also as you were anchoring the whole string.
I would suggest trying to match just \n, if nothing matches then newlines are not included.
You can't add /m to the end as the regex does not include delimiters. It will try to match the literal characters /m instead which is why you got no match.
If it was going to work the regex you want would be:
"^([0-9:- ]{19},[0-9]{3}) (\\[[A-Z]*\\]) ([0-9a-z-]*) ([\\s\\S]*?)(?=\\r?\\n([0-9:-] ){19},[0-9]|\\r?\\z)"
Breakdown:
^([0-9:- ]{19},[0-9]{3})
Match start of newline, and 19 characters that are digits, :, - or plus a comma, three digits and a space. Capture all but the final space (the timestamp).
(\\[[A-Z]*\\])
Match a literal [, any number of UPPERCASE letters, even none, a literal ] and a space. Capture all but the final space (the error level).
([0-9a-z-]*)
Match any number of digits, lowercase letters or - and a space. Capture all but the final space (the message id).
([\\s\\S]*?)(?=\\r?\\n([0-9:-] ){19},[0-9]|\\r?\\Z)
Match any whitespace or non-whitespace character (any character) but match ungreedy *?. Stop matching when a new record or end of input (\Z) is immediately ahead. In this case you don't want to match end of line as once again, you will only get one line in your output. Capture all but the final (the message text). The \r?\n is to skip the final newline at the end of your message, as is the \r?\Z. You could also write \r?\n\z Note: capital \Z includes the final newline at the end of the input if there is one. Lowercase \z matches at end of input only, not newline before end of input. I have added \z? just in case you have to deal with Windows line endings, however, I don't believe this should be necessary.
However, I suspect that unless you can feed the whole file in at once instead of line-by-line that this will not work either.
Another simple test you can try is:
"^([\\s\\S]+)^\\d"
If it works it will match any full line followed by a line digit on the next line (the first digit of your timestamp).
Following Java regex may help:
(\d{4}-\d{1,2}-\d{1,2}\s+\d{1,2}:\d{1,2}:\d{1,2},\d{1,3})\s+(\[.+?\])\s+(.+?)\s+([\s\S\s]+?)(?=\d{4}-\d{1,2}-\d{1,2}|\Z)
Breakdown:
1st Capturing group (\d{4}-\d{1,2}-\d{1,2}\s+\d{1,2}:\d{1,2}:\d{1,2},\d{1,3})
2nd Capturing group (\[.+?\])
3rd Capturing group (.+?)
4th Capturing group ([\s\S]+?).
(?=\d{4}-\d{1,2}-\d{1,2}|\Z) Positive Lookahead - Assert that the regex below can be matched.1st Alternative: \d{4}-\d{1,2}-\d{1,2}.2nd Alternative: \Z assert position at end of the string.
Reference http://regex101.com/
I don't know much about Hive, but the following regex, or a variation formatted for Java strings, might work:
(\d{4}-\d\d-\d\d \d\d:\d\d:\d\d,\d+) \[([a-zA-Z_-]+)\] ([\w-]+) ((?:[^\n\r]+)(?:[\n\r]{1,2}\s[^\n\r]+)*)
This can be seen matching your sample data here:
http://rubular.com/r/tQp9iBp4JI
A breakdown:
(\d{4}-\d\d-\d\d \d\d:\d\d:\d\d,\d+) The date and time (capture group 1)
\[([a-zA-Z_-]+)\] The log level (capture group 2)
([\w-]+) The request id (capture group 3)
((?:[^\n\r]+)(?:[\n\r]{1,2}\s[^\n\r]+)*) The potentially multi-line message (capture group 4)
The first three capture groups are pretty simple.
The last one might is a little odd, but it's working on rubular. A breakdown:
( Capture it as one group
(?:[^\n\r]+) Match to the end of the line, dont capture
(?: Match line by line, after the first, but dont capture
[\n\r]{1,2} Match the new-line
\s Only lines starting with a space (this prevents new log-entries from matching)
[^\n\r]+ Match to the end of the line
)* Match zero or more of these extra lines
)
I used [^\n\r] instead of the . because it looks like RegexSerDe lets the . match new lines (link):
// Excerpt from https://github.com/apache/hive/blob/trunk/contrib/src/java/org/apache/hadoop/hive/contrib/serde2/RegexSerDe.java#L101
if (inputRegex != null) {
inputPattern = Pattern.compile(inputRegex, Pattern.DOTALL
+ (inputRegexIgnoreCase ? Pattern.CASE_INSENSITIVE : 0));
} else {
inputPattern = null;
}
Hope this helps.