I need to combine multiple lines starting with the same ID - regex

I have multiple lines in a text file that I need to combine together. The file is about 200 million lines long, so opening it with Excel and using their built-in tools is out of the picture.
The first set of lines looks like this:
1,example#gmail.com,Username
3,example#gmail.com,Username
4,example#gmail.com,Username
5,example#gmail.com,Username
9,example#gmail.com,Username
10,example#gmail.com,Username
Second set which I want to add at the last line of the first set is:
1,$2a$10$gdsZkf62vUfwHQX8pUGe2.7zqvBvcIPWseaJmboJw3U2sxDj18y5q
3,$2a$10$gdsZkf62vUfwHQX8pUGe2.7zqvBvcIPWseaJmboJw3U2sxDj18y5q
4,$2a$10$gdsZkf62vUfwHQX8pUGe2.7zqvBvcIPWseaJmboJw3U2sxDj18y5q
5,$2a$10$gdsZkf62vUfwHQX8pUGe2.7zqvBvcIPWseaJmboJw3U2sxDj18y5q
9,$2a$10$gdsZkf62vUfwHQX8pUGe2.7zqvBvcIPWseaJmboJw3U2sxDj18y5q
10,$2a$10$gdsZkf62vUfwHQX8pUGe2.7zqvBvcIPWseaJmboJw3U2sxDj18y5q
If anyone has experience with this, I'd love some help

Code
Regex
^(\d+),(.*$)(?=[\s\S]*^\1,(.*))
Formatting output
$1,$2,$3
Results
Input
1,example#gmail.com,Username
3,example#gmail.com,Username
4,example#gmail.com,Username
5,example#gmail.com,Username
9,example#gmail.com,Username
10,example#gmail.com,Username
1,$2a$10$gdsZkf62vUfwHQX8pUGe2.7zqvBvcIPWseaJmboJw3U2sxDj18y5q
3,$2a$10$gdsZkf62vUfwHQX8pUGe2.7zqvBvcIPWseaJmboJw3U2sxDj18y5q
4,$2a$10$gdsZkf62vUfwHQX8pUGe2.7zqvBvcIPWseaJmboJw3U2sxDj18y5q
5,$2a$10$gdsZkf62vUfwHQX8pUGe2.7zqvBvcIPWseaJmboJw3U2sxDj18y5q
9,$2a$10$gdsZkf62vUfwHQX8pUGe2.7zqvBvcIPWseaJmboJw3U2sxDj18y5q
10,$2a$10$gdsZkf62vUfwHQX8pUGe2.7zqvBvcIPWseaJmboJw3U2sxDj18y5q
Output
1,example#gmail.com,Username,$2a$10$gdsZkf62vUfwHQX8pUGe2.7zqvBvcIPWseaJmboJw3U2sxDj18y5q
3,example#gmail.com,Username,$2a$10$gdsZkf62vUfwHQX8pUGe2.7zqvBvcIPWseaJmboJw3U2sxDj18y5q
4,example#gmail.com,Username,$2a$10$gdsZkf62vUfwHQX8pUGe2.7zqvBvcIPWseaJmboJw3U2sxDj18y5q
5,example#gmail.com,Username,$2a$10$gdsZkf62vUfwHQX8pUGe2.7zqvBvcIPWseaJmboJw3U2sxDj18y5q
9,example#gmail.com,Username,$2a$10$gdsZkf62vUfwHQX8pUGe2.7zqvBvcIPWseaJmboJw3U2sxDj18y5q
10,example#gmail.com,Username,$2a$10$gdsZkf62vUfwHQX8pUGe2.7zqvBvcIPWseaJmboJw3U2sxDj18y5q
Explanation
^ Assert position at the start of a line
(\d+) Capture one or more digits into capture group 1
, Match the comma character , literally
(.*$) Capture any number of any character (except newline characters) until the asserted position at the end of the line (asserting end of line position dramatically reduces steps) into capture group 2
(?=[\s\S]*^\1,(.*)) Positive lookahead asserting what follows matches
[\s\S]* Match any number of any character (\s: any whitespace character; \S: any non-whitespace character)
^ Assert position at the start of a line
\1 Matches the same text as most recently matched by the 1st capturing group
, Matches the comma character , literally
(.*) Capture any number of any character into capture group 3

Related

How to find regex for multiple conditions

I am trying to find regex which would find below matches. I would replace these with blank. I am able to create regex for few of these conditions individually, but I am not able to figure out how to create one regex for all of these
Strings:
song1 artist (SiteWithMp3Keyword.com).mp3
02.song2 | siteWithdownloadKeyword.in 320 Kbps
song3 [SitewithDjKeyword.in] 128kbps.mp3
Output
song1 artist.mp3
song2
song3.mp3
Criteria for match:
Case Insensitive
Find Strings with particular keyword and remove whole word, even if inside any braces
Find kpbs keyword and remove it along with any number before it (128/320)
if string ends in .mp3, keep it as it is.
Remove junk characters (like | ) and replace _ with space.
Remove number if present at start of string, like 001_ 02. etc.
Trim whitespaces before and after remaining string
Example Regex for 2.
\S+(mp3|dj|download)\S+
https://regex101.com/r/nxp4d3/1
Try this regex ....
Find:^[0-9. ]*(song\d+ (\w+ )?).*?(\.mp3 ?)?$
Replace with:$1$3
P.S , if this code doesn't solve your problem, please share a sample of your real data, so someone well better understand you,
Thanks...
For the example data, you might use:
^\h*(?:\d+\W*)?(\w+(?:\h+\w+)*).*?(\.mp3)?\h*$
The pattern matches:
^ Start of string
\h* Match optional leading spaces
(?:\d+\W*)? Match 1+ digits followed by optional non word characters
(\w+(?:\h+\w+)*) Capture group 1, match word characters optionally repeated with a space in between
.*? Match any character except a newline, as least as possible
(\.mp3)? Optionally capture .mp3 in group 2
\h* Match optional trailing spaces
$ End of string
Regex demo
Replace with capture group 1 and group 2
$1$2

Match certain string on second line of text with regex

I'm new to regex, and would appreciate some guidance/help.
Currently, I'm looking to write an expression, that derives a certain part of text from the 2nd line of the provided text.
Here is the text:
123 anywhere Avenue
Winnipeg, Manitoba R3E 0L7
Canada
Pharmacy Manager: person person
Pharmacy Licence Holder/Owner: 123456 Manitoba Ltd.
see correct formatting with code here
My goal is to derive the 'Manitoba' string from the second line, however I'd like to make it dynamic rather than writing an expression to always fetch Manitoba as a static. I used the below code to target the second line:
(.*)(?=(\n.*){3}$)
(It matches 3 lines up from the last line, thus targeting the desired line)
I noticed, that within the dataset, that the Province (Manitoba) is always in between two spaces.
Is there any addition I can make to the code, so that the expression only targets the second line, then matches the first string in-between spaces?
Perhaps using a lazy expression with a positive lookaround?
If I target all matches in between spaces, it would take both 'Manitoba' and 'R3E 0L7' which I dont want.
I want it to only match the first piece of text in between spaces on the second line.
Any help is much appreciated :-)
Thanks.
One option could be to match the first line, then capture the second word in the second lines in capturing group 1.
Then match the rest of the second line and assert what follows is 3 times a line.
^.*\r?\n\S+[^\S\r\n]+(\S+).*(?=(?:\r?\n.*){3}$)
In parts:
^ Start of string
.*\r?\n Match the whole lines and a newline
\S+ Match 1+ non whitespace char (the first "word")
[^\S\r\n]+ Match 1+ times a whitespace char except newlines
(\S+) Capture group 1 Match 1+ times a non whitespace char (the second "word')
.* Match the rest of the line
(?= Positive lookahead, assert what follows on the right is
(?:\r?\n.*){3}$ Match 3 times a newline followed by 0+ times any except a newline and assert the end of the string
) Close lookahead
Regex demo
You could also turn the lookahead in to a match instead
^.*\r?\n\S+[^\S\r\n]+(\S+).*(?:\r?\n.*){3}$
Regex demo

Is there a regex for adding the first 4 characters to end of string and the last 4 characters to start of string?

I have some lines which I need to alter. They are protein sequences. How would I copy the first 4 characters of the line to the end of the line, and also copy the last 4 characters to the beginning of the line?
The strings are variable which complicates it, for example:
>X
LTGLGIGTGMAATIINAISVGLSAATILSLISGVASGGAWVLAGAKQALKEGGKKAGIAF
>Y
LVATGMAAGVAKTIVNAVSAGMDIATALSLFSGAFTAAGGIMALIKKYAQKKLWKQLIAA
Moreover, how could I exclude lines with a '>' at the beginning (these are names of the corresponding sequence)?
Does anyone know a regex which will allow this to work?
I've already tried some regex solutions but I'm not very experienced with this sort of thing and I can find the end string but can't get it to replace:
Find:
(...)$
Replace:
^$2$1"
An example of what I want to achieve is:
>1
ABCDEFGHIJKLMNOPQRSTUVWXYZ
becomes:
>1
WXYZABCDEFGHIJKLMNOPQRSTUVWXYZABCD
Thanks
Try doing a find, in regex mode, on the following pattern:
^([A-Z]{4}).*([A-Z]{4})$
Then replace with the first four and last four characters swapped:
$2$0$1
Demo
You can use the regex below.
^(([A-Z]{4})([A-Z]*)([A-Z]{4}))$
^ asserts the position at the start of the line, so nothing can come before it.
( is the start of a capture group, this is group 1.
( is the start of a capture group, this is group 2. This group is inside group 1.
[A-Z]{4} means exactly 4 capital characters from A to Z.
) is the end of capture group 2.
( is the start of a capture group, this is group 3.
[A-Z]* matches capital characters from A to Z between zero and infinite times.
) is the end of capture group 3.
( is the start of a capture group, this is group 4.
[A-Z]{4} means exactly 4 capital characters from A to Z.
) is the end of capture group 4.
$ asserts the position at the end of the line, so nothing can come after it.
See how it works with a replace here: https://regex101.com/r/W786uL/3.
$4$1$2
$4 means put capture group 4 here. Which is the last 4 characters.
$1 means put capture group 1 here. Which is everything in the entire string.
$2 means put capture group 2 here. Which is the first 4 characters.
You can use
^(.{4})(.*?)(.{4})$
^ - start of sting
(.{4}) - Match any for characters except new line
(.*?) - Match any character zero or more time (lazy mode)
$ - End of string
Demo

regex optional group matching different sections of optional data

I have the following type of data
Apple:Red
Kiwi:brown,Box:no
Grapes:"Black,Green",Box:yes,qty:55
I created this regex,
(.*):(.*)(?:(,)?),(Box):(.*),(qty):(.*)
but the issue is this matches only 3rd line.
first line has one set of data, second line has one comma and thrid line has two commas. in other words, i have 3 sets of data and i need all the key and value in capture groups. How can i make sections after each comma is optional, so that i can match all 3 lines?
See regex in use here
([^:\n]*):("[^"]*"|[^,\n]*)(?:,Box:([^,\n]+)(?:,qty:(\d+))?)?
([^:\n]*) Capture any character except : or \n into capture group 1
: Match this literally
("[^"]*"|[^,\n]*) Capture either of the following into capture group 2
"[^"]*" Match ", then any character except " any number of times, then "
[^,\n]* Match any character except , or \n any number of times
(?:,Box:([^,\n]+)(?:,qty:(\d+))?)? Optionally match the following
,Box: Match this literally
([^,\n]+) Capture any character except , or \n one or more times into capture group 3
(?:,qty:(\d+))? Optionally match the following
,qty: Match this literally
(\d+) Capture one or more digits into capture group 4
Results in the following:
Apple
Red
Kiwi
brown
no
Grapes
"Black,Green"
yes
55

Hive RegexSerDe Multiline Log matching

I am looking for a regex that can be fed to a "create external table" statement of Hive QL in the form of
"input.regex"="the regex goes here"
The condition is that the logs in the files that the RegexSerDe must be reading are of the following form:
2013-02-12 12:03:22,323 [DEBUG] 2636hd3e-432g-dfg3-dwq3-y4dsfq3ew91b Some message that can contain any special character, including linebreaks. This one does not have a linebreak. It just has spaces on the same line.
2013-02-12 12:03:24,527 [DEBUG] 265y7d3e-432g-dfg3-dwq3-y4dsfq3ew91b Some other message that can contain any special character, including linebreaks. This one does not have one either. It just has spaces on the same line.
2013-02-12 12:03:24,946 [ERROR] 261rtd3e-432g-dfg3-dwq3-y4dsfq3ew91b Some message that can contain any special character, including linebreaks.
This is a special one.
This has a message that is multi-lined.
This is line number 4 of the same log.
Line 5.
2013-02-12 12:03:24,988 [INFO] 2632323e-432g-dfg3-dwq3-y4dsfq3ew91b Another 1-line log
2013-02-12 12:03:25,121 [DEBUG] 263tgd3e-432g-dfg3-dwq3-y4dsfq3ew91b Yet another one line log.
I am using the following create external table code:
CREATE EXTERNAL TABLE applogs (logdatetime STRING, logtype STRING, requestid STRING, verbosedata STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES
(
"input.regex" = "(\\A[[0-9:-] ]{19},[0-9]{3}) (\\[[A-Z]*\\]) ([0-9a-z-]*) (.*)?(?=(?:\\A[[0-9:-] ]{19},[0-9]|\\z))",
"output.format.string" = "%1$s \\[%2$s\\] %3$s %4$s"
)
STORED AS TEXTFILE
LOCATION 'hdfs:///logs-application';
Here's the thing:
It is able to pull all the FIRST LINES of each log. But not the other lines of logs that have more than one lines. I tried all links, replaced \z with \Z at the end, replaced \A with ^ and \Z or \z with $, nothing worked. Am I missing something in the output.format.string's %4$s? or am I not using the regex properly?
What the regex does:
It matches the timestamp first, followed by the log type (DEBUG or INFO or whatever), then the ID (mix of lower case alphabets, numbers and hyphens) followed by ANYTHING, till the next timestamp is found, or till the end of input is found to match the last log entry. I also tried adding the /m at the end, in which case, the table generated has all NULL values.
There seem to be a number of issues with your regex.
First, remove your double square brackets.
Second, \A and \Z/\z are to match the beginning and end of the input, not just a line. Change \A to ^ to match start-of-line but don't change \z to $ as you do actually want to match end-of-input in this case.
Third, you want to match (.*?), not (.*)?. The first pattern is ungreedy, whereas the second pattern is greedy but optional. It should have matched your entire input to the end as you allowed it to be followed by end-of-input.
Fourth, . does not match newlines. You can use (\s|\S) instead, or ([x]|[^x]), etc., any pair of complimentary matches.
Fifth, if it was giving you single line matches with \A and \Z/\z then the input was single lines also as you were anchoring the whole string.
I would suggest trying to match just \n, if nothing matches then newlines are not included.
You can't add /m to the end as the regex does not include delimiters. It will try to match the literal characters /m instead which is why you got no match.
If it was going to work the regex you want would be:
"^([0-9:- ]{19},[0-9]{3}) (\\[[A-Z]*\\]) ([0-9a-z-]*) ([\\s\\S]*?)(?=\\r?\\n([0-9:-] ){19},[0-9]|\\r?\\z)"
Breakdown:
^([0-9:- ]{19},[0-9]{3})
Match start of newline, and 19 characters that are digits, :, - or plus a comma, three digits and a space. Capture all but the final space (the timestamp).
(\\[[A-Z]*\\])
Match a literal [, any number of UPPERCASE letters, even none, a literal ] and a space. Capture all but the final space (the error level).
([0-9a-z-]*)
Match any number of digits, lowercase letters or - and a space. Capture all but the final space (the message id).
([\\s\\S]*?)(?=\\r?\\n([0-9:-] ){19},[0-9]|\\r?\\Z)
Match any whitespace or non-whitespace character (any character) but match ungreedy *?. Stop matching when a new record or end of input (\Z) is immediately ahead. In this case you don't want to match end of line as once again, you will only get one line in your output. Capture all but the final (the message text). The \r?\n is to skip the final newline at the end of your message, as is the \r?\Z. You could also write \r?\n\z Note: capital \Z includes the final newline at the end of the input if there is one. Lowercase \z matches at end of input only, not newline before end of input. I have added \z? just in case you have to deal with Windows line endings, however, I don't believe this should be necessary.
However, I suspect that unless you can feed the whole file in at once instead of line-by-line that this will not work either.
Another simple test you can try is:
"^([\\s\\S]+)^\\d"
If it works it will match any full line followed by a line digit on the next line (the first digit of your timestamp).
Following Java regex may help:
(\d{4}-\d{1,2}-\d{1,2}\s+\d{1,2}:\d{1,2}:\d{1,2},\d{1,3})\s+(\[.+?\])\s+(.+?)\s+([\s\S\s]+?)(?=\d{4}-\d{1,2}-\d{1,2}|\Z)
Breakdown:
1st Capturing group (\d{4}-\d{1,2}-\d{1,2}\s+\d{1,2}:\d{1,2}:\d{1,2},\d{1,3})
2nd Capturing group (\[.+?\])
3rd Capturing group (.+?)
4th Capturing group ([\s\S]+?).
(?=\d{4}-\d{1,2}-\d{1,2}|\Z) Positive Lookahead - Assert that the regex below can be matched.1st Alternative: \d{4}-\d{1,2}-\d{1,2}.2nd Alternative: \Z assert position at end of the string.
Reference http://regex101.com/
I don't know much about Hive, but the following regex, or a variation formatted for Java strings, might work:
(\d{4}-\d\d-\d\d \d\d:\d\d:\d\d,\d+) \[([a-zA-Z_-]+)\] ([\w-]+) ((?:[^\n\r]+)(?:[\n\r]{1,2}\s[^\n\r]+)*)
This can be seen matching your sample data here:
http://rubular.com/r/tQp9iBp4JI
A breakdown:
(\d{4}-\d\d-\d\d \d\d:\d\d:\d\d,\d+) The date and time (capture group 1)
\[([a-zA-Z_-]+)\] The log level (capture group 2)
([\w-]+) The request id (capture group 3)
((?:[^\n\r]+)(?:[\n\r]{1,2}\s[^\n\r]+)*) The potentially multi-line message (capture group 4)
The first three capture groups are pretty simple.
The last one might is a little odd, but it's working on rubular. A breakdown:
( Capture it as one group
(?:[^\n\r]+) Match to the end of the line, dont capture
(?: Match line by line, after the first, but dont capture
[\n\r]{1,2} Match the new-line
\s Only lines starting with a space (this prevents new log-entries from matching)
[^\n\r]+ Match to the end of the line
)* Match zero or more of these extra lines
)
I used [^\n\r] instead of the . because it looks like RegexSerDe lets the . match new lines (link):
// Excerpt from https://github.com/apache/hive/blob/trunk/contrib/src/java/org/apache/hadoop/hive/contrib/serde2/RegexSerDe.java#L101
if (inputRegex != null) {
inputPattern = Pattern.compile(inputRegex, Pattern.DOTALL
+ (inputRegexIgnoreCase ? Pattern.CASE_INSENSITIVE : 0));
} else {
inputPattern = null;
}
Hope this helps.