Regex With Colons in Data - regex

I have a text file which I'm looking to remove some data from. The data is separated using a colon ':' as the delimiter. There are approx 9 separations. The data after the 7th column is most often null and thus useless but the additional colons are still there.
An example of the file would like this:
column1:column2:column3:column4:column5:column6:column7:column8:column9:column10
I hope to remove the info from after column8. So the data to be removed would be:
:column9:column10
Could someone advise me how to do so in Regex?
I've been reading and no where have I found a way to isolate a colon and text following after x number of colons.
Any help you could offer would be much appreciated.

$_ = join ":", ( split /:/, $_, -1 )[0..7];
or
s/(?::[^:]*){2}\z//;

The following regex will keep the first 8 columns and discard all others.
s/^[^:]*(?::[^:]*){7}\K.*//;
Assumes simple single line records.

Related

Parse a log file to fetch some values in a line

I am reading a log file where i am trying to fetch some values from lines which contains a substring "edited by:" and ending with " bye".
This is how a log file is designed.
Error nothing reported
19-06-2021 LOGGER:INFO edited by : James Cooper Person Administrator bye. //Line 2
No data match.
19-06-2021 LOGGER:INFO edited by : Harry Rhodes Person External bye. //Line 4
.......
So i am trying to fetch:
James Cooper Person Administrator //from line 2
Harry Rhodes Person External //from line 4
And assign them to variables in my tcl program.
I am assuming the fetched lines are in a list name line2.
like
set splitList[$line2 ' ']
set agent [lindex $splitList 0]
set firstName [lindex $splitList 1]
set lastName [lindex $splitList 2]
set role [lindex $splitList 3]
I understand that having the fetched or extracted lines from log file in a list is not a good idea as they are unstructured input. Using Tcl list functions can lead to weird things when they aren't in proper Tcl list format.
I am very new to tcl. And don't have much idea using regex in tcl.
So I tried extracting values from the matched line using regex. Suppose line2 is a variable holding the extracted matched line2 from the log file,
regexp -- {edited by:(.*) bye.$} $line2 match agent
I was able to get the expected output like below.
Person Harry Rhodes External
However, on this extracted string I don't know how I can further drill to get my variables assigned values. Any suggestion on this approach or any other functions which are present in tcl library which can help me with this task please let me know.
Updated the question by editing the log format. The format of the log file was not correct.
To err on the safe side, I would modify the regex to look for whitespace ([[:space:]]) between words, using * (= "any amount") and + (= "at least one") as appropriate and storing each variable in a capturing group (surrounded by parentheses ()):
edited[[:space:]]+by[[:space:]]*:[[:space:]]*([^[:space:]]*)[[:space:]]+([^[:space:]]*)[[:space:]]+([^[:space:]]*)[[:space:]]+([^[:space:]]*)[[:space:]]+bye.$
Please note that [^[:space:]] matches any character except whitespace.
Regex101 demo: https://regex101.com/r/78l4HJ/1
First off, taking apart the name of a person into its components is extremely difficult. For example, some people have a multi-word family name. (Yes, I know specific examples of this.) Other people put the parts in different orders. Can you avoid splitting the name?
The other parts of parsing that substring are easier as we can assume that agent and role will not have spaces in. The trick to this RE is that \w+ matches a “word” character sequence, \s+ matches a space character sequence (more robustly than a single space), and .*? matches anything, but as little of it as possible.
regexp {^\s*(\w+)\s+(.*?)\s+(\w+)\s*$} $substring -> agent name role
OK, that's great for the substring, but what about the whole line? It's really just a matter of adjusting the anchors. (\y matches a word boundary.)
regexp {\yedited by:\s*(\w+)\s+(.*?)\s+(\w+)\s+bye\y} $line -> agent name role
It's often not a good idea to feed more than a line at a time into a regular expression search, not unless you need to. Fortunately your records are newline-delimited so that's not a problem here.

regular expression replace for SQL

I have to replace a string pattern in SQL with empty string, could anyone please suggest me?
Input String 'AC001,AD001,AE001,SA001,AE002,SD001'
Output String 'AE001,AE002
There are the 4 digit codes with first 2 characters "alphabets" and last two are digits. This is always a 4 digit code. And I have to replace all codes except the codes starting with "AE".
I can have 0 or more instances of "AE" codes in the string. The final output should be a formatted string "separated by commas" for multiple "AE" codes as mentioned above.
Here is one option calling regex_replace multiple times, eliminating the "not required" strings little by little in each iteration to arrive at the required output.
SELECT regexp_replace(
regexp_replace(
regexp_replace(
'AC001,AD001,AE001,SA001,AE002,SD001', '(?<!AE)\d{3},{0,1}', 'X','g'
),'..X','','g'
),',$','','g'
)
See Demo here
I would convert the list to an array, unnest that to rows then filter out those that should be kept and aggregate it back to a string:
select string_agg(t, ',')
from unnest(string_to_array('AC001,AD001,AE001,SA001,AE002,SD001',',') as x(t)
where x.t like 'AE%'; --<< only keep those
This is independent of the number of elements in the string and can easily be extended to support more complex conditions.
This is a good example why storing comma separated values in a single column is not such a good idea to begin with.

How can I separate a string by underscore (_) in google spreadsheets using regex?

I need to create some columns from a cell that contains text separated by "_".
The input would be:
campaign1_attribute1_whatever_yes_123421
And the output has to be in different columns (one per field), with no "_" and excluding the final number, as it follows:
campaign1 attribute1 whatever yes
It must be done using a regex formula!
help!
Thanks in advance (and sorry for my english)
=REGEXEXTRACT("campaign1_attribute1_whatever_yes_123421","(("&REGEXREPLACE("campaign1_attribute1_whatever_yes_123421","((_)|(\d+$))",")$1(")&"))")
What this does is replace all the _ with parenthesis to create capture groups, while also excluding the digit string at the end, then surround the whole string with parenthesis.
We then use regex extract to actuall pull the pieces out, the groups automatically push them to their own cells/columns
To solve this you can use the SPLIT and REGEXREPLACE functions
Solution:
Text - A1 = "campaign1_attribute1_whatever_yes_123421"
Formula - A3 = =SPLIT(REGEXREPLACE(A1,"_+\d*$",""), "_", TRUE)
Explanation:
In cell A3 We use SPLIT(text, delimiter, [split_by_each]), the text in this case is formatted with regex =REGEXREPLACE(A1,"_+\d$","")* to remove 123421, witch will give you a column for each word delimited by ""
A1 = "campaign1_attribute1_whatever_yes_123421"
A2 = "=REGEXREPLACE(A1,"_+\d*$","")" //This gives you : *campaign1_attribute1_whatever_yes*
A3 = SPLIT(A2, "_", TRUE) //This gives you: campaign1 attribute1 whatever yes, each in a separate column.
I finally figured it out yesterday in stackoverflow (spanish): https://es.stackoverflow.com/questions/55362/c%C3%B3mo-separo-texto-por-guiones-bajos-de-una-celda-en...
It was simple enough after all...
The reason I asked to be only in regex and for google sheets was because I need to use it in Google data studio (same regex functions than spreadsheets)
To get each column just use this regex extract function:
1st column: REGEXP_EXTRACT(Campaña, '^(?:[^_]*_){0}([^_]*)_')
2nd column: REGEXP_EXTRACT(Campaña, '^(?:[^_]*_){1}([^_]*)_')
3rd column: REGEXP_EXTRACT(Campaña, '^(?:[^_]*_){2}([^_]*)_')
etc...
The only thing that has to be changed in the formula to switch columns is the numer inside {}, (column number - 1).
If you do not have the final number, just don't put the last "_".
Lastly, remember to do all the calculated fields again, because (for example) it gets an error with CPC, CTR and other Adwords metrics that are calculated automatically.
Hope it helps!

Extract a text string with regex

I have a large set of data I need to clean with open refine.
I am quite bad with regex and I can't think of a way to get what I want,
which is extracting a text string between quotes that includes lots of special characters like " ' / \ # # -
In each cell, it has the same format
caption': u'text I want to extract', u'likes':
Any help would be highly appreciated!
If you want to extract text string that includes lots of special characters in between, and is located between quotes ' ', You can do it in general this way:
\'[\S\s]*?\'
Demo
.
In your case, if you want to extract only the medial quote from this: caption': u'text I want to extract', u'likes': , Try this Regex:
(?<=u\')[\V]*?(?=\'\,)
Demo
We designed OpenRefine with a few smart functions to handle common cases such as yours without using Regex.
Two other cool ways to handle this in OpenRefine.
Using drop down menu:
Edit Column
Split into several columns
by separator Separator '
Using smartSplit
(string s, optional string sep)
returns: array
Returns the array of strings obtained by splitting s with separator sep. Handles quotes properly. Guesses tab or comma separator if "sep" is not given.
value.smartSplit("'")[2]

How to split CSV line according to specific pattern

In a .csv file I have lines like the following :
10,"nikhil,khandare","sachin","rahul",viru
I want to split line using comma (,). However I don't want to split words between double quotes (" "). If I split using comma I will get array with the following items:
10
nikhil
khandare
sachin
rahul
viru
But I don't want the items between double-quotes to be split by comma. My desired result is:
10
nikhil,khandare
sachin
rahul
viru
Please help me to sort this out.
The character used for separating fields should not be present in the fields themselves. If possible, replace , with ; for separating fields in the csv file, it'll make your life easier. But if you're stuck with using , as separator, you can split each line using this regular expression:
/((?:[^,"]|"[^"]*")+)/
For example, in Python:
import re
s = '10,"nikhil,khandare","sachin","rahul",viru'
re.split(r'((?:[^,"]|"[^"]*")+)', s)[1::2]
=> ['10', '"nikhil,khandare"', '"sachin"', '"rahul"', 'viru']
Now to get the exact result shown in the question, we only need to remove those extra " characters:
[e.strip('" ') for e in re.split(r'((?:[^,"]|"[^"]*")+)', s)[1::2]]
=> ['10', 'nikhil,khandare', 'sachin', 'rahul', 'viru']
If you really have such a simple structure always, you can use splitting with "," (yes, with quotes) after discarding first number and comma
If no, you can use a very simple form of state machine parsing your input from left to right. You will have two states: insides quotes and outside. Regular expressions is a also a good (and simpler) way if you already know them (as they are basically an equivalent of state machine, just in another form)