How do you select specific range in a string using RegEx - regex

So I have a string. Let's say for the argument it is this one:
1234567891113SomeTextExample
I want to have two regular expresions:
Select from beginning to, say, 6th position;
Select from 8th position to 12th position.
I know how to select everything AFTER specific position, e.g.:
(?<=.{6})(.*)$
would select everything after 5 characters.
I am using Sublime Text editor and need to cleanup some logs and these two expressions would save a whole lot of time.

use ^ to get your regex to start at the beginning.
Beginning to 6th position : ^(.{6})
var str = 'xdcfvgbhdsds';
var regex = /^(.{6})/;
console.log(regex.exec(str)[1]);
8th to 12th position : ^.{7}(.{5})
var str = 'xdcfvgbhddsfsffsds';
var regex = /^.{7}(.{5})/;
console.log(regex.exec(str)[1]);

Beginning to 6th position (Demo):
^(.{6}).*$
Characters 8 to 12, inclusive on both ends (Demo):
^.{7}(.{5}).*$
I am assuming here that you want to capture these specific ranges for some sort of use.

Finally I found it out.
First one - Select from beginning to, say, 6th position:
^(.{6})
Thanks Zenoo for this.
And select from 8th position to 12th position:
^(.{8})|(?<=.{12})(.*)$
Well, at least this one works in Sublime Text. I am sure there are lots and lots of editors/applications which are fine with Zenoo's approach (^.{7}(.{5})).

Related

Extract text up to the Nth character in a string

How can I extract the text up to the 4th instance of a character in a column?
I'm selecting text out of a column called filter_type up to the fourth > character.
To accomplish this, I've been trying to find the position of the fourth > character, but it's not working:
select substring(filter_type from 1 for position('>' in filter_type))
You can use the pattern matching function in Postgres.
First figure out a pattern to capture everything up to the fourth > character.
To start your pattern you should create a sub-group that captures non > characters, and one > character:
([^>]*>)
Then capture that four times to get to the fourth instance of >
([^>]*>){4}
Then, you will need to wrap that in a group so that the match brings back all four instances:
(([^>]*>){4})
and put a start of string symbol for good measure to make sure it only matches from the beginning of the String (not in the middle):
^(([^>]*>){4})
Here's a working regex101 example of that!
Once you have the pattern that will return what you want in the first group element (which you can tell at the online regex on the right side panel), you need to select it back in the SQL.
In Postgres, the substring function has an option to use a regex pattern to extract text out of the input using a 'from' statement in the substring.
To finish, put it all together!
select substring(filter_type from '^(([^>]*>){4})')
from filter_table
See a working sqlfiddle here
If you want to match the entire string whenever there are less than four instances of >, use this regular expression:
^(([^>]*>){4}|.*)
You can also use a simple, non-regex solution:
SELECT array_to_string((string_to_array(filter_type, '>'))[1:4], '>')
The above query:
splits your string into an array, using '>' as delimeter
selects only the first 4 elements
transforms the array back to a string
substring(filter_type from '^(([^>]*>){4})')
This form of substring lets you extract the portion of a string that matches a regex pattern.
You can also split the string, then choose the N'th element inside the result list. For example:
SELECT SPLIT_PART('aa,bb,cc', ',', 2)
will return: bb.
This function is defined as:
SPLIT_PART(string, delimiter, position)
In order to look at this problem, I did the following (all of the code below is available on the fiddle here):
CREATE TABLE s
(
a TEXT
);
I then created a PL/pgSQL function to generate random strings as follows.
CREATE FUNCTION f() RETURNS TEXT LANGUAGE SQL AS
$$
SELECT STRING_AGG(SUBSTR('abcdef>', CEIL(RANDOM() * 7)::INTEGER, 1), '')
FROM GENERATE_SERIES(1, 40)
$$;
I got the code from here and modified it so that it would produce strings with lots of > characters for testing purposes.
I then manually inserted a few strings at the beginning so that a quick look would tell me if the code was working as anticipated.
INSERT INTO s VALUES
('afsad>adfsaf>asfasf>afasdX>asdffs>asfdf>'),
('23433>433453>4>4559>455>3433>'),
('adfd>adafs>afadsf>'), -- only 3 '>'s!
('babedacfab>feaefbf>fedabbcbbcdcfefefcfcd'),
('e>>>>>'), -- edge case - multiple terminal '>'s
('aaaaaaa'); -- edge case - no '>'s whatsoever
The reason I put in the records with fewer than 4 >s is because the accepted answer (see discussion at the end of this answer) puts forward a solution which should return the entire string if this is the case!
On the fiddle, I then added 50,000 records as follows:
INSERT INTO s
SELECT f() FROM GENERATE_SERIES(1, 50000);
I also created a table s on a home laptop (16GB RAM, 500MB NVMe SSD) and populated it with 40,000,000 (50M) records - times also shown.
Now, my reading of the question is that we need to extract the string up to but not including the 4th > character.
The first solution (from treecon) was this one (I also show them running on the fiddle, but to save space here, I've only included the partial output of EXPLAIN (ANALYZE, BUFFERS, VERBOSE)) - the times shown are typical over a few runs:
EXPLAIN (ANALYZE, BUFFERS, VERBOSE)
SELECT
ARRAY_TO_STRING((STRING_TO_ARRAY(a, '>'))[1:4], '>'),
a
FROM s;
Result (only key parts included):
Seq Scan on public.s
Execution Time: 81.807 ms
40M Time: 46 seconds
A regex solution which works (significantly faster):
EXPLAIN (ANALYZE, BUFFERS, VERBOSE)
SELECT
SUBSTRING(a FROM '^(?:[^>]*>){0,3}[^>]*'),
a
FROM s;
Result:
Seq Scan on public.s
Execution Time: 74.757 ms
40M Time: 32 seconds
The accepted answer fails on many levels (see the fiddle). It leaves a > at the end and fails on various strings even when modified. Also, the solution proposed to include strings with fewer than 4 >s (i.e. ^(([^>]*>){4}|.*)) merely returns the original string (see end of fiddle).

How can I separate a string by underscore (_) in google spreadsheets using regex?

I need to create some columns from a cell that contains text separated by "_".
The input would be:
campaign1_attribute1_whatever_yes_123421
And the output has to be in different columns (one per field), with no "_" and excluding the final number, as it follows:
campaign1 attribute1 whatever yes
It must be done using a regex formula!
help!
Thanks in advance (and sorry for my english)
=REGEXEXTRACT("campaign1_attribute1_whatever_yes_123421","(("&REGEXREPLACE("campaign1_attribute1_whatever_yes_123421","((_)|(\d+$))",")$1(")&"))")
What this does is replace all the _ with parenthesis to create capture groups, while also excluding the digit string at the end, then surround the whole string with parenthesis.
We then use regex extract to actuall pull the pieces out, the groups automatically push them to their own cells/columns
To solve this you can use the SPLIT and REGEXREPLACE functions
Solution:
Text - A1 = "campaign1_attribute1_whatever_yes_123421"
Formula - A3 = =SPLIT(REGEXREPLACE(A1,"_+\d*$",""), "_", TRUE)
Explanation:
In cell A3 We use SPLIT(text, delimiter, [split_by_each]), the text in this case is formatted with regex =REGEXREPLACE(A1,"_+\d$","")* to remove 123421, witch will give you a column for each word delimited by ""
A1 = "campaign1_attribute1_whatever_yes_123421"
A2 = "=REGEXREPLACE(A1,"_+\d*$","")" //This gives you : *campaign1_attribute1_whatever_yes*
A3 = SPLIT(A2, "_", TRUE) //This gives you: campaign1 attribute1 whatever yes, each in a separate column.
I finally figured it out yesterday in stackoverflow (spanish): https://es.stackoverflow.com/questions/55362/c%C3%B3mo-separo-texto-por-guiones-bajos-de-una-celda-en...
It was simple enough after all...
The reason I asked to be only in regex and for google sheets was because I need to use it in Google data studio (same regex functions than spreadsheets)
To get each column just use this regex extract function:
1st column: REGEXP_EXTRACT(Campaña, '^(?:[^_]*_){0}([^_]*)_')
2nd column: REGEXP_EXTRACT(Campaña, '^(?:[^_]*_){1}([^_]*)_')
3rd column: REGEXP_EXTRACT(Campaña, '^(?:[^_]*_){2}([^_]*)_')
etc...
The only thing that has to be changed in the formula to switch columns is the numer inside {}, (column number - 1).
If you do not have the final number, just don't put the last "_".
Lastly, remember to do all the calculated fields again, because (for example) it gets an error with CPC, CTR and other Adwords metrics that are calculated automatically.
Hope it helps!

Replace at every 5th semicolon

I was wondering if i'ts possible to use hive's regexp_replace at every nth in my case I would like to replace every 5th semicolon with pipe
example of column data:
test;vid;1;;1.45;id:3;manlyman;2;4;;
So there would be 2 replaces in this one. This can't be a static replace because some times the column data will have only 5 or sometimes 15.
Use String.replaceAll() with the pattern mentioned by #bobble Bubble :
int nth = 5;
String input ="test;vid;1;;1.45;id:3;manlyman;2;4;;";
System.out.println(input.replaceAll("(([^;]*;){"+(nth-1)+"}[^;]*);", "$1|"));
Output:
test;vid;1;;1.45|id:3;manlyman;2;4;|

Python re find index position of first search match

I have a series of strings, most of which contain 4 digits in a row. I want to slice the string at the end of that fourth digit, using Python. Sometimes the string contains more than one such pattern. What I want is the index position of the FIRST match of my regular expression. What I have been able to get is the LAST match.
myString = 'Today is June 14, 2019. I sometimes like to think back when I was a child in 1730.'
theYear = re.compile("\d{4}")
[(m.start(0), m.end(0)) for m in re.finditer(theYear, myString)]
print m.span(0)
The result is (77, 81), which is the index position for the second date, not the first one. I know the problem is my loop, which will iterate through all of the matches, leaving me with the last one. But I havn't been able to figure out how to access those index positions without looping.
Thanks for any help.
print theYear.search(myString).span()

RichTextBox search'n'replace results are staggered

I am currently trying to generate colored results after a search containing keywords. My code displays a richtextbox containing a text that was succesfully hit by the search engine.
Now I want to highlight the keywords in the text, by making them bold and colored in red. I have my list of words in a nice string table, which I browse this way (rtb is my RichTextBox, plainText is the only Run from rtb, containing the entire text of it) :
rtb.SelectAll();
string allText = rtb.Selection.Text;
string expression = "";
foreach (string word in words)
{
expression = Regex.Escape(word);
Regex regExp = new Regex(expression);
foreach (Match match in regExp.Matches(allText))
{
TextPointer start = plainText.ContentStart.GetPositionAtOffset(match.Index, LogicalDirection.Forward);
TextPointer end = plainText.ContentStart.GetPositionAtOffset(match.Index + match.Length, LogicalDirection.Forward);
rtb.Selection.Select(start, end);
rtb.Selection.ApplyPropertyValue(Run.FontWeightProperty, FontWeights.Bold);
rtb.Selection.ApplyPropertyValue(Run.ForegroundProperty, "red");
}
}
Now I thought this would do the trick. But somehow, only the first word gets highlighted correctly. Then, the second occurence of the highlights starts two early, with the correct amount of letters getting highlighted, but a few characters before the actual word. Then for the third occurence it's further more characters earlier, etc.
Have you got any idea what is causing this behavior?
EDIT (01/07/2013): Still not figuring out why these results are staggered... So far I noticed that if I created a variable set to zero right before the second foreach statement, added it up to each textpointer's positions and incremented it by 4 (no idea why) at the end of each loop, the results are colored adequately. Nevertheless, if I search for two keywords or more (doesn't matter if they're the same size), each occurence of the first keyword get colored correctly, but only the first occurences of the other keywords are well-colored. (the others are staggered again) Here's the edited code:
rtb.SelectAll();
string allText = rtb.Selection.Text;
string expression = "";
foreach (string word in words)
{
expression = Regex.Escape(word);
Regex regExp = new Regex(expression);
int i = 0;
foreach (Match match in regExp.Matches(allText))
{
TextPointer start = plainText.ContentStart.GetPositionAtOffset(match.Index + i, LogicalDirection.Forward);
TextPointer end = plainText.ContentStart.GetPositionAtOffset(match.Index + match.Length + i, LogicalDirection.Forward);
rtb.Selection.Select(start, end);
rtb.Selection.ApplyPropertyValue(Run.FontWeightProperty, FontWeights.Bold);
rtb.Selection.ApplyPropertyValue(Run.ForegroundProperty, "red");
i += 4; // number found out from trials
}
}
Alright! So I learned by reading this question that everytime I modify the style, it adds 4 characters to the text, which is what was messing up my setting.
In order to fix this, since I possibly have multiple keywords and that they do not appear one after the other in the text in the order that they were typed in the search box, I had to first browse my text to locate each occurence for each keyword without modifying the text. For each occurence, I store in a custom list the start position, end position and desired color of the occurence.
When this selection is done, I order my occurence list by the start attribute of each member in it. I can now be assured that each occurence I browse in my foreach loop is the next one in the text, with no regard to its content or length. And I know in which color I want to make it appear, so I can distinguish different keywords.
Then, finally, I can browse each member of my ordered list and modify the style of my text, knowing that the next word will appear later in the text, so I must add 4 characters to my index at the end of each loop.