perform substring extraction on data frame column - regex

I have a dataframe with 1 column called 'full_url'. Each element of the column is just a url. How to I write a function to remove the 'http://' from all of the elements at once? I need to use some kind of regex because some don't have it at all, some have https, etc. The closest I've gotten is gsub(".*//","",unlist(full_url))
but that also returns 'full_url1' 'full_url2' 'full_url3' ... as the row names for some reason

Without a reproducible example I'm not sure, but would something like this work?
apply(df$full_url, 1, function(x) ifelse(substr(x,1,7) == "http://", substr(x,8,length(x)),x)
So using apply to go by row and substr to find if the first 7 characters are "http://". If they are replace without the http and if they're not then replace with just x.

Related

Extract string after last match strings [duplicate]

This question already has an answer here:
Reference - What does this regex mean?
(1 answer)
Closed 3 years ago.
I'm using BigQuery and I want to extract string after the specific match strings, in my case, the strings is sc
I have a string like this :
www.xxss.com?psct=T-EST2%20.coms&.com/u[sc'sc(mascscin', sc'.c(scscossccnfiscg.scjs']-/ci=1(sctitis)
My expected result is:
titis)
Is this possible?
In general, across all RDBMS finding the index of the last instance of a match in a string is easy to compute by first reversing the string. Then we are only looking for the first match.
Update: BigQuery
Follow the documentation for REGEXP_EXTRACT in the String Functions documentation for BigQuery
NOTE: BigQuery provides regular expression support using the re2 library; see that documentation for its regular expression syntax.
However, this problem can be solved without RegEx.
BigQuery supports array processing and has a SPLIT function, so you could split by the lookup variable and capture only the last result:
SELECT ARRAY_REVERSE(SPLIT( !YOUR COLUMN HERE! , "sc"))[OFFSET(1)]
The following adaptation from my original submission may still work:
SELECT REVERSE(SUBSTR(REVERSE(#text), 1, STRPOS(REVERSE(#text), "cs") -1))
For those who have a similar requirement in MS SQL Server the following syntax can be used.
other RDBMS can use a similar query, you will have to use the appropriate platform functions to acheive the result.
DECLARE #text varchar(200) = 'www.xxss.com?psct=T-EST2%20.coms&.com/u[sc''sc(mascscin'', sc''.c(scscossccnfiscg.scjs'']-/ci=1(sctitis)'
SELECT REVERSE(LEFT(REVERSE(#text), CharIndex('cs', REVERSE(#text),1) -1))
Produces: titis)
You could achieve a similar result by obtaining the last index of 'sc' as above and using that value in a SUBSTRING however for that to work you need to re-compute the Length, this solution instead uses the LEFT function and then REVERSE's the result , reducing the functional complexity of the query by 1 (1 less function call)
Step this through:
Reverse the value:
SELECT REVERSE(#text)
Results in:
)sititcs(1=ic/-]'sjcs.gcsifnccssocscs(c.'cs ,'nicscsam(cs'cs[u/moc.&smoc.02%2TSE-T=tcsp?moc.ssxx.www
Now we find the first Index of 'cs'
Note: we have to reverse the sequece of the lookup string as well!
SELECT CharIndex('cs', REVERSE(#text),1)
Result: 7
Select the characters before this index:
Note: we must use -1 here because SQL uses 1-based index result from CharIndex so we must reduce it by 1
SELECT LEFT(REVERSE(#text), CharIndex('cs', REVERSE(#text),1) -1)
Finally, we reverse the result:
SELECT REVERSE(LEFT(REVERSE(#text), CharIndex('cs', REVERSE(#text),1) -1))
Guess you could use 'sc' as seperator, define (if constant string length) string length in your query (wildcard),
STRING_SPLIT ( string , separator )

Extract text up to the Nth character in a string

How can I extract the text up to the 4th instance of a character in a column?
I'm selecting text out of a column called filter_type up to the fourth > character.
To accomplish this, I've been trying to find the position of the fourth > character, but it's not working:
select substring(filter_type from 1 for position('>' in filter_type))
You can use the pattern matching function in Postgres.
First figure out a pattern to capture everything up to the fourth > character.
To start your pattern you should create a sub-group that captures non > characters, and one > character:
([^>]*>)
Then capture that four times to get to the fourth instance of >
([^>]*>){4}
Then, you will need to wrap that in a group so that the match brings back all four instances:
(([^>]*>){4})
and put a start of string symbol for good measure to make sure it only matches from the beginning of the String (not in the middle):
^(([^>]*>){4})
Here's a working regex101 example of that!
Once you have the pattern that will return what you want in the first group element (which you can tell at the online regex on the right side panel), you need to select it back in the SQL.
In Postgres, the substring function has an option to use a regex pattern to extract text out of the input using a 'from' statement in the substring.
To finish, put it all together!
select substring(filter_type from '^(([^>]*>){4})')
from filter_table
See a working sqlfiddle here
If you want to match the entire string whenever there are less than four instances of >, use this regular expression:
^(([^>]*>){4}|.*)
You can also use a simple, non-regex solution:
SELECT array_to_string((string_to_array(filter_type, '>'))[1:4], '>')
The above query:
splits your string into an array, using '>' as delimeter
selects only the first 4 elements
transforms the array back to a string
substring(filter_type from '^(([^>]*>){4})')
This form of substring lets you extract the portion of a string that matches a regex pattern.
You can also split the string, then choose the N'th element inside the result list. For example:
SELECT SPLIT_PART('aa,bb,cc', ',', 2)
will return: bb.
This function is defined as:
SPLIT_PART(string, delimiter, position)
In order to look at this problem, I did the following (all of the code below is available on the fiddle here):
CREATE TABLE s
(
a TEXT
);
I then created a PL/pgSQL function to generate random strings as follows.
CREATE FUNCTION f() RETURNS TEXT LANGUAGE SQL AS
$$
SELECT STRING_AGG(SUBSTR('abcdef>', CEIL(RANDOM() * 7)::INTEGER, 1), '')
FROM GENERATE_SERIES(1, 40)
$$;
I got the code from here and modified it so that it would produce strings with lots of > characters for testing purposes.
I then manually inserted a few strings at the beginning so that a quick look would tell me if the code was working as anticipated.
INSERT INTO s VALUES
('afsad>adfsaf>asfasf>afasdX>asdffs>asfdf>'),
('23433>433453>4>4559>455>3433>'),
('adfd>adafs>afadsf>'), -- only 3 '>'s!
('babedacfab>feaefbf>fedabbcbbcdcfefefcfcd'),
('e>>>>>'), -- edge case - multiple terminal '>'s
('aaaaaaa'); -- edge case - no '>'s whatsoever
The reason I put in the records with fewer than 4 >s is because the accepted answer (see discussion at the end of this answer) puts forward a solution which should return the entire string if this is the case!
On the fiddle, I then added 50,000 records as follows:
INSERT INTO s
SELECT f() FROM GENERATE_SERIES(1, 50000);
I also created a table s on a home laptop (16GB RAM, 500MB NVMe SSD) and populated it with 40,000,000 (50M) records - times also shown.
Now, my reading of the question is that we need to extract the string up to but not including the 4th > character.
The first solution (from treecon) was this one (I also show them running on the fiddle, but to save space here, I've only included the partial output of EXPLAIN (ANALYZE, BUFFERS, VERBOSE)) - the times shown are typical over a few runs:
EXPLAIN (ANALYZE, BUFFERS, VERBOSE)
SELECT
ARRAY_TO_STRING((STRING_TO_ARRAY(a, '>'))[1:4], '>'),
a
FROM s;
Result (only key parts included):
Seq Scan on public.s
Execution Time: 81.807 ms
40M Time: 46 seconds
A regex solution which works (significantly faster):
EXPLAIN (ANALYZE, BUFFERS, VERBOSE)
SELECT
SUBSTRING(a FROM '^(?:[^>]*>){0,3}[^>]*'),
a
FROM s;
Result:
Seq Scan on public.s
Execution Time: 74.757 ms
40M Time: 32 seconds
The accepted answer fails on many levels (see the fiddle). It leaves a > at the end and fails on various strings even when modified. Also, the solution proposed to include strings with fewer than 4 >s (i.e. ^(([^>]*>){4}|.*)) merely returns the original string (see end of fiddle).

Split string and get last element

Let's say I have a column which has values like:
foo/bar
chunky/bacon/flavor
/baz/quz/qux/bax
I.e. a variable number of strings separated by /.
In another column I want to get the last element from each of these strings, after they have been split on /. So, that column would have:
bar
flavor
bax
I can't figure this out. I can split on / and get an array, and I can see the function INDEX to get a specific numbered indexed element from the array, but can't find a way to say "the last element" in this function.
Edit:
this one is simplier:
=REGEXEXTRACT(A1,"[^/]+$")
You could use this formula:
=REGEXEXTRACT(A1,"(?:.*/)(.*)$")
And also possible to use it as ArrayFormula:
=ARRAYFORMULA(REGEXEXTRACT(A1:A3,"(?:.*/)(.*)$"))
Here's some more info:
the RegExExtract function
Some good examples of syntax
my personal list of Regex Tricks
This formula will do the same:
=INDEX(SPLIT(A1,"/"),LEN(A1)-len(SUBSTITUTE(A1,"/","")))
But it takes A1 three times, which is not prefferable.
You could do this too
=index(SPLIT(A1, "/"), COLUMNS(SPLIT(A1, "/"))-1)
Also possible, perhaps best on a copy, with Find:
.+/
(Replace with blank) and Search using regular expressions ticked.
You can try use this!
You've got the array of String, so you can acess the last element by length
String message = "chunky/bacon/flavor";
String[] outSplited = message.split("/");
System.out.println(outSplited[outSplited.length -1]);

Postgresql - How do I extract the first occurence of a substring in a string using a regular expression pattern?

I am trying to extract a substring from a text column using a regular expression, but in some cases, there are multiple instances of that substring in the string.
In those cases, I am finding that the query does not return the first occurrence of the substring. Does anyone know what I am doing wrong?
For example:
If I have this data:
create table data1
(full_text text, name text);
insert into data1 (full_text)
values ('I 56, donkey, moon, I 92')
I am using
UPDATE data1
SET name = substring(full_text from '%#"I ([0-9]{1,3})#"%' for '#')
and I want to get 'I 56' not 'I 92'
You can use regexp_matches() instead:
update data1
set full_text = (regexp_matches(full_text, 'I [0-9]{1,3}'))[1];
As no additional flag is passed, regexp_matches() only returns the first match - but it returns an array so you need to pick the first (and only) element from the result (that's the [1] part)
It is probably a good idea to limit the update to only rows that would match the regex in the first place:
update data1
set full_text = (regexp_matches(full_text, 'I [0-9]{1,3}'))[1]
where full_text ~ 'I [0-9]{1,3}'
Try the following expression. It will return the first occurrence:
SUBSTRING(full_text, 'I [0-9]{1,3}')
You can use regexp_match() In PostgreSQL 10+
select regexp_match('I 56, donkey, moon, I 92', 'I [0-9]{1,3}');
Quote from documentation:
In most cases regexp_matches() should be used with the g flag, since
if you only want the first match, it's easier and more efficient to
use regexp_match(). However, regexp_match() only exists in PostgreSQL
version 10 and up. When working in older versions, a common trick is
to place a regexp_matches() call in a sub-select...

Use REPLACE and LIKE together in postgres

I am trying to replace all the occurences of '-' in a column of a table.
What I need is also to replace the string which exists after the dash and its a random number.
To be more specific this is one of my values:
"ANDRIU 5-9, CHAL 152 34, SOMETHING"
What I want is to replace this part:
-9
with an empty space.
The problem is that: 9 can be any number and not necessarily one digit.
So I need something like finding the position of the first comma in the whole string. And the position of the dash and then replacing this based on the index values.
Is this possible?
Postgres provides the function regexp_replace(), which does what you want directly:
select regexp_replace(col, '-[0-9]+', ' ')