Substitute all non matching characters between certain columns - regex

I'm trying to substitute all non matching characters in a single line between certain columns (after a search).
Example:
The search can be everything
In example below the search = test
The substitute character of non matching characters: empty space.
I want to substitute all characters non part of "test" between columns 10 and 30.
Columns 10 and 30 are indicated with |
before: djd<aj.testjal.kjetestjaja testlala ratesttsuvtesta !<-a-
| |
after: djd<aj.test test testlala ratesttsuvtesta !<-a-
How can I realize this?

Use the following substitution command on that line.
:s/\(test\)\zs\|\%>9v\%<31v./\=submatch(1)!=''?'':' '/g
If the range of columns is specified using visual selection, run
:'<,'>s/\(test\)\zs\|\%V./\=submatch(1)!=''?'':' '/g

One method may be to select the appropiate column range using the Visual mode (control+v)
Once selected, the search and replace can be done using (see this question)
%s/\%Vfoo/bar/g
A regular expression for not test can be found here: Regular expression to match a line that doesn't contain a word?

Related

Why regexp_matches returns incorrect number of matches

OS: Debian 9.6.18-1 (64 bits)
PostgreSQL: 9.6.18 (64 bits)
Inside a table I have a text column that here for the sake of this example I name it colval. I want to selct rows where the colval column matches one of the following patterns:
1) variableName:*
2) variableName:partOfAValue*
3) variableName: *partOfAValue
4) variableName: *partOfAValue*
I have defined the following regular expression based on the four above-mentioned rules. Just to ease the reading I'm writing it here on several lines to show what part of the regular expression matches exactly which rule among the four above mentioned rules. As the text being searched is a part of a RESTful API log file, the character : may also be coded as %3A and the character asterisk may be instead coded as %2A (= html url encoding).
[a-zA-Z][a-zA-Z0-9]*([:]|%3A) ---> this for matching 'variableName:'
(
([*]|%2A) ----> * matches the rule n° 1
| [^=*]+([*]|%2A) ----> partOfAValue* matches the rule n° 2
|([*]|%2A)[^=*]+ ----> *partOfAValue matches the rule n° 3
|([*]|%2A)[^=*]+([*]|%2A) ----> *partOfAValue* matches the rule n° 4
)
I did a few tests and apparently this works and detects the matching rows.
Recently I was asked that in addition to the matching rows, also to provide for each row the number of times there was a match. So for example if I have the following row:
###var1:*########var2:enter*####
This should return two because there are two occurences/matches. var1 matches the rule 1 and var2 matches the rule 2. I checked the online documentation : 9.7. Pattern Matching to see whether there is any function in PostgreSQL allowing to count the number of times a string matches a given regular expression and I found the regexp_matches function which seems to be what I'm looking for. Yet when I tried with an example just to learn how it works, I was quite confused with the result. Here is my test case:
with tmptab as
(
select 'line_01##var1:*#####var2:*val' as colval union all
select 'line_02##var1:*val*' as colval union all
select 'line_03' as colval union all
select 'line_04' as colval union all
select 'line_05' as colval union all
select 'line_06#####var1:*###var2:*endval#####var3:*value*####var4:val*' as colval
)
select
colval,
regexp_matches(colval, '[a-zA-Z][a-zA-Z0-9]*([:]|%3A)(([*]|%2A)|[^=*]+([*]|%2A)|([*]|%2A)[^=*]+|([*]|%2A)[^=*]+([*]|%2A))', 'i'),
array_length(regexp_matches(colval, '[a-zA-Z][a-zA-Z0-9]*([:]|%3A)(([*]|%2A)|[^=*]+([*]|%2A)|([*]|%2A)[^=*]+|([*]|%2A)[^=*]+([*]|%2A))', 'i') , 1)
from tmptab;
And here is the result
colval | regexp_matches | array_length
-----------------------------------------------------------------+-------------------------------------+--------------
line_01##var1:*#####var2:*val | {:,*#####var2:*,NULL,NULL,NULL,*,*} | 7
line_02##var1:*val* | {:,*val*,NULL,NULL,NULL,*,*} | 7
line_06#####var1:*###var2:*endval#####var3:*value*####var4:val* | {:,*###var2:*,NULL,NULL,NULL,*,*} | 7
(3 lignes)
The output is correct in the sense that only line_01, line_02 and line_06 match one of the patterns in the regular expression. But I don't understand why there are seven matches in the returned array by regexp_matches? I have two matches for the first row, one for the second and four matches for line_06. Besides I don't understand the NULL values in the array.
Could you kindly make some clarification? It seems that either my regular expression is wrong or I misunderstand how regexp_matches works (or possibly both)
Is regexp_matches the correct way of counting matches in PostgreSQL while using regular expressions?
You're creating a match group and getting an additional element in the array for each pair of parentheses in your regex, notice how the last one is always *. You could use non-capturing parens (?:...), also you may need the global flag on lines with more than one match? (I'm not familiar with PostgreSQL)
The regexp_matches function returns a text array of all of the
captured substrings resulting from matching a POSIX regular expression
pattern. It has the syntax regexp_matches(string, pattern [, flags ]).
The function can return no rows, one row, or multiple rows (see the g
flag below). If the pattern does not match, the function returns no
rows. If the pattern contains no parenthesized subexpressions, then
each row returned is a single-element text array containing the
substring matching the whole pattern. If the pattern contains
parenthesized subexpressions, the function returns a text array whose
n'th element is the substring matching the n'th parenthesized
subexpression of the pattern (not counting "non-capturing"
parentheses; see below for details). The flags parameter is an
optional text string containing zero or more single-letter flags that
change the function's behavior. Flag g causes the function to find
each match in the string, not only the first one, and return a row for
each such match.
https://www.postgresql.org/docs/9.3/functions-matching.html

Remove leading 0 in String with letters and digits

I have a comma separated file where I need to change the first column removing leading zeroes in string. Text file is as below
ABC-0001,ab,0001
ABC-0010,bc,0010
I need to get the data as under
ABC-1,ab,0001
ABC-10,bc,0010
I can do a command line replace which i tried as below:
sed 's/ABC-0*[1-9]/ABC-[1-9]/g' file
I ended up getting output:
ABC-[1-9],ab,0001
ABC-[1-9]0,ac,0010
Can you please tell me what I am missing in here.
Alternately I also tried to apply formatting in the SQL that generates this file as below:
select regexp_replace(key,'((0+)|1-9|0+)','(1-9|0+)') from file where key in ('ABC-0001','ABC-0010')
which gives output as
ABC-(1-9|0+)1
ABC-(1-9|0+)1(1-9|0+)
Help on either of solution will be very helpful!
Try this :
sed -E 's/ABC-0*([1-9])/ABC-\1/g' file
------ --
| |
capturing group |
captured group
To do it in the query using Oracle, where the key value with the zeroes you want to remove is in a column called "key" in a table called "file", would look like this:
select regexp_replace(key, '(-)(0+)(.*)', '\1\3')
from file;
You need to capture the dash as it is "consumed" by the regex as it is matched. Followed by the second group of one or more 0's, followed by the rest of the field. Replace with captured groups 1 and 3, leaving the 0's (if any) between out.

Extract text up to the Nth character in a string

How can I extract the text up to the 4th instance of a character in a column?
I'm selecting text out of a column called filter_type up to the fourth > character.
To accomplish this, I've been trying to find the position of the fourth > character, but it's not working:
select substring(filter_type from 1 for position('>' in filter_type))
You can use the pattern matching function in Postgres.
First figure out a pattern to capture everything up to the fourth > character.
To start your pattern you should create a sub-group that captures non > characters, and one > character:
([^>]*>)
Then capture that four times to get to the fourth instance of >
([^>]*>){4}
Then, you will need to wrap that in a group so that the match brings back all four instances:
(([^>]*>){4})
and put a start of string symbol for good measure to make sure it only matches from the beginning of the String (not in the middle):
^(([^>]*>){4})
Here's a working regex101 example of that!
Once you have the pattern that will return what you want in the first group element (which you can tell at the online regex on the right side panel), you need to select it back in the SQL.
In Postgres, the substring function has an option to use a regex pattern to extract text out of the input using a 'from' statement in the substring.
To finish, put it all together!
select substring(filter_type from '^(([^>]*>){4})')
from filter_table
See a working sqlfiddle here
If you want to match the entire string whenever there are less than four instances of >, use this regular expression:
^(([^>]*>){4}|.*)
You can also use a simple, non-regex solution:
SELECT array_to_string((string_to_array(filter_type, '>'))[1:4], '>')
The above query:
splits your string into an array, using '>' as delimeter
selects only the first 4 elements
transforms the array back to a string
substring(filter_type from '^(([^>]*>){4})')
This form of substring lets you extract the portion of a string that matches a regex pattern.
You can also split the string, then choose the N'th element inside the result list. For example:
SELECT SPLIT_PART('aa,bb,cc', ',', 2)
will return: bb.
This function is defined as:
SPLIT_PART(string, delimiter, position)
In order to look at this problem, I did the following (all of the code below is available on the fiddle here):
CREATE TABLE s
(
a TEXT
);
I then created a PL/pgSQL function to generate random strings as follows.
CREATE FUNCTION f() RETURNS TEXT LANGUAGE SQL AS
$$
SELECT STRING_AGG(SUBSTR('abcdef>', CEIL(RANDOM() * 7)::INTEGER, 1), '')
FROM GENERATE_SERIES(1, 40)
$$;
I got the code from here and modified it so that it would produce strings with lots of > characters for testing purposes.
I then manually inserted a few strings at the beginning so that a quick look would tell me if the code was working as anticipated.
INSERT INTO s VALUES
('afsad>adfsaf>asfasf>afasdX>asdffs>asfdf>'),
('23433>433453>4>4559>455>3433>'),
('adfd>adafs>afadsf>'), -- only 3 '>'s!
('babedacfab>feaefbf>fedabbcbbcdcfefefcfcd'),
('e>>>>>'), -- edge case - multiple terminal '>'s
('aaaaaaa'); -- edge case - no '>'s whatsoever
The reason I put in the records with fewer than 4 >s is because the accepted answer (see discussion at the end of this answer) puts forward a solution which should return the entire string if this is the case!
On the fiddle, I then added 50,000 records as follows:
INSERT INTO s
SELECT f() FROM GENERATE_SERIES(1, 50000);
I also created a table s on a home laptop (16GB RAM, 500MB NVMe SSD) and populated it with 40,000,000 (50M) records - times also shown.
Now, my reading of the question is that we need to extract the string up to but not including the 4th > character.
The first solution (from treecon) was this one (I also show them running on the fiddle, but to save space here, I've only included the partial output of EXPLAIN (ANALYZE, BUFFERS, VERBOSE)) - the times shown are typical over a few runs:
EXPLAIN (ANALYZE, BUFFERS, VERBOSE)
SELECT
ARRAY_TO_STRING((STRING_TO_ARRAY(a, '>'))[1:4], '>'),
a
FROM s;
Result (only key parts included):
Seq Scan on public.s
Execution Time: 81.807 ms
40M Time: 46 seconds
A regex solution which works (significantly faster):
EXPLAIN (ANALYZE, BUFFERS, VERBOSE)
SELECT
SUBSTRING(a FROM '^(?:[^>]*>){0,3}[^>]*'),
a
FROM s;
Result:
Seq Scan on public.s
Execution Time: 74.757 ms
40M Time: 32 seconds
The accepted answer fails on many levels (see the fiddle). It leaves a > at the end and fails on various strings even when modified. Also, the solution proposed to include strings with fewer than 4 >s (i.e. ^(([^>]*>){4}|.*)) merely returns the original string (see end of fiddle).

Search and replace in a range of line and column

I want to apply a search and replace regular expression pattern that work only in a given range of line and column on a text file like this :
AAABBBFFFFBBBAAABBB
AAABBBFFFFBBBAAABBB
GGGBBBFFFFBHHAAABBB
For example i want to replace BBB with YYY in line range 1 to 2 and from column 4 to 6, then obtaining this output :
AAAYYYFFFFBBBAAABBB
AAAYYYFFFFBBBAAABBB
GGGBBBFFFFBHHAAABBB
Is there a way to do it with Vim ?
:1,2 s/\%3cBBB/YYY/
\%3c means third column (see :help /\%c or more globally :help pattern)
If this is always the first one you want to replace, simply don't specify /g
:1,2s/BBB/YYY/
would work fine.
Alternatively, if you need to exactly specify which column you want replaced, you can use the \%Nv syntax, where N is the virtual column (column as it looks, so tabs are multiple columns, use c instead of v for actual columns)
Replacing the second set of B's on lines 1 and 2 could be done with:
:1,2s/\%11vBBB/YYY/

Find lines matching regex and select a different part of the line

I have two lines like below:
/pace =builtin\administrators Type=0x0 Flags=0x13 AccessMask=0x1f01ff
/pace =domain\user Type=0x0 Flags=0x13 AccessMask=0x1f01ff
Need to create a regular expression where it only select 0x1f01ff where the line have domain\user.
This is what I have created but it select /pace =domain\user Type=0x0 Flags=0x13 AccessMask=:
^(.+domain(.*)accessmask=)
try this:
^.+domain\\user.+AccessMask=([^\s]+)
It matches any line that has domain\user and then get the value of accessmask (any character that is not a whitespace)