Related
How can I extract the text up to the 4th instance of a character in a column?
I'm selecting text out of a column called filter_type up to the fourth > character.
To accomplish this, I've been trying to find the position of the fourth > character, but it's not working:
select substring(filter_type from 1 for position('>' in filter_type))
You can use the pattern matching function in Postgres.
First figure out a pattern to capture everything up to the fourth > character.
To start your pattern you should create a sub-group that captures non > characters, and one > character:
([^>]*>)
Then capture that four times to get to the fourth instance of >
([^>]*>){4}
Then, you will need to wrap that in a group so that the match brings back all four instances:
(([^>]*>){4})
and put a start of string symbol for good measure to make sure it only matches from the beginning of the String (not in the middle):
^(([^>]*>){4})
Here's a working regex101 example of that!
Once you have the pattern that will return what you want in the first group element (which you can tell at the online regex on the right side panel), you need to select it back in the SQL.
In Postgres, the substring function has an option to use a regex pattern to extract text out of the input using a 'from' statement in the substring.
To finish, put it all together!
select substring(filter_type from '^(([^>]*>){4})')
from filter_table
See a working sqlfiddle here
If you want to match the entire string whenever there are less than four instances of >, use this regular expression:
^(([^>]*>){4}|.*)
You can also use a simple, non-regex solution:
SELECT array_to_string((string_to_array(filter_type, '>'))[1:4], '>')
The above query:
splits your string into an array, using '>' as delimeter
selects only the first 4 elements
transforms the array back to a string
substring(filter_type from '^(([^>]*>){4})')
This form of substring lets you extract the portion of a string that matches a regex pattern.
You can also split the string, then choose the N'th element inside the result list. For example:
SELECT SPLIT_PART('aa,bb,cc', ',', 2)
will return: bb.
This function is defined as:
SPLIT_PART(string, delimiter, position)
In order to look at this problem, I did the following (all of the code below is available on the fiddle here):
CREATE TABLE s
(
a TEXT
);
I then created a PL/pgSQL function to generate random strings as follows.
CREATE FUNCTION f() RETURNS TEXT LANGUAGE SQL AS
$$
SELECT STRING_AGG(SUBSTR('abcdef>', CEIL(RANDOM() * 7)::INTEGER, 1), '')
FROM GENERATE_SERIES(1, 40)
$$;
I got the code from here and modified it so that it would produce strings with lots of > characters for testing purposes.
I then manually inserted a few strings at the beginning so that a quick look would tell me if the code was working as anticipated.
INSERT INTO s VALUES
('afsad>adfsaf>asfasf>afasdX>asdffs>asfdf>'),
('23433>433453>4>4559>455>3433>'),
('adfd>adafs>afadsf>'), -- only 3 '>'s!
('babedacfab>feaefbf>fedabbcbbcdcfefefcfcd'),
('e>>>>>'), -- edge case - multiple terminal '>'s
('aaaaaaa'); -- edge case - no '>'s whatsoever
The reason I put in the records with fewer than 4 >s is because the accepted answer (see discussion at the end of this answer) puts forward a solution which should return the entire string if this is the case!
On the fiddle, I then added 50,000 records as follows:
INSERT INTO s
SELECT f() FROM GENERATE_SERIES(1, 50000);
I also created a table s on a home laptop (16GB RAM, 500MB NVMe SSD) and populated it with 40,000,000 (50M) records - times also shown.
Now, my reading of the question is that we need to extract the string up to but not including the 4th > character.
The first solution (from treecon) was this one (I also show them running on the fiddle, but to save space here, I've only included the partial output of EXPLAIN (ANALYZE, BUFFERS, VERBOSE)) - the times shown are typical over a few runs:
EXPLAIN (ANALYZE, BUFFERS, VERBOSE)
SELECT
ARRAY_TO_STRING((STRING_TO_ARRAY(a, '>'))[1:4], '>'),
a
FROM s;
Result (only key parts included):
Seq Scan on public.s
Execution Time: 81.807 ms
40M Time: 46 seconds
A regex solution which works (significantly faster):
EXPLAIN (ANALYZE, BUFFERS, VERBOSE)
SELECT
SUBSTRING(a FROM '^(?:[^>]*>){0,3}[^>]*'),
a
FROM s;
Result:
Seq Scan on public.s
Execution Time: 74.757 ms
40M Time: 32 seconds
The accepted answer fails on many levels (see the fiddle). It leaves a > at the end and fails on various strings even when modified. Also, the solution proposed to include strings with fewer than 4 >s (i.e. ^(([^>]*>){4}|.*)) merely returns the original string (see end of fiddle).
I am hoping to receive some feedback on some code I have written in Python 3 - I am attempting to write a program that reads an input file which has page numbers in it. The page numbers are formatted as: "[13]" (this means you are on page 13). My code right now is:
pattern='\[\d\]'
for line in f:
if pattern in line:
re.sub('\[\d\]',' ')
re.compile(line)
output.write(line.replace('\[\d\]', ''))
I have also tried:
for line in f:
if pattern in line:
re.replace('\[\d\]','')
re.compile(line)
output_file.write(line)
When I run these programs, a blank file is created, rather than a file containing the original text minus the page numbers. Thank you in advance for any advice!
Your if statement won't work because not doing a regex match, it's looking for the literal string \[\d\] in line.
for line in f:
# determine if the pattern is found in the line
if re.match(r'\[\d\]', line):
subbed_line = re.sub(r'\[\d\]',' ')
output_file.writeline(subbed_line)
Additionally, you're using the re.compile() incorrectly. The purpose of it is to pre-compile your pattern into a function. This improves performance if you use the pattern a lot because you only evaluate the expression once, rather than re-evaluating each time you loop.
pattern = re.compile(r'\[\d\]')
if pattern.match(line):
# ...
Lastly, you're getting a blank file because you're using output_file.write() which writes a string as the entire file. Instead, you want to use output_file.writeline() to write lines to the file.
You don't write unmodified lines to your output.
Try something like this
if pattern in line:
#remove page number stuff
output_file.write(line) # note that it's not part of the if block above
That's why your output file is empty.
I have a csv file where every line but the first starts with a number and looks like this:
subject,parameter1,parameter2,parameter3
1,blah,blah,blah
3,blah,blah,blah
2,blah,blah,blah
44,blah,blah,blah
12,blah,blah,blah
14,blah,blah,blah
11,blah,blah,blah
10,blah,blah,blah
11,blah,blah,blah
13,blah,blah,blah
3,blah,blah,blah
...
I would like to delete all lines except the first that start, say, with the numbers 1,6,12.
I was trying something like this:
:g!/^[1 6 12]\|^subject/d
But the 12 is interpreted as "1 or 2" so this also erases the lines that start with 2..
What am I missing, and what should be the most efficient way to do this?
Btw instead of 1, 6, 12, my list contains many multiple single and 2-digit numbers.
The character class [1 6 12] means "any single character that is in this class,
i.e. any one of ' ', 1, 2, 6 (the repeated 1 is ignored).
You could use
:g!/^1,\|^6,\|^12,\|^subject/d
which is close to your original syntax - but it works (tested with vim on Mac OS X).
Note - it is important to include the comma, so that the line starting with 1 doesn't "protect" 11, 12345, etc.
You might want to do this differently though - using grep.
Put all the "white listed" numbers in a file, one per line, like so:
^subject
^1,
^2,
^6,
^12,
then do
grep -f whitelist csvFile
and the output will be your "edited" file (which you can pipe to a new file).
If you are even more interested in "efficiency", you could make your text file (let's continue to call it whitelist) just
subject
1
2
6
12
and use the following command:
cat whitelist | xargs -I {} grep "^"{}"," cvsFile
This needs a bit of explaining.
xargs - take the input one line at a time
-I {} - and insert that line in the command that follows, at the {}
This means that the grep command will be run n times (once per line in the whitelist file), and each time the regular expression that is fed into grep will be the concatenation of
"^" - start of line
{} - contents of one line of the input file (whitelist)
"," - comma that follows the number
So this is a compact way of writing
grep "^subject," csvFile; grep "^1," csvFile; grep "^2," csvFile;
etc.
It has the advantage that you can now generate your whitelist any way you want - as long as it ends up in a file, one line at a time, you can use it; the disadvantage is that you are essentially running grep n times. If your files get very large, and you have a large number of items in your white list, that may start to be a problem; but since your OS is likely to put the file into cache after the first read-through, it is really quite fast. The use of the ^ anchor makes the regular expression very efficient - as soon as it doesn't find a match it goes on to the next line.
Use a global match:
:v/^\(subject\|1\|6\|12\),/ delete
For every line that does not match that regular expression, delete it.
It yields:
subject,parameter1,parameter2,parameter3
1,blah,blah,blah
12,blah,blah,blah
EDIT: Just now I realised that you were already using the global match. You error was in the character class. It matches any character inside it regardless of repeated ones, in your case numbers one, two, six and a space. You must separate them in different branches, like I did before.
a "functional" alternative:
:g/./if index([1,12,6],str2nr(split(getline("."),",")[0]))<0|exec 'normal! dd'|endif
I have a data file as follows.
1,14.23,1.71,2.43,15.6,127,2.8,3.06,.28,2.29,5.64,1.04,3.92,1065
1,13.2,1.78,2.14,11.2,100,2.65,2.76,.26,1.28,4.38,1.05,3.4,1050
1,13.16,2.36,2.67,18.6,101,2.8,3.24,.3,2.81,5.68,1.03,3.17,1185
1,14.37,1.95,2.5,16.8,113,3.85,3.49,.24,2.18,7.8,.86,3.45,1480
1,13.24,2.59,2.87,21,118,2.8,2.69,.39,1.82,4.32,1.04,2.93,735
Using vim, I want to reomve the 1's from each of the lines and append them to the end. The resultant file would look like this:
14.23,1.71,2.43,15.6,127,2.8,3.06,.28,2.29,5.64,1.04,3.92,1065,1
13.2,1.78,2.14,11.2,100,2.65,2.76,.26,1.28,4.38,1.05,3.4,1050,1
13.16,2.36,2.67,18.6,101,2.8,3.24,.3,2.81,5.68,1.03,3.17,1185,1
14.37,1.95,2.5,16.8,113,3.85,3.49,.24,2.18,7.8,.86,3.45,1480,1
13.24,2.59,2.87,21,118,2.8,2.69,.39,1.82,4.32,1.04,2.93,735,1
I was looking for an elegant way to do this.
Actually I tried it like
:%s/$/,/g
And then
:%s/$/^./g
But I could not make it to work.
EDIT : Well, actually I made one mistake in my question. In the data-file, the first character is not always 1, they are mixture of 1, 2 and 3. So, from all the answers from this questions, I came up with the solution --
:%s/^\([1-3]\),\(.*\)/\2,\1/g
and it is working now.
A regular expression that doesn't care which number, its digits, or separator you've used. That is, this would work for lines that have both 1 as their first number, or 114:
:%s/\([0-9]*\)\(.\)\(.*\)/\3\2\1/
Explanation:
:%s// - Substitute every line (%)
\(<something>\) - Extract and store to \n
[0-9]* - A number 0 or more times
. - Every char, in this case,
.* - Every char 0 or more times
\3\2\1 - Replace what is captured with \(\)
So: Cut up 1 , <the rest> to \1, \2 and \3 respectively, and reorder them.
This
:%s/^1,//
:%s/$/,1/
could be somewhat simpler to understand.
:%s/^1,\(.*\)/\1,1/
This will do the replacement on each line in the file. The \1 replaces everything captured by the (.*)
:%s/1,\(.*$\)/\1,1/gc
.........................
You could also solve this one using a macro. First, think about how to delete the 1, from the start of a line and append it to the end:
0 go the the start of the line
df, delete everything to and including the first ,
A,<ESC> append a comma to the end of the line
p paste the thing you deleted with df,
x delete the trailing comma
So, to sum it up, the following will convert a single line:
0df,A,<ESC>px
Now if you'd like to apply this set of modifications to all the lines, you will first need to record them:
qj start recording into the 'j' register
0df,A,<ESC>px convert a single line
j go to the next line
q stop recording
Finally, you can execute the macro anytime you want using #j, or convert your entire file with 99#j (using a higher number than 99 if you have more than 99 lines).
Here's the complete version:
qj0df,A,<ESC>pxjq99#j
This one might be easier to understand than the other solutions if you're not used to regular expressions!
I'm trying to parse a log file that looks like this:
%%%% 09-May-2009 04:10:29
% Starting foo
this is stuff
to ignore
%%%% 09-May-2009 04:10:50
% Starting bar
more stuff
to ignore
%%%% 09-May-2009 04:11:29
...
This excerpt contains two time periods I'd like to extract, from the first delimiter to the second, and from the second to the third. I'd like to use a regular expression to extract the start and stop times for each of these intervals. This mostly works:
p = '%{4} (?<start>.*?)\n% Starting (?<name>.*?)\n.*?%{4} (?<stop>.*?)\n';
times = regexp(c,p,'names');
Returning:
times =
1x16 struct array with fields:
start
name
stop
The problem is that this only captures every other period, since the second delimiter is consumed as part of the first match.
In other languages, you can use lookaround operators (lookahead, lookbehind) to solve this problem. The documentation on regular expressions explains how these work in MATLAB, but I haven't been able to get these to work while still capturing the matches. That is, I not only need to be able to match every delimiter, but also I need to extract part of that match (the timestamp).
Is this possible?
P.S. I realize I can solve this problem by writing a simple state machine or by matching on the delimiters and post-processing, if there's no way to get this to work.
Update: Thanks for the workaround ideas, everyone. I heard from the developer and there's currently no way to do this with the regular expression engine in MATLAB.
MATLAB seems unable to capture characters as a token without removing them from the string (or, I should say, I was unable to do so using MATLAB REGEXP). However, by noting that the stop time for one block of text is equal to the start time of the next, I was able to capture just the start times and the names using REGEXP, then do some simple processing to get the stop times from the start times. I used the following sample text:
c =
%%%% 09-May-2009 04:10:29
% Starting foo
this is stuff
to ignore
%%%% 09-May-2009 04:10:50
% Starting bar
more stuff
to ignore
%%%% 09-May-2009 04:11:29
some more junk
...and applied the following expression:
p = '%{4} (?<start>[^\n]*)\n% Starting (?<name>[^\n]*)[^%]*|%{4} (?<start>[^\n]*).*';
The processing can then be done with the following code:
names = regexp(c,p,'names');
[names.stop] = deal(names(2:end).start,[]);
names = names(1:end-1);
...which gives us these results for the above sample text:
>> names(1)
ans =
start: '09-May-2009 04:10:29'
name: 'foo'
stop: '09-May-2009 04:10:50'
>> names(2)
ans =
start: '09-May-2009 04:10:50'
name: 'bar'
stop: '09-May-2009 04:11:29'
If you are doing a lot of parsing and such work, you might consider using Perl from within Matlab. It gives you access to the powerful regex engine of Perl and might also make many other problems easier to solve.
All you should have to do is to wrap a lookahead around the part of the regex that matches the second timestamp:
'%{4} (?<start>.*?)\n% Starting (?<name>.*?)\n.*?(?=%{4} (?<stop>.*?)\n)'
EDIT: Here it is without named groups:
'%{4} (.*?)\n% Starting (.*?)\n.*?(?=%{4} (.*?)\n)'