regex replace space between words in string - regex

Say we have the strings on the left and we want to replace empty space between words with <->
" Power Lines" => " Power<->Lines"
Even further, can regex also remove spaces such as a trim in the same regex?
" Power Lines" => "Power<->Lines"
These questions pertain to postgres regex_replace function

Easier than a regex you can do:
SELECT replace(trim(both ' ' from ' Power Lines'), ' ', '<->');
+---------------+
| replace |
|---------------|
| Power<->Lines |
+---------------+
SELECT 1
Time: 0.003s
If you want to do it with a Regex, the syntax is regexp_replace(string text, pattern text, replacement text [, flags text]) (see https://www.postgresql.org/docs/current/static/functions-string.html)

Related

Redshift Translate command to replace characters

I need to translate commas in a column to pipe with with spaces on each side in Redshift ('a,b,c' becomes 'a | b | c' using Translate. Something in this statement is not giving me my desired results and I can't figure out why?
select 'a,b,c' as comma_string, translate(comma_string, ',', ' | ' ) as pipe_string
is yielding 'a b c' with no pipes. Having trouble getting the space before and after the pipe as
select 'a,b,c' as comma_string, translate(comma_string, ',', '|' ) as pipe_string
gives me 'a|b|c'
The REPLACE command works for this. NOt sure why Translate doesn't.
select 'a,b,c' as comma_string, REPLACE(comma_string, ',' ,' | ') as pipe_string
yields the desired result 'a | b | c'
You would need to use REPLACE since TRANSLATE only maps single characters:
TRANSLATE is similar to the REPLACE function and the REGEXP_REPLACE function, except that REPLACE substitutes one entire string with another string and REGEXP_REPLACE lets you search a string for a regular expression pattern, while TRANSLATE makes multiple single-character substitutions.
https://docs.aws.amazon.com/redshift/latest/dg/r_TRANSLATE.html

Match a word in a list of words regex

I want the user to only be able to enter the values in the following regex:
^[AB | BC | MB | NB | NL | NS | NT | NU | ON |QC | PE | SK | YT]{2}$
My problem is that words like : PP AA QQ are accepted.
I am not sure how i can prevent that ? Thank you.
Site i use to verify the expression : https://regex101.com/
In most RegExp flavors, square brackets [] denotate character classes; that is, a set of individual tokens that can be matched in a specific position.
Because P is included in this character class (along with a quantifier of {2}) PP is matched.
Instead, you seem to want a group with alternatives; for that, you'd use parenthesis () (while also eliminating the whitespace, something it doesn't appear was intentional on your part):
^(AB|BC|MB|NB|NL|NS|NT|NU|ON|QC|PE|SK|YT){2}$
RegEx101
This matches things like ABBC, ABAB, NLBC, etc.

Only output matching regex pattern

I have a csv file that contains 10,000s of rows. Each row has 8 columns. One of those columns contains text similar to this:
this is a row: http://somedomain.com | some_text | http://someanotherdomain.com | some_more_text
this is a row: http://yetanotherdomain.net
this is a row: https://hereisadomain.org | some_text
I'm currently accessing the data in this column this way:
for row in csv_reader:
the_url = row[3]
# this regex is used to find the hrefs
href_regex = re.findall('(?:http|ftp)s?://.*', the_url)
for link in href_regex:
print (link)
Output from the print statement:
http://somedomain.com | some_text | http://someanotherdomain.com | some_more_text
http://yetanotherdomain.net
https://hereisadomain.org | some_text
How do I obtain only the URLs?
http://somedomain.com
http://someanotherdomain.com
http://yetanotherdomain.net
https://hereisadomain.org
Just change your pattern to:
\b(?:http|ftp)s?://\S+
Instead of matching anything with .*, match any non-whitespace characters instead with \S+. You might want to add a word boundary before your non capturing group, too.
Check it live here.
Instead of repeating any character at the end
'(?:http|ftp)s?://.*'
^
repeat any character except a space, to ensure that the pattern will stop matching at the end of a URL:
'(?:http|ftp)s?://[^ ]*'
^^^^

Regex to select NOT and operand

I am trying to break a string to array using Regex in C# .
I have for example the string
{([Field] = '100' OR [LaneDescription] LIKE '%DENTINPALEUW%'
OR [LaneDescription] = 'asdf' OR ([ObjectID] = 1) AND [ITEM_HEIGHT] >=
10 AND [SENDER_COMPANY] NOT LIKE '%DHL%'}
(Generated from Telerik RadFilter)
and i need it broken so i can pass it to a custom object with types: open parenthesis, field, comparator , value, close parenthesis.
So far and with the help of http://regexr.com i have reached to
\[([^\[\]]*)\]+|[\w'%]+|[()=]
but i need to get the '>=' and 'NOT LIKE' as one (and similar values like <> != etc..)
You can see my late night attempts at http://regexr.com/39g6b
Any help would be much appreciated.
(PS: There are no newline characters at the string)
Try
\(|\)|\[[a-zA-Z0-9_]+\]|'.*?'|\d+|NOT LIKE|\w+|[=><!]+
Demo.
Explanation:
\( // match "(" literally
| // or
\) // ")"
| // or
\[[a-zA-Z0-9_]+\] // any words inside square braces []
|
'.*?' // strings enclosed in single quotes '' (escape sequences can easily trip this up though)
|
\d+ // digits
|
NOT LIKE // "NOT LIKE", because this is the only token that can contain whitespace
|
\w+ // words like "NOT", "AND", etc
|
[=><!]+ // operators like ">", "!=", etc

Regex: how to determine odd/even number of occurrences of a char preceding a given char?

I would like to replace the | with OR only in unquoted terms, eg:
"this | that" | "the | other" -> "this | that" OR "the | other"
Yes, I could split on space or quote, get an array and iterate through it, and reconstruct the string, but that seems ... inelegant. So perhaps there's a regex way to do this by counting "s preceding | and obviously odd means the | is quoted and even means unquoted. (Note: Processing doesn't start until there is an even number of " if there is at least one ").
It's true that regexes can't count, but they can be used to determine whether there's an odd or even number of something. The trick in this case is to examine the quotation marks after the pipe, not before it.
str = str.replace(/\|(?=(?:(?:[^"]*"){2})*[^"]*$)/g, "OR");
Breaking that down, (?:[^"]*"){2} matches the next pair of quotes if there is one, along with the intervening non-quotes. After you've done that as many times as possible (which might be zero), [^"]*$ consumes any remaining non-quotes until the end of the string.
Of course, this assumes the text is well-formed. It doesn't address the problem of escaped quotes either, but it can if you need it to.
Regexes do not count. That's what parsers are for.
You might find the Perl FAQ on this issue relevant.
#!/usr/bin/perl
use strict;
use warnings;
my $x = qq{"this | that" | "the | other"};
print join('" OR "', split /" \| "/, $x), "\n";
You don't need to count, because you don't nest quotes. This will do:
#!/usr/bin/perl
my $str = '" this \" | that" | "the | other" | "still | something | else"';
print "$str\n";
while($str =~ /^((?:[^"|\\]*|\\.|"(?:[^\\"]|\\.)*")*)\|/) {
$str =~ s/^((?:[^"|\\]*|\\.|"(?:[^\\"]|\\.)*")*)\|/$1OR/;
}
print "$str\n";
Now, let's explain that expression.
^ -- means you'll always match everything from the beginning of the string, otherwise
the match might start inside a quote, and break everything
(...)\| -- this means you'll match a certain pattern, followed by a |, which appears
escaped here; so when you replace it with $1OR, you keep everything, but
replace the |.
(?:...)* -- This is a non-matching group, which can be repeated multiple times; we
use a group here so we can repeat multiple times alternative patterns.
[^"|\\]* -- This is the first pattern. Anything that isn't a pipe, an escape character
or a quote.
\\. -- This is the second pattern. Basically, an escape character and anything
that follows it.
"(?:...)*" -- This is the third pattern. Open quote, followed by a another
non-matching group repeated multiple times, followed by a closing
quote.
[^\\"] -- This is the first pattern in the second non-matching group. It's anything
except an escape character or a quote.
\\. -- This is the second pattern in the second non-matching group. It's an
escape character and whatever follows it.
The result is as follow:
" this \" | that" | "the | other" | "still | something | else"
" this \" | that" OR "the | other" OR "still | something | else"
Another approach (similar to Alan M's working answer):
str = str.replace(/(".+?"|\w+)\s*\|\s*/g, '$1 OR ');
The part inside the first group (spaced for readability):
".+?" | \w+
... basically means, something quoted, or a word. The remainder means that it was followed by a "|" wrapped in optional whitespace. The replacement is that first part ("$1" means the first group) followed by " OR ".
Perhaps you're looking for something like this:
(?<=^([^"]*"[^"]*")+[^"|]*)\|
Thanks everyone. Apologies for neglecting to mention this is in javascript and that terms don't have to be quoted, and there can be any number of quoted/unquoted terms, eg:
"this | that" | "the | other" | yet | another -> "this | that" OR "the | other" OR yet OR another
Daniel, it seems that's in the ballpark, ie basically a matching/massaging loop. Thanks for the detailed explanation. In js, it looks like a split, a forEach loop on the array of terms, pushing a term (after changing a | term to OR) back into an array, and a re join.
#Alan M, works nicely, escaping not necessary due to the sparseness of sqlite FTS capabilities.
#epost, accepted solution for brevity and elegance, thanks. it needed to merely be put in a more general form for unicode etc.
(".+?"|[^\"\s]+)\s*\|\s*
My solution in C# to count the quotes and then regex to get the matches:
// Count the number of quotes.
var quotesOnly = Regex.Replace(searchText, #"[^""]", string.Empty);
var quoteCount = quotesOnly.Length;
if (quoteCount > 0)
{
// If the quote count is an odd number there's a missing quote.
// Assume a quote is missing from the end - executive decision.
if (quoteCount%2 == 1)
{
searchText += #"""";
}
// Get the matching groups of strings. Exclude the quotes themselves.
// e.g. The following line:
// "this and that" or then and "this or other"
// will result in the following groups:
// 1. "this and that"
// 2. "or"
// 3. "then"
// 4. "and"
// 5. "this or other"
var matches = Regex.Matches(searchText, #"([^\""]*)", RegexOptions.Singleline);
var list = new List<string>();
foreach (var match in matches.Cast<Match>())
{
var value = match.Groups[0].Value.Trim();
if (!string.IsNullOrEmpty(value))
{
list.Add(value);
}
}
// TODO: Do something with the list of strings.
}