Remove parenthesis if length > max length - regex

I have a number of models like
"Dell Inspiron 6 (i7-8550U/8GB/256GB/Radeon 520/FHD/W10)"
"Lenovo V130 (i5-7200U/4GB/128GB/FHD/W10) (2017)"
"Dell XPS 13 Touch (i7-8550U/16GB/512GB/UHD/W10)"
and I want to remove the parenthesis where it mentions the specs.
The final output should be something like this
"Dell Inspiron 6"
"Lenovo V130 (2017)"
"Dell XPS 13 Touch"
I managed to get the parenthesis contents with this regex
\((.+?)\)
However, the regex returns both parentheses.
Is there any way to only get the text that is greater than a number of characters?
Here is a regex example.

You may use this regex to match (...) where there are 6+ characters between ( and ):
\s*\(([^)]{6,})\)
Updated RegEx Demo
RegEx Details:
[^)]: matches any character except )
{6,}: Quantifier to match 6 or more characters

Use a regex substitution, and sub in an empty string. For example, in Python, you could do the following:
import re
name = 'Dell Inspiron 6 (i7-8550U/8GB/256GB/Radeon 520/FHD/W10)'
re.sub("\s\(.{5,}\)", "", name)
Output
'Dell Inspiron 6'
Note that I used {5,} which will match only parathentical strings with 5 or more characteres, which will leave the years (e.g., (2017)) in place.

You can exploit some text that is surely there in your first brackets and not in second brackets like a slash / and change your regex from,
\((.+?)\)
to
\((.*\/[^)]*?)\)
And replace it with empty string.
Live Demo

Related

Regex for text (and numbers and special characters) between multiple commas [duplicate]

I'm going nuts trying to get a regex to detect spam of keywords in the user inputs. Usually there is some normal text at the start and the keyword spam at the end, separated by commas or other chars.
What I need is a regex to count the number of keywords to flag the text for a human to check it.
The text is usually like this:
[random text, with commas, dots and all]
keyword1, keyword2, keyword3, keyword4, keyword5,
Keyword6, keyword7, keyword8...
I've tried several regex to count the matches:
-This only gets one out of two keywords
[,-](\w|\s)+[,-]
-This also matches the random text
(?:([^,-]*)(?:[^,-]|$))
Can anyone tell me a regex to do this? Or should I take a different approach?
Thanks!
Pr your answer to my question, here is a regexp to match a string that occurs between two commas.
(?<=,)[^,]+(?=,)
This regexp does not match, and hence do not consume, the delimiting commas.
This regexp would match " and hence do not consume" in the previous sentence.
The fact that your regexp matched and consumed the commas was the reason why your attempted regexp only matched every other candidate.
Also if the whole input is a single string you will want to prevent linebreaks. In that case you will want to use;
(?<=,)[^,\n]+(?=,)
http://www.phpliveregex.com/p/1DJ
As others have said this is potentially a very tricky thing to do... It suffers from all of the same failures as general "word filtering" (e.g. people will "mask" the input). It is made even more difficult without plenty of example posts to test against...
Solution
Anyway, assuming that keywords will be on separate lines to the rest of the input and separated by commas you can match the lines with keywords in like:
Regex
#(?:^)((?:(?:[\w\.]+)(?:, ?|$))+)#m
Input
Taken from your question above:
[random text, with commas, dots and all]
keyword1, keyword2, keyword3, keyword4, keyword5,
Keyword6, keyword7, keyword8
Output
// preg_match_all('#(?:^)((?:(?:[\w]+)(?:, ?|$))+)#m', $string, $matches);
// var_dump($matches);
array(2) {
[0]=>
array(2) {
[0]=>
string(49) "keyword1, keyword2, keyword3, keyword4, keyword5,"
[1]=>
string(31) "Keyword6, keyword7, keyword8..."
}
[1]=>
array(2) {
[0]=>
string(49) "keyword1, keyword2, keyword3, keyword4, keyword5,"
[1]=>
string(31) "Keyword6, keyword7, keyword8"
}
}
Explanation
#(?:^)((?:(?:[\w]+)(?:, ?|$))+)#m
# => Starting delimiter
(?:^) => Matches start of line in a non-capturing group (you could just use ^ I was using |\n originally and didn't update)
( => Start a capturing group
(?: => Start a non-capturing group
(?:[\w]+) => A non-capturing group to match one or more word characters a-zA-Z0-9_ (Using a character class so that you can add to it if you need to....)
(?:, ?|$) => A non-capturing group to match either a comma (with an optional space) or the end of the string/line
)+ => End the non-capturing group (4) and repeat 5/6 to find multiple matches in the line
) => Close the capture group 3
# => Ending delimiter
m => Multi-line modifier
Follow up from number 2:
#^((?:(?:[\w]+)(?:, ?|$))+)#m
Counting keywords
Having now returned an array of lines only containing key words you can count the number of commas and thus get the number of keywords
$key_words = implode(', ', $matches[1]); // Join lines returned by preg_match_all
echo substr_count($key_words, ','); // 8
N.B. In most circumstances this will return NUMBER_OF_KEY_WORDS - 1 (i.e. in your case 7); it returns 8 because you have a comma at the end of your first line of key words.
Links
http://php.net/manual/en/reference.pcre.pattern.modifiers.php
http://www.regular-expressions.info/
http://php.net/substr_count
Why not just use explode and trim?
$keywords = array_map ('trim', explode (',', $keywordstring));
Then do a count() on $keywords.
If you think keywords with spaces in are spam, then you can iterate of the $keywords array and look for any that contain whitespace. There might be legitimate reasons for having spaces in a keyword though. If you're talking about superheroes on your system, for example, someone might enter The Tick or Iron Man as a keyword
I don't think counting keywords and looking for spaces in keywords are really very good strategies for detecting spam though. You might want to look into other bot protection strategies instead, or even use manual moderation.
How to match on the String of text between the commas?
This SO Post was marked as a duplicate to my posted question however since it is NOT a duplicate and there were no answers in THIS SO Post that answered my question on how to also match on the strings between the commas see below on how to take this a step further.
How to Match on single digit values in a CSV String
For example if the task is to search the string within the commas for a single 7, 8 or a single 9 but not match on combinations such as 17 or 77 or 78 but only the single 7s, 8s, or 9s see below...
The answer is to Use look arounds and place your search pattern within the look arounds:
(?<=^|,)[789](?=,|$)
See live demo.
The above Pattern is more concise however I've pasted below the Two Patterns provided as solutions to THIS this question of matching on Strings within the commas and they are:
(?<=^|,)[789](?=,|$) Provided by #Bohemian and chosen as the Correct Answer
(?:(?<=^)|(?<=,))[789](?:(?=,)|(?=$)) Provided in comments by #Ouroborus
Demo: https://regex101.com/r/fd5GnD/1
Your first regexp doesn't need a preceding comma
[\w\s]+[,-]
A regex that will match strings between two commas or start or end of string is
(?<=,|^)[^,]*(?=,|$)
Or, a bit more efficient:
(?<![^,])[^,]*(?![^,])
See the regex demo #1 and demo #2.
Details:
(?<=,|^) / (?<![^,]) - start of string or a position immediately preceded with a comma
[^,]* - zero or more chars other than a comma
(?=,|$) / (?![^,]) - end of string or a position immediately followed with a comma
If people still search for this in 2021
([^,\n])+
Match anything except new line and comma
regexr.com/60eme
I think the difficulty is that the random text can also contain commas.
If the keywords are all on one line and it is the last line of the text as a whole, trim the whole text removing new line characters from the end. Then take the text from the last new line character to the end. This should be your string containing the keywords. Once you have this part singled out, you can explode the string on comma and count the parts.
<?php
$string = " some gibberish, some more gibberish, and random text
keyword1, keyword2, keyword3
";
$lastEOL = strrpos(trim($string), PHP_EOL);
$keywordLine = substr($string, $lastEOL);
$keywords = explode(',', $keywordLine);
echo "Number of keywords: " . count($keywords);
I know it is not a regex, but I hope it helps nevertheless.
The only way to find a solution, is to find something that separates the random text and the keywords that is not present in the keywords. If a new line is present in the keywords, you can not use it. But are 2 consecutive new lines? Or any other characters.
$string = " some gibberish, some more gibberish, and random text
keyword1, keyword2, keyword3,
keyword4, keyword5, keyword6,
keyword7, keyword8, keyword9
";
$lastEOL = strrpos(trim($string), PHP_EOL . PHP_EOL); // 2 end of lines after random text
$keywordLine = substr($string, $lastEOL);
$keywords = explode(',', $keywordLine);
echo "Number of keywords: " . count($keywords);
(edit: added example for more new lines - long shot)

Substitute every character after a certain position with a regex

I'm trying to use a regex replace each character after a given position (say, 3) with a placeholder character, for an arbitrary-length string (the output length should be the same as that of the input). I think a lookahead (lookbehind?) can do it, but I can't get it to work.
What I have right now is:
regex: /.(?=.{0,2}$)/
input string: 'hello there'
replace string: '_'
current output: 'hello th___' (last 3 substituted)
The output I'm looking for would be 'hel________' (everything but the first 3 substituted).
I'm doing this in Typescript, to replace some old javascript that is using ugly split/concatenate logic. However, I know how to make the regex calls, so the answer should be pretty language agnostic.
If you know the string is longer than given position n, the start-part can be optionally captured
(^.{3})?.
and replaced with e.g. $1_ (capture of first group and _). Won't work if string length is <= n.
See this demo at regex101
Another option is to use a lookehind as far as supported to check if preceded by n characters.
(?<=.{3}).
See other demo at regex101 (replace just with underscore) - String length does not matter here.
To mention in PHP/PCRE the start-part could simply be skipped like this: ^.{1,3}(*SKIP)(*F)|.

Regex Erasing all except numbers with limited digits

What I want to do is erase everything except \d{4,7} only by replacing.
Any ideas to get this?
ex)
G-A15239L → 15239
(G-A and L should be selected and replaced by empty strings)
now200316stillcovid19asdf → 200316
(now and stillcovid19asdf should be selected and replaced by empty strings)
Also, replacing text is not limited as empty string.
substitutions such as $1 are possible too.
Using Regex in 'Kustom' apps. (including KLCK, KLWP, KWGT)
I don't know which engine it's using because there are no information about it
You may use
(\d{4,7})?.?
Or
(\d{4,7})|.
and replace with $1. See the regex demo.
Details
(\d{4,7})? - an optional (due to ? at the end - if it is missing, then the group is obligatory) capturing group matching 1 or 0 occurrences of 4 to 7 digits
| - or
.? - any one char other than line break chars, 1 or 0 times when ? is right after it.
So, any match of 4 to 7 digits is kept (since $1 refers to the Group 1 value) and if there is a char after it, it is removed.
It looks as if the regex is Java based since all non-matching groups are replaced with null:
So, the only possible solution is to use a second pass to post-process the results, just replace null with some kind of a delimiter, a newline for example.
Search: .*?(\d{4,7})[^\d]+|.*
Replace: $1
in for instance Notepad++ 6.0 or better (which comes with built-in PCRE support) works with your examples:
jalsdkfilwsehf
now200316stillcovid19asdf
G-A15239L
becomes:
200316
15239

Regex function to find all and only 6 digit numeric string ignoring spaces if any any between [duplicate]

This question already has an answer here:
Reference - What does this regex mean?
(1 answer)
Closed 3 years ago.
I have HTML source page as text file.
I need to read file and find out only those numeric strings which have 6 continous digits and can have a space in between those 6 digits
Eg
209 016 - should be come up in search result and as 400013(space removed)
209016 - should also come up in search and unaltered as 209016
any numeric string more then 6 digits long should not come up in search eg 20901677,209016#223, 29016,
I think this can be achieved by regex but I was not able to
A soln in regex is more desirable but anything else is also welcome
To match 6 digits with any number of spaces in between, you may use the following pattern:
\b(?:\d[ ]*?){6}\b
Or if you want to reject it when it's followed by an #, you may use:
\b(?:\d[ ]*?){6}\b(?!#)
Regex demo.
Then, you can use the replace method to remove the space characters.
Python example:
import re
regex = r"\b(?:\d[ ]*?){6}\b(?!#)"
test_str = ("209016 \n"
"209 016\n"
"20901677','209016#223', '29016")
matches = re.finditer(regex, test_str, re.MULTILINE)
for match in matches:
print (match.group().replace(" ", ""))
Output:
209016
209016
Try it online.
You can try the following regex:
\b(?<!#)\d(?:\s*\d){5}\b(?!#)
demo: https://regex101.com/r/ZCcDmF/2/
But note that you might have to modify your boundaries if you need to exclude more than the #. it will become something like:
\b(?<!#|other char I need to exclude|another one|...)\d(?:\s*\d){5}\b(?!#|other char I need to exclude|another one|...)
where you have to replace other char I need to exclude, another one,... by the characters.

RegEx Lookaround issue

I am using Powershell 2.0. I have file names like my_file_name_01012013_111546.xls. I am trying to get my_file_name.xls. I have tried:
.*(?=_.{8}_.{6})
which returns my_file_name. However, when I try
.*(?=_.{8}_.{6}).{3}
it returns my_file_name_01.
I can't figure out how to get the extension (which can be any 3 characters. The time/date part will always be _ 8 characters _ 6 characters.
I've looked at a ton of examples and tried a bunch of things, but no luck.
If you just want to find the name and extension, you probably want something like this: ^(.*)_[0-9]{8}_[0-9]{6}(\..{3})$
my_file_name will be in backreference 1 and .xls in backreference 2.
If you want to remove everything else and return the answer, you want to substitute the "numbers" with nothing: 'my_file_name_01012013_111546.xls' -replace '_[0-9]{8}_[0-9]{6}' ''. You can't simply pull two bits (name and extension) of the string out as one match - regex patterns match contiguous chunks only.
try this ( not tested), but it should works for any 'my_file_name' lenght , any lenght of digit and any kind of extension.
"my_file_name_01012013_111546.xls" -replace '(?<=[\D_]*)(_[\d_]*)(\..*)','$2'
non regex solution:
$a = "my_file_name_01012013_111546.xls"
$a.replace( ($a.substring( ($a.LastIndexOf('.') - 16 ) , 16 )),"")
The original regex you specified returns the maximum match that has 14 characters after it (you can change to (?=.{14}) who is the same).
Once you've changed it, it returns the maximum match that has 14 characters after it + the next 3 characters. This is why you're getting this result.
The approach described by Inductiveload is probably better in case you can use backreferences. I'd use the following regex: (.*)[_\d]{16}\.(.*) Otherwise, I'd do it in two separate stages
get the initial part
get the extension
The reason you get my_filename_01 when you add that is because lookaheads are zero-width. This means that they do not consume characters in the string.
As you stated, .*(?=_.{8}_.{6}) matches my_file_name because that string is is followed by something matching _.{8}_.{6}, however once that match is found, you've only consumed my_file_name, so the addition of .{3} will then consume the next 3 characters, namely _01.
As for a regex that would fit your needs, others have posted viable alternatives.