I have a CSV that basically has rows that look like:
06444|WidgetAdapter 6444|Description:
Here is a description.
Maybe some more.
|0
The text in the third field is always different and varying, and I'm trying to replace all newlines within it only with <br>, so it ends up as
06444|WidgetAdapter 6444|Description: <br>Here is a description.<br>Maybe some more.<br>|0
edit:
I basically need to get rid of all linebreaks so each line is a proper VALUE|VALUE|VALUE|VALUE. Normalize/beautify/clean it.
None of my tools can import this properly, phpMyAdmin chokes, etc.
There are linebreaks within the field, there are doublequotes that are not escaped, etc.
Example other field:
08681|Book 08681|"Testimonial" - Person
You should buy this.|
Example of another field:
39338|Itemizer||
If you know you have 4 columns, you can easily parse your data. For example, here's a PHP line that results in an array with all data. Each line in the array is another array with all capturing groups: [0] has the whole match, and [1]-[4] with each column:
$pattern = '/^([^|]*)\|([^|]*)\|([^|]*)\|([^|]*)$/m';
preg_match_all($pattern, $data, $matches, PREG_SET_ORDER);
The pattern is extremely simple: it takes 4 values (not pipe signs), separated by 3 pipes. Once you have the data, you can easily rebuild it the way you want, for example by using nl2br.
Note that you cannot reliably parse the data if the first and last columns can also containg new lines.
Working example: http://ideone.com/gG0K3
If needed, it is possible to target these newlines using a regular expression. The idea is to find only newlines that are followed by one extra value, and then only whole lines. We can check the number of values after the current newline is 1 modulo 4, so we know we're at the 3rd column:
(?:\r\n?|\n)(?=[^|]*\|[^\n\r|]*\s*(?:^(?:[^|]*\|){3}[^\n\r|]*$\s*)*\Z)
Or, with (some) explanations:
(?:\r\n?|\n) # Match a newline
(?= # that is before...
[^|]*\|[^\n\r|]*\s* # one more separator and value
(?:^(?:[^|]*\|){3}[^\n\r|]*$\s*)* # and some lines with 4 values.
\Z # until the end of the string.
)
I couldn't get it to work on Notepad++ (it didn't even match [\r\n]), but it seems to work well on other engines:
Rubular (Ruby): http://rubular.com/r/NsbTNg9vCT
RegExr (Action Script): http://regexr.com?2u1iu
Regex Hero (.Net): http://regexhero.net/tester/?id=215ac2bb-811b-48dd-8c00-6dcfadfae2f2
Related
I'm going nuts trying to get a regex to detect spam of keywords in the user inputs. Usually there is some normal text at the start and the keyword spam at the end, separated by commas or other chars.
What I need is a regex to count the number of keywords to flag the text for a human to check it.
The text is usually like this:
[random text, with commas, dots and all]
keyword1, keyword2, keyword3, keyword4, keyword5,
Keyword6, keyword7, keyword8...
I've tried several regex to count the matches:
-This only gets one out of two keywords
[,-](\w|\s)+[,-]
-This also matches the random text
(?:([^,-]*)(?:[^,-]|$))
Can anyone tell me a regex to do this? Or should I take a different approach?
Thanks!
Pr your answer to my question, here is a regexp to match a string that occurs between two commas.
(?<=,)[^,]+(?=,)
This regexp does not match, and hence do not consume, the delimiting commas.
This regexp would match " and hence do not consume" in the previous sentence.
The fact that your regexp matched and consumed the commas was the reason why your attempted regexp only matched every other candidate.
Also if the whole input is a single string you will want to prevent linebreaks. In that case you will want to use;
(?<=,)[^,\n]+(?=,)
http://www.phpliveregex.com/p/1DJ
As others have said this is potentially a very tricky thing to do... It suffers from all of the same failures as general "word filtering" (e.g. people will "mask" the input). It is made even more difficult without plenty of example posts to test against...
Solution
Anyway, assuming that keywords will be on separate lines to the rest of the input and separated by commas you can match the lines with keywords in like:
Regex
#(?:^)((?:(?:[\w\.]+)(?:, ?|$))+)#m
Input
Taken from your question above:
[random text, with commas, dots and all]
keyword1, keyword2, keyword3, keyword4, keyword5,
Keyword6, keyword7, keyword8
Output
// preg_match_all('#(?:^)((?:(?:[\w]+)(?:, ?|$))+)#m', $string, $matches);
// var_dump($matches);
array(2) {
[0]=>
array(2) {
[0]=>
string(49) "keyword1, keyword2, keyword3, keyword4, keyword5,"
[1]=>
string(31) "Keyword6, keyword7, keyword8..."
}
[1]=>
array(2) {
[0]=>
string(49) "keyword1, keyword2, keyword3, keyword4, keyword5,"
[1]=>
string(31) "Keyword6, keyword7, keyword8"
}
}
Explanation
#(?:^)((?:(?:[\w]+)(?:, ?|$))+)#m
# => Starting delimiter
(?:^) => Matches start of line in a non-capturing group (you could just use ^ I was using |\n originally and didn't update)
( => Start a capturing group
(?: => Start a non-capturing group
(?:[\w]+) => A non-capturing group to match one or more word characters a-zA-Z0-9_ (Using a character class so that you can add to it if you need to....)
(?:, ?|$) => A non-capturing group to match either a comma (with an optional space) or the end of the string/line
)+ => End the non-capturing group (4) and repeat 5/6 to find multiple matches in the line
) => Close the capture group 3
# => Ending delimiter
m => Multi-line modifier
Follow up from number 2:
#^((?:(?:[\w]+)(?:, ?|$))+)#m
Counting keywords
Having now returned an array of lines only containing key words you can count the number of commas and thus get the number of keywords
$key_words = implode(', ', $matches[1]); // Join lines returned by preg_match_all
echo substr_count($key_words, ','); // 8
N.B. In most circumstances this will return NUMBER_OF_KEY_WORDS - 1 (i.e. in your case 7); it returns 8 because you have a comma at the end of your first line of key words.
Links
http://php.net/manual/en/reference.pcre.pattern.modifiers.php
http://www.regular-expressions.info/
http://php.net/substr_count
Why not just use explode and trim?
$keywords = array_map ('trim', explode (',', $keywordstring));
Then do a count() on $keywords.
If you think keywords with spaces in are spam, then you can iterate of the $keywords array and look for any that contain whitespace. There might be legitimate reasons for having spaces in a keyword though. If you're talking about superheroes on your system, for example, someone might enter The Tick or Iron Man as a keyword
I don't think counting keywords and looking for spaces in keywords are really very good strategies for detecting spam though. You might want to look into other bot protection strategies instead, or even use manual moderation.
How to match on the String of text between the commas?
This SO Post was marked as a duplicate to my posted question however since it is NOT a duplicate and there were no answers in THIS SO Post that answered my question on how to also match on the strings between the commas see below on how to take this a step further.
How to Match on single digit values in a CSV String
For example if the task is to search the string within the commas for a single 7, 8 or a single 9 but not match on combinations such as 17 or 77 or 78 but only the single 7s, 8s, or 9s see below...
The answer is to Use look arounds and place your search pattern within the look arounds:
(?<=^|,)[789](?=,|$)
See live demo.
The above Pattern is more concise however I've pasted below the Two Patterns provided as solutions to THIS this question of matching on Strings within the commas and they are:
(?<=^|,)[789](?=,|$) Provided by #Bohemian and chosen as the Correct Answer
(?:(?<=^)|(?<=,))[789](?:(?=,)|(?=$)) Provided in comments by #Ouroborus
Demo: https://regex101.com/r/fd5GnD/1
Your first regexp doesn't need a preceding comma
[\w\s]+[,-]
A regex that will match strings between two commas or start or end of string is
(?<=,|^)[^,]*(?=,|$)
Or, a bit more efficient:
(?<![^,])[^,]*(?![^,])
See the regex demo #1 and demo #2.
Details:
(?<=,|^) / (?<![^,]) - start of string or a position immediately preceded with a comma
[^,]* - zero or more chars other than a comma
(?=,|$) / (?![^,]) - end of string or a position immediately followed with a comma
If people still search for this in 2021
([^,\n])+
Match anything except new line and comma
regexr.com/60eme
I think the difficulty is that the random text can also contain commas.
If the keywords are all on one line and it is the last line of the text as a whole, trim the whole text removing new line characters from the end. Then take the text from the last new line character to the end. This should be your string containing the keywords. Once you have this part singled out, you can explode the string on comma and count the parts.
<?php
$string = " some gibberish, some more gibberish, and random text
keyword1, keyword2, keyword3
";
$lastEOL = strrpos(trim($string), PHP_EOL);
$keywordLine = substr($string, $lastEOL);
$keywords = explode(',', $keywordLine);
echo "Number of keywords: " . count($keywords);
I know it is not a regex, but I hope it helps nevertheless.
The only way to find a solution, is to find something that separates the random text and the keywords that is not present in the keywords. If a new line is present in the keywords, you can not use it. But are 2 consecutive new lines? Or any other characters.
$string = " some gibberish, some more gibberish, and random text
keyword1, keyword2, keyword3,
keyword4, keyword5, keyword6,
keyword7, keyword8, keyword9
";
$lastEOL = strrpos(trim($string), PHP_EOL . PHP_EOL); // 2 end of lines after random text
$keywordLine = substr($string, $lastEOL);
$keywords = explode(',', $keywordLine);
echo "Number of keywords: " . count($keywords);
(edit: added example for more new lines - long shot)
I have some sample data (simplified extract below - the real file contains 52,000 lines, with pairs of lines, the 2nd line of each pair is always a date field, and there are always 2 blank lines between each data pair):
The colour of money 20170233434
10-DEC-2015
SOME TEST DATA 32423412123
19-OCT-2015
I want to join each line up, using a Regular Expression (I am using TextPad, but I think the RegEx syntax is generic).
I am doing a replace search, and want to end up with this:
The colour of money 20170233434 10-DEC-2015
SOME TEST DATA 32423412123 19-OCT-2015
I am using this in the "Find what" field:
\n^[0|1|2|3|4|5|6|7|8|9]
And replacing with NULL.
The end result I am getting is almost there:
The colour of money 20170233434 0-DEC-2015
SOME TEST DATA 32423412123 9-OCT-2015
But not quite, because the first digit of the date values are being stripped out.
How would I modify the RegEx to not delete the first number of the 2nd line? I tried to replace with [0|1|2|3|4|5|6|7|8|9] but that just put that entire string in front of each date field, and still stripped out the first number of the date.
Just search for this
\r?\n(\d{1,2}\-)
And replace it with $1. See the live example here.
If you want to replace it with null, you can also use a lookahead:
\r?\n(?=\d{1,2}\-)
And replace it with null. See the live example here.
Those regular expressions only match for a newline character (in UNIX \n or Windows \r\n) followed by 1 or 2 characters of a number and finally followed by a dash. If you want to be more specific, you could also use this regular expression:
\r?\n(\d{1,2}\-[A-Z]{3}\-\d{4})
Or with a lookahead respectively:
\r?\n(?=\d{1,2}\-[A-Z]{3}\-\d{4})
You could even check for the double linebreaks after the statement (live example):
\r?\n(\d{1,2}\-[A-Z]{3}\-\d{4}(?:\r?\n){2})
Or with a lookahead respectively (live example):
\r?\n(?=\d{1,2}\-[A-Z]{3}\-\d{4}(?:\r?\n){2})
I'm going to post a simplified version of my problem so please let me know if you would like more detail.
I have a flat text file containing text similar to a log file. It contains what should be 541 lines beginning with a 9-digit number and followed by various bits of data. Some of this data is XML, much of which contained extra new line characters that caused these lines to get split up and now has the file at ~30k lines. I want to concatenate this file back down to 541 lines, essentially consolidating any lines that don't start with the 9-digit number onto the previous line.
First off, the 9-digit numbers begin with '11' so I've run a match on 11\d{7} and gotten exactly my 541 matches (ie, there are no matching numbers in my file that might match incorrectly). I've also been able to match all lines that don't begin with this number with ^(?!11\d{7})(.|\n)*$. I would like to concatenate all those lines together, as well as tack that onto the line before (which is the line that begins with 11\d{7}). My searches online have only found solutions with finite and/or consistent numbers of lines to concatenate but this XML varies in length and structure. Finally, there is XML in this file that is not split across lines so matching and consolidating all XML indiscriminately isn't really an option either. Suggestions are greatly appreciated. Here's an example to illustrate what I'm trying to do:
Before:
117337909,some text,42930842,misc data,<xmlRoot>
<parent1>
<foo>data</foo>
<bar>123</bar>
</parent1>
</xmlRoot>
116425348,some more text,2df34as,blah,<xmlRoot>
<parent2>
<foo>data</foo>
<bar>123</bar>
</parent2>
</xmlRoot>
After:
117337909,some text,42930842,misc data,<xmlRoot><parent1><foo>data</foo><bar>123</bar></parent1></xmlRoot>
116425348,some more text,2df34as,blah,<xmlRoot><parent2><foo>data</foo><bar>123</bar></parent2></xmlRoot>
You can use this:
String result = yourstring.replaceAll("\\r?\\n(?!11\\d{7}(?!\\d))", "");
Pattern details:
\\r? # optional carriage return (for windows format)
\\n # line feed
(?! # open a negative lookahead (ie: not followed by)
11\\d{7} # 11, seven digits
(?!\\d) # not followed by another digit (to ensure that there isn't more
# digits after, "1123456789" will not match)
) # close the lookahead
Just search for this regex:
(?s)[\r\n]*(?!11\d{7})
and replace with an empty string, i.e., "".
In Notepad++, you can probably do this with the Extended Search. Type the text inside the quotation marks, not the quotation marks.
Find: "\r\n" Replace: " " [a space]
Find: " (11\d\d\d\d\d\d\d)" Replace: "\n\1"
This won't work if you have other 9 digit numbers starting with 11, but if you don't this might be easier than doing it in regex. Notepad++ is not very good at regexes spanning new lines from what I've read.
I am using Powershell 2.0. I have file names like my_file_name_01012013_111546.xls. I am trying to get my_file_name.xls. I have tried:
.*(?=_.{8}_.{6})
which returns my_file_name. However, when I try
.*(?=_.{8}_.{6}).{3}
it returns my_file_name_01.
I can't figure out how to get the extension (which can be any 3 characters. The time/date part will always be _ 8 characters _ 6 characters.
I've looked at a ton of examples and tried a bunch of things, but no luck.
If you just want to find the name and extension, you probably want something like this: ^(.*)_[0-9]{8}_[0-9]{6}(\..{3})$
my_file_name will be in backreference 1 and .xls in backreference 2.
If you want to remove everything else and return the answer, you want to substitute the "numbers" with nothing: 'my_file_name_01012013_111546.xls' -replace '_[0-9]{8}_[0-9]{6}' ''. You can't simply pull two bits (name and extension) of the string out as one match - regex patterns match contiguous chunks only.
try this ( not tested), but it should works for any 'my_file_name' lenght , any lenght of digit and any kind of extension.
"my_file_name_01012013_111546.xls" -replace '(?<=[\D_]*)(_[\d_]*)(\..*)','$2'
non regex solution:
$a = "my_file_name_01012013_111546.xls"
$a.replace( ($a.substring( ($a.LastIndexOf('.') - 16 ) , 16 )),"")
The original regex you specified returns the maximum match that has 14 characters after it (you can change to (?=.{14}) who is the same).
Once you've changed it, it returns the maximum match that has 14 characters after it + the next 3 characters. This is why you're getting this result.
The approach described by Inductiveload is probably better in case you can use backreferences. I'd use the following regex: (.*)[_\d]{16}\.(.*) Otherwise, I'd do it in two separate stages
get the initial part
get the extension
The reason you get my_filename_01 when you add that is because lookaheads are zero-width. This means that they do not consume characters in the string.
As you stated, .*(?=_.{8}_.{6}) matches my_file_name because that string is is followed by something matching _.{8}_.{6}, however once that match is found, you've only consumed my_file_name, so the addition of .{3} will then consume the next 3 characters, namely _01.
As for a regex that would fit your needs, others have posted viable alternatives.
I have following text pattern
(2222) First Last (ab-cd/ABC1), <first.last#site.domain.com> 1224: efadsfadsfdsf
(3333) First Last (abcd/ABC12), <first.last#site.domain.com> 1234, 4657: efadsfadsfdsf
I want the number 1224 or 1234, 4657 from the above text after the text >.
I have this
\((\d+)\)\s\w*\s\w*\s\(\w*\/\w+\d*\),\s<\w*\.\w*\#\w*\.domain.com>\s\d+:
which will take the text before : But i want the one after email till :
Is there any easy regular expression to do this? or should I use split and do this
Thanks
Edit: The whole text is returned by a command line tool.
(3333) First Last (abcd/ABC12), <first.last#site.domain.com> 1234, 4657: efadsfadsfdsf
(3333) - Unique ID
First Last - First and last names
<first.last#site.domain.com> - Email address in format FirstName.LastName#sub.domain.com
1234, 4567 - database primary Keys
: xxxx - Headline
What I have to do is process the above and get hte database ID (in ex: 1234, 4567 2 separate ID's) and query the tables
The above is the output (like this I will get many entries) from the tool which I am calling via my Perl script.
My idea was to use a regular expression to get the database id's. Guess I could use regular expression for this
you can fudge the stuff you don't care about to make the expression easier, say just 'glob' the parts between the parentheticals (and the email delimiters) using non-greedy quantifiers:
/(\d+)\).*?\(.*?\),\s*<.*?>\s*(\d+(?:,\s*\d+)*):/ (not tested!)
there's only two captured groups, the (1234), and the (1234, 4657), the second one which I can only assume from your pattern to mean: "a digit string, followed by zero or more comma separated digit strings".
Well, a simple fix is to just allow all the possible characters in a character class. Which is to say change \d to [\d, ] to allow digits, commas and space.
Your regex as it is, though, does not match the first sample line, because it has a dash - in it (ab-cd/ABC1 does not match \w*\/\w+\d*\). Also, it is not a good idea to rely too heavily on the * quantifier, because it does match the empty string (it matches zero or more times), and should only be used for things which are truly optional. Use + otherwise, which matches (1 or more times).
You have a rather strict regex, and with slight variations in your data like this, it will fail. Only you know what your data looks like, and if you actually do need a strict regex. However, if your data is somewhat consistent, you can use a loose regex simply based on the email part:
sub extract_nums {
my $string = shift;
if ($string =~ /<[^>]*> *([\d, ]+):/) {
return $1 =~ /\d+/g; # return the extracted digits in a list
# return $1; # just return the string as-is
} else { return undef }
}
This assumes, of course, that you cannot have <> tags in front of the email part of the line. It will capture any digits, commas and spaces found between a <> tag and a colon, and then return a list of any digits found in the match. You can also just return the string, as shown in the commented line.
There would appear to be something missing from your examples. Is this what they're supposed to look like, with email?
(1234) First Last (ab-cd/ABC1), <foo.bar#domain.com> 1224: efadsfadsfdsf
(1234) First Last (abcd/ABC12), <foo.bar#domain.com> 1234, 4657: efadsfadsfdsf
If so, this should work:
\((\d+)\)\s\w*\s\w*\s\(\w*\/\w+\d*\),\s<\w*\.\w*\#\w*\.domain\.com>\s\d+(?:,\s(\d+))?:
$string =~ /.*>\s*(.+):.+/;
$numbers = $1;
That's it.
Tested.
With number catching:
$string =~ /.*>\s*(?([0-9]|,)+):.+/;
$numbers = $1;
Not tested but you get the idea.