I am going to fix a txt file which store about 200,000 record
Every record should consist of 8 column delimit by comma.
However, some data is corrupted which lead to an extra comma. I need to get rid of the extra comma which is likely to happen in the 3rd column
5180,1103131373,Good Day,ABC,12,34,75484,7/1/2014 12:00:00 AM, <---Correct format
5180,1103131373,Good, Day,ABC,12,34,75484,7/1/2014 12:00:00 AM, <-- Incorrect
i.e. in this example Good Day should be store in one column instead of two.
i can locate them by regular expression (.*,.*,.*),(.*,.*,.*,.*,.*,.*,)
but when i try to replace the extra comma using \1\2, some records are missing
Any input is welcome. Thanks in advance.
You should replace .* with [^,\r\n]* and add a ^ at the pattern start.
Use
^([^,\r\n]*,[^,\r\n]*,[^,\r\n]*),([^,\r\n]*,[^,\r\n]*,[^,\r\n]*,[^,\r\n]*,[^,\r‌​\n]*,[^,\r\n]*,)
and replaced by \1\2.
The [^,\r\n] negated character class matches any character but a ,, CR and LF symbols. \1 is a backreference to the value in Group 1 (([^,\r\n]*,[^,\r\n]*,[^,\r\n]*)) and \2 is a backreference to the value in Group 2 (([^,\r\n]*,[^,\r\n]*,[^,\r\n]*,[^,\r\n]*,[^,\r‌​\n]*,[^,\r\n]*,)).
Related
I have a lob describing the rows of a CSV, and of course each column is delimited by a semicolon.
Some of that colum are strings, delimited by pipes, which may hold a semicolon, so I must replace that semicolon with colon but only inside a delimiter used for string colums, or columns order will be destroyed.
Example of a row:
1;4;|1.Simple response|;|once upon a time; I used to...|;|my favorite
character is ; I really love it.|
Response example:
1;4;|1.Simple response|;|once upon a time, I used to...|;|my favorite
character is , I really love it.|
This is the regex I wrote:
(\|)(.*?)(\|[\n\;])
LINK To regex101
What I need is to replace that .*? with [;]+ but if a try, nothing will be catched.
I don't get how to capture with regex, inside an already captured group.
Any advice?
Thanks
It appears that the third part of the pattern (second pipe |) won't match because it must always be followed by a newline (\n) or semicolon (;), which is not the case with your input.
Did you mean something like this:
\|;([\|\n;]) //would allow newline OR semicolon OR second pipe
https://regex101.com/r/S9MrWe/1
A question was asked earlier for the given dataset.
03-24-2014 fm506 TOTAL-PROCESS OK;HARD;1;PROCS OK: 717 processes
03-24-2014 fm504 CHECK-LOAD OK;SOFT;2;OK - load average: 54.61, 56.95
The input regex provided in that thread is not at all working hence I created two "input regex" and tested the first regex in "http://www.regexplanet.com/advanced/java/index.html". The groups are perfect. But when I am trying in Hive, it's loading only NULL values.
input regex I provided as below
([^ ]*)\t+([^ ]*)\t+([^ ]*)\t+([^ ]*)
My second input regex is
^(\\S+)\\t+(\\S+)\\t+(\\S+)\\t+(\\S+)$
I thought it will work but it's also not loading NULL values.
Could you please let me know what's wrong with these two input regex?
Your first pattern does not match the entire string, and field matching parts are [^ ]*, that is, any 0+ chars other than a space, so the last field cannot be matched (it contains spaces).
The second regex also contains \S+ patterns matching 1 or more chars other than whitespace, and the last one does not match the last field.
You may use
^(\S+)\t+(\S+)\t+(\S+)\t+(.+)
^([^\t]*)\t+([^\t]*)\t+([^\t]*)\t+(.*)
See the regex demo
The [^\t]* matches any field in a tab-delimited text since it matches zero or more chars other than a tab.
I have this regular expression:
\..*?\.
But it only selects between two periods, not every punctuation mark, and it also selects across multiple lines.
Would modifying this expression to only take in one line at a time work somehow, if there's also a way to group punctuation into where we have a period?
Just to make things simpler, at this time I only need the expression to recognize periods, exclamation points, and question marks. I don't need it to register commas.
Thanks to Nathan and Agumander below, I know to substitute [.!?] in place of \. now, but I'm still having trouble with the other half of my question.
Just to make sure I'm being more clear, using [.!?].*?[.!?]\s will highlight text between punctuation marks, but across multiple lines. So I can't use it to bookmark only the lines that have multiple punctuation marks.
Placing characters inside a pair of square brackets will match to any of the enclosed characters. In your case you'd want [.?!]
If you want to match any sentence that has two of these, then you'll be looking for a pair of [.!?] separated by zero or more of any character.
The regex that matches strings with more than one of the set [.?!] would then be [.!?].*[.!?]
To make . match newlines, you'd add the s modifier to your regex.
...so the full regex would be /[.!?].*[.!?]/s
Ok I figured it out. Thanks to Agumander and Nathan above I substituted [.!?] in for the two \. in my original regex:
\..*?\. became [.!?].*[.!?]
Putting \s at the end of the regex made it pink select the entire document in notepad++.
The last issue I had was remembering to turn off "matches newline."
Agumander, I think you're asking for a regex that basically finds multiple punctuation marks on a single line. So here's one way to do it.
Here's the text I'm going to match. The regex will match the first line in it's entirety, but will not match the second.
Here's a line with multiple punctuation. The entire line will match the regex!
This line does not have multiple punctuation.
Regex
^.*(?:[\.?!].*){2,}$
Explanation
^ -- Start matching at the beginning of a line
.* -- match any character 0 or more times
(?: -- start a new non-capturing group
[.?!] -- find a character matching a period, question mark, or exclamation point.
.* -- match any character 0 or more times
)
{2,} -- repeat the previous group 2 or more times. This is how we ensure there's at least two punctuation marks before considering it a match.
$ -- end of line anchor, basically stop matching at the end of a line
I have a csv file with all UK areas (43000 rows).
However, even though the fields are separated with commas, they are not enclosed with anything, hence if the field has commas within its contents, import to a database fails.
Fortunately, there is only one field that has commas within its content.
I need a regular expression that I could use to select this field on all rows.
Here is an example of data:
Aberaman,Rhondda, Cynon, Taf (Rhondda, Cynon, Taff),51.69N,03.43W,SO0101
Aberangell,Powys,52.67N,03.71W,SH8410
This should look like:
Aberaman,"Rhondda, Cynon, Taf (Rhondda, Cynon, Taff)",51.69N,03.43W,SO0101
Aberangell,"Powys",52.67N,03.71W,SH8410
So I need to basically select the second field, which is between the first comma and the comma just before the first number.
I will use sublime text 2 to perform this regex search.
Sublime text2 supports \K,
Regex:
^[^,]*,\K(.*?)(?=,\d)
Replacement string:
"\1"
DEMO
Explanation:
^ Asserts that we are at the start of a line.
[^,]* Matches any character not of comma zero or more times.
, Literal comma.
\K Previously matched characters would be discarded.
(.*?)(?=,\d) Matches any character zeror or more times which must be followed by , and a number. ? after * does a reluctant match.
You can try with capturing groups. Simply substitute it with $1"$2"$3 or \1"\2"\3
^(\w+,)([^\d]*)(,.*)$
Live Demo
You can do it in Notepad++ as well.
Find what: ^(\w+,)([^\d]*)(,.*)$
Replace with: $1"$2"$3
A regex which should be able to solve your problem is:
^.*?,(.*?),\d+
This matches
anything (non-greedy) up to first comma (which will not be included in result)
then anything up to second comma (which will be in a group)
and additional condition is that there has to be a number after second comma
So your group is in $1
I have a text file open in Notepad++ in which some lines go past 112 columns, which I'd like to avoid. By the time any of these lines gets to the 112th column, a comma has appeared in the string. Like this.
1,2,3,4...109,110,111,112,113
(Let's suspend disbelief and imagine three-digit numbers take up one column each)
In the end I'd like something like this:
1,2,3,4...109,110,111,112,
113
So far I've figured out the regular expression to find all the lines that are too long:
^.{113,}$
For the life of me I can't figure how to capture the comma I'm looking for in the string up to that column so I can sub in a newline after it.
Anyone know how this can be done?
This should do it:
^(?=.{112})(.{0,111},)
It matches the begin of a line with at least 112 characters (by lookahead), then matches as many characters as possible (up to 111) before a comma.
Replace this with with the captured group followed by a linebreak (\1\n).
Use parentheses to create a capture group:
^(.{112},).+
This will identify lines greater than 112 columns. Then take the capture group, append a newline, and replace the capture group in the line.