I need to do import a large file with almost 50 columns and thousand of lines, with the struture |field|;|field|;|field|... Each field is started and ended with | (pipe) and ; (semicolon) to separate between fields
The problem is that some data have "Enters" in the middle and is "destroying" the lines
|123|;|ABC|;|text
text text|
|124|;|ABB|;|Text
text |
|125|;|BDD|;|text text text|
|126|;|ABC|;|text text
text
text|
|127|;|ABC|;|text text text|
I need if the line does not start with | (pipe) to delete the previous "enter" so line does't break
The expected result would be
|123|;|ABC|;|text text text|
|124|;|ABB|;|Text text |
|125|;|BDD|;|text text text|
|126|;|ABC|;|text text text text|
|127|;|ABC|;|text text text|
I have tried several suggestions the other questions but with no success so far. I never used this
You could use match 0+ horizontal whitespace chars, a newline and 0+ whitespace chars using \h*\R\s*.
Then capture in group 1 any char except a whitespace char or pipe using ([^\s|])
In the replacement, use a space and group 1.
Find what:
\h*\R\s*([^\s|])
Replace with:
$1
Regex demo
Related
Using Autohotkey, I would like to copy a large text file to the clipboard, extract text between two repeated words, delete everything else, and paste the parsed text. I am trying to do this to a large text file with 80,000+ lines of text where the start and stop words repeat 100s of times.
Any help would be greatly appreciated!
Input Text Example
Delete this text
De l e te this text
StartWord
Apples Oranges
Pears Grapes
StopWord
Delete this text
Delete this text
StartWord
Peas Carrots
Peas Carrots
StopWord
Delete this text
Delete this text
Desired Output Text
Apples Oranges
Pears Grapes
Peas Carrots
Peas Carrots
I think I found a regex statement to extract text between two words, but don't know how to make it work for multiple instances of the start and stop words. Honestly, I can't even get this to work.
!c::
Send, ^c
Fullstring = %clipboard%
RegExMatch(Fullstring, "StartWord *\K.*?(?= *StopWord)", TrimmedResult)
Clipboard := %TrimmedResult%
Send, ^v
return
You can start the match at StartWord, and then match all lines that do not start with either StartWord or StopWord
^StartWord\s*\K(?:\R(?!StartWord|StopWord).*)+
^ Start of string
StartWord\s*\K Match StartWord, optional whitespace chars and then clear forget what is matched so far using \K
(?: Non capture group to repeat as a whole
\R Match a newline
(?!StartWord|StopWord).* Negative lookahead, assert that the line does not start with Start or Stopword
)+ Close the non capture group and repeat 1 or more times to match at least a single line
See a regex demo.
This is only slightly different than #Thefourthbird's solution.
You can match the following regular expression with general, multiline and dot-all flags set1:
^StartWord\R+\K.*?\R(?=\R*^StopWord\R)
Demo
The regular expression can be broken down as follows:
^StartWord # match 'StartWord' at the beginning of a line
\R+ # match >= 1 line terminators to avoid matching empty lines
# below
\K # reset start of match to current location and discard
# all previously-matched characters
.*? # match >= 0 characters lazily
\R # match a line terminator
(?= # begin a positive lookahead
\R* # match >= 0 line terminators to avoid matching empty lines
# above
^StopWord\R # Match 'StopWord' at the beginning of a line followed
# by a line terminator
) # end positive lookahead
1. Click on /gms at the link to obtain explanations of the effects of each of the three flags.
I am working on a Powershell script to parse SWIFT messages (text based) into a database. I am using REGEX to find the appropriate strings in the file and extract them. I now run into the issue that one of the data fields can have CR/LF characters in the string - in the example below I would need to extract the second line as well.
:61:2111261126D12000,00NTRF11000004217657P//03MT211124101166
JANE DOE 1232
I tested this regex pattern (:61:.*[\r\n].*) in RegExr and it recognizes the [\r\n] characters as requirement to be valid, so my plan was to have two expressions - one with and one without CR/LF characters to identify both messages - either with line break or without - however the code below will return all matches no matter whether a line break in included or not - it seems that PS stops evaluation strings after CR/LF.
$transaction = $swift | select-string ‘:61:.*[\r\n].*’ -AllMatches | % { $_.Matches } | % { $_.Value }
Can I use REGEX for this task or do I have to create a function to read the entire string and check for the next line tag to determine the end of this string?
Describe the first line more accurately, then whatever is left is necessarily the message:
$swift = #'
:61:2111261126D12000,00NTRF11000004217657P//03MT211124101166
JANE DOE 1232
'#
$swift |Select-String -Pattern '(?m):\d+:[^,]+,[^/]+//\d+MT\d+[\s\r\n]+.*$'
The regex pattern breaks down as follows:
(?m) # Multi-line mode, this will make `$` match end-of-line positions as well as end-of-string
:\d+: # 1 or more digits, surrounded by colons, matches `:61:`
[^,]+, # 1 or more non-commas followed by a comma, matches `2111261126D12000,`
[^/]+// # 1 or more non-slashes, followed by 2, matches `00NTRF11000004217657P//`
\d+MT\d+ # 1 or more digits followed by `MT` and more digits, matches `03MT211124101166`
[\s\r\n]+ # 1 or more white-space/CR/LF characters
.*$ # everything until the end of the current line, matches `JANE DOE 1232`
Since we're using [\s\r\n]+ to describe the potential line break, it'll still work when the linebreak is replaced with other whitespace characters.
This question already has answers here:
Using the star sign in grep
(12 answers)
Closed 3 years ago.
I have this text:
NBA:red this line has a tab and ends with a curly braces}
some random text qwertyuiop
NBA:green this line must match
NBA:red this line has a tab and must match
NBA:response this line has spaces and must match
NBA:blue this line has a tab and ends with a curly braces}
some random text qwertyuiop
NBA:blue this line has spaces at the begining and ends with curly braces}
random text qwertyuiop
this line must not match}
this line must not match }
I want to match the lines that contains 'NBA:' following by the word 'red' or 'green' or 'blue', and also that doesn't end with a curly braces'}', this command match only 'NBA:' and one of the three words:
$ egrep 'NBA:(red|green|blue)' myfile.txt
NBA:red this line has a tab and ends with a curly braces}
NBA:green this line must match
NBA:red this line has a tab and must match
NBA:blue this line has a tab and ends with a curly braces}
NBA:blue this line has spaces at the begining and ends with curly braces}
But I don't know how to match the lines that doesn't end with '}':
I tried this but it doesn't work:
egrep 'NBA:(red|green|blue)*[^}]$' myfile.txt
But this works:
egrep 'NBA:(red|green|blue)' lorem.txt | egrep '[^}]$'
NBA:green this line must match
NBA:red this line has a tab and must match
I want to do it in just one command
You were just one character off. This should work fine:
egrep 'NBA:(red|green|blue).*[^}]$'
# ^
# Note this bit.
* doesn't mean the same thing in regex that it does in glob patterns. It means zero-or-more of the preceding item (a preceding item in this answer being ., any character).
I have a lot of files which starts with some tags I defined.
Example:
=Title
#context
!todo
#topic
#subject
#etc
And some text (notice the blank line just before this text).
Foo
Bar
I'd like to write a Vim search command (with vimgrep) to match something before an empty line.
How do I grep only in the lines before the first blank line? Will it make quicker grep action? Please, no need to mention :grep and binary like Ag - silver search.
I know \_.* to match everything including EOL. I know the negation [^foo]. I succeed to match everything but empty lines with /[^^$]. But I didn't manage to compose my :vimgrep command. Thank you for your help!
If you want a general solution which works for any content of file let me tell you that AFAK, you can't with that form of text. You may ask why ?
Explanation:
vimgrep requires a pattern argument to do the search line by line which behaves exactly as the :global cmd.
For your case we need to get the first part preceding the first blank line. (It can be extended to: Get the first non blank text)
Let's call:
A :Every block of text not containing any single blank line inside
x :Blank lines
With these only 5 forms of content file you can get the first A block with vimgrep(case 2,4,5 are obvious):
1 | 2 | 3 | 4 | 5
x | x | A | x | A
A | A | x | A | x
x | x | A
A |
Looking to your file, it is having this form:
A
x
A
x
A
the middle block causes a problem that's why you cannot split the first A unless you delimit it by some known UNIQUE text.
So the only solution that I can come up for the only 5 cases is:
:vimgrep /\_.\{-}\(\(\n\s*\n\)\+\)\#=/ %
AFAIK the most you can do with :vimgrep is use the \%<XXl atom to search below a specific line number:
:vim /\%<20lfunction/ *.vim
That command will find all instances of function above line 20 in the given files.
See :help \%l.
[...] always matches a single character. [^^$] matches a character that is not ^ or $. This is not what you want.
One of the things you can do is:
/\%^\%(.\+\n\)\{-}.\{-}\zsfoo/
This matches
\%^ - the beginning of the file
\%( \) - a non-capturing group
\{-} - ... repeated 0 or more times (as few as possible)
.\+ - 1 or more non-newline characters
\n - a newline
.\{-} - 0 or more non-newline characters (as few as possible)
\zs - the official start of the match
This will find the first occurrence of foo, starting from the beginning of the file, searching only non-empty lines. But that's all it does: You can't use it to find multiple matches.
Alternatively:
/\%(^\n\_.*\)\#<!foo/
\%( \) - a non-capturing group
\#<! - not-preceded-by modifier
^ - beginning of line
\n - newline
\_.* - 0 or more of any character
This matches every occurrence of foo that is not preceded anywhere by an empty line (i.e. a beginning-of-line / newline combo).
I need the following:
input:
NAME-LIST:
name1
<any text>
name_to_be_changed;
NAME-LIST:
name3
<any text>
name_to_be_changed;
output: replace "name_to_be_changed" by first name in the block
NAME-LIST:
name1
<any text>
name1;
NAME-LIST:
name3
<any text>
name3;
result:
I would prefer a perl one-liner :-)
I suggest a search expression similar to what Sam already posted:
(NAME-LIST:[\t ]*[\r\n]+)([^\r\n]+)([\r\n]+[^\r\n]*[\r\n]+)name_to_be_changed;
The replace string is \1\2\3\2; or $1$2$3$2;
Each pair of opening and closing round brackets specify a marking group. There are three such marking groups in the search expression.
[\t ]* makes it possible that there are trailing spaces or tabs after fixed string NAME-LIST: at end of first line of a block.
[\r\n]+ matches 1 or more carriage returns or linefeeds. That is similar to \v as used by Sam but does not match other vertical whitespaces like formfeed.
[^\r\n]+ matches 1 or more characters which are whether a carriage return nor a linefeed. That is like . if the matching behavior for a dot is defined as matching all characters except line terminators.
[^\r\n]* matches 0 or more characters which are whether a carriage return nor a linefeed. So <any text> can be also no text at all which means third line can be also a blank line.
The 3 strings found by the expressions in the marking groups are backreferenced by \1, \2 and \3 respectively $1, $2 and $3 whereby only the second one is backreferenced twice to copy the string from line 2 to line 4 and keep the other 3 lines unchanged.
Using a perl one-liner
perl -00 -pe 's/NAMELIST:\n(.*)\n.*\n\K.*/$1/' file.txt
Explanation:
Switches:
-00: Paragraph mode
-p: Creates a while(<>){...; print} loop for each line in your input file.
-e: Tells perl to execute the code on command line.
first of all thanks for your input...
unfortunately I could not make use of both of your suggested solutions, but I have found an own one:
perl -00 -pe 's/(NAME-LIST:\s+)(\w+)(.*?)\w+;/$1$2$3$2;/gs'
\s+ = 1 or more white spaces (space, newline, tab,...)
\w+ = 1 or more alphanumericals (like words or numbers
important is the /gs
g = global (do the replacements more than one time, otherwise only the first name will be replaced)
s = treat as single line