How can I grep for every uneven match? - regex

Markdowns fenced code blocks look like this
```
Here is the code
in many lines
```
or like this:
```text
Here is the code
in many lines
```
The "text" specifies the language which should be used for highlighting.
I want to run over a flat directory and find all files which contain fenced code blocks without a specified language. How can I find fenced code blocks without a specified language?
What I tried
The following is a superset of what I want:
$ grep -rIE -m1 "\`\`\`[[:space:]]*$" *
The problem is the closing part. Essentially this finds all files which have a fenced code block at all. But how do I grep for every uneven triple backtick?
My guess is that I have to grep for the complete code block. It is guaranteed that there is either a newline after the triple backticks or a language.
So I tried the following two:
grep -rIzPo -m1 "\`\`\`\\n(.*?)\`\`\`" *
grep -rIzEo -m1 "\`\`\`\\n(.*?)\`\`\`" *
It found a couple of cases, but it missed at least one. I have no idea why.
Problem: Two codeblocks
I have many files with multiple code blocks, e.g:
```python
a = "Hello"
b = "Stackoverflow"
print(f"{a} {b}")
```
and
```python
print("foobar")
```
Please note that I don't want a file with this content to match! All regexes I tried so far match
```
and
```python
print("foobar")
```

I think that'd be easier with gawk.
awk 'BEGINFILE{f=0} /^```/{f=!f}
f&&/^```\s*$/{print FILENAME;nextfile}' *
f denotes whether last match was even or uneven. It is reset at the beginning of each file, and negated by each match. When f is 1 and exit condition (i.e current line is three backquotes followed by zero+ spaces) is met, the program prints filename and moves on to the next file.

Related

complex search/delete/move/replace operation using sed?

after several hours of searching and experimenting, I'm hoping someone can either help me or rub my nose in a post I've missed which acctually would be helpful as well come to think of it...
Problem:
I've made a quick&dirty fix in several dozens of php scripts (we use to enhance smarty capabilities) with security checks.
Example of input(part1):
///// SMARTY AUTH /////
$auth['model'] = isset($params['model']) ? $params['model'] : null;
$auth['requiredLevel'] = isset($params['requiredlevel']) ? $params['requiredlevel'] : null;
$auth['baseAuthorizationLevel'] = isset($params['_authorizationlevel']) ? $params['_authorizationlevel'] : null;
$auth['defaultRequiredLevel'] = AuthorizationLevel::AULE_WRITE;
$auth['baseModel'] = $smarty->getTemplateVars('model');
///// SMARTY AUTH /////
...which i'd like to replace with a much cleaner solution we've come up with. Now here's the rub; in one section of the file there's a block of lines, luckily with very distinct delimiter lines, but in one of those lines is a piece of code that needs to be merged with a replacement string which replaces a second pattern in a line which follows the before-said block, with optionally a variable number of lines in between.
I'm having trouble figuring out how to piece this nested code together as the shorthand code of sed is quite confusing to me.
So far I've tried to assemble the code needed to capture the first block, but sed keeps giving me the same error each time; extra characters after command
here are some of the attempts I've made:
sed -n 'p/^\/\/\/\/\/ SMARTY AUTH \/\/\/\/\/\\n.*\\n.*\\n.*\\n.*AULE_\([A-Z_]*\);$^.*$^^\/\/\/\/\/ SMARTY AUTH \/\/\/\/\/$/' function.xls_form.php
sed -n 'p/\(^.*SMARTY AUTH.*$^.*$^.*$^.*$^.*AULE_\([A-Z_]*\);$^.*$^.*SMARTY AUTH.*$/' function.xls_form.php
the second part is relatively easy compared to the first;
sed -ei'.orig' 's/RoleContextAuthorizations::smartyAuth(\$auth)/$smarty->hasAccess(\$params,AuthorizationLevel::AULE_\1)/' *.php
where \1 would be the matched snippet from the first part...
Edit:
The first codeblock is an example of input part 1 which needs to be removed; part 2 is RoleContextAuthorizations::smartyAuth($auth) which needs to be replaced with $smarty->hasAccess($params, AuthorizationLevel::AULE_<snippet from part1>)
/edit
Hoping somebody can point me in the right direction, Many thanks in advance!!!
The hold space is going to be key to solving this. You can copy material from the pattern space (where sed normally works) into the hold space, and do various operations with the hold space, etc.
You need to find the AuthorizationLevel::AULE_WRITE type text within the block markers, and copy that to the hold space, and then delete the text within the block markers. And then separately find the other pattern and replace it with information from the hold space.
Given that the markers use slashes, it is also time to use a custom search marker which is introduced by a backslash. The following could be in a file script.sed, to be used as:
sed -f script.sed function.xls_form.php
When you're sure it's working, you can play with -i options to overwrite the original.
\%///// SMARTY AUTH /////%,\%///// SMARTY AUTH /////% {
/.*\(AuthorizationLevel::AULE_[A-Z]\{1,\}\).*/{
s//$smarty->hasAccess($params,\1);/
x
}
d
}
/RoleContextAuthorizations::smartyAuth($auth)/x
The first line searches for the start and end marker, using \% to change the delimiter to %. There's then a group of actions in braces. The second line searches for the authorization level and starts a second group of actions. The substitute command replaces the line with the desired output line. The x swaps the pattern space and the hold space, copying the desired output line to the hold space (and copying the empty hold space to the pattern space — it's x for eXchange pattern and hold spaces). This has saved the AuthorizationLevel information. The inner block ends; the outer block deletes the line and continues the execution. Note that there's no need to escape the $ symbol most of the time — it would matter if it was at the end of a pattern (there's a difference between /a\$/ and /a$/, but no difference between /b$c/ and /b\$c/).
The last line then looks for the RoleContextAuthorizations line and swaps it with the hold space. Everything else is just let through.
Given a data file containing:
Gibberish
Rhubarb
///// SMARTY AUTH /////
$auth['model'] = isset($params['model']) ? $params['model'] : null;
$auth['requiredLevel'] = isset($params['requiredlevel']) ? $params['requiredlevel'] : null;
$auth['baseAuthorizationLevel'] = isset($params['_authorizationlevel']) ? $params['_authorizationlevel'] : null;
$auth['defaultRequiredLevel'] = AuthorizationLevel::AULE_WRITE;
$auth['baseModel'] = $smarty->getTemplateVars('model');
///// SMARTY AUTH /////
More gibberish
More rhubarb - it is good with strawberries, especially in yoghurt
RoleContextAuthorizations::smartyAuth($auth);
Trailing gibbets — ugh; worse are trailing giblets
Finish - EOF
The output from sed -f script.sed data is:
$ sed -f script.sed data
Gibberish
Rhubarb
More gibberish
More rhubarb - it is good with strawberries, especially in yoghurt
$smarty->hasAccess($params,AuthorizationLevel::AULE_WRITE);
Trailing gibbets — ugh; worse are trailing giblets
Finish - EOF
$
I think that's what was wanted.
You can convert the file of sed script into a single line of gibberish, but that's left as an exercise for the reader — it isn't very hard, but GNU sed and BSD (macOS) sed have different rules for when you need semicolons as part of a single line command; you were warned. There are also differences in the rules for the -i option between the GNU and BSD variants of sed.
If you have to preserve some portions of the RoleContextAuthorizations::smartyAuth line, you have to work harder, but it can probably be done. For example, you can add the hold space to the current pattern space with the G command, and then edit the information into the right places. It is simplest if every place the line occurs needs to look the same apart from the AULE_XYZ string — that's what I've assumed here.
Also, note that using x rather than h or g is lazy — but doesn't matter if there's only one RoleContextAuthorizations::smartyAuth line. Using the alternatives would mean that if a file has multiple RoleContextAuthorizations::smartyAuth lines, then you'd be able to make the same substitution in each, unless there's another ///// SMARTY AUTH ///// in the file.

How can I combine multiple text files, remove duplicate lines and split the remaining lines into several files of certain length?

I have a lot of relatively small files with about 350.000 lines of text.
For example:
File 1:
asdf
wetwert
ddghr
vbnd
...
sdfre
File 2:
erye
yren
asdf
jkdt
...
uory
As you can see line 3 of file 2 is a duplicate of line 1 in file 1.
I want a program / Notepad++ Plugin that can check and remove these duplicates in multiple files.
The next problem I have is that I want all lists to be combined into large 1.000.000 line files.
So, for example, I have these files:
648563 lines
375924 lines
487036 lines
I want them to result in these files:
1.000.000 lines
511.523 lines
And the last 2 files must consist of only unique lines.
How can I possibly do this? Can I use some programs for this? Or a combination of multiple Notepad++ Plugins?
I know GSplit can split files of 1.536.243 into files of 1.000.000 and 536.243 lines, but that is not enough, and it doesn't remove duplicates.
I do want to create my own Notepad++ plugin or program if needed, but I have no idea how and where to start.
Thanks in advance.
You have asked about Notepad++ and are thus using Windows. On the other hand, you said you want to create a program if needed, so I guess the main goal is to get the job done.
This answer uses Unix tools - on Windows, you can get those with Cygwin.
To run the commands, you have to type (or paste) them in the terminal / console.
cat file1 file2 file3 | sort -u | split -l1000000 - outfile_
cat reads the files and echoes them; normally, to the screen, but the pipe | gets the output of the command left to it and pipes it through to the command on the right.
sort obviously sorts them, and the switch -u tells it to remove duplicate lines.
The output is then piped to split which is being told to split after 1000000 lines by the switch -l1000000. The - (with spaces around) tells it to read its input not from a file but from "standard input"; the output in sort -u in this case. The last word, outfile_, can be changed by you, if you want.
Written like it is, this will result in files like outfile_aa, outfile_ab and so on - you can modify this with the last word in this command.
If you have all the files in on directory, and nothing else is in there, you can use * instead of listing all the files:
cat * | sort -u | split -l1000000 - outfile_
If the files might contain empty lines, you might want to remove them. Otherwise, they'll be sorted to the top and your first file will not have the full 1.000.000 values:
cat file1 file2 file3 | grep -v '^\s*$' | sort -u | split -l1000000 - outfile_
This will also remove lines that consist only of whitespace.
grep filters input using regular expressions. -v inverts the filter; normally, grep keeps only lines that match. Now, it keeps only lines that don't match. ^\s*$ matches all lines that consist of nothing else than 0 or more characters of whitespace (like spaces or tabs).
If you need to do this regularly, you can write a script so you don't have to remember the details:
#!/bin/sh
cat * | sort -u | split -l1000000 - outfile_
Save this as a file (for example combine.sh) and run it with
./combine.sh

What is the difference b/w two sed commands below?

Information about the environment I am working in:
$ uname -a
AIX prd231 1 6 00C6B1F74C00
$ oslevel -s
6100-03-10-1119
Code Block A
( grep schdCycCleanup $DCCS_LOG_FILE | sed 's/[~]/ \
/g' | grep 'Move(s) Exist for cycle' | sed 's/[^0-9]*//g' ) > cycleA.txt
Code Block B
( grep schdCycCleanup $DCCS_LOG_FILE | sed 's/[~]/ \n/g' | grep 'Move(s) Exist for cycle' | sed 's/[^0-9]*//g' ) > cycleB.txt
I have two code blocks(shown above) that make use of sed to trim the input down to 6 digits but one command is behaving differently than I expected.
Sample of input for the two code blocks
Mar 25 14:06:16 prd231 ajbtux[33423660]: 20160325140616:~schd_cem_svr:1:0:SCHD-MSG-MOVEEXISTCYCLE:200705008:AUDIT:~schdCycCleanup - /apps/dccs/ajbtux/source/SCHD/schd_cycle_cleanup.c - line 341~ SCHD_CYCLE_CLEANUP - Move(s) Exist for cycle 389210~
I get the following output when the sample input above goes through the two code blocks.
cycleA.txt content
389210
cycleB.txt content
25140616231334236602016032514061610200705008341389210
I understand that my last piped sed command (sed 's/[^0-9]*//g') is deleting all characters other than numbers so I omitted it from the block codes and placed the output in two additional files. I get the following output.
cycleA1.txt content
SCHD_CYCLE_CLEANUP - Move(s) Exist for cycle 389210
cycleB1.txt content
Mar 25 15:27:58 prd231 ajbtux[33423660]: 20160325152758: nschd_cem_svr:1:0:SCHD-MSG-MOVEEXISTCYCLE:200705008:AUDIT: nschdCycCleanup - /apps/dccs/ajbtux/source/SCHD/schd_cycle_cleanup.c - line 341 n SCHD_CYCLE_CLEANUP - Move(s) Exist for cycle 389210 n
I can see that the first code block is removing every thing other that (SCHD_CYCLE_CLEANUP - Move(s) Exist for cycle 389210) and is using the tilde but the second code block is just replacing the tildes with the character n. I can also see that it is necessary in the first code block for a line break after this(sed 's/[~]/ ) and that is why I though having \n would simulate a line break but that is not the case. I think my different output results are because of the way regular expressions are being used. I have tried to look into regular expressions and searched about them on stackoverflow but did not obtain what I was looking for. Could someone explain how I can achieve the same result from code block B as code block A without having part of my code be on a second line?
Thank you in advance
This is an example of the XY problem (http://xyproblem.info/). You're asking for help to implement something that is the wrong solution to your problem. Why are you changing ~s to newlines, etc when all you need given your posted sample input and expected output is:
$ sed -n 's/.*schdCycCleanup.* \([0-9]*\).*/\1/p' file
389210
or:
$ awk -F'[ ~]' '/schdCycCleanup/{print $(NF-1)}' file
389210
If that's not all you need then please edit your question to clarify your requirements for WHAT you are trying to do (as opposed to HOW you are trying to do it) as your current approach is just wrong.
Etan Reisner's helpful answer explains the problem and offers a single-line solution based on an ANSI C-quoted string ($'...'), which is appropriate, given that you originally tagged your question bash.
(Ed Morton's helpful answer shows you how to bypass your problem altogether with a different approach that is both simpler and more efficient.)
However, it sounds like your shell is actually something different - presumably ksh88, an older version of the Korn shell that is the default sh on AIX 6.1 - in which such strings are not supported[1]
(ANSI C-quoted strings were introduced in ksh93, and are also supported not only in bash, but in zsh as well).
Thus, you have the following options:
With your current shell, you must stick with a two-line solution that contains an (\-escaped) actual newline, as in your code block A.
Note that $(printf '\n') to create a newline does not work, because command substitutions invariably trim all trailing newlines, resulting in the empty string in this case.
Use a more modern shell that supports ANSI C-quoted strings, and use Etan's answer. http://www.ibm.com/support/knowledgecenter/ssw_aix_61/com.ibm.aix.cmds3/ksh.htm tells me that ksh93 is available as an alternative shell on AIX 6.1, as /usr/bin/ksh93.
If feasible: install GNU sed, which natively understands escape sequences such as \n in replacement strings.
[1] As for what actually happens when you try echo 'foo~bar~baz' | sed $'s/[~]/\\\n/g' in a POSIX-like shell that does not support $'...': the $ is left as-is, because what follow is not a valid variable name, and sed ends up seeing literal $s/[~]/\\\n/g, where the $ is interpreted as a context address applying to the last input line - which doesn't make a difference here, because there is only 1 line. \\ is interpreted as plain \, and \n as plain n, effectively replacing ~ instances with literal \n sequences.
GNU sed handles \n in the replacement the way you expect.
OS X (and presumably BSD) sed does not. It treats it as a normal escaped character and just unescapes it to n. (Though I don't see this in the manual anywhere at the moment.)
You can use $'' quoting to use \n as a literal newline if you want though.
echo 'foo~bar~baz' | sed $'s/[~]/\\\n/g'

Bash: Regex for SVN Conflicts

So I'm trying to write a regex to use for a grep command on an SVN status command. I want only files with conflicts to be displayed, and if it's a tree conflict, the extra information SVN provides about it (which is on a line with a > character).
So, here's my description of how SVN outputs lines with conflicts, and then I'll show my regex:
[Single Char Code][Spaces][Letter "C"][Space]Filename
[Spaces][Letter "C"][Space]Filename
[Letter "C"][Space]Filename
This is what I have so far to try and get the proper regex. The second part, after the OR condition, works fine to get the tree conflict extra line. It's the first part, where I'm trying to get lines with the letter C under very specific conditions.
Anyway, I'm not exactly the greatest with Regex, so some help here (plus an explanation of what I'm doing wrong, so I can learn from this) would be great.
CONFLICTS=($(svn status | grep "^(.)*C\s\|>"))
Thanks.
This regex should match your lines :
CONFLICTS=$(svn status | grep '^[ADMRCXI?!~ ]\? *C')
^[ADMRCXI?!~ ]\?: lines starting with zero or one \?status character ^[ADMRCXI?!~ ]
*zero or more spaces
character C
I removed the extra parenthesis surrounding the command substitution.
You have to read description of svn st output more deeply and try to get at least one Tree Conflict.
I'll start it for you:
> The first seven columns in the output are each one character wide:
>...
> Seventh column: Whether the item is the victim of a tree conflict
>...
> 'C' tree-Conflicted
and note: theoretically any of these 7 columns can be non-empty
status for tree-conflict
M wc/bar.c
! C wc/qaz.c
> local missing, incoming edit upon update
D wc/qax.c
Dirty lazy draft of regexp
^[enumerate_all_chars_here]{6}C\s

On Cygwin (or windows 7), match a word, look backwards, skip a word and print x number of comma separated words

Have a headache trying to understand squiggly awks and greps but not gotten far.
I have 100 thousand files from which I'm trying to extract a single line.
A sample set of lines of the file is:
Revenue,876.08,,9361.000,444.000,333.000,222.000,111.00,485.000,"\t\t",178.90,9008.98
EV to Revenue,6.170,0.65,3.600,2.60,1.520,1.7,"\t\t",190.9,9008.98,80.9,87
(there are two tabs between the double quotes. I'm representing them with \t here. They are actual whitespace tabs)
I'm trying to output just this line that starts with Revenue:
Revenue,444.000,333.000,222.000,111.000
This output line outputs the first word of the line and the comma (ie: Revenue,) It then finds the two tabs ensconced in double quotes, looks backwards skipping the first set of comma separated numbers (also assume that instead of numbers, there could be nothing ie: just a comma separated blank) and then outputs the 4 set of comma separated numbers.
Is this doable in a simple grep or awk or cut or tr command on cygwin that won't be a bear to run on 100K files ?
To clarify, there are 100K files that look very similar. Each file will contain lots of lines (separated by new line/carriage return). Some lines will contain the word Revenue at the start, some at the middle (as in the 2nd sample line I had paste above) etc. I'm only interested in those lines that start with Revenue followed by the comma and then the sequence above. Each file will contain that specific line.
As a completion to this kind of task (because working on 100K files would require this too), what would have to be added to sed to print out the current file name being operated on too?
ie: output like this:
FileName1: Revenue,444.000,333.000,222.000,111.000
[I'll post the answer here if I find it]
Thank you!
Thanks to Sputnick for editing my question so it looks neat and thanks to shellter for responding.
Ed, your solution looks really good. I'm testing it out and will reply back with info plus my understanding of how that regex works. Thank you very much for taking time to write this out!
Since this is just a simple subsitution on a single line it's really most appropriate for sed:
$ sed -n -r 's/(^Revenue)(,[^,]*){3}(.*),[^,]*,"\t\t".*/\1\3/p' file
Revenue,444.000,333.000,222.000,111.00
but you can do the same in awk with gensub() (gawk) or match()/substr() or similar. It will run in the blink of an eye no matter what tool you use.