ignoring changes matching a string in git diff

ignoring changes matching a string in git diff - regex

I've made a single simple change to a large number of files that are version controlled in git and I'd like to be able to check that no other changes are slipping into this large commit.
The changes are all of the form
- "main()",
+ OOMPH_CURRENT_FUNCTION,
where "main()" could be the name of any function. I want to generate a diff of all changes that are not of this form.
The -G and -S options to git diff are tantalisingly close--they find changes that DO match a string or regexp.
Is there a good way to do this?
Attempts so far
Another question describes how regexs can be negated, using this approach I think the command should be
git diff -G '^((?!OOMPH_CURRENT_FUNCTION).)*$'
but this just returns the error message
fatal: invalid log-grep regex: Invalid preceding regular expression
so I guess git doesn't support this regex feature.
I also noticed that the standard unix diff has the -I option to "ignore changes whose lines all match RE". But I can't find the correct way to replace git's own diff with the unix diff tool.

Try the following:
$ git diff > full_diff.txt
$ git diff -G "your pattern" > matching_diff.txt
You can then compare the two like so:
$ diff matching_diff.txt full_diff.txt
If all changes match the pattern, full_diff.txt and matching_diff.txt will be identical, and the last diff command will not return anything.
If there are changes that do not match the pattern, the last diff will highlight those.
You can combine all of the above steps and avoid having to create two extra files like so:
diff <(git diff -G "your pattern") <(git diff) # works with other diff tools too

No more grep needed!
With Git 2.30 (Q1 2021), "git diff"(man) family of commands learned the "-I<regex>" option to ignore hunks whose changed lines all match the given pattern.
See commit 296d4a9, commit ec7967c (20 Oct 2020) by Michał Kępień (kempniu).
(Merged by Junio C Hamano -- gitster -- in commit 1ae0949, 02 Nov 2020)
diff: add -I<regex> that ignores matching changes
Signed-off-by: Michał Kępień
Add a new diff option that enables ignoring changes whose all lines (changed, removed, and added) match a given regular expression.
This is similar to the -I/--ignore-matching-lines option in standalone diff utilities and can be used e.g. to ignore changes which only affect code comments or to look for unrelated changes in commits containing a large number of automatically applied modifications (e.g. a tree-wide string replacement).
The difference between -G/-S and the new -I option is that the latter filters output on a per-change basis.
Use the 'ignore' field of xdchange_t for marking a change as ignored or not.
Since the same field is used by --ignore-blank-lines, identical hunk emitting rules apply for --ignore-blank-lines and -I.
These two options can also be used together in the same git invocation (they are complementary to each other).
Rename xdl_mark_ignorable() to xdl_mark_ignorable_lines(), to indicate that it is logically a "sibling" of xdl_mark_ignorable_regex() rather than its "parent".
diff-options now includes in its man page:
-I<regex>
--ignore-matching-lines=<regex>
Ignore changes whose all lines match <regex>.
This option may be specified more than once.
Examples:
git diff --ignore-blank-lines -I"ten.*e" -I"^[124-9]"
A small memleak in "diff -I<regexp>" has been corrected with Git 2.31 (Q1 2021).
See commit c45dc9c, commit e900d49 (11 Feb 2021) by Ævar Arnfjörð Bjarmason (avar).
(Merged by Junio C Hamano -- gitster -- in commit 45df6c4, 22 Feb 2021)
diff: plug memory leak from regcomp() on {log,diff} -I
Signed-off-by: Ævar Arnfjörð Bjarmason
Fix a memory leak in 296d4a9 ("diff: add -I that ignores matching changes", 2020-10-20, Git v2.30.0-rc0 -- merge listed in batch #3) by freeing the memory it allocates in the newly introduced diff_free().
This memory leak was intentionally introduced in 296d4a9, see the discussion on a previous iteration of it.
At that time freeing the memory was somewhat tedious, but since it isn't anymore with the newly introduced diff_free() let's use it.
Let's retain the pattern for diff_free_file() and add a diff_free_ignore_regex(), even though (unlike "diff_free_file") we don't need to call it elsewhere.
I think this will make for more readable code than gradually accumulating a giant diff_free() function, sharing "int i" across unrelated code etc.

Use git difftool to run a real diff.
Example: https://github.com/cben/kubernetes-discovery-samples/commit/b1e946434e73d8d1650c887f7d49b46dcbd835a6
I've created a script running diff the way I want to (here I'm keeping curl --verbose outputs in the repo, resulting in boring changes each time I rerun the curl):
#!/bin/bash
diff --recursive --unified=1 --color \
--ignore-matching-lines=serverAddress \
--ignore-matching-lines='^\* subject:' \
--ignore-matching-lines='^\* start date:' \
--ignore-matching-lines='^\* expire date:' \
--ignore-matching-lines='^\* issuer:' \
--ignore-matching-lines='^< Date:' \
--ignore-matching-lines='^< Content-Length:' \
--ignore-matching-lines='--:--:--' \
--ignore-matching-lines='{ \[[0-9]* bytes data\]' \
"$#"
And now I can run git difftool --dir-diff --extcmd=path/to/above/script.sh and see only interesting changes.
An important caveat about GNU diff -I aka --ignore-matching-lines: this merely prevents such lines from making a chunk "intersting" but when these changes appear in same chunk with other non-ignored changes, it will still show them. I used --unified=1 above to reduce this effect by making chunks smaller (only 1 context line above and below each change).

I think that I have a different solution using pipes and grep. I had two files that needed to be checked for differences that didn't include ## and g:, so I did this (borrowing from here and here and here:
$ git diff -U0 --color-words --no-index file1.tex file2.tex | grep -v -e "##" -e "g:"
and that seemed to do the trick. Colors still were there.
So I assume you could take a simpler git diff command/output and do the same thing. What I like about this is that it doesn't require making new files or redirection (other than a pipe).

Related

Trying to get git diff to ignore comments using regex, doesn't seem to be working

My goal is to get git diff to ignore C comments. I've been using a basic regex, and printing the diff to another file (it doesn't print anything otherwise). I've also tried the reverse, to get the diff to only show the comments (I'll later see if I can reverse engineer it). However, it doesn't behave as it's supposed to. Here's a few examples of what I've tried:
Trying to get the diff to show only the lines that begin with /*:
git diff -w -G'(^(/\**)' master > text.diff
Getting the diff to show lines that start with either * or / or end with the same:
git diff -w -G'(^[/\*])|($[^/\*])' master > text.diff
Getting the diff to show only non-comment lines (see How to make 'git diff' ignore comments):
git diff -w -G'(^[^\*# /])|(^#\w)|(^\s+[^\*#/])' master > text.diff
I'm running it under WSL and using git version 2.17.1 for reference.

My goal is to get git diff to ignore C comments ...
This is difficult, because:
Git doesn't understand C code;
parsing C comments requires lexical analysis across multiple lines;
git diff breaks up the input into lines too early.
Your best bet is therefore not to do this directly with git diff at all. Instead:
Extract the file(s) to be compared from wherever they live (two commits, one commit and a regular file, one commit and an index copy of a file, etc).
Use a C-comment-stripper that does understand how to analyze C source and detect (and remove1) the comments.
Run the output of step 2 through some diff engine (regular diff, git diff, whatever you like).
If you wrap all of this up as a tool that git difftool can run, you'll get something serviceable and convenient. It will require generating lots of temporary files.
(Note that your attempt to use -G here is ultimately doomed. The -G expression will look for a comment within the changed lines, rather than whether the changed line is or is not in the middle of a long comment. Languages that have only comment-to-end-of-line, such as sh/bash, are more tractable than C. Backslash-newline sequences will still foil things though. See also Erik Aronesty's answer to the linked question.)
1Remember that in ANSI C, comments always separate tokens, so for ANSI C, replace comments with white-space, but in many traditional K&R compilers, comments simply vanish. This technique is used in place of the new-in-1989 token-pasting operator in some very old C code. You might want to support this mode by making step 2 have an option to leave out the white-space.

Git diff: ignore lines starting with a word

As I have learned here, we can tell git diff to ignore lines starting with a * using:
git diff -G '^[[:space:]]*[^[:space:]*]'
How do I tell git to ignore lines starting with a word, or more (for example: * Generated at), not just a character?
This file shall be ignored, it contains only trivial changes:
- * Generated at 2018-11-21
+ * Generated at 2018-11-23
This file shall NOT be ignored, it contains NOT only trivial changes:
- * Generated at 2018-11-21
+ * Generated at 2018-11-23
+ * This line is important! Although it starts with a *

Git is using POSIX regular expressions which seem not to support lookarounds. That is the reason why #Myys 3's approach does not work. A not so elegant workaround could be something like this:
git diff -G '^\s*([^\s*]|\*\s*[^\sG]|\*\sG[^e]|\*\sGe[^n]|\*\sGen[^e]|\*\sGene[^r]|\*\sGener[^a]|\*\sGenera[^t]|\*\sGenerat[^e]|\*\sGenerate[^d]).*'
This will filter out all changes starting with "* Generated".
Test: https://regex101.com/r/kdv4V0/3

Considering you are ignoring changes that does NOT match your regex, you just have to put the words you want inside the expression within a lookahead capture group, like this:
git diff -G '^(?=.*Generated at)[[:space:]]*[^[:space:]*]'
Note that if you want to keep adding words to ignore, just keep adding these groups (don't forget the .*):
However, if the string contains a "Generated at" anywhere in their whole, it shall be ignored. If you want to define exactly how it should start, then replace the . with a [^[:word:]].
git diff -G '^(?=[^[:word:]]*Generated at)[[:space:]]*[^[:space:]*]'
You can have a look at it's behaviour at
Version 1: .*
https://regex101.com/r/kdv4V0/1
Version 2: [^[:word:]]*
https://regex101.com/r/kdv4V0/2

TL;DR: git diff -G is not able to exclude changes only to include changes that match the regex.
Have a look at git diff: ignore deletion or insertion of certain regex
There torek explains how git log and git diff work and how the parameter -G works.

Drop commits by commit message in `git rebase`

I would like to do a git rebase and drop commits whose commit messages match a certain regular expression. For example, it might work like
git rebase --drop="deletme|temporary" master
And this would do a rebase over master while dropping all commits containing the string deleteme or temporary.
Is is possible to do this with the standard Git tool? If not, is it possible with a third-party Git tool? In particular, I want it to be a single, noninteractive command.

This can be accomplished using the same method as I used in this answer.
First, we need to find the relevant commits. You can do that with something like:
git log --format=format:"%H %s" master..HEAD | grep -E "deleteme|temporary"
This will give you a list of commits with commit messages containing deleteme or temporary that are between master and your current branch. These are the commits that need to be dropped.
Save this bash script somewhere you can access it:
#!/bin/bash
for sha in $(git log --format=format:"%H %s" master..HEAD | grep -E "deleteme|temporary" | cut -d " " -f 1)
do
sha=${sha:0:7}
sed -i "s/pick $sha/drop $sha/" $#
done
Then run the rebase as:
GIT_SEQUENCE_EDITOR=/path/to/script.sh git rebase -i
This will automatically drop all commits that contain deleteme or temporary in their commit message.
As mentioned in my other answer:
[This script won't allow] you to customize what command is run to calculate which commits to use, but if this is an issue, you could probably pass in an environment variable to allow such customization.
Obligatory warning: Since a rebase rewrites history, this can be dangerous / disruptive for anyone else working on this branch. Be sure you clearly communicate what you have done with anyone you are collaborating with.

You could e. g. use interactive rebase. So do git rebase -i <first commit that should not be touched>, and then in vim where you have the list of commits, you can do :%s/^[^ ]* \([^ ]* issue\)/d \1/g to use drop stanza for all commits whose commit message starts with issue. But be aware that git rebase is not working optimally with merge commits. By default they are skipped and the history flattened, but you can try to keep them with parameters.

#Scott Weldon's answer works great for this usecase, however If the regex checks from the start of the message for example with (^(deleteme)|^(temporary)), then this won't work, since the start of the grep is the commit hash. So in that case you can use this instead
#!/bin/bash
for sha in $(git log --format=format:"%s %H" master..HEAD | grep -E "^(deleteme)|^(temporary)" | awk '{print $NF}')
do
sha=${sha:0:7}
sed -i "s/pick $sha/drop $sha/" $#
done
The core difference is that %sand %H swapped places, and therefore we search the last part of the string instead of the first part of the string by piping to awk '{print $NF}')
Also worth noting that this is called the same way as in Scott Weldon's answer:
GIT_SEQUENCE_EDITOR=/path/to/script.sh git rebase -i master

git smart line and word diff

I'd like to git diff and combine the regular line-by-line diff with git diff --word-diff. The problem with line-by-line diffs is that they're unnecessary if I change one or two words and leave the line mostly intact--the chunking is too coarse. On the other hand if I change entire lines and use --word-diff, sometimes the diff algorithm will get confused and spit out incredibly confusing diffs, with lots of words inserted and deleted to "morph" one line into another.
Is there a way to specify that git should be smart about this and only --word-diff if it actually makes sense to do so (on a line-by-line basis, of course)?

The smartest thing I have found for git diff --word-diff or git diff --color-words are the predefined patterns that come with git (as used in --word-diff-regex or diff.wordregex). They might not be perfect, but give quite good results AFAICT.
A list of predefined diff drivers (they all have predefined word regexes too) is given in the docs for .gitattributes. It is further stated that
you still need to enable this with the attribute mechanism, via .gitattributes
So to activate the python pattern for all *.py files, you could issue the following command in your repo root:
echo "*.py diff=python" >> .gitattributes
If you are interested in what the different preset patterns actually look like, take a look at git's source code

How to substitute words in Git history & properly debug related problems?

I'm trying to remove sensitive data like passwords from my Git history. Instead of deleting whole files I just want to substitute the passwords with removedSensitiveInfo. This is what I came up with after browsing through numerous StackOverflow topics and other sites.
git filter-branch --tree-filter "find . -type f -exec sed -Ei '' -e 's/(aSecretPassword1|aSecretPassword2|aSecretPassword3)/removedSensitiveInfo/g' {} \;"
When I run this command it seems to be rewriting the history (it shows the commits it's rewriting and takes a few minutes). However, when I check to see if all sensitive data has indeed been removed it turns out it's still there.
For reference this is how I do the check
git grep aSecretPassword1 $(git rev-list --all)
Which shows me all the hundreds of commits that match the search query. Nothing has been substituted.
Any idea what's going on here?
I double checked the regular expression I'm using which seems to be correct. I'm not sure what else to check for or how to properly debug this as my Git knowledge quite rudimentary. For example I don't know how to test whether 1) my regular expression isn't matching anything, 2) sed isn't being run on all files, 3) the file changes are not being saved, or 4) something else.
Any help is very much appreciated.
P.S.
I'm aware of several StackOverflow threads about this topic. However, I couldn't find one that is about substituting words (rather than deleting files) in all (ASCII) files (rather than specifying a specific file or file type). Not sure whether that should make a difference, but all suggested solutions haven't worked for me.

git-filter-branch is a powerful but difficult to use tool - there are several obscure things you need to know to use it correctly for your task, and each one is a possible cause for the problems you're seeing. So rather than immediately trying to debug them, let's take a step back and look at the original problem:
Substitute given strings (ie passwords) within all text files (without specifying a specific file/file-type)
Ensure that the updated Git history does not contain the old password text
Do the above as simply as possible
There is a tailor-made solution to this problem:
Use The BFG... not git-filter-branch
The BFG Repo-Cleaner is a simpler alternative to git-filter-branch specifically designed for removing passwords and other unwanted data from Git repository history.
Ways in which the BFG helps you in this situation:
The BFG is 10-720x faster
It automatically runs on all tags and references, unlike git-filter-branch - which only does that if you add the extraordinary --tag-name-filter cat -- --all command-line option (Note that the example command you gave in the Question DOES NOT have this, a possible cause of your problems)
The BFG doesn't generate any refs/original/ refs - so no need for you to perform an extra step to remove them
You can express you passwords as simple literal strings, without having to worry about getting regex-escaping right. The BFG can handle regex too, if you really need it.
Using the BFG
Carefully follow the usage steps - the core bit is just this command:
$ java -jar bfg.jar --replace-text replacements.txt my-repo.git
The replacements.txt file should contain all the substitutions you want to do, in a format like this (one entry per line - note the comments shouldn't be included):
PASSWORD1 # Replace literal string 'PASSWORD1' with '***REMOVED***' (default)
PASSWORD2==>examplePass # replace with 'examplePass' instead
PASSWORD3==> # replace with the empty string
regex:password=\w+==>password= # Replace, using a regex
Your entire repository history will be scanned, and all text files (under 1MB in size) will have the substitutions performed: any matching string (that isn't in your latest commit) will be replaced.
Full disclosure: I'm the author of the BFG Repo-Cleaner.

Looks OK. Remember that filter-branch retains the original commits under refs/original/, e.g.:
$ git commit -m 'add secret password, oops!'
[master edaf467] add secret password, oops!
1 file changed, 4 insertions(+)
create mode 100644 secret
$ git filter-branch --tree-filter "find . -type f -exec sed -Ei '' -e 's/(aSecretPassword1|aSecretPassword2|aSecretPassword3)/removedSensitiveInfo/g' {} \;"
Rewrite edaf467960ade97ea03162ec89f11cae7c256e3d (2/2)
Ref 'refs/heads/master' was rewritten
Then:
$ git grep aSecretPassword `git rev-list --all`
edaf467960ade97ea03162ec89f11cae7c256e3d:secret:aSecretPassword2
but:
$ git lola
* e530e69 (HEAD, master) add secret password, oops!
| * edaf467 (refs/original/refs/heads/master) add secret password, oops!
|/
* 7624023 Initial
(git lola is my alias for git log --graph --oneline --decorate --all). Yes, it's in there, but under the refs/original name space. Clear that out:
$ rm -rf .git/refs/original
$ git reflog expire --expire=now --all
$ git gc
Counting objects: 6, done.
Delta compression using up to 8 threads.
Compressing objects: 100% (4/4), done.
Writing objects: 100% (6/6), done.
Total 6 (delta 0), reused 0 (delta 0)
and then:
$ git grep aSecretPassword `git rev-list --all`
$
(as always, run filter-branch on a copy of the repo Just In Case; and then removing original refs, expiring the reflog "now", and gc'ing, means stuff is Really Gone).

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js