One common use case for grep is the need to search through multiple files, including files within sub-directories. In this article you will learn how to grep files recursively. It’s a staple of many Linux and Unix-based systems, and is widely used by system administrators, developers, and others who need to search through large volumes of text data. Alternatively you can install GNU grep on macOS in order to use this solution.Grep is a powerful command-line tool that allows you to search for specific patterns within text files. This doesn't work with all versions of grep, for example it fails in macOS, where a line in file 1 will be shown as not present in file 2, even though it is, if it matches another line that is a substring of it. So a solution that does not require the inputs to be sorted, is fast, flexible (case sensitivity, etc) is: grep -F -x -v -f file2 file1 These tests also included the -x switch, which are necessary part of the solution in order to ensure totally accuracy in cases where file2 contains lines which match part of, but not all of, one or more lines in file1. With -F it took 0.031 s (real), while without it took 2.278 s (real), when redirecting grep output to wc -l. I verified this on a pair of ~1000 line file lists I had to compare. Like konsolebox suggested, the posters grep solution grep -v -f file2 file1Īctually works great (faster) if you simply add the -F option, to treat the patterns as fixed strings instead of regular expressions. This is provided by split from file1 in chunks of 20000 line per-invocation.įor users on non-GNU systems, there is almost certainly a GNU coreutils package you can obtain, including on OSX as part of the Apple Xcode tools which provides GNU diff, awk, though only a POSIX/BSD split rather than a GNU version. Note the use and placement of - meaning stdin on the gawk command line. In this case, with the problem as stated, you can also divide and conquer using GNU split (filtering is a GNU extension), repeated runs with chunks of file1 and reading file2 completely each time: split -l 20000 -filter='gawk -f linesnotin.awk - file2' < file1 At the end the remaining lines from file1 are output, preserving the original order. Then as file2 is read, each matching line is deleted from ll1 and ss1. The above stores the entire contents of file1 in two arrays, one indexed by line number ll1, one indexed by line content ss1. (NR=FNR) įor (ll=1 ll<=nl1 ll ) if (ll in ll1) print ll1 # output lines in file1 that are not in file2 Here's a simple awk (nawk) script (inspired by the scripts linked-to in Konsolebox's answer) which accepts arbitrarily ordered input files, and outputs the missing lines in the order they occur in file1. The diff method (along with other suggestions comm and join) only produce the expected output with sorted input, though you can use <(sort. You can also use this to do other useful things like number each line with %dn. (note that it only outputs differences, it lacks the - and lines at the top of each grouped change). The %L specifier is the line in question, and we prefix each with " " "-" or " ", like diff -u If you are familiar with unified diff format, you can partly recreate it with: diff -old-line-format="-%L" -unchanged-line-format=" %L" \ ![]() Setting one to empty "" prevents output of that kind of line. These options format new (added), old (removed) and unchanged lines respectively. ![]() The options -new-line-format, -old-line-format and -unchanged-line-format let you control the way diff formats the differences, similar to printf format specifiers. ![]() You may also use a few diff options that other solutions don't offer, such as -i to ignore case, or various whitespace options ( -E, -b, -v etc) for less strict matching. In the above new and unchanged lines are suppressed, so only changed (i.e. ![]() With bash (and zsh) you can sort in-place with process substitution <( ): diff -new-line-format="" -unchanged-line-format="" <(sort file1) <(sort file2) The input files should be sorted for this to work. You can achieve this by controlling the formatting of the old/new/unchanged lines in GNU diff output: diff -new-line-format="" -unchanged-line-format="" file1 file2
0 Comments
Leave a Reply. |