Comparing two files using a Bash script

Comparing two files using a Bash script

ยท

3 min read

Table of contents

No heading

No headings in the article.

The beauty of scripting is the ability to automate mundane and repetitive tasks so that you can focus on other high value tasks. Before writing this article, I got a folder containing 15 documents varying between 20 to 30 pages each that I had to compare their contents and take note of the changes. This would have been a daunting task for anyone out there however, armed with bash scripting knowledge and a bit of creativity, I created a bash script that would take in the files, compare the contents and show output of both files with the disparities between them.

For the uninitiated here, Bash is a command line interpreter commonly used on GNU/Linux systems and is an acronym for the 'Bourne-Again SHell'. Having scripting knowledge is a must skill for anyone who wishes to explore the Linux System Administration domain and will greatly help in automation.

Enough with the intros, let's get into scripting!

what the shell.gif

The full code can be found on my Github however, the code snippets below describe how the script works;

set +o nounset
LC_ALL=C ; LANG=C ; export LC_ALL LANG
pe() { for _i;do printf "%s" "$_i";done; printf "\n"; }
pl() { pe;pe "-----" ;pe "$*"; }
db() { ( printf " db, ";for _i;do printf "%s" "$_i";done;printf "\n" ) >&2 ; }
db() { : ; }
C=$HOME/bin/ && [ -f "$C" ] && $C
set -o nounset

FILE1=${1-data1.txt}
shift
FILE2=${1-data2.txt}

# Display samples of data files.
pl " Data files:"
head "$FILE1" "$FILE2"

# Set file descriptors.
exec 3<"$FILE1"
exec 4<"$FILE2"

In this first part, the nounset function treats attempts to reference an undefined variable as an error. This simply means if an argument is not passed to the script i.e the two files to be compared, it will return an error. The locale program will print the locale variables of the process that launched it which includes the language, time etc. LANG is used to set the default locale i.e the locale used when no specific settings are provided. Therefore we export LC_ALL and LANG to enable the script interpret the files in the locale language, timestamp, identifications etc. of 'C' which is my /usr/bin absolute path.

# Section 2, solution.
pl " Results:"

eof1=0
eof2=0
count1=0
count2=0
while [[ $eof1 -eq 0 || $eof2 -eq 0 ]]
do
  if read a <&3; then
    let count1++
    # printf "%s, line %d: %s\n" $FILE1 $count1 "$a"
  else
    eof1=1
  fi
  if read b <&4; then
    let count2++
    # printf "%s, line %d: %s\n" $FILE2 $count2 "$b"
  else
    eof2=1
  fi
  if [ "$a" != "$b" ]
  then
    echo " File $FILE1 and $FILE2 differ at lines $count1, $count2:"
    pe "$a"
    pe "$b"
    # exit 1
  fi
done

exit 0

This section displays the original messages first, compares the two files for any disparities and prints out the lines that differ from one another. The output of the script looks something like this.

2022-07-04 10_32_21-kali [Running] - Oracle VM VirtualBox.png

This script came in handy and cut the time it would take to highlight the differences in the documents by more than half. I would imagine this is a low level version of how plagiarism detection tools work and I would appreciate any contribution towards the project. As always, thank you for reading! ๐Ÿ˜€

Feel free to leave a comment or suggestion.

ย