Linux – Checking that two files contain the same data but in a different order – Sorted!

Trying to regression test a change to a feed-file generation program can be tricky. Whether the format is CSV or some fashionable markup language, the ordering of the result set tends to be unimportant in such circumstances.
When testing, we need to verify that the files produced by the new version of the program contain the same data as those produced by the old version, irrespective of the order in which the records are written.

Recently, I was rescued from my struggle with just such a problem by my colleague, Don Thomson, who imparted some (Linux) Jedi wisdom resulting in a simple yet effective solution, involving an inventive combination of Linux utilities.

What we’re going to look at here is :

  • comparing files with diff
  • using sort to give diff a hand
  • comparing sets of files in different directories using sum
  • ignoring trailer records in delimited files using grep or head

Some Data Files

The first file we’ll use in our example is called episodes.txt and contains the following :

one
two
three
four
five
six
seven
eight
nine

The file we’re comparing it to is called releases.txt :

four
five
six
one
two
three
seven
eight
nine

As you can see, the files contain the same data but the first is in numeral order and the second is the result of what may be considered a “Skywalker” sort.

Predictably, diff decides that they are not identical :

diff episodes.txt releases.txt

1,3d0
< one
< two
< three
6a4,6
> one
> two
> three

Before we go any further, let’s use some switches to minimize the diff output as, for the purposes of this exercise, we just want to tell whether or not the files are the same.

diff -qs episodes.txt release.txt

Files episodes.txt and releases.txt differ

Fortunately, Don knows just the thing to help us perform a data – rather than line-by-line – comparison…

Sorting it out

Let’s see what happens when we use the sort command on episodes.txt :

sort episodes.txt

eight
five
four
nine
one
seven
six
three
two

Interesting. It seems to have sorted the file contents ( “data”) into alphabetical order. Let’s see what it does with releases.txt :

sort releases.txt

eight
five
four
nine
one
seven
six
three
two

That’s useful. The output is identical. Now we just need to pass the sort output for each file to diff.
We can do this using sub-shells. As we’re running in Bash, the syntax for this is :

diff -qs <(sort episodes.txt) <(sort releases.txt)

Files /dev/fd/63 and /dev/fd/62 are identical

Note that the filenames in the output are the temporary files that hold the output (stdout) from each sub-shell.

Just to prove that this solution does detect when the rows of data are different in the files, let’s introduce a “rogue” one…

echo 'three-and-a-bit' >>episodes.txt

diff -qs <(sort episodes.txt) <(sort releases.txt)

Files /dev/fd/63 and /dev/fd/62 differ

Comparing two sets of files in different directories with a checksum

Whilst this approach works really well for comparing two files, you may find that it’s not that quick when you’re comparing a large number of large files. For example, we have a directory containing files that we generated before making any changes to our (fictitious) program :

mkdir baseline_text_files

ls -l baseline_text_files/*.txt

-rw-rw-r-- 1 mike mike 14 Sep 10 13:36 baseline_text_files/originals.txt
-rw-rw-r-- 1 mike mike 14 Sep 10 13:36 baseline_text_files/prequels.txt
-rw-rw-r-- 1 mike mike 17 Sep 10 13:36 baseline_text_files/sequels.txt

The file contents are :

cat originals.txt
four
five
six

cat prequels.txt
one 
two 
three

cat sequels.txt
seven 
eight
nine

Files generated after modifying the program are in the new_text_files directory :

cat originals.txt
four
six
five

cat prequels.txt
three
two
one

cat sequels.txt
eight
nine
seven

Don’s rather neat alternative to diffing each pair of files is to create a checksum for each file and write the output to a temporary file. We then just diff the files with the output for each directory.

There are a number of utilities you can use to do this and the complexity of the checksum algorithm used may impact the runtime for a large number of files.
In light of this, we’ll be using sum, which seems to be the simplest and therefore (in theory) the fastest.

A quick test first :

sum baseline_text_files/originals.txt
45749     1

The first number is the checksum. The second is the file block count.

Now we’ve identified the required utilities, this script should do the job. I’ve called it data_diff.sh and you can download it from Github should you feel inclined to do so. The link is here.

#!/bin/sh
# Difference between files in two directories
orig_dir=$1
new_dir=$2

TMPFILE1=$(mktemp)
TMPFILE2=$(mktemp)

for file in $orig_dir/*
do 
    sort $file |sum >> $TMPFILE1
done

for file in $new_dir/*
do
    sort $file|sum >>$TMPFILE2
done 
diff -qs $TMPFILE1 $TMPFILE2

is_same=$?

if [ $is_same -eq 1 ] 
then
    echo 'Files do not match'
else 
    echo 'Files are identical'
fi 

#delete the temporary files before exiting, even if we hit an error
trap 'rm -f $TMPFILE1 $TMPFILE2' exit

Run this and we get :

sh data_diff.sh baseline_text_files new_text_files

Files /tmp/tmp.OshLmwGL0J and /tmp/tmp.Uz2mUa0SSY are identical
Files are identical

If we introduce a difference in one of the existing files…

echo 'SOLO' >>new_text_files/sequels.txt

sh data_diff.sh baseline_text_files new_text_files

Files /tmp/tmp.4OGnjQls0S and /tmp/tmp.L7OUZyGUzl differ
Files do not match

Unsurprisingly, the script will also detect a difference if we’re missing a file…

touch baseline_text_files/tv_series.txt

sh data_diff.sh baseline_text_files new_text_files

Files /tmp/tmp.LsCHbhxK1D and /tmp/tmp.358UInXSJX differ
Files do not match

Ignoring Trailer Records in delimited files

With text delimited files, it’s common practice to include a trailer record at the end of the file to confirm it is complete.
This record will typically include the date (or timestamp) of when the file was created.

Such a file might look like this in the baseline_files directory
For example :

cat baseline_files/episodes.csv

HEADER|episode|title|release_year
I|The Phantom Menace|1999
II|Attack of the Clones|2002
III|Revenge of the Sith|2005
IV|A New Hope|1977
V|The Empire Strikes Back|1980
VI|Return of the Jedi|1983
VII|The Force Awakens|2015
VIII|The Last Jedi|2017
IX|The Rise of Skywalker|2019
TRAILER|9|20220903

The trailer in the corresponding file in the new directory includes a different date :

cat new_files/episodes.csv

HEADER|episode|title|release_year
I|The Phantom Menace|1999
II|Attack of the Clones|2002
III|Revenge of the Sith|2005
IV|A New Hope|1977
V|The Empire Strikes Back|1980
VI|Return of the Jedi|1983
VII|The Force Awakens|2015
VIII|The Last Jedi|2017
IX|The Rise of Skywalker|2019
TRAILER|9|20220904

To accurately compare the data in these files, we’ll need to ignore the trailer record.

Once again, there are numerous ways to do this. We could use :

grep -iv trailer baseline_files/episodes.csv

HEADER|episode|title|release_year
I|The Phantom Menace|1999
II|Attack of the Clones|2002
III|Revenge of the Sith|2005
IV|A New Hope|1977
V|The Empire Strikes Back|1980
VI|Return of the Jedi|1983
VII|The Force Awakens|2015
VIII|The Last Jedi|2017
IX|The Rise of Skywalker|2019

…which would result in our diff looking like this :

diff -qs <(sort baseline_files/episodes.csv|grep -iv trailer) <(sort new_files/episodes.csv| grep -iv trailer)

Alternatively, if we know that the trailer record is always the last line of the file we can use head to output everything apart from the last line :

head -n -1  baseline_files/episodes.csv.

HEADER|episode|title|release_year
I|The Phantom Menace|1999
II|Attack of the Clones|2002
III|Revenge of the Sith|2005
IV|A New Hope|1977
V|The Empire Strikes Back|1980
VI|Return of the Jedi|1983
VII|The Force Awakens|2015
VIII|The Last Jedi|2017
IX|The Rise of Skywalker|2019

…which would entail a diff command like this :

diff -qs <(head -n -1 baseline_files/episodes.csv| sort) <(head -n -1 new_files/episodes.csv| sort)

This being Linux there are probably several more options but these should cover at least some of the more common circumstances where comparison of file by data is required.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.