"diff" does not quit ecut the biscuit, as it is to unstructured. So I want to take the first md5sum from the second file and remove it from the original. In that way the remaining entries are the ones different in the second file.
So in bash this spells out:
while read line
do
# isolate md5sum from line:
md5=$(echo $line| awk '{print $1}')
# Is this md5 in the second file ?
if grep -q "$md5" RESTORE-sorted.txt
then
# If so, throw it out, we don't consider it anymore:
grep -v "$md5" RESTORE-sorted.txt > RESTORE.mv
mv RESTORE.mv RESTORE-sorted.txt
fi
done
Pretty short and sweet. However, it runs forever. on a 650M file. Something to do with the kernels handling of file-descriptors. I started it 6 hours ago and it has not even done half of the task. In fact, while it was running I was able to pick up the necessary Perl to accomplish the same, using arrays. (Perl is quite "intuitive", you can sort of "baby-talk" your way into it) The prog is not quite as short and sweet, but that is probably due to my newbieness. However it takes 10 seconds to run. Well, does illustrate a point, does it not...
#!/usr/bin/perl
# Read original file and checksums into array:
$orig_file="ORIG-sorted.txt";
open(ORIG, $orig_file) || die("Could not open file!");
while ()
{
($key,$value) = split(/ +/,$_);
$orig_a{$key} = $value;
}
close(ORIG);
# Open the next file:
$restore_file="RESTORE-sorted.txt";
open(RESTORE, "<$restore_file") || die("Could not open file!");
while(<>)
{
# Split the line in two parts...
my($line) = $_;
@record = split(/ +/,$line);
##...and delete line containing md5sum from original array:
##(the central task)
delete $orig_a{"$record[0]" };
}; close(ORIG); # print out formatted array: foreach $key (keys %orig_a) { print $key , " " , $orig_a{$key} ; }