Module Diff::LCS
In: lib/diff/lcs.rb
lib/diff/lcs/callbacks.rb

Diff::LCS 1.1.2

Computes "intelligent" differences between two sequenced Enumerables. This is an implementation of the McIlroy-Hunt "diff" algorithm for Enumerable objects that include Diffable.

Based on Mario I. Wolczko‘s <mario@wolczko.com> Smalltalk version (1.2, 1993) and Ned Konz‘s <perl@bike-nomad.com> Perl version (Algorithm::Diff).

Synopsis

  require 'diff/lcs'

  seq1 = %w(a b c e h j l m n p)
  seq2 = %w(b c d e f j k l m r s t)

  lcs = Diff::LCS.LCS(seq1, seq2)
  diffs = Diff::LCS.diff(seq1, seq2)
  sdiff = Diff::LCS.sdiff(seq1, seq2)
  seq = Diff::LCS.traverse_sequences(seq1, seq2, callback_obj)
  bal = Diff::LCS.traverse_balanced(seq1, seq2, callback_obj)
  seq2 == Diff::LCS.patch(seq1, diffs)
  seq2 == Diff::LCS.patch!(seq1, diffs)
  seq1 == Diff::LCS.unpatch(seq2, diffs)
  seq1 == Diff::LCS.unpatch!(seq2, diffs)
  seq2 == Diff::LCS.patch(seq1, sdiff)
  seq2 == Diff::LCS.patch!(seq1, sdiff)
  seq1 == Diff::LCS.unpatch(seq2, sdiff)
  seq1 == Diff::LCS.unpatch!(seq2, sdiff)

Alternatively, objects can be extended with Diff::LCS:

  seq1.extend(Diff::LCS)
  lcs = seq1.lcs(seq2)
  diffs = seq1.diff(seq2)
  sdiff = seq1.sdiff(seq2)
  seq = seq1.traverse_sequences(seq2, callback_obj)
  bal = seq1.traverse_balanced(seq2, callback_obj)
  seq2 == seq1.patch(diffs)
  seq2 == seq1.patch!(diffs)
  seq1 == seq2.unpatch(diffs)
  seq1 == seq2.unpatch!(diffs)
  seq2 == seq1.patch(sdiff)
  seq2 == seq1.patch!(sdiff)
  seq1 == seq2.unpatch(sdiff)
  seq1 == seq2.unpatch!(sdiff)

Default extensions are provided for Array and String objects through the use of ‘diff/lcs/array’ and ‘diff/lcs/string’.

Introduction (by Mark-Jason Dominus)

The following text is from the Perl documentation. The only changes have been to make the text appear better in Rdoc.

I once read an article written by the authors of diff; they said that they hard worked very hard on the algorithm until they found the right one.

I think what they ended up using (and I hope someone will correct me, because I am not very confident about this) was the `longest common subsequence’ method. In the LCS problem, you have two sequences of items:

   a b c d f g h j q z
   a b c d e f g i j k r x y z

and you want to find the longest sequence of items that is present in both original sequences in the same order. That is, you want to find a new sequence S which can be obtained from the first sequence by deleting some items, and from the second sequence by deleting other items. You also want S to be as long as possible. In this case S is:

   a b c d f g j z

From there it‘s only a small step to get diff-like output:

   e   h i   k   q r x y
   +   - +   +   - + + +

This module solves the LCS problem. It also includes a canned function to generate diff-like output.

It might seem from the example above that the LCS of two sequences is always pretty obvious, but that‘s not always the case, especially when the two sequences have many repeated elements. For example, consider

   a x b y c z p d q
   a b c a x b y c z

A naive approach might start by matching up the a and b that appear at the beginning of each sequence, like this:

   a x b y c         z p d q
   a   b   c a b y c z

This finds the common subsequence +a b c z+. But actually, the LCS is +a x b y c z+:

         a x b y c z p d q
   a b c a x b y c z

Author

This version is by Austin Ziegler <diff-lcs@halostatue.ca>.

It is based on the Perl Algorithm::Diff by Ned Konz <perl@bike-nomad.com>, copyright &copy; 2000 - 2002 and the Smalltalk diff version by Mario I. Wolczko <mario@wolczko.com>, copyright &copy;

  1. Documentation includes work by Mark-Jason Dominus.

Licence

Copyright &copy; 2004 Austin Ziegler This program is free software; you can redistribute it and/or modify it under the same terms as Ruby, or alternatively under the Perl Artistic licence.

Credits

Much of the documentation is taken directly from the Perl Algorithm::Diff implementation and was written originally by Mark-Jason Dominus <mjd-perl-diff@plover.com> and later by Ned Konz. The basic Ruby implementation was re-ported from the Smalltalk implementation, available at st.cs.uiuc.edu/pub/Smalltalk/MANCHESTER/manchester/4.0/diff.st

sdiff and traverse_balanced were written for the Perl version by Mike Schilli <m@perlmeister.com>.

"The algorithm is described in A Fast Algorithm for Computing Longest Common Subsequences, CACM, vol.20, no.5, pp.350-353, May 1977, with a few minor improvements to improve the speed."

Methods

Classes and Modules

Module Diff::LCS::ChangeTypeTests
Module Diff::LCS::Ldiff
Class Diff::LCS::Block
Class Diff::LCS::Change
Class Diff::LCS::ContextChange
Class Diff::LCS::ContextDiffCallbacks
Class Diff::LCS::DefaultCallbacks
Class Diff::LCS::DiffCallbacks
Class Diff::LCS::Hunk
Class Diff::LCS::SDiffCallbacks

Constants

VERSION = '1.1.2'
PATCH_MAP = { #:nodoc: :patch => { '+' => '+', '-' => '-', '!' => '!', '=' => '=' }, :unpatch => { '+' => '-', '-' => '+', '!' => '!', '=' => '=' }
SequenceCallbacks = DefaultCallbacks   An alias for DefaultCallbacks that is used in Diff::LCS#traverse_sequences.
    Diff::LCS.LCS(seq1, seq2, Diff::LCS::SequenceCallbacks)
BalancedCallbacks = DefaultCallbacks   An alias for DefaultCallbacks that is used in Diff::LCS#traverse_balanced.
    Diff::LCS.LCS(seq1, seq2, Diff::LCS::BalancedCallbacks)

Public Class methods

Given two sequenced Enumerables, LCS returns an Array containing their longest common subsequences.

  lcs = Diff::LCS.LCS(seq1, seq2)

This array whose contents is such that:

  lcs.each_with_index do |ee, ii|
    assert(ee.nil? || (seq1[ii] == seq2[ee]))
  end

If a block is provided, the matching subsequences will be yielded from +seq1+ in turn and may be modified before they are placed into the returned Array of subsequences.

Examine the patchset and the source to see in which direction the patch should be applied.

WARNING: By default, this examines the whole patch, so this could take some time. This also works better with Diff::LCS::ContextChange or Diff::LCS::Change as its source, as an array will cause the creation of one of the above.

If vector maps the matching elements of another collection onto this Enumerable, compute the inverse vector that maps this Enumerable onto the collection. (Currently unused.)

private Compute the longest common subsequence between the sequenced Enumerables a and b. The result is an array whose contents is such that

    result = Diff::LCS.__lcs(a, b)
    result.each_with_index do |e, ii|
      assert_equal(a[ii], b[e]) unless e.nil?
    end

Normalize the patchset. A patchset is always a sequence of changes, but how those changes are represented may vary, depending on how they were generated. In all cases we support, we also support the array representation of the changes. The formats are:

  [ # patchset <- Diff::LCS.diff(a, b)
    [ # one or more hunks
      Diff::LCS::Change # one or more changes
    ] ]

  [ # patchset, equivalent to the above
    [ # one or more hunks
      [ action, line, value ] # one or more changes
    ] ]

  [ # patchset <- Diff::LCS.diff(a, b, Diff::LCS::ContextDiffCallbacks)
    #       OR <- Diff::LCS.sdiff(a, b, Diff::LCS::ContextDiffCallbacks)
    [ # one or more hunks
      Diff::LCS::ContextChange # one or more changes
    ] ]

  [ # patchset, equivalent to the above
    [ # one or more hunks
      [ action, [ old line, old value ], [ new line, new value ] ]
        # one or more changes
    ] ]

  [ # patchset <- Diff::LCS.sdiff(a, b)
    #       OR <- Diff::LCS.diff(a, b, Diff::LCS::SDiffCallbacks)
    Diff::LCS::ContextChange # one or more changes
  ]

  [ # patchset, equivalent to the above
    [ action, [ old line, old value ], [ new line, new value ] ]
      # one or more changes
  ]

The result of this will be either of the following.

  [ # patchset
    Diff::LCS::ContextChange # one or more changes
  ]

  [ # patchset
    Diff::LCS::Change # one or more changes
  ]

If either of the above is provided, it will be returned as such.

Returns a hash mapping each element of an Enumerable to the set of positions it occupies in the Enumerable, optionally restricted to the elements specified in the range of indexes specified by interval.

Find the place at which value would normally be inserted into the Enumerable. If that place is already occupied by value, do nothing and return nil. If the place does not exist (i.e., it is off the end of the Enumerable), add it to the end. Otherwise, replace the element at that point with value. It is assumed that the Enumerable‘s values are numeric.

This operation preserves the sort order.

Diff::LCS.diff computes the smallest set of additions and deletions necessary to turn the first sequence into the second, and returns a description of these changes.

See Diff::LCS::DiffCallbacks for the default behaviour. An alternate behaviour may be implemented with Diff::LCS::ContextDiffCallbacks. If a Class argument is provided for callbacks, diff will attempt to initialise it. If the callbacks object (possibly initialised) responds to finish, it will be called.

Given a patchset, convert the current version to the new version. If direction is not specified (must be :patch or :unpatch), then discovery of the direction of the patch will be attempted.

Given a set of patchset, convert the current version to the next version. Does no auto-discovery.

Diff::LCS.sdiff computes all necessary components to show two sequences and their minimized differences side by side, just like the Unix utility sdiff does:

    old        <     -
    same             same
    before     |     after
    -          >     new

See Diff::LCS::SDiffCallbacks for the default behaviour. An alternate behaviour may be implemented with Diff::LCS::ContextDiffCallbacks. If a Class argument is provided for callbacks, diff will attempt to initialise it. If the callbacks object (possibly initialised) responds to finish, it will be called.

traverse_balanced is an alternative to traverse_sequences. It uses a different algorithm to iterate through the entries in the computed longest common subsequence. Instead of viewing the changes as insertions or deletions from one of the sequences, traverse_balanced will report changes between the sequences. To represent a

The arguments to traverse_balanced are the two sequences to traverse and a callback object, like this:

  traverse_balanced(seq1, seq2, Diff::LCS::ContextDiffCallbacks.new)

sdiff is implemented with traverse_balanced.

Callback Methods

Optional callback methods are emphasized.

callbacks#match:Called when a and b are pointing to common elements in A and B.
callbacks#discard_a:Called when a is pointing to an element not in B.
callbacks#discard_b:Called when b is pointing to an element not in A.
callbacks#change:Called when a and b are pointing to the same relative position, but A[a] and B[b] are not the same; a change has occurred.

traverse_balanced might be a bit slower than traverse_sequences, noticable only while processing huge amounts of data.

The sdiff function of this module is implemented as call to traverse_balanced.

Algorithm

      a---+
          v
      A = a b c e h j l m n p
      B = b c d e f j k l m r s t
          ^
      b---+

Matches

If there are two arrows (a and b) pointing to elements of sequences A and B, the arrows will initially point to the first elements of their respective sequences. traverse_sequences will advance the arrows through the sequences one element at a time, calling a method on the user-specified callback object before each advance. It will advance the arrows in such a way that if there are elements A[ii] and B[jj] which are both equal and part of the longest common subsequence, there will be some moment during the execution of traverse_sequences when arrow a is pointing to A[ii] and arrow b is pointing to B[jj]. When this happens, traverse_sequences will call callbacks#match and then it will advance both arrows.

Discards

Otherwise, one of the arrows is pointing to an element of its sequence that is not part of the longest common subsequence. traverse_sequences will advance that arrow and will call callbacks#discard_a or callbacks#discard_b, depending on which arrow it advanced.

Changes

If both a and b point to elements that are not part of the longest common subsequence, then traverse_sequences will try to call callbacks#change and advance both arrows. If callbacks#change is not implemented, then callbacks#discard_a and callbacks#discard_b will be called in turn.

The methods for callbacks#match, callbacks#discard_a, callbacks#discard_b, and callbacks#change are invoked with an event comprising the action ("=", "+", "-", or "!", respectively), the indicies ii and jj, and the elements A[ii] and B[jj]. Return values are discarded by traverse_balanced.

Context

Note that ii and jj may not be the same index position, even if a and b are considered to be pointing to matching or changed elements.

Diff::LCS.traverse_sequences is the most general facility provided by this module; diff and LCS are implemented as calls to it.

The arguments to traverse_sequences are the two sequences to traverse, and a callback object, like this:

  traverse_sequences(seq1, seq2, Diff::LCS::ContextDiffCallbacks.new)

diff is implemented with traverse_sequences.

Callback Methods

Optional callback methods are emphasized.

callbacks#match:Called when a and b are pointing to common elements in A and B.
callbacks#discard_a:Called when a is pointing to an element not in B.
callbacks#discard_b:Called when b is pointing to an element not in A.
callbacks#finished_a:Called when a has reached the end of sequence A.
callbacks#finished_b:Called when b has reached the end of sequence B.

Algorithm

      a---+
          v
      A = a b c e h j l m n p
      B = b c d e f j k l m r s t
          ^
      b---+

If there are two arrows (a and b) pointing to elements of sequences A and B, the arrows will initially point to the first elements of their respective sequences. traverse_sequences will advance the arrows through the sequences one element at a time, calling a method on the user-specified callback object before each advance. It will advance the arrows in such a way that if there are elements A[ii] and B[jj] which are both equal and part of the longest common subsequence, there will be some moment during the execution of traverse_sequences when arrow a is pointing to A[ii] and arrow b is pointing to B[jj]. When this happens, traverse_sequences will call callbacks#match and then it will advance both arrows.

Otherwise, one of the arrows is pointing to an element of its sequence that is not part of the longest common subsequence. traverse_sequences will advance that arrow and will call callbacks#discard_a or callbacks#discard_b, depending on which arrow it advanced. If both arrows point to elements that are not part of the longest common subsequence, then traverse_sequences will advance one of them and call the appropriate callback, but it is not specified which it will call.

The methods for callbacks#match, callbacks#discard_a, and callbacks#discard_b are invoked with an event comprising the action ("=", "+", or "-", respectively), the indicies ii and jj, and the elements A[ii] and B[jj]. Return values are discarded by traverse_sequences.

End of Sequences

If arrow a reaches the end of its sequence before arrow b does, traverse_sequence try to call callbacks#finished_a with the last index and element of A (A[-1]) and the current index and element of B (B[jj]). If callbacks#finished_a does not exist, then callbacks#discard_b will be called on each element of B until the end of the sequence is reached (the call will be done with A[-1] and B[jj] for each element).

If b reaches the end of B before a reaches the end of A, callbacks#finished_b will be called with the current index and element of A (A[ii]) and the last index and element of B (A[-1]). Again, if callbacks#finished_b does not exist on the callback object, then callbacks#discard_a will be called on each element of A until the end of the sequence is reached (A[ii] and B[-1]).

There is a chance that one additional callbacks#discard_a or callbacks#discard_b will be called after the end of the sequence is reached, if a has not yet reached the end of A or b has not yet reached the end of B.

Given a set of patchset, convert the current version to the prior version. Does no auto-discovery.

Public Instance methods

Returns the difference set between self and other. See Diff::LCS#diff.

Returns an Array containing the longest common subsequence(s) between self and other. See Diff::LCS#LCS.

  lcs = seq1.lcs(seq2)

Attempts to patch a copy of self with the provided patchset. See Diff::LCS#patch.

Attempts to patch self with the provided patchset. See Diff::LCS#patch!. Does no autodiscovery.

Returns the balanced ("side-by-side") difference set between self and other. See Diff::LCS#sdiff.

Traverses the discovered longest common subsequences between self and other using the alternate, balanced algorithm. See Diff::LCS#traverse_balanced.

Traverses the discovered longest common subsequences between self and other. See Diff::LCS#traverse_sequences.

Attempts to unpatch a copy of self with the provided patchset. See Diff::LCS#patch.

Attempts to unpatch self with the provided patchset. See Diff::LCS#unpatch. Does no autodiscovery.

[Validate]