HW7: Needleman-Wunsch (20 Points)

Due Monday 5/4/2020

Overview / Logistics

The purpose of this assignment is to get you practice with recursion and dynamic programming. By the end of this assignment, you will have code that tells you an optimal set of steps it takes to match to strings with variable costs for character deletion and substitution.

What to submit: When you are finished, you should submit a file called Needleman.py to Canvas, along with answers to the following as a comment on Canvas:

  • Did you work with a buddy on this assignment? If so, who?
  • Are you using up any grace points to buy lateness days? If so, how many?
  • Approximately how many hours it took you to finish this assignment (I will not judge you for this at all...I am simply using it to gauge if the assignments are too easy or hard)
  • Your overall impression of the assignment. Did you love it, hate it, or were you neutral? One word answers are fine, but if you have any suggestions for the future let me know.
  • Any other concerns that you have. For instance, if you have a bug that you were unable to solve but you made progress, write that here. The more you articulate the problem the more partial credit you will receive (fine to leave this blank)

The Problem

In class, we talked about how to compute the edit distance efficiently by using dynamic programming. In the edit distance, we seek to minimize a sequence of deletions and substitutions that turn one string into another, and the cost of substituting one letter for the other and deleting a character are both 1, regardless of the letters. To address this limitation, there is a variant of edit distance known as Needleman-Wunsch, in which the costs can change depending on what characters are involved. By convention, we actually switch from a "minimizing cost" mindset to a maximizing score mindset (for those curious, this is to make it easier to do subsequence matching with variable costs).

To help you explore examples of Needleman-Wunsch scoring, I have provided a live demo app below that a former independent study study Liz Dempsey made in the course of our research on sequence matching. In the default example, we have an alphabet of two letters a and b, and the scores are as follows

  • A -1 penalty for deleting a
  • A -2 penalty for deleting b
  • A -3 penalty for swapping an a for a b or a b for an a
  • A +2 score for matching an a to an a
  • A +3 score for matching a b to a b
All of this information is specified in a dictionary, which you can see in the "pairwise costs" menu. When you hit "match strings," this program will run the Needleman-Wunsch algorithm to fill out the dynamic programming table. It will also perform a backtracing to find one of the optimal-cost sequence of edit operations. Play around with different examples and explore the results until you feel comfortable.

Needleman-Wunsch Interactive Applet

String 1

String 2

Alphabet

Pairwise
Costs

Optimal Matching Score

Code To Write

Your job in this assignment is to create a Python program that performs the Needleman-Wunsch matching. You should create a method that takes in two strings, along with a dictionary that specifies the scores of different operations, as in the program above. You should only need to specify the cost of a mismatch once in a dictionary (for instance, "ab" should have the same value as "ba", so your program should work if only one of them is specified).

Your method should return both the optimal cost (15 points), as well as an example of an optimal sequence of edit operations obtained by backtracing (5 points). It will be +5 points of extra credit to return all possible sequences of optimal edit operations if there are ties.

Tips

  • You can stick pretty closely to the iterative code we wrote for edit distance in class. You'll just need to use different costs, and it's a maximization instead of a minimization.
  • Use the above applet to check your answers on different examples.
  • The first row and the first column will be different from what they were in the edit distance, so be careful with this. Look at what they are in the example.
  • When you are considering the left path, for example, that's like saying you've deleted the character in the second string, so you'll have to figure out what character that is and its cost.
  • Recall that string access works the same way as array access. For instance