Princeton University: Programming Assignment Checklist: DNA Sequence Alignment

website

This assignment allows optional partnering. If you choose to do this, you must follow the pair programming guidelines. Please click the link and review them before you begin. Your partner can be from another precept (but ISC students may only partner with other ISC students). Please note that writing code with a partner without following the pair programming instructions is a violation of the course collaboration policy. All writing of code, comments, analysis and uploading to the dropbox should be done together from start to finish. If you come to office hours alone, you can get advice, but you may not change any code until both partners are together.

Frequently Asked Questions

What are the main goals of this assignment? You will (i) solve a fundamental problem in computational biology, (ii) learn about the analysis of algorithms, and (iii) learn about a powerful programming paradigm known as dynamic programming.

How do I read in the two input strings from the file? Use StdIn.readString() and redirection as usual.

How do I access the length of a string s? The ith character? Use s.length() and s.charAt(i), respectively. As with arrays, indices start at 0. We'll learn about this notation for manipulating (String) objects in Section 3.1. For this assignment, this is all you'll need to know about objects.

Can I assume that the input characters will always be A, C, G or T? NO! Your program should work equally well for any letter, upper case or lower case.

What's a StringIndexOutOfBoundsException? It's just like an ArrayOutOfBoundsException. It results from invoking s.charAt(i) with an illegal value of i.

How could I get a NullPointerException? Did you forget to allocate memory for opt[][]?

How do I declare and initialize a two dimensional array in Java? Review the end of Section 1.4 in Intro to Programming.

It seems strange to define x[i..M] to be the substring consisting of x[i] through x[M-1] instead of x[i] through x[M]. Is this a typo? It's a little strange, but no, it's not a typo. It's consistent with Java's indexing notation where the left endpoint is inclusive and the right endpoint is exclusive.

Which alignment should I print out if there are two or more optimal ones? Output any one you like.

Should gaps be handled by penalty()? The solution we think is clearest does not call penalty() on gaps, since a gap is not a character. Of course, the alignment output should use the '-'symbol to denote a gap, and the inputs we give you will never contain '-' (only alphanumeric characters are used), so if you find it convenient for penalty() to recognize '-' as a gap, you are permitted to do so.

Where can I learn more about dynamic programming and backtracking? The LCS (longest common subsequence) problem described in booksite section 9.6 is an example of a dynamic programming problem on strings with backtracking. However, it is different from the current problem in many ways, so do not simply mimic the code without understanding what it does. The websheet exercises for this week include several dynamic programming exercises starting from a basic level, including KnapsackBacktrack which is an example of backtracking.

Memory, Timing, and Operating System Issues

What does OutOfMemoryError mean? When Java runs, it requests a certain amount of memory from the operating system. The exact amount depends on the version of Java and your computer, but can vary from 64MB to 1024MB (1GB). After Java has started, the total size of all variables in use cannot be larger than what it originally requested. Trying to do so causes anOutOfMemoryError.

For this assignment, the largest test cases use huge arrays, and Java needs to ask for enough memory from the operating system. To explicitly ask for for more (or less) memory, use the -Xmxflag. For example, to request 500 megabytes (500 MB) of memory for a run, use

java-introcs -Xmx500m EditDistance < input.txt

Here 500m means 500 MB. You should adjust this number depending on the amount of memory your computer has and the size of the arrays you will need for the data set you are running. The amount 500MB should get you through ecoli10000.txt. To run ecoli20000.txt you will need to request more memory. (How much? The readme asks you to estimate this.)

What does "Could not reserve enough space for object heap" mean? This occurs if you use -Xmx with a value that is larger than the amount of available physical memory. Additionally, due to address space limitations, some 32-bit versions of Windows also will give this error if you try to request more than approximately 1.5GB, no matter how much physical memory is installed.

How do I determine how much physical memory is installed on my computer? On Mac, select About this Mac from the Apple menubar. On Windows, press Windows-R (or Run on the Start menu), enter msinfo32 and look for total physical memory.

How can I measure how long my program takes on each file? To measure the running time of your program, there are a few techniques.

  • The simplest is to use

    java-introcs -Xmx500m EditDistance < input.txt > output.txt

    and use a stopwatch. We redirect the output to a file to prevent printing text from becoming a bottleneck.

  • A second technique, which we think probably best suits the needs of this application, is to use the -Xprof runtime switch, which asks Java to print out timing data about the run. To use this with output redirection, type

    java-introcs -Xprof -Xmx500m EditDistance < input.txt > output.txt

    The timing information will appear at the end of the file output.txt, and you want the "flat profile" for main. The line will look like

    Flat profile of 4.5 secs (15 total ticks): main

    but these numbers are made up and yours will be different. We don't care about the ticks.Piping can be useful here. You can skip the output file in the previous step by piping your output to another program that will look for "Flat profile". and it should print out the time (and throw away all the other program output). On a Mac, run

    java-introcs -Xprof -Xmx500m EditDistance < input.txt | grep "main"

    On Windows, use find instead of grep:

    java-introcs -Xprof -Xmx500m EditDistance < input.txt | find "main"

    This find/grep command searches through all of whatever text it is fed and only prints out the lines containing the text main. Type man grep (in Terminal) or find /? (in Command Prompt) for more information.

  • As a third technique, you can use Stopwatch objects, see Repeat.java for an example. We'll explain more about objects later in the course.
  • Finally, you may elect to directly use System.currentTimeMillis() as shown in lecture.

How do I use a cluster machine in Friend 016/017? See this page for general instructions. Please also read the first bullet point for the question immediately below.

My timing data do not fit a polynomial hypothesis. What could I be doing wrong?

  • If you are running your program and accessing the data files from the Windows H: drive (especially if via a wireless network) or ~ on a cluster Mac (Friend 016/017), the bottleneck for medium-sized test cases might be the network latency instead of the dynamic programming algorithm! Do one of the following:
    1. Use a Stopwatch or System.currentTimeMillis() to specifically isolate the time taken after the input is read (after all calls to StdIn) and before any output is printed. Remember to remove the time printing statements before submitting the final version of your code.
    2. Or, copy all files to a folder on a local hard drive. We don't recommend this on a cluster (Friend 016/017) machine.
    3. Or, report your problems and the data you obtained, and use the readme's sample data for analysis instead.
  • When you run out of physical memory, your operating system may start using your hard drive as another form of storage. Accessing information from the hard drive is substantially slower than main memory, and you may be observing this effect. Avoid running extraneous complicated programs (media players, file sharing clients, word processors, web browsers, etc) while doing the timing tests if this seems to be a problem.
  • Make sure you are using output redirection or piping (as in the examples above) to prevent printing text from becoming a bottleneck.
  • Very small test cases are hard to use since the Java virtual machine takes a nontrivial amount of time to start, and since the processor "cache" may make small test cases run an order of magnitude faster than expected. If in doubt, use the test cases that take between 0.1 and 10.0 seconds.
Testing and Debugging

Testing.   To help you check the part of your program that generates the alignment, there are many test files in the sequence directory.

  1. Many of the small files are designed so that it is easy for you to determine what the correct answer should be by hand. Test your program on these cases to see that it gets these easy cases right.
  2. Here are the optimal edit distances of several of the supplied files.
    ecoli2500.txt   118
    ecoli5000.txt   160
    fli8.txt          6
    fli9.txt          4
    fli10.txt         2
    ftsa1272.txt    758
    gene57.txt        8
    stx1230.txt     521
    stx19.txt        10
    stx26.txt        17
    stx27.txt        19
    
  3. The test case worked through as an example in the assignment description, which is the same as the example10.txt file, has a unique optimal alignment. (Some test inputs like "xx y" have more than one optimal alignment.) So your code should give the exact same output on example10.txt as in the assignment page.
  4. Here are two more test cases with unique optimal alignments:
    % java-introcs EditDistance < endgaps7.txt  % java-introcs EditDistance < fli10.txt
    Edit distance = 4                           Edit distance = 2
    a - 2                                       T T 0
    t t 0                                       G G 0
    a a 0                                       G G 0
    t t 0                                       C T 1
    t t 0                                       G G 0
    a a 0                                       G G 0
    t t 0                                       A T 1
    - a 2                                       A A 0
                                                C C 0
                                                T T 0
    
  5. In addition, we require that you generate one small input file of your own to be used for testing special cases. Create a new input file with some interesting property. Then test your code using your file and make sure your program behaves as expected. For example, when we tested your RGBtoCMYK program in an earlier assignment, our special case was when R=0, G=0, B=0. In NBody, one of our special cases was a system with only 1 body. Include a description of your special test case in your readme.txt file.
Possible Progress Steps

These are purely suggestions for how you might make progress. You do not have to follow these steps.

  1. Download sequence.zip to your computer and unzip it, as described on the assignment page. It contains test files and the readme templates.
  2. Write the following two short helper methods.

    // return the penalty for aligning character a with character b public static int penalty(char a, char b) // return the min of 3 integers public static int min(int a, int b, int c)

    You will call these from your main method to compute penalties and to determine which of the three cases yields the minimum edit distance.

  3. Write the main() method in EditDistance.java to read in the two strings from standard input, using the method StdIn.readString(). For debugging, print them to standard output.
  4. Declare and initialize the (M+1)-by-(N+1) array opt[][]. Include the base cases. Print out the 2D array to check your work.To print the matrix out in nicely formatted columns, use

    System.out.printf("%3d", opt[i][j]);

    with nested for loops. Remember to remove this debugging print section before submitting the final version of your program.

  5. Now, it's time to implement the crucial dynamic programming part. First read the dynamic programming portion of Section 9.6 and make sure you understand how dynamic programming works. Think carefully about the order in which you will perform the computation since this is critical. Hint: first fill in the base case of opt[i][j], e.g., when i = M or j = N. Now, fill in the remaining values using a nested loop. Test that it works by printing out the contents of opt.
  6. Now, figure out how to recover the optimal alignment by backtracing.This is an iterative process. At each step we look to see which path choice we should make. Using the example from the assignment we start at i = 0, j=0 where x[i] = 'A' and y[i] = 'T'. The choices are to print "A -" and move down with a gap cost of 2, "- T" and move right with a gap cost of 2, or "A T" and move diagonally with a mismatch cost of 1. We know to pick "A T" because 7 - 6 = 1. This is the only choice which matches the matrix. (It is possible to have more than one choice which matches the matrix. In that case, either choice will lead to the same optimal edit distance.)

    Test this part thoroughly. For example, one corner case to test is to make sure that you print out the ENTIRE alignment, even when one sequence finishes before the other. (Use lastXgaps9.txt and lastYgaps9.txt to test.)

  7. Measure the time that your program takes on the sample runs indicated in the readme. For help on performing timing tests, see above.
  8. Use the doubling method to estimate the running time as a polynomial function of the input size.
Reviewing Your Program
  • Does each static method have a comment indicating what it does?
  • Are there any hardwired constants? Do they appear in multiple methods? The cleanest solution to this is to use the static class constant
    private static int MISMATCH = 1;
    

    (and two other similar ones for gaps and matches). This makes your code easy to use if a different set of penalties is desired.

Enrichment
  • The idea of dynamic programming was first advanced by Bellman (1957). Levenshtein (1966) formalized the notion of edit distance. Needleman-Wunsch (1970) were the first to apply edit distance and dynamic programming for aligning biological sequences, and our algorithm is essentially the one proposed in their seminal paper. The widely-used Smith-Waterman(1981) algorithm is quite similar, but solves a slightly different problem (local sequence alignment instead of global sequence alignment).
  • The same technology is employed in spell checkers and to identify plagiarism in many courses (including this one).
  • The genetic data are taken from GenBank. The National Center for Biotechnology Information also contains many examples of such database and alignment software.
  • With a little work, you can compute the optimal cost in quadratic time but using only linear space (do we need the whole opt matrix?) With more work, you can also compute the optimal alignment in linear space (and quadratic time). This is known as Hirschberg's algorithm (1975).