In my spare time, I like to work a bit on Biojava. It’s not as widely used as Bioperl or Biopython or other Bio* languages, and still missing many important features. Some time ago, it was completely rewritten as Biojava3, and the old biojava-legacy has to be kept around until all features are migrated to the new implementation.
So I chimed in when a missing feature (SCF parsing, reading binary files that contain data from DNA Sequencing) was mentioned in the mailing list and said that I would port it to Biojava3. Unfortunately, I had no spare time at that point (changing jobs and having 2 small kids does that to you).
Soon after that, Biojava made the move from a SVN repository to github. I also had some time left over, so I started looking at the old implementation and the documentation of the format. There was also a support request of a user who had troubles with some of his SCF files.
I asked him to send them to me and reproduced the issues. There were two different exceptions. One occurred when a file did not contain a comments section. I made a quick fix and started to look at the second issue.
That one made me scratch my head a lot. I looked at the SCF files in a HEX editor, but could not really pinpoint the issue. I only found out that the parser got some unexpected zero values while trying to read the contained DNA sequence. So I looked around for a better HEX editor that allowed me to annotate the areas in the hex file so that I could keep track of the different positions better. After trying out several editors for OS X, I stumbled across Synalyze It! – this tool allows to reverse-engineer binary files and create a grammar. I spent some time refining the grammar so that it worked for all the files that I had access to.
With the new and better understanding of the data gained by creating the grammar, I quickly found out that the parser did not increase the offset while parsing comments, which caused all subsequent offsets to be… off by the length of the comment block. I fixed that too.
The next task is to refactor the parser (make smaller, more manageable and better documented classes instead of the current nested-class implementation) and to make it work in Biojava3.