Bio lib paper
From Bioperl.info
Abstract:
In the bioinformatics field and various genome projects which as a whole regarded as the largest ever scientific effort in human history has seen enormous benefit from a general purpose high level programming language Perl. Also known as Protein Engineering Research Language, Perl has been used universally in computational biology labs all over the world. This was because of the inherent ability of perl to handle biological sequences or strings as well as the fast prototyping capability derived from its interpreter and compiler hybrid feature. Anther critical factor not obvious for computer scientists was the design and characteristics of perl was such, it was probably the easiest language for common biologist who rarely have serious contacts with programming languages.
To maximize and extend the capability and easiness of Perl with the spirit of recycling, an effort to make a biological module for perl was begun in 1995 by generalizing existing subroutines (http://cyrah.med.harvard.edu/bioperl.html). Known as Bioperl, the project is a combined effort of many scientists with various background and various small modules are now publicly available. For example, there are Bio modules for basic sequence handling, sequence alignment, Blast search method, molecular structure (conceptual only), etc.
There have been two major paths in developing perl subroutines. One is of object oriented modules of Bioperl and the other is the conventional procedural way of coding with simple subroutines put together into a subroutine library. Also, various perl utility programs belong to this approach in terms of the reuse of algorithms. (Even though the difference between the two seems minor, for most non computational biologists, it can create a big mental barrier in participating in the communal effort.)
However, the very first purpose of recycling of subroutines by gathering many common and special subroutines from various resources is not yet achieved as there is not generally accepted biological module or subroutine libraries which are commonly used, maintained, extended and developed further. Here we describe the reasons and try to propose a direction in the development Bioperl subroutines.
The benefits of objects are tremendous when any project becomes big in size. Object orientation can provide high efficiency especially when any task is done routinely and in a regular fashion. The organization of various packages itself can provide clean platform for continuous and more complex development. Also, any extremely specific application is to be made such an object orientation for the developer can provide an easy interface for common users who would normally not attempt to try code due to lack of skill and time.
However, a lot of biological research involve continuous changes and testing of new algorithms, adaptation to changes from various sources like databases, formats and new standards, etc. The programming problem scientists are dealing with are different from the problems of system administrators or web maters. This means that we need to have an object module which is extremely flexible for extension and changes. Practically it is very difficult for most biologists to spend time both in module development and prototyping. Therefore, any systematic and deep level hierarchy developed by 3rd party may cause unnecessary complications for most biological problems where quick solution with minimum time are required (which is one reason why Perl was so popular among biologists).
Secondly, object technology is often a difficult concept for biologists who do not have much computational background. This problem becomes serious when the developed codes are to be used by other colleagues who have even less knowledge in programming.
Third, when such object module is defined, the subroutines developed become more specific and hidden to the module and it becomes less easy for common users to extract necessary subroutines and apply quickly for more specific problems. Often default Perl provides very useful functions for Biology, but at the same time the algorithms they develop can be extremely specific which are quite difficult to be generalized. Therefore putting such subroutines in any hierarchy within a separate module file can discourage common users.
Fourth, however easy and systematic a specialized module like Bio can be, users should look up the module and read how they can use. If this is a very simple and general module, it is cost effective to study the module, but for common biologists it can be an additional burden of learning language. Often they choose Perl because it is relatively easier than other languages so bigger and complex module beats the purpose.
Compared to object oriented module with hierarchy, a conventional and comprehensive subroutine library without any hierarchy can be much simpler in concept and practice. By subroutine library we mean any perl module which is a library of subroutines without using package and object oriented subroutines. First, there is no need of learning new concepts except taking and integrating. Second, it is more portable which is one of the very important point of Perl itself. As usually subroutines are copied in the programs rather than calling a module even though it can be done in both ways (a module does not have to be object oriented).
Third, it is much easier to maintain and extend a Bio subroutine library as any subroutine can be enlisted with proper documentation. This means with minimum effort the users of any subroutine library in any group can update and export to other groups. Once any agreeable standard way is naturally accepted, it will be trivial to combine all such small libraries into bigger ones. This will result in a more extensive collaboration between scientists who use Perl for Biology.
In conclusion, a conventional subroutine library type module for Biology will be very useful for making easy, fast and reusable codes for biological problems. Therefore it will be the ideal situation where both such object oriented modules and an easily accessible subroutine library are offered for different aims of programming problems.
Here, we present a biological subroutine library with a package of perl programs from a various sources as a primitive example. The library and the package have been developed over the years for general and specific problems in handling protein sequences, searches and manipulation. The purpose of the library is to provide a source of comprehensive collection of subroutines which can be used for fast production of biological perl programs. It is intended for both naive and experienced computational biologists for solving scientific problems. The programs in the packages are mostly small and medium sized programs using subroutines from the library. As a collection of algorithms for Biology, it can serve as an algorithmic recipe, too. All the subroutines will be indexed according to the category and documented with easy examples so that users can copy and integrate into their own programs directly. A web server is set up to accept new subroutines to be registered from all over the world.
