[content]

Index A to ZApply NowFrom the ChancellorVisitorsAlumniPeople FinderFor the MediaFor Parentsjobs
Southern Illinois University Carbondale Home SIU Salukis
SalukinetSIUC IntranetAthleticsPublic Events CalendarWeather
Main
 HOME PAGE
 BUILD 3
 BUILD 4
 BUILD 5
STUDY GUIDES
 TUTORIALS
 USER GUIDES
 ONTOLOGY AND CLASSIFICATION
DOWNLOADS
 FINGER PRINT DATABASE
 GFF FILES
 RELATED      PUBLICATIONS
INFORMATION
 NEW STORIES
 FIND US
 CONTACT US
LINKED SITES
 SOYBASE
  NCGR
 TIGR
 Feedback
SOYBEAN SEQUENCE
 BES
 EST
 GENOME SEQUENCE
 
Other Projects
Feedback
Data Submission

 

 
 

Round 2, SIU Soybean Genome


 
 

Methodology

This page documents Round 2 of converting soybean genomic data so that it can be displayed in GBrowse.

Distinguishing characteristics of Round 2:

The second FPC build, provided by J. Shultz at SIU, was put into fpc2.txt

Here is a summary of the process described below:

  • Preparing Data
    • Original marker data was obtained
    • New data was extracted from FPC file
  • Dropping Anchors
    • Loci anchors were noted
    • Clone anchors were dropped
    • Contig anchors were dropped
  • Placing Features
    • Contigs were spread apart to avoid conflicts
    • Clones were placed based on contig locations
    • Loci were placed based on clone locations
  • Creating the GFF file
    • A GFF file was created based upon the above placements
A table at the bottom of this page shows the relationships between the data files and the Perl scripts which are discussed herein.

The marker data used was the same as before. It was taken from Round 1, Step 1, Phase 2 and put into soybean.gff. These were the original locations of the markers, before they were moved to avoid conflicts. For better accessibility, soybean.gff was split into mlg.gff, loci,gff, and qtl.gff.

The loci GFF data was rewritten by adjust_loci_file.pl into locus_anchors.txt and then sorted into sorted_locus_anchors.txt in order so that it would be more conveniently accessible. The numeric fields are for start, end, and midpoint. Note that these were not necessarily the locations of the loci. Just as a boat can float a distance away from its anchor, a locus location can also be a distance away from its anchor. This is necessary because loci cannot be stacked on top of each other; and also because the locations of clones and contigs influence the locations of some loci.

examine_fpc_file.pl was written to examine the FPC file and write files of relationships between loci, clones, and contigs. These output files were clone2locus_relations.txt and contig2clone_relations.txt. They were sorted into sorted_clone2locus_relations.txt and sorted_contig2clone_relations.txt.

Note that this was starting to look like a relational database. There were relations between loci and their locations; between loci and clones; and, between clones and contigs. Each of the these relations were in a separate file and Perl scripts were the database manager.

drop_clone_anchors.pl was written to anchor the clones. It read from sorted_clone2locus_relations.txt and sorted_locus_anchors.txt in order to determine the anchor locations. The output was written to clone_anchors.txt, which was sorted to sorted_clone_anchors.txt. Again, as with the loci, the anchor locations were not necessarily where the clones would actually be placed. That depended upon their relative locations within contigs. An error file, drop_clone_anchors_errors.txt, was also output in order to document which clones had errors during processing.

An average band size was needed before the contig anchors could be dropped. get_ave_clone_band_size.pl was written to accomplish this. It examined sorted_contig2clone_relations.txt to determine this. (The answer was 3,881.89350119645 bases.) The average clone size of 145,000 bases was provided by Dr. Lightfoot at SIU.

drop_contig_anchors.pl was written to anchor the contigs. It read from sorted_contig2clone_relations.txt and sorted_clone_anchors.txt in order to determine the anchor locations. The output was written to contig_anchors.txt and drop_contig_anchors_errors.txt. Also, sorted_contig_anchors.txt was created. Once again, anchor locations only indicated pressure attracting corresponding contigs, not actual contig locations. The fields in the proper output file indicated start and end (as opposed to midpoint) because the contigs had variable lengths. The last field indicated how many clones anchored the corresponding contig.

spread_contigs.pl was written to spread out the contigs. The issue was that in many cases more than one contig was anchored to the same spot. In other cases, the anchors were so close, that the contigs, if placed literally, would overlap, which was not possible. Imagine several boats anchored to the same spot. These boats would spread out, because they could not occupy the same spot. Then imagine that the ropes to the anchors were pulled in so that each boat was pulled closer to its anchor. For an illustratin of this, see The Boats-on-a-Lake Algorithm. This is conceptually what spread_contigs.pl accomplished with the contigs: it spread them out, and then pulled them towards their anchors. To make this easier from a programming standpoint, the contigs were spread out to the right, and then the whole group of them pulled back part way to the left. The input file was sorted_contig_anchors.txt and the output file was sorted_contig_placements.txt. The last two columns of the output file were used to determine the greatest distance that any contig was moved during this process.

While spreading the contigs, a constant, OFFSET, was used to make sure that the contigs were each at least 1000 bases from each other. This turned out to be not enough spacing because clones in neighboring contigs ended up being clustered too closely to each other. In the next run, this offset will be greater.

spread_clones.pl was written to spread out the clones based on their locations within the contigs. The input files were sorted_contig2clone_relations.txt and sorted_contig_placements.txt. The output file was clone_placements.txt which was sorted to sorted_clone_placements.txt. An error file, spread_clones_errors.txt, was written for clones which matched contigs whose locations on MLG's were not known.

A sorted locus-to-clone relationship file was needed in order to spread the loci. locus2clone.pl was written to read clone2locus_relations.txt and rearrange the fields so that the information could be sorted by loci. The results were written to locus2clone_relations.txt which was sorted to sorted_locus2clone_relations.txt.

spread_loci.pl was written to spread out the loci based on their relationships, via the clones, with the contigs. It reads sorted_clone_placements.txt and sorted_locus2clone_relations.txt and writes to locus_placements.txt, which was sorted to sorted_locus_placements.txt.

create_gff.pl was written to take the available information and write a GFF file. It read mlg.gff, sorted_locus_placements.txt, sorted_clone_placements.txt, and sorted_contig_placements.txt and produced soybean.gff.025, which was loaded into GBrowse.

The table below shows the relationships between the data files and the Perl scripts.

The data files are on the right...
 
The Perl scripts are below...
mlg.gff (given)
  sorted_locus_anchors.txt (given)
  fpc2.txt (given)
  sorted_clone2locus_relations.txt
  sorted_locus2clone_relations.txt
  sorted_contig2clone_relations.txt
  sorted_clone_anchors.txt
  sorted_contig_anchors.txt
  sorted_contig_placements.txt
  sorted_clone_placements.txt
  sorted_locus_placements.txt
  soybean.gff
Examine FPC     Input Output Output Output            
Drop Clone Anchors   Input   Input     Output          
Get Ave. Band Size           Input            
Drop Contig Anchors           Input Input Output        
Spread Contigs               Input Output      
Spread Clones           Input     Input Output    
Spread Loci         Input         Input Output  
Create GFF Input               Input Input Input Output
How to read the above table: The cells indicate which data files are input and output by each Perl script. For example, Spread Clones (spread_clones.pl) inputs sorted_contig2clone_relations.txt and sorted_contig_placements.txt. It outputs sorted_clone_placements.txt.

Some minor adjustments were made to the above Perl scripts. The distance between contigs, the OFFSET, was increased to 103,812 so that clones would not overlap into neighboring contigs. Also, in places where sorting needed to be done, it was made a part of the Perl script in order so that the sorting would not have to be done manually via a Unix command.

What still needs to be done:

  • Features that are unrelated to contigs need to be placed and added to the GFF file.
  • Comments need to be added to identify the clones which are contig anchors.
  • Check for locus conflicts.

A Venn Diagram was made in order to assist in determining how to do the items on the above list. This Venn Diagram is here. The colors discussed here come from the Venn Diagram. One category of loci was processed in Round 2: Black Loci which matched clones which matched contigs. Two categories of clones were processed in Round 2: Black Clones that matched both loci and contigs; and, some Orange Clones, when they matched a contig that matched another clone which matched a locus. One category of contigs was processed in Round 2: Black ones which matched a clone which matched a locus.

What was not processed in Round 2. Two categories of loci were not processed: Blue Loci which did not match any clones; and, Purple Loci which matched clones which did not match contigs. Three categories of clones were not processed: Red Clones which did not match anything; Purple Clones which only matched loci; and, Orange Clones which only matched a contig (when that contig was not anchored by any other clones). One category of contig was not processed: Orange Contigs which matched contigs which did not match loci.

 

Obviously, there is one situation which the Venn Diagram cannot legally handle: There are two kinds of Orange Clones. One kind of Orange Clone matches a contig which cannot be anchored. Another type of Orange Clone matches a contig which matches another clone which matches a locus, and thus can be anchored.

 

The Round 2 output needs to distinguish between the Black Clones and the Oranges ones which are anchored by contigs, so that the Black ones can be used for further visual analysis. The Round 2 Perl scripts cannot be readily adapated to do this. Furthermore, the other categories of features need to be distinguished and processed. These will be issues for Round 3.

 

              Deepak
http://soybeangenome.siu.edu
Last update: July 31,2005.