|
This page documents Round 3
of converting soybean genomic data so that it can be
displayed in GBrowse. Round 3 refers to the
programming method and not to the data.
Round 3 Update.
Distinguishing characteristics of Round 3:
- The programming consolidates and streamlines
Round 2.
- The data is divided into sets for
description and manipulation. See
ontology.
There are two types of sources of data: FPC
results and loci locations. The FPC results are used
to set up relationships between loci, clones, and
contigs. The loci locations are then used via these
relationships to place these features on a GBrowse
physical map.
Putting the Loci Locations into a Standard
Format
Three sources of loci locations were available:
-
input.txt was the original loci data used in
Round 1
-
raw_loci.txt is from a file supplied by
Soybase at Iowa State University.
-
usda.txt is a tab-delimited file created
from a March, 2003, USDA spreadsheet.
The above three files were in different formats.
Furthermore, future loci data files will also likely
be in different formats. For consistency, the loci
locations were changed to the format used by
sorted_locus_anchors.txt in Round 2. The data
from one or a combination of these three files could
be used at a time. Whatever would be put into
sorted_locus_anchors.txt is what would be used.
input.txt was converted to
sorted_locus_anchors.txt as described in Round
2.
extract_loci.plx was used to convert the data
from raw_loci.txt and
extract_loci2.plx was used to extract data from
usda.txt. Note the new usage of the
.plx file extension. This stands for PerL
eXecutable and is this way to conform with the
style preferred by the Perl community.
Each of the datasets was tried, as well as
combinations of them. The USDA dataset was then
chosen as the one to continue with.
Putting the FPC Results into Usable Formats
Three sources of FPC results were available:
- The original
fpc.txt
file used in Round 1 (Version 3 data).
- The
fpc2.txt file used in Round 2 (Version 4
data).
-
webdata.fpc, an alternative build of Version
3 data.
examine_fpc_file.plx was updated from
examine_fpc_file.pl to handle the different
naming conventions of webdata.fpc (such as
putting question marks after MLG names).
Whichever FPC file is input into
examine_fpc_file.plx the following output files
are created: sorted_clone2locus_relations.txt,
sorted_locus2clone_relations.txt, and
sorted_contig2clone_relations.txt.
Putting it all Together
After preparing the chosen loci location data and
the chosen FPC output data, these were the files
ready to be put together into a GFF file:
-
mlg.gff contained the MLG data from Round 2.
It was put into the new GFF file without any
changes.
-
qtl.gff contained the QTL data from Round 2.
It was also put into the new GFF file without
any changes.
-
sorted_locus_anchors.txt contained the locus
anchor positions from above. This is equal to
the set of Σ Loci. These were the bases for the
locations of the clones and the contigs. The
numeric fields in this file are for start, end,
and midpoints of the anchor locations.
-
sorted_clone2locus_relations.txt contained
the relationships between clones and loci
obtained from the chosen FPC file. This is equal
to the sets of Σβ Loci, γ Loci, Σβ1
Clones, and γ Clones.
-
sorted_locus2clone_relations.txt contained
the relationships between clones and loci
obtained from the chosen FPC file. This is the
same data as the previous file, but it was
sorted by loci instead of by clones.
-
sorted_contig2clone_relations.txt contained
the relationships between contigs and clones
obtained from the chosen FPC file. This is equal
to the sets of Σβ2 Clones, γ Clones,
and Σ Contigs. The numeric fields refer to
starting and ending bands.
place_features.plx took the above files as input
and created a new GFF file.
Contig conflicts were spread out using the
Boats-on-a-Lake Algorithm.
Summary
Round 3 put the loci locations into a standard
format, put the relevant FPC data into a usable
format, and combined all of this into a GFF file |