|
This page documents Round 2 of
converting soybean genomic data so that it can be
displayed in GBrowse.
Distinguishing characteristics
of Round 2:
The second FPC build, provided by J. Shultz at SIU,
was put into
fpc2.txt
Here is a summary of the process described below:
- Preparing Data
- Original marker data was obtained
- New data was extracted from FPC file
- Dropping Anchors
- Loci anchors were noted
- Clone anchors were dropped
- Contig anchors were dropped
- Placing Features
- Contigs were spread apart to avoid conflicts
- Clones were placed based on contig locations
- Loci were placed based on clone locations
- Creating the GFF file
- A GFF file was created based upon the above
placements
A table at the bottom of this page shows the
relationships between the data files and the Perl
scripts which are discussed herein.
The marker data used was the same as before. It was
taken from
Round 1, Step 1, Phase 2 and put into
soybean.gff.
These were the original locations of the markers, before
they were moved to avoid conflicts. For better
accessibility, soybean.gff was split into
mlg.gff,
loci,gff,
and qtl.gff.
The loci GFF data was rewritten by
adjust_loci_file.pl into
locus_anchors.txt and then sorted into
sorted_locus_anchors.txt in order so that it would
be more conveniently accessible. The numeric fields are
for start, end, and midpoint.
Note that these were not necessarily the locations of
the loci. Just as a boat can float a distance away from
its anchor, a locus location can also be a distance away
from its anchor. This is necessary because loci cannot
be stacked on top of each other; and also because the
locations of clones and contigs influence the locations
of some loci.
examine_fpc_file.pl was written to examine the FPC
file and write files of relationships between loci,
clones, and contigs. These output files were
clone2locus_relations.txt and
contig2clone_relations.txt. They were sorted into
sorted_clone2locus_relations.txt and
sorted_contig2clone_relations.txt.
Note that this was starting to look like a relational
database. There were relations between loci and their
locations; between loci and clones; and, between clones
and contigs. Each of the these relations were in a
separate file and Perl scripts were the database
manager.
drop_clone_anchors.pl was written to anchor the
clones. It read from
sorted_clone2locus_relations.txt and
sorted_locus_anchors.txt in order to determine the
anchor locations. The output was written to
clone_anchors.txt, which was sorted to
sorted_clone_anchors.txt. Again, as with the loci,
the anchor locations were not necessarily where the
clones would actually be placed. That depended upon
their relative locations within contigs. An error file,
drop_clone_anchors_errors.txt, was also output in
order to document which clones had errors during
processing.
An average band size was needed before the contig
anchors could be dropped.
get_ave_clone_band_size.pl was written to accomplish
this. It examined sorted_contig2clone_relations.txt
to determine this. (The answer was 3,881.89350119645
bases.) The average clone size of 145,000 bases was
provided by Dr. Lightfoot at SIU.
drop_contig_anchors.pl was written to anchor the
contigs. It read from
sorted_contig2clone_relations.txt and
sorted_clone_anchors.txt in order to determine the
anchor locations. The output was written to
contig_anchors.txt and
drop_contig_anchors_errors.txt. Also,
sorted_contig_anchors.txt was created. Once again,
anchor locations only indicated pressure attracting
corresponding contigs, not actual contig locations. The
fields in the proper output file indicated start
and end (as opposed to midpoint)
because the contigs had variable lengths. The last field
indicated how many clones anchored the corresponding
contig.
spread_contigs.pl was written to spread out the
contigs. The issue was that in many cases more than one
contig was anchored to the same spot. In other cases,
the anchors were so close, that the contigs, if placed
literally, would overlap, which was not possible.
Imagine several boats anchored to the same spot. These
boats would spread out, because they could not occupy
the same spot. Then imagine that the ropes to the
anchors were pulled in so that each boat was pulled
closer to its anchor. For an illustratin of this, see
The
Boats-on-a-Lake Algorithm. This is conceptually what
spread_contigs.pl accomplished with the contigs:
it spread them out, and then pulled them towards their
anchors. To make this easier from a programming
standpoint, the contigs were spread out to the right,
and then the whole group of them pulled back part way to
the left. The input file was
sorted_contig_anchors.txt and the output file was
sorted_contig_placements.txt. The last two columns
of the output file were used to determine the greatest
distance that any contig was moved during this process.
While spreading the contigs, a constant, OFFSET, was
used to make sure that the contigs were each at least
1000 bases from each other. This turned out to be not
enough spacing because clones in neighboring contigs
ended up being clustered too closely to each other. In
the next run, this offset will be greater.
spread_clones.pl was written to spread out the
clones based on their locations within the contigs. The
input files were sorted_contig2clone_relations.txt
and sorted_contig_placements.txt. The output
file was
clone_placements.txt which was sorted to
sorted_clone_placements.txt. An error file,
spread_clones_errors.txt, was written for clones
which matched contigs whose locations on MLG's were not
known.
A sorted locus-to-clone relationship file was needed
in order to spread the loci.
locus2clone.pl was written to read
clone2locus_relations.txt and rearrange the fields
so that the information could be sorted by loci. The
results were written to
locus2clone_relations.txt which was sorted to
sorted_locus2clone_relations.txt.
spread_loci.pl was written to spread out the loci
based on their relationships, via the clones, with the
contigs. It reads sorted_clone_placements.txt
and sorted_locus2clone_relations.txt and writes
to
locus_placements.txt, which was sorted to
sorted_locus_placements.txt.
create_gff.pl was written to take the available
information and write a GFF file. It read mlg.gff,
sorted_locus_placements.txt,
sorted_clone_placements.txt, and
sorted_contig_placements.txt and produced
soybean.gff.025, which was loaded into GBrowse.
The table below shows the relationships between the
data files and the Perl scripts. |