Bio lab 3 Solution

Input:

·         Input files: one or more GFF files

·         Reference files: FASTA file(s) of the entire sequence of contigs referenced by the GFF file(s). (note: Number values in columns 4 and 5 of the GFF files indicate a nucleotide position in the respective reference file)

·         see attachment for how to interpret a GFF file

 

Processing Instructions:

For each input file calculate:

1.      Average gene span per contig/fosmid. Average number of nucleotides between the highest position indicated and the lowest position indicated in all features with the same gene_id in column 9

2.      Average length of CDS per contig/fosmid. Average coding DNA sequence size, sum of the CDS regions with the same gene_id (see Figure 1 below)

3.      Average size of exon per contig/fosmid. Average number of nucleotides between the first position and the last position indicated in features marked CDS in the GFF files.

4.      Average size of intron per contig. Average number of nucleotides between the last position of a CDS region and the first position of the subsequent CDS region with the same transcript_id (see Figure 1 below)

5.      Average intergenic region size per contig. Average number of nucleotides between the last nucleotide of the stop codon and the first nucleotide of the next gene (distance between subsequent gene spans with different gene_id’s in column 9 of the GFF file, see Figure 2 below).

6.      Average density of nucleotides in a CDS span per contig. Average number of nucleotides in each “Coding DNA Sequence,” multiplied by the number different gene_id’s, divided by the total number of nucleotides in the entire reference sequence.

7.      Average density of nucleotides in a CDS region per contig. Average number of nucleotides in each CDS region, multiplied by the number different gene_id’s, divided by the total number of nucleotides in the entire reference sequence

8.      Proportion of CDS nucleotides per contig. Total number of nucleotides in all CDS regions divided by total number of nucleotides in the reference sequence. If one or more CDS regions overlap, choose the longest CDS region for calculations (see Figure 3 below)

9.      Average Number of Genes per 10KB total number of gene spans with the same gene_id in column 9 divided by total number of nucleotides in the reference sequence multiplied by 10^4 (1 KB = 10^3 nucleotides)

10.  KB’s per Gene total number of nucleotides in the reference sequence in KB’s (1 KB = 10^3 nucleotides) divided by the number of gene spans with different gene_id’s in column 9

11.  Predicted protein sequence use features with identical labels in column 9 and marked CDS to obtain nucleotide sequence to be used, join these sequences in order of lowest position to highest position indicated by these features. If column 7 is marked + then proceed to translation, if column 7 is marked - then the reverse complement needs to be created before translation. Translate the appropriate sequence using chart in Figure 4 below



Figure 1. Gene Span vs. Coding DNA Sequence Size



Figure 2. Intergenic Region



Figure 3. In the Case of Overlapping Exons/CDS Regions



Figure 4. Codon Translation Chart

 

 

 

 

 

 

 

 

 

Output instructions:

·         data should be organized as follows:



 
Input 1
Input 2
Average gene length
 
 
Average length of CDS
 
 
Average size of exon
 
 
Average size of intron
 
 
Average intergenic region size
 
 
Average density of nucleotides in a CDS span
 
 
Average density of nucleotides in a CDS region
 
 
Proportion of CDS nucleotides
 
 
Genes per 10 KB
 
 
KB's per Gene
 
 
Predicted Protein Sequence
 
 

 

GFF Format:

Column 1: sequence name (example: fosmid 10, contig 18, etc.)

Column 2: name of the program that generated the sequence (database, research project name, etc)

Column 3: feature type (mRNA = entire coding DNA sequence of all exons with no stop codon, CDS = DNA sequence of an individual exon including any stop codons)

Column 4: number/position of the first nucleotide in the indicated feature if numbering starts at the beginning of the reference sequence and starts with 1

Column 5: number/position of the last nucleotide in the indicated feature if numbering starts at the beginning of the reference sequence and starts with 1

Column 6: score = floating point value... I don’t actually know what that means, and the example we were given shows them all as blank...

Column 7: which strand the indicated feature is on; + = forward strand, - = reverse strand

Column 8: frame number; 0 = first base of the indicated feature is the first base of a codon, 1 = second base of the indicated feature is the second base of a codon, 2 = third base of the indicated feature is the first base of a codon

Column 9: provides additional information about each feature, with information surrounded by quotation marks and each piece is separated by a semicolon and the field ends with a semicolon. gene_id = gene name, transcript_id = isoform of the gene (ex: gene_id “CG7970”; transcript_id “CG7970-RB”)

 

Additional information to keep in mind:

each row is called a feature
fields must be tab-separated
all fields within each feature line must contain a value
if a column needs to be “empty” then it is denoted by a “.”
file may contain additional features with feature names “cigar” and “score,” these are to be ignored
features may not be listed in any order
 

 

Example:

fosmid10    .    mRNA    736    4493    .    -    .    gene_id “alphaCop”; transcript_id “alphaCop-RA”;

fosmid10    .    CDS    3548    4493    .    -    .    gene_id “alphaCop”; transcript_id “alphaCop-RA”;

fosmid10    .    CDS    733    3491    .    -    .    gene_id “alphaCop”; transcript_id “alphaCop-RA”;

fosmid10    .    mRNA    736    4493    .    -    .    gene_id “alphaCop”; transcript_id “alphaCop-RB”;

fosmid10    .    CDS    3548    4493    .    -    .    gene_id “alphaCop”; transcript_id “alphaCop-RB”;

fosmid10    .    CDS    733    3491    .    -    .    gene_id “alphaCop”; transcript_id “alphaCop-RB”;

fosmid10    .    mRNA    5056    5448    .    +    .    gene_id “CG13919”; transcript_id “CG13919-RA”;

fosmid10    .    CDS    5056    5451    .    +    .    gene_id “CG13939”; transcript_id “CG13919-RA”;

 IF IN NEED OF THE OTHER BIO LABS PLEASE INBOX  ME
Powered by