Write a python program to calculate some statistics from this DNA sequence. You will need to initialize a string variable in your script so your code can start something like this template code I started for you here.
Calculate and print the GC content of the DNA
Find the location of all the ATG codons in the sequence
Print the reverse complement of the sequence
Download the genome annotation for the Chr6 of Rice again (or reuse what you had from before) Rice Chr6 annotation.
Compute the number of gene and exon features in the file using python
Compute the number of bases which are in genes and in CDS features. Report the % of the chromosome which is coding (e.g. covered by CDS exons). Assume the CDS exons in the GFF files are NON-overlapping for this problem.
Use the following basic code template
Use the following files SNPs and annotation
You will likely need to re-use your solutions from Homework 2.
Using the data which present the SNPs and the genes, we would like to find genes which have the most number of changes. This will require counting the number of SNPs in each gene and then dividing the number of SNPs by the length of the gene in kilobases.
Generate a report that has four columns like this - you will likely need to use BEDTools to generate the information about which genes have SNPs and you will need to distill that down into the count of SNPs per gene. You will need to calculate the length of each gene feature. Finally you will need to compute the 4th column by dividing the SNP count by the gene length and adjust that to kilobases instead of bases. Here is example of expected output.
gene_name length SNP SNPs_per_kb
OS06G0510200 1391 50 35.9453630482
OS06G0487620 553 20 36.1663652803
OS06G0120200 2568 93 36.214953271
There are several ways to solve this. You should try your best to do as much of it as you can in python but you can use BEDtools, grep, cut, sort, etc. In particular you will need to use some of the string functions like split to extract out some of the parts you don't want from that last column.
You should use some of the code you had from the previous problem to manage to read in the gene information and keep track of the length of the genes.
You will need one more python type which allows you to store something in a list where the lookup is a string. The basic code for this is written you just need to use it.