Due Oct 30 before class starts.
Genome file: [MY FILE]
Number of EcoRI (GAATTC) cut sites: [Number of sites]
...
Here is a list of RE sites - you will need to re-write some of these to convert to a regular expression.
EcoRI = "GAATTC"
Bsu15I = "ATCGAT"
Bsu36I = "CCTNAGG"
BsuRI = "GGCC"
EcoRII = "CCWGG"
The abbreviation for DNA ambiguity patterns can be found on this page.
Use the B. subtilis genome here and you can find a direct link to the strain 168 FTP folder of the assembly. You want the ".fna" file.
You can also use the E. coli K-12 genome from NCBI or what is available here.
The goal of this is to write generic code so you are welcome to run this on any genome really. There are many sequenced strains, it would be interesting to compare if the number of cut sites (or their size) varied among strains.
You can use this script as a starting point for reading a FASTA file
For the advanced programmers. Think about (or try) to make your program handle a folder of sequence files to read and provide a report.
Generate a distribution of polyA lengths (distance between the motif you found and the end of the contig). You will want to review how we capture a regular expression search match and then get the start or end of that match. E.g.
5'<------------------------------------> 3'
AATAAAGAACAAAGTA
100 110
match ends at 100
sequence is 110 bp long
polyA tail is 10 bp long
The matching or sequence composition of is not always perfect for polyA (how to decide what is the LAST motif match is the problem). But give this a try - see if you can generate some summary statistics from this data.
Plot histogram of this distribution - using R. Here is a simple R script to plot histogram you can run like this - just make sure you file is called 'polyA_lengths.dat' or change the code in the R script.
$ R --no-save < histogram.R
You may want to revisit the regular expression lectures and see this page on Regular Expressions from Python.