Homework 1
Write a bash / shell script to accomplish the following tasks. You can just break the tasks into separate sections in your code, use comments to indicate which parts answer which of the questions. You can use the echo command to print out a message with the result if needed.
- Getting data
- Compressing and uncompressing
- Compress the File pg24923.txt you downloaded with gzip, how big is it in kilobytes?
- Uncompress it, then compress it with bzip2, how big is it in kilobytes?
- Uncompress it again
- Counting
- Sorting
- Sort the data file based on the FPKM column (gene expression) (write out to a new file called Nc20H.expr.sorted.tab)
- How many exon features are there in the rice chromosome 6 gff file (this file is compressed). You can do this in several different ways.
- Finding and Counting
- Column combining
- Take these files Nc20H Nc3H and combine columns 1-6 (gene_id,bundle_id,chr,left,right,FPKM) from Nc20H and column 6 from Nc3H to make a new file with 7 columns the last 2 are gene expression values for the two experiments. You should use the cut and paste UNIX commands to accomplist this, though use of other tools is permitted. You may want to validate the gene counts in both files, using sort and uniq