Mollee Jain - SURI 2013

About Me:

My name is Mollee Jain and I am a rising sophomore at Wellesley College this fall. Though my undergraduate studies are on the East Coast, my roots are right here in San Diego. Due to the fantastic oppurtunites that the Scripps Resesarch Institute provides, I get to combine my interests in biology and computer science to learn about the world of bioinformatics at STSI under Dr. Carland. I look forward to spending the summer working with professionals!

Top

What I Have Been Working On:

Programming:

Due to the vast number of tutorials of coding languages on the internet, I was able to find some that I could follow and understand readily to expand my knowledge of programming. Some of these tutorials include:

Python: Learn Python The Hard Way

PHP: PHP Tutorial

HTML and CSS: HTML Dog

CodeAcademy: CodeAcademy [This site provides free tutorials and practice for a multitude of programming languages!]

[Note: I used hilite.me to format my source code in an HTML format.]

Using Conky:

Conky is system monitor for your desktop that has the ability to display information in a clear and concise way. There are many different documentations for setting up Conky, and I used a combination of them to configure the Conky that I use for my desktop.

Note: Conky works on Linux, FreeBSD and OpenBSD platforms - I worked on a Linux Mint platform. There are alternatives like GeekTool and Samurize for Macs, and Rainmeter and Fences for Windows.)

Setup:

To set up Conky with the "conky-colors" configuration, I almost entirely followed this tutorial . This comprhensively combines many of the steps I took to create my configuration. If you want a different configuration, there are many, many different tutorials on the internet. DeviantArt and NoobsLab are two websites that have a plethora of conky styles.

Once you have your .conkyrc file open, try to understand the text. Check this conky manual page. It gives many definitions for the variables found in the .conkyrc file.

Configure:

To get the configuration I have, I changed some of the variables, shown below. Feel free to mess around with them!

own_window_transparent yes
#####[this made the window transparent]
#own_window_type override
#####[this allowed my window to stay on my desktop after I changed the .conkyrc file. Otherwise, the window disappears.]
gap_x 15
gap_y 0
######[this reduces the gap between the screen and the text]

qstat Command

Once I was satisfied with how conky looked, I worked on creating a new section in it that reflected the 'qstat' function of what was occuring in a remote host. The 'qstat' function allows one to see what jobs are currently running. To create this, I inserted the following code into my .conkyrc file [note: the names (ex. remote@host.edu) should be changed to reflect your specific remote hosts and usernames, etc].

Remote Host Name ${hr 2}
${execi 600 ssh user@remotehost.edu 'uptime'}
${execi 600 ssh user@remotehost.edu 'qstat | grep username | gawk "{ print \$1 \": \" \$3 \"\t\" \"State: \" \$5 }"'}

The number after 'execi' tells you how after how many seconds the command repeats. In this case, this command would run every 10 minutes ('600'). By using 'grep username', you will only see jobs that are under your name (this is especially useful if there are many people running jobs on the remote host). The information inside the print statement tells you what part of the 'qstat' will be displayed on conky.

[The alternative 'print' command below will show the status of the job (ex. 'R' for running), the running time of the job, and then the job name:]

${execi 10 ssh user@remotehost.edu 'qstat | grep username | gawk "{ print \$5 \": \" \$4 \"\t\" \$2 }"'}

My .conkyrc file

If you want a configuration to get you started here is the whole text of my .conkyrc file (to get the qstat command to work fill in the needed information under the "Remote Host" section of the configuration) :

Note: the Linux Mint 15 logo should be downloaded and saved. Include the pathway of the logo under the "TEST" - "SYSTEM" tab "image /pathway/to/logo/...".
In the end, your conky configuration should look something like this:

What's next?

Further options include creating a RSA key that would allow you to bypass entering your password into the prompt conky gives when it logs into the remote host. You could also customize your conky to display things like Twitter feeds or your music library.

Top of "Using Conky"
Top

Running Novoalign:

Background:

Novoalign is a software that aligns sequencing reads to a reference database(i.e.HG19) and calls variants to identify regions in the sequences that may be of significance. The software supports fragment, paired-end reads and is designed as an accurate and quick short read mapper. Novoalign can be used for small to large genomes.

Specifics: Novoalign 2.08.03 for Illumina paired-end reads.

How do we use it?

Novoalign's maker - Novocraft - provides some documentation on how to use Novoalign. The website provided a formula to follow under their Quick Start Tutorial:

novoalign -d reference_genome -f filename1 filename2 -i mean, stdev -o SAM

To run it, a reference genome had to be created. The genome was created using these commands:

module load novocraft
novoindex ref.nix ref.fa

An example of the command needed to actually run the aligner is:

novoalign -d ref.nix -f ../Sample/4_2_10_TAGCTT_L001_R1_001.fastq.gz ../Sample/4_2_10_TAGCTT_L001_R2_001.fastq.gz -i 250,90 -o SAM > L001_001.SAM

(To calculate the mean and standard deviation of the size distribution of the sequencing runs, a short python program was created. However, you could also use this command:

novoalign -d genome.nix -f read1.fq read2.fq -i 500,100 -#2K >/some/output

This command will give the first 2000 alignments of two files, and the mean and standard deviation of the fragment lengths of the sequence will be displayed at the end of the file. As long as the -i parameters are robust enough for the read [ex. -i 500,100], the alignment will be fairly accurate.)

The .run file I created to run this program is below.

What will Novoalign give you?

You will get two output files: an *.oe file and a *.SAM file. The *.oe file will give you:

The time the run started and stopped
Elapsed time in seconds, and CPU time in minutes
The number of paired reads and pairs aligned, among other statistics
Fragment length distribution (which is the length of the DNA fragment as mapped by the aligner)
Mean and standard deviation of the fragment length of the DNA strands that were sequenced

Overall:

Novoalign is one of the fastest alignment tools and has quite a high sensitivity for detection of variants. However, Novoalign requires relatively more memory than other alignment softwares (our run took about 8GB of memory). It is also best used for an alignment of short reads to a long reference.

More information:

Novocraft's Website: Main Page

Novoalign - Quick Start Tutorial: Quick Start Tutorial

Novoalign - User Guide/Wiki: User Guide

Novoalign - Reference Manual: Manual

Novoalign - FAQ: FAQ

Top of "Running Novoalign"
Top

Running a Chi-Squared Test in Python:

Overview:

A Chi-Squared test is statistical test that compares observed counts of categorical data with expected counts of the same categorical data. To calculate the expected counts of a 2x2 contigency table you follow this formula:

Expected = Row Total X Column Total / Total Counts

To get the Chi-Squared value you follow this formula:

Chi-Squared = Σ [(observed count - expected count)^2/expected count]

Using the Chi-Squared value you can then measure statistical significance by calculating the p-value. The Chi-Squared test utilizes a null hypothesis, or the hypothesis that the two categorical variables are independent of each other. If the p-value is less than the assigned level of significance (in most cases, this value is 0.05), then you REJECT the null hypothesis and conclude that the two categorical variables are NOT independent.

This link provides a great and detailed explanation of the Chi-Squared test.

What is a GWAS?

A Genome Wide Association Study (or GWAS for short) compares a 'control' genome (ex. HG19) with a 'case' genome (ex. your genome) and genotypes both samples for common single-nucleotide polymorhism's (or SNPs for short). A GWAS aims to determine if allelic frequency of SNPs between the two genomes is different.

To determine this, we do an odds ratio. When an odds ratio is 1 it means that both SNPs have equal odds of happening. A Chi-Squared is then used to test the significance of the odds ratio.

Chi-Squared in Python:

To start creating this Python file, I first had to figure out how to do a Chi-Squared test and understand how it works. A link for a good summary of that can be found in the "Overview" section. I further had to find a sample GWAS dataset that I could use for my script. The PDF I found not only provided a sample GWAS dataset, but it also provided a great tutorial for understanding the GWAS dataset and using it in a program called PLINK. Follow this link to download the PDF. If it does not download, try googling "BioInfoSummer2010 GWAS tutorial - WEHI Bioinformatics".

My script uses a 2x2 contigency table for single variants (1 pair of SNPs at a time). My script uses 1 degree of freedom and outputs a p-value if it has less than 1.0e-8 significance, but both of these values can be changed in the script. A future step in using this script would be to altering it so that it can run Chi-Squared association tests for multiple variants as well as single variants.

My Code:

Top of "Running a Chi-Squared Test in Python"
Top

Useful Links:

The Scripps Translational Science Institute (STSI): www.stsiweb.org

The Scripps Research Institute (TSRI): www.scripps.edu

Scripps Health: www.scripps.org

Genomics and Current Projects - Dr. Carland (Main Page): genomics1.scripps.edu

Top

Contact Me:

Email: mollee@scripps.edu

Secondary Email: mjain2@wellesley.edu