Bioinformatics tutorials

Thursday, November 6, 2014

What is principal component analysis?

A very interesting article written by Prof Lior Pachter.

What is principal component analysis?

Tuesday, July 22, 2014

You’re not allowed bioinformatics anymore

You’re not allowed bioinformatics anymore
is interesting post by opiniomics blog.

An eye-opener reading for those who believe that science can be done only in traditional way.

Monday, March 31, 2014

Perl Program to find Open Reading Frames (ORFs) in DNA sequence and then converting into protein

Exported from Notepad++

#!/usr/bin/perl

use strict;

#
# set to 1 for lots of debugging output
#
use constant DEBUG => 0;

#
#  minimum size of ORF to output
#
use constant MINSIZE => 20;

#
# hash to make complement sequence
#
my %compnuc = ( 'A' => 'T', 'T' => 'A', 'C' => 'G', 'G' => 'C',
        '-' => '-', ' ' => ' ', '*' => '*', '.' => '.');

#
# hash for translating DNA
#
my %DNAtoAA = (
        'GCT' => 'A', 'GCC' => 'A', 'GCA' => 'A', 'GCG' => 'A',
        'TGT' => 'C', 'TGC' => 'C', 'GAT' => 'D', 'GAC' => 'D',
        'GAA' => 'E', 'GAG' => 'E', 'TTT' => 'F', 'TTC' => 'F',
        'GGT' => 'G', 'GGC' => 'G', 'GGA' => 'G', 'GGG' => 'G',
        'CAT' => 'H', 'CAC' => 'H', 'ATT' => 'I', 'ATC' => 'I',
        'ATA' => 'I', 'AAA' => 'K', 'AAG' => 'K', 'TTG' => 'L',
        'TTA' => 'L', 'CTT' => 'L', 'CTC' => 'L', 'CTA' => 'L',
        'CTG' => 'L', 'ATG' => 'M', 'AAT' => 'N', 'AAC' => 'N',
        'CCT' => 'P', 'CCC' => 'P', 'CCA' => 'P', 'CCG' => 'P',
        'CAA' => 'Q', 'CAG' => 'Q', 'CGT' => 'R', 'CGC' => 'R',
        'CGA' => 'R', 'CGG' => 'R', 'AGA' => 'R', 'AGG' => 'R',
        'TCT' => 'S', 'TCC' => 'S', 'TCA' => 'S', 'TCG' => 'S',
        'AGT' => 'S', 'AGC' => 'S', 'ACT' => 'T', 'ACC' => 'T',
        'ACA' => 'T', 'ACG' => 'T', 'GTT' => 'V', 'GTC' => 'V',
        'GTA' => 'V', 'GTG' => 'V', 'TGG' => 'W', 'TAT' => 'Y',
        'TAC' => 'Y', 'TAA' => ':', 'TAG' => ':', 'TGA' => ':');

#
# program will open FASTA files end with *.fa
# 

print "Enter your file name with .fa extension\n";
my $file=<STDIN>;
chomp($file);

my $seq = "";
my $chr = $file;
$chr =~ s/\.fa//;
$chr =~ s/\.\///;

open (FILE, $file) || die "cannot open for reading $file: $!\n";

#
# reading FASTA formated files, read each line and append 
# it to the previous.
# 
# ignore the def line
#
# this assumes that each chromosome is in a file by itself
#   
    while (<FILE>) {
    chop;
    if ($_ !~ /^>/) {
        $seq .= $_;
    }
    }
    print STDERR "length of DNA is " . length($seq) . "\n";
    close (FILE);

#
# Make reverse complement
#

    my $revseq = "";
    my $revtmp = reverse $seq;

    for (my $j = 0; $j < length($revtmp); $j++) {
    $revseq .= $compnuc{ substr($revtmp, $j, 1) };
    }

#
# Make translations
#

    my $pept = "";
    my $start = 0;
    my $stop = 0;

    open (FILE, ">$chr.peptides.fa") || die "cannot open file $!\n";

#
# $i loop designates which strand, 0 = Watson and 1 = Crick
#
    for (my $i = 0; $i < 2; $i++) {
    if ($i == 1) {
        $seq = $revseq;
    }
#
# $j is for reading frame, 1, 2 or 3
#
    for (my $j = 0; $j < 3; $j++) {
        print STDERR "FRAME " . (($j + 1) + ($i * 3)) . " \n";
        if (DEBUG) { print STDERR "\n*"; }
#
# translation loop
# $x loop is for walking throught the framing frame looking for a peptide
#
        for (my $x = $j; $x < length($seq); $x += 3) {
        if (DEBUG) { print STDERR "= $x "; }
#
# Beginning of translation loop
# Only find peptides that start with ATG (methonine)
#
        if ( $DNAtoAA{ substr($seq, $x, 3) } eq "M" ) {
            $start = $x;
            $stop = $x + 3;
#
# Translate until a STOP is found (':') or the end of the sequence is
# reached $y loop is used after a start has been found, now we
# translate.  The loop starts with a stop is found ':' or $y is larger
# than the length of the sequence minus 3.  This is so we always have
# a full codon.
#

            for (my $y = $x; ($DNAtoAA{ substr($seq, $y, 3) } ne ":") 
                                && ($y < (length($seq) - 3)); $y += 3) {

            $pept .= $DNAtoAA{ substr($seq, $y, 3) };
            $stop = $y + 3;
            if (DEBUG) { print STDERR "\n+ " . (length($seq) - $start) . 
                        " -> " . (length($seq) - ($stop + 2)) . " = $pept"; }
            }
            $x = $stop;
#
# if the peptide is 20 amino acids or longer write it to a FASTA file with 80 char lines
#
            if (length($pept) > (MINSIZE - 1)) {
            my $peptcoord = $chr;
            $peptcoord =~ s/chr//;
            if ($i == 0) {
                $peptcoord .= ":" . ($start + 1) . ":" . ($stop + 3);
            } else {
                $peptcoord .= ":" . (length($seq) - $start) . ":" . (length($seq) - ($stop + 2));
            }
print FILE ">ORF|" . $peptcoord . "| frame " . (($j + 1) + ($i * 3)). "\n";
#
# output sequence in 80 char lines, while processes all sequences larger than 80 characters
#  in length
#
            while (length($pept) > 80) {
                print FILE substr($pept, 0, 80) . "\n";
                $pept = substr($pept, 80, length($pept));
                if (DEBUG) { print STDERR "-"; }
            }
#
# print last bit of $pept
#
            print FILE $pept . "\n";
            if (DEBUG) { print STDERR "."; }
            }
            $pept = "";
        }
        }
    }
    }

#
# Close chromosome sequence file
#
    close (FILE); 

print STDERR "\n\n";

exit;

Thursday, February 27, 2014

Confused About Distribution....

If you are confused when, how and which distribution one should use for one's analysis then you must go through below links... Very interactive and informative way they have explained... enjoy

Diagram of distribution relationships

Univariate Distribution Relationships

Tuesday, January 28, 2014

pbc wrap command to put protein in centre of the box in VMD

If protein moves to the corner or side of the box and you want to put it back to the centre use wrap command:

pbc wrap -centersel "protein" -center com -compound residue -all

Thursday, November 28, 2013

awk command examples

For a bioinformatician pattern searching is everyday routine... And I feel one should use the power of shell oneline commands which are easy and make life easier... So here I am giving some basic examples of awk's commands which are simple and can be used very frequently...

Print only next line after the pattern matched...

awk '/pattern/{getline; print}' filename

This command first match the pattern and then print next line... this command can also be extended to print specific line after pattern match... for example if one want to print third line after pattern match just write...

awk -v lines=3 '/pattern/ {for(i=lines;i;--i)getline; print}' filename

If one one want to extract only one column from that line then simply add column index after print for example to print second column of next line after pattern match... just write the command...

awk '/pattern/{getline; print $2}' filename

$0 to print complete line
$1 to print first column
$2 to print second column and so on

If one want to calculate sum or average of a particular column then one command can do all for you...
for example one want to calculate sum or average of second column then write...

awk '{sum+=$2} END { print "Sum = ",sum}' filename
awk '{sum+=$2} END { print "Average = ",sum/NR}' filename

Friday, June 28, 2013

Trajectory file conversion using wordom

wordom -conv -imol mymolecule.pdb -omol mymolecule.crd

wordom -conv -imol mymolecule.pdb -sele "/A/@(20-100)/*" -omol mymolecule.crd

wordom -conv -itrj mytrj.dcd -imol mymolecule.pdb -otrj mytrj.xtc

wordom -conv -itrj mytrj.xtc -otrj mytrj.dcd