LVB Manual

LVB Manual – LVB phylogeny program, version 3.0 Beta

CONTENTS

COPYRIGHT

Part of this document is based on PHYLIP documentation (see ACKNOWLEDGEMENTS).

The PHYLIP component of this document:

© Copyright 1986-2000 by the University of Washington. Permission is granted to copy thisdocument provided that no fee is charged for it and that this copyright notice is not removed.

The remainder of this document:

© Copyright 2003-2012 by Daniel Barker.

© Copyright 2013 by Daniel Barker and Maximilian Strobl.

Permission is granted to copy this document provided that no fee is charged for it and that this copyright notice is not removed.


DESCRIPTION

lvb seeks parsimonious trees from an aligned nucleotide data matrix. It uses heuristic searches consisting of simulated annealing followed by hill-climbing. In contrast to the more usualheuristic searches used to find parsimonious trees (e.g. stepwise addition followed by hill-climbing), simulated annealing can ‘jump out’ of local optima. Especially with large, complex data matrices, the simulated annealing heuristic may run faster and/orfind a shorter tree.


CITING LVB

Please cite the following paper if you use LVB:

Barker, D. 2004. LVB: Parsimony and simulated annealing in the search for phylogenetic trees. Bioinformatics, 20, 274-275.

The following may also be relevant:

LVB.https://eggg.st-andrews.ac.uk/lvb.

Barker, D. 1999. Simulated annealing in the Search for Phylogenetic Trees. PhD Thesis,University of Edinburgh.

Barker, D. 1997.LVB1.0: Reconstructing Evolution with Parsimony and Simulated Annealing (Edinburgh: Daniel Barker).


RUNNING LVB

lvb is a command-line program.

lvb reads the alignment file from the current directory (folder) and writes its main output to a file in the current directory. The user is prompted for the matrix format, the interpretation of gaps in the alignment, the type of simulated annealing heuristic searches to run (with a sensible default), the seed for the pseudorandom number generator (with a sensible default), and whether bootstrap replicates are required (by default, no). Answers are entered using the keyboard.

lvb logs progress information and errors to the screen.

MacOS X

The Apple Mac OS X version of LVB runs under OS X 10.7 (Lion) on 64-bit Intel-basedhardware. It is also expected to run on more recent versions of OS X.

After downloading, extractlvbfrom the file lvb_3_0_BETA_macos.tar.gz. Once this is done, you may launch it from the Terminal command-line. Terminal is usually found in theApplications/Utilitiesfolder. If lvb is on your desktop, you may launch it by typing the following commands in Terminal:

cd Desktop./lvb

If lvb is in a directory in your PATH environment variable, it should be accessible in Terminal fromany location, as lvb.

Raspberry Pi

The Raspberry Pi version of LVB runs under Raspbian Linux for Raspberry Pi.

After downloading, extractlvbfrom the file lvb_3_0_BETA_raspi.tar.gz. Once this is done, you may launch it from a terminal command-line. A suitable terminal in Raspbian is LXTerminal, usually found in the Menu under Accessories. If lvb is on your desktop, you may launch it by typing the following commands in LXTerminal:

cd Desktop./lvb

If lvbis in a directory in your PATH environment variable, it should be accessible in Terminal fromany location, as lvb.

Other Linux and UNIX

After downloading, compilelvb from the source code (see COMPILINGLVB). Once this is done, it may be launched as for Mac OS X or Raspberry Pi. The only difference may be in the mechanism to start a terminal window or remote connection.

Other Systems

It should be possible to compile and run lvb on Windows and many other operating systems, if you have a C compiler. The details will vary, but to help you get started see COMPILING LVB.


INPUT

Keyboard (standard input)

Keyboard input is case-independent. So, for example, where the instructions belowsuggest you type I, typing will have the same effect.

Matrix format

lvb can read matrices in PHYLIP 3.6 interleaved or PHYLIP 3.6 sequential format. These are described in the section on infile.

When prompted for the data matrix format, type Ior Sfollowed by RETURNfor ‘interleaved’ or ‘sequential’, respectively.

Treatment of gaps

See the the table under Bases for a list of base codes allowed by lvb.

A gap represented by theletter ‘O‘in the data matrix is always treated as a character state in its own right (fifth state). lvb can treat gaps represented by ‘-‘ in either of the following ways:

Fifth state

-‘is treated as equivalent to ‘O‘.

Unknown

-‘is treated as equivalent to ‘?‘,i.e., as an ambiguous site that may contain ‘A‘or ‘C‘or ‘G‘or ‘T‘or ‘O‘.

When prompted for the treatment of ‘-‘,type Uor followed by RETURN for ‘unknown’ or ‘fifth state’, respectively.’

Fifth state’ may give excessive weight to multi-site gaps, since each affected base position will be counted as one event.

Cooling schedule

When prompted for the cooling schedule, press RETURN for the default or enter or for ‘geometric’ or ‘linear’, respectively.’

Geometric’ causes lvb to run rapidly and usually gives results of good quality. (In the simulated annealing heuristic search, the relation between one level of the ‘temperature’ and the next is set to exponential decay.) This is the default.

Linear’ causes lvb to run more slowly and may give results of even better quality.(The relation between one level of the ‘temperature’ and thenext is set to linear decrease.)

Random number seed

When prompted for the random number seed, press RETURN for the default or enter an integer in the range 0 to 900000000inclusive.

The default value is taken from the system clock and hence will vary from one analysis to the next, changing every second. The default is usuallyappropriate.

Bootstrapping

When prompted for thenumber of bootstrap replicates, enter the number of replicatesrequired. If bootstrapping is not required, enter the number 0 or just press RETURN.

lvb allows any number of replicates from 1 to 1000000 inclusive. For each replicate, a bootstrap sample of sites in the alignment is generated and analyzed.

For an alignment matrix of m sites, each bootstrap replicate contains sites, randomly sampled with replacement from the originals.Compared to the original alignment, it is likely that some sitesare left out, some are present once, and others are present twiceor more. In lvb the probability of including a site is equal for all sites, irrespective of whether the site varies or is constant.

The most parsimonious tree(s) for each replicate are output. There will be at least onetree for each replicate. If the search for any replicate foundmore than one equally parsimonious tree, all are output and thenumber of trees will exceed the number of replicates. Generationof a consensus from all trees will over-represent thosereplicates for which more trees were found. If each bootstrapreplicate finds a single tree, this is not an issue.

infile

The data matrix must be in a file called infilelvb expects this file to contain a single nucleotide matrix in PHYLIP 3.6 format.

Layout

The simplest type of datamatrix file looks something like this:

 6 13
Archaeopt CGATGCTTAC CGC
HesperorniCGTTACTCGT TGT
BaluchitheTAATGTTAAT TGT
B. virginiTAATGTTCGT TGT
BrontosaurCAAAACCCAT CAT
B.subtilisGGCAGCCAAT CAC

The first line of the input file contains the number of sequencesand the number of characters (sites). These are in free format,separated by blanks. The information for each sequence follows,starting with a ten-character sequence name (which can include blanks and some punctuation marks), and continuing with the characters for that sequence.

The name should come right at the start of the line, without any preceding blanks or tabs. It should be ten characters in length, filled out to the full ten characters by trailing blanks if shorter. Any printable ASCII/ISOcharacter is allowed in the name, except for parentheses ‘(‘ and’)’, square brackets ‘[‘ and ‘]’, colon ‘:’, semicolon ‘;’ andcomma ‘,’. If you forget to extend the names to ten characters in length by blanks, an error message will result.

The biological characters(bases or gaps) are each a single ASCII character, sometimes separated by blanks.

The sequences can continue over multiple lines. When this is done the sequences must beeither in interleaved format or sequentialformat. In sequential format all of one sequence is given,possibly on multiple lines, before the next starts. Ininterleaved format the first part of the file should contain thefirst part of each of the sequences, then possibly a linecontaining nothing but a carriage-return character, then thesecond part of each sequence, and so on. Only the first parts ofthe sequences should be preceded by names. The name must be onthe same line as the first character of the data for thatsequence. Here is a hypothetical example of interleaved format:

 5 42
Turkey AAGCTNGGGC ATTTCAGGGT
Salmo gairAAGCCTTGGC AGTGCAGGGT
H. SapiensACCGGTTGGC CGTTCAGGGT
Chimp AAACCCTTGC CGTTACGCTT
Gorilla AAACCCTTGC CGGTACGCTT
GAGCCCGGGC AATACAGGGT AT
GAGCCGTGGC CGGGCACGGT AT
ACAGGTTGGC CGTTCAGGGT AA
AAACCGAGGC CGGGACACTC AT
AAACCATTGC CGGTACGCTT AA

while in sequential format the same sequences would be:

 5 42
Turkey AAGCTNGGGC ATTTCAGGGT
GAGCCCGGGC AATACAGGGT AT
Salmo gairAAGCCTTGGC AGTGCAGGGT
GAGCCGTGGC CGGGCACGGT AT
H. SapiensACCGGTTGGC CGTTCAGGGT
ACAGGTTGGC CGTTCAGGGT AA
Chimp AAACCCTTGC CGTTACGCTT
AAACCGAGGC CGGGACACTC AT
Gorilla AAACCCTTGC CGGTACGCTT
AAACCATTGC CGGTACGCTT AA

If each sequence only occupies one line in the matrix file, there is no difference between sequential and interleaved format and lvb can read the file in either way. Other thanthis special case, it is important not to read an interleavedmatrix as sequential or a sequential matrix as interleaved. A BADBASE error message often indicates that thewrong format has been specified.Note that a portion of a sequence like this:

300 AAGCGTGAAC GTTGTACTAA TRCAG

is perfectly legal, assuming that the sequence name has gone before and is filled out to full length by blanks. The above digits and blanks will be ignored, the sequence being taken as starting at the first base symbol (in this case an A). This should enable you to use output from many multiple-sequence alignment programs with only minimal editing.

lvb may have difficulties with spaces at the end of lines. The symptoms ofthis problem are that lvb complains about a BADBASE, and you can find no other cause forthis complaint. The problem may be avoided by deleting any spaces at the end of lines.

In interleaved format the present version of lvb may sometimes have difficulties with the blank lines between groups of lines, and if so you might want to retype those lines, making sure that they have only a carriage-return and no blank characters on them, or you may perhaps have to eliminate them. The symptoms of thisproblem are that lvb complains that thesequences are not properly aligned, and you can find no othercause for this complaint.

Bases

The sequences may containA’s, G’s, C’s and T’s (or U’s, which lvb treatsas equivalent to T’s). Each ASCII character in the sequence mustbe one of the letters A,B,C,D,G,H,K,M,N,O,R,S,T,U,V,W,X,Y,?,or -(a period is not allowed, because it is used in different sensesin different programs). Blanks will be ignored, and so will numerical digits.

These characters can be either upper or lower case, because the algorithms convert all input characters to upper case (which is how they are treated).The characters constitute the IUPAC (IUB) nucleic acid code plussome slight extensions. They enable input of nucleic acid sequences taking full account of any ambiguities in the sequence.

For further information on’-‘,See Treatment of gaps.

Symbol: Meaning:
A Adenine
G Guanine
C Cytosine
T Thymine
U Uracil (treated as T by lvb)
Y pYrimidine (C or T)
R puRine (A or G)
W 'Weak' (A or T)
S 'Strong' (C or G)
K 'Keto' (T or G)
M 'aMino' (C or A)
B not A (C or G or T)
D not C (A or G or T)
H not G (A or C or T)
V not T (A or C or G)
N aNy base (A or C or G or T)
X any base (A or C or G or T)
? unknown (A or C or G or T or O)
O gap
- gap (O; alternatively, A or C or G or T or O)

OUTPUT

Screen (standard output)

lvb logs its version, details of the analysis, indication of progress and any errors encountered to the standard output, which is usually the screen.

Without bootstrapping, the arrangement number (iteration) of the search and current tree length is logged every 50000 trees. During simulated annealing, the tree length can go up as well as down. LVB keeps and outputs the shortest treesencountered during its search. The length of this tree or trees is logged to the screen near end of the analysis.

With bootstrapping, the replicate number is logged, along with the number of rearrangements tries, the number of trees found and length of trees found for that replicate.

outtree

Without bootstrapping, the file outtree contains the most parsimonious tree or trees found.

With bootstrapping, outtree contains the most parsimonious tree or trees found for each replicate. Results for the replicates are given in order so, for example, if 40 trees were found for the first replicate, these are the first 40 trees in outtree.

Trees use a subset of the ‘Newick standard’ tree format. This is accepted by many otherprograms.

Trees may be converted to graphics files using the drawtree program of the PHYLIP package. They may also be viewed and printed using Mesquite.

Without bootstrapping, if more than one equally parsimonious tree is found, these may be combined in various ways using consense in the PHYLIP package. With bootstrapping, consense is useful to generate the majority rule consensus tree.

Output trees are unrooted and branch lengths are not given. Trees may be rooted with the retree program of the PHYLIP package. Trees may also be rooted andbranch lengths (under various models of character state change)may be obtained by importing the tree and data matrix intoMesquite.


COMPILINGLVB

lvb is available at the LVB Web page as ready-to-run software for AppleMac OS X and for Raspbian Linux on the Raspberry Pi.

For other platforms, or if you wish to modify the source code, you will have to compile lvb. It is written in ANSI C and is expected to compile and run on a variety of operating systems.

Assuming your system isUNIX-like, uses GNU make and has Perl installed, follow the instructions below. If usingnon-UNIX-like system such as Windows, the instructions below willrequire adjustment.

Unpackingthe source code

Assuminglvb_3_0_BETA_source.tar.gzis in the current directory, enter the following commands:

 tar xzvf lvb_3_0_BETA_source.tar.gz

This gives you a main directory lvb_3_0_BETAwith two subdirectories, LVB_MAINand PHYLIP_FOR_LVB.

Compileroptions

By default, LVB is builtusing compiler options which make sense for GNU C (gcc). To useother compiler options, edit the file LVB_MAIN/Makefilebefore compiling.

Compilation

Now, assuming you begin int he lvb_3_0_BETA directory, the following sequence of commands will build lvb and test it:

cd LVB_MAIN 
make 
make test

Results of the above commands are:

  • A report on the tests, which is sent to the screen. All tests should pass. Any failure may indicate that lvb won’t work properly on your system.
  • A stand-alone executable file, lvb. This is all that is required to run the program.
  • Internal documentation of the LVB program, consisting of HTML files in the directory docs_programmer(see below).

After changing the source code or Makefile,it is safer to always make again from scratch.

Documentation

The main documentation(i.e. this file) is lvb_manual.htm in the LVB_MAIN directory.

Internal documentation will b e of interest to people who wish to modify or re-use the source code of LVB. During a successful build, documentation ind ocs_programmer/ is automatically extracted from POD-format comments within theLVB source code. The internal documentation is incomplete and out of date.

Documentation of PHYLIP code within LVB is given separately, inPHYLIP_FOR_LVB/README_phylip_code_in_lvb.rtf. This PHYLIP code should not be used to build PHYLIP itself, as it contains modifications specifically for LVB. PHYLIP proper may bebuilt by downloading its source code from the PHYLIP Web page.


SUPPORTAND REGISTRATION

Please send questions and bug reports to:

db60@st-andrews.ac.uk

To be placed on an emaillist to receive information on new versions, please

email db60@st-andrews.ac.uk with subject ‘Register as LVB user’.


ACKNOWLEDGEMENTS

lvb contains portions of PHYLIP 3.6a. This allows lvb to read PHYLIP-format matrix files. Also, most of the abovedocumentation for infile is taken from thePHYLIP 3.6a manual. I wish to thank Joe Felsenstein for makingPHYLIP freely available, and for advising on how to re-use it in lvb.


SEEALSO

LVB Web page

https://eggg.st-andrews.ac.uk/lvb

PHYLIP

http://evolution.genetics.washington.edu/phylip.html

Mesquite

http://mesquiteproject.org/mesquite/mesquite.html