bioinformatics
data analysis
Tutorial for common steps in data analysis

Perseus Tutorial

What is Perseus?

Powerful statistical software package developed by a group of Prof. Matthias Mann. It was designed to analyze, evaluate and visualize MaxQuant-derived proteomic data, but can handle all kind of numerical data with some prior adjustments.

Original citation can be found in Nature paper.

The Perseus software platform allows interpretation of protein quantification, interaction and post-translational modification data. Perseus contains a comprehensive portfolio of statistical tools for high-dimensional omics data analysis covering normalization, pattern recognition, time-series analysis, cross-omics comparisons and multiple-hypothesis testing. Perseus provides interactive workflow environment with a easy to use graphical user interface.

More information and tutorials you can find on the coxdocs.org

1 Download and Installation Guide

1.1 Download

Downloading and using the software is free of charge. Simply download from maxquant.net/perseus/ and unpack the compressed file Perseus.zip. For the purpose of the training, we also prepared all software and data needed on the USB stick.

1.2 Running

Supported operation system versions (64 bit is required) are Windows Vista, 7, 8, 10, Server 2008, 2012, 2016.

  1. Install .NET Framework 4.7.2 or higher

  2. Run Perseus GUI by double clicking on Perseus.exe in Perseus.

2 Data pre-processing

2.1 Loading data

You can load any kind of table if it’s in a tab-separated.txt format. It means, that you can take any Excel table, save it as tab-separated.txt and then use it in Perseus. This tutorial will be based on the standard MaxQuant output and is suitable for Label-Free quantification analysis of proteomics data.

  1. Double click on the Perseus.exe file to execute the program

  2. In the main menu bar, open the Load menu and select the Generic upload

  3. Click on the Select button and browse the folder where your MaxQuant results were saved. Go to the folder combined –> txt and select the data file proteinGroups.txt

  4. Click OK.
    A new window should open containing five feature tabs (Expression, Categorical annotation, Textual annotation, Numerical annotation, Multi-numerical annotation), where the information you want to include in the data analysis will be chosen by you.

  5. Load all LFQ intensity columns into Main columns window. All other important columns should be automatically recognized by Perseus.

2.2 Filtering

In this step, the unnecessary or incorrect protein identifications can be removed from the main data frame.

  1. Go to Processing –> Filter –> Filter based on categorical column

  2. In the Column parameter select Only identified by site and verify that Find parameter contains the symbol “+” and the Mode parameter contains Remove matching rows.

  3. Click OK.

  4. Repeat previous step with Reverse and Contaminants categories.

2.3 Transformation

As the range of the intensity values can vary more than 10 folds, some visualizations would not work with raw data. In order to adjustthe range of the data to a more “friendly” scale, we need to transform them to log2 values.

  1. Go to Anylysis –> Visualization –> Histograms.

  2. Check visually the distribution of the samples.

  3. Go to Processing –> Basic –> Transform.

  4. In Transformation parameter, select Log and in the Base parameter select 2. In the Columns box, select which columns should be transformed.

  5. Click OK.

  6. Go to Anylysis –> Visualization –> Histograms.

  7. Check visually the distribution of the samples.

2.4 Categorical annotation of rows

In order to group the conditions compared in this quantitative analysis you must annotate the categories/classes of the conditions analyzed into groups. In simpler words, we have to define which samples belong to which conditions. This is the crucial step that allows us to perform a statistical analysis in further steps.

  1. Go to Processing –> Annot. rows –> Categorical annotation rows.

  2. Select sample replicates belonging to the same condition

  3. Press the “tick” button

  4. Name the condition by removing any unique characters from sample names

  5. Press OK

  6. Name other conditions accordingly.

2.5 Filtering valid values

The log transformation of the expression values generate a pool of “NaN” (NonAssigned Number) values, which correspond to expression values originally equal to zero, when proteins were not detected by the mass spectrometer. In order to define the level of stringency of your analysis, you must define the minimum number of valid values accepted in your analysis. This may increase the confidence of your data.

  1. Go to Processing –> Filter rows –> Filter rows based on valid values.

  2. In the Number of valid values parameter type the number 2.

  3. In the Mode parameter select In at least one group.

  4. In the Filter mode select Reduce matrix.

2.6 Annotation

Based on a column containing protein (or gene or transcript) identifies this activity adds columns with annotations. These are read from specifically formatted files contained in the folder ‘\\conf\\annotations’ in your Perseus installation. Species-specific annotation files generated from UniProt can be downloaded from the link specified in the menu at the blue box in the upper left corner.

  1. Go to Processing –> Annot. columns –> Add annotation

  2. In Source select one of the mainAnnot files

  3. In Annotations to be added select some GO databases and Taxonomy and click the > button

  4. In Additional sources select all files and transfer them to the right

2.7 Removing/changing/renaming columns

  1. At any moment you can change type, and/or remove columns from the matrix

  2. Go to Rearrange and investigate some options there

2.8 Adding filter columns

  1. Go to Processing –> Filter rows –> Filter rows based on numerical/main column

  2. Select Peptides column in x, in order to filter proteins with low number of peptides identified

  3. In Relation 1 type >2

  4. In Filter mode select Add categorical column

3 Data evaluation

3.1 Sample distribution

  1. Select the Matrix before log2 transformation

  2. Go to Analysis –> Basic visualization –> Histogram

  3. Select columns for histogram

  4. Click OK

  5. Now go to the Matrix with log2 values and repeat the histogram

3.2 Correlation

Once you have transformed the LFQ intensity values to Log2 values and valid values were filtered, you can estimate the degree of correlation between the different samples and to each extend the replicates are similar or dissimilar to each other. This analysis may help you to eliminate datasets that behave as outliers, since the values of data correlation will be calculated and help you to identify these possible outliers.

  1. Go to Analysis –> Basic visualization –> Multi scatter plot.

  2. Select columns for correlation matrix

  3. Click OK

  4. Try to select only Escherichia coli proteins

3.3 Principal component analysis

Principal component analysis, or PCA, is a statistical procedure that allows you to summarize the information content in large data tables by means of a smaller set of “summary indices” that can be more easily visualized and analyzed. A PCA plot shows clusters of samples based on their similarity and can help to observe trends, jumps, groups and outliers. This overview may uncover the relationships between observations and variables.

  1. Go to Analysis –> Principal component analysis.

3.4 Numeric Venn diagram

The combinations and number of proteins identified on each group or condition can be calculated based on the information provided by numeric Venn diagram function.

  1. Go to Analysis –> Numeric venn diagram.

3.5 Profile plots

In this plots you can observe all identified proteins and their intensities across different samples. This kind of visualization is good for a trend evaluation and for QC purposes.

  1. Go to Visualize –> Profile plot

  2. From the list of proteins, try to select only Human proteins

  3. Observe the trend of protein intensities

  4. Now, try to select only Escherichia coli or only Saccharomyces cerevisiae proteins

4 Statistical analysis

In order to see if there are any differences between the given conditions, it is necessary not only to identify the proteins present in a sample, but also to perform statistical tests to determine if the changes observed experimentally are statistically significant. Perseus offers you at least four different statistical tests that can be performed for the proteome data analysis.

4.1 Two-sample T-test

Two-sample test or Diffferential Expression Analysis quantifies whether subgroup (condition) differences are significant. The t-statistic expresses the difference between two subgroups in terms of standard error (SE) units (i.e. standard devation, normalized for sample size): \[t = \frac{\textrm{difference}}{\frac{\textrm{sd}}{\sqrt n}}\] When samples from two subgroups are many SE units away from each other, the \(t\) value will be large, and the difference likely arose due to true subgroup differences. When samples from two different subgroups are close to each other, on the other hand, the \(t\) value will be small, and the probability that the difference arose due to random sampling is high. This probability (that the difference arose due to random sampling) is known as the p value. The p value expresses a signal (difference) to noise (standard error) ratio, and is very useful for feature (protein) prioritization. A general convention is to call \(p\) < 0.05 differences significant.

As a result of Perseus analysys, two numerical columns are further added in the data matrix, one containing the p-value, the other one containing the difference between the means. Again, there is a categorical column added containing a symbol ‘+’ when the changes in protein abundance between the different groups is statistically significant with respect to the specified threshold.

  1. Go to Tests –> Two-sample tests

  2. After the new matrix is created, go to Analysis –> Scatter plot

  3. Try to create a volcano plot and label different organisms with different colors

4.2 Enrichment

  1. In the Scatter plot view, mark some proteins that are significant and export them as a new category in the new matrix

  2. Go to Annot. columns –> Fisher exact test and perform the test

  3. In the Column select our new Selection column

4.3 Heatmap

A heat map is a visual representation of the relative expression levels of the proteins according to a clustering behavior. In other terms, a heat map is a graphical representation of the data where the individual values contained in a data matrix are represented as colors. In order to graph the difference in proteins abundance we perform a series of actions for data clustering and visualization.

  1. Go to Analysis –> Clustering –> Hierarchical clustering

  2. Click OK

  3. What are the grey spaces withing the plot?

4.4 Imputation of missing values

Proteomic data will always contain some number of proteins that were identified in one, but not in the other sample. There are many reasons for this. First, the sample preparation is not always exactly the same for all samples and small differences can come from this step. Secondly, MS instruments have a limited capacity of the range of differently abundant ions that it can measure (lower and upper limit of detection) and this can be different from sample to sample, dependent on the complexity of it. Thirdly, MS measurements contain some level of stochasticity, in case of selecting ions that will be fragmented (“sequenced”) and thus identified, and this can, again, vary from measurement to measurement.

Note: Not always imputation will make sense in case of your project, so if you have any doubts, contact us and we can discuss it.

  1. Go to Processing –> Imputation –> Replace missing values from normal distribution

  2. Go to Analysis –> Basic visualization –> Histogram

  3. Select columns for histogram

  4. Click OK

  5. Label imputed values in the histograms

5 Saving your analysis and results

Once your analysis is complete or you want to finish the analysis later on, you can save any matrix you have by exporting it to a file.

  1. Go to Export –> General export.

  2. The results are saved as .txt files and can be either opened again in the Perseus software or in programs such Microsoft Excel.

  3. For saving the whole analysis, the Perseus session can be saved, also at any time of the analysis.

  4. Go to File –> Save session

  5. For saving the sequence of data analysis you have performed and to apply it again, you can go to Workflow > Save As..

We strongly recommend you to always save your Perseus session with the Perseus version in the file name. Historically, it is known that new versions of the software is incompatible with older saved sessions. We also recommend to save Session after some crucial steps of data analysis.

Your final data as .txt file can now be opened in other programs for further functional annotation of the candidate proteins.

6 Excersise

Things to reproduce on the imputed data:

  1. Evaluate the effect of imputation on the data with the tools you learned before.

  2. Try to create a new matrix containing only Escherichia coli proteins and once this is done, explore different normalization methods and their influence on the data.

  3. Now recreate everything in Excel ;)