High-Throughput Interaction Data Matrix Analysis

Documentation

The High Throughput Interaction Data Handler takes a data matrix as an input, under the form of a file uploaded to the web site. This file should abide by a particular File Format. In short, the file should be a delimited (tab, comma, semicolon..) text file, have 2 header lines with at least the second containing the respective columns identifiers (proteins, protein domains...), have the numeric data/values starting from the first column of the matrix. Right after the data columns, a column with line identifiers (for example peptide or protein sequences) should be present. For an example of a valid interaction dataset/data matrix see here. Please note that some of the following tools, such as cytoscaper, require some additional features in the file to give an output under particular conditions

The High Throughput Interaction Data Handler can perform a number of operations/transformations on your input data matrix. Each operation is performed by a distinct script or combination of scripts with a dedicated name.
We provide here a detailed description of the available tools by name

Cytoscaper

Output for motif/consensus analysis

Matrix transformation for selectivity/specificity analysis

Cut-and-Sort according to input peptide list

Thresholder

Collapser

Grapher

Cytoscaper

Cytoscaper extracts selected information from the matrix and outputs this as a new file suitable as input for interaction networks analysis/drawing tools such as Cytoscape, Visant and others. Tipically, these programs accept as input a file containing the description of a particular interaction for each line of the file, in a form such:

Name_of_interactor_1 some_funny_symbol Name_of_interactor_2 Value_related_to_the_affinity (optional)

For example a tipical line of an input file for cytoscape, in a format without affinity values, could be:
Myo3 pp GQYRRTIVIPRRFFT
Where Myo3 is the common name of a yeast protein, GQYRRTIVIPRRFFT is a peptide sequence, and pp is symbol telling Cytoscape that this is a protein-protein interaction. The file will contain one such lines for each interaction present in the input interaction data matrix

Cytoscaper offers various output options.

The first thing to select is the application for which you wish to output the data. At the moment two choices are available: Cytoscape and Visant. More options to come. If you would like a particular output to be added to the options, please do contact us.

The second option is the selection of a threshold. Any interaction will be included in the output file as a line, only if the value is equal or above the selected threshold. This allows to try to build different interacion networks from the data, based on different thresholds. Or simply to filter out the lower strength (sometimes less reliable) interactions.

Then the possibility is given to select the header line (first or second line of the file) used to qualify the first interactor. This feature stems from the common situation of having two (or more) different names indicating the same object, both useful. At present two different column qualifiers (names for the first interactor) are allowed, in the form of the two header lines required by the present File Format specifications.
Two possibilities also exist for the selection of the name of the second interactor, but at present this requires complex file specifications.
We suggest to stick to the preselected option (peptide sequence), that selects the name corresponding to a line by picking it from the column of the line qualifiers that immediately follow the numerical values (data), as indicated in the File Format specifications.

In some cases the network analysis/drawing applications can accept different forms of input data, for example with of without a numeric value for interactions included. In the last part of the Cytoscaper form you can select for this option. This is supported only when the selected network analysis program supports the option itself.

Output for motif/consensus analysis

For each colum qualifier (protein/protein domain), this tool returns a list of the sequences (line qualifiers) of all the peptides that bind to the selected domain with an interaction value above a defined threshold, as a list with one item (peptide sequence) per line. This can be used as an input for applications such as MEME that are able to align all the input sequences and derive a consensus from this alignment. As options you can
:

Define a interaction treshold for a peptide to be included in the output list

Select the culumn qualifier for which you want an output, or simply state that an output for ALL column qualifiers (proteins/domains) is needed.
Please Note: as column qualifier, in this case the system uses the second of the two header lines

Cut and sort according to input peptide list

While selecting this tool, the user is requested to upload a file containing a list of peptide sequences, one for each line. This list is used to remove from the input file all the peptides (lines) not present in the list. The remaining lines are sorted according to the order of the peptides in the peptide sequences list.

It is not uncommon that the input matrix file contains several lines corresponding to the same peptide sequence. In the "Cut and Sort" tool, the option is given to "Collapse", in the output of "Cut and Sort", these repeated lines in a single line that contains for each value the average of the corresponding values of the original lines. At the moment AN INTEGER VALUE is returned as average.

During the Cut and Sort operation a log is printed to the screen indicating if and how many copies (lines of the input file) were found for each of the peptides of the input list. If you select the option of collapsing the file after the Cut and Sort operation, a log printed by the collapser script itself will appear after the Cut and Sort log. At the end of the page a link is provided to download a semicolon-delimited .csv output file.
Please note that the Collapser tool is also accessible as a standalone tool.

Thresholder

This very simple tool analyses the numeric data values for each line in the input file. If none of the values in a line is equal or above the selected threshold, the line is discarded.

During this operation, a log is printed to the screen with a report for each removed line and a final complete list of line positions and corresponding peptide sequences that were removed. At the end of the page a link is provided to download a semicolon-delimited .csv output file.

Collapser

Collapser, as the name suggests, collapses multiple lines that share the same line identifier (in the input file the line identifiers are always located in the column right after the numerical data values column(s). See the File Format specifications) in a single line.

Collapser starts by extracting the column corresponding to the line identifiers from the input matrix file. It then checks for repeated items and builds a collection of the repeated lines. For each group of repeated lines sharing the same identifier, Collapser averages the values for each column and replaces the multiple lines by a single averaged line.

The result of this operation is a semicolon-delimited .csv output file in which no lines are repeated.
While running, Collapser prints a log of the duplicated lines to the screen so that the user can know which lines were repeated in his file, and how many times.

Grapher

Grapher thansforms your numerical values matrix in a colored matrix, assigning colors on the basis of the numerical values contained in the data section of your input matrix (see here). Color assignment is made on a dynamic basis. Grapher first estimates the lower (excluding 0) and higher values contained in your matrix. The resulting interval (higher value minus lower value) is subdivided according to the number of colors available, which varies according to the color scheme selected in the Grapher options. The most popular scheme, a simple scale of reds, contains 32 colors. Each number of the input matrix is then assigned a particular color depending on the value. The correspondence color-range of values is always displayed at the button of the Grapher results page as a dynamically generated color legend. It is important to understand that for each new file/matrix, the color assignments will vary as they are calculated on the basis of the data contained in each specific matrix.

Many aspects of the Grapher output are customizable:

Before drawing the picture, data can be tranformed for a specificity/promiscuity analysis (see details here) by using columns or raws

Many aspects of the graphical output can be controlled. In particular, each numerical value will be represented by a bar, the height and width of which can be setted in the options. Three different color schemes are available at the moment: scale of reds (32 colors), multicolor (53 colors) or megacolor (73 colors). For large data matrices it should be considered that options with more colors will take more time to be computed and drawn to screen.

A sample grapher output with the "scale of reds" color scheme selection

A sample grapher output with the "multicolor" color scheme selection

A note on how the picture is drawn: the output of grapher is in fact an HTML table. Each cell of the table corresponds to a cell containing data values of the input matrix file. The cell of the HTML table is filled by a single pixel image extended to the width and height selected in the options by the user. On positioning the mouse over a color bar in the output of Grapher, the corresponding column and line identifiers are shown, together with the related numerical value.

A web application written in Python by Andrea Cabibbo