Introduction

This document describes the output produced by the pipeline, relatively to the top-level results directory (defined by the --outdir parameter).

The directories listed below will be created in the results directory after the pipeline has finished.

Main output files

MultiQC

This report is located at reporting/multiqc_report.html and can be opened in a browser.

MultiQC](http://multiqc.info) is a visualization tool that generates a single HTML report summarising all samples in your project. The pipeline has special steps which also allow the software versions to be reported in the MultiQC output for future traceability. For more information about how to use MultiQC reports, see http://multiqc.info.

Dash Plotly app

reporting/dash_app/: folder containing the Dash Plotly app

To launch the app, you must first create and activate the appropriate conda environment:

conda env create -n dash_app -f reporting/dash_app/environment.yml
conda activate dash_app

then:

cd reporting/dash_app
python app.py

and open your browser at http://localhost:8080

Note

The app will try to use the port 8080 by default. If it is already in use, it will try 8081, 8082 and so on. Check the logs to see which port it is using.

Gene statistics and scores

The pipelines also exports a summary of all genes, located at reporting/all_genes_summary.csv. It contains their statistics, scores, ranks and respective sections.

Merged data

Parquet files containing all normalised gene counts are also stored in the merged_data/ directory.

Merged data
  • merged_data/all_counts.imputed.parquet: parquet file containing all normalised + imputed gene counts
  • merged_data/all_counts.parquet: parquet file containing all normalised gene counts
  • merged_data/whole_design.csv: table containing designs for all datasets and all samples comprised in the analysis

Other output files of interest (useful for debbuging)

Individual datasets

All individual datasets are also stored at each step of the pipelines, with the following pattern: datasets/<platform>/<normalisation status>/<dataset name>/

Sub sections
  • 0.downloaded/: raw datasets downloaded from public databases
  • 1.id_filtered_renamed/: datasets with filtered and renamed gene IDs
  • 2.samples_filtered/: datasets with filtered samples
  • 3.: TPM / CPM normalisation
    • 3.tpm_normalised/: TPM normalised datasets
    • 3.cpm_normalised/: CPM normalised datasets
  • 4.quantile_normalised/: quantile normalised datasets

The design of each dataset is also stored in its own directory.

Expression Atlas / GEO accessions

Accession files
  • accessions/expression_atlas/: accessions found when querying Expression Atlas
  • accessions/geo/: accessions found when querying GEO

ID Mapping

The pipeline also exports the ID mapping metadata used for gene ID conversion.

ID mapping metadata
  • idmapping/global_gene_metadata.csv: table containing the complete set of gene metadata, obtained either via gProfiler or via the custom file provided by the user
  • idmapping/global_gene_id_mapping.csv: table containing the complete set of gene id mapping, obtained either via gProfiler or via the custom file
  • idmapping/valid_gene_ids.txt: List of gene IDs retained as valid

Annotation / gene length

The annotation and gene lengths are also stored in the annotation/ directory.

Files
  • gene_transcript_lengths.csv: transcript length relative to each gene ID
  • <annotation name>.gff3.gz: GFF3 file

Pipeline information

Output files
  • pipeline_info/
    • Reports generated by Nextflow: execution_report.html, execution_timeline.html, execution_trace.txt and pipeline_dag.dot/pipeline_dag.svg.
    • Reports generated by the pipeline: pipeline_report.html, pipeline_report.txt and software_versions.yml. The pipeline_report* files will only be present if the --email / --email_on_fail parameter’s are used when running the pipeline.
    • Parameters used by the pipeline run: params.json.

Nextflow provides excellent functionality for generating various reports relevant to the running and execution of the pipeline. This will allow you to troubleshoot errors with the running of the pipeline, and also provide you with other information such as launch commands, run times and resource usage.