.. R document ============================ Westlake Imputation Server ============================ --------------------- 1. Introduction --------------------- Westlake Imputation Server is developed by Diseases & Population (DaP) Geninfo Lab in Westlake University for public use. Users could register and create imputation jobs freely under a strict policy of data security. The server provides a choice of four reference panels to conduct the imputation, including WBBC, 1000G Phase3, WBBC combined EAS and WBBC combined 1000G Phase3. All panels are in both GRCh37 and GRCh38 build to meet different needs. Besides, service of phasing is also provided in the server for users who cannot afford the corresponding heavy computational load. Currently, Westlake Imputation Server mainly serves Asian populations imputation, especially for Han Chinese due to the specificity of genetic background between the reference panel and array. Anyone could freely use our server, more information could be found in the website of WBBC Project (https://wbbc.westlake.edu.cn/). .. image:: img/1.png --------------------- 2. Reference Panels --------------------- Westlake Imputation Server provides four reference panels in both GRCh37 and GRCh38 versions to conduct imputation, including WBBC, 1000G Phase3, WBBC combined EAS and whole 1000G Phase3. Note that all singletons are excluded from reference panels. .. _WBBC: WBBC Phase 1 (no singleton) --------------------------- +----------+------------------------+---------------------+ | Build | Number of Haplotypes | Variants (chr1-22) | +==========+========================+=====================+ | GRCh37 | 8,978 | 34,892,049 | +----------+------------------------+---------------------+ | GRCh38 | 8,978 | 35,616,674 | +----------+------------------------+---------------------+ .. _1000G: 1000G Phase 3 (no singleton, version 5) --------------------------------------- +----------+------------------------+---------------------+ | Build | Number of Haplotypes | Variants (chr1-22) | +==========+========================+=====================+ | GRCh37 | 5,008 | 47,096,290 | +----------+------------------------+---------------------+ | GRCh38 | 5,096 | 45,242,581 | +----------+------------------------+---------------------+ .. _WBBC_EAS: WBBC Phase 1 + EAS (no singleton) ---------------------------------- +----------+------------------------+---------------------+ | Build | Number of Haplotypes | Variants (chr1-22) | +==========+========================+=====================+ | GRCh37 | 9,986 | 40,249,755 | +----------+------------------------+---------------------+ | GRCh38 | 9,986 | 40,580,025 | +----------+------------------------+---------------------+ .. _WBBC_1000G: WBBC Phase 1 + 1000G Phase 3 (no singleton) ------------------------------------------- +----------+------------------------+---------------------+ | Build | Number of Haplotypes | Variants (chr1-22) | +==========+========================+=====================+ | GRCh37 | 13,986 | 68,932,526 | +----------+------------------------+---------------------+ | GRCh38 | 14,074 | 68,196,638 | +----------+------------------------+---------------------+ .. _South and East Asia Database(no singleton): South and East Asia Database(no singleton) ------------------------------------------- +----------+------------------------+---------------------+ | Build | Number of Haplotypes | Variants (chr1-22) | +==========+========================+=====================+ | GRCh38 | 22,134 | 80,367,720 | +----------+------------------------+---------------------+ +------------------------+-----------------------------+------------+-------+ | Cohort | unrelated samples in Release| country | Depth | +========================+=============================+============+=======+ | WBBC | 4480 | China | 13.9x | +------------------------+-----------------------------+------------+-------+ | SG10K | 4563 | Singapore | 13.7x | +------------------------+-----------------------------+------------+-------+ | GAsP | 30 | China | 36x | +------------------------+-----------------------------+------------+-------+ | GAsP | 30 | Japan | 36x | +------------------------+-----------------------------+------------+-------+ | GAsP | 149 | KOREA | 36x | +------------------------+-----------------------------+------------+-------+ | GAsP | 97 | Mongolia | 36x | +------------------------+-----------------------------+------------+-------+ | GAsP | 464 | India | 36x | +------------------------+-----------------------------+------------+-------+ | GAsP | 91 | Pakistan | 36x | +------------------------+-----------------------------+------------+-------+ | GAsP | 25 | India | 36x | +------------------------+-----------------------------+------------+-------+ | GAsP | 8 | Sri Lanka | 36x | +------------------------+-----------------------------+------------+-------+ | GAsP | 64 | Indonesia | 36x | +------------------------+-----------------------------+------------+-------+ | GAsP | 48 | Philippines| 36x | +------------------------+-----------------------------+------------+-------+ | GAsP | 25 | Vietnam | 36x | +------------------------+-----------------------------+------------+-------+ | 1000G Phase 3 (EAS+SAS)| 301 | China | 30x | +------------------------+-----------------------------+------------+-------+ | 1000G Phase 3 (EAS+SAS)| 104 | Japan | 30x | +------------------------+-----------------------------+------------+-------+ | 1000G Phase 3 (EAS+SAS)| 99 | Vietnamese | 30x | +------------------------+-----------------------------+------------+-------+ | 1000G Phase 3 (EAS+SAS)| 86 | Bengali | 30x | +------------------------+-----------------------------+------------+-------+ | 1000G Phase 3 (EAS+SAS)| 103 | Gujarati | 30x | +------------------------+-----------------------------+------------+-------+ | 1000G Phase 3 (EAS+SAS)| 102 | Telugu | 30x | +------------------------+-----------------------------+------------+-------+ | 1000G Phase 3 (EAS+SAS)| 96 | Punjabi | 30x | +------------------------+-----------------------------+------------+-------+ | 1000G Phase 3 (EAS+SAS)| 102 | Tamil | 30x | +------------------------+-----------------------------+------------+-------+ | Total | 11067 | | | +------------------------+-----------------------------+------------+-------+ ----------------------- 3. Before Getting Start ----------------------- .. _Registration: Registration ------------- A registration is required before use Westlake Imputation Server. Click the “Sign-In” button on the navigation bar and fill your email address to finish the registration. Once the email address has been verified, the service can be used without any costs. .. image:: img/2.png .. _Data_Preparation: Data Preparation ----------------- There are some dos and don'ts for the data preparation. Please make sure: 1) Your array data is separated into each chromosome. 2) The data is in VCF format and zipped by bgzip (e.g. xxx.chr1.vcf.gz, xxx.chr2.vcf.gz …). 3) All variants in dataset are sorted by genomic position. 4) The chromosomes are correctly encoded with or without prefix ‘chr’ according to the build version of your data. That is to say, if your data is GRCh37/hg19, then the CHROM column should be encoded as 1, 2…; if your data is GRCh38/hg38, then the CHROM column should be encoded as chr1, chr2…. There are some tools can be really helpful in data preparation. Use Plink v1.9 and BCFtools to performed quality control, and convert data format. Use bgzip compress your data. We provide the Phasing service in Westlake Imputation Server, but if you prefer to conduct that on your own, we would suggest SHAPEIT v2. Note: So far Westlake Imputation Server only offers genotype imputation for chromosome 1-22. ----------------- 4. Getting Start ----------------- .. _Create_Imputation_Missions: Create Imputation Missions --------------------------- After login, you can click the “Create Mission” button on home page to start your imputation. Please type a name for the mission in the “Project Name” textbox. Then, upload the array data by clicking the “Upload Files (vcf.gz)” button. An open dialog will appear where you can select your VCF files, it is okay to upload several files (<= 22 files) at once. .. image:: img/3.png .. image:: img/4.png Next, choose a reference panel for the imputation, there are four alternatives in Westlake Imputation Server. The server also provides phasing service if the array data is not phased yet. Then, select a genome build version (GRCh37 or GRCh38) according to your array data. Finally, click the “Submit” button to create missions after all array dataset are successfully uploaded. A popup window will show up soon when the creating is finished. .. image:: img/5.png .. image:: img/6.png Important Note: ----------------- 1. Please make sure the genome build version (GRCh37 or GRCh38) is correct. 2. We used the dataset from 1000G Phase3 project as the reference panel for PHASING if user set the “Phasing by all 1000G/EAS/EUR/AFR/SAS” parameter. In this case, please note that target-only sites for unphased data (missing in 1000G Phase3) are not included in the phased output. 3. If your sample size is small, we strongly NOT recommend using “Phasing-by-Self”. 4. If your data have been phased, the server will skip the phasing step (whatever the phasing parameter is selected or not). .. _Check the Status of Missions: Check the Status of Missions ----------------------------- Once the uploading is finished, you imputation mission will get in queue. You can check the status of the mission by clicking the “My Missions” button on the navigation bar at any time: .. image:: img/7.png There are several statuses for each chromosome mission, such as Mission Waiting, Quality Control, Mission Running and Mission Success. Each chromosome has an indicator to help user get the status straight. There are three kinds of colors for the indicator, that green means your mission is waiting or running, blue means your mission has completed, red means mission failed. If you got the red indicator, please check your input and re-submit. At any time, you can abort a mission by clicking “Delete” button, and the mission will be aborted and corresponding data will be removed in the server. .. _Download Imputation Results: Download Imputation Results ----------------------------- For each chromosome, once the imputation mission has completed, a reminder email will be automatically sent to you. You can check the imputation results files by clicking the “Download” button on “My Mission” page: .. image:: img/8.png A new page will be open and you can download the results by either directly clicking the icon or using the download link below. .. image:: img/9.png Note: The imputation results will be deleted in 14 days since the mission has completed. Please finish your downloading during that time. -------------------- 5. Pipeline Overview -------------------- Before the actual imputation, Westlake Imputation Server will perform some quality controls to ensure the imputation quality and accuracy. The following variants will be excluded: (1) Mismatched SNPs (i.e. the alleles in study array [ref/alt] are not matched with 1000G Project Phase3 v5). (2) The SNP call rate less than 90%. (3) Monomorphic sites. (4) InDels. (5) Duplicates (a variant is defined by [chr:pos:ref:alt]). The SHAPEIT2 and MINIMAC4 are used to performed phasing and imputation respectively. Take chromosome 20 as example, the pipeline used in Westlake Imputation Server is shown below: .. _Quality Control by BCFtools: Quality Control by BCFtools --------------------------- .. literalinclude:: bash/1.bash :language: bash :linenos: .. _Phasing by SHAPEIT2: Phasing by SHAPEIT2 --------------------- .. literalinclude:: bash/2.bash :language: bash :linenos: .. _Imputation by MINIMAC4: Imputation by MINIMAC4 ----------------------- .. literalinclude:: bash/3.bash :language: bash :linenos: For each chromosome, two result files will be generated, including the imputed genotype data of study array (chr20.impu.dose.vcf.gz) and an INFO file (chr20.impu.info) that contains statistics of variants, such as R-square. You can download them by following the instruction. ------------------ 6. Data Security ------------------ A range of security controls are established to protect the data and servers: 1). HTTPs were adopted for communication with computer servers. 2). Users must register with a unique e-mail address and strong password which contain a mix of letters, numbers and symbols. The users only access to their own missions and data. 3). We do not provide the storage space for a long time. The input data must be deleted from our server in a month, or it will be removed automatically. 4). The user will receive an email with download links when the imputation mission is finished. The result should be download in one week, or the data will be removed automatically. 5). We only store the number of samples and markers analyzed, we don't ever "look" at your data in anyway.