Why does this human bam file only have one copy of each chromosome?
As we know that in human DNA sequence, one copy of chromosome comes from mother's DNA and another copy comes from father's DNA so as to form two copies of each chromosome in human DNA. So, if we extract exome sequence from its DNA then each exome must also have two copies of each chromosome (one from mother and another from father). But in either fastq file or bam file in WES data, I always find only one copy of each chromosome.
Q1: Where is the other gene copy in the sequence or have I have missed something?
Q2: How can I check the ploidy and find if there are two copies of each chromosome in WES data? How can I do this if my WES data are mapped to the reference and stored in a bam file?
bam sequencing fastq exome
add a comment |
As we know that in human DNA sequence, one copy of chromosome comes from mother's DNA and another copy comes from father's DNA so as to form two copies of each chromosome in human DNA. So, if we extract exome sequence from its DNA then each exome must also have two copies of each chromosome (one from mother and another from father). But in either fastq file or bam file in WES data, I always find only one copy of each chromosome.
Q1: Where is the other gene copy in the sequence or have I have missed something?
Q2: How can I check the ploidy and find if there are two copies of each chromosome in WES data? How can I do this if my WES data are mapped to the reference and stored in a bam file?
bam sequencing fastq exome
add a comment |
As we know that in human DNA sequence, one copy of chromosome comes from mother's DNA and another copy comes from father's DNA so as to form two copies of each chromosome in human DNA. So, if we extract exome sequence from its DNA then each exome must also have two copies of each chromosome (one from mother and another from father). But in either fastq file or bam file in WES data, I always find only one copy of each chromosome.
Q1: Where is the other gene copy in the sequence or have I have missed something?
Q2: How can I check the ploidy and find if there are two copies of each chromosome in WES data? How can I do this if my WES data are mapped to the reference and stored in a bam file?
bam sequencing fastq exome
As we know that in human DNA sequence, one copy of chromosome comes from mother's DNA and another copy comes from father's DNA so as to form two copies of each chromosome in human DNA. So, if we extract exome sequence from its DNA then each exome must also have two copies of each chromosome (one from mother and another from father). But in either fastq file or bam file in WES data, I always find only one copy of each chromosome.
Q1: Where is the other gene copy in the sequence or have I have missed something?
Q2: How can I check the ploidy and find if there are two copies of each chromosome in WES data? How can I do this if my WES data are mapped to the reference and stored in a bam file?
bam sequencing fastq exome
bam sequencing fastq exome
edited 1 hour ago
conchoecia
1,569223
1,569223
asked 2 hours ago
Lot_to_learn
758
758
add a comment |
add a comment |
1 Answer
1
active
oldest
votes
The maternal and paternal copies of a chromosome haplotypes. Many metazoans are diploid and have maternal and paternal chromosome contribution during sexual reproduction, not just humans as your question states.
Response to Q1
Your question, in other words, is: Why do bam
files not differentiate between haplotypes?
Your question gets to a more fundamental kernel of how most people do "genomics" today. When most people assemble a genome or create a reference genome, they actually just produce a fasta file that contains a single sequence for each chromosome. This, of course, is not biologically accurate as there are two very distinct sequences per chromosome. Most genomes only report a chimeric consensus between the two and call this a reference. These reference genomes are haploid, or haplotype-collapsed.
This is where your question comes in: You have mapped transcripts that biologically are derived from two different haplotypes to a reference genome that only contains one consensus sequence that (inadequately) represents both haplotypes. As a result, at a single locus in the bam file there will be mapped reads from both haplotypes. If your reference fasta file contained the exact sequence of both haplotypes, if you mapped reads to the reference, and if you only look at primary alignments, the reads will mostly map to their haplotype-of-origin.
This gets into another topic called phasing, wherein the order and orientation of most of the polymorphisms unique to each haplotype are determined using sequencing data. There are some problems with this as it relies on properly detecting variant sites. Software like GATK and others can find single nucleotide polymorphisms (SNPs) if there is good sequencing coverage, however detecting insertions and deletions is much more difficult. This gives a very SNP-biased view of the haplotype differences in any given genome. After finding variant sites, phasing software like hapcut2 determines which variants fall on which haplotype and outputs blocks of variants that are thought to belong to the same haplotype.
Phasing alone is not enough to accurately reconstruct the exact sequence of both haplotypes due to the inability to detect all variants with read mapping. The gold standard of the future is diploid de novo genome assembly, in which each haplotype is assembled independently. This is an active area of research for people who develop genome assemblers and is tightly linked to the advances in PacBio and Oxford Nanopore long reads. This paper on trio canu is a good start for learning about one successful dioploid assembly technique.
Response to Q2
If you want to check ploidy of all of the chromosomes then you need at least 10X coverage of whole-genome shotgun data and a tool like smudgeplot and genomescope.
If you are trying to check local duplications or whole-chromosome duplications for something like a cancer sample you would also need whole-genome shotgun data. WES data does not give reliable information on ploidy since the read count at a given locus is dependent on the transcription level, not the actual quantity of homologous chromosomal DNA for the region in question.
If you are trying to (use a bam file to) look for evidence of chromosomal duplication or duplication of a locus in cancer samples you would look for an increase in whole genome shotgun coverage at that locus compared to a known normal locus. For example if all chromosomes but one have a coverage of 22, and one chromosome has an average coverage of 33, this is evidence of a trisomy. This logic can be applied to smaller regions as well (barring repetitive regions, paralogs, et cetera).
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
});
});
}, "mathjax-editing");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "676"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fbioinformatics.stackexchange.com%2fquestions%2f6717%2fwhy-does-this-human-bam-file-only-have-one-copy-of-each-chromosome%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
The maternal and paternal copies of a chromosome haplotypes. Many metazoans are diploid and have maternal and paternal chromosome contribution during sexual reproduction, not just humans as your question states.
Response to Q1
Your question, in other words, is: Why do bam
files not differentiate between haplotypes?
Your question gets to a more fundamental kernel of how most people do "genomics" today. When most people assemble a genome or create a reference genome, they actually just produce a fasta file that contains a single sequence for each chromosome. This, of course, is not biologically accurate as there are two very distinct sequences per chromosome. Most genomes only report a chimeric consensus between the two and call this a reference. These reference genomes are haploid, or haplotype-collapsed.
This is where your question comes in: You have mapped transcripts that biologically are derived from two different haplotypes to a reference genome that only contains one consensus sequence that (inadequately) represents both haplotypes. As a result, at a single locus in the bam file there will be mapped reads from both haplotypes. If your reference fasta file contained the exact sequence of both haplotypes, if you mapped reads to the reference, and if you only look at primary alignments, the reads will mostly map to their haplotype-of-origin.
This gets into another topic called phasing, wherein the order and orientation of most of the polymorphisms unique to each haplotype are determined using sequencing data. There are some problems with this as it relies on properly detecting variant sites. Software like GATK and others can find single nucleotide polymorphisms (SNPs) if there is good sequencing coverage, however detecting insertions and deletions is much more difficult. This gives a very SNP-biased view of the haplotype differences in any given genome. After finding variant sites, phasing software like hapcut2 determines which variants fall on which haplotype and outputs blocks of variants that are thought to belong to the same haplotype.
Phasing alone is not enough to accurately reconstruct the exact sequence of both haplotypes due to the inability to detect all variants with read mapping. The gold standard of the future is diploid de novo genome assembly, in which each haplotype is assembled independently. This is an active area of research for people who develop genome assemblers and is tightly linked to the advances in PacBio and Oxford Nanopore long reads. This paper on trio canu is a good start for learning about one successful dioploid assembly technique.
Response to Q2
If you want to check ploidy of all of the chromosomes then you need at least 10X coverage of whole-genome shotgun data and a tool like smudgeplot and genomescope.
If you are trying to check local duplications or whole-chromosome duplications for something like a cancer sample you would also need whole-genome shotgun data. WES data does not give reliable information on ploidy since the read count at a given locus is dependent on the transcription level, not the actual quantity of homologous chromosomal DNA for the region in question.
If you are trying to (use a bam file to) look for evidence of chromosomal duplication or duplication of a locus in cancer samples you would look for an increase in whole genome shotgun coverage at that locus compared to a known normal locus. For example if all chromosomes but one have a coverage of 22, and one chromosome has an average coverage of 33, this is evidence of a trisomy. This logic can be applied to smaller regions as well (barring repetitive regions, paralogs, et cetera).
add a comment |
The maternal and paternal copies of a chromosome haplotypes. Many metazoans are diploid and have maternal and paternal chromosome contribution during sexual reproduction, not just humans as your question states.
Response to Q1
Your question, in other words, is: Why do bam
files not differentiate between haplotypes?
Your question gets to a more fundamental kernel of how most people do "genomics" today. When most people assemble a genome or create a reference genome, they actually just produce a fasta file that contains a single sequence for each chromosome. This, of course, is not biologically accurate as there are two very distinct sequences per chromosome. Most genomes only report a chimeric consensus between the two and call this a reference. These reference genomes are haploid, or haplotype-collapsed.
This is where your question comes in: You have mapped transcripts that biologically are derived from two different haplotypes to a reference genome that only contains one consensus sequence that (inadequately) represents both haplotypes. As a result, at a single locus in the bam file there will be mapped reads from both haplotypes. If your reference fasta file contained the exact sequence of both haplotypes, if you mapped reads to the reference, and if you only look at primary alignments, the reads will mostly map to their haplotype-of-origin.
This gets into another topic called phasing, wherein the order and orientation of most of the polymorphisms unique to each haplotype are determined using sequencing data. There are some problems with this as it relies on properly detecting variant sites. Software like GATK and others can find single nucleotide polymorphisms (SNPs) if there is good sequencing coverage, however detecting insertions and deletions is much more difficult. This gives a very SNP-biased view of the haplotype differences in any given genome. After finding variant sites, phasing software like hapcut2 determines which variants fall on which haplotype and outputs blocks of variants that are thought to belong to the same haplotype.
Phasing alone is not enough to accurately reconstruct the exact sequence of both haplotypes due to the inability to detect all variants with read mapping. The gold standard of the future is diploid de novo genome assembly, in which each haplotype is assembled independently. This is an active area of research for people who develop genome assemblers and is tightly linked to the advances in PacBio and Oxford Nanopore long reads. This paper on trio canu is a good start for learning about one successful dioploid assembly technique.
Response to Q2
If you want to check ploidy of all of the chromosomes then you need at least 10X coverage of whole-genome shotgun data and a tool like smudgeplot and genomescope.
If you are trying to check local duplications or whole-chromosome duplications for something like a cancer sample you would also need whole-genome shotgun data. WES data does not give reliable information on ploidy since the read count at a given locus is dependent on the transcription level, not the actual quantity of homologous chromosomal DNA for the region in question.
If you are trying to (use a bam file to) look for evidence of chromosomal duplication or duplication of a locus in cancer samples you would look for an increase in whole genome shotgun coverage at that locus compared to a known normal locus. For example if all chromosomes but one have a coverage of 22, and one chromosome has an average coverage of 33, this is evidence of a trisomy. This logic can be applied to smaller regions as well (barring repetitive regions, paralogs, et cetera).
add a comment |
The maternal and paternal copies of a chromosome haplotypes. Many metazoans are diploid and have maternal and paternal chromosome contribution during sexual reproduction, not just humans as your question states.
Response to Q1
Your question, in other words, is: Why do bam
files not differentiate between haplotypes?
Your question gets to a more fundamental kernel of how most people do "genomics" today. When most people assemble a genome or create a reference genome, they actually just produce a fasta file that contains a single sequence for each chromosome. This, of course, is not biologically accurate as there are two very distinct sequences per chromosome. Most genomes only report a chimeric consensus between the two and call this a reference. These reference genomes are haploid, or haplotype-collapsed.
This is where your question comes in: You have mapped transcripts that biologically are derived from two different haplotypes to a reference genome that only contains one consensus sequence that (inadequately) represents both haplotypes. As a result, at a single locus in the bam file there will be mapped reads from both haplotypes. If your reference fasta file contained the exact sequence of both haplotypes, if you mapped reads to the reference, and if you only look at primary alignments, the reads will mostly map to their haplotype-of-origin.
This gets into another topic called phasing, wherein the order and orientation of most of the polymorphisms unique to each haplotype are determined using sequencing data. There are some problems with this as it relies on properly detecting variant sites. Software like GATK and others can find single nucleotide polymorphisms (SNPs) if there is good sequencing coverage, however detecting insertions and deletions is much more difficult. This gives a very SNP-biased view of the haplotype differences in any given genome. After finding variant sites, phasing software like hapcut2 determines which variants fall on which haplotype and outputs blocks of variants that are thought to belong to the same haplotype.
Phasing alone is not enough to accurately reconstruct the exact sequence of both haplotypes due to the inability to detect all variants with read mapping. The gold standard of the future is diploid de novo genome assembly, in which each haplotype is assembled independently. This is an active area of research for people who develop genome assemblers and is tightly linked to the advances in PacBio and Oxford Nanopore long reads. This paper on trio canu is a good start for learning about one successful dioploid assembly technique.
Response to Q2
If you want to check ploidy of all of the chromosomes then you need at least 10X coverage of whole-genome shotgun data and a tool like smudgeplot and genomescope.
If you are trying to check local duplications or whole-chromosome duplications for something like a cancer sample you would also need whole-genome shotgun data. WES data does not give reliable information on ploidy since the read count at a given locus is dependent on the transcription level, not the actual quantity of homologous chromosomal DNA for the region in question.
If you are trying to (use a bam file to) look for evidence of chromosomal duplication or duplication of a locus in cancer samples you would look for an increase in whole genome shotgun coverage at that locus compared to a known normal locus. For example if all chromosomes but one have a coverage of 22, and one chromosome has an average coverage of 33, this is evidence of a trisomy. This logic can be applied to smaller regions as well (barring repetitive regions, paralogs, et cetera).
The maternal and paternal copies of a chromosome haplotypes. Many metazoans are diploid and have maternal and paternal chromosome contribution during sexual reproduction, not just humans as your question states.
Response to Q1
Your question, in other words, is: Why do bam
files not differentiate between haplotypes?
Your question gets to a more fundamental kernel of how most people do "genomics" today. When most people assemble a genome or create a reference genome, they actually just produce a fasta file that contains a single sequence for each chromosome. This, of course, is not biologically accurate as there are two very distinct sequences per chromosome. Most genomes only report a chimeric consensus between the two and call this a reference. These reference genomes are haploid, or haplotype-collapsed.
This is where your question comes in: You have mapped transcripts that biologically are derived from two different haplotypes to a reference genome that only contains one consensus sequence that (inadequately) represents both haplotypes. As a result, at a single locus in the bam file there will be mapped reads from both haplotypes. If your reference fasta file contained the exact sequence of both haplotypes, if you mapped reads to the reference, and if you only look at primary alignments, the reads will mostly map to their haplotype-of-origin.
This gets into another topic called phasing, wherein the order and orientation of most of the polymorphisms unique to each haplotype are determined using sequencing data. There are some problems with this as it relies on properly detecting variant sites. Software like GATK and others can find single nucleotide polymorphisms (SNPs) if there is good sequencing coverage, however detecting insertions and deletions is much more difficult. This gives a very SNP-biased view of the haplotype differences in any given genome. After finding variant sites, phasing software like hapcut2 determines which variants fall on which haplotype and outputs blocks of variants that are thought to belong to the same haplotype.
Phasing alone is not enough to accurately reconstruct the exact sequence of both haplotypes due to the inability to detect all variants with read mapping. The gold standard of the future is diploid de novo genome assembly, in which each haplotype is assembled independently. This is an active area of research for people who develop genome assemblers and is tightly linked to the advances in PacBio and Oxford Nanopore long reads. This paper on trio canu is a good start for learning about one successful dioploid assembly technique.
Response to Q2
If you want to check ploidy of all of the chromosomes then you need at least 10X coverage of whole-genome shotgun data and a tool like smudgeplot and genomescope.
If you are trying to check local duplications or whole-chromosome duplications for something like a cancer sample you would also need whole-genome shotgun data. WES data does not give reliable information on ploidy since the read count at a given locus is dependent on the transcription level, not the actual quantity of homologous chromosomal DNA for the region in question.
If you are trying to (use a bam file to) look for evidence of chromosomal duplication or duplication of a locus in cancer samples you would look for an increase in whole genome shotgun coverage at that locus compared to a known normal locus. For example if all chromosomes but one have a coverage of 22, and one chromosome has an average coverage of 33, this is evidence of a trisomy. This logic can be applied to smaller regions as well (barring repetitive regions, paralogs, et cetera).
edited 56 mins ago
answered 1 hour ago
conchoecia
1,569223
1,569223
add a comment |
add a comment |
Thanks for contributing an answer to Bioinformatics Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Some of your past answers have not been well-received, and you're in danger of being blocked from answering.
Please pay close attention to the following guidance:
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fbioinformatics.stackexchange.com%2fquestions%2f6717%2fwhy-does-this-human-bam-file-only-have-one-copy-of-each-chromosome%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown