Part A
Where: The genetic information was collected from the remains of individuals who inhabited Southeast Asia between 4100 and 1700 years ago. They yield usable ancient DNA from 18 individuals. Specifically from the Man Bac site in Vietnam and other excavation sites in Myanmar and across Southeast Asia and Japan.
Why: This data was collected because the historical population movements and the early people of Southeast Asia were previously poorly underrepresented in ancient DNA studies. Researchers decided to analyze these ancient genomes to reconstruct the complex genetic history of the region’s diversity.By comparing the ancient DNA to present day populations, they aimed to find physical evidence of how different groups mixed over time,looking to trace the initial influx of early farmers from South China meeting local hunter-gatherers, as well as later waves of Bronze Age migrations that shaped modern populations.
The table with variables with units
Variables with units Mg Base pairs Equivalent mg bone powder used for library preparation Shotgun raw sequences - mtDNA median read length
The dataset has a long format.
Code
## tibble [173 × 26] (S3: tbl_df/tbl/data.frame)
## $ Library ID : chr [1:173] "S0626.E1.L1" "S7241.E1.L1" "S0626.E1.L2" "S0626.E1.L3" ...
## $ Lab Indiv. ID : chr [1:173] "I0626" "I0626" "I0626" "I0626" ...
## $ Skeletal codes : chr [1:173] "VN33, 0.7.MB.H1.M10" "VN33, 0.7.MB.H1.M10" "VN33, 0.7.MB.H1.M10" "VN33, 0.7.MB.H1.M10" ...
## $ UDG : chr [1:173] "partial" "partial" "minus" "minus" ...
## $ Skeletal element : chr [1:173] "petrous" "petrous" "petrous" "petrous" ...
## $ Date (BCE/CE) : chr [1:173] "1900-1600 BCE" "1900-1600 BCE" "1900-1600 BCE" "1900-1600 BCE" ...
## $ Culture : chr [1:173] "Vietnam_Neolithic" "Vietnam_Neolithic" "Vietnam_Neolithic" "Vietnam_Neolithic" ...
## $ Location : chr [1:173] "Man Bac" "Man Bac" "Man Bac" "Man Bac" ...
## $ Country : chr [1:173] "Vietnam" "Vietnam" "Vietnam" "Vietnam" ...
## $ Extraction protocol (see key below) : chr [1:173] "[42, 43]" "[42, 43]" "[42, 43]" "[42, 43]" ...
## $ Library preparation (see key below) : num [1:173] 1 3 1 1 1 1 1 3 3 3 ...
## $ Equivalent mg bone powder used for library preparation: num [1:173] 26.33 8.33 13.17 13.17 13.17 ...
## $ Shotgun raw sequences : chr [1:173] "783749" "397903" "3316700" "3316700" ...
## $ Shotgun fraction raw sequences mapping to hg19 : chr [1:173] "4.1999999999999997E-3" "1.12E-2" "1.4E-3" "1.1000000000000001E-3" ...
## $ mtDNA sequences in alignment : chr [1:173] "942188" "1000352" "662593" "584222" ...
## $ mtDNA sequences aligning to target : chr [1:173] "95311" "6605" "1943" "1648" ...
## $ Aligned mtDNA sequences remaining after deduping : chr [1:173] "3754" "2403" "1288" "1088" ...
## $ mtDNA coverage : chr [1:173] "7.96" "4.8" "2.87" "2.4700000000000002" ...
## $ mtDNA median read length : chr [1:173] "35" "33" "37" "38" ...
## $ mtDNA fraction damaged in last base : chr [1:173] "0.30299999999999999" "0.32400000000000001" "0.61399999999999999" "0.66" ...
## $ mtDNA haplogroup : chr [1:173] "R" "B5" "H2a2a" "H2a2a1" ...
## $ mtDNA match fraction : chr [1:173] ".." "0.94899999999999995" "0.80100000000000005" "0.502" ...
## $ mtDNA match fraction 95% CI : chr [1:173] ".." "[0.899,0.977]" "[0.554,0.930]" "[0.349,0.659]" ...
## $ Perform 1240k capture? : chr [1:173] "Yes" "Yes" "Yes" "Yes" ...
## $ 1240k captured in pool? (number of libraries in pool) : chr [1:173] "No" "No" "Yes (4)" "Yes (4)" ...
## $ Used in analyses? : chr [1:173] "Yes" "Yes" "No^" "No^" ...
## Library ID
## "character"
## Lab Indiv. ID
## "character"
## Skeletal codes
## "character"
## UDG
## "character"
## Skeletal element
## "character"
## Date (BCE/CE)
## "character"
## Culture
## "character"
## Location
## "character"
## Country
## "character"
## Extraction protocol (see key below)
## "character"
## Library preparation (see key below)
## "numeric"
## Equivalent mg bone powder used for library preparation
## "numeric"
## Shotgun raw sequences
## "character"
## Shotgun fraction raw sequences mapping to hg19
## "character"
## mtDNA sequences in alignment
## "character"
## mtDNA sequences aligning to target
## "character"
## Aligned mtDNA sequences remaining after deduping
## "character"
## mtDNA coverage
## "character"
## mtDNA median read length
## "character"
## mtDNA fraction damaged in last base
## "character"
## mtDNA haplogroup
## "character"
## mtDNA match fraction
## "character"
## mtDNA match fraction 95% CI
## "character"
## Perform 1240k capture?
## "character"
## 1240k captured in pool? (number of libraries in pool)
## "character"
## Used in analyses?
## "character"
Part B
Represent the distribution - variables to compare: mtDNA coverage by sample’s culture
library(ggplot2)
library(dplyr)
data_clean <- data %>%
dplyr::select(
mtdna_coverage = `mtDNA coverage`,
culture = Culture
) %>%
mutate(mtdna_coverage = as.numeric(mtdna_coverage)) %>%
na.omit()## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `mtdna_coverage = as.numeric(mtdna_coverage)`.
## Caused by warning:
## ! NAs introduced by coercion
ggplot(data_clean, aes(x = reorder(culture, mtdna_coverage, median),
y = mtdna_coverage,
fill = culture)) +
geom_bar(stat = "summary", fun = "median", alpha = 0.8) +
geom_point(color = "black", size = 1, alpha = 0.5,
position = position_jitter(width = 0.2)) +
scale_fill_brewer(palette = "Set1") +
labs(
title = "Median mtDNA Coverage by Culture",
x = "Culture",
y = "Median mtDNA Coverage (x)",
fill = "Culture"
) +
theme_minimal() +
theme(
axis.text.x = element_text(angle = 45, hjust = 1),
legend.position = "none"
)We can see that the median mtDNA coverage is greater in Vietnam_BA group compared to others. Thailand_IA is the second largest. However, there are a few important things to consider here. Firstly, the data dispersion for Vietnam_BA is quite great, so we’re not exactly sure if this is representative of a true trend. Also, some groups (Thailand_IA and Thailand_LN) have only one point, so once again we are not sure if their values represent sample of a culture accurately. The group with the the highest amount of data points and moderate dispersion is Vietnam_Neolithic, and occupies a position somewhere in the middle, which makes you wonder if there would be a significant difference between groups whatsoever if we had more data points for each group and accounted for high variation. So currently it is very hard to say if there is any trend at all or not.
Summarize the data: mtDNA coverage by culture
library(dplyr)
library(ggplot2)
summary_data <- data %>%
dplyr::select(
culture = Culture,
coverage = `mtDNA coverage`
) %>%
mutate(
culture = as.factor(culture),
coverage = as.numeric(coverage)
) %>%
na.omit()## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `coverage = as.numeric(coverage)`.
## Caused by warning:
## ! NAs introduced by coercion
ggplot(summary_data, aes(x = culture, y = coverage, fill = culture)) +
geom_boxplot(alpha = 0.75, outlier.color = "red", outlier.alpha = 0.7) +
geom_jitter(width = 0.15, alpha = 0.35, size = 1, color = "black") +
labs(
title = "Distribution of mtDNA Coverage Across Cultures",
x = "Culture",
y = "mtDNA Coverage",
fill = "Culture"
) +
theme_minimal() +
theme(
axis.text.x = element_text(angle = 45, hjust = 1),
legend.position = "none"
)The boxplot shows that mtDNA coverage is not uniformly distributed across cultures. Some groups have a lower median coverage and a very narrower interquartile range, whereas others display greater variability and more extreme observations. These differences may reflect variation in sample preservation, excavation conditions, or laboratory success during DNA recovery.
Choose three graphics that better describe your data:
GRAPH 1:
data_graph1 <- data %>%
select(haplogroup = `mtDNA haplogroup`,
damage = `mtDNA fraction damaged in last base`,
coverage = `mtDNA coverage`,
individual = `Lab Indiv. ID`) %>%
mutate(damage = as.numeric(damage), coverage = as.numeric(coverage)) %>%
filter(!is.na(haplogroup), haplogroup != "..", !is.na(damage), coverage > 0)## Warning: There were 2 warnings in `mutate()`.
## The first warning was:
## ℹ In argument: `damage = as.numeric(damage)`.
## Caused by warning:
## ! NAs introduced by coercion
## ℹ Run `dplyr::last_dplyr_warnings()` to see the 1 remaining warning.
ggplot(data_graph1, aes(x = haplogroup, y = damage)) +
geom_point(aes(size = coverage), alpha = 0.7) +
labs(title = "Genetic mixture in Neolithic Vietnam (Man Bac site)", x = "Haplogroup",y = "DNA damage") +
theme_minimal() + theme(axis.text.x = element_text(angle = 45, hjust = 1))Graph 1: Genetic mixture in Neolithic Vietnam (Man Bac site)
What this graph shows: - The X axis shows different mitochondrial haplogroups (maternal lineages) - The Y axis shows DNA damage (a marker that confirms the DNA is truly ancient) - The size of each dot represents DNA coverage (quality of the data)
What it means: This graph includes all samples from Man Bac, a Neolithic site in Vietnam (~4000 years old). We can see different haplogroups (like M7b1a1, M13b, B5a1a) which suggests a mixture of populations. All samples have high DNA damage (>0.3), confirming they are authentic ancient DNA, not modern contamination. Larger dots represent higher quality samples that were likely used in the final analysis.
According to Lipson et al. 2018, the Man Bac individuals represent the FIRST genetic mixture between incoming farmers from southern China and local hunter-gatherers. The diversity of haplogroups in this graph supports that mixture.
GRAPH 2:
data_graph2 <- data %>%
select(individual = `Lab Indiv. ID`,
culture = Culture,
coverage = `mtDNA coverage`,
used = `Used in analyses?`) %>%
mutate(coverage = as.numeric(coverage),
culture_simple = case_when(
grepl("Vietnam_Neolithic", culture) ~ "Vietnam Neolithic\n(Man Bac)",
grepl("Vietnam_BA", culture) ~ "Vietnam Bronze Age\n(Nui Nap)",
grepl("Myanmar_LNBA", culture) ~ "Myanmar LNBA\n(Oakaie)",
grepl("Thailand_BA", culture) ~ "Thailand Bronze Age\n(Ban Chiang)",
grepl("Thailand_IA", culture) ~ "Thailand Iron Age\n(Ban Chiang)",
grepl("Cambodia_IA", culture) ~ "Cambodia Iron Age\n(Vat Komnou)",
TRUE ~ culture)) %>%
filter(!is.na(coverage), coverage > 0)## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `coverage = as.numeric(coverage)`.
## Caused by warning:
## ! NAs introduced by coercion
ggplot(data_graph2, aes(x = culture_simple, y = coverage)) +
geom_jitter(aes(color = used), width = 0.2, height = 0, size = 3, alpha = 0.8) +
scale_color_manual(values = c("Yes" = "#CC0033", "No" = "gray70", "No^" = "gray40")) +
labs(title = "DNA coverage across Southeast Asian archaeological sites") +
theme_minimal() +
geom_hline(yintercept = 5, linetype = "dashed", color = "gray50")Graph 2: DNA coverage across Southeast Asian archaeological sites
What this graph shows: - The X axis shows different archaeological sites and time periods - The Y axis shows mtDNA coverage (higher numbers = better quality) - The color of each dot indicates if the library was used in the final analysis - The dashed line at 5x coverage is a common quality threshold
What it means: This graph compares DNA quality across different sites and periods. The dashed line at 5x coverage is a typical minimum threshold for good quality data.
What’s interesting is that many red dots (samples used in the final analysis) are below this line. This tells us something important: in ancient DNA studies from tropical regions like Southeast Asia, preservation is poor, so researchers often have to work with lower coverage data. They can’t afford to be too strict with quality thresholds, or they would have no samples left to analyze.
We can see that: - Vietnam Bronze Age (Nui Nap) has the highest coverage samples, including some well above the threshold - Other sites have more variable coverage, with many used samples falling below 5x - This explains why the study ended up with only 18 individuals out of 146 screened - it’s very difficult to get good quality DNA from hot and humid environments
The authors mention in the paper: “Because of poor preservation conditions in tropical environments, we observed both a low rate of conversion of screened samples to working data and also limited depth of coverage per sample.” This graph visually confirms that challenge - even samples they used in the final analysis often have low coverage.
GRAPH 3:
data_graph3 <- data %>%
select(culture = Culture,
match_fraction = `mtDNA match fraction`,
individual = `Lab Indiv. ID`) %>%
mutate(match_fraction = as.numeric(match_fraction)) %>%
filter(!is.na(match_fraction), match_fraction > 0)## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `match_fraction = as.numeric(match_fraction)`.
## Caused by warning:
## ! NAs introduced by coercion
ggplot(data_graph3, aes(x = culture, y = match_fraction)) +
geom_point(alpha = 0.7, size = 3) +
labs(title = "Confidence in haplogroup assignment across cultures", x = "Culture", y = "mtDNA match fraction") +
theme_minimal() + theme(axis.text.x = element_text(angle = 45, hjust = 1))Graph 3: Confidence in haplogroup assignment across cultures
What this graph shows: - X axis: Archaeological cultures - Y axis: mtDNA match fraction (0-1, higher = more confidence) - Each dot: One ancient DNA library
What it means: Most samples have high match fractions (>0.9), meaning haplogroup assignments are reliable.
Vietnam_Neolithic shows more variation and some lower values (0.8-0.9). This makes sense because these individuals represent the FIRST genetic mixture between farmers and hunter-gatherers - mixed ancestry can make haplogroup assignment less certain.
Later cultures (Vietnam_BA, Thailand) show more consistent high values, reflecting more homogeneous populations after the second migration wave.
The variation in Neolithic samples supports the paper’s finding of initial genetic mixture. High values in later periods confirm reliable data for studying subsequent migrations.