Genomics of the Brazilian Biodiversity

Brazilian genomic bank makes biodiversity DNA more accessible

Named GenRefBR, the repository is strategic for the sovereignty of national science. Data can open paths for more effective conservation and bioeconomy development actions.

Why is it important to tell this story?

Transforming the “alphabet soup” that forms the DNA of species into data poses enormous challenges, from sample collection to processing on supercomputers. This story details the steps involved and how the Brazilian Species Genomic Reference Bank (GenRefBR) works.

Partnerships and collaborations

The Genomics of the Brazilian Biodiversity (GBB) project is led by the Vale Technological Institute and the Chico Mendes Institute for Biodiversity Conservation. Over 300 researchers from national and international institutions, such as the University of São Paulo, the Federal University of Pará, and the University of Oxford in the UK are taking part, as are hundreds of organizations such as Ibama and Fiocruz.

Mata N'Ativa Glossary

We work to promote access to scientific knowledge. Be sure to check out the glossary at the end of the story to get a clearer understanding of each of the concepts covered in this text!

Access Glossary

BRANDED CONTENT

19/11/2025By: João Paulo VicenteIllustrations: Mateus Zanon

Composed of carbon, hydrogen, oxygen, nitrogen and phosphorus atoms, DNA stores all the genetic information of living beings. It encodes the instructions for the development, functioning and reproduction of each individual. Its fundamental components are four nitrogenous bases: adenine (A), thymine (T), cytosine (C) and guanine (G). Whether we're talking about an açaí tree, a jararaca snake or a human being, the only difference between them is the order and quantity of these bases.

For researchers, transforming this “soup of organic letters,” or DNA nucleotides, into structured and accessible data is a major step toward finding answers that will inform new biodiversity conservation strategies, make food production more efficient and even enable the development of new medicines.

But until recently, Brazilian scientists had to rely on genomic databases in the United States, Japan and Europe, which are not very “friendly” to foreigners. It is not even possible to identify which species are Brazilian. “We started looking and couldn’t do it, because many of these databases have the data, they have the name of the species, but they don’t ask about the origin of that individual,” explains Gisele Nunes, a researcher in environmental genomics in the field of bioinformatics at the Vale Technological Institute (ITV).

Faced with so many difficulties, Nunes began to think of ways to facilitate access to genomic data on Brazilian species. The idea gave way to the Genomics of the Brazilian Biodiversity (GBB) project, a public-private partnership between ITV and the Chico Mendes Institute for Biodiversity Conservation (ICMBio). With an estimated investment of R$ 110 million by 2027, the consortium is bringing together hundreds of researchers to study the genetics of over 600 species of fauna and flora from all of the country's biomes. The information is shared free of charge through the Brazilian Species Genomic Reference Database (GenRefBR).

The database already contains data on vertebrates produced within the scope of the GBB, as well as information from foreign platforms, but now with the accessibility that was previously lacking. The species included are selected from the Biodiversity Extinction Risk Assessment System (SALVE), a platform managed by the ICMBio. “If you want to know about birds, amphibians, reptiles, how many have genomes? In fact, we have six biomes. If I want to know how many bird species have been registered in the Amazon, I can find out. How many of these are threatened to some degree?” explains Nunes.

The data is not directly stored in GenRefBR. Even the genomes produced by the GBB are first published in GenBank, the public database of the US National Center for Biotechnology Information (NCBI), the largest genomic database in the world. GenRefBR only makes them accessible after they have been published by GenBank. In 2026, the Brazilian database will begin to receive data on plants and, later, invertebrates.

Genetic and genomic data help answer fundamental questions about biodiversity. The Xenarthra lineage, for example, emerged in South America during the Cretaceous-Paleogene transition, about 66 million years ago, a period marked by the mass extinction of dinosaurs and the adaptive explosion of mammals.

The branches of the phylogenetic tree show hypotheses about how species are related. They often indicate possible speciation events, when an ancestral lineage gave rise to two or more.

The first speciation event originated the orders Pilosa (sloths and anteaters) and Cingulata (armadillos), establishing the main lineages of xenarthrans.

The diversification of the main families within the Pilosa and Cingulata orders occurred between the Eocene and the Oligocene, during the Paleogene period, which marks the beginning of the Cenozoic Era.

The order Cingulata (armadillos) underwent a significant differentiation about 36 million years ago, during the late Eocene. The suborders Folivora (sloths) and Vermilingua (anteaters), within the order Pilosa, diverged between 28 and 44 million years ago, between the Eocene and the Oligocene.

Starting in the Miocene (between 23 and 5 million years ago), several xenarthran lineages showed a significant increase in size, resulting in gigantic forms.

Currently, 38 living species of xenarthrans are recognized, most of which were already present in the Pleistocene. However, habitat loss, roadkill, forest fires, and climate change put these species at risk, threatening the continuation of an evolutionary history that began millions of years ago.

From the field to the database, from molecules to data

The GBB brings together researchers from more than 100 Brazilian and foreign institutions, such as universities, non-governmental organizations and museums. Through the ICMBio's National Research and Conservation Centers, priority species were listed in each of the approaches: conservation; invasive exotics, which pose a threat to native biodiversity; or species that are of interest to the bioeconomy.

First, the team defines the strategic approach for each species. At least 80 reference genomes will be produced, a complete map of the species' genetic code at a high standard of quality; 1,000 population genomes, which analyze the genetic diversity of several individuals in a population; and 1,600 DNA barcodes, short, standardized DNA sequences that function as unique species identifiers.

The search for samples can take place in the field or from materials that were stored in institutions, waiting for an opportunity to be sequenced. In both cases, the sample is sent to the ITV genomics laboratory in Belém (Pará), where all of the associated data is recorded, such as taxonomic identification, collection site and a series of metadata.

DNA is extracted through a chemical process. If it passes quality control, the sample moves on to the genomic library preparation stage. “Basically, we're going to divide this DNA into thousands of parts and then link it to what we call tags, which are molecular labels. Then it goes to an instrument called a sequencer, which reads the DNA, finds the tags, and for each one, tells us the sequence of those four bases [AGTC],” explains Alexandre Aleixo, GBB coordinator at the ITV.

After sequencing, the bioinformatics stage begins, which, based on the enormous text file, pieces together the puzzle containing billions of letters representing the bases, in order to identify the different genes. It is a long process that requires significant computational resources, as well as professionals with expertise in computer science, because this is when the molecules are transformed into data. “Bioinformaticians will take this text file and begin to separate it: look, this is gene X, this is part of chromosome 1, this is part of chromosome 2,” says Aleixo.

Ultimately, bioinformaticians work on annotation to answer the question: which protein does each gene encode? And here we should pause to remember our biology lessons: the gene stores the code, but it is the protein that actually “puts the recipe in practice.” And now, yes, the DNA is assembled, cured, and annotated.

The process begins with the collection of biological material, such as plants or animals, and environmental samples, like water and soil. For animals, sedation may be necessary, following ethical protocols, to ensure the organism's safety and sample integrity.

Each cell in this material carries the organism's complete DNA.

DNA is the key to all subsequent steps.

In the laboratory, the cells from the collected material are broken down using reagents and enzymes to release the DNA.

At this stage, proteins and other contaminants are removed, leaving the DNA clean and pure.

Now, the DNA is ready for analysis.

The next step is the preparation of the DNA library.

The extracted DNA is fragmented into smaller pieces.

Each fragment receives synthetic sequences called adapters, which serve as anchor points for the sequencing machines.

This step is essential to allow millions of fragments to be read at the same time, maximizing speed and precision.

In the sequencing phase, the prepared fragments are inserted into machines that read the order of the nitrogenous bases (A, T, C, G) in each piece.

Short-read equipment performs rapid readings of small segments.

Long-read equipment can decipher much larger sequences.

Sequencing machines generate text files in FASTQ format, which can reach up to 1 terabyte of data.

Those DNA segments that were separated by adapters have been digitized. All the DNA is encoded in four letters, representing the four nitrogenous bases: Adenine (A), Cytosine (C), Guanine (G), and Thymine (T).

But since the DNA was fragmented to be read by the sequencing machines, it needs to be reassembled.

Bioinformatics software comes into play to process all this data and align the overlapping fragments.

The genome assembly process is like putting together a gigantic jigsaw puzzle.

Repeated regions and similar sequences pose the biggest challenge.

Once reconstructed, the genome is analyzed for genes, mutations, and specific patterns.

Computational tools identify genetic variations, define species, establish relationships, and even reveal signs of evolutionary adaptation, including genes related to disease resistance.

The genomes sequenced under the GBB are made available on GenBank, the world's largest genomic database, and on GenRefBR, a database compiling information on Brazilian species, currently focusing on vertebrates. The data is available for free to everyone.

Storage and processing

All these steps require significant financial investment. The infrastructure for High Performance Computing (HPC), so-called supercomputers, and other equipment for genomic data processing costs approximately R$ 11.4 million.

There is also the investment in specialized professionals. The ITV currently has 24 bioinformaticians. Many are biologists who left the labs and began to devote themselves to bioinformatics. After all, it is not enough to produce data. You need to have the knowledge to analyze and interpret it.

The equipment used for DNA sequencing depends on the objective. PacBio and Nanopore sequencers are intended for long reads. Much more expensive, they capture long stretches of DNA, necessary in order to assemble the reference genome. "If we were to sequence and analyze 80 amphibians, we would need 250 terabytes of disk space. And we're only talking about one axis of the GBB, which is reference genomes. We aren't even talking about population genomes and environmental DNA (Metabarcoding),” says Renato Oliveira, a bioinformatics researcher at the ITV. The Illumina sequencer, on the other hand, is used for short reads, i.e., population genomes, DNA barcodes and environmental DNA markers, because it generates a large amount of data, but fragmented.

And here's an interesting fact: just as data storage is measured in bits, bytes, megabytes etc., DNA bases also have their own unit of measurement: base pairs (bp). “When we obtain a 3-gigabase genome, that file will be around 3 gigabytes,” explains Oliveira.

It is estimated that 120 to 250 terabytes are used for reference genomes alone. The storage capacity was one petabyte for the entire ITV, but the GBB alone has already consumed 400 terabytes. A new petabyte storage system has been purchased. The price? R$ 3 million.

In late 2024, a process began to migrate genome assembly to cloud processing. To give you an idea, the estimated cost of assembling reference genomes is around R$ 30,000 per month.

The entire investment should represent a leap forward for Brazilian science. “By generating genetic information for this large amount of species, we are generating resources for different types of research. Imagine: I just assembled the genome, but that other group [of scientists] will work on the genetic improvement of that plant so that it produces better,” says Gisele Nunes.

In addition to the fact that the Brazilian Species Genomic Reference Database (GenRefBR) is public and free, the team has developed software and pipelines—sequences of connected processes—that are open to anyone who wants to use them (https://genrefbr.itv.org/#tools). “Instead of running software by software, the pipeline can be installed on any machine and, with just one command line, you can generate all the processes automatically,” Nunes explains.

“If you want to know about birds, amphibians, reptiles, how many have genomes? If I want to know how many bird species have been registered in the Amazon, I can find out. How many of these are threatened to some degree?”

Gisele Nunes, environmental genomics researcher in bioinformatics at ITV

From drawers to the sequencer

The figures, equipment and storage required explain why hundreds of studies have been stalled. Many of the institutions participating in the GBB already had large collections of species samples, which are useful for population or barcode analysis. This is the case of the Araripe manakin (Antilophia bokermanni), a small bird the size of a human hand, scientifically described in 1997. Considered of conservation interest, it is “critically endangered” according to the National List (MMA Ordinance No. 444/2014).

Endemic to the Araripe Mountains in southern Ceará, the species is estimated to have fewer than 1,000 individuals in the wild. "Its distribution occurs in a single line on the mountainside. The bird does not occur down below, at the base, or at the top of the mountain range—only on the slope. And it is an area where there is already fragmentation of the forests," notes Péricles Sena, a researcher who saw an Araripe Manakin for the first time in 1998, back when he was an undergraduate student at the Federal University of Ceará (UFCE).

Sena was one of the participants in an expedition specifically organized to study the species up close. “When I took the Araripe manakin out of the net, it felt like I had a ghost in my hands.”

Today, the researcher is a specialist in genetics and molecular biology. The laboratory where he works, linked to the Federal University of Pará (UFPA), holds a collection of about 150 samples of the Araripe manakin, including the holotype used to describe the species, captured in 1996. As there are less than 1,000 estimated individuals in the wild, this number represents a sample corresponding to over 15% of the population.

Throughout his career, Sena has followed the evolution of technologies. To give you an idea, the first studies using classic genetic markers were not able to differentiate the Araripe manakin from its ‘sister species’ (Antilophia galeata), found in the Cerrado.

In an article published in the journal Ornithological Applications in 2022, Sena and other researchers analyzed the bird species' low diversity and declining population. One of their conclusions was categorical: genomic studies are needed to develop conservation strategies. “We didn't want to look at just one piece, but at the entire genome of the species,” says the scientist.

Now the future has arrived: through the Genomics of the Brazilian Biodiversity project, the Araripe manakin will finally have a reference genome and a population genomics study of its own. In addition, through GenRefBR, other researchers will also have easier access to genetic information on this and other species.

The first research with classic genetic markers was not even able to differentiate the Araripe manakin, in the photo, from its 'sister species' (Antilophia galeata), the manakin found in the Cerrado. Now, the genomic sequencing of the species will allow for more precise conservation strategies to be defined. PHOTO: Rick Elis Simpson

Challenges in the field

Not all samples can be used to create a reference genome. Advanced high-coverage sequencing requires well-preserved material—blood, tissue from different organs, leaves, among others—kept at temperatures as low as -80ºC and transported in a specialized logistics chain. “It's a process that many researchers are unfamiliar with,” says Antonita Santana, a GBB research fellow responsible for sample management and curation.

When a species can only be found in hard-to-reach places, the work becomes more complex and costly. Santana recalls a challenging expedition to Abrolhos, an archipelago 70 kilometers off the southern coast of Bahia. “It was quite a saga,” she says.

The goal was to collect samples of a giant anemone (Condylactis gigantea) threatened with extinction in order to generate the species' reference genome in a project of interest to the Northeast National Center for Research and Conservation of Marine Biodiversity (CEPENE). In theory, these samples should be kept in a liquid nitrogen tank, something which is difficult to obtain authorization for transportation by plane. In reality, it was not possible to find liquid nitrogen in Salvador to take to Porto Seguro, where the collection teams would depart from. The alternative was to take two coolers with 40 kilograms of dry ice from a factory in São Paulo, a material that evaporates very quickly. By the time the boat returned to the mainland, there was hardly any ice left—just enough to meet the deadline for delivering the material to the carrier specializing in cold chain logistics for biological material.

Emanuel Neuhaus, a postdoctoral fellow at the GBB who participated in the expedition, says that the team took the opportunity to test four different means of preserving samples that do not depend so much on the cold chain. The results are not yet ready. “It's a pilot project to improve logistics for upcoming field trips,” he says.

Brazilian three-banded armadillo. Genomic study can reveal gene flow between populations of the Brazilian three-banded armadillo and identify possible isolations, guiding actions such as creating ecological corridors, restocking, ex situ conservation, and new protected areas. CREDIT: Felipe Peters

Genetics as a conservation tool

Another species eagerly awaiting the answers and new conservation approaches from the GBB is the Brazilian three-banded armadillo, which served as the inspiration for Fuleco. Remember him? The mascot for the 2014 World Cup in Brazil started as a suggestion from the Instituto Tamanduá, which is part of the GBB network of collaborators. Over ten years later, the species remains threatened with extinction, with populations scattered throughout the Caatinga.

Genomic research could change this reality. With the results from samples collected in different locations, it will be possible to understand the gene flow between populations of the Brazilian three-banded armadillo and whether some of them are isolated—a major risk for the species. This information can inform policies such as the creation of ecological corridors, restocking, conservation in captivity (called ex situ) and the creation of new protected areas, which took place in 2017 in the Poti River Canyon in Piauí, an important reserve for the three-banded armadillo.

The relevance of these efforts takes on another dimension when the Brazilian three-banded armadillo is seen as an umbrella species of the Caatinga: it protects the biodiversity of the biome and benefits other, less charismatic animals. “A cute mammal like the armadillo is a lure for society to become aware of the need for preservation,” explains Flávia Miranda, the institute's CEO. “It's much easier than preserving a snake or an earthworm from the Caatinga.”

The GBB's efforts are guaranteed until 2027. After that, there are still no answers about the continuity of the studies or the Genomic Reference Bank of Brazilian Species. GenRefBR alone has an estimated annual maintenance cost of R$ 144,000. “The idea is that updates will be automatic, but there always has to be someone behind it. Either we will have to outsource it or maybe it will become something government-run. Let's hope it becomes something broader, something that goes a little beyond the ITV and perhaps becomes something national,” suggests Gisele Nunes.

That's how science works: when answers are revealed, new questions arise. That's great! The most important thing is that, as with the GBB, Brazil continues to play a leading role in this story—overcoming challenges ranging from sample collection to data processing and availability, from genomic studies to the implementation of effective public policies. Bring on the next chapters!

OTHER SPECIES

GLOSSARY

DNA

Molecule present in the nucleus of each cell in which the genetic information of an organism is stored; it is considered the essence of heredity

Nitrogenous bases

Organic compounds that contain nitrogen atoms and are part of the structure of DNA and RNA. They are essential for the storage and transmission of genetic information

Nucleotides

The basic units of the DNA molecule. Each nucleotide is made up of three parts: a nitrogenous base (A, T, C, G), a sugar and a phosphate group

Invasive alien species

Species found outside their natural range; when they proliferate uncontrollably, they threaten biodiversity and ecosystem services

Reference genome

DNA sequence that serves as a model or standard to represent a species' complete genome, exemplifying its genetic organization; it is used for comparisons and analyses of genetic variations between individuals, populations and species

Population genome

Set of genetic variations within a population of organisms; the study of the population genome examines how DNA varies among individuals of the same species in different geographic locations, time periods or environmental conditions, allowing for the calculation of important parameters that indicate a species' risk of extinction

DNA barcodes

Known in English as DNA barcodes, these short, standardized DNA segments function as unique species identifiers; just as a barcode on products quickly identifies an item in the supermarket, DNA barcodes quickly identify a species by comparing its DNA segment with a database

Taxonomic identification

The process of recognizing, naming and classifying an organism; it allows us to know which group (species, genus, family, etc.) the individual belongs to within the biological classification system

Genomic library

Standardized set of DNA fragments from an organism prepared for sequencing and containing tags (molecular labels) attached to the ends, which allow sequencing orientation and identification of genetic information

Bioinformatics

Interdisciplinary field that combines biology, computing and mathematics in order to analyze, interpret and understand complex biological data

Gene

Section of genetic code that, when read or activated, has some effect on the cell, such as the production of a specific protein

Protein

Molecule essential for life formed by a chain of amino acids linked in sequence. It is responsible for almost all of the body's functions

Long reads

The term refers to long fragments of DNA sequenced at once, usually with more than 10,000 base pairs (10 kb), and can exceed 100 kb

Short reads

The term refers to small fragments of DNA sequenced at once, usually between 50 and 300 base pairs (bp) in length

Terabyte

Unit of measurement used in computing to express the amount of data or storage capacity. One terabyte (TB) corresponds to 1,024 gigabytes (GB)

Metabarcoding

Technique that allows the simultaneous identification of multiple species from the DNA sequencing of a single bulk sample containing whole organisms or a single environmental sample, such as soil or water

Gigabyte

Unit of measurement used in computing to express amounts of data or storage capacity. One gigabyte (GB) corresponds to 1,024 megabytes (MB)

Genetic improvement

Set of techniques used to select and develop plants or animals with desired characteristics, such as higher productivity, disease resistance, better quality or adaptation to the environment

Endemic species

Species that occur exclusively in a particular geographical region

Holotype

Specimen used to describe a new species; a scientific reference

Genetic sequencing

Technique that uses biochemical processes to “read” the DNA of a living being

Gene flow

Movement of genes between different populations of the same species; it occurs when individuals migrate, reproduce in another group and pass on their genes, mixing genetic characteristics

Restocking

Reintroduction of individuals of a species into an area where its population has been reduced or gone extinct in order to restore ecological balance or biodiversity