SeqHub

Documentation / FAQs

SeqHub is a platform for scientists to discover, annotate, manage and share biological data.

General FAQs

What is SeqHub?

SeqHub is a scientific data platform for discovering, interpreting, and sharing genomic data. We provide the central infrastructure for sequences and functional metadata so that scientists can conduct their research more efficiently and effectively.

Who can use SeqHub?

If you're a scientist working with biological sequence data, SeqHub could be helpful to your initial discovery process and then to tracking work throughout your research! Please keep in mind that our database is composed of microbial genomes so our platform will work best with prokaryotic sequences.

What do scientists use SeqHub for?

Scientists use SeqHub for a wide variety of use cases throughout their research process. These are a few that users have mentioned:

  • Find enzymes with high potential in bioeconomy applications
  • Mine Biosynthetic Gene Clusters (BGCs) for designing retro-biosynthetic pathways
  • Characterize soil and gut microbiome functions

How is SeqHub different from other annotation tools?

SeqHub annotations can be retrieved faster and for more hypothetical proteins as compared to most other tools. Our annotation is based on embedding distances, not sequence alignment-based similarity. This means we can retrieve sequences that share similar structures or sequence contexts but cannot be aligned with high identity using the BLAST algorithm.

Is my data private and secure?

Yes. We always build with security in mind. Your sequences and data are private to you, owned by you, and are encrypted in transit and at rest. You may choose to make your data public or share it with your team. We do not use your data to train models.

How do I get started with SeqHub?

  1. Sign up here
  2. Search a protein sequence and explore results
  3. Upload a FASTA file for full genome annotation
  4. Ask SeqHub agent your follow up questions
  5. Publish the dataset you wrote your last paper on to improve discoverability
  6. Join the community on Discord to stay up-to-date on product announcements, get support, and share features you'd like to see added.

SeqHub Search & Visualizations

How does SeqHub search work?

SeqHub is an embedding-based protein search tool, leveraging embeddings from a fine-tuned genomic language model (specifically, the gLM2 model). Given an input protein sequence, SeqHub embeds the sequence, and searches against a database of 85M pre-computed embeddings to retrieve similar sequences and their genomic contexts. Detailed methods can be found in our manuscript and blog post.

What sequence database does SeqHub search?

SeqHub searches the Open Genome database comprising more than 85 million protein clusters (at 90% sequence identity) from 131,744 microbial genomes.

Are my searches and data secure?

Yes. We always build with security in mind, in addition to encrypting data in transit and at rest.

What data is your model trained on?

Our model (gLM2) is pretrained on the Open MetaGenome (OMG) dataset consisting of metagenomic sequences (totalling 3.3B proteins) across diverse environments. For additional details on how the model was fine-tuned for search, please refer to our paper.

Are you using my data to train your models?

No. Your private data and search history are never used to train our models.

Can I run a search for multiple protein sequences at once?

Yes, we enable simultaneous search and annotation for multiple sequences. Annotate your dataset by submitting your file in "Upload Dataset" and refer to FAQs in SeqHub Annotations & Datasets section below.

Can I search a eukaryotic sequence?

You can search a eukaryotic sequence, however, the Open Genome database that we search against currently consists of proteins from microbial genomes. We will add eukaryotic genomes and proteins in upcoming platform updates.

How is SeqHub Search different from BLAST?

SeqHub search is based on embedding distances, not sequence alignment-based similarity. This means we can retrieve sequences that share similar structures or sequence contexts but cannot be aligned with high identity using the BLAST algorithm. See our paper to learn more about how our search compares to BLAST and other retrieval methods.

How can I group my search results?

You can drag-select a region of interest in the UMAP to create a group and then view search results specific to just that group. You can also add sequences to a group directly from the search results.

Can I download SeqHub Search results?

Yes! You can also choose how and which results you would like to download. Downloading all retrieved sequences provides a single fasta file containing all retrieved proteins matching the input sequence. Downloading all retrieved contexts provides all retrieved sequences and the sequence IDs can be used to identify contig boundaries.

How does Literature Search Work?

We implemented an embedding search version of PaperBLAST. Please cite the PaperBLAST paper alongside SeqHub if you found this functionality useful for your research.

How are the structures generated?

Predicted structures are generated using ESMFold. Please note that recycling is disabled to increase prediction speed. Please cite the ESMFold paper alongside SeqHub if you found this functionality useful for your research.

How can I compare structures?

In your search results, click the target sequence you would like to compare your query structure against and then toggle to "Alignment".

How do you calculate the structural alignment?

We use TM-align to calculate the structural alignment. Please cite the TM-align paper alongside SeqHub if you found this functionality useful for your research.

How do I view synteny (conserved linkage of genes across genomes)?

Click the "Highlight Synteny" button.

How are genomic neighborhood proteins colored?

Protein coding genes are assigned colors by clustering all embeddings found in retrieved contexts. To provide visual synteny, colors are assigned to the most frequently occurring protein clusters across all retrievals.

What is the interactive scatter plot at the top of the page?

The retrieved SeqHub search results are displayed using a UMAP scatterplot. Each point represents a retrieved sequence, and users can click individual points to see protein structure and genomic context. The UMAP is colored by cosine similarity to the query embedding, but the coloring can be changed using the top-right setting ⚙️icon to taxonomic levels or sequence similarity metric.

How is sequence similarity calculated?

Similarity is calculated using the cosine similarity in sequence embedding space.

SeqHub Annotations & Datasets

How can I annotate multiple protein sequences at once? What file types are supported?

Upload a .csv, .tsv, .fasta, .fas, .faa, .fa, .gb, .gbff, or .gbk to annotate multiple proteins. For csv or .tsv files, a header is required, with the first column being IDs and the second column being sequences. Any additional user-defined columns will be displayed alongside our annotations.

Can I upload genomic DNA as input?

Yes! Simply upload a fasta file of genomic DNA assembly. We will handle gene-calling, annotation and visualization.

Is there a file size limitation on file upload?

Yes. We currently support 8 MB uploads but plan to increase this limit in the future.

How are SeqHub annotations generated?

SeqHub provides two sources of protein-level annotations. For the input sequence and retrieved matches, domain-level Pfam annotations are generated. For all proteins in the genomic context, functional annotations are generated using a CLIP-like model, which provides the text annotation of the closest Swiss-prot representative. The matching Swiss-prot entry for each protein can be reached using the 'Annotation' button for that protein. Details about the implementation of the CLIP-like model and annotation method can be found in the 'Functional annotation' subsection of the Methods in the Gaia paper.

How is the annotation confidence score calculated?

Annotation confidence is summarized as either "high", "mid", or "low", determined by embedding similarity and sequence alignment quality:

  • High confidence: Has both high embedding similarity (≥90%) and significant sequence alignment (>20% identity and >20% coverage)
  • Medium confidence: Has high embedding similarity but lacks significant sequence alignment
  • Low confidence: Has embedding similarity below 90%

Embedding and sequence similarity values can be shown using the "Column Visibility" settings.

How can I find the closest UniProt/Swiss-Prot annotation?

The Uniprot/Swiss-Prot IDs are hyperlinked to entries in the UniProt website.

How can I download the annotated datasets?

You can export the full datatable as .csv by clicking the "Download Table" or a subset of the table by selecting a set of rows.

How can I add additional columns?

You can create columns (with any type of data) from within the dataset viewer.

Can I upload experimental and other additional data? What are the formatting requirements?

Yes. You can upload .csv or .tsv data with additional data columns, as long as the first and second columns of the data map to IDs and sequences, respectively, and the file contains a header. Your additional columns will be displayed alongside our annotations.

Does SeqHub have genome visualization capabilities?

Yes! Upload a raw genomic sequence as a fasta (.fa, fna) file, or the output of Prodigal to annotate and visualize your genome in SeqHub.

I identified a misannotation, what should I do?

Flag this misannotation in the #misannotation channel in our Discord.

Is there an API available?

Not yet, however, command line tools are on our roadmap.

Public Datasets

What are the sources of these datasets?

Public datasets are sourced from SeqHub users who elect to deposit their data on the platform.

Can I make my data public?

Yes! To make your data public, and thus more discoverable, switch the visibility of your dataset to "Public".

What types of data can I make public?

We encourage you to publish all the sequence data and metadata generated during your research.

SeqHub AI Agent

How does SeqHub Agent work?

SeqHub Agent is an LLM that can reason about functions of proteins given the genomic neighborhood. Check out our blog post to learn more and stay tuned as we expand our Agent capabilities.

What can the SeqHub Agent do?

SeqHub Agent can work alongside human scientists to help summarize findings, answer follow-up questions on functional annotations of individual sequences or full genomes, help identify patterns across a dataset, and much more.

Is my data secure when I use the Agent?

Yes. We do not train models on user inputs.

Community

Where does SeqHub's community gather?

We're working and learning together on Discord. Come join us!

Why should I join SeqHub's Discord?

  • Shape the product: Early feedback directly impacts SeqHub's roadmap.
  • Stay in the loop: Be the first to hear about updates and events.
  • Get support: Quick answers from our team and community.
  • Elevate knowledge together: Every contribution to annotations—whether adding new ones, updating existing ones, or flagging misannotations—strengthens the collective understanding we all build on.
  • Collaborate: Share datasets, insights, and ideas with fellow Seqers.

What will I find inside SeqHub's Discord?

What is a Seqer?

A Seqer (pronounced "seek-er") is a scientist who uses SeqHub to analyze biological sequences and understand their functions. Seqers contribute to the platform by sharing their findings and annotations, helping build a more comprehensive knowledge base.

Pricing

How much does SeqHub cost?

Academic users can use SeqHub for free by signing up with their academic email. Commercial users can subscribe to a monthly plan. Use will be limited to 1 million annotations (excluding any data made public). Please book time with us if you anticipate exceeding that limit.

Do you offer a free plan or free trial?

Yes. The platform is available for free for academic use. For commercial use, there is a free trial.

Do you offer custom pricing?

Yes, please book time with us to discuss what you have in mind.

Support, Feedback, & Citation

How can I get more help navigating this platform?

Join our Discord server to ask anything about SeqHub in our #bugs-and-support channel.

I have feature requests, where can I submit them?

Join our Discord server to submit any feature requests in our #ideas-and-features channel.

How should I cite SeqHub?

Please cite "Gaia: An AI-enabled genomic context–aware platform for protein sequence annotation. Sci. Adv. 11, eadv5109 (2025). DOI: 10.1126/sciadv.adv5109" where our search method was first described.

I love SeqHub, how can I be more involved?

Help us spread the word, invite your labmates and share your findings on social media (please tag us)! We are building a community of researchers so that scientific outputs can have greater impact.

About SeqHub

Why did you build SeqHub?

There is no central location for biologists to collectively annotate and relate genomic information with functional data. Check out our blog post on the state of data infrastructure in our space and why we must innovate to unlock the next breakthroughs in biology.

How is SeqHub related to Tatta Bio?

SeqHub is built by Tatta Bio, a scientific non-profit developing AI systems to improve the functional interpretability of genomic data.

How can I get in touch?

If you're a user seeking support or provide suggestions, join our Discord server to ask anything about SeqHub

For other inquiries, email us at team@tatta.bio

Is SeqHub stable?

Yes. Our platform is built for stability and reliability:

  • Daily backups and redundant infrastructure to prevent data loss.
  • Battle-tested by researchers across academia, biotech, and government.
  • Long-term support backed by funding from institutions like Schmidt Futures, Moore Foundation, Activate, and DARPA.