GigaBrain Search

AI search over hundreds of billions of Reddit comments and YouTube discussions: the core engine behind every GigaBrain product.

Users ask a question and get TL;DR summaries threaded from real community answers, with citations back to source threads and moments in videos.

  • Indexed 100B+ documents across Reddit and YouTube at scale on a limited budget
  • Semantic + keyword retrieval with LLM summarization and source threading
  • Served millions of searches per month to 300K+ registered users
  • Powers the web app, browser extension, Pro, Ultra, API, and Shopping experiences

How it works

Search has a few moving parts: continuous ingestion from the Reddit API, several indexes built offline at different sizes and freshness levels, a Go API that runs a query across those indexes, and a scoring and ranking stack that orders the candidates and summarizes them. Cost is a constant constraint. Everything has to run over billions of documents on a startup budget, so much of the design comes down to keeping the expensive work small.

Ingestion, storage & indexing

Ingest. We continuously poll the Reddit API in batches for new submissions and comments at several lag values. Polling at short lags gives us low time-to-first-observation; polling the same content again later lets us pick up the score and moderation signals that only settle once a post has been up for a while. Submissions and comments are published to Pub/Sub topics, and a Dataflow writer consumes them and persists them to GCS, BigQuery, Bigtable, and Spanner.

Storage. Each store backs a different access pattern:

  • Bigtable: performant document store for full submission and comment lookups at request time.
  • Spanner: relational tables keyed by indexed integer IDs, plus the real-time full-text and embedding search indexes over recently created content.
  • BigQuery: large-scale analysis for global statistics, preprocessing, and assembling training sets for the ML models.

Archive index. A historical approximate-nearest-neighbour index over the full back-catalogue of Reddit submissions, spread across tens of shards each hosting around a billion documents. Submissions are embedded with a lower-resource dual encoder into compact 128-dimension vectors. A GKE job assembles hundreds of smaller shards in parallel and then merges them into the larger shards served from SSD, which keeps the index cheap to build and query at scale. Query serving runs as a GKE stateful set, with each pod mounting and answering for its own shards.

Archive data curation. We aggressively prune what goes into the archive. Simple filters drop stubs, NSFW, and "never-include" content, and a learned ValueModel estimates how valuable each submission is to include at all. Pruning low-value documents lets us trade index size against quality. A smaller index of high-value documents can be searched with more accurate algorithms inside the same latency and cost budget.

ValueModel training. The submission ValueModel is trained to predict the likelihood that a source will be cited in an LLM-generated answer. This teaches it to keep niche but excellent threads and drop high-score but useless posts, like jokes and drama, that don't help answer a question. A separate comment ValueModel gives a query-independent ranking of comments within a submission, so contentification can pick the best replies without another model call at query time.

Search embedding & ranking training. The retrieval embedding model is a dual encoder trained with a cross-entropy loss, with queries and submissions passing through the same transformer trunk with a marker for which variant they are. The training labels come from the search itself. At query time we hit multiple retrievers, have the strongest available LLM rank all of the candidates and write a cited summary, then use that ranking, which sources the LLM actually cites, and later user feedback as the signal for the next generation of embedding and ranking models. Submissions are represented by their title, selftext, date, score, reply count, best comments, and some additional computed features, and the models are small enough to embed and score billions of submissions within our budget.

A single search request

A typical search runs through five stages: preprocessing/pruning, indexing, retrieval, scoring/ranking, and summarization. At request time the relevant work is:

  1. Query reformulation: the user's query goes to a fast LLM that produces complementary alternate queries.
  2. Retrieve: those queries fan out across the archive ANN index and the real-time full-text and embedding indexes, gathering a candidate pool. Retrieval can optionally follow up by querying the nearest neighbours of high-scoring candidates to pull in competitive peers.
  3. Contentify: candidates are streamed through Bigtable lookups to attach the full submission document and its offline-ranked top comments, building the inputs for scoring.
  4. Filter: ineligible candidates are dropped.
  5. Score: each candidate is scored independently against the original query by a custom ranking model, producing a scalar score; top comments contribute an aggregated score too.
  6. Rank: a smaller subset of the top-scoring candidates is handed to a more expensive LLM for final ordering.
  7. Summarize: a language model synthesizes the answer from the top results with a proprietary prompt, citing sources. The request, candidate set, ranked results, and summary are all persisted to feed the next round of training.

The search service (gb-search)

gb-search is a Go service deployed on Kubernetes, exposed over GraphQL (gqlgen) with a streaming streamSearch subscription and a plain /search REST endpoint. It is built around a SearchService interface so multiple backends can be registered behind one API, each a different combination of retrievers, scoring model, and ranking model. That makes it easy to run configurations against each other and trade off quality, cost, and latency.

A single search touches several sources and models: the archive index, the real-time indexes, document and comment lookups, the scoring model, and an LLM for ranking and summarization. Running that in series would be too slow, so the service leans on Go's concurrency primitives to overlap the work. Stages are wired together as goroutines and channels so candidates stream through retrieval, contentification, filtering, and scoring at the same time rather than one batch after another. We also hedge requests across sources and model calls, firing them in parallel and taking results as they arrive, which hides the latency of any single slow source or backend.

Agentic searcher for Pro

For Pro users we run an agentic searcher on top of the same capabilities. Instead of a fixed pipeline, an LLM drives the search with the sources and models exposed to it as tools, deciding which indexes to query, reformulating and issuing follow-up searches, and judging when it has gathered enough to answer. That lets it break a hard question into sub-searches and do deeper research than a single pass could.

Continual improvement

A theme across all of this is using a strong LLM to supervise the cheaper models. Given the retrieved candidates, a frontier LLM will cite the specific discussions and comments that best answer a question, and those citations make a useful relevance signal. It is close to the implicit feedback that larger systems collect from user sessions, but we could get it without that scale of traffic. We log the candidates and the answer on every search and feed that back into the retrieval embeddings, the scoring model, and the value models, so the larger model's judgement about what is actually useful ends up in the small models that have to run over billions of documents.