When building applications with large language models (LLMs), one of the biggest challenges is retrieving fresh and reliable information. Training data often gets outdated, and developers turn to serp APIs to keep models relevant.
Developers often need to gather large amounts of data from multiple search APIs. To pull results from Google, Bing, or Yahoo, they end up calling several APIs, merging the results, and then feeding them into their system. This extra aggregation step adds overhead and ultimately reduces system efficiency.
That’s why we built the Universal Search API at Scrapingdog. One single request to this API pulls data from all major search engines. It’s the simplest way to add real-time, multi-engine search to your AI or data projects.
In this blog, we will build a small LLM model that will collect real-time data from the web.
Why Use Universal Search API
- LLMs typically rely on static databases, but integrating the Universal Search API enables them to access real-time data at lightning speed.
- You don’t have to integrate multiple APIs for scraping data from different search engines.
- Deduplication is critical when assembling datasets. This API saves you hours by ensuring no repeated links clutter your results.
Here’s a quick demo on Universal Search API. ⬇️
Prerequisite To Build LLM using Fresh Search Engines Data
- You must have Python installed on your machine. If it is not there, then you can install it from here.
- Create a folder by any name you like. We will keep our Python file inside this folder.
- Create a Python file by any name you like. I am naming it as
llm.py
. - Install the
requests
library inside this folder. You can install it with the commandpip install requests
. - Sign up for the free pack of Scrapingdog. You will get a generous 1000 credits, which are enough for testing the service.
Building a Mini-LLM
We will be building a Markov-chain like mini model. We will fetch real-time data from Scrapingdog and then feed it to predict the next future state.
Pulling data from Scrapingdog
import requests
API_KEY = "YOUR_API_KEY"
query = "russia ukraine war"
url = f"https://api.scrapingdog.com/search/?api_key={API_KEY}&query={query}"
response = requests.get(url)
results = response.json()
# Scrapingdog returns "organic_results" for this endpoint
results = data.get("organic_results", [])
# Extract snippets for corpus
corpus = [item.get("snippet", "") for item in results if item.get("snippet")]
if not corpus:
raise ValueError("No snippets found in API response.")
print(f"Got {len(corpus)} snippets from Scrapingdog API.")
Let me explain this code step-by-step.
- Uses
requests
library to make HTTP calls. API_KEY
stores your Scrapingdog API key.query = "russia ukraine war"
is the search term.- Formats the API endpoint with your key and query.
requests.get(url)
fetches search results.response.json()
converts the reply into a Python dictionary.- Then we extract
organic_results
from the response (that’s where search results are stored). - Iterates over
organic_results
and collects non-empty"snippet"
values into a list. - Finally, we are printing the number of snippets retrieved from the Scrapingdog API.
Markov Chain
def build_markov_model(corpus, n=2):
"""Builds a simple Markov chain model from text corpus"""
model = defaultdict(list)
for text in corpus:
words = text.split()
for i in range(len(words) - n):
key = tuple(words[i:i+n])
next_word = words[i+n]
model[key].append(next_word)
return model
def generate_text(model, length=60):
"""Generates text of given length using the Markov chain model"""
start = random.choice(list(model.keys()))
output = list(start)
for _ in range(length):
state = tuple(output[-2:])
next_words = model.get(state)
if not next_words:
break
output.append(random.choice(next_words))
return " ".join(output)
markov_model = build_markov_model(corpus)
This code builds a very simple text generator using a Markov chain. First, it takes your collected snippets (the corpus) and learns which words tend to follow which. It does this by looking at pairs of words (like “Russia Ukraine”) and recording what word usually comes next (for example, “war”). The result is a dictionary where each pair of words points to a list of possible continuations.
Then, when you want to generate new text, the program picks a random starting pair and keeps adding words by checking what words are likely to follow the last two. By repeating this process, it creates a new sequence of words that looks similar to the original snippets but isn’t copied directly. Essentially, it’s a toy model that mimics writing style and context based on word transitions from your data.
Now let’s check the output of our code. You can run it with the command python llm.py
.
So, we successfully built a lightweight LLM model without relying on multiple SERP APIs.
Complete Code
You can also use GPT in place of this Markov model to get a better conclusion. However, for now, the code appears as follows.
import requests
import random
from collections import defaultdict
# ======================
# Step 1. Fetch Data
# ======================
API_KEY = "your-api-key" # Replace with your Scrapingdog API key
query = "russia ukraine war"
url = f"https://api.scrapingdog.com/search/?api_key={API_KEY}&query={query}"
print(f"Fetching data for query: {query}...")
response = requests.get(url)
data = response.json()
# Scrapingdog returns "organic_results" for this endpoint
results = data.get("organic_results", [])
# Extract snippets for corpus
corpus = [item.get("snippet", "") for item in results if item.get("snippet")]
if not corpus:
raise ValueError("No snippets found in API response.")
print(f"Got {len(corpus)} snippets from Scrapingdog API.")
# ======================
# Step 2. Build Mini Model
# ======================
def build_markov_model(corpus, n=2):
"""Builds a simple Markov chain model from text corpus"""
model = defaultdict(list)
for text in corpus:
words = text.split()
for i in range(len(words) - n):
key = tuple(words[i:i+n])
next_word = words[i+n]
model[key].append(next_word)
return model
def generate_text(model, length=60):
"""Generates text of given length using the Markov chain model"""
start = random.choice(list(model.keys()))
output = list(start)
for _ in range(length):
state = tuple(output[-2:])
next_words = model.get(state)
if not next_words:
break
output.append(random.choice(next_words))
return " ".join(output)
# ======================
# Step 3. Generate Output
# ======================
markov_model = build_markov_model(corpus)
print("\n--- Mini LLM Output ---\n")
print(generate_text(markov_model, 800))
print("\n-----------------------")
Conclusion
What we’ve built here is a fun, lightweight “mini-LLM” powered by live data from Scrapingdog’s Universal Search API. Instead of training on static datasets, this approach allows your model to generate fresh, topic-specific text using real search snippets from engines like Google, Bing, and Yahoo.
Of course, this isn’t a replacement for large-scale language models, but it shows how quickly developers can prototype search-aware applications without heavy infrastructure.
👉 Sign up for Scrapingdog and get free credits to start experimenting with your own mini-models today.