Blog   |   tags:  

From Reports to Knowledge: Build a Queryable RDF Knowledge Graph

8 min read | Published
  • GoodData logo
Written by Ondřej Macek

From Reports to Knowledge: Build a Queryable RDF Knowledge Graph

Turn a single PDF into an RDF knowledge graph you can query with SPARQL, using a pipeline that leaves a clear paper trail at every stage.

Most teams have plenty of documents (reports, policies, contracts, research papers) and very little time to keep re-reading them. PDFs are great for distribution, but they are not great for searching across concepts, linking facts, or answering questions like "Who worked with whom?" or "What organizations show up most often?"

This tutorial walks through a practical pipeline that takes one PDF and produces:

  • Clean text and sentence-level inputs for NLP
  • RDF/Turtle files for entities and relation triples
  • A Fuseki dataset you can query via SPARQL
  • An optional draft ontology scaffold you can refine in Protege

Everything is modular and inspectable. Each step writes concrete outputs (text files, TSV/CSV, Turtle graphs), so you can validate what the models produced and adjust as needed.

Pipeline overview

The core flow looks like this:

PDF -> Clean text -> Split into sentences -> Coreference resolution
    -> Entity extraction (NER) -> Relation extraction (REBEL)
    -> Clean and deduplicate triples -> Load into Fuseki -> Query with SPARQL

Optional (but useful): generate a first-pass ontology draft from the predicates you actually observed in your triples.

Prerequisites

System requirements

  • Python 3.10 or 3.11
  • uv 0.4+ (virtualenv and dependency management)
  • Docker 24+ (for Fuseki)
  • Make (optional, but convenient)

Dependencies live in pyproject.toml and uv.lock and are installed via uv.

Installation

# Install uv (skip if already installed)
curl -Ls https://astral.sh/uv/install.sh | sh

# Install dependencies; uv creates and manages .venv/
uv sync

# Optional: install the project itself (and dev extras if you want linting/testing)
uv pip install -e .
# uv pip install -e ".[dev]"

# Download model weights once (FastCoref, Transformers, REBEL)
uv run python pipeline/download_models.py

If you have a Makefile, you can use:

make setup        # uv sync + model download
make install-dev  # install with developer tooling

Fuseki runs in Docker. You can start it now, or let your loader step handle it (depending on how your repo is set up):

make fuseki-start  # starts stain/jena-fuseki on :3030 (admin/admin)
make fuseki-stop

Step-by-step pipeline

Step 0: Add your input PDF

Place the PDF you want to process to data/input/source.pdf.


For a first run, short and clean PDFs work best. A simple biography exported to PDF (for example, Einstein or Curie) is a good test case.

Step 1: PDF to clean text

This step extracts text from the PDF and removes common junk that breaks NLP downstream:

  • Page numbers, headers, footers (as much as possible)
  • Hyphenated line breaks ("-\n" -> "")
  • Extra whitespace
  • Optional: Wikipedia-style reference sections, bracket citations like [12], and boilerplate

You can get better structure with tools like GROBID or Apache Tika, and you may need OCR (for example, Tesseract) for scanned PDFs.

# Script: pipeline/01_prepare_text.py
import re
import pdfplumber
from pathlib import Path

WIKIPEDIA_SECTIONS = [
    r"\bReferences\b",
    r"\bExternal\s+links\b",
    r"\bSee\s+also\b",
    r"\bFurther\s+reading\b",
]

def clean_wikipedia_text(text: str) -> str:
    # Trim trailing sections that mostly contain bibliographies and footers
    earliest = min(
        (
            match.start()
            for marker in WIKIPEDIA_SECTIONS
            if (match := re.search(marker, text, flags=re.IGNORECASE))
        ),
        default=len(text),
    )
    text = text[:earliest]

    # Remove citation brackets, URLs, and page artifacts
    text = re.sub(r"\[\d+\]", "", text)  # [12]
    text = re.sub(r"https?://[^\s)]+", "", text)
    text = text.replace("-\n", "").replace("\n", " ")
    return re.sub(r"\s+", " ", text).strip()

def extract_pdf_text(pdf_path: Path) -> str:
    with pdfplumber.open(pdf_path) as pdf:
        text = "\n".join(page.extract_text() or "" for page in pdf.pages)
    return clean_wikipedia_text(text)

Run:

uv run python pipeline/run_pipeline.py --only-step 1

Output:

  • data/intermediate/source.txt

Step 2: Clean text to sentences

Most NLP components behave better when you feed them one sentence at a time. This step splits the cleaned text into one sentence per line using NLTK's Punkt tokenizer.

You can swap this for spaCy or Stanza if your document style is tricky (lots of abbreviations, tables, bullet fragments, and so on).

# Script: pipeline/02_split_sentences.py
import re
import nltk
from nltk.tokenize import sent_tokenize

def clean_sentence(sentence: str) -> str:
    sentence = re.sub(r"\s+\d+/\d+\s+", " ", sentence)
    words = []
    previous = None
    for word in sentence.split():
        if word.lower() != previous:
            words.append(word)
        previous = word.lower()
    return " ".join(words).strip()

def filter_sentence(sentence: str) -> bool:
    if len(sentence.split()) < 5:
        return False
    if any(k in sentence.lower() for k in ("retrieved", "doi", "external links")):
        return False
    return True

def tokenize_sentences(text: str) -> list[str]:
    nltk.download("punkt", quiet=True)
    sentences = sent_tokenize(text)
    cleaned = [clean_sentence(s) for s in sentences]
    return [s for s in cleaned if filter_sentence(s)]

Run:

uv run python pipeline/run_pipeline.py --only-step 2

Output:

  • data/intermediate/sentences.txt (one sentence per line)

Step 3: Coreference resolution

Coreference resolution replaces pronouns and repeated mentions with their referents, so later steps attach facts to the right entity.

Example:

  • Before: "Marie Curie discovered polonium. She won two Nobel Prizes."
  • After: "Marie Curie discovered polonium. Marie Curie won two Nobel Prizes."
# Script: pipeline/03_coreference_resolution.py
import re
import nltk
from fastcoref import FCoref
from nltk.tokenize import sent_tokenize

PRONOUNS = {"he","she","it","they","his","her","its","their","him","them"}

def resolve_coreferences(source_text: str, device: str = "auto") -> list[str]:
    nltk.download("punkt", quiet=True)

    model = FCoref(device=device)
    result = model.predict(texts=source_text, is_split_into_words=False)

    resolved_text = source_text
    for cluster in result.get_clusters():
        mentions = [m for m in cluster if m.lower() not in PRONOUNS]
        if not mentions:
            continue

        main = max(mentions, key=len)
        for pronoun in set(cluster) - set(mentions):
            resolved_text = re.sub(r"\b" + re.escape(pronoun) + r"\b", main, resolved_text)

    return sent_tokenize(resolved_text)

Run:

uv run python pipeline/run_pipeline.py --only-step 3 --device cpu

Output:

  • data/intermediate/resolved_sentences.txt

Note: Coreference is never perfect. Treat it as a quality boost, then verify on a few examples before trusting it at scale.

Step 4: Sentences to entities (NER)

Now we extract named entities (people, places, organizations, dates, and so on) using a Hugging Face NER model.

One important detail: entity URIs should be stable across the pipeline. If NER creates entity:entity_42_1 while relation extraction creates entity:Albert_Einstein, you end up with two disconnected graphs. The snippet below uses a simple "slug" based on entity text so both steps can share identifiers.

# Script: pipeline/04_sentences_to_entities.py
import re
from transformers import pipeline
from rdflib import Graph, Namespace, Literal
from rdflib.namespace import RDF, XSD

def slug(text: str) -> str:
    text = re.sub(r"[^A-Za-z0-9]+", "_", text.strip())
    text = re.sub(r"_+", "_", text).strip("_")
    return text or "Unknown"

def extract_entities(sentences, model_name, aggregation_strategy, namespaces):
    ner = pipeline(
        "ner",
        model=model_name,
        tokenizer=model_name,
        aggregation_strategy=aggregation_strategy,
    )

    rdf_graph = Graph()
    ENTITY = Namespace(namespaces["entity"])
    ONTO = Namespace(namespaces["onto"])
    DOC = Namespace(namespaces["doc"])
    rdf_graph.bind("entity", ENTITY)
    rdf_graph.bind("onto", ONTO)
    rdf_graph.bind("doc", DOC)

    entity_records = []

    for i, sentence in enumerate(sentences, start=1):
        ents = ner(sentence)

        sentence_uri = DOC[f"sentence_{i}"]
        rdf_graph.add((sentence_uri, RDF.type, ONTO.Sentence))
        rdf_graph.add((sentence_uri, ONTO.text, Literal(sentence)))
        rdf_graph.add((sentence_uri, ONTO.sentenceId, Literal(i, datatype=XSD.integer)))

        for e in ents:
            text = (e.get("word") or "").strip()
            conf = e.get("score")
            ent_type = e.get("entity_group")

            if len(text) <= 1 or conf is None:
                continue

            entity_uri = ENTITY[slug(text)]

            # Create the entity node once, then keep linking it to sentences
            rdf_graph.add((entity_uri, RDF.type, ONTO.Entity))
            rdf_graph.add((entity_uri, ONTO.text, Literal(text)))

            if (entity_uri, ONTO.entityType, None) not in rdf_graph:
                rdf_graph.add((entity_uri, ONTO.entityType, Literal(ent_type)))

            # Keep the best confidence seen for this entity label
            existing = list(rdf_graph.objects(entity_uri, ONTO.confidence))
            if existing:
                old = float(existing[0])
                if float(conf) > old:
                    rdf_graph.set((entity_uri, ONTO.confidence, Literal(float(conf), datatype=XSD.float)))
            else:
                rdf_graph.add((entity_uri, ONTO.confidence, Literal(float(conf), datatype=XSD.float)))

            rdf_graph.add((entity_uri, ONTO.foundInSentence, sentence_uri))

            entity_records.append({
                "sentence_id": i,
                "entity_text": text,
                "entity_uri": str(entity_uri),
                "entity_type": ent_type,
                "confidence": float(conf),
                "start_pos": e.get("start"),
                "end_pos": e.get("end"),
                "sentence": sentence,
            })

    return entity_records, rdf_graph

Run:

uv run python pipeline/run_pipeline.py --only-step 4 --max-sentences 500

Outputs:

  • data/output/entities.ttl

Step 5: Extract relation triples (REBEL)

Next we extract subject-predicate-object triples with REBEL. The model emits a tagged format that you parse into triples.

As with NER, use the same URI normalization for subjects and objects so your relation edges connect to the entity nodes you already created.

# Script: pipeline/05_extract_triplets.py
import re
from transformers import pipeline

def slug(text: str) -> str:
    text = re.sub(r"[^A-Za-z0-9]+", "_", text.strip())
    text = re.sub(r"_+", "_", text).strip("_")
    return text or "Unknown"

def extract_triplets_from_text(generated_text: str):
    triplets = []
    text = (
        generated_text.replace("<s>", "")
        .replace("</s>", "")
        .replace("<pad>", "")
        .strip()
    )
    if "<triplet>" not in text:
        return triplets

    subject = relation = obj = ""
    current = None

    for token in text.split():
        if token == "<triplet>":
            if subject and relation and obj:
                triplets.append((subject.strip(), relation.strip(), obj.strip()))
            subject = relation = obj = ""
            current = "subj"
        elif token == "<subj>":
            current = "rel"
        elif token == "<obj>":
            current = "obj"
        else:
            if current == "subj":
                subject += (" " if subject else "") + token
            elif current == "rel":
                relation += (" " if relation else "") + token
            elif current == "obj":
                obj += (" " if obj else "") + token

    if subject and relation and obj:
        triplets.append((subject.strip(), relation.strip(), obj.strip()))

    return triplets

def extract_triplets(sentences, model_name="Babelscape/rebel-large", device=-1):
    gen = pipeline("text2text-generation", model=model_name, tokenizer=model_name, device=device)

    results = []
    for i, sentence in enumerate(sentences, start=1):
        output = gen(sentence, max_length=256, num_beams=2)[0]["generated_text"]
        for s, p, o in extract_triplets_from_text(output):
            if len(s) > 1 and len(p) > 2 and len(o) > 1:
                results.append({
                    "sentence_id": i,
                    "subject": slug(s),
                    "predicate": slug(p),
                    "object": slug(o),
                    "sentence": sentence,
                    "extraction_method": "rebel",
                })
    return results

Run:

uv run python pipeline/run_pipeline.py --only-step 5 --max-sentences 300

Output:

  • data/output/triplets.ttl

Tip: REBEL can be slow on CPU. Iterate with a small --max-sentences, then scale up once you are happy with cleaning and normalization.

Step 6: Clean and deduplicate triples

Even with normalization, you usually want to drop duplicates and filter out junk predicates. This step reads the Turtle graph, converts it to a tabular form, applies cleanup rules, and writes a clean Turtle file.

# Script: pipeline/06_clean_triplets.py
import pandas as pd
from rdflib import Graph
from config.settings import get_pipeline_paths

def load_triplets(ttl_path):
    graph = Graph()
    graph.parse(str(ttl_path), format="turtle")

    rows = []
    for s, p, o in graph:
        rows.append({
            "subject": str(s).split("/")[-1].replace("_", " "),
            "predicate": str(p).split("/")[-1].replace("_", " "),
            "object": str(o).split("/")[-1].replace("_", " "),
        })
    return pd.DataFrame(rows)

paths = get_pipeline_paths()
df = load_triplets(paths["triplets_turtle"])

df = df[df["predicate"].notna() & (df["predicate"].str.len() > 1)]
df = df.drop_duplicates(subset=["subject", "predicate", "object"], keep="first")

Run:

uv run python pipeline/run_pipeline.py --only-step 6

Output:

  • data/output/triplets_clean.ttl

Step 7: Load to graph DB (Apache Jena Fuseki)

Fuseki gives you a SPARQL endpoint on top of your RDF data.

A practical note: you usually want both entity data (entities.ttl) and relation triples (triplets_clean.ttl) in the dataset. The simplest approach is to merge them into one Turtle file and upload that.

If you do not want to modify the loader, a quick merge often works:

cat data/output/entities.ttl data/output/triplets_clean.ttl > data/output/graph.ttl

Loader example:

# Script: pipeline/07_load_to_graphdb.py
import requests

def load_turtle_to_fuseki(ttl_path, endpoint, dataset, user=None, password=None, timeout=60):
    upload_url = f"{endpoint.rstrip('/')}/{dataset}/data"
    auth = (user, password) if user and password else None

    with open(ttl_path, "rb") as f:
        response = requests.put(
            upload_url,
            data=f,
            headers={"Content-Type": "text/turtle"},
            auth=auth,
            timeout=timeout,
        )
    response.raise_for_status()

Run:

make fuseki-start
uv run python pipeline/run_pipeline.py --only-step 7

Verify in the UI:

  • http://localhost:3030/

Step 8 (optional): Auto-generate a draft ontology

At this point you have a graph, but your schema is still informal. A quick way to get started is to generate a draft ontology file that:

  • Defines a couple of base classes (Entity, Sentence)
  • Defines each observed predicate as an owl:ObjectProperty
  • Adds simple labels, plus default domain and range

This does not replace real ontology work, but it gives you something to refine in Protege.

# Script: pipeline/08_generate_ontology_draft.py
from rdflib import Graph, Namespace, Literal
from rdflib.namespace import RDF, RDFS, OWL

def build_ontology_draft(triples_ttl: str, out_ttl: str, namespaces: dict):
    g = Graph()
    g.parse(triples_ttl, format="turtle")

    ONTO = Namespace(namespaces["onto"])
    REL = Namespace(namespaces["rel"])

    onto = Graph()
    onto.bind("onto", ONTO)
    onto.bind("rel", REL)
    onto.bind("owl", OWL)
    onto.bind("rdfs", RDFS)

    onto.add((ONTO.Entity, RDF.type, OWL.Class))
    onto.add((ONTO.Sentence, RDF.type, OWL.Class))

    rel_preds = {p for _, p, _ in g if str(p).startswith(str(REL))}
    for p in sorted(rel_preds, key=str):
        label = str(p).split("/")[-1].replace("_", " ")
        onto.add((p, RDF.type, OWL.ObjectProperty))
        onto.add((p, RDFS.label, Literal(label)))
        onto.add((p, RDFS.domain, ONTO.Entity))
        onto.add((p, RDFS.range, ONTO.Entity))

    onto.serialize(out_ttl, format="turtle")

Run:

uv run python pipeline/run_pipeline.py --only-step 8

Output:

  • data/output/ontology_draft.ttl

Querying your graph with SPARQL

Use these prefixes in the Fuseki UI:

PREFIX entity: <http://example.org/entity/>
PREFIX rel:    <http://example.org/relation/>
PREFIX onto:   <http://example.org/ontology/>
PREFIX doc:    <http://example.org/document/>

Top predicates by usage:

PREFIX rel: <http://example.org/relation/>
SELECT ?predicate (COUNT(*) AS ?count)
WHERE {
  ?s ?predicate ?o .
  FILTER(STRSTARTS(STR(?predicate), STR(rel:)))
}
GROUP BY ?predicate
ORDER BY DESC(?count)
LIMIT 10

Outgoing relations for a specific entity label:

PREFIX rel:  <http://example.org/relation/>
PREFIX onto: <http://example.org/ontology/>
SELECT ?relation ?objectLabel
WHERE {
  ?e onto:text "Albert Einstein" .
  ?e ?relation ?o .
  FILTER(STRSTARTS(STR(?relation), STR(rel:)))
  OPTIONAL { ?o onto:text ?objectLabel }
}
ORDER BY ?relation ?objectLabel

Two-hop paths:

PREFIX rel:  <http://example.org/relation/>
PREFIX onto: <http://example.org/ontology/>
SELECT ?midLabel ?targetLabel ?r1 ?r2
WHERE {
  ?e onto:text "Albert Einstein" .
  ?e ?r1 ?mid . FILTER(STRSTARTS(STR(?r1), STR(rel:)))
  ?mid ?r2 ?target . FILTER(STRSTARTS(STR(?r2), STR(rel:)))
  OPTIONAL { ?mid onto:text ?midLabel }
  OPTIONAL { ?target onto:text ?targetLabel }
}
LIMIT 25

Sentences mentioning an entity (with sentence order):

PREFIX onto: <http://example.org/ontology/>
SELECT ?sentenceId ?sentenceText
WHERE {
  ?e onto:text "Albert Einstein" ;
     onto:foundInSentence ?s .
  ?s onto:sentenceId ?sentenceId ;
     onto:text ?sentenceText .
}
ORDER BY ?sentenceId
LIMIT 20

List people extracted by NER:

PREFIX onto: <http://example.org/ontology/>
SELECT ?person ?text ?confidence
WHERE {
  ?person a onto:Entity ;
          onto:entityType "PER" ;
          onto:text ?text ;
          onto:confidence ?confidence .
}
ORDER BY DESC(?confidence)
LIMIT 20

Troubleshooting

  • NLTK tokenizer errors: run uv run python -c "import nltk; nltk.download('punkt')" and rerun Step 2 or Step 3.
  • Slow first run: model downloads are slow once, then cached.
  • REBEL on CPU: reduce --max-sentences while iterating.
  • Fuseki issues: confirm http://localhost:3030 is reachable, check Docker logs, and verify your dataset name and credentials.
  • Resume after a failure: uv run python pipeline/run_pipeline.py --start-from N

Wrap-up and next steps

You now have a repeatable path from PDF to RDF and a live SPARQL endpoint. From here, the most valuable improvements usually come from:

  • Better normalization and entity linking (so "IBM" and "International Business Machines" merge correctly)
  • Predicate cleanup (mapping model output to a controlled vocabulary)
  • Adding more documents and comparing patterns across sources
  • Aligning your ontology with existing vocabularies (FOAF, schema.org, Dublin Core)

If you generated data/output/ontology_draft.ttl, open it in Protege and treat it as a starting scaffold, not a final schema.

Blog   |   tags:  

Read more