From Reports to Knowledge: Build a Queryable RDF Knowledge Graph
Turn a single PDF into an RDF knowledge graph you can query with SPARQL, using a pipeline that leaves a clear paper trail at every stage.
Most teams have plenty of documents (reports, policies, contracts, research papers) and very little time to keep re-reading them. PDFs are great for distribution, but they are not great for searching across concepts, linking facts, or answering questions like "Who worked with whom?" or "What organizations show up most often?"
This tutorial walks through a practical pipeline that takes one PDF and produces:
- Clean text and sentence-level inputs for NLP
- RDF/Turtle files for entities and relation triples
- A Fuseki dataset you can query via SPARQL
- An optional draft ontology scaffold you can refine in Protege
Everything is modular and inspectable. Each step writes concrete outputs (text files, TSV/CSV, Turtle graphs), so you can validate what the models produced and adjust as needed.
Pipeline overview
The core flow looks like this:
PDF -> Clean text -> Split into sentences -> Coreference resolution
-> Entity extraction (NER) -> Relation extraction (REBEL)
-> Clean and deduplicate triples -> Load into Fuseki -> Query with SPARQL
Optional (but useful): generate a first-pass ontology draft from the predicates you actually observed in your triples.
Prerequisites
System requirements
- Python 3.10 or 3.11
- uv 0.4+ (virtualenv and dependency management)
- Docker 24+ (for Fuseki)
- Make (optional, but convenient)
Dependencies live in pyproject.toml and uv.lock and are installed via uv.
Installation
# Install uv (skip if already installed)
curl -Ls https://astral.sh/uv/install.sh | sh
# Install dependencies; uv creates and manages .venv/
uv sync
# Optional: install the project itself (and dev extras if you want linting/testing)
uv pip install -e .
# uv pip install -e ".[dev]"
# Download model weights once (FastCoref, Transformers, REBEL)
uv run python pipeline/download_models.py
If you have a Makefile, you can use:
make setup # uv sync + model download
make install-dev # install with developer tooling
Fuseki runs in Docker. You can start it now, or let your loader step handle it (depending on how your repo is set up):
make fuseki-start # starts stain/jena-fuseki on :3030 (admin/admin)
make fuseki-stop
Step-by-step pipeline
Step 0: Add your input PDF
Place the PDF you want to process to data/input/source.pdf.
For a first run, short and clean PDFs work best. A simple biography exported to PDF (for example, Einstein or Curie) is a good test case.
Step 1: PDF to clean text
This step extracts text from the PDF and removes common junk that breaks NLP downstream:
- Page numbers, headers, footers (as much as possible)
- Hyphenated line breaks ("-\n" -> "")
- Extra whitespace
- Optional: Wikipedia-style reference sections, bracket citations like [12], and boilerplate
You can get better structure with tools like GROBID or Apache Tika, and you may need OCR (for example, Tesseract) for scanned PDFs.
# Script: pipeline/01_prepare_text.py
import re
import pdfplumber
from pathlib import Path
WIKIPEDIA_SECTIONS = [
r"\bReferences\b",
r"\bExternal\s+links\b",
r"\bSee\s+also\b",
r"\bFurther\s+reading\b",
]
def clean_wikipedia_text(text: str) -> str:
# Trim trailing sections that mostly contain bibliographies and footers
earliest = min(
(
match.start()
for marker in WIKIPEDIA_SECTIONS
if (match := re.search(marker, text, flags=re.IGNORECASE))
),
default=len(text),
)
text = text[:earliest]
# Remove citation brackets, URLs, and page artifacts
text = re.sub(r"\[\d+\]", "", text) # [12]
text = re.sub(r"https?://[^\s)]+", "", text)
text = text.replace("-\n", "").replace("\n", " ")
return re.sub(r"\s+", " ", text).strip()
def extract_pdf_text(pdf_path: Path) -> str:
with pdfplumber.open(pdf_path) as pdf:
text = "\n".join(page.extract_text() or "" for page in pdf.pages)
return clean_wikipedia_text(text)
Run:
uv run python pipeline/run_pipeline.py --only-step 1
Output:
data/intermediate/source.txt
Step 2: Clean text to sentences
Most NLP components behave better when you feed them one sentence at a time. This step splits the cleaned text into one sentence per line using NLTK's Punkt tokenizer.
You can swap this for spaCy or Stanza if your document style is tricky (lots of abbreviations, tables, bullet fragments, and so on).
# Script: pipeline/02_split_sentences.py
import re
import nltk
from nltk.tokenize import sent_tokenize
def clean_sentence(sentence: str) -> str:
sentence = re.sub(r"\s+\d+/\d+\s+", " ", sentence)
words = []
previous = None
for word in sentence.split():
if word.lower() != previous:
words.append(word)
previous = word.lower()
return " ".join(words).strip()
def filter_sentence(sentence: str) -> bool:
if len(sentence.split()) < 5:
return False
if any(k in sentence.lower() for k in ("retrieved", "doi", "external links")):
return False
return True
def tokenize_sentences(text: str) -> list[str]:
nltk.download("punkt", quiet=True)
sentences = sent_tokenize(text)
cleaned = [clean_sentence(s) for s in sentences]
return [s for s in cleaned if filter_sentence(s)]
Run:
uv run python pipeline/run_pipeline.py --only-step 2
Output:
data/intermediate/sentences.txt(one sentence per line)
Step 3: Coreference resolution
Coreference resolution replaces pronouns and repeated mentions with their referents, so later steps attach facts to the right entity.
Example:
- Before: "Marie Curie discovered polonium. She won two Nobel Prizes."
- After: "Marie Curie discovered polonium. Marie Curie won two Nobel Prizes."
# Script: pipeline/03_coreference_resolution.py
import re
import nltk
from fastcoref import FCoref
from nltk.tokenize import sent_tokenize
PRONOUNS = {"he","she","it","they","his","her","its","their","him","them"}
def resolve_coreferences(source_text: str, device: str = "auto") -> list[str]:
nltk.download("punkt", quiet=True)
model = FCoref(device=device)
result = model.predict(texts=source_text, is_split_into_words=False)
resolved_text = source_text
for cluster in result.get_clusters():
mentions = [m for m in cluster if m.lower() not in PRONOUNS]
if not mentions:
continue
main = max(mentions, key=len)
for pronoun in set(cluster) - set(mentions):
resolved_text = re.sub(r"\b" + re.escape(pronoun) + r"\b", main, resolved_text)
return sent_tokenize(resolved_text)
Run:
uv run python pipeline/run_pipeline.py --only-step 3 --device cpu
Output:
data/intermediate/resolved_sentences.txt
Note: Coreference is never perfect. Treat it as a quality boost, then verify on a few examples before trusting it at scale.
Step 4: Sentences to entities (NER)
Now we extract named entities (people, places, organizations, dates, and so on) using a Hugging Face NER model.
One important detail: entity URIs should be stable across the pipeline. If NER creates entity:entity_42_1 while relation extraction creates entity:Albert_Einstein, you end up with two disconnected graphs. The snippet below uses a simple "slug" based on entity text so both steps can share identifiers.
# Script: pipeline/04_sentences_to_entities.py
import re
from transformers import pipeline
from rdflib import Graph, Namespace, Literal
from rdflib.namespace import RDF, XSD
def slug(text: str) -> str:
text = re.sub(r"[^A-Za-z0-9]+", "_", text.strip())
text = re.sub(r"_+", "_", text).strip("_")
return text or "Unknown"
def extract_entities(sentences, model_name, aggregation_strategy, namespaces):
ner = pipeline(
"ner",
model=model_name,
tokenizer=model_name,
aggregation_strategy=aggregation_strategy,
)
rdf_graph = Graph()
ENTITY = Namespace(namespaces["entity"])
ONTO = Namespace(namespaces["onto"])
DOC = Namespace(namespaces["doc"])
rdf_graph.bind("entity", ENTITY)
rdf_graph.bind("onto", ONTO)
rdf_graph.bind("doc", DOC)
entity_records = []
for i, sentence in enumerate(sentences, start=1):
ents = ner(sentence)
sentence_uri = DOC[f"sentence_{i}"]
rdf_graph.add((sentence_uri, RDF.type, ONTO.Sentence))
rdf_graph.add((sentence_uri, ONTO.text, Literal(sentence)))
rdf_graph.add((sentence_uri, ONTO.sentenceId, Literal(i, datatype=XSD.integer)))
for e in ents:
text = (e.get("word") or "").strip()
conf = e.get("score")
ent_type = e.get("entity_group")
if len(text) <= 1 or conf is None:
continue
entity_uri = ENTITY[slug(text)]
# Create the entity node once, then keep linking it to sentences
rdf_graph.add((entity_uri, RDF.type, ONTO.Entity))
rdf_graph.add((entity_uri, ONTO.text, Literal(text)))
if (entity_uri, ONTO.entityType, None) not in rdf_graph:
rdf_graph.add((entity_uri, ONTO.entityType, Literal(ent_type)))
# Keep the best confidence seen for this entity label
existing = list(rdf_graph.objects(entity_uri, ONTO.confidence))
if existing:
old = float(existing[0])
if float(conf) > old:
rdf_graph.set((entity_uri, ONTO.confidence, Literal(float(conf), datatype=XSD.float)))
else:
rdf_graph.add((entity_uri, ONTO.confidence, Literal(float(conf), datatype=XSD.float)))
rdf_graph.add((entity_uri, ONTO.foundInSentence, sentence_uri))
entity_records.append({
"sentence_id": i,
"entity_text": text,
"entity_uri": str(entity_uri),
"entity_type": ent_type,
"confidence": float(conf),
"start_pos": e.get("start"),
"end_pos": e.get("end"),
"sentence": sentence,
})
return entity_records, rdf_graph
Run:
uv run python pipeline/run_pipeline.py --only-step 4 --max-sentences 500
Outputs:
data/output/entities.ttl
Step 5: Extract relation triples (REBEL)
Next we extract subject-predicate-object triples with REBEL. The model emits a tagged format that you parse into triples.
As with NER, use the same URI normalization for subjects and objects so your relation edges connect to the entity nodes you already created.
# Script: pipeline/05_extract_triplets.py
import re
from transformers import pipeline
def slug(text: str) -> str:
text = re.sub(r"[^A-Za-z0-9]+", "_", text.strip())
text = re.sub(r"_+", "_", text).strip("_")
return text or "Unknown"
def extract_triplets_from_text(generated_text: str):
triplets = []
text = (
generated_text.replace("<s>", "")
.replace("</s>", "")
.replace("<pad>", "")
.strip()
)
if "<triplet>" not in text:
return triplets
subject = relation = obj = ""
current = None
for token in text.split():
if token == "<triplet>":
if subject and relation and obj:
triplets.append((subject.strip(), relation.strip(), obj.strip()))
subject = relation = obj = ""
current = "subj"
elif token == "<subj>":
current = "rel"
elif token == "<obj>":
current = "obj"
else:
if current == "subj":
subject += (" " if subject else "") + token
elif current == "rel":
relation += (" " if relation else "") + token
elif current == "obj":
obj += (" " if obj else "") + token
if subject and relation and obj:
triplets.append((subject.strip(), relation.strip(), obj.strip()))
return triplets
def extract_triplets(sentences, model_name="Babelscape/rebel-large", device=-1):
gen = pipeline("text2text-generation", model=model_name, tokenizer=model_name, device=device)
results = []
for i, sentence in enumerate(sentences, start=1):
output = gen(sentence, max_length=256, num_beams=2)[0]["generated_text"]
for s, p, o in extract_triplets_from_text(output):
if len(s) > 1 and len(p) > 2 and len(o) > 1:
results.append({
"sentence_id": i,
"subject": slug(s),
"predicate": slug(p),
"object": slug(o),
"sentence": sentence,
"extraction_method": "rebel",
})
return results
Run:
uv run python pipeline/run_pipeline.py --only-step 5 --max-sentences 300
Output:
data/output/triplets.ttl
Tip: REBEL can be slow on CPU. Iterate with a small --max-sentences, then scale up once you are happy with cleaning and normalization.
Step 6: Clean and deduplicate triples
Even with normalization, you usually want to drop duplicates and filter out junk predicates. This step reads the Turtle graph, converts it to a tabular form, applies cleanup rules, and writes a clean Turtle file.
# Script: pipeline/06_clean_triplets.py
import pandas as pd
from rdflib import Graph
from config.settings import get_pipeline_paths
def load_triplets(ttl_path):
graph = Graph()
graph.parse(str(ttl_path), format="turtle")
rows = []
for s, p, o in graph:
rows.append({
"subject": str(s).split("/")[-1].replace("_", " "),
"predicate": str(p).split("/")[-1].replace("_", " "),
"object": str(o).split("/")[-1].replace("_", " "),
})
return pd.DataFrame(rows)
paths = get_pipeline_paths()
df = load_triplets(paths["triplets_turtle"])
df = df[df["predicate"].notna() & (df["predicate"].str.len() > 1)]
df = df.drop_duplicates(subset=["subject", "predicate", "object"], keep="first")
Run:
uv run python pipeline/run_pipeline.py --only-step 6
Output:
data/output/triplets_clean.ttl
Step 7: Load to graph DB (Apache Jena Fuseki)
Fuseki gives you a SPARQL endpoint on top of your RDF data.
A practical note: you usually want both entity data (entities.ttl) and relation triples (triplets_clean.ttl) in the dataset. The simplest approach is to merge them into one Turtle file and upload that.
If you do not want to modify the loader, a quick merge often works:
cat data/output/entities.ttl data/output/triplets_clean.ttl > data/output/graph.ttl
Loader example:
# Script: pipeline/07_load_to_graphdb.py
import requests
def load_turtle_to_fuseki(ttl_path, endpoint, dataset, user=None, password=None, timeout=60):
upload_url = f"{endpoint.rstrip('/')}/{dataset}/data"
auth = (user, password) if user and password else None
with open(ttl_path, "rb") as f:
response = requests.put(
upload_url,
data=f,
headers={"Content-Type": "text/turtle"},
auth=auth,
timeout=timeout,
)
response.raise_for_status()
Run:
make fuseki-start
uv run python pipeline/run_pipeline.py --only-step 7
Verify in the UI:
http://localhost:3030/
Step 8 (optional): Auto-generate a draft ontology
At this point you have a graph, but your schema is still informal. A quick way to get started is to generate a draft ontology file that:
- Defines a couple of base classes (Entity, Sentence)
- Defines each observed predicate as an owl:ObjectProperty
- Adds simple labels, plus default domain and range
This does not replace real ontology work, but it gives you something to refine in Protege.
# Script: pipeline/08_generate_ontology_draft.py
from rdflib import Graph, Namespace, Literal
from rdflib.namespace import RDF, RDFS, OWL
def build_ontology_draft(triples_ttl: str, out_ttl: str, namespaces: dict):
g = Graph()
g.parse(triples_ttl, format="turtle")
ONTO = Namespace(namespaces["onto"])
REL = Namespace(namespaces["rel"])
onto = Graph()
onto.bind("onto", ONTO)
onto.bind("rel", REL)
onto.bind("owl", OWL)
onto.bind("rdfs", RDFS)
onto.add((ONTO.Entity, RDF.type, OWL.Class))
onto.add((ONTO.Sentence, RDF.type, OWL.Class))
rel_preds = {p for _, p, _ in g if str(p).startswith(str(REL))}
for p in sorted(rel_preds, key=str):
label = str(p).split("/")[-1].replace("_", " ")
onto.add((p, RDF.type, OWL.ObjectProperty))
onto.add((p, RDFS.label, Literal(label)))
onto.add((p, RDFS.domain, ONTO.Entity))
onto.add((p, RDFS.range, ONTO.Entity))
onto.serialize(out_ttl, format="turtle")
Run:
uv run python pipeline/run_pipeline.py --only-step 8
Output:
data/output/ontology_draft.ttl
Querying your graph with SPARQL
Use these prefixes in the Fuseki UI:
PREFIX entity: <http://example.org/entity/>
PREFIX rel: <http://example.org/relation/>
PREFIX onto: <http://example.org/ontology/>
PREFIX doc: <http://example.org/document/>
Top predicates by usage:
PREFIX rel: <http://example.org/relation/>
SELECT ?predicate (COUNT(*) AS ?count)
WHERE {
?s ?predicate ?o .
FILTER(STRSTARTS(STR(?predicate), STR(rel:)))
}
GROUP BY ?predicate
ORDER BY DESC(?count)
LIMIT 10
Outgoing relations for a specific entity label:
PREFIX rel: <http://example.org/relation/>
PREFIX onto: <http://example.org/ontology/>
SELECT ?relation ?objectLabel
WHERE {
?e onto:text "Albert Einstein" .
?e ?relation ?o .
FILTER(STRSTARTS(STR(?relation), STR(rel:)))
OPTIONAL { ?o onto:text ?objectLabel }
}
ORDER BY ?relation ?objectLabel
Two-hop paths:
PREFIX rel: <http://example.org/relation/>
PREFIX onto: <http://example.org/ontology/>
SELECT ?midLabel ?targetLabel ?r1 ?r2
WHERE {
?e onto:text "Albert Einstein" .
?e ?r1 ?mid . FILTER(STRSTARTS(STR(?r1), STR(rel:)))
?mid ?r2 ?target . FILTER(STRSTARTS(STR(?r2), STR(rel:)))
OPTIONAL { ?mid onto:text ?midLabel }
OPTIONAL { ?target onto:text ?targetLabel }
}
LIMIT 25
Sentences mentioning an entity (with sentence order):
PREFIX onto: <http://example.org/ontology/>
SELECT ?sentenceId ?sentenceText
WHERE {
?e onto:text "Albert Einstein" ;
onto:foundInSentence ?s .
?s onto:sentenceId ?sentenceId ;
onto:text ?sentenceText .
}
ORDER BY ?sentenceId
LIMIT 20
List people extracted by NER:
PREFIX onto: <http://example.org/ontology/>
SELECT ?person ?text ?confidence
WHERE {
?person a onto:Entity ;
onto:entityType "PER" ;
onto:text ?text ;
onto:confidence ?confidence .
}
ORDER BY DESC(?confidence)
LIMIT 20
Troubleshooting
- NLTK tokenizer errors:
run uv run python -c "import nltk; nltk.download('punkt')"and rerun Step 2 or Step 3. - Slow first run: model downloads are slow once, then cached.
- REBEL on CPU: reduce
--max-sentenceswhile iterating. - Fuseki issues: confirm
http://localhost:3030is reachable, check Docker logs, and verify your dataset name and credentials. - Resume after a failure:
uv run python pipeline/run_pipeline.py --start-from N
Wrap-up and next steps
You now have a repeatable path from PDF to RDF and a live SPARQL endpoint. From here, the most valuable improvements usually come from:
- Better normalization and entity linking (so "IBM" and "International Business Machines" merge correctly)
- Predicate cleanup (mapping model output to a controlled vocabulary)
- Adding more documents and comparing patterns across sources
- Aligning your ontology with existing vocabularies (FOAF, schema.org, Dublin Core)
If you generated data/output/ontology_draft.ttl, open it in Protege and treat it as a starting scaffold, not a final schema.