DEGU is a J2EE based distributed index and retrieval engine written in
100% Java (License: LGPL).
- See a running instance of DEGU here (Search for "JBoss").
- Sourceforge page is here.
The philosophy behind DEGU is to index a rather small sized collection
of documents, but to provide high quality search capabilities. Unlike
other search engines, DEGU not only retrieves whole documents,
also document parts like chapters and pages. For example if
search the IBM Redbooks you will likely get resulting documents,
which have around 1000 pages. It would be smarter to get the relevant
chapters only. Each hit produced by DEGU is accompanied with the
document's Table of Contents (TOC), if it was possible to extract one.
TOCs are searchable as well. Since TOC entries represent document
chunks like sections, subsections, subsubsections etc., DEGU is
capable to append to each TOC entry the number of relevant pages, and,
of course, any chunk is separately downloadable.
Moreover, DEGU alters the hits, for example, DEGU underlines in PDF
files keywords and adds bookmarks, which points to the relevant pages.
- Supported document formats: PDF (*.pdf)
- Document search
- Document inside search (page level, chunk level)
- TOC search / TOC extraction (bookmark based)
- Hits altering (Highlighting, Bookmarks)
- Simple web and cmdl. client
- Document formats: MS-Word Documents (*.doc), MS-PowerPoint
(*.ppt) LaTeX (*.tex), HTML (*.html)
- Lexical extraction of TOCs
- Language detection (n-gram based)
- Query expansion (via WordNet, GermaNet)
- Automatic text summarization (<= this was the topic of my senior
theses at the uni :)
- New sophisticated web client
- Web client for the DEGU index (administration stuff)
- Eclipse plug-in client
Many thanks to these projects
Michael Barth, 2006