GNU libextractor - Documentation
Home
Download
Documentation
Copyright
Installation
Usage
Plugins
Reference Manual
Freshmeat Page

Further documentation

This documentation covers the major aspects of libextractor in brief. More details can be found in the GNU libextractor manual (html, pdf). The man pages for extract and libextractor are also on-line.
An article describing libextractor was published in the LinuxJournal and is available here. That article describes the API for versions 0.0.0 to 0.5.23 and not the more recent 0.6.x API.

Copyright and Contributions

libextractor is released under the GNU General Public License. All contributions must thus be put under the GNU Public License (GPL) or a compatible license.

Mailing lists

libextractor has a mailing list for discussion of anything related to the project: <libextractor@gnu.org>.

To subscribe to this or any GNU mailing lists, please send an empty mail with a Subject: header of just subscribe to the relevant -request list. For example, to subscribe yourself to the GNU libextractor list, you would send mail to <libextractor-request@gnu.org>. Or you can use the mailing list web interface.

Getting involved

Development of libextractor, and GNU in general, is a volunteer effort, and you can contribute. For information, please read How to help GNU. If you would like to get involved, it is a good idea to join the mailing list (see above).

Development
Development sources can be found in our Subversion repository at https://gnunet.org/svn/Extractor/. Our bugtracker is at https://gnunet.org/bugs/.
Translating libextractor
To translate libextractor's messages into other languages, please see the Translation Project page for libextractor. If you have a new translation of the message strings, or updates to the existing strings, please have the changes made in this repository. Only translations from this site will be incorporated into libextractor. For more information, see the Translation Project.

Installation

The simplest way to install libextractor is to use one of the binary packages which are available online for many distributions. Note that under Debian, the extract tool is in a separate package extract and headers required to compile other applications against libextractor are in libextractor-dev. Thus, under Debian, you should use:

# apt-get install libextractor-dev extract
Compiling by hand follows the usual sequence:
$ tar xzvf libextractor.x.y.z.tar.gz
$ cd libextractor.x.y.z
$ ./configure
$ make
# make install
Note that you need various dependencies (read README.debian for an up-to-date list for Debian systems) in order to compile all of the plugins.

Using the extract tool

After installing libextractor, the extract tool can be used to obtain meta data from documents. By default, the extract tool uses the canonical set of plugins, which consists of all format-specific plugins supported by the current version of libextractor together with the mime-type detection plugin. If you are a user of BibTeX the option -b is likely to come in handy to automatically create bibtex entries from documents that have been properly equipped with meta-data:

$ wget -q http://www.copyright.gov/legislation/dmca.pdf
$ extract -b ~/dmca.pdf
% BiBTeX file
@misc{ unite2001the_d,
  title = "The Digital Millennium Copyright Act of 1998",
  author = "United States Copyright Office - jmf",
  note = "digital millennium copyright act circumvention...",
  year = "2001",
  month = "10",
  key = "Copyright Office Summary of the DMCA",
  pages = "18"
}

Further options are described in the extract manpage (man 1 extract).

Examples:

$ extract libextractor-0.1.3-1.src.rpm
Keywords for file libextractor-0.1.3-1.src.rpm:
os - linux
resource-identifier - http://ovmj.org/libextractor/
group -System Environment/Libraries
license - LGPL
copyright - LGPL
size - 251545
build-host - wedge.cs.purdue.edu
creation date - Wed Dec 25 07:50:07 2002
description - libextractor is a simple library...
summary - keyword extraction library
release - 1
version - 0.1.3
title - libextractor
unknown - SOURCE RPM 3.0
mimetype - application/x-rpm
$ extract extractor_logo.png
Keywords for file extractor_logo.png:
image dimensions - 272x188
thumbnail - (binary, 5932 bytes)
image dimensions - 272x188
thumbnail - (binary, 6427 bytes)
image dimensions - 272x188
thumbnail - (binary, 6427 bytes)
mimetype - image/png
mimetype - image/png
image dimensions - 272x188
keywords - The libextractor logo

Using the libextractor library

The following listing shows the code of a minimalistic program that uses libextractor. Compiling the fragment requires passing the option -lextractor to gcc. For details and additional functions for loading plugins and manipulating the keyword list, see the libextractor manpage (man 3 libextractor). Java programmers should note that a Java class that uses JNI to communicate with libextractor is also available. Python programmers will find that libextractor (since 0.5.0) can also be used from Python, just import Extractor.

#include 

int main(int argc, char * argv[]) 
{
  struct EXTRACTOR_PluginList *plugins
    = EXTRACTOR_plugin_add_defaults (EXTRACTOR_OPTION_DEFAULT_POLICY);
  EXTRACTOR_extract (plugins, argv[1],
                     NULL, 0, 
                     &EXTRACTOR_meta_data_print, stdout);
  EXTRACTOR_plugin_remove_all (plugins);
  return 0;
}

Current Plugins

HTML, PDF, PS, OLE2 (DOC, XLS, PPT), OpenOffice (sxw), StarOffice (sdw), DVI, MAN, FLAC, MP3 (ID3v1 and ID3v2), NSF(E) (NES music), SID (C64 music), OGG, WAV, EXIV2, JPEG, GIF, PNG, TIFF, DEB, RPM, TAR(.GZ), ZIP, ELF, S3M (Scream Tracker 3), XM (eXtended Module), IT (Impulse Tracker), FLV, REAL, RIFF (AVI), MPEG, QT and ASF.

Writing new Plugins

The most complicated thing when writing a new plugin for libextractor is the writing of the actual parser for a specific format. Nevertheless, the basic pattern is always the same. The plugin library must be called libextractor_XXX.so where XXX denotes the file format supported by the plugin and must be placed in the plugin directory (typically $PREFIX/lib/libextractor/). The library must export a method EXTRACTOR_XXX_extract with the following signature:

int
EXTRACTOR_XXX_extract (const char *data,
                       size_t size,
                       EXTRACTOR_MetaDataProcessor proc,
                       void *proc_cls,
                       const char* options);

data is a pointer to the contents of the file, and size is the number of bytes available in data. Most plugins starting by verifying that size is sufficiently large and that the header of data matches the specific format. The extract function is expected to call proc with each meta data item found. proc_cls must be passed as the first argument to proc, the other arguments correspond to the meta data found. Finally, options is an arbitrary string of options that the plugin is free to interpret. Most plugins ignore options.

If the meta data extracted is a string, it issupposed to be converted into the UTF-8 character set by the plugin. However, in cases where the character encoding used in the document is unknown, no conversion should be done. Binary meta data can also be extracted. Plugins indicate the format of the meta data using the format argument to proc. Supported formats are UTF-8 strings, C Strings (for strings of unknown encoding) and binary data. In addition to this rough categorization, the plugin is also supposed to indicate the mime type of the meta data. For strings, that mime type is most often text/plain. Finally, the plugin must specify the meta data type. Common meta data types are "author", "title" and "mime-type". The full signature of the "proc" callback is:

typedef int (*EXTRACTOR_MetaDataProcessor)(void *cls,
                                           const char *plugin_name,
                                           enum EXTRACTOR_MetaType type,
                                           enum EXTRACTOR_MetaFormat format,
                                           const char *data_mime_type,
                                           const char *data,
                                           size_t data_len);

If "proc" returns non-zero, the plugin should abort and return non-zero itself. The "extract" function should always return zero unless a call to "proc" returned non-zero, in which case the plugin must return 1.


libextractor@gnu.org

Translations of this page