GNU Libextractor

GNU Libextractor is a library used to extract meta data from files. The goal is to provide developers of file-sharing networks, browsers or WWW-indexing bots with a universal library to obtain simple keywords and meta data to match against queries and to show to users instead of only relying on filenames. libextractor contains the shell command extract that, similar to the well-known file command, can extract meta data from a file and print the results to stdout.

Currently, libextractor supports the following formats: HTML, MAN, PS, DVI, OLE2 (DOC, XLS, PPT), OpenOffice (sxw), StarOffice (sdw), FLAC, MP3 (ID3v1 and ID3v2), OGG, WAV, S3M (Scream Tracker 3), XM (eXtended Module), IT (Impulse Tracker), NSF(E) (NES music), SID (C64 music), EXIV2, JPEG, GIF, PNG, TIFF, DEB, RPM, TAR(.GZ), LZH, LHA, RAR, ZIP, CAB, 7-ZIP, AR, MTREE, PAX, CPIO, ISO9660, SHAR, RAW, XAR FLV, REAL, RIFF (AVI), MPEG, QT and ASF. Also, various additional MIME types are detected.

GNU libextractor uses helper-libraries (plugins) to perform the actual extraction. As a result, GNU libextractor can be extended simply by installing additional plugins. Writing robust parsers can be difficult. GNU libextractor protects the main applications from haning or crashing plugins by executing all plugins out-of-process.

Downloading Libextractor

Source Code
Libextractor is available from the main GNU FTP server via HTTP(S) and FTP. It can also be found on the GNU mirrors; please use a mirror if possible.
Debian .deb package
The debian package can be downloaded from the official debian archive. The extract package can be found under Utilities and the library under Libraries. The respective packages for libextractor are extract, libextractor, and for development, libextractor-dev.
Tar Package
The latest version can be found on GNU mirrors. If the mirror does not work, you should be able to find them on the main FTP server.
Latest release is libextractor-latest.tar.gz.
Latest Java-binding is libextractor-java-1.0.0.tar.gz.
Latest Mono-binding is libextractor-mono-0.5.23.tar.gz.
Latest Python-binding is libextractor-python-0.5.tar.gz.
RPM Packages
RPMs are available for Fedora, Mageia and several other distributions.
Windows
The latest Windows binary is libextractor-w32-1.0.0.zip.

Documentation

Documentation for Libextractor is available online, as is documentation for most GNU software. You may also find more information about Libextractor by running info libextractor or man libextractor, or man extract, or by looking at /usr/share/doc/libextractor/, /usr/local/doc/libextractor/, or similar directories on your system. A brief summary is available by running extract --help. You might also be interested in an API compatibility report comparing the various Libextractor versions.

Articles related to libextractor:

Mailing lists

Libextractor has the following mailing lists:

  • bug-libextractor is used to discuss most aspects of Libextractor, including development and enhancement requests, as well as bug reports.
  • help-libextractor is for general user help and discussion.

Announcements about Libextractor and most other GNU software are made on info-gnu (archive).

Security reports that should not be made immediately public can be added as private reports on the bugtracker. If there is no response to an urgent issue, you can escalate to the general security mailing list for advice.

Getting involved

Development of Libextractor, and GNU in general, is a volunteer effort, and you can contribute. For information, please read How to help GNU. If you'd like to get involved, it's a good idea to join the discussion mailing list (see above).

Development
Known bugs and open feature requests are tracked in our bugtracker.
Git access
  • You can access the current development version of libextractor using

    $ git clone https://git.gnunet.org/libextractor.git
  • A Java binding for libextractor is in

    $ git clone https://git.gnunet.org/libextractor-java
  • A Mono binding for libextractor is in

    $ git clone https://git.gnunet.org/libextractor-mono
  • A Python binding can be found under

    $ git clone https://git.gnunet.org/libextractor-python

    A source package is available on the GNU FTP server. This binding has been packaged as a python egg.

    A second Python binding includes a binding for doodle.

  • A Perl binding is in CPAN. The latest version of the Perl binding is available using

    $ git clone git://git.perldition.org/File-Extractor.git/
  • Ruby bindings have been published on raa.ruby-lang.org (mirror) and rubyforge.org (mirror).

  • An initial draft of a PHP binding can be found under

    $ git clone https://git.gnunet.org/libextractor-php
Translating Libextractor
To translate Libextractor's messages into other languages, please see the Translation Project page for Libextractor. If you have a new translation of the message strings, or updates to the existing strings, please have the changes made in this repository. Only translations from this site will be incorporated into Libextractor. For more information, see the Translation Project.

Quick Introduction

Installation

The simplest way to install GNU libextractor is to use one of the binary packages which are available online for many distributions. Note that under Debian, the extract tool is in a separate package extract and headers required to compile other applications against libextractor are in libextractor-dev. Thus, under Debian, you should use:

# apt-get install libextractor-dev extract

Compiling by hand follows the usual sequence:

$ tar xzvf libextractor.x.y.z.tar.gz
$ cd libextractor.x.y.z
$ ./configure
$ make
# make install

Note that you need various dependencies (read README for an up-to-date list) in order to compile all of the plugins.

Using the extract tool

After installing GNU libextractor, the extract tool can be used to obtain meta data from documents. By default, the extract tool uses the canonical set of plugins, which consists of all format-specific plugins supported by the current version of libextractor together with the mime-type detection plugin. If you are a user of BibTeX the option -b is likely to come in handy to automatically create bibtex entries from documents that have been properly equipped with meta-data (if available).

Further options are described in the extract manpage (man 1 extract).

Example Output
$ extract libextractor-0.1.3-1.src.rpm
Keywords for file libextractor-0.1.3-1.src.rpm:
os - linux
resource-identifier - http://ovmj.org/libextractor/
group -System Environment/Libraries
license - LGPL
copyright - LGPL
size - 251545
build-host - wedge.cs.purdue.edu
creation date - Wed Dec 25 07:50:07 2002
description - libextractor is a simple library...
summary - keyword extraction library
release - 1
version - 0.1.3
title - libextractor
unknown - SOURCE RPM 3.0
mimetype - application/x-rpm
$ extract extractor_logo.png
Keywords for file extractor_logo.png:
image dimensions - 272x188
thumbnail - (binary, 5932 bytes)
image dimensions - 272x188
thumbnail - (binary, 6427 bytes)
image dimensions - 272x188
thumbnail - (binary, 6427 bytes)
mimetype - image/png
mimetype - image/png
image dimensions - 272x188
keywords - The libextractor logo
Using the GNU libextractor library in your programs

The following listing shows the code of a minimalistic program that uses GNU libextractor. Compiling the fragment requires passing the option -lextractor to gcc. For details and additional functions for loading plugins and manipulating the keyword list, see the libextractor manpage (man 3 libextractor). Java programmers should note that a Java class that uses JNI to communicate with libextractor is also available. Python programmers will find that libextractor (since 0.5.0) can also be used from Python, just import Extractor.

#include <extractor.h>

int
main (int argc, char * argv[])
{
  struct EXTRACTOR_PluginList *plugins
    = EXTRACTOR_plugin_add_defaults (EXTRACTOR_OPTION_DEFAULT_POLICY);
  EXTRACTOR_extract (plugins, argv[1],
                     NULL, 0,
                     &EXTRACTOR_meta_data_print, stdout);
  EXTRACTOR_plugin_remove_all (plugins);
  return 0;
}
Writing new Plugins for GNU libextractor

The most complicated thing when writing a new plugin for GNU libextractor is the writing of the actual parser for a specific format. Nevertheless, the basic pattern is always the same. The plugin library must be called libextractor_XXX.so where XXX denotes the file format supported by the plugin and must be placed in the plugin directory (typically $PREFIX/lib/libextractor/). The library must export a method EXTRACTOR_XXX_extract_method with the following signature:

void
EXTRACTOR_XXX_extract_method (struct EXTRACTOR_ExtractContext *ec);

ec provides a callback to invoke with meta data as well as functions for reading data from the file that is being processed. Most plugins start by reading the first bytes of the file and checking that that the header of data matches the specific format. The extract function is expected to call ec->proc with each meta data item found. ec->cls must be passed as the first argument to proc and other function invoked from within ec. Finally, ec->config is an arbitrary string of options that the plugin is free to interpret. Most plugins ignore config.

If the meta data extracted is a string, it is supposed to be converted into the UTF-8 character set by the plugin. However, in cases where the character encoding used in the document is unknown, no conversion should be done. Binary meta data can also be extracted. Plugins indicate the format of the meta data using the format argument to proc. Supported formats are UTF-8 strings, C strings (for strings of unknown encoding) and binary data. In addition to this rough categorization, the plugin is also supposed to indicate the mime type of the meta data. For strings, that mime type is most often "text/plain". Finally, the plugin must specify the meta data type. Common meta data types are "author", "title" and "mime-type". The full signature of the proc callback is:

typedef int (*EXTRACTOR_MetaDataProcessor)(void *cls,
                                           const char *plugin_name,
                                           enum EXTRACTOR_MetaType type,
                                           enum EXTRACTOR_MetaFormat format,
                                           const char *data_mime_type,
                                           const char *data,
                                           size_t data_len);

If proc returns non-zero, the plugin should abort processing the current file and return.

Related projects and useful resources
Projects that use libextractor

Licensing

Libextractor is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 3 of the License, or (at your option) any later version.