Thanks to your support, 2015 marks 30 years of the FSF! In the next 30 years, we want to do even more to defend computer user rights. To kick off in that direction, we're setting our highest-ever fundraising goal of $525,000 by January 31st. Read more.

$525K
30% (159K)
Count me in

GNU Libextractor

libextractor

GNU Libextractor is a library used to extract meta data from files. The goal is to provide developers of file-sharing networks, browsers or WWW-indexing bots with a universal library to obtain simple keywords and meta data to match against queries and to show to users instead of only relying on filenames. libextractor contains the shell command extract that, similar to the well-known file command, can extract meta data from a file an print the results to stdout.

Currently, libextractor supports the following formats: HTML, MAN, PS, DVI, OLE2 (DOC, XLS, PPT), OpenOffice (sxw), StarOffice (sdw), FLAC, MP3 (ID3v1 and ID3v2), OGG, WAV, S3M (Scream Tracker 3), XM (eXtended Module), IT (Impulse Tracker), NSF(E) (NES music), SID (C64 music), EXIV2, JPEG, GIF, PNG, TIFF, DEB, RPM, TAR(.GZ), LZH, LHA, RAR, ZIP, CAB, 7-ZIP, AR, MTREE, PAX, CPIO, ISO9660, SHAR, RAW, XAR FLV, REAL, RIFF (AVI), MPEG, QT and ASF. Also, various additional MIME types are detected.

GNU libextractor uses helper-libraries (plugins) to perform the actual extraction. As a result, GNU libextractor can be extended simply by installing additional plugins. Writing robust parsers can be difficult. GNU libextractor protects the main applications from haning or crashing plugins by executing all plugins out-of-process.

GNU libextractor is a GNU package. Our official GNU website can be found at http://www.gnu.org/software/libextractor/.

Downloading Libextractor

Source Code
Libextractor can be found on the main GNU ftp server: http://ftp.gnu.org/gnu/libextractor/ (via HTTP) and ftp://ftp.gnu.org/gnu/libextractor/ (via FTP). It can also be found on the GNU mirrors; please use a mirror if possible.
Debian .deb package
The debian package can be downloaded from the official debian archive. The extract package can be found under Utilities and the library under Libraries. The respective packages for libextractor are extract, libextractor and for development libextractor-dev. Backports for Debian Stable are also available.
Tar Package
The latest version can be found on GNU mirrors. If the mirror does not work, you should be able to find them on the main FTP server at ftp://ftp.gnu.org/gnu/libextractor/.
Latest release is libextractor-1.1.tar.gz.
Latest Java-binding is libextractor-java-1.0.0.tar.gz.
Latest Mono-binding is libextractor-mono-0.5.23.tar.gz.
Latest Python-binding is libextractor-python-0.5.tar.gz.
RPM Package
RPMs for SuSE 9.3 can be found here (i386, x86_64, SRPM)
Windows
Latest Windows binary is libextractor-0.5.23-w32.zip.

Documentation

Documentation for Libextractor is available online, as is documentation for most GNU software. You may also find more information about Libextractor by running info libextractor or man libextractor, or man extract, or by looking at /usr/share/doc/libextractor/, /usr/local/doc/libextractor/, or similar directories on your system. A brief summary is available by running extract --help. You might also be interested in an API compatibility report comparing the various Libextractor versions.

Articles related to libextractor:

Mailing lists

Libextractor has the following mailing lists:

Announcements about Libextractor and most other GNU software are made on info-gnu (archive). If you only want to get notifications about Libextractor, we suggest you subscribe to the project at freshmeat.

Security reports that should not be made immediately public can be sent directly to the maintainer. If there is no response to an urgent issue, you can escalate to the general security mailing list for advice.

Getting involved

Development of Libextractor, and GNU in general, is a volunteer effort, and you can contribute. For information, please read How to help GNU. If you'd like to get involved, it's a good idea to join the discussion mailing list (see above).

Development
Known bugs and open feature requests are tracked in our bugtracker.
Subversion access
You can access the current development version of libextractor using
$ svn checkout https://gnunet.org/svn/Extractor

A Java binding for libextractor is in
$ svn checkout https://gnunet.org/svn/Extractor-java

A Mono binding for libextractor is in
$ svn checkout https://gnunet.org/svn/Extractor-mono

A Python binding can be found under
$ svn checkout https://gnunet.org/svn/Extractor-python
A source package is here. This binding has been packaged as a python egg, available here A second Python binding that includes a binding for doodle can be found here.
A Perl binding is in CPAN The latest version of the Perl binding is available using git clone git://git.perldition.org/File-Extractor.git/
A Ruby binding has been published here (mirror). Another Ruby binding has been published here (mirror).
An initial draft of a PHP binding can be found under
$ svn checkout https://gnunet.org/svn/Extractor-php
Translating Libextractor
To translate Libextractor's messages into other languages, please see the Translation Project page for Libextractor. If you have a new translation of the message strings, or updates to the existing strings, please have the changes made in this repository. Only translations from this site will be incorporated into Libextractor. For more information, see the Translation Project.
Maintainer
Libextractor is currently being maintained by Christian Grothoff.

Quick Introduction

Installation
The simplest way to install GNU libextractor is to use one of the binary packages which are available online for many distributions. Note that under Debian, the extract tool is in a separate package extract and headers required to compile other applications against libextractor are in libextractor-dev. Thus, under Debian, you should use:
# apt-get install libextractor-dev extract
Compiling by hand follows the usual sequence:
$ tar xzvf libextractor.x.y.z.tar.gz
$ cd libextractor.x.y.z
$ ./configure
$ make
# make install
Note that you need various dependencies (read README for an up-to-date list) in order to compile all of the plugins.
Using the extract tool
After installing GNU libextractor, the extract tool can be used to obtain meta data from documents. By default, the extract tool uses the canonical set of plugins, which consists of all format-specific plugins supported by the current version of libextractor together with the mime-type detection plugin. If you are a user of BibTeX the option -b is likely to come in handy to automatically create bibtex entries from documents that have been properly equipped with meta-data (if available).
Further options are described in the extract manpage (man 1 extract).
Example Output
$ extract libextractor-0.1.3-1.src.rpm
Keywords for file libextractor-0.1.3-1.src.rpm:
os - linux
resource-identifier - http://ovmj.org/libextractor/
group -System Environment/Libraries
license - LGPL
copyright - LGPL
size - 251545
build-host - wedge.cs.purdue.edu
creation date - Wed Dec 25 07:50:07 2002
description - libextractor is a simple library...
summary - keyword extraction library
release - 1
version - 0.1.3
title - libextractor
unknown - SOURCE RPM 3.0
mimetype - application/x-rpm
$ extract extractor_logo.png
Keywords for file extractor_logo.png:
image dimensions - 272x188
thumbnail - (binary, 5932 bytes)
image dimensions - 272x188
thumbnail - (binary, 6427 bytes)
image dimensions - 272x188
thumbnail - (binary, 6427 bytes)
mimetype - image/png
mimetype - image/png
image dimensions - 272x188
keywords - The libextractor logo
Using the GNU libextractor library in your programs
The following listing shows the code of a minimalistic program that uses GNU libextractor. Compiling the fragment requires passing the option -lextractor to gcc. For details and additional functions for loading plugins and manipulating the keyword list, see the libextractor manpage (man 3 libextractor). Java programmers should note that a Java class that uses JNI to communicate with libextractor is also available. Python programmers will find that libextractor (since 0.5.0) can also be used from Python, just import Extractor.
#include <extractor.h>

int 
main (int argc, char * argv[]) 
{
  struct EXTRACTOR_PluginList *plugins
    = EXTRACTOR_plugin_add_defaults (EXTRACTOR_OPTION_DEFAULT_POLICY);
  EXTRACTOR_extract (plugins, argv[1],
                     NULL, 0, 
                     &EXTRACTOR_meta_data_print, stdout);
  EXTRACTOR_plugin_remove_all (plugins);
  return 0;
}
Writing new Plugins for GNU libextractor
The most complicated thing when writing a new plugin for GNU libextractor is the writing of the actual parser for a specific format. Nevertheless, the basic pattern is always the same. The plugin library must be called libextractor_XXX.so where XXX denotes the file format supported by the plugin and must be placed in the plugin directory (typically $PREFIX/lib/libextractor/). The library must export a method EXTRACTOR_XXX_extract_method with the following signature:
void
EXTRACTOR_XXX_extract_method (struct EXTRACTOR_ExtractContext *ec);

ec provides a callback to invoke with meta data as well as functions for reading data from the file that is being processed. Most plugins start by reading the first bytes of the file and checking that that the header of data matches the specific format. The extract function is expected to call ec->proc with each meta data item found. ec->cls must be passed as the first argument to proc and other function invoked from within ec. Finally, ec->config is an arbitrary string of options that the plugin is free to interpret. Most plugins ignore config.
If the meta data extracted is a string, it is supposed to be converted into the UTF-8 character set by the plugin. However, in cases where the character encoding used in the document is unknown, no conversion should be done. Binary meta data can also be extracted. Plugins indicate the format of the meta data using the format argument to proc. Supported formats are UTF-8 strings, C strings (for strings of unknown encoding) and binary data. In addition to this rough categorization, the plugin is also supposed to indicate the mime type of the meta data. For strings, that mime type is most often text/plain. Finally, the plugin must specify the meta data type. Common meta data types are "author", "title" and "mime-type". The full signature of the "proc" callback is:
typedef int (*EXTRACTOR_MetaDataProcessor)(void *cls,
                                           const char *plugin_name,
                                           enum EXTRACTOR_MetaType type,
                                           enum EXTRACTOR_MetaFormat format,
                                           const char *data_mime_type,
                                           const char *data,
                                           size_t data_len);
If "proc" returns non-zero, the plugin should abort processing the current file and return.
Related projects and useful resources
Projects that use libextractor

Licensing

Libextractor is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 3 of the License, or (at your option) any later version.

 [FSF logo] “Our mission is to preserve, protect and promote the freedom to use, study, copy, modify, and redistribute computer software, and to defend the rights of Free Software users.”

The Free Software Foundation is the principal organizational sponsor of the GNU Operating System. Support GNU and the FSF by buying manuals and gear, joining the FSF as an associate member, or making a donation, either directly to the FSF or via Flattr.

back to top