| GNU libextractor - Documentation | ||||||||||
|
Further documentationThis documentation covers the major aspects of libextractor in brief. More details can be found in the GNU libextractor manual (html, pdf). The man pages for extract and libextractor are also on-line.An article describing libextractor was published in the LinuxJournal and is available here. That article describes the API for versions 0.0.0 to 0.5.23 and not the more recent 0.6.x API. Copyright and Contributionslibextractor is released under the GNU General Public License. All contributions must thus be put under the GNU Public License (GPL) or a compatible license.Mailing listslibextractor has a mailing list for discussion of anything related to the project: <libextractor@gnu.org>. To subscribe to this or any GNU mailing lists, please send an empty mail with a Subject: header of just subscribe to the relevant -request list. For example, to subscribe yourself to the GNU libextractor list, you would send mail to <libextractor-request@gnu.org>. Or you can use the mailing list web interface. Getting involvedDevelopment of libextractor, and GNU in general, is a volunteer effort, and you can contribute. For information, please read How to help GNU. If you would like to get involved, it is a good idea to join the mailing list (see above).
InstallationThe simplest way to install libextractor is to use one of the binary packages which are available online for many distributions. Note that under Debian, the extract tool is in a separate package extract and headers required to compile other applications against libextractor are in libextractor-dev. Thus, under Debian, you should use: # apt-get install libextractor-dev extractCompiling by hand follows the usual sequence: $ tar xzvf libextractor.x.y.z.tar.gz $ cd libextractor.x.y.z $ ./configure $ make # make installNote that you need various dependencies (read README.debian for an up-to-date list for Debian systems) in order to compile all of the plugins. Using the extract toolAfter installing libextractor, the extract tool can be used to obtain meta data from documents. By default, the extract tool uses the canonical set of plugins, which consists of all format-specific plugins supported by the current version of libextractor together with the mime-type detection plugin. If you are a user of BibTeX the option -b is likely to come in handy to automatically create bibtex entries from documents that have been properly equipped with meta-data:
$ wget -q http://www.copyright.gov/legislation/dmca.pdf
$ extract -b ~/dmca.pdf
% BiBTeX file
@misc{ unite2001the_d,
title = "The Digital Millennium Copyright Act of 1998",
author = "United States Copyright Office - jmf",
note = "digital millennium copyright act circumvention...",
year = "2001",
month = "10",
key = "Copyright Office Summary of the DMCA",
pages = "18"
}
Further options are described in the extract manpage (man 1 extract).
Examples:$ extract libextractor-0.1.3-1.src.rpm Keywords for file libextractor-0.1.3-1.src.rpm: os - linux resource-identifier - http://ovmj.org/libextractor/ group -System Environment/Libraries license - LGPL copyright - LGPL size - 251545 build-host - wedge.cs.purdue.edu creation date - Wed Dec 25 07:50:07 2002 description - libextractor is a simple library... summary - keyword extraction library release - 1 version - 0.1.3 title - libextractor unknown - SOURCE RPM 3.0 mimetype - application/x-rpm $ extract extractor_logo.png Keywords for file extractor_logo.png: image dimensions - 272x188 thumbnail - (binary, 5932 bytes) image dimensions - 272x188 thumbnail - (binary, 6427 bytes) image dimensions - 272x188 thumbnail - (binary, 6427 bytes) mimetype - image/png mimetype - image/png image dimensions - 272x188 keywords - The libextractor logo Using the libextractor library
The following listing shows the code of a minimalistic program that
uses libextractor. Compiling the fragment requires passing the
option -lextractor to gcc. For details and additional
functions for loading plugins and manipulating the keyword list, see
the libextractor manpage (man 3 libextractor).
Java programmers should note that a Java class that uses JNI to
communicate with libextractor is also available. Python programmers
will find that libextractor (since 0.5.0) can also be used from
Python, just import Extractor.
#include Current PluginsHTML, PDF, PS, OLE2 (DOC, XLS, PPT), OpenOffice (sxw), StarOffice (sdw), DVI, MAN, FLAC, MP3 (ID3v1 and ID3v2), NSF(E) (NES music), SID (C64 music), OGG, WAV, EXIV2, JPEG, GIF, PNG, TIFF, DEB, RPM, TAR(.GZ), ZIP, ELF, S3M (Scream Tracker 3), XM (eXtended Module), IT (Impulse Tracker), FLV, REAL, RIFF (AVI), MPEG, QT and ASF.Writing new PluginsThe most complicated thing when writing a new plugin for libextractor is the writing of the actual parser for a specific format. Nevertheless, the basic pattern is always the same. The plugin library must be called libextractor_XXX.so where XXX denotes the file format supported by the plugin and must be placed in the plugin directory (typically $PREFIX/lib/libextractor/). The library must export a method EXTRACTOR_XXX_extract with the following signature:
int
EXTRACTOR_XXX_extract (const char *data,
size_t size,
EXTRACTOR_MetaDataProcessor proc,
void *proc_cls,
const char* options);
data is a pointer to the contents of the file, and size is the number of bytes available in data. Most plugins starting by verifying that size is sufficiently large and that the header of data matches the specific format. The extract function is expected to call proc with each meta data item found. proc_cls must be passed as the first argument to proc, the other arguments correspond to the meta data found. Finally, options is an arbitrary string of options that the plugin is free to interpret. Most plugins ignore options. If the meta data extracted is a string, it issupposed to be converted into the UTF-8 character set by the plugin. However, in cases where the character encoding used in the document is unknown, no conversion should be done. Binary meta data can also be extracted. Plugins indicate the format of the meta data using the format argument to proc. Supported formats are UTF-8 strings, C Strings (for strings of unknown encoding) and binary data. In addition to this rough categorization, the plugin is also supposed to indicate the mime type of the meta data. For strings, that mime type is most often text/plain. Finally, the plugin must specify the meta data type. Common meta data types are "author", "title" and "mime-type". The full signature of the "proc" callback is:
typedef int (*EXTRACTOR_MetaDataProcessor)(void *cls,
const char *plugin_name,
enum EXTRACTOR_MetaType type,
enum EXTRACTOR_MetaFormat format,
const char *data_mime_type,
const char *data,
size_t data_len);
If "proc" returns non-zero, the plugin should abort and return non-zero itself. The "extract" function should always return zero unless a call to "proc" returned non-zero, in which case the plugin must return 1. |
|||||||||