Previous: Meta formats, Up: Extracting meta data


4.4 Extracting

— Function Pointer: int (*EXTRACTOR_MetaDataProcessor)(void *cls, const char *plugin_name, enum EXTRACTOR_MetaType type, enum EXTRACTOR_MetaFormat format, const char *data_mime_type, const char *data, size_t data_len)

Type of a function that libextractor calls for each meta data item found.

cls
closure (user-defined)
plugin_name
name of the plugin that produced this value; special values can be used (i.e. '<zlib>' for zlib being used in the main libextractor library and yielding meta data);
type
libextractor-type describing the meta data;
format basic
format information about data
data_mime_type
mime-type of data (not of the original file); can be NULL (if mime-type is not known);
data
actual meta-data found
data_len
number of bytes in data

Return 0 to continue extracting, 1 to abort.

— Function: void EXTRACTOR_extract (struct EXTRACTOR_PluginList *plugins, const char *filename, const void *data, size_t size, EXTRACTOR_MetaDataProcessor proc, void *proc_cls)

This is the main function for extracting keywords with GNU libextractor. The first argument is a plugin list which specifies the set of plugins that should be used for extracting meta data. The ‘filename’ argument is optional and can be used to specify the name of a file to process. If ‘filename’ is NULL, then the ‘data’ argument must point to the in-memory data to extract meta data from. If ‘filename’ is non-NULL, ‘data’ can be NULL. If ‘data’ is non-null, then ‘size’ is the size of ‘data’ in bytes. Otherwise ‘size’ should be zero. For each meta data item found, GNU libextractor will call the ‘proc’ function, passing ‘proc_cls’ as the first argument to ‘proc’. The other arguments to ‘proc’ depend on the specific meta data found.

Meta data extraction should never really fail — at worst, GNU libextractor should not call ‘proc’ with any meta data. By design, GNU libextractor should never crash or leak memory, even given corrupt files as input. Note however, that running GNU libextractor on a corrupt file system (or incorrectly mmaped files) can result in the operating system sending a SIGBUS (bus error) to the process. While GNU libextractor runs plugins out-of-process, it first maps the file into memory and then attempts to decompress it. During decompression it is possible to encounter a SIGBUS. GNU libextractor will not attempt to catch this signal and your application is likely to crash. Note again that this should only happen if the file system is corrupt (not if individual files are corrupt). If this is not acceptable, you might want to consider running GNU libextractor itself also out-of-process (as done, for example, by doodle).