AppSuite:DocumentConverter Installation Guide

From Open-Xchange
Revision as of 14:58, 19 September 2014 by Kai.ahrens (talk | contribs) (Added Setup section (TBC))

Product Description

Open-Xchange Inc. (“Open-Xchange”) has created a proprietary software program called the Open-Xchange Document Converter (the “Software”), which converts Microsoft (OOXML) und OpenOffice (ODF) office documents to PDF or to HTML5 with embedded SVG files (Scalable Vector Graphics). Additional the OX Document Converter converts PDF format files to SVG files and Microsoft files to OpenOffice files and backwards. So users can view and print documents in the existing infrastructure without any necessary additional plugins.

Introduction

Offering document preview functionality within Open-Xchange App Suite, the user expects to be able to open as much different document formats as possible or - to get a better picture - she doesn't need to take care of the document format she just received. It should just work, without knowing anything about document formats at all.

To offer such transparent behaviour to the user, OX App Suite needs to take care of converting a lot of document formats into the display formats needed by OX Files. OX Files is extended with document preview functionality by the module OX Document Viewer.

The conversion functionality is also available as stand-alone product OX Document Converter. The OX Document Converter WebService allows customers the flexible integration of document conversion in their offering.

The API reference describes the available actions with request parameters and results.

Requirements

OX Document Converter requires a 64bit systems; 32bit systems are not supported.
See the Open-Xchange software requirements page for details.

Download and Installation

The OX Document Converter deployment consists of two functional modules, that need to be intalled separately: the readerengine component and the Document Converter Webservice component.

ReaderEngine

See Readerengine installation instructions

Webservice

See Document converter installation instructions

See Document converter API installation instructions

See Document converter Webservice instructions

Configuration

After deployment of both components readerengine and Web service, the administrator needs to make some adjustments to the configuration of the OX Document Converter installation.

The component readerengine works with the default configuration. The settings are in the file documentconverter.properties located in the directory "/opt/open-xchange/etc" as described below.

A summary of all configuration items, together with each default value, is given below. Although the defaults have been carefully chosen for a real life deployment, the admin should take a closer look at each of them and adjust them accordingly, if necessary.

com.openexchange.documentconverter.installDir=/opt/readerengine

This item contains the the directory of the libreaderengine installation. The libreaderengine installation directory in general contains the ./program directory, which itself contains the engine executables.
VERY IMPORTANT: If not set correctly, the complete web service will be nonfunctional.
Default value: "/opt/readerengine"

com.openexchange.documentconverter.cacheDir=/var/spool/open-xchange/documentconverter/readerengine.cache

This item contains the directory that will make up the cache for persistent job data. The directory itself does not need to exist at startup, but the parent directory needs to exist and needs to have write permissions for the user running the servlet, in order for the servlet to create this cache directory at runtime.
VERY IMPORTANT: If not set correctly, the complete web service will be nonfunctional.
Default value: "/var/spool/open-xchange/documentconverter/readerengine.cache"

com.openexchange.documentconverter.scratchDir=/var/spool/open-xchange/documentconverter/readerengine.scratch

This item contains the directory, that will make up the runtime enironment for the readerengine. The directory itself does not need to exist at startup, but the parent directory needs to exist and needs to have write permissions for the user running the servlet , in order for the servlet to create this cache directory at runtime.
VERY IMPORTANT: If not set correctly, the complete web service will be nonfunctional.
Default value: "/var/spool/open-xchange/documentconverter/readerengine.scratch"

com.openexchange.documentconverter.errorDir=

This item specifies a directory for files that could not be loaded due to an error condition or due to a timeout.
Note: The used disk space will grow with retained files. Files have to be removed manually.
Default value: n/a

com.openexchange.documentconverter.blacklistFile=/opt/open-xchange/etc/readerengine.blacklist

The list of external document content URLs that are not allowed to be loaded by the readerengine after loading a document. The file itself contains a list of (newline separated) regular expressions. Each external URL is first checked against the list of blacklist URL regular expressions. If the external URL matches one blacklist entry, the external URL is then checked against the list of whitelist URL regular expressions. The behavior in summary is as follows: If the URL is not blacklisted and not whitelisted, it is resolved at runtime. If the URL is blacklisted but not whitelisted, it is not resolved at runtime. If the URL is not blacklisted but whitelisted, it is resolved at runtime. If the URL is blacklisted and whitelisted, it is resolved at runtime. In boolean notation: valid = (!blacklisted) || whitelisted Please note that the regular expressions need to fully qualify the patterns that the URL should be checked against. Upper/Lower cases need to be handled by the regular expression as well. The file itself needs to be UTF-8 encoded to be read appropriately.
Default value: "/opt/open-xchange/etc/readerengine.blacklist"

com.openexchange.documentconverter.whitelistFile=/opt/open-xchange/etc/readerengine.whitelist

The list of external document content URLs that are allowed to be loaded by the readerengine after an external URL matched a blacklist pattern. The file itself contains a list of (newline separated) regular expressions. Each external URL is only checked against the list of whitelist URL regular expressions if it previously matched a pattern in the blacklist file. If the external URL matches one blacklist entry, the external URL is then checked against the list of whitelist URL regular expressions. The behavior in summary is as follows: If the URL is not blacklisted and not whitelisted, it is resolved at runtime. If the URL is blacklisted but not whitelisted, it is not resolved at runtime. If the URL is not blacklisted but whitelisted, it is resolved at runtime. If the URL is blacklisted and whitelisted, it is resolved at runtime. In boolean notation: valid = (!blacklisted) || whitelisted Please note that the regular expressions need to fully qualify the patterns that the URL should be checked against. Upper/Lower cases need to be handled by the regular expression as well. The file itself needs to be UTF-8 encoded to be read appropriately.
Default value: "/opt/open-xchange/etc/readerengine.whitelist"

com.openexchange.documentconverter.urlLinkLimit=200

The external URL link limit specifies the maximum amount of valid external internet URLs (filtered by blacklist and whitelist before), that are tried to get resolved by the engine when loading a document. When this limit is reached, no more external internet URLs are resolved for the current document.
Important: Please take note than one externally linked object within the document does not automatically correspond to one external URL call. In general, there are - at least - two URL calls necessary to display one externally linked object. Such additional calls are in most cases based on a format detection, happening prior to resolving the object data itself.
Set to -1 for no upper limit or to 0 to disable the resolving of internet URLs completely
Default value: 200

com.openexchange.documentconverter.urlLinkProxy =

The external URL link proxy entry specifies a proxy server that is used by the readerengine to resolve external links, contained within a document. Such links are e.g. external http:// graphic links, that are going to be resolved during the filtering process of a readerengine instance. Set this entry to the address of the proxy server: host:port Recognized protocols are http://, https:// and ftp:// Leave empty, if no proxy server should be used by the readerengine
Default value: n/a

com.openexchange.documentconverter.RemoteBaseUrl =

Use a remote document conversion webservice to do the actual conversion; Set this entry to the base URL of the remote host http://host[:port]/documentconverterPath; leave empty if conversion should happen on the local machine
Default value: n/a

From 7.8.2 on: The com.openexchange.documentconverter.RemoteBaseUrl is not valid for the documentconverter.properties file anymore. The corresponding documentconverter server needs to be set on the Ox backend node, where the documentconverter-client package has been installed. The name of the new entry is com.openexchange.documentconverter.client.remoteDocumentConverterUrl. The entry itself is located within the documentconverter-client.properties configuration file>

com.openexchange.documentconverter.RemoteCacheUrls =

Use one or more remote converter cache(s) to speedup the conversion. The first entry, if set, is treated as the remote master cache, receiving cache updates from the local cache. Additional entries are treated as remote slave caches for read purposes only.
Set the (whitespace separated) entries to the base URL('s) of the appropriate remote host(s): http://host[:port]/documentconverterCachePath
Leave empty if only the local filesystem cache should be used
Default value: n/a

com.openexchange.documentconverter.RemoteSharePointUrl =

Use a remote SharePoint service to do MSO to PDF conversions.
Set this entry to the URL of the SharePoint host: http://host[:port]/_vti_bin/oxconvert.svc/mex?wsdl
If left empty, the corresponding conversion job always returns false.
Default value: n/a

com.openexchange.documentconverter.RemoteSharePointUsername =

The login user name to be used for calls to the SharePoint service
Default value: n/a

com.openexchange.documentconverter.RemoteSharePointPassword =

The password to be used for calls to the SharePoint service
Default value: n/a

com.openexchange.documentconverter.jobProcessorCount=3

This item determines the number of engines working in parallel for job execution. The value needs to be greater or equal to 1, with best performance results about (n-1), where n specifies the number of available CPU cores of the machine the service is running on.
Default value: 3

com.openexchange.documentconverter.jobRestartCount=50

This item determines the maximum number of executed jobs after which a single engine is automatically restarted in order to avoid memory fragmentation and possible memory leaks within one libreaderengine instance,
Default value: 50

com.openexchange.documentconverter.jobExecutionTimeoutMilliseconds=60000

This item determines the timeout in milliseconds, after which the execution of a single job is terminated.
Default value: 60000

com.openexchange.documentconverter.maxVMemMB=2048

This item determines the maximum size in megabytes (MB) of virtual memory that each started readerengine process is allowed to consume. If a job tries to consume more VMem than set via this config item, the processing of the current job for the appropriate readerengine process will be aborted and the underlying process is restarted to avoid memory corruption.
Set this value to -1 for no upper limit.
Default value: 2048

com.openexchange.documentconverter.maxCacheSizeMB=-1

This item determines the maximum size in megabytes (MB) of all persistently cached converter job entries at runtime. A larger value may drastically reduce the time for conversion jobs, e.g. in case of a repeated creation of document previews.
Set this value to -1 for no upper limit.
Default value: -1

com.openexchange.documentconverter.maxCacheEntries=-1

This item determines the maximum number of converter jobs cached at runtime. The value affects the amount of runtime job information to be cached as well as the number of file entries within the cache directory.
Set this value to -1 for no upper limit.
Default value: -1

com.openexchange.documentconverter.cacheEntryTimeoutSeconds=2592000

This item determines the timeout in seconds, after which a cached job result is automatically removed from the cache.
Set this value to 0 to disable the timeout based removal of cached job results.
Default value: 2592000

com.openexchange.documentconverter.enableCacheLookup=false

Setting this flag to true enables the caller of the RemoteInternalPreviewService#getCachedPreviewFor implementation (OfficePreviewService) to retrieve the cached only result of a previous conversion call, without scheduling a new job in case of a non existing cache entry, which might run for a long period time, up to the given job timeout time.
Set to false to disable the cache lookup within the RemoteInternalPreviewService#getCachedPreviewFor implementation.
Default value: false

com.openexchange.documentconverter.errorCacheTimeoutSeconds=600

This value determines, how long an error, associated with a job hash value, is held within the error cache. If the timeout has not been reached, additional RemoteInternalPreviewService#getPreviewFor calls with the same job hash will instantly return with the cached error code instead of processing the job again.
Set to 0 to disable the error cache handling.
Default value: 0

com.openexchange.documentconverter.errorCacheMaxCycleCount=5

This value determines the number of cycles, a job, associated with a job hash value, is added to the error cache. One cycle starts after adding a job to the error cache and ends after the errorCacheTimeout has been reached. After reaching the given maximum cycle count, the job is not removed from the error cache anymore and will be held within the error cache for the rest of the runtime of the current backend instance. Since the error cache is not persistent, the cycle counter for each job hash is reset after a restart of the backend instance.
Set to 0 to disable the error cache handling.
Default value: 5

com.openexchange.documentconverter.servletLocalFileUrls=false

This item determines, if the documentconverter servlet should be allowed to handle file Urls of the form file://... The file Url itself is a resource that locates files that are locally accessible on the machine, the documentconverter backend is running on.
Default value: false

com.openexchange.capability.sharepointconversion=false

Capability to enable the usage of a SharePoint conversion server; capability is only checked, if a valid SharePoint remote converter has been configured appropriately
Default value: false

Handling of temporary files

The DocumentConverter server needs to store files at runtime for different purposes at different volume locations:

  • Persistent files (Cache) The files that should last longer than the runtime of one converter instance are stored at the configurable com.openexchange.documentconverter.cacheDir directory. As the name of the property implies, such files are result cache entries used by multiple converter instances. This directory is monitored at runtime and all files are managed by the converter. Constraints for this directory are set via the converter properties com.openexchange.documentconverter.minFreeVolumeSizeMB, com.openexchange.documentconverter.maxCacheSizeMB, com.openexchange.documentconverter.maxCacheEntries and com.openexchange.documentconverter.cacheEntryTimeoutSeconds.
  • Medium lasting files These files are only valid for the runtime of one converter instance (e.g. ReaderEngine related runtime config files for each ReaderEngine instance). They are stored within the configurable com.openexchange.documentconverter.scratchDir directory. This directory is not constantly monitored at runtime but all files, contained in the ${com.openexchange.documentconverter.scratchDir}/oxdc.tmp sub directory are managed by the converter during the startup and shutdown phase of one converter server instance. In this case, the whole ${com.openexchange.documentconverter.scratchDir}/oxdc.tmp directory gets cleaned up during converter server shutdown as well as converter server startup. Initial cleanup during startup is necessary due to the fact, that the last converter instance might have aborted for unknown reasons, like e.g. power outage, VM abort etc.
  • Short lasting files These files are stored within the Java VM specific I/O temporary directory, whose location is configurable via the Java VM system property java.io.tmpdir. This directory is used by the converter to temporarily store request attachments in most cases. The files stored within this directory have a lifetime equal to the duration of the request itself. When the request has been finished, the appropriate files are cleaned up. For the converter, this means that e.g. source files to be converted and attached to the request are extracted from the request and stored in order to prevent exceeding memory consumption by source file buffers. When the conversion request is finished, the stored temporary file gets deleted.

From 7.10.2 on: The java.io.tmpdir Java system property specified directory will not be used by the converter anymore. Instead, even short living temporary files will be stored at the ${com.openexchange.documentconverter.scratchDir}/oxdc.tmp location. By this change, even short living files will be stored inside this managed directory, so that a server shutdown/start cleans up this directory automatically. This change affects all files created by the converter implementation itself. Temporary files from other baseline bundles might still be stored within the configured java.io.tmpdir.


Setup And Best Practices

The whole documentconverter product consists of 3 different parts: the OX backend documentconverter bundles, written in Java as OSGi plugins, the native readerengine and a native tool called pdf2svg.

Readerengine

The readerengine is - as the name implies - the low level backend engine for the converter implementaion, that acts as a general purpose document conversion solution, being able to read a vast number of commonly used document formats like Microsoft binary and XML formats (MS Word, MS Excel, MS Powerpoint), the whole set of ODF (Open Document Format) document formats and many more, and to export the loaded documents into different output formats like PDF.

The readerengine itself is currently a stripped LibreOffice installation, containing some code enhancements and optimizations, provided by Open-Xchange developers. The packaging is also done inhouse to have control of the whole development cycle, versioning and other aspects of the product development.

PDF2SVG

The pdf2svg component is a tool, that uses libpoppler for reading and processing PDF source documents and Cairo for rendering the processed PDF into SVG, JPEG and PNG formats, to be finally rendered within the OX appsuite client.

Documentconverter OSGi bundles

The OX appsuite backend bundles of the documentconverter are responsible for receiving and processing client conversion requests, to schedule conversion jobs in multi threaded queues, to manage the correct startup and shutdown of (parallel) readerengine and pdf2svg instances, managing the caching of conversion results, communication with remote converters and remote converter caches etc.

After installation of the documentconverter and related bundles, only minor changes have to be made by the admin in order to get the full functionality available within the appsuite.

Documentconverter system packages

The Java OSGi bundles for the documentconverter functionality are distributed via four different packages in total (revision numbers are omitted here):

  • documentconverter-api This bundle contains interfaces and other simple classes that are used by different other bundles to get access to the documentconverter OSGi service.
  • documentconverter This bundle contains the core documentconverter implementation as an OSGi service.
  • documentconverter-webservice This bundle contains the functionality to let the documentconverter offer its functionality as a webservice. In a remote scenario, this bundle needs to be installed only on those machines, that are configured to do the real conversions.
  • documentconverter-jolokia This optional bundle can be installed to make internal conversion statistics available for monitoring via Munin et al. The contained scripts provide access to the statistic data either via the modern Jolokia JMX bridge or the old fashioned showruntimestats functionality, provided by the OX appsuite backend.