AppSuite:DocumentConverter Installation Guide
Product Description
Open-Xchange Inc. (“Open-Xchange”) has created a proprietary software program called the Open-Xchange Document Converter (the “Software”), which converts Microsoft (OOXML) und OpenOffice (ODF) office documents to PDF or to HTML5 with embedded SVG files (Scalable Vector Graphics). Additional the OX Document Converter converts PDF format files to SVG files and Microsoft files to OpenOffice files and backwards. So users can view and print documents in the existing infrastructure without any necessary additional plugins.
Introduction
Offering document preview functionality within Open-Xchange App Suite, the user expects to be able to open as much different document formats as possible or - to get a better picture - she doesn't need to take care of the document format she just received. It should just work, without knowing anything about document formats at all.
To offer such transparent behaviour to the user, OX App Suite needs to take care of converting a lot of document formats into the display formats needed by OX Files. OX Files is extended with document preview functionality by the module OX Document Viewer.
The conversion functionality is also available as stand-alone product OX Document Converter. The OX Document Converter WebService allows customers the flexible integration of document conversion in their offering.
The API reference describes the available actions with request parameters and results.
Requirements
OX Document Converter requires a 64bit systems; 32bit systems are not supported.
See the Open-Xchange software requirements page for details.
Download and Installation
The OX Document Converter deployment consists of two functional modules, that need to be intalled separately: the readerengine component and the Document Converter Webservice component.
ReaderEngine
See Readerengine installation instructions
Webservice
See Document converter installation instructions
See Document converter API installation instructions
See Document converter Webservice instructions
Configuration
After deployment of both components readerengine and Web service, the administrator needs to make some adjustments to the configuration of the OX Document Converter installation.
The component readerengine works with the default configuration. The settings are in the file documentconverter.properties located in the directory "/opt/open-xchange/etc" as described below.
A summary of all configuration items, together with each default value, is given below. Although the defaults have been carefully chosen for a real life deployment, the admin should take a closer look at each of them and adjust them accordingly, if necessary.
- com.openexchange.documentconverter.installDir=/opt/readerengine
This item contains the the directory of the libreaderengine installation. The libreaderengine installation directory in general contains the ./program directory, which itself contains the engine executables.
VERY IMPORTANT: If not set correctly, the complete web service will be nonfunctional.
Default value: "/opt/readerengine"
- com.openexchange.documentconverter.cacheDir=/var/spool/open-xchange/documentconverter/readerengine.cache
This item contains the directory that will make up the cache for persistent job data. The directory itself does not need to exist at startup, but the parent directory needs to exist and needs to have write permissions for the user running the servlet, in order for the servlet to create this cache directory at runtime.
VERY IMPORTANT: If not set correctly, the complete web service will be nonfunctional.
Default value: "/var/spool/open-xchange/documentconverter/readerengine.cache"
- com.openexchange.documentconverter.scratchDir=/var/spool/open-xchange/documentconverter/readerengine.scratch
This item contains the directory, that will make up the runtime enironment for the readerengine. The directory itself does not need to exist at startup, but the parent directory needs to exist and needs to have write permissions for the user running the servlet , in order for the servlet to create this cache directory at runtime.
VERY IMPORTANT: If not set correctly, the complete web service will be nonfunctional.
Default value: "/var/spool/open-xchange/documentconverter/readerengine.scratch"
- com.openexchange.documentconverter.errorDir=
This item specifies a directory for files that could not be loaded due to an error condition or due to a timeout.
Note: The used disk space will grow with retained files. Files have to be removed manually.
Default value: n/a
- com.openexchange.documentconverter.blacklistFile=/opt/open-xchange/etc/readerengine.blacklist
The list of external document content URLs that are not allowed to be loaded
by the readerengine after loading a document.
The file itself contains a list of (newline separated) regular expressions.
Each external URL is first checked against the list of blacklist URL regular
expressions.
If the external URL matches one blacklist entry, the external URL is
then checked against the list of whitelist URL regular expressions.
The behavior in summary is as follows:
If the URL is not blacklisted and not whitelisted, it is resolved at runtime.
If the URL is blacklisted but not whitelisted, it is not resolved at runtime.
If the URL is not blacklisted but whitelisted, it is resolved at runtime.
If the URL is blacklisted and whitelisted, it is resolved at runtime.
In boolean notation: valid = (!blacklisted) || whitelisted
Please note that the regular expressions need to fully qualify the patterns that
the URL should be checked against.
Upper/Lower cases need to be handled by the regular expression as well.
The file itself needs to be UTF-8 encoded to be read appropriately.
Default value: "/opt/open-xchange/etc/readerengine.blacklist"
- com.openexchange.documentconverter.whitelistFile=/opt/open-xchange/etc/readerengine.whitelist
The list of external document content URLs that are allowed to be loaded
by the readerengine after an external URL matched a blacklist pattern.
The file itself contains a list of (newline separated) regular expressions.
Each external URL is only checked against the list of whitelist URL regular
expressions if it previously matched a pattern in the blacklist file.
If the external URL matches one blacklist entry, the external URL is
then checked against the list of whitelist URL regular expressions.
The behavior in summary is as follows:
If the URL is not blacklisted and not whitelisted, it is resolved at runtime.
If the URL is blacklisted but not whitelisted, it is not resolved at runtime.
If the URL is not blacklisted but whitelisted, it is resolved at runtime.
If the URL is blacklisted and whitelisted, it is resolved at runtime.
In boolean notation: valid = (!blacklisted) || whitelisted
Please note that the regular expressions need to fully qualify the patterns that
the URL should be checked against.
Upper/Lower cases need to be handled by the regular expression as well.
The file itself needs to be UTF-8 encoded to be read appropriately.
Default value: "/opt/open-xchange/etc/readerengine.whitelist"
- com.openexchange.documentconverter.urlLinkLimit=200
The external URL link limit specifies the maximum amount of
valid external internet URLs (filtered by blacklist and whitelist before),
that are tried to get resolved by the engine when loading a document.
When this limit is reached, no more external internet URLs are resolved
for the current document.
Important: Please take note than one externally linked object within the document does not automatically correspond to one external URL call. In general, there are - at least - two URL calls necessary to display one externally linked object. Such additional calls are in most cases based on a format detection, happening prior to resolving the object data itself.
Set to -1 for no upper limit or to 0 to disable the resolving of internet URLs completely
Default value: 200
- com.openexchange.documentconverter.urlLinkProxy =
The external URL link proxy entry specifies a proxy server that is used by the readerengine
to resolve external links, contained within a document. Such links are e.g. external http://
graphic links, that are going to be resolved during the filtering process of a readerengine
instance.
Set this entry to the address of the proxy server: host:port
Recognized protocols are http://, https:// and ftp://
Leave empty, if no proxy server should be used by the readerengine
Default value: n/a
- com.openexchange.documentconverter.RemoteBaseUrl =
Use a remote document conversion webservice to do the actual conversion;
Set this entry to the base URL of the remote host http://host[:port]/documentconverterPath;
leave empty if conversion should happen on the local machine
Default value: n/a
From 7.8.2 on: The com.openexchange.documentconverter.RemoteBaseUrl is not valid for the documentconverter.properties file anymore. The corresponding documentconverter server needs to be set on the Ox backend node, where the documentconverter-client package has been installed. The name of the new entry is com.openexchange.documentconverter.client.remoteDocumentConverterUrl. The entry itself is located within the documentconverter-client.properties configuration file>
- com.openexchange.documentconverter.RemoteCacheUrls =
Use one or more remote converter cache(s) to speedup the conversion. The first entry, if set, is treated as the remote master cache, receiving cache updates from the local cache. Additional entries are treated as remote slave caches for read purposes only.
Set the (whitespace separated) entries to the base URL('s) of the appropriate remote host(s): http://host[:port]/documentconverterCachePath
Leave empty if only the local filesystem cache should be used
Default value: n/a
- com.openexchange.documentconverter.RemoteSharePointUrl =
Use a remote SharePoint service to do MSO to PDF conversions.
Set this entry to the URL of the SharePoint host: http://host[:port]/_vti_bin/oxconvert.svc/mex?wsdl
If left empty, the corresponding conversion job always returns false.
Default value: n/a
- com.openexchange.documentconverter.RemoteSharePointUsername =
The login user name to be used for calls to the SharePoint service
Default value: n/a
- com.openexchange.documentconverter.RemoteSharePointPassword =
The password to be used for calls to the SharePoint service
Default value: n/a
- com.openexchange.documentconverter.jobProcessorCount=3
This item determines the number of engines working in parallel for job execution. The value needs to be greater or equal to 1, with best performance results about (n-1), where n specifies the number of available CPU cores of the machine the service is running on.
Default value: 3
- com.openexchange.documentconverter.jobRestartCount=50
This item determines the maximum number of executed jobs after which a single engine is automatically restarted in order to avoid memory fragmentation and possible memory leaks within one libreaderengine instance,
Default value: 50
- com.openexchange.documentconverter.jobExecutionTimeoutMilliseconds=60000
This item determines the timeout in milliseconds, after which the execution of a single job is terminated.
Default value: 60000
- com.openexchange.documentconverter.maxVMemMB=2048
This item determines the maximum size in megabytes (MB) of virtual memory that each started readerengine process is allowed to consume. If a job tries to consume more VMem than set via this config item, the processing of the current job for the appropriate readerengine process will be aborted and the underlying process is restarted to avoid memory corruption.
Set this value to -1 for no upper limit.
Default value: 2048
- com.openexchange.documentconverter.maxCacheSizeMB=-1
This item determines the maximum size in megabytes (MB) of all persistently cached converter job entries at runtime. A larger value may drastically reduce the time for conversion jobs, e.g. in case of a repeated creation of document previews.
Set this value to -1 for no upper limit.
Default value: -1
- com.openexchange.documentconverter.maxCacheEntries=-1
This item determines the maximum number of converter jobs cached at runtime. The value affects the amount of runtime job information to be cached as well as the number of file entries within the cache directory.
Set this value to -1 for no upper limit.
Default value: -1
- com.openexchange.documentconverter.cacheEntryTimeoutSeconds=2592000
This item determines the timeout in seconds, after which a cached job result is automatically removed from the cache.
Set this value to 0 to disable the timeout based removal of cached job results.
Default value: 2592000
- com.openexchange.documentconverter.enableCacheLookup=false
Setting this flag to true enables the caller of the RemoteInternalPreviewService#getCachedPreviewFor implementation (OfficePreviewService) to retrieve the cached only result of a previous conversion call, without scheduling a new job in case of a non existing cache entry, which might run for a long period time, up to the given job timeout time.
Set to false to disable the cache lookup within the RemoteInternalPreviewService#getCachedPreviewFor implementation.
Default value: false
- com.openexchange.documentconverter.errorCacheTimeoutSeconds=600
This value determines, how long an error, associated with a job hash value, is held within the error cache. If the timeout has not been reached, additional RemoteInternalPreviewService#getPreviewFor calls with the same job hash will instantly return with the cached error code instead of processing the job again.
Set to 0 to disable the error cache handling.
Default value: 0
- com.openexchange.documentconverter.errorCacheMaxCycleCount=5
This value determines the number of cycles, a job, associated with a job hash value, is added to the error cache.
One cycle starts after adding a job to the error cache and ends after the errorCacheTimeout has been reached.
After reaching the given maximum cycle count, the job is not removed from the error cache anymore and will be held within the error cache for the rest of the runtime of the current backend instance.
Since the error cache is not persistent, the cycle counter for each job hash is reset after a restart of the
backend instance.
Set to 0 to disable the error cache handling.
Default value: 5
- com.openexchange.documentconverter.servletLocalFileUrls=false
This item determines, if the documentconverter servlet should be allowed to handle file Urls of the form file://... The file Url itself is a resource that locates files that are locally accessible on the machine, the documentconverter backend is running on.
Default value: false
- com.openexchange.capability.sharepointconversion=false
Capability to enable the usage of a SharePoint conversion server; capability is only
checked, if a valid SharePoint remote converter has been configured appropriately
Default value: false
Handling of temporary files
The DocumentConverter server needs to store files at runtime for different purposes at different volume locations:
- Persistent files (Cache) The files that should last longer than the runtime of one converter instance are stored at the configurable com.openexchange.documentconverter.cacheDir directory. As the name of the property implies, such files are result cache entries used by multiple converter instances. This directory is monitored at runtime and all files are managed by the converter. Constraints for this directory are set via the converter properties com.openexchange.documentconverter.minFreeVolumeSizeMB, com.openexchange.documentconverter.maxCacheSizeMB, com.openexchange.documentconverter.maxCacheEntries and com.openexchange.documentconverter.cacheEntryTimeoutSeconds.
- Medium lasting files These files are only valid for the runtime of one converter instance (e.g. ReaderEngine related runtime config files for each ReaderEngine instance). They are stored within the configurable com.openexchange.documentconverter.scratchDir directory. This directory is not constantly monitored at runtime but all files, contained in the ${com.openexchange.documentconverter.scratchDir}/oxdc.tmp sub directory are managed by the converter during the startup and shutdown phase of one converter server instance. In this case, the whole ${com.openexchange.documentconverter.scratchDir}/oxdc.tmp directory gets cleaned up during converter server shutdown as well as converter server startup. Initial cleanup during startup is necessary due to the fact, that the last converter instance might have aborted for unknown reasons, like e.g. power outage, VM abort etc.
- Short lasting files These files are stored within the Java VM specific I/O temporary directory, whose location is configurable via the Java VM system property java.io.tmpdir. This directory is used by the converter to temporarily store request attachments in most cases. The files stored within this directory have a lifetime equal to the duration of the request itself. When the request has been finished, the appropriate files are cleaned up. For the converter, this means that e.g. source files to be converted and attached to the request are extracted from the request and stored in order to prevent exceeding memory consumption by source file buffers. When the conversion request is finished, the stored temporary file gets deleted.
From 7.10.2 on: The java.io.tmpdir Java system property specified directory will not be used by the converter anymore. Instead, even short living temporary files will be stored at the ${com.openexchange.documentconverter.scratchDir}/oxdc.tmp location. By this change, even short living files will be stored inside this managed directory, so that a server shutdown/start cleans up this directory automatically. This change affects all files created by the converter implementation itself. Temporary files from other baseline bundles might still be stored within the configured java.io.tmpdir.
Setup And Best Practices
The whole documentconverter product consists of 3 different parts: the OX backend documentconverter bundles, written in Java as OSGi plugins, the native readerengine and a native tool called pdf2svg.
Readerengine
The readerengine is - as the name implies - the low level backend engine for the converter implementaion, that acts as a general purpose document conversion solution, being able to read a vast number of commonly used document formats like Microsoft binary and XML formats (MS Word, MS Excel, MS Powerpoint), the whole set of ODF (Open Document Format) document formats and many more, and to export the loaded documents into different output formats like PDF.
The readerengine itself is currently a stripped LibreOffice installation, containing some code enhancements and optimizations, provided by Open-Xchange developers. The packaging is also done inhouse to have control of the whole development cycle, versioning and other aspects of the product development.
PDF2SVG
The pdf2svg component is a tool, that uses libpoppler for reading and processing PDF source documents and Cairo for rendering the processed PDF into SVG, JPEG and PNG formats, to be finally rendered within the OX appsuite client.
Documentconverter OSGi bundles
The OX appsuite backend bundles of the documentconverter are responsible for receiving and processing client conversion requests, to schedule conversion jobs in multi threaded queues, to manage the correct startup and shutdown of (parallel) readerengine and pdf2svg instances, managing the caching of conversion results, communication with remote converters and remote converter caches etc.
After installation of the documentconverter and related bundles, only minor changes have to be made by the admin in order to get the full functionality available within the appsuite.
Documentconverter system packages
The Java OSGi bundles for the documentconverter functionality are distributed via four different packages in total (revision numbers are omitted here):
- documentconverter-api This bundle contains interfaces and other simple classes that are used by different other bundles to get access to the documentconverter OSGi service.
- documentconverter This bundle contains the core documentconverter implementation as an OSGi service.
- documentconverter-webservice This bundle contains the functionality to let the documentconverter offer its functionality as a webservice. In a remote scenario, this bundle needs to be installed only on those machines, that are configured to do the real conversions.
- documentconverter-jolokia This optional bundle can be installed to make internal conversion statistics available for monitoring via Munin et al. The contained scripts provide access to the statistic data either via the modern Jolokia JMX bridge or the old fashioned showruntimestats functionality, provided by the OX appsuite backend.
General
By taking a look at the overview of documentconverter system packages in the chapter above, it can be seen, that the two essential packages to be installed are the documentconverter-api and the documentconverter' packages.
If the current OX backend node is configured to only receive client requests and to make remote conversion calls to an other node, configured as a real converter node, no additional packages need to be installed.
In order to act as real converter node, the readerengine and the pdf2svg packages need to be installed as well.
For an OX backend node to act as a conversion webservice, the documentconverter-webservice package needs to be installed beside the documentconverter-api, the documentconverter, the readerengine and the the pdf2svg packages.
The documentconverter-jolokia package only makes sense to be installed on an OX backend node, that acts as real converter node.
Single OX backend with documentconverter / Single converter node
The simplest possible scenario is definitely a single OX appsuite backend installation with the additional installation of the documentconverter-api, documentconverter, readerengine and pdf2svg packages.
Nevertheless, this configuration can be considered as the prototype configuration block for a real converter node with different setups. Therefore, the description for this building block will contain all details, that are necessary to setup even a highly complex scenario.
Important: although it is possible to run a real documentconverter on a standard OX backend node together with all other provided services, this setup is not recommended for production use, as long as each single node hardware is not sufficient enough to handle everything appropriately. This is especially true, if this node is also configured to act as a calcengine server, which itself has a resource consumption, that can be seen on the high side with a lot of memory consumption.
Documentconversion is a process, that might happen in the background all of the time due to permanent client requests. In such cases, the standard OX backend behavior might be slowed down in a way, that is not acceptible anymore. This is really important in cases of inferior hardware resources.
So, please use this all-in-one scenario only for quick evalution purposes, demo setups or if there's really no dedicated conversion hardware available.
Configuration and Hardware
After installation of the single documentconverter related packages, the default values for the documentconverter properties can be found in the file /opt/open-xchange/etc/documentconverter.properties. Each entry is documented and has been assigned a default value, that is in most cases sufficient for a first setup. Please take care that all listed directory pathes have the appropriate write permissions for the OX backend user.
Important: Each real converter node needs its own scratch and cache directories. Don't set these directories to shared directories like NFS shares etc., since file writes to the same cache/scratch directories by different backend nodes are not synchronized at the moment and will most possibly corrupt some files over the time. Actually, it is possible to use a shared network drive for the cache and scratch directories, as long as the finally used directories are unique for each given node. In this case, the documentconverter.properties file on each node has to be adjusted accordingly to guarantee such a unique, per node directory structure.
In addition, you should check the following entries for their validity/sensibility:
- For the scratch directory, it is recommended to use a very fast volume (e.g. on SSD or even a RAM disk, if memory allows), since temp. files are permanently created and deleted within this directory. The whole documentconverter performance will benefit from a high speed volume in any case.
- The cache sizes, configured in the properties file, should be checked against the given hardware and volume sizes and adjusted accordingly, if wanted/needed. Cache entries are only read/written once during a conversion, so that the used medium does not need to be a high speed one. For the cache, it is better to use more memory than speed, since cached files will accelerate the whole conversion process the most.
- The number of parallel readerengine instances used is configured via the property com.openexchange.documentconverter.jobProcessorCount. The default value of three should fit on almost all modern hardware. Please adjust this value depending on the really available CPU cores in your case. As long as the current node is configured as a conversion only node, the number of parallel processes should be in the order of (CPU core count - 1)
- The com.openexchange.documentconverter.maxVMemMB config entry is the virtual memory assigned to each readerengine process. You don't need to have (jobProcessorCount * maxVMemMB) in reality, since large parts of this VMem is shared memory between all running readerengine instances. So, there's no simple formula to calculate the optimum total amount of memory. For the beginning, a maxVMemMB value of 2048 is absolutely ok and sufficient. The node, the documentconverter is running on, should have at least 8GB RAM, with more RAM and more processors being better, of course. A good recommendation would be a total amount of 16GB RAM for the node and a CPU core count of 4. A dual CPU machine with 4 cores each seems to be a better recommendation, if hardware is not a crucial point in the whole setup.
- The upper limit for the Java VM memory allocation should not be below 2048MB (-Xmx2048m). Please adjust this value accordingly, but don't set this value too high, since values greater than about 4GB will have signifcant negative impact on the Java VM Garbage Collector. 2048MB seemed to be a very sensible value for the -Xmx limit during our tests.
OX backend(s) with one remote documentconverter
A remote documentconverter setup uses (at least) one OX backend, configured as a real documentconverter as described in the chapter above. This documentconverter node should not handle standard OX backend requests, but only remote conversion requests from the standard OX backends.
The standard OX backend needs to have the documentconverter-api and the documentconverter packages installed. readerengine and pdf2svg packages are not needed on the standard OX backends.
For the remote converter node, the documentconverter-webservice package needs to be installed beside the documentconverter-api, documentconverter, readerengine and pdf2svg packages.
In order to request remote conversions, the only entry, that needs to be set within the documentconverter.properties file at the standard OX backend(s) is the com.openexchange.documentconverter.RemoteBaseUrl entry. In general, this is a http-URL consisting of the remote conversion node ip address followed by the /documentconverterws path:\\
com.openexchange.documentconverter.RemoteBaseUrl=http://host[:port]/documentconverterws
No other entry from the documentconverter.properties is used on the standard OX backend if the RemoteBaseUrl is set.
In order to check, if the remote documentconverter is correctly set up, you can enter the complete RemoteBaseUrl into a browser and should see a page, stating that the documentconverter is running.
OX backend cluster with one remote documentconverter
The OX backend cluster is the standard OX backend scenario, which is similar, if not equal to the standard OX backend with one documentconverter scenario above.
The only thing that you need to take care of is to add the the appropriate documentconverterws ProxyPass entry to the proxy_http.conf configuration file of e.g. the Apache web server:
<Proxy /documentconverterws> ProxyPass balancer://oxcluster/documentconverterws </Proxy>
Sample cluster installation
A typical, basic cluster setup consists of two or more standard OX backend nodes (e.g. samplehost-node-1, samplehost-node-2, ...., samplehost-node-n), with each of these standard backend nodes having their com.openexchange.documentconverter.RemoteBaseUrl config entry set to a dedicated documentconverter cluster node (e.g. samplehost-node-x).
OX backend cluster with two or more documentconverters
To reduce the load on one documentconverter within a cluster installation, you can add more documentconverter nodes to the given cluster. The general setup is the same as with the Ox backend cluster with one remote documentconverter setup.
In addition, care needs to be taken to add the appropriate JSESSIONID cookie handling to the cluster setup. This is done for OX clusters in general, but mentioned here for completeness. Setting the JSESSIONID cookie is essential in a multi documentconverter setup, since the Viewer client application uses stateful calls into the OX backend to retrieve the single pages of a document on demand.
A typical Proxy configuration with proper stickysession handling looks as follows:
<Proxy balancer://oxcluster>
Order deny,allow
Allow from all
BalancerMember http://docs-develop-cluster-b1:8009 timeout=100 smax=0 ttl=60 retry=60 loadfactor=50 keepalive=On route=OX1
BalancerMember http://docs-develop-cluster-b2:8009 timeout=100 smax=0 ttl=60 retry=60 loadfactor=50 keepalive=On route=OX2
ProxySet stickysession=JSESSIONID
SetEnv proxy-initial-not-pooled
SetEnv proxy-sendchunked
</Proxy>
Remote caching with two or more documentconverters
As mentioned above, using a cache gives a lot of performance improvements with subsequent conversion request. Having more than one real documentconverter node configured allows to get access to the remote caching feature.
In general, one converter acts as a so called master cache in this scenario. It is set up as usual and as described in one of the chapters above.
The second (third, ...) converter are also set up as usual, but have their com.openexchange.documentconverter.RemoteCacheUrls documentconverter.properties set, so that the first URL (of possible more URLs) points to the master cache converter.
The URL to be used is the standard RemoteBaseURL with an additional /cache path set:
com.openexchange.documentconverter.RemoteCacheUrls=http://mastercachehost[:port]/documentconverterws/cache
In this case, after searching the local cache for a conversion result after a conversion request arrives without any success, all set remote cache hosts are queried for a successful cache result.
If one of the remote cache hosts returns a valid cache result, the cache entry is locally replicated and the cache result is returned.
If none of the remote cache hosts returns a valid cache result, a local conversion is performed, a local cache entry is created and the cache result is transfered to the first set remote cache host (the master cache host).