Context Preprovisioning
Context Preprovisioning
The standard createcontext call (be it via CLT or RMI or SOAP) is designed to work under all possible cornercases, like concurrent createcontext, createuser, deletecontext, including potential generation and removal of DB schemas and correct counting of schemas on target (DBs, Schemas, Filestores) to allocate the context correctly according to the rules (like weights, etc).
In order to achieve this, a number of expensive DB queries and extensive locking needs to take place. We optimized this as far as possible while still meeting the "general purupose" requirements outlined above, but it still is costly.
For other usecases a lot of these queries and locks is not necessary. In particular if it is possible to allocate a "pre-provisioning" phase in an implementation project where only createcontext takes place, and all schemas can be pre-generated, a lot of these allocation queries and locks can be skipped.
This article is to describe the prerequisites of this "fast mode preprovisioning", and how to execute it.
Prerequisites
- All required DB schemas are pre-generated (see below)
- Only createcontext, no other provisioning calls, no HTTP API calls
- Only one provisioning host (which usually does not become the bottleneck; the bottleneck is on the DB usually)
Execution
Schema pre-generation
To pre-generate schemas, you can temporarily set the setting CONTEXTS_PER_SCHEMA in /opt/open-xchange/etc/plugin/hosting.properties to the value 1. Then, every subsequent createcontext will trigger the generation of a schema. So we recommend to generate the schemas with placeholder contexts which will not actually used for anything, just to generate the schemas, and by not deleting them we make sure the schemas are not teared down also.
So with this setting CONTEXTS_PER_SCHEMA=1, you create as much contexts as you require for your final number of contexts to be provisioned. Then you change that setting back to its original value (as per system high level design, usually something between 1000 and 7000).
Update: as of 7.8.3, there is also a tool called createschema to bootstrap schemas without the aforementioned trick. However, these schemas are only suitable for schema-name based fastmode, so the aforementioned trick is still required for bootstrapping schemas for use with in-memory or automatic mode.
Context Creation in Fast Mode
There are a number of options you need to set to run in "fast mode".
- In order to save OX from counting contexts per filestore, you should use the --destination-store-id <id> argument to specifiy a target filestore.
- The same reasoning applies to the target database, --destination-database-id <destination-database-id>
- Then to use the "fast mode" (skipping locks and some checks) you can use one of the two following. For a discussion see below.
- --schema-name <schema-name>
- --schema-strategy=in-memory
Then, the most important thing is: context creation parallelizes reasonably well at least until the number of DB hosts you got in your setup, perhaps even beyond. So you want to implement some parallel scheme of creating contexts, but with lesser parallelity than the number of schemas you pregenerated. The optimal number of parallel createcontext "streams" depends on your infrastructure and is to be determined heuristically. On single node development machines values up to 10 concurrent streams are reasonable. On larger test platforms (and also on a production platform) we found that up to 100 streams make sense and increase the throughput. But, if increasing the number of streams further does not increase throughput, but only latency, you should reduce the number of streams again.
To implement parallel streams of createcontext, you can do one of the following:
- The most simple (and most simple to understand) approach is to create one huge .csv file which is designed to be fed into the CLT createcontext --csv. Then you split that file into a number of chunks which matches the number of parallel streams you want to run. Then you invoke (from a shell script, for example), just the corresponding number of createcontext --csv tools in parallel.
- More sophisticated is to create a program which creates the required data (or reads them from some input) and uses a multithreaded worker model to create the required number of streams
- You can use our gatling tools to load-test your system. You will probably need help setting up this stuff, so contact your OX consultant for assistance.
Schema-Name vs Schema-Strategy
These are two different implementations with the same goal to run in "fast mode", skipping locks and some checks.
The approach of the --schema-name method is to require the caller to do book-keeping of how many contexts are allocated where. Usually some simple round-robin approach is sufficient here. But be aware that the schema names you supply must match the DB ID you supply. So usually you have a tool which creates the createcontext requests, and this tool needs to be coded such that the schema names it gives are valid for the given DB ID.
The approach of the --schema-strategy=in-memory is to not require any logic from the caller, but rather do the required book-keeping in memory of OX.
Both approaches seem to work in our benchmarks with similar achieved throughput. Theoretically the in-memory approach is easier to integrate into existing workflows, but the schema-name approach is much simpler inside of OX and potentially more robust.
Sample Script
We provide below a sample script that can help you using fastmode for quick preprovisioning of a lot of contexts.
We still recommend to use a full solution leveraging SOAP including error checking, throttling, etc for production. However to quickly fill a test / benchmark system with a lot of contexts, a script like the example given below can be helpful.
Note: this is completely unsupported and meant to serve as starting point for your own experiments.
Stripped down by the documentation and option parsing, the script is quite short. It builds rows of a CSV file by a loop and leverages GNU split later to split the full file round-robin to multiple part files. Some handling for header lines comes on top, as well as support for supplying a "schemas.csv" file which allows for specifying available database IDs, schema name and filestore IDs.
The script finishes after creating the CSV files, leavin actual createcontext --csv invocation to the user. It provides a sample execute.sh file which can server as starting point for that task.
#!/bin/bash start_cid=1001 end_cid=1100 parallel=10 output_prefix=contexts- context_name_prefix=performance access_combination_name=all timezone=Europe/Berlin language=en_US quota=100 admin_user=oxadmin admin_pass=secret master_user=oxadminmaster master_pass=secret schemas_csv=schemas.csv while getopts ":A:P:s:e:p:o:c:" o; do case "${o}" in A) master_user=${OPTARG} ;; P) master_pass=${OPTARG} ;; s) start_cid=${OPTARG} ;; e) end_cid=${OPTARG} ;; p) parallel=${OPTARG} ;; o) output_prefix=${OPTARG} ;; c) schemas_csv=${OPTARG} ;; *) usage ;; esac done shift $((OPTIND-1)) if [ $ENDID -lt $STARTID ] then usage fi usage() { cat >&2 <<EOF Usage: $0 [-d schemas.csv input file name] [-p password] [-s startid] [-e endid] [-a] [-n nr_shards] [-c contexts.csv prefix] [-u users.csv prefix] Use to create contexts-*.csv files suitable for provisioning using createcontext --csv-import. This script does only generate files, it does not call createcontext --csv itself. This script expects to find a schemas.csv file with information about the schema select strategy to employ. Available is schema-name fastmode, in-memory fastmode, and automatic (non-fast) mode. Fastmode provisioning using schema-name selection strategy can be leveraged with a csv file with the columns destination-store-id,destination-database-id,schema In order to avoid deadlocks by parallel provisioning into the same schema and for best performance by spreading out over all available database instances, it is important to make sure to go round-robin over database-ids and schemas. For example, if you have one filestore ID (2), and the DBs with schemas... DB ID 3, schemas schema_1, schema_4, schema_7, ... DB ID 6, schemas schema_2, schema_5, schema_8, ... DB ID 9, schemas schema_3, schema_6, schema_9, ... ... then you should order them in the csv file as follows: destination-store-id,destination-database-id,schema 2,3,schema_1 2,6,schema_2 2,9,schema_3 2,3,schema_4 2,6,schema_5 2,9,schema_6 2,3,schema_7 2,6,schema_8 2,9,schema_9 ... (Note: strictly speaking to avoid deadlocks it is required to ensure no concurrent provisioning into the same schema can occur. This can only be ensured in the framwork of this tool if the number of schemas is an integer multiple of the number of parallel provisioning hosts employed, so that each schema is only provisionined from within the same CSV file. The other option would be to use a more sophisticated provisioning framework like our gatling framework or a custom solution.) It is also possible to use this script with in-memory fastmode with a CSV file like destination-store-id,destination-database-id,schema-strategy 2,3,in-memory 2,6,in-memory 2,9,in-memory and with automatic mode with a file like destination-store-id,destination-database-id,schema-strategy 2,3,automatic 2,6,automatic 2,9,automatic or even just: schema-strategy automatic Note: Make sure you pre-generated schemas. For schema-name mode, you can use something like for (( i=1; i<=70; i++ )) for j in 3 6 9 do do /opt/open-xchange/sbin/createschema -A oxadminmaster -P secret -i $j done done while for in-memory or automatic mode you need to pre-generate the schemas using ordinary createcontext with temporary setting of CONTEXTS_PER_SCHEMA=1. EOF exit 1 } exec 3<$schemas_csv read headers <&3 d=$(mktemp --tmpdir=. -d -t contexts-XXXXX) cp /dev/null $d/contexts.all.csv for(( i=$start_cid; i<=$end_cid; i++ )) do read stuff <&3 || { exec 3<$schemas_csv read _headers <&3 read stuff <&3 } echo "\"$i\",\"$admin_user\",\"$admin_pass\",\"Admin Context $i\",\"Admin\",\"Context $i\",\"$admin_user@$context_name_prefix$i\",\"$quota\",\"$access_combination_name\",\"$language\",\"$timezone\",$stuff" >> $d/contexts.all.csv done cat $d/contexts.all.csv | split -u -n r/$parallel --numeric-suffixes=1 --additional-suffix=.csv - $d/$output_prefix sed -i -e "1i contextid,username,password,displayname,givenname,surname,email,quota,access-combination-name,language,timezone,$headers" $d/contexts-*.csv cat >$d/execute.sh <<EOF #!/bin/bash for file in *-*.csv do /opt/open-xchange/sbin/createcontext -A $master_user -P $master_pass --csv-import \$file & done wait # alternatively, using gnu parallel can be more convenient and powerful #parallel '/opt/open-xchange/sbin/createcontext -A $master_user -P $master_pass --csv-import {}' ::: *-*.csv EOF echo $d exit 0