Context Preprovisioning

From Open-Xchange
The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.

Context Preprovisioning

The standard createcontext call (be it via CLT or RMI or SOAP) is designed to work under all possible cornercases, like concurrent createcontext, createuser, deletecontext, including potential generation and removal of DB schemas and correct counting of schemas on target (DBs, Schemas, Filestores) to allocate the context correctly according to the rules (like weights, etc).

In order to achieve this, a number of expensive DB queries and extensive locking needs to take place. We optimized this as far as possible while still meeting the "general purupose" requirements outlined above, but it still is costly.

For other usecases a lot of these queries and locks is not necessary. In particular if it is possible to allocate a "pre-provisioning" phase in an implementation project where only createcontext takes place, and all schemas can be pre-generated, a lot of these allocation queries and locks can be skipped.

This article is to describe the prerequisites of this "fast mode preprovisioning", and how to execute it.

Prerequisites

  • All required DB schemas are pre-generated (see below)
  • Only createcontext, no other provisioning calls, no HTTP API calls
  • Only one provisioning host (which usually does not become the bottleneck; the bottleneck is on the DB usually)

Execution

Schema pre-generation

To pre-generate schemas, you can temporarily set the setting CONTEXTS_PER_SCHEMA in /opt/open-xchange/etc/plugin/hosting.properties to the value 1. Then, every subsequent createcontext will trigger the generation of a schema. So we recommend to generate the schemas with placeholder contexts which will not actually used for anything, just to generate the schemas, and by not deleting them we make sure the schemas are not teared down also.

So with this setting CONTEXTS_PER_SCHEMA=1, you create as much contexts as you require for your final number of contexts to be provisioned. Then you change that setting back to its original value (as per system high level design, usually something between 1000 and 7000).

Update: as of 7.8.3, there is also a tool called createschema to bootstrap schemas without the aforementioned trick. However, these schemas are only suitable for schema-name based fastmode, so the aforementioned trick is still required for bootstrapping schemas for use with in-memory or automatic mode.

Context Creation in Fast Mode

There are a number of options you need to set to run in "fast mode".

  • In order to save OX from counting contexts per filestore, you should use the --destination-store-id <id> argument to specifiy a target filestore.
  • The same reasoning applies to the target database, --destination-database-id <destination-database-id>
  • Then to use the "fast mode" (skipping locks and some checks) you can use one of the two following. For a discussion see below.
    • --schema-name <schema-name>
    • --schema-strategy=in-memory

Then, the most important thing is: context creation parallelizes reasonably well at least until the number of DB hosts you got in your setup, perhaps even beyond. So you want to implement some parallel scheme of creating contexts, but with lesser parallelity than the number of schemas you pregenerated. The optimal number of parallel createcontext "streams" depends on your infrastructure and is to be determined heuristically. On single node development machines values up to 10 concurrent streams are reasonable. On larger test platforms (and also on a production platform) we found that up to 100 streams make sense and increase the throughput. But, if increasing the number of streams further does not increase throughput, but only latency, you should reduce the number of streams again.

To implement parallel streams of createcontext, you can do one of the following:

  • The most simple (and most simple to understand) approach is to create one huge .csv file which is designed to be fed into the CLT createcontext --csv. Then you split that file into a number of chunks which matches the number of parallel streams you want to run. Then you invoke (from a shell script, for example), just the corresponding number of createcontext --csv tools in parallel.
  • More sophisticated is to create a program which creates the required data (or reads them from some input) and uses a multithreaded worker model to create the required number of streams
  • You can use our gatling tools to load-test your system. You will probably need help setting up this stuff, so contact your OX consultant for assistance.

Schema-Name vs Schema-Strategy

These are two different implementations with the same goal to run in "fast mode", skipping locks and some checks.

The approach of the --schema-name method is to require the caller to do book-keeping of how many contexts are allocated where. Usually some simple round-robin approach is sufficient here. But be aware that the schema names you supply must match the DB ID you supply. So usually you have a tool which creates the createcontext requests, and this tool needs to be coded such that the schema names it gives are valid for the given DB ID.

The approach of the --schema-strategy=in-memory is to not require any logic from the caller, but rather do the required book-keeping in memory of OX.

Both approaches seem to work in our benchmarks with similar achieved throughput. Theoretically the in-memory approach is easier to integrate into existing workflows, but the schema-name approach is much simpler inside of OX and potentially more robust.

Sample Script

We provide below a sample script that can help you using fastmode for quick preprovisioning of a lot of contexts.

We still recommend to use a full solution leveraging SOAP including error checking, throttling, etc for production. However to quickly fill a test / benchmark system with a lot of contexts, a script like the example given below can be helpful.

Note: this is completely unsupported and meant to serve as starting point for your own experiments.

Stripped down by the documentation and option parsing, the script is quite short. It builds rows of a CSV file by a loop and leverages GNU split later to split the full file round-robin to multiple part files. Some handling for header lines comes on top, as well as support for supplying a "schemas.csv" file which allows for specifying available database IDs, schema name and filestore IDs.

The script finishes after creating the CSV files, leavin actual createcontext --csv invocation to the user. It provides a sample execute.sh file which can server as starting point for that task.

#!/bin/bash

start_cid=1001
end_cid=1100
parallel=10
output_prefix=contexts-
context_name_prefix=performance
access_combination_name=all
timezone=Europe/Berlin
language=en_US
quota=100
admin_user=oxadmin
admin_pass=secret
master_user=oxadminmaster
master_pass=secret
schemas_csv=schemas.csv

while getopts ":A:P:s:e:p:o:c:" o; do
    case "${o}" in
        A)
            master_user=${OPTARG}
            ;;
        P)
            master_pass=${OPTARG}
            ;;
        s)
            start_cid=${OPTARG}
            ;;
        e)
            end_cid=${OPTARG}
            ;;
        p)
            parallel=${OPTARG}
            ;;
        o)
            output_prefix=${OPTARG}
            ;;
        c)
            schemas_csv=${OPTARG}
            ;;
        *)
            usage
            ;;
    esac
done
shift $((OPTIND-1))

if [ $ENDID -lt $STARTID ]
then
    usage
fi

usage() {
cat >&2 <<EOF
Usage: $0 [-d schemas.csv input file name] [-p password] [-s startid] [-e endid] [-a] [-n nr_shards] [-c contexts.csv prefix] [-u users.csv prefix]

    Use to create contexts-*.csv files suitable for provisioning
    using createcontext --csv-import.

    This script does only generate files, it does not call createcontext --csv
    itself.

    This script expects to find a schemas.csv file with information about the
    schema select strategy to employ. Available is schema-name fastmode, in-memory
    fastmode, and automatic (non-fast) mode.

    Fastmode provisioning using schema-name selection strategy can be leveraged with a csv file with the columns
    
        destination-store-id,destination-database-id,schema

    In order to avoid deadlocks by parallel provisioning into the same schema
    and for best performance by spreading out over all available database
    instances, it is important to make sure to go round-robin over database-ids and
    schemas. For example, if you have one filestore ID (2), and the DBs with
    schemas...

    DB ID 3, schemas schema_1, schema_4, schema_7, ...
    DB ID 6, schemas schema_2, schema_5, schema_8, ...
    DB ID 9, schemas schema_3, schema_6, schema_9, ...

    ... then you should order them in the csv file as follows:

        destination-store-id,destination-database-id,schema
        2,3,schema_1
        2,6,schema_2
        2,9,schema_3
        2,3,schema_4
        2,6,schema_5
        2,9,schema_6
        2,3,schema_7
        2,6,schema_8
        2,9,schema_9
        ...

    (Note: strictly speaking to avoid deadlocks it is required to ensure no
    concurrent provisioning into the same schema can occur. This can only be
    ensured in the framwork of this tool if the number of schemas is an integer
    multiple of the number of parallel provisioning hosts employed, so that each
    schema is only provisionined from within the same CSV file. The other option
    would be to use a more sophisticated provisioning framework like our gatling
    framework or a custom solution.)

    It is also possible to use this script with in-memory fastmode with a CSV
    file like

        destination-store-id,destination-database-id,schema-strategy
        2,3,in-memory
        2,6,in-memory
        2,9,in-memory

    and with automatic mode with a file like

        destination-store-id,destination-database-id,schema-strategy
        2,3,automatic
        2,6,automatic
        2,9,automatic

    or even just:

        schema-strategy
        automatic

    Note: Make sure you pre-generated schemas. For schema-name mode, you can
    use something like

        for (( i=1; i<=70; i++ ))
            for j in 3 6 9
            do
                do /opt/open-xchange/sbin/createschema -A oxadminmaster -P secret -i $j
            done
        done

    while for in-memory or automatic mode you need to pre-generate the schemas
    using ordinary createcontext with temporary setting of CONTEXTS_PER_SCHEMA=1.

EOF
    exit 1
}

exec 3<$schemas_csv
read headers <&3

d=$(mktemp --tmpdir=. -d -t contexts-XXXXX)

cp /dev/null $d/contexts.all.csv

for(( i=$start_cid; i<=$end_cid; i++ ))
do
    read stuff <&3 || {
        exec 3<$schemas_csv
        read _headers <&3
        read stuff <&3
    }
    echo "\"$i\",\"$admin_user\",\"$admin_pass\",\"Admin Context $i\",\"Admin\",\"Context $i\",\"$admin_user@$context_name_prefix$i\",\"$quota\",\"$access_combination_name\",\"$language\",\"$timezone\",$stuff" >> $d/contexts.all.csv
done
cat $d/contexts.all.csv | split -u -n r/$parallel --numeric-suffixes=1 --additional-suffix=.csv - $d/$output_prefix

sed -i -e "1i contextid,username,password,displayname,givenname,surname,email,quota,access-combination-name,language,timezone,$headers" $d/contexts-*.csv

cat >$d/execute.sh <<EOF
#!/bin/bash

for file in *-*.csv
do
    /opt/open-xchange/sbin/createcontext -A $master_user -P $master_pass --csv-import \$file &
done
wait

# alternatively, using gnu parallel can be more convenient and powerful
#parallel '/opt/open-xchange/sbin/createcontext -A $master_user -P $master_pass --csv-import {}' ::: *-*.csv
EOF

echo $d

exit 0