In this post, we are introducing a new tool
called S2C – Apache Solr to Amazon CloudSearch Migration Tool. S2C is a Linux
console based utility that helps developers / engineers to migrate search index
from Apache Solr to Amazon CloudSearch.
Very often customers initially build search
for their website or application on top of Solr, but later run into challenges
like elastic scaling and managing the Solr servers. This is a typical scenario
we have observed in our years of search implementation experience. For such use
cases, Amazon CloudSearch is a good choice. Amazon CloudSearch is a
fully-managed service in the cloud that makes it easy to set up, manage, and
scale a search solution for your website. To know more, please read the Amazon CloudSearch documentation
We are seeing growing trend every year,
organizations of various sizes are migrating their workloads to Amazon
CloudSearch and leveraging the benefits of fully managed service. For example,
Measured Search, an analytics and e-Commerce platform vendor, found it easier
to migrate to Amazon CloudSearch rather than scale Solr themselves (see article for details).
Since Amazon CloudSearch is built on top of
Solr, it exposes all the key features of Solr while providing the benefits of a
fully managed service in the cloud such as auto-scaling, self-healing clusters,
high availability, data durability, security and monitoring.
In this post, we provide step-by-step
instructions on how to use the Apache Solr to Amazon
CloudSearch Migration (S2C) tool to
migrate from Apache Solr to Amazon CloudSearch.
Before we get into detail, you can download
the S2C tool in the below link.
Download Link: https://s3-us-west-2.amazonaws.com/s2c-tool/s2c-cli.zip
Download Link: https://s3-us-west-2.amazonaws.com/s2c-tool/s2c-cli.zip
Pre-Requisites
Before starting
the migration, the following pre-requisites have to be met. The pre-requisites
include installations and configuration on the migration server. The migration
server could be the same Solr server or independent server that sits between
your Solr server and Amazon CloudSearch instance.
Note: We recommend running the migration from the Solr server instead
of independent server as it can save time and bandwidth. It is much better if
the Solr server is hosted on EC2 as the latency between EC2 and CloudSearch is
relatively less.
The following installations and configuration should be done on the
migration server (i.e. your Solr server or any new independent server that
connects between your Solr machine and Amazon CloudSearch).
1.
The application is developed using Java. Download
and Install Java 8 .Validate the JDK path and ensure the
environment variables like JAVA_HOME, classpath, path is set correctly.
2.
We assume you
already have setup Amazon Web services IAM account. Please ensure the IAM
user has right permissions to access AWS services like CloudSearch.
Note: If you do not have an AWS IAM account with above mentioned
permissions, you cannot proceed further.
3.
The IAM user should have AWS Access key and
Secret key. In the application hosting server, set up the Amazon environment
variables for access key and secret key. It is important that the application
runs using the AWS environment variables.
To setup AWS environment variables, please read the below link.
Alternatively, you can set the following AWS
environment variables by running the commands below from Linux console.
export
AWS_ACCESS_KEY=Access Key
export
AWS_SECRET_KEY=Secret Key
4.
Note: This step
is applicable only if the application is hosted on Amazon EC2.
If you do not have an AWS Access key and Secret key, you can opt for
IAM role attached to an EC2 instance. A new IAM role can be created and
attached to EC2 during the instance launch. The IAM role should have access
to AWS resources like S3, DynamoDB and CloudSearch.
For more information, read the below link
5.
Download the migration utility ‘S2C’, unzip the
tool and copy it in your working directory.
|
S2C Utility File
The downloaded ‘S2C’
migration utility should
have the following sub directories and files.
Folder / Files
|
Description
|
|
bin
|
Binaries of the migration tool
|
|
lib
|
Libraries required for migration
|
|
application.conf
|
Configuration file that allows end users to
input parameters
|
Require end-user’s input.
|
logback.xml
|
Log file configuration
|
Optional. Does not require end-user / developer input
|
s2c
|
script file that executes the migration process
|
Configure only
application.conf and logback.xml. Do not
modify any other file.
application.conf
The application.conf
file has the configuration related to the new Amazon CloudSearch domain that
will be created. The parameters configured in the in the application.conf file are explained in the table below.
s2c {
api {
SchemaParser =
"s2c.impl.solr.DefaultSchemaParser"
SchemaConverter =
"s2c.impl.cs.DefaultSchemaConverter"
DataFetcher =
"s2c.impl.solr.DefaultDataFetcher"
DataPusher =
"s2c.impl.cs.DefaultDataPusher"
}
|
List of API that is executed
step by step during the migration.
Do not change this.
|
solr {
dir = "files"
server-url =
"http://localhost:8983/solr/collection1"
fetch-limit = 100
}
|
dir – The base directory path
of Solr.
Ensure the directory is
present and also its validity.
Eg:/opt/solr/example/solr/collection1/conf
|
server-url – Server host, port
and collection path
The endpoint which will be
used to fetch the data.
If the utility is run from a different
server, ensure the IP address and port has firewall access.
|
|
fetch-limit – number of solr
documents that can be fetched for each batch call
This configuration number
should be carefully set by the developer.
The fetch limit depends on the
following factors:
1. Record size
of a Solr record(1KB or 2KB)
2. Latency
between migration server and Amazon CloudSearch
3. Current
Request Load on the Solr Server
E.g.: If the total Solr documents is 100000 and
fetch limit is 100, then it would take 100000 / 10 = 10000 batch calls to
complete the fetch.
If size of each Solr record is 2KB, then 100000
* 2KB = 200MB data is migrated.
|
|
cs {
domain = "collection1"
region = "us-east-1"
instance-type = " search.m3.xlarge"
partition-count = 1
replication-count = 1
}
|
domain - CloudSearch domain
name
Ensure that the domain name
does not already exist.
|
Region – AWS region for the
new CloudSearch domain
|
|
Instance type – Desired
instance type for CloudSearch nodes
Choose the instance type based
on the volume of data and the expected query volume.
|
|
Partition count – Number of partitions
required for CloudSearch
|
|
replication-count -
Replication count for CloudSearch
|
|
wd = "/tmp"
|
Temporary file path to store
intermediate data files and migration log files
|
Running the migration
Before launching the S2C migration tool,
verify the following:
1. Solr
directory path – Make sure that the Solr directory path is valid and available.
The tool cannot read the configuration if the path or directory is invalid.
2. Solr
configuration contents - Validate that the Solr configuration contents are correctly
set inside the directory. Example: solrconfig.xml, schema.xml, stopwords.txt,
etc.
3. Make
sure that the working directory is present in the file system and has write
permissions for the current user. It can be an existing directory or a new
directory. The working directory stores the fetched data from Solr and
migration logs.
4. Validate
the disk size before starting the migration. If the available free disk space
is lesser than the size of the Solr index, the fetch operations will fail.
For example,
if the Solr index size is 7 GB, make sure that the disk has at least 10 GB to
20 GB of free space.
Note: The tool reads the data from Solr and
stores in a temporary directory (Please see configuration wd = /tmp in the above table).
5. Verify that
the AWS environment variables are set correctly. The AWS environment variables
are mentioned in the pre-requisites
section above.
6. Validate
the firewall rules for IP address and ports if the migration tool is run from a
different server or instance. Example: Solr default port 8983 should be opened
to the EC2 instance executing this tool.
Run the following command from directory ‘{S2C filepath}’ example: /build/install/s2c-cli
./s2c
Or
JVM_OPTS="-Xms2048m
-Xmx2048m" ./s2c (With
heap size)
|
The above will invoke the shell ‘s2c’
script that starts the search migration process. The migration
process is a series of steps that require user inputs as shown in the screen
shots below.
Step 1: Parse the Solr schema
The first step of migration prompts for a confirmation
to parse the Solr schema and Solr configuration file. During this step, the
application generates a ‘Run Id’ folder inside the working directory.
The Run Id is a unique identifier for each
migration. Note down the Run Id as you will need it to resume the migration in
case of any failure.
Step 2: Schema conversion from Solr to
CloudSearch
The second step prompts confirmation to convert Solr schema to CloudSearch schema. Press any key to proceed further.
The second step prompts confirmation to convert Solr schema to CloudSearch schema. Press any key to proceed further.
The second step will also list all the converted fields which are ready to be migrated from Solr to CloudSearch. If any fields are left out, this step will allow you to correct the original schema. User can abort the migration and identify the ignored fields, rectify the schema and re-run the migration again.
Step 3: Data Fetch
The third step prompts for confirmation to fetch
the search index data from the Solr server. Press any key to proceed. This step will generate a temporary file which will be stored in the
working directory. This temporary file will have all the fetched documents from
the Solr index.
There is also option to skip the fetch process if all the Solr data is already stored in the temporary file. If this is the case, the prompt will look like the screenshot below.
Step 4: Data push to CloudSearch
The
last and final step prompts for confirmation to push the search data from the temporary
file store to Amazon CloudSearch. This step also creates the CloudSearch domain
with the configuration specified in application.conf including desired instance
type, replication count, and multi-AZ options.
If the domain is already created, the
utility will prompt to use the existing domain. If you do not wish to use an
existing domain, you can create a new CloudSearch domain using the same prompt.
Note:
The console does not prompt for any ‘CloudSearch domain name’ but instead it
uses the domain name configured in the application.conf file.
Step 5: Resume (Optional)
During
the migration steps, if there is any failure during the fetch operation, it can
be resumed. This is illustrated in the screen shot below.
Step 6: Verification
Log
into AWS CloudSearch management console to verify that the domain and index
fields.
Amazon
CloudSearch allows running test queries to validate the migration and as well
the functionality of your application.
Features supported
1. Support
for other non-Linux environments is not available for now.
2. Support
for Solr Shards is not available for now. The Solr shard needs to be migrated
separately.
3. The
install commands may vary for different Linux flavors. Example installing
software, file editor command, permission set commands can be different for
every Linux flavors. It is left to engineering team to choose the right
commands during the installation and execution of this migration tool.
4. Only
fields configured as ‘stored’ in Solr schema.xml are supported. The non-stored
fields are ignored during schema parsing.
5. The
document id (unique key) is required to have following attributes:
a. Document
ID should be 128 characters or less in size.
b. Document
ID can contain any letter, any number, and any of the following characters: _ - = # ; : / ? @ &
c. The
below link will help you to understand in data
preparation before migrating to CloudSearch http://docs.aws.amazon.com/cloudsearch/latest/developerguide/preparing-data.html
6. If the conditions
are not met in a document, it will be skipped during migration. Skipped records
are shown in the log file.
7. If a
field type (mapped to fields) is not stored, the stopwords mapped to that particular
field type are ignored.
Example
1:
<field
name="description" type="text_general"
indexed="true" stored="true" />
The above field
‘description’ will be considered for stopwords.
Example
2:
<field
name="fileName" type="string" />
The above field
‘fileName’ will not be migrated and ignored in the stopwords.
No comments:
Post a Comment