Building a new search engine for SCL.org using MS SQL and Sphinx

SCL.org is a 'go to' source of current legal know-how for many who work in and around IT law. The site has an archive of nearly 20 years' contributions, plotting the development of IT law and explaining some of its most complex concepts.

Unfortunately we have received a number of complaints about the poor quality of the site's search engine. In one part that complaint is the inevitable consequence of Google having made the dark art of text indexing and retrieval look so simple. In another part lawyers are used to search engines offered by commercial providers such as LexisNexis, companies who have invested years of development into their platforms. SCL.org's search facility has lagged behind the state of the art, and it is noticeable.

We have a developer who looks after the site, but I have access to the code base in order to oversee its development and for disaster recovery purposes. I have just spent two weeks pro-bono with the team in Bristol HQ helping to improve aspects of it. Our developer has been focused on building out the administrative functions (the 'back-end' of our site is our membership management and financial accounting system; the site that you see is the tip of the iceberg). In addition to some other less interesting tasks I took on the challenge of improving the search engine.

Here's a log of the steps that I took.

Step 1. Define criteria

My criteria for the new search engine were as follows:

  1. Provides better, reasonably intuitive, search results (very much a subjective criteria).
  2. Does not require major architectural changes to our existing code base, which relies heavily on Microsoft SQL Server as its data storage layer.
  3. Implementable within a few days, without introducing bugs or side effects into our code-base.

Step 2. Choose a search engine

Search engines typically have two components: (i) an indexer, which takes the content to be made searchable and creates an index, and (ii) a query parser, which takes queries and searches against the index.

Microsoft's implementation, as part of SQL Server, is much improved from where it was when I last developed on top of it, many years ago. Nevertheless I discounted it early-on. We are using a slightly older version of the SQL Server database for our platform, which does not have all of the benefits of these improvements. Upgrading to the latest version of SQL Server may have had unforeseen consequences on the rest of our platform, which was contrary to my third criteria.

I assessed ElasticSearch, the heavyweight open source full text search engine. ElasticSearch has a great reputation, and fits modern development practices very well: it is document based, RESTful, uses JSON as its interface format, and can be easily distributed through sharding and replication. It sits very much as its own service, with new data to be indexed being pushed to it via its API. In principle I like this idea; it simplifies 'DevOps' by keeping each service distinct but interoperable. In this case, though I wasn't keen to build a 'push' interface to add new content to be indexed, and implement a JSON parser to 'pull' search results. This would run contrary to the SQL-driven approach of our current site and be in violation of my second criteria.

I settled on Sphinx. Sphinx is an open source full text search engine famed for powering the search engine behind Craigslist.org. Craigslist's developers have described it as "easy to understand", "programmer friendly", "fast" and "stable". For ease of implementation Sphinx will pull data directly from our Microsoft SQL Server instance. Further, its query interface emulates the MySQL protocol. MySQL is another open source database, similar to Microsoft SQL Server. Architecturally all of this makes it easier to integrate into our existing application by adapting the existing search function, already built using SQL.

Step 3. Installing and configuring Sphinx

Sphinx comes prepackaged for Windows. It does not have an installer, so needs to be unzipped and installed manually.

I loosely followed these instructions: http://sphinxsearch.com/forum/view.html?id=2972

There are two components to Sphinx:

  1. The search engine daemon - this needed to be configured to listed on localhost using the MySQL protocol only, and then set up as a Windows service.

  2. The indexer - this needed to be configured as a scheduled task to connect to our Microsoft SQL Server database and pull down the correct data.

Configuring Sphinx itself was straightforward. I won't go into each and every configuration option, because the configuration file is well documented, and this is not a HOWTO.

I will note, for anyone tempted to try a similar exercise, that I had a number of niggles around the usual offenders of file paths and permissions, easily solved by using Google and StackExchange.

I also found a couple of particular oddities: (i) Sphinx requires datetime fields in epoch time; and (ii) Sphinx requires numeric row ids for all indexed data.

Working around these issues required a little SQL-foo:

(i) Here is how to calculate an epoch time in SQL Server:

@timestamp=DATEDIFF(second,{d '1970-01-01'},@datetime)

(ii) Here are my thoughts on the best (most expedient?) way to join tables with a unique row ID:

  • Most of the searchable content of the SCL site is stored in a single table, but there is an additional table for events, which I wanted to include in the same search index to maintain compatibility with our existing data model. Easily done with a UNION query, subject to the following wrinkle.

  • In our database, row ids are numeric but unique to each table rather than globally. Sphinx relies on an individual unique ID per search index, which would therefore need to be computed. We do have a 'guid' of sorts, which simply affixes the ID to the content type, as a string. I experimented with creating a numeric hash of this but I worry about collisions. We'd have the same problem even if we hashed SQL Server's native GUID to an integer type. Hash collisions are very unlikely but not impossible, and difficult to predict.

  • Instead, for now I simply summed the ID of the second table with a suitably high number (I chose 10000 which should outlive the current site with ease) to give each table its own number range. Hacky but effective. Future collisions are still possible but only with explosive growth of content on our site, which we would know about, because we are in control of adding such content.

Here's the SQL that defines the view (field names changed to protect the guilty):

SELECT        id, content_type, content_title, content_strapline, cast(content_body AS nvarchar(max)) AS content_body, url AS guid, DATEDIFF(s, '1970-01-01 00:00:00', CONVERT(datetime, stuff(stuff(stuff(publish, 9, 0, ' '), 12, 0, ':'), 15, 0, ':'))) AS pubdate  
FROM            dbo.content  
WHERE        content_type IS NOT NULL AND content_title IS NOT NULL  
UNION  
SELECT        id + 1000000 AS id, 'e' AS content_type, event_title AS content_title, event_strap AS content_strap, cast(event_body AS nvarchar(max)) AS content_body, url AS guid, DATEDIFF(s, '1970-01-01 00:00:00', CONVERT(datetime, CONVERT(varchar(8), calendar), 112)) AS pubdate  
FROM            dbo.event  
WHERE        (live = 1) AND ISNULL(content_title, '') != '';  

Step 4. Building the search query.

Here is our Sphinx search query:

SELECT id, weight() as wght, (1-(now()-pubdate)/100000000)*weight() as wght_time FROM mainindex WHERE MATCH('Privacy Shield') ORDER BY wght_time DESC, wght DESC LIMIT 0, 10000  OPTION max_matches=10000,field_weights=(title=3,strap=2,body=1)  

You'll note the use of Sphinx's powerful OPTIONS modifier, and particularly the weighting applied to each field. This is a simple multiplication of Sphinx's internal score for each field, the totals of which are then summed to produce the final weighting.

You'll also note that I've opted for a static search order, which is weighted against time (again in epoch format) so that newer results are higher ranking than older results. There was an element of trial and error to perfecting this query. I am happy with the results and will be interested to take feedback when the new site is live. We can tweak the weightings if needs be.

Step 5. Wiring it all up

The search function of our site is wired up via a stored procedure in MS SQL, which makes a version of the above SQL query against the linked Sphinx instance.

MS SQL has a great, but somewhat eye-watering-from-a-cybersecurity-perspective, feature of being able to create linked servers, such that once you have access to one MS SQL server you can then run inline queries against other linked servers using the OPENQUERY command. (There's an even worse function from a security standpoint, OPENROWSET, which allows making of inline queries against other servers using only an SQL command containing the DSN string, but this has to be enabled manually. I suggest leaving it switched off).

Servers can be linked via ODBC, an open API for allowing different flavours of databases to connect through a common interface. (Note, API in the widest sense, not the web API interface that many of us have become used to thinking about when we hear that acronym). Because Sphinx speaks native MySQL, you can use Oracle's MySQL ODBC/Connector to give it an ODBC DSN. Be sure to following the instructions to set the MySQL version string and default character-set, so that the ODBC/Connector thinks that it is talking to a bona-fide version of MySQL, rather than Sphinx's emulation.

For legacy reasons, we build the query using a temporary table which is then paged as users click through the pages of results on our site. Our site is low traffic and so this is not of huge concern, although could get quite memory intensive on a very popular site.

Here's our search function. I have altered it slightly for security purposes.

CREATE TABLE #tmp (id INT, order INT IDENTITY(1,1));  
IF (@searchtext != '')  
      DECLARE @TSQL varchar(max)
      SELECT  @TSQL = 'INSERT INTO #tmp (id) SELECT id FROM OPENQUERY(SPHINX,''SELECT id, weight() as wght, (1-(now()-pubdate)/100000000)*weight() as wght_time FROM mainindex WHERE MATCH(''''' + @searchtext + ''''') ORDER BY wght_time DESC, wght DESC LIMIT 0, 10000  OPTION max_matches=10000,field_weights=(title=3,strap=2,body=1)'')'
      EXEC (@TSQL)

/* Return the number of results */
SELECT COUNT(*) AS cnt FROM #tmp  
/* The sphinx search just returns IDs of the indexed content. We need to join in the relevant columns to get the full results */

SELECT TOP 10 sphinx_content.id AS id, sphinx_content.type,  
    ISNULL(editorial.url, events.url) AS url,
    ISNULL(editorial.title, events.title) AS title,
    ISNULL(editorial.strap, events.strap) AS strap,
    ISNULL(FLOOR(editorial.publish / 1000000), events.calendar) AS date
    FROM sphinx_content
    INNER JOIN #tmp ON #tmp.id = sphinx_content.id
    LEFT JOIN editorial ON sphinx_content.type != 'editorial' AND editorial.id = sphinx_content.id
    LEFT JOIN events ON sphinx_content.type = 'editorial' AND events.id = sphinx_content.id - 1000000 AND events.live = 1 
    WHERE order > (@pag - 1) * 10
    ORDER BY ord;  
DROP TABLE #tmp  

Results

Here are the results of a query for "GDPR" (hits 1-10, descending order):

Old search engine:
Search results 1

Sphinx:
Search results 1

I consider this a significant improvement. Hopefully our members will agree.

The new search engine is not live yet but will be in the coming weeks. I look forward to receiving feedback on it.