Suppose you work for an organization which has close to a hundred repositories and multiple databases with countles stored procedures and you need to design a system to search through all these sources in order to find answers to two groups questions:

  1. Impact Analysis - where in my organization a given class, method or stored procedure is used? What happens if I changed it? Who will be impacted, how much ripple effect changes are there?
  2. Code snippets - how can I find code snippets that are relevant to our code base? Not just generic usage of standard libraries by a piece of code that works in our environment? Somebody has already figured out how to process initial payment and what the arguments for this method are, I need to use it the same way in my code.

Given that you have all source code, how would you design a system to achieve that? For many of us working on development of line of business applications, this is a new type of problem which requires very different thinking.

Let's reduce the code search problem to text file search. Given a set of text files and search phrase, find in which files this phrase is used.

A little primer on this classical Information Retrieval problem:

  1. Tokenize the text. Basically chopping document in the list of words which are called tokens.
  2. Do lingustic pre-processing to normalize the tokens. Remove "a", "the" (stop words), deal with plural / single, synonyms and so on.
  3. Index document by building the inverted index. Inverted index is a key component here. This is a data structure in some ways representing book index (for a given word it tells you the page this word is on).
  4. Process queries. A lot of things is happening here - scoring, weighting, synonym match, but in the simplest form it is a matter of getting data from the inverted index.

In summary, this is two-step process: build an index for every document and then search the index.

It is so nice that there is already a library that does all this work - Lucene which is a poweful Java search library that is ported to .Net.

This is what it takes to add to a console application (This originally came from one of the sources listed at the bottom)

Install-Package Lucene.Net

This code is written against Lucene.Net 3.0.3 and may not work against a newer version.

static void Main(string[] args)  
        // 0. Analyzer
        var analyzer = new StandardAnalyzer(Version.LUCENE_30);

        // 1. Create index
        var directory = new RAMDirectory();
        using (var w = new IndexWriter(directory, analyzer, IndexWriter.MaxFieldLength.UNLIMITED))
            AddDoc(w, "Lucene in Action", "193398817");
            AddDoc(w, "Lucene for Dummies", "55320055Z");
            AddDoc(w, "Managing Gigabytes", "55063554A");
            AddDoc(w, "The Art of Computer Science", "9900333X");

        // 2. Query
        //const string querystr = "Lucene";
        const string querystr = "Science OR Action";
        var query = new QueryParser(Version.LUCENE_30, "title", analyzer).Parse(querystr);

        // 3. Search / display results
        using (var reader = IndexReader.Open(directory, true))
            var searcher = new IndexSearcher(reader);
            TopDocs hits = searcher.Search(query, 10);

            Console.WriteLine("Found {0} hits", hits.TotalHits);
            foreach (ScoreDoc hit in hits.ScoreDocs)
                int docId = hit.Doc;
                var d = searcher.Doc(docId);
                Console.WriteLine("{0}. {1}", d.Get("isbn"), d.Get("title"));

    private static void AddDoc(IndexWriter w, string title, string isbn)
        var doc = new Document();
        doc.Add(new Field("title", title, Field.Store.YES, Field.Index.ANALYZED));
        doc.Add(new Field("isbn", isbn, Field.Store.YES, Field.Index.NO));

A few notes on the steps.

0. Analyzer Analyzer is a component that does tokenization and may deal with all sort of interesting questions like compound words, spell correction, synonyms, singular-plural and so on. Lucene provides several analyzers, StandardAnalyzer is one of them.

1. Create index In order to create index we need IndexWriter and a directory where index should be created. This example creates index in memory. Normally index would be built on disk:

var indexDirectory = new SimpleFSDirectory(new DirectoryInfo("Index"));  
using (var writer = new IndexWriter(indexDirectory, analyzer, true, IndexWriter.MaxFieldLength.UNLIMITED))  
    IndexDirectory(writer, new DirectoryInfo(contentPath));

In order to build an index we need to create documents. Each document has multiple fields. Similar to NoSql databases, document can be anything. Documents in the same index does not need to have the same fields. The most important class in indexing is Field which can be stored in index or not. Index.ANALYZED means that analyzer will break it down to tokens and make each token searchable, thus providing an ability to search for anything in the document title. One of the other options is Index.NOT_ANALYZED - field is seachable but not broken into tokens. This is for things like dates, system paths, phone numbers and so on. Index.NO options makes field unavailable for searching (can't seach by ISBN in this example)

2. Query. QueryParser translates user request into Query object.

3. Search / display results. This is where we use IndexReader and Index Searcher to search the index we just built and render the results. In this example, we only return document names, but Lucene can determine the position of hit within document as well.

Helpful resources