I built a simple but yet functional code search solution based on Lucene. One problem that came up recently was duplicates in the search. Since we index our entire code base, there are many projects that are shared between solutions and included as source files, not as compiled DLLs or Nugets. For example, we have Login class with ResetPassword method. The project with that class is used in many solutions and search for ResetPassword returns multiple hits for all those projects.

My first attempt to deal with it was to simply use built-in Lucene DuplicateFilter which is very straightforward:

var duplicateFilter = new DuplicateFilter(Fields.Key) {KeepMode = DuplicateFilter.KM_USE_FIRST_OCCURRENCE};  
//...
ScoreDoc[] scoreDocs = searcher.Search(query, duplicateFilter, MaxNumberOfHits).ScoreDocs;  

This is one of the many Lucene filters which discards subsequent duplicates. The filter requires all documents to have unique key field. This wasn't a problem since I run all C# files through Roslyn and it was just a matter of grabbing a class name and a namespace. This key fields can be set to file path for the SQL and JavaScript files. Since I already had a filter, I pulled another handy class from Lucene.Contrib - ChainedFilter to cascade my filters. And so it worked - on my machine...

It turned out that DuplicateFilter has a know bug which prevents it to work with index containing multiple segments (Lucene index is a folder with a bunch of files which are created as soon as Writer does flush, and I only had one segment on my machine).

Resorting to plan B, I chose not to index duplicate files in the first place. Lucene has a pretty powerful near-real-time (NRT) search feature which enables efficient search of the index while adding documents to it. It is super easy to use, just grab GetReader() from the current writer which automatically flushes data to disk thus providing an ability to search current index:

string key;  
if (syntax.Classes.Any() && syntax.Usings.Any())  
{
    key = string.Format("{0}.{1}", syntax.Usings.First(), syntax.Classes.First().ClassName);
    using (var reader = writer.GetReader()) // We want to get a new reader once per document
    {
        var term = new Term(Fields.Key, key);
        var docs = reader.TermDocs(term);
        if (docs.Next())
        {
            return false; // We have already indexed this file.
        }
    }
}

And so this basically does it - no duplicate files will be indexed. NRT has a dent on performance because of extra I/O. In my test of 32,000 files half of which turned out to be duplicates, the performance hit was around 40% and there were almost twice as many index segment files created - 89 files versus 14 without NRT search, but hey, it works on another machine too..