Jump to content
Sign in to follow this  
Mike Torrettinni

Performance - Find duplicates: Iteration vs Binary Search

Recommended Posts

I have a simple search for duplicated integers in array and I was just doing simple benchmark exercise, that I thought it will show that simple one-time iteration is faster than binary search ( because bin search needs to sort data first). 

But seems that for large arrays the sort + binary search is faster than simple iteration:

 

First number is iteration, second number is binary search - (execution loop adjusted to not run for hours in large arrays, so don't compare timings between different records, it;s not comparable, just iteration vs bin search per records):

image.png.7acdc4a98830baf7d34a979f5af0fffb.png

 

Is there any obvious mistake in my benchmark, or is Sort + Binary Search really that faster than simple one-time iteration?

 

In most cases, the data is like this:

1. Array initial state is numbers from 1..n

2. Then some changes occur and new numbers replace a few old ones at random places (in my test I add 10% of numbers above n)

3. At some point a few duplicates can occur

 

*To add: binary search has to first copy array, because I need to keep the original array in original order. So, another reason why I thought using binary search would be slower.

 

program Project1;

{$APPTYPE CONSOLE}

{$R *.res}

uses
  System.SysUtils, System.StrUtils, System.Diagnostics, System.Generics.Defaults, System.Generics.Collections;

type
  TDataRec = record
    IntId: integer;
    Duplicate: boolean;
  end;

  TCustomComparer = class(TInterfacedObject, IComparer<TDataRec>)
  public
    function Compare(const Left, Right: TDataRec): Integer;
  end;

const
  cCounts = 6;
  cLengths: array[1..cCounts] of integer = (10,100,1000,10000,100000,1000000);
  cLoops: array[1..cCounts] of integer = (100000,10000, 1000, 100, 1, 1);

var xComparer : TCustomComparer;

function TCustomComparer.Compare(const Left, Right: TDataRec): Integer;
begin
  Result := Left.IntId - Right.IntId;
end;

// Get 2 random positions: a - where duplicate will be, b - a pos where original nummber is
procedure Get2RandomPositions(const aData: TArray<TDataRec>; out a, b: integer);
begin
  // making sure they don't posint to the same position
  repeat
    a := Random(Length(aData));
    b := Random(Length(aData));
  until a <> b;
end;

procedure FillData(var aData: TArray<TDataRec>; aIdx: integer);
var i, a, b : Integer;
begin
  aData := nil;
  SetLength(aData, cLengths[aIdx]);
  // fill sequential integers
  for i := Low(aData) to High(aData) do
    aData[i].IntId := i + 1;

  // now add some (10%) random numbers above last number, at random pos
  for i := 1 to cLengths[aIdx] div 10 do
    aData[Random(Length(aData))].IntId := cLengths[aIdx] + Random(MaxInt);

  // add 2 duplicates per each div 10, at random pos
  for i := 1 to cLengths[aIdx] div 10 do
  begin
    Get2RandomPositions(aData, a, b);
    aData[a].IntId :=  aData[b].IntId;
  end;
end;

// search 1 by 1
procedure FindDuplicates(const aData: TArray<TDataRec>);
var i, j: Integer;
begin
  for i := Low(aData) to High(aData) - 1 do
    for j := i + 1 to High(aData) do
      if aData[i].IntId = aData[j].IntId then
      begin
        aData[i].Duplicate := True;
        aData[j].Duplicate := True;
      end;
end;

function ResultsAreOK(const aData1, aData2: TArray<TDataRec>): boolean;
var i, j: Integer;
begin
  for i := Low(aData1) to High(aData1) do
  begin
    for j := Low(aData2) to High(aData2) do
      if (aData2[j].IntId = aData1[i].IntId) and (aData2[j].Duplicate <> aData1[i].Duplicate) then
          Exit(false);
  end;
  Result := True;
end;

// Duplicate data to benchmark the identical data
procedure CopyData(const aDataIn:TArray<TDataRec>; out aDataOut: TArray<TDataRec>);
var i: integer;
begin
  aDataOut := nil;
  SetLength(aDataOut, Length(aDataIn));
  for i := Low(aDataIn) to High(aDataIn) do
    aDataOut[i].IntId := aDataIn[i].IntId;
end;

function Time1(aIdx: integer): Int64;
var vData: TArray<TDataRec>;
    vSW: TStopWatch;
    i: Integer;
begin
  FillData(vData, aIdx);
  vSW := TStopwatch.StartNew;
  for i := 1 to cLoops[aIdx] do
    FindDuplicates(vData);
  Result := vSW.ElapsedMilliseconds;
end;

procedure SortData(var aData:TArray<TDataRec>);
begin
  TArray.Sort<TDataRec>(aData, xComparer);
end;

procedure FindDuplicates_BinSearch(var aData:TArray<TDataRec>);
var
  i,j: Integer;
begin
  for i := Low(aData) to High(aData) do
  begin
    if TArray.BinarySearch<TDataRec>(aData, aData[i], j, xComparer) then
      if i <> j then
      begin
        aData[i].Duplicate := True;
        aData[j].Duplicate := True;
      end;
  end;
end;

procedure RunTests;
var vData1, vData2: TArray<TDataRec>;
    c, i, t1, t2: int64;
    vSW: TStopWatch;
begin
  xComparer := TCustomComparer.Create;
  for c := 1 to cCounts do
  begin
    FillData(vData1, c);

    Write(cLengths[c].ToString + ' records:  ');

    // loop array
    vSW := TStopwatch.StartNew;
    for i := 1 to cLoops[c] do
      FindDuplicates(vData1);
    t1 := vSW.ElapsedMilliseconds;

    // binary search
    vSW := TStopwatch.StartNew;
    for i := 1 to cLoops[c] do
    begin
      CopyData(vData1, vData2);
      SortData(vData2);
      FindDuplicates_BinSearch(vData2);
    end;
    t2 := vSW.ElapsedMilliseconds;

    if not ResultsAreOK(vData1, vData2) then
      raise Exception.Create('issue in data!');

    Writeln(Format('%10d vs %d', [t1, t2]));
  end;
end;

begin
  Randomize;
  RunTests;
  readln;
end.

 

 

Edited by Mike Torrettinni

Share this post


Link to post

Why are you putting CopyData in the stopWatch for the binary search. That is not fair for comparison.

You are comparing vs IComparer. That basically calls a function for the comparison, which is very slow compared to an native comparison i.e. int1=int2. So minimizing the comparisons with the binary search in the example above has an outweighed effect compared to a native comparison (i.e. I would expect the linear search would hold its own for longer for a native comparison).

Your linear search for duplicates is essentially a sort anyhow, and that offsets the explicit sort you are doing for the binary search.

 

Try sorting and doing the duplicate linear search on the sorted data, that is easy, you just compare adjacent elements.

  • Like 1

Share this post


Link to post

I thought that my iteration through array is similar to what sort does, it has to iterate and compare items for the full array.

I already improved iteration, to not iterate full array for each item, but only from the item forward till the end.

Share this post


Link to post
3 minutes ago, Dave Novo said:

Why are you putting CopyData in the stopWatch for the binary search. That is not fair for comparison.

That's exactly why I thought binary search will be much slower, because it needs to copy data (as it would in real case, because I need to preserve order in original array) and then sort. But the copy and sort is not the big performance hog in cases of 100+ items,

 

 

Share this post


Link to post

You don't do binary search here. With sorting, you sort and then a single iteration is all you need. That has the same complexity as the sort O(n log n). Without sorting you compare all vs all, so O(n^2).

 

But yeah, no binary search. 

Share this post


Link to post

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
Sign in to follow this  

×