How might you solve this data visualization problem?

David Schwartz · April 2, 2020

I'm using an OCR library on some text and I'm getting a lot of "noise" characters. There's some demo code that looks like this:

  For C:=0 To OCRResultsHeader.NumItems-1 Do
  Begin
	S:=S+Chr(OCRResultItems[C].OCRChar);
	If OCRResultItems[C].OCRChar=13 Then
	  S:=S+#10;
	AMsg('  OCRCha'#9    +IntToStr  (OCRResultItems[C].OCRChar   ), False);
	AMsg('  Confidence'#9+FloatToStr(OCRResultItems[C].Confidence), False);
  End;
  MemoResult.Lines.Add(S);

OCRChar is the character at position C in the converted results (OCRResultItems), and Confidence is a Float in the range [0..1) -- ie, it's always <1.0.

I want to get a sense of what the Confidence is for individual characters, so I was thinking of taking n := Integer(Confidence * 10) to get a number in the range [0..9].

Then I'd map that to colors, like: [light-red, med-red, dark-red, light-yellow, med-yellow, dark-yellow, light-green, med-green, dark-green, white]

Then I'd display each letter in a RichEdit and set its background color ("highlight") to a red/yellow/green/white color based on 'n'.

Unfortunately, I can't tell this library that in certain places I'm only looking for numbers. So for example, it often spits out things that look like a '1' (vertical lines) that aren't. Same for zeros ('0'). A lot of noise characters show up as vertical-bar | , periods, and underscores. I'm thinking that if they all show up with red backgrounds, then I could filter them out by their confidence values rather than text matching against similarly-shaped characters (homographs?)

Here's my code that sets various boundaries on the width and number of lines, plus it splits lines up in the stringlist. I'm not using the Confidence value here yet.

    S := '';
    for n := 0 To OCRResultsHeader.NumItems-1 do
    begin
      ch := OCRResultItems[n].OCRChar;
      conf := OCRResultItems[n].Confidence;
      if (ch = 13) then
      begin
        if (Length(S) > 4) then // nothing we want is <= 4 chars in length
          scanned_rslts_sl.AddObject( S );
        if (scanned_rslts_sl.Count > 9) then // it's going to be in the first 9 lines...
          Break;
        S := '';
      end;
      if (Length(S) > 20) then  // we're looking for an 8-digit number, not a novella
        Continue;

      if (ch >= 20) then // ignore control chars (20 is a space char)
        S := S + Chr(ch);
    end;

One thing I was thinking was to allocate an array of bytes the same length as each line and attach it to the stringlist via AddObject. (I'm not exactly sure how to allocate an array of bytes from the heap for each line to attach as the Objects[ ] value corresponding to each character, tho. Does it need to be an actual "object"? Or can it just be a pointer to an array of bytes?)

Anyway, I'm curious how to add the array of byte values for 'n' for each character to the Objects array, and also if anybody has a more obvious or interesting approach.

Sign In

How might you solve this data visualization problem?

Recommended Posts

David Schwartz 443

Share this post

Link to post

Create an account or sign in to comment

Create an account

Sign in

Browse

Activity