David Schwartz 426 Posted April 2, 2020 I'm using an OCR library on some text and I'm getting a lot of "noise" characters. There's some demo code that looks like this: For C:=0 To OCRResultsHeader.NumItems-1 Do Begin S:=S+Chr(OCRResultItems[C].OCRChar); If OCRResultItems[C].OCRChar=13 Then S:=S+#10; AMsg(' OCRCha'#9 +IntToStr (OCRResultItems[C].OCRChar ), False); AMsg(' Confidence'#9+FloatToStr(OCRResultItems[C].Confidence), False); End; MemoResult.Lines.Add(S); OCRChar is the character at position C in the converted results (OCRResultItems), and Confidence is a Float in the range [0..1) -- ie, it's always <1.0. I want to get a sense of what the Confidence is for individual characters, so I was thinking of taking n := Integer(Confidence * 10) to get a number in the range [0..9]. Then I'd map that to colors, like: [light-red, med-red, dark-red, light-yellow, med-yellow, dark-yellow, light-green, med-green, dark-green, white] Then I'd display each letter in a RichEdit and set its background color ("highlight") to a red/yellow/green/white color based on 'n'. Unfortunately, I can't tell this library that in certain places I'm only looking for numbers. So for example, it often spits out things that look like a '1' (vertical lines) that aren't. Same for zeros ('0'). A lot of noise characters show up as vertical-bar | , periods, and underscores. I'm thinking that if they all show up with red backgrounds, then I could filter them out by their confidence values rather than text matching against similarly-shaped characters (homographs?) Here's my code that sets various boundaries on the width and number of lines, plus it splits lines up in the stringlist. I'm not using the Confidence value here yet. S := ''; for n := 0 To OCRResultsHeader.NumItems-1 do begin ch := OCRResultItems[n].OCRChar; conf := OCRResultItems[n].Confidence; if (ch = 13) then begin if (Length(S) > 4) then // nothing we want is <= 4 chars in length scanned_rslts_sl.AddObject( S ); if (scanned_rslts_sl.Count > 9) then // it's going to be in the first 9 lines... Break; S := ''; end; if (Length(S) > 20) then // we're looking for an 8-digit number, not a novella Continue; if (ch >= 20) then // ignore control chars (20 is a space char) S := S + Chr(ch); end; One thing I was thinking was to allocate an array of bytes the same length as each line and attach it to the stringlist via AddObject. (I'm not exactly sure how to allocate an array of bytes from the heap for each line to attach as the Objects[ ] value corresponding to each character, tho. Does it need to be an actual "object"? Or can it just be a pointer to an array of bytes?) Anyway, I'm curious how to add the array of byte values for 'n' for each character to the Objects array, and also if anybody has a more obvious or interesting approach. Share this post Link to post