Jump to content
dummzeuch

Guessing the decimal separator

Recommended Posts

Sometimes I get files with text representations of numbers where it is not known what is used for the decimal separator. (I'm sure I am not the only one.) For this type of data, I have written the following function:

function GuessDecimalSeparator(const _s: string): Char;
var
  i: Integer;
  CommaCnt: Integer;
begin
  CommaCnt := 0;
  Result := '.';
  for i := 1 to Length(_s) do begin
    case _s[i] of
      '.': begin
          Result := '.';
        end;
      ',': begin
          Inc(CommaCnt);
          Result := ',';
        end;
    end;
  end;
  if (Result = ',') and (CommaCnt = 1) then
    Exit;
  Result := '.';
end;

It takes a string which is supposed to be a number in decimal representation, which can optionally contain thousands separators and a decimal separator. It then tries to guess, which one of '.' or ',' is the decimal separator.

The algorithm is:

Start at the beginning and count the occurrences of '.' and ','. The one that only occurs once is the decimal separator. If there is none, return '.' (because it does not matter).

 

I can see the following problems with this approach:

  1. There are other possible decimal / thousands separators than '.' and ',' (personally I don't know any, but my experience is limited to European and American number formats).
  2. For strings in the form '1.000' (or '1,000') it assumes that the separator is the decimal separator. This could be wrong (That's why it's called GuessDecimalSeparator and not DetectDecimalSeparator.)
  3. For malformed strings (e.g. '1.00.00' or '1,00,00') the result is most likely wrong, but that's simply garbage in -> garbage out. I guess this could be improved.

 

One possible improvement would be to not just process one number but many and check whether the results match.

Any additional thoughts on this?

Edited by dummzeuch

Share this post


Link to post
30 minutes ago, Attila Kovacs said:

the first one from the right

I started out with that approach, but:

 

In '1.000.000' the decimal separator would definitely not be '.'. (or '1,000,000' it wouldn't be ',').

Share this post


Link to post

I would add a little more safety anyway,

to ensure that you' in a number at all, e.g. like

 

_s[i-1].IsSpace or _s[i-1].IsDigit
_s[i+1].IsSpace or _s[i-1].IsDigit

and probably if you find the first dot in a new number, just analyse this number until the end.

 

I assume this would give a little more stable results, to avoid things like "a,b,c,d" counting wrong.

 

 

  • Thanks 1

Share this post


Link to post
function GuessDecimalSeparator(const Value: string): Char;
{ assumes that the decimal separator is the last one and does not appear more than once }
var
  idx: Integer;
begin
  idx := Value.LastIndexOfAny(['.', ',']); // possible decimal separators
  if (idx >= 0) then begin
    Result := Value.Chars[idx];
    if Value.CountChar(Result) = 1 then Exit;
  end;
  Result := FormatSettings.DecimalSeparator; // Default
end;

 

  • Like 1
  • Thanks 1

Share this post


Link to post
37 minutes ago, Uwe Raabe said:

function GuessDecimalSeparator(const Value: string): Char;
{ assumes that the decimal separator is the last one and does not appear more than once }
var
  idx: Integer;
begin
  idx := Value.LastIndexOfAny(['.', ',']); // possible decimal separators
  if (idx >= 0) then begin
    Result := Value.Chars[idx];
    if Value.CountChar(Result) = 1 then Exit;
  end;
  Result := FormatSettings.DecimalSeparator; // Default
end;

 

Will also fail on '1.000.000'.

(Yes, I have read the comment, but that case is not as rare that I would accept the function to fail on it.)

Share this post


Link to post

If making a mistake is not critical, just choose any of these algorithms; without knowing the context, and without detecting the language of the surrounding text, they seem about equally risky to me.

If you are transferring money, or worse, measuring dosage of medicines, then none of them are of any practical use.

Share this post


Link to post
35 minutes ago, dummzeuch said:

Will also fail on '1.000.000'.

Define "fail". For German language settings it returns a comma, which is probably correct in most cases.

 

If the choice is either comma or point, the algorithm can be adapted to that. Otherwise the task is just not fully defined for all inputs.

 

So what would you call a correct result when the decimal separator is not part of the string?

  • Like 1

Share this post


Link to post
3 hours ago, dummzeuch said:

In '1.000.000' the decimal separator would definitely not be '.'. (or '1,000,000' it wouldn't be ',').

Why would you search decimal separator on integers?

What formula do you think would be feasible to tell what 1.000 is?

Check the first X rows, if still not obvious check another X, if still not and EOF, panic.

Share this post


Link to post
On 2/4/2019 at 8:56 PM, Attila Kovacs said:

Why would you search decimal separator on integers?

Because I don't know that it is an integer. I get a string that is supposed to be a number. I assume that the decimal separator is either a comma or a dot. I try to guess which one it is.

One criterion: The decimal separator is only allowed once.

On 2/4/2019 at 8:56 PM, Attila Kovacs said:

What formula do you think would be feasible to tell what 1.000 is?

There is none, as I already wrote in my original post:

On 2/4/2019 at 4:47 PM, dummzeuch said:

For strings in the form '1.000' (or '1,000') it assumes that the separator is the decimal separator. This could be wrong (That's why it's called GuessDecimalSeparator and not DetectDecimalSeparator.)

 

 

 

Edited by dummzeuch

Share this post


Link to post
46 minutes ago, dummzeuch said:

I assume that the decimal separator is either a comma or a dot. I try to guess which one or is.

One criterion: The decimal separator is only allowed once.

function GuessDecimalSeparator(const Value: string): Char;
{ Assumes that the decimal separator is the last one and does not appeat more than once.
  Only dot and comma are treated as possible decimal separator. }
const
  cSeparators: array[Boolean] of Char = ('.', ','); // possible decimal separators
var
  idx: Integer;
begin
  Result := cSeparators[False]; // default if none of the separators is found - alternative "True"
  idx := Value.LastIndexOfAny(cSeparators);
  if (idx >= 0) then begin
    Result := Value.Chars[idx];
    if Value.CountChar(Result) > 1 then begin
      { if it appears more than once we use the other one }
      Result := cSeparators[Result = cSeparators[False]];
    end;
  end
end;

 

  • Thanks 1

Share this post


Link to post
14 hours ago, Uwe Raabe said:
15 hours ago, dummzeuch said:

Will also fail on '1.000.000'.

Define "fail". For German language settings it returns a comma, which is probably correct in most cases.

Fail means that it might improperly return '.' on English language settings. If I knew the correct language settings for the input, I wouldn't have to guess the decimal separator.

14 hours ago, Uwe Raabe said:

If the choice is either comma or point, the algorithm can be adapted to that. Otherwise the task is just not fully defined for all inputs.

Yes, I am aware of that (as I said in the original post).

 

14 hours ago, Uwe Raabe said:

So what would you call a correct result when the decimal separator is not part of the string?

In that case it would not matter. I could simply use '.' or ',' to convert it to a number and would get the correct result for either.

Share this post


Link to post
Quote

Define "fail".

I don't think a preliminary "guess" might help, when e.g. TryStrToFloat() fails.
It is maybe only a general switch on what is the preferred method to use.
Even if the guess is valid, and you have only ONE wrong decimal number, you have to handle the TryStrToFloat() failure anyway.
So what ?

Edited by Rollo62

Share this post


Link to post
1 hour ago, John Kouraklis said:

Why don't you use TFormatSettings? It should load the locale of the machine

The file might have been created on another machine.

Share this post


Link to post
Quote

For strings in the form '1.000' (or '1,000') it assumes that the separator is the decimal separator. This could be wrong (That's why it's called GuessDecimalSeparator and not DetectDecimalSeparator.)

@dummzeuch This is impossible to guess without any context. Maybe if you can sample a few numbers in the file you may be able to figure it out

Share this post


Link to post

It is impossible to decide for 100%. But possible to guess taking into account ThousandSeparator and DecimalSeparator. It will work in 95.38% of the cases. All other - up to the user.

  • Like 1
  • Haha 1

Share this post


Link to post
38 minutes ago, Dmitry Arefiev said:

It will work in 95.38% of the cases.

Did you know that 86.64% of all statistics are made up? The remaining 53.42% have been tampered with. 

 

Seriously, is there any sense behind this number?

  • Haha 1

Share this post


Link to post

This seems to be simple problem, but nothing but trivial. I think is very hard to get 100% right. Depending on use cases. Is it about user input or random set of Xml-files to read etc.

 

This is one of those things that if there would have been solved by any standard from dawn of times. Too much variations. Dates and/or times are even worse, at least the Date part.

 

If someone has tons of time and good test sets, would be good to have "Fuzzy string conversion library" which would at least try to make conversions like this.

 

-Tee-

Share this post


Link to post
function GuessDecimalSeparator(const Value: string): Char;
var i,commacount,dotcount,lastsep:integer; c:char;
Const DefaultDecimalSeparator='.';
begin  
 commacount:=0;
 dotcount:=0;
 lastsep:=0;
 
 for i:=1 to length(value) do
 begin   
   c:=value[i];
   if c = '.' then
   begin
     inc(dotcount);
     lastsep:=i;
   end;   
   if c = ',' then
   begin
     inc(commacount);
     lastsep:=i;
   end;        
 end; 
 
Case (dotcount + commacount) of 
    0: result:=DefaultDecimalSeparator;    //Default             
    1: Result:=Value[lastsep];  //accept the one separator
    ELSE 
        //If a separator occurs more than once, it must be the thousand separator        
        Begin 
           if (commacount > dotcount) then result:='.' //multiple commas 
         else 
           if (dotcount > commacount) then result:=',' //multiple dots 
         else //equal amounts, take last separator 
           result:= Value[lastsep];            
        End;
 end;
end;

 

This function uses the simple assumption that a decimal separator should occur at most one time in the string, but thousand separators may occur multiple times.

 

 

  • Thanks 1

Share this post


Link to post
36 minutes ago, A.M. Hoornweg said:

This function uses the simple assumption that a decimal separator should occur at most one time in the string, but thousand separators may occur multiple times.

It also assumes that neither of them may be anything but a period or a comma. In Europe alone this will lead to errors. Switzerland for example allows for this to be correct: 1'000. Other countries allow this 1 000 000 or even more ludicrous 1/000/000. As an academic exercise, this project is great. But not suitable for production. I strongly advise against guessing games.

  • Thanks 1

Share this post


Link to post

What about the "Quantum Computing" APIs out there?

But it think having a human* look the file over will be cheaper!

Excel does that, i seem to remember, it displays a sample of converted data and the human can select other import options.

 

* Not me.

  • Haha 1

Share this post


Link to post

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×