Guessing the decimal separator

Recommended Posts

Sometimes I get files with text representations of numbers where it is not known what is used for the decimal separator. (I'm sure I am not the only one.) For this type of data, I have written the following function:

```function GuessDecimalSeparator(const _s: string): Char;
var
i: Integer;
CommaCnt: Integer;
begin
CommaCnt := 0;
Result := '.';
for i := 1 to Length(_s) do begin
case _s[i] of
'.': begin
Result := '.';
end;
',': begin
Inc(CommaCnt);
Result := ',';
end;
end;
end;
if (Result = ',') and (CommaCnt = 1) then
Exit;
Result := '.';
end;```

It takes a string which is supposed to be a number in decimal representation, which can optionally contain thousands separators and a decimal separator. It then tries to guess, which one of '.' or ',' is the decimal separator.

The algorithm is:

Start at the beginning and count the occurrences of '.' and ','. The one that only occurs once is the decimal separator. If there is none, return '.' (because it does not matter).

I can see the following problems with this approach:

1. There are other possible decimal / thousands separators than '.' and ',' (personally I don't know any, but my experience is limited to European and American number formats).
2. For strings in the form '1.000' (or '1,000') it assumes that the separator is the decimal separator. This could be wrong (That's why it's called GuessDecimalSeparator and not DetectDecimalSeparator.)
3. For malformed strings (e.g. '1.00.00' or '1,00,00') the result is most likely wrong, but that's simply garbage in -> garbage out. I guess this could be improved.

One possible improvement would be to not just process one number but many and check whether the results match.

Edited by dummzeuch

the first one from the right

30 minutes ago, Attila Kovacs said:

the first one from the right

I started out with that approach, but:

In '1.000.000' the decimal separator would definitely not be '.'. (or '1,000,000' it wouldn't be ',').

I would add a little more safety anyway,

to ensure that you' in a number at all, e.g. like

```_s[i-1].IsSpace or _s[i-1].IsDigit
_s[i+1].IsSpace or _s[i-1].IsDigit```

and probably if you find the first dot in a new number, just analyse this number until the end.

I assume this would give a little more stable results, to avoid things like "a,b,c,d" counting wrong.

• 1

```function GuessDecimalSeparator(const Value: string): Char;
{ assumes that the decimal separator is the last one and does not appear more than once }
var
idx: Integer;
begin
idx := Value.LastIndexOfAny(['.', ',']); // possible decimal separators
if (idx >= 0) then begin
Result := Value.Chars[idx];
if Value.CountChar(Result) = 1 then Exit;
end;
Result := FormatSettings.DecimalSeparator; // Default
end;```

• 1
• 1

37 minutes ago, Uwe Raabe said:
```
function GuessDecimalSeparator(const Value: string): Char;
{ assumes that the decimal separator is the last one and does not appear more than once }
var
idx: Integer;
begin
idx := Value.LastIndexOfAny(['.', ',']); // possible decimal separators
if (idx >= 0) then begin
Result := Value.Chars[idx];
if Value.CountChar(Result) = 1 then Exit;
end;
Result := FormatSettings.DecimalSeparator; // Default
end;```

Will also fail on '1.000.000'.

(Yes, I have read the comment, but that case is not as rare that I would accept the function to fail on it.)

If making a mistake is not critical, just choose any of these algorithms; without knowing the context, and without detecting the language of the surrounding text, they seem about equally risky to me.

If you are transferring money, or worse, measuring dosage of medicines, then none of them are of any practical use.

35 minutes ago, dummzeuch said:

Will also fail on '1.000.000'.

Define "fail". For German language settings it returns a comma, which is probably correct in most cases.

If the choice is either comma or point, the algorithm can be adapted to that. Otherwise the task is just not fully defined for all inputs.

So what would you call a correct result when the decimal separator is not part of the string?

• 1

3 hours ago, dummzeuch said:

In '1.000.000' the decimal separator would definitely not be '.'. (or '1,000,000' it wouldn't be ',').

Why would you search decimal separator on integers?

What formula do you think would be feasible to tell what 1.000 is?

Check the first X rows, if still not obvious check another X, if still not and EOF, panic.

On 2/4/2019 at 8:56 PM, Attila Kovacs said:

Why would you search decimal separator on integers?

Because I don't know that it is an integer. I get a string that is supposed to be a number. I assume that the decimal separator is either a comma or a dot. I try to guess which one it is.

One criterion: The decimal separator is only allowed once.

On 2/4/2019 at 8:56 PM, Attila Kovacs said:

What formula do you think would be feasible to tell what 1.000 is?

There is none, as I already wrote in my original post:

On 2/4/2019 at 4:47 PM, dummzeuch said:

For strings in the form '1.000' (or '1,000') it assumes that the separator is the decimal separator. This could be wrong (That's why it's called GuessDecimalSeparator and not DetectDecimalSeparator.)

Edited by dummzeuch

46 minutes ago, dummzeuch said:

I assume that the decimal separator is either a comma or a dot. I try to guess which one or is.

One criterion: The decimal separator is only allowed once.

```function GuessDecimalSeparator(const Value: string): Char;
{ Assumes that the decimal separator is the last one and does not appeat more than once.
Only dot and comma are treated as possible decimal separator. }
const
cSeparators: array[Boolean] of Char = ('.', ','); // possible decimal separators
var
idx: Integer;
begin
Result := cSeparators[False]; // default if none of the separators is found - alternative "True"
idx := Value.LastIndexOfAny(cSeparators);
if (idx >= 0) then begin
Result := Value.Chars[idx];
if Value.CountChar(Result) > 1 then begin
{ if it appears more than once we use the other one }
Result := cSeparators[Result = cSeparators[False]];
end;
end
end;```

• 1

14 hours ago, Uwe Raabe said:
15 hours ago, dummzeuch said:

Will also fail on '1.000.000'.

Define "fail". For German language settings it returns a comma, which is probably correct in most cases.

Fail means that it might improperly return '.' on English language settings. If I knew the correct language settings for the input, I wouldn't have to guess the decimal separator.

14 hours ago, Uwe Raabe said:

If the choice is either comma or point, the algorithm can be adapted to that. Otherwise the task is just not fully defined for all inputs.

Yes, I am aware of that (as I said in the original post).

14 hours ago, Uwe Raabe said:

So what would you call a correct result when the decimal separator is not part of the string?

In that case it would not matter. I could simply use '.' or ',' to convert it to a number and would get the correct result for either.

I would only accept numbers in a well defined format known to me.

https://en.wikipedia.org/wiki/Decimal_separator#Digit_grouping gives me the creeps. 😱

tl;dr: digit grouping (thousand separators) may also occur to the right side of the decimal separator.

BTW: I've seen something like this 1'000'000 quite often.

• 1

Quote

Define "fail".

I don't think a preliminary "guess" might help, when e.g. TryStrToFloat() fails.
It is maybe only a general switch on what is the preferred method to use.
Even if the guess is valid, and you have only ONE wrong decimal number, you have to handle the TryStrToFloat() failure anyway.
So what ?

Edited by Rollo62

Why don't you use TFormatSettings? It should load the locale of the machine

1 hour ago, John Kouraklis said:

Why don't you use TFormatSettings? It should load the locale of the machine

The file might have been created on another machine.

@Uwe Raabe Ah yes...

Quote

For strings in the form '1.000' (or '1,000') it assumes that the separator is the decimal separator. This could be wrong (That's why it's called GuessDecimalSeparator and not DetectDecimalSeparator.)

@dummzeuch This is impossible to guess without any context. Maybe if you can sample a few numbers in the file you may be able to figure it out

It is impossible to decide for 100%. But possible to guess taking into account ThousandSeparator and DecimalSeparator. It will work in 95.38% of the cases. All other - up to the user.

• 1
• 1

38 minutes ago, Dmitry Arefiev said:

It will work in 95.38% of the cases.

Did you know that 86.64% of all statistics are made up? The remaining 53.42% have been tampered with.

Seriously, is there any sense behind this number?

• 1

No sense, as in the trial to find 100% working guessing

Edited by Dmitry Arefiev

This seems to be simple problem, but nothing but trivial. I think is very hard to get 100% right. Depending on use cases. Is it about user input or random set of Xml-files to read etc.

This is one of those things that if there would have been solved by any standard from dawn of times. Too much variations. Dates and/or times are even worse, at least the Date part.

If someone has tons of time and good test sets, would be good to have "Fuzzy string conversion library" which would at least try to make conversions like this.

-Tee-

```function GuessDecimalSeparator(const Value: string): Char;
var i,commacount,dotcount,lastsep:integer; c:char;
Const DefaultDecimalSeparator='.';
begin
commacount:=0;
dotcount:=0;
lastsep:=0;

for i:=1 to length(value) do
begin
c:=value[i];
if c = '.' then
begin
inc(dotcount);
lastsep:=i;
end;
if c = ',' then
begin
inc(commacount);
lastsep:=i;
end;
end;

Case (dotcount + commacount) of
0: result:=DefaultDecimalSeparator;    //Default
1: Result:=Value[lastsep];  //accept the one separator
ELSE
//If a separator occurs more than once, it must be the thousand separator
Begin
if (commacount > dotcount) then result:='.' //multiple commas
else
if (dotcount > commacount) then result:=',' //multiple dots
else //equal amounts, take last separator
result:= Value[lastsep];
End;
end;
end;```

This function uses the simple assumption that a decimal separator should occur at most one time in the string, but thousand separators may occur multiple times.

• 1

36 minutes ago, A.M. Hoornweg said:

This function uses the simple assumption that a decimal sep﻿arator should occur at most one time in the string, but thousand separators may occur multiple times.

It also assumes that neither of them may be anything but a period or a comma. In Europe alone this will lead to errors. Switzerland for example allows for this to be correct: 1'000. Other countries allow this 1 000 000 or even more ludicrous 1/000/000. As an academic exercise, this project is great. But not suitable for production. I strongly advise against guessing games.

• 1

What about the "Quantum Computing" APIs out there?

But it think having a human* look the file over will be cheaper!

Excel does that, i seem to remember, it displays a sample of converted data and the human can select other import options.

* Not me.

• 1