Jump to content
JeanCremers

filenames with unicode chars

Recommended Posts

i have this code, but it does not catch the filename 'Duo Canopée play Ständchen by Franz Schubert on a 1968 D. Friederich Guitar & Violoncello.webm', why not???

 

#define DIACRITIC_COUNT 10
    static const WideChar diacritics[DIACRITIC_COUNT] = {
        L'á', L'à', L'â', L'ä', L'ã', L'å',
        L'é', L'è', L'ê', L'ë',
    };

    static const WideChar replacements[DIACRITIC_COUNT] = {
        L'a', L'a', L'a', L'a', L'a', L'a',
        L'e', L'e', L'e', L'e',
    };


if(FindFirst(dir + L"\\*.*", faAnyFile, sr) == 0)
    {
        do {
            if(!(sr.Attr & faDirectory))
            {
                String newName = sr.Name;
                bool changed = false;
                for(int i = 1; i <= newName.Length(); i++)
                  {
                  WideChar ch = newName;
                   for (int j = 0; j < DIACRITIC_COUNT; j++)
                      {
                        if(ch == diacritics[j])
                        {
                        newName = replacements[j];
                        changed = true;
                        }
                      }
                  }
                if(changed) {
                    TListItem* item = ListView1->Items->Add();
                    item->Caption = sr.Name;
                    item->SubItems->Add(newName);
                }
            }
        } while(FindNext(sr) == 0);
        FindClose(sr);
    }

 

Edited by JeanCremers
forgot [i]

Share this post


Link to post

it should read newName [ i ] = replacements[j];

but the board does not let me change that.

Edited by JeanCremers

Share this post


Link to post

The debugger displays the name like

Debug Output: Duo Canope´e play Sta¨ndchen by Franz Schubert on a 1968 D. Friederich Guitar & Violoncello.webm

How do i catch these so i can rename them?

Edit, had to use the windows FindFirstFileW to get it right.

 

Ps, I have used c++builder 12 in the past, this one is real shitty. FindFile/Next not working properly. Properties not surviving oncreate(), i had a crash that trashed my source completely, mwooooooh. I'm not gonna port my main app to this version.

Edited by JeanCremers

Share this post


Link to post
2 hours ago, JeanCremers said:

Edit, had to use the windows FindFirstFileW to get it right.

The RTL's Find(First|Next)() functions use the Win32 Find(First|Next)FileW() APIs internally, and have done so since 2009.

2 hours ago, JeanCremers said:

Properties not surviving oncreate()

That might be related to the removal of the Form's OldCreateOrder property in RAD Studio 11.  But, you should never have been using the OnCreate (and OnDestroy) event(s) in C++ anyway, as that has always had the potential of introducing Undefined Behavior in user C++ code due to the different creation models between Delphi vs C++.  Use the Form's constructor (and destructor) instead, that is always safe.  And streamed property values are available in the constructor.

Share this post


Link to post
3 minutes ago, Remy Lebeau said:

The RTL's Find(First|Next)() functions use the Win32 Find(First|Next)FileW() APIs internally, and have done so since 2009.

 

Hi Remy, the files i mention really did not show up using the RTL ones. Off course i would prefer that. And using the debugger i got strange names like Canope´e play Sta¨ndchen.

Edited by JeanCremers

Share this post


Link to post

Feel free to step into the RTL source code for yourself with the debugger (see the code below).  On Windows, the RTL's Find(First|Next)() functions simply call the  API Find(First|Next)FileW() functions and then copy the WIN32_FIND_DATA fields into the TSearchRec fields:

function FindFirstFile; external kernelbase name 'FindFirstFileW';
function FindNextFile; external kernelbase name 'FindNextFileW';

...

function FindMatchingFile(var F: TSearchRec): Integer;
...
begin
  while F.FindData.dwFileAttributes and F.ExcludeAttr <> 0 do
    if not FindNextFile(F.FindHandle, F.FindData) then
    begin
      Result := GetLastError;
      Exit;
    end;
  ...
  F.Name := F.FindData.cFileName; // <-- HERE
  Result := 0;
end;

function FindFirst(const Path: string; Attr: Integer;
  var F: TSearchRec): Integer;
const
  faSpecial = faHidden or faSysFile or faDirectory;
begin
  F.ExcludeAttr := not Attr and faSpecial;
  F.FindHandle := FindFirstFile(PChar(Path), F.FindData);
  if F.FindHandle <> INVALID_HANDLE_VALUE then
  begin
    Result := FindMatchingFile(F); // <-- HERE
    if Result <> 0 then FindClose(F);
  end
  else
    Result := GetLastError;
end;

function FindNext(var F: TSearchRec): Integer;
begin
  if FindNextFile(F.FindHandle, F.FindData) then
    Result := FindMatchingFile(F) // <-- HERE
  else
    Result := GetLastError;
end;

As you can see, the API's WIN32_FIND_DATA::cFileName is assign as-is to the RTL's TSearchRec::Name field, and since both field are based on WideChar (WIN32_FIND_DATA::cFileName is a WideChar[] array and TSearchRec::Name is a UnicodeString) then there is no manipulation of the reported characters in any way, they are copied as-is.  What you get back in your code SHOULD be exactly what Windows actually reported.

 

The TSearchRec::FindData field is the raw WIN32_FIND_DATA data that Find(First|Next)File() actually reported.

Edited by Remy Lebeau

Share this post


Link to post

You are right, it was a my fault, i did not test it properly, the file is catched with plain RTL functions, only that file is never having detected to have diacritics, never gets added to the listview.

 

    const int DIACRITIC_COUNT = 29; 
    static const WideChar diacritics[DIACRITIC_COUNT] = {
        L'á', L'à', L'â', L'ä', L'ã', L'å',
        L'é', L'è', L'ê', L'ë',
        L'í', L'ì', L'î', L'ï',
        L'ó', L'ò', L'ô', L'ö', L'õ',
        L'ú', L'ù', L'û', L'ü',
        L'ý', L'ÿ',
        L'ñ',
        L'ç',
        L'Ñ'
    };
    static const WideChar replacements[DIACRITIC_COUNT] = {
        L'a', L'a', L'a', L'a', L'a', L'a',
        L'e', L'e', L'e', L'e',
        L'i', L'i', L'i', L'i',
        L'o', L'o', L'o', L'o', L'o',
        L'u', L'u', L'u', L'u',
        L'y', L'y',
        L'n',
        L'c',
        L'N'
    };

if (FindFirst(dir + "\\*.*", faAnyFile, R) == 0) do if(!(R.Attr & faDirectory))
  {
  String newName = R.Name;
  bool changed = false;
  for(int i = 1; i <= newName.Length(); i++)
    for (int j = 0; j < DIACRITIC_COUNT; j++)
      {
      if(newName == diacritics[j])
        {
        newName = replacements[j];
        changed = true;
        }
      }
    if(changed)
      {
      TListItem* item = ListView1->Items->Add();
      item->Caption = R.Name;
      item->SubItems->Add(newName);
      }
  } while (FindNext(R) == 0);
FindClose(R);
 

Share this post


Link to post

the forum does not display the code correctly, i have to use spaces in the bracket like newName[ i ]

Forgot to say, other files with diacritics are catched, just this particular one is not. Duo Canopée play Ständchen by Franz Schubert on a 1968 D. Friederich Guitar & Violoncello.webm

 

If i put this in the findfirst/next loop R.Name is still the original but ch becomes plain 'e'

 if (R.Name.Pos("Duo Canop"))
{
WideChar ch = R.Name[10];
}
 

Edited by JeanCremers

Share this post


Link to post
1 hour ago, JeanCremers said:

You are right, it was a my fault, i did not test it properly, the file is catched with plain RTL functions, only that file is never having detected to have diacritics, never gets added to the listview.

One issue I see is you are not taking into account either UTF-16 surrogates, or Unicode combining codepoints.  Not all Unicode characters take up 1 WideChar, sometimes they require 2+ WideChars, especially if they are not in a normalized form.  For example, the character 'á' may be 1 WideChar 0x00E1 (Latin Small Letter A with Acute), or it may be 2 WideChars 0x0061 (Latin Small Letter A) and 0x0301 (Combining Acute Accent) working together.

 

What are the actual numeric values of the WideChars that you are actually receiving for the filename you are having trouble with?

1 hour ago, JeanCremers said:

the forum does not display the code correctly, i have to use spaces in the bracket like newName[ i ]

That is because you are posting the code as plain text.  Put it inside of a code block instead (the '</>' button on the editor toolbar).  For example:

void sayHi() {
  cout << "This is in a code block!";
}
Edited by Remy Lebeau

Share this post


Link to post

But how can R.Name be 'Duo Canopée play Ständchen by Franz Schubert on a 1968 D. Friederich Guitar & Violoncello.webm' and widechar W = R.Name[10] a plain e?

 

Yes i was using [ blocks ], thanks.

 

Edited by JeanCremers

Share this post


Link to post
34 minutes ago, JeanCremers said:

But how can R.Name be 'Duo Canopée play Ständchen by Franz Schubert on a 1968 D. Friederich Guitar & Violoncello.webm' and widechar W = R.Name[10] a plain e?

Just like in the example I gave you in my last reply, in this case I'm guessing that R.Name[10] is 0x0065 (Latin Small Letter E) and R.Name[11] is 0x0301 (Combining Acute Accent), whereas you are expecting R.Name[10] to be 0x00E9 (Latin Small Letter E with Acute) instead.

 

To accomplish what you want, you should normalize the Unicode characters, probably to form NFC, before you can then process and replace them.  Read the following for more details:

 

Unicode Standard Annex #15: Unicode Normalization Forms

Using Unicode Normalization to Represent Strings

NormalizeString function
 

Quote

Yes i was using [ blocks ], thanks.

I realize that.  But that is reserved for markup language in plain text.  You need to actually click on the '</>' button in the toolbar and put your code in the resulting popup dialog.

image.thumb.png.99b4cf1d7d42ec82490067c456d04f50.png

Edited by Remy Lebeau

Share this post


Link to post

Recurring theme here is that you think that everything else is at fault when you can't achieve things that others can. Perhaps you need the curiosity to ask why this is. 

Share this post


Link to post

I didn't see your question about the numeric values.

The é = 101 and the  =  116.

The corresponding chars in my diacritics table are 233 and 228 though.

i tried changing them with

if (newName[i] == 101 || newName[i] == 116)
		{
		newName[i] = (newName[i] == 101 ? 'e' : 'a');
  OutputDebugStringW(newName.c_str());
		changed = true;
		}

But i get:

Debug Output: Duo Canope´e play Sta¨ndchen by Franz Schubert on a 1968 D. Friederich Guitar & Violoncello.webm Process diacritics.exe (2792)

 

Share this post


Link to post

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×