Jump to content
David Schwartz

CreateProcess[A] only returns Chinese

Recommended Posts

I've got this piece of code from years ago called TRedirectedConsole that's a class wrapping a call to CreateProcess. It uses pipes to grab StdOut and StdErr, and it's mainly to let you run an EXE and capture the output in an event handler that lets you write it to something like a TMemo.

 

I have not used it in ages, and the timestamp is dated 2004. I dug it out of my archives and it compiles just fine in Delphi 10.2.3 after a minor fix.

 

But when I try to use it, all I can get it to output is gibberish that looks mostly like Chinese characters (could be some other Asian language for all I know).

 

Does anybody have an experience making CreateProcess[A] work properly in Delphi?

RedCon.zip

Share this post


Link to post

The problem is not with CreateProcess[A] itself, but in how the output of the spawned process is [mis]interpreted by TRedirectedConsole when converted to a string format.

 

What you describe is commonly known as "mojibake", and it can happen when 7/8-bit ANSI character data is mis-interpreted as 16/32-bit Unicode data, thus producing bad "characters" that fall within, in this case, the Chinese language, for instance.

 

If you look at TRedirectedConsole's internal logic more carefully (specifically, in its ReadHandle() method), you will see that it reads the spawned process's output using the Win32 API ReadFile() function (which has no concept of string characters), stores the raw bytes as-is into a Char[] array, and then copies that array into a native String.  That worked just fine in 2004 when Delphi's native String type was still AnsiString, and the native Char type was still AnsiChar.  But remember that in Delphi 2009, the native String type was changed to UnicodeString, and the native Char type was changed to WideChar.  As such, TRedirectedConsole's current reading logic is no longer valid under a Unicode environment.  It needs to be updated to account for the fact that the 8-bit ANSI encoding of the spawned process's output is now different than the 16-bit Unicode encoding of Delphi's native String type.

 

One way to handle this would be to tweak TRedirectedConsole to make it explicitly use AnsiString and AnsiChar everywhere, instead of using String and Char, respectively.  This is commonly known as "Ansifying" your code, which is generally shunned upon, but this is a common use case for it.

 

Also, note that AnsiString is now codepage-aware, so you would have to use the RTL's SetCodePage() function to assign a proper codepage identifier to any AnsiString you give to the user, so the ANSI data will be converted properly when assigned to other strings, such as a UnicodeString when adding to the TMemo (since the entire VCL uses UnicodeString now).  Unless the spawned process is manipulating its output to use a specific encoding (like UTF-8), you can generally use the Win32 API GetACP() function for the codepage identifier, or the RTL's global DefaultSystemCodePage .

 

For example:

 

//==============================================================
function TRedirectedConsole.ReadHandle(h: THandle; var s: AnsiString): integer;
//==============================================================
var
  BytesWaiting: DWORD;
  Buf: Array[1..BufSize] of AnsiChar;
  BytesRead: {$IFDEF VER100}Integer{$ELSE}DWORD{$ENDIF};
begin
  Result := 0;
  PeekNamedPipe(h, nil, 0, nil, @BytesWaiting, nil);
  if BytesWaiting > 0 then
  begin
    if BytesWaiting > BufSize then
      BytesWaiting := BufSize;
    ReadFile(h, Buf[1], BytesWaiting, BytesRead, nil);
    SetString(s, Buf, BytesRead);
    {$IFDEF CONDITIONALEXPRESSIONS}
      {$IF CompilerVersion >= 12} // D2009+
    SetCodePage(PRawByteString(@s)^, GetACP(), False);
      {$IFEND}
    {$ENDIF}
    Result := BytesRead;
  end;
end;

...

//==============================================================
procedure TRedirectedConsole.Run( working_dir : string );
//==============================================================
var
    s: AnsiString; // not String!
    ...
begin
    ...
end;

However, other areas of TRedirectedConsole would also have to be ansified, and I can see some issues with doing that, so I would suggest instead to have TRedirectedConsole continue to use the native String type everywhere, and just re-write the ReadHandle() method to explicitly convert the spawned process's output from ANSI to UTF-16, such as with the RTL's UnicodeFromLocaleChars() function (using the same codepage as above):

 

{$IFDEF CONDITIONALEXPRESSIONS}
  {$IF CompilerVersion >= 12}
    {$DEFINE STRING_IS_UNICODESTRING}
  {$IFEND}
{$ENDIF}

//==============================================================
function TRedirectedConsole.ReadHandle(h: THandle; var s: String): integer;
//==============================================================
var
  BytesWaiting: DWORD;
  Buf: Array[1..BufSize] of AnsiChar;
  BytesRead: {$IFDEF VER100}Integer{$ELSE}DWORD{$ENDIF};
begin
  Result := 0;
  if PeekNamedPipe(h, nil, 0, nil, @BytesWaiting, nil) then
  begin
    if BytesWaiting > 0 then
    begin
      if BytesWaiting > BufSize then
        BytesWaiting := BufSize;
      ReadFile(h, Buf[1], BytesWaiting, BytesRead, nil);
      {$IFDEF STRING_IS_UNICODESTRING}
      SetLength(s, UnicodeFromLocaleChars(GetACP(), 0, Buf, BytesRead, nil, 0));
      UnicodeFromLocaleChars(GetACP(), 0, Buf, BytesRead, PChar(s), Length(s)));
      {$ELSE}
      SetString(s, Buf, BytesRead);
      {$IFEND}
      Result := BytesRead;
    end;
  end;
end;

//==============================================================
procedure TRedirectedConsole.Run( working_dir : string );
//==============================================================
var
    ...
begin
    ...

    {$IFDEF UNICODE}
    // this is important to prevent a crash in CreateProcessW()...
    UniqueString(fCmdLine);
    {$ENDIF}

    // this raises an exception if anything happens
    Win32Check(CreateProcess(nil, PChar(fCmdLine), nil, nil, True,
                NORMAL_PRIORITY_CLASS, nil, wd, fSI, fPI));
    ...
end;

 

Edited by Remy Lebeau
  • Like 2
  • Thanks 1

Share this post


Link to post

Remy, this is an awesome explanation. To summarize: there's an impedance mismatch of sorts between the command being run and the data it's spitting out vs. the rest of the processing pipeline.

 

The command I was testing is the simple "dir c:\" command. It's running on Windows 7, so I'm guessing that everything outside this process is all Unicode. Why is a mismatch arising?

 

Just out of curiosity, what's the dominating factor here? The app (command.exe); the program (dir c:\); the ReadFile() function; or the type of the storage array used? Or does everything have to be in alignment / agreement? (I did study the code before posting here, and it is a bit mind-numbing in this case.)

 

What's not quite fully distilled for me is this; it sounds like the "data pipeline" around the CreateProcess call (including all of the various handlers) may need to be tailored for EITHER Ansi OR Unicode. Is this true? Or is there a way to handle both?

 

I guess it's irrelevant tho, because I plan to use this to run exactly one program (Ora2Pg, which is built on top of a Perl platform). 

 

That said, what's the simplest way to deal with this issue?

Edited by David Schwartz

Share this post


Link to post
42 minutes ago, David Schwartz said:

To summarize: there's an impedance mismatch of sorts between the command being run and the data it's spitting out vs. the rest of the processing pipeline.

Essentially, yes. 

42 minutes ago, David Schwartz said:

The command I was testing is the simple "dir c:\" command. It's running on Windows 7, so I'm guessing that everything outside this process is all Unicode. Why is a mismatch arising?

Because the "dir" process is outputting its data in ANSI, but the TRedirectedConsole code is expecting the data to be in Unicode when compiled in D2009+.

42 minutes ago, David Schwartz said:

Just out of curiosity, what's the dominating factor here? The app (command.exe); the program (dir c:\); the ReadFile() function; or the type of the storage array used? Or does everything have to be in alignment / agreement? (I did study the code before posting here, and it is a bit mind-numbing in this case.)

CMD.EXE (not COMMAND.EXE) outputs the result of its internal commands (those built into CMD.EXE itself, like "dir") in OEM/ANSI unless you explicitly ask it to output in Unicode via the "/u" parameter. To apply that to "dir" in your example, you would have to run CMD.EXE directly, executing "dir" via the "/c" parameter, then you can add the "/u" parameter, eg:

 

"cmd.exe /c /u dir c:\"

 

Note that this only works for internal commands. The "/u" parameter does not effect external processes run from CMD.EXE. They are free to output in whatever codepage they want, or in the calling console's current codepage.

 

ReadFile() operates on raw bytes only.  Like I said, it has no concept of strings or character encodings. So it will read whatever raw data is sent to the output pipes.  In your situation, "dir" is outputting ANSI and you are reading it into a Unicode buffer, thus the mismatch leading to the mojibake text you see. 

 

42 minutes ago, David Schwartz said:

What's not quite fully distilled for me is this; it sounds like the "data pipeline" around the CreateProcess call (including all of the various handlers) may need to be tailored for EITHER Ansi OR Unicode. Is this true? Or is there a way to handle both?

Not easily, short of buffering the raw data as-is and then analyzing it to guess what encoding it may be using, and then convert it to the desired encoding if/when needed.

42 minutes ago, David Schwartz said:

I guess it's irrelevant tho, because I plan to use this to run exactly one program (Ora2Pg, which is built on top of a Perl platform). 

That would be an external process outside of CMD.EXE, so you would be subject to whatever character encoding it decides to send its output as.  You will have to test it and see what it sends,and then code for it accordingly.

42 minutes ago, David Schwartz said:

That said, what's the simplest way to deal with this issue?

I think I've already covered that.

  • Thanks 1

Share this post


Link to post

Ok, one more question. What's the difference between using CreateProcess, CreateProcessA, and CreateProcessW if the internal process being spawned may or may not be producing data in the expected format? (Ansi vs. Unicode) Why would you choose one or the other if it depends more on the ReadFile() ?

Edited by David Schwartz

Share this post


Link to post

There are only two functions, the A and the W version. The A version converts the input text arguments to Unicode and calls the W version. That's always the case with Windows. The A function is just a time and memory consuming wrapper around the W function. 

 

Note that this implies it makes no difference to the output of the created process since the W version is always going the work in the end. 

 

Basic rule is always call the W version. It's more efficient, the system is Unicode natively, and the W version means your code can be used by non English users. 

  • Like 2

Share this post


Link to post
7 minutes ago, David Heffernan said:

There are only two functions, the A and the W version. The A version converts the input text arguments to Unicode and calls the W version. That's always the case with Windows. The A function is just a time and memory consuming wrapper around the W function. 

 

Note that this implies it makes no difference to the output of the created process since the W version is always going the work in the end. 

 

Basic rule is always call the W version. It's more efficient, the system is Unicode natively, and the W version means your code can be used by non English users. 

Thanks, that's great. But it's also why I was puzzled to see the mojibake, because I assumed the data would be converted automatically to Unicode.

 

I guess the trick is knowing exactly where in the pipeline the conversion needs to occur (if/when it does) and making it happen.

Share this post


Link to post
27 minutes ago, David Schwartz said:

Thanks, that's great. But it's also why I was puzzled to see the mojibake, because I assumed the data would be converted automatically to Unicode.

Only the string input parameters that are passed directly to CreateProcess() itself are converted, not the data that gets sent over pipes afterwards.

27 minutes ago, David Schwartz said:

I guess the trick is knowing exactly where in the pipeline the conversion needs to occur (if/when it does) and making it happen.

It needs to occur at the point where the data first enters your app, before you pass it up to other areas of your code.  In general, you should use a single encoding inside your app (ie, always use Delphi's native encoding), and then convert data at the last possible moment when it leaves your app, and convert it at the earliest possible moment when it enters your app.

  • Like 1

Share this post


Link to post
18 hours ago, Remy Lebeau said:

SetLength(s, UnicodeFromLocaleChars(GetACP(), 0, Buf, BytesRead, nil, 0));
UnicodeFromLocaleChars(GetACP(), 0, Buf, BytesRead, PChar(s), Length(s)));

I'm getting a compiler error from this:

 

[dcc32 Error] RedCon.pas(174): E2250 There is no overloaded version of 'UnicodeFromLocaleChars' that can be called with these arguments
[dcc32 Error] RedCon.pas(175): E2250 There is no overloaded version of 'UnicodeFromLocaleChars' that can be called with these arguments

 

The problem seems to be that the function wants PAnsiChar for Buf, which is an array of AnsiChar.

 

If I try casting it to PAnsiChar, I get another error saying it's an illegal cast.

 

EDIT: I changed it to @Buf[1] and that works.

 

AND ... it works great now. Thanks!

 

Edited by David Schwartz

Share this post


Link to post
39 minutes ago, David Schwartz said:

I'm getting a compiler error from this:

 

[dcc32 Error] RedCon.pas(174): E2250 There is no overloaded version of 'UnicodeFromLocaleChars' that can be called with these arguments
[dcc32 Error] RedCon.pas(175): E2250 There is no overloaded version of 'UnicodeFromLocaleChars' that can be called with these arguments

 

The problem seems to be that the function wants PAnsiChar for Buf, which is an array of AnsiChar.

Typically, a fixed AnsiChar[] array can be passed to a PAnsiChar pointer, but apparently that only works when the low index is 0, whereas yours is 1 instead.  From the documentation:

Quote

An array type of the form array[0..x] of Char is called a zero-based character array. Zero-based character arrays are used to store null-terminated strings and are compatible with PChar values. See "Working with null-terminated strings" in String Types (Delphi).

Quote

EDIT: I changed it to @Buf[1] and that works.

Yes, that will work, and is what you need to do, unless you change your Buf array to use 0-based indexing.

Edited by Remy Lebeau
  • Like 1

Share this post


Link to post

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×