Watch, Follow, &
Connect with Us

For forums, blogs and more please visit our
Developer Tools Community.


Welcome, Guest
Guest Settings
Help

Thread: Some invalid chars on TIdText.body.text if TIdText.Charset = 'ISO646-US'


This question is answered. Helpful answers available: 2. Correct answers available: 1.


Permlink Replies: 5 - Last Post: Oct 6, 2016 1:18 AM Last Post By: Julio Pião
Julio Pião

Posts: 4
Registered: 12/3/08
Some invalid chars on TIdText.body.text if TIdText.Charset = 'ISO646-US'  
Click to report abuse...   Click to reply to this thread Reply
  Posted: Oct 3, 2016 6:28 AM
Hi,

i encouter some invalid chars (seems all accentuated chars like óòõéè ...) on TIdText.body.text if TIdText.Charset = 'ISO646-US'.

Apparently, string is encoded as UCS4String and not correctly decoded by TIdMessage.

I have triyed to correct (Delphi 2009) with UCS4StringToWideString() with no result.

Here' s an extract from *.eml file created with TIdMessage (you can see the problem on word "electr??nica") :

-Cf7WL0kqbRYkLoCdE7vCgkHt2tiFJ0h=_G
Content-Type: text/html; charset="ISO646-US"
Content-Transfer-Encoding: quoted-printable
Content-Disposition: inline

<html>
<head>
<meta http-equiv=3D"Content-Type" content=3D"text/html; charset=3Dutf-=
8">
</head>
<body>
<basefont face=3D"calibri, arial">

Estimado(a) Cliente,

Enviamos em anexo a sua factura electr??nica, n.?? 1130191611...


Thanks for any help,
Mauricio

Edited by: Julio Pião on Oct 3, 2016 6:30 AM

Edited by: Julio Pião on Oct 3, 2016 6:41 AM

Remy Lebeau (Te...


Posts: 9,447
Registered: 12/23/01
Re: Some invalid chars on TIdText.body.text if TIdText.Charset ='ISO646-US' [Edit]  
Click to report abuse...   Click to reply to this thread Reply
  Posted: Oct 3, 2016 10:39 AM   in response to: Julio Pião in response to: Julio Pião
Julio wrote:

i encouter some invalid chars (seems all accentuated chars like óòõéè
...) on TIdText.body.text if TIdText.Charset = 'ISO646-US'.

Your extract clearly shows a conflict between the charset of a MIME piece
within the email and the charset of the HTML data inside of that piece.
The MIME portion of the email is claiming ISO646-US, but the HTML inside
of the MIME is claiming UTF-8 instead. That is an error on the sender's
end, there is nothing Indy can do about that, as the MIME charset has priority
when processing text. You need to report the error to the sender (or the
author of the sender's app). If the HTML is actually encoded as UTF-8, the
sender needs to send the MIME Content-Type charset as UTF-8 instead of ISO646-US.

Assuming that would take time to get fixed, the only way you can receive
this email correctly will be to receive the email as raw data and decode
it yourself manually. You can either:

1. receive the email as a TStream instead of a TIdMessage, and then decode
the TStream as needed.

2. set the TIdMessage.NoDecode property to true before receiving the email
as a TIdMessage, and then parse the raw octets in the TIdMessage.Body property
as needed.

3. if you want to be sneaky about it, you could try implementing and registering
a custom TIdMessageDecoderInfoMIME class that is similar to TIdMessageDecoderInfoMIME,
but returns a custom TIdMessageDecoder class that is similar to TIdMessageDecoderMIME
but whose overriden ReadBody() method parses the raw HTML as it is being
received, and then it can override the charset stored in the decoder's Headers
property if a charset declaration is found in the HTML. Then Indy would
decode the raw HTML bytes using the HTML's charset instead of the MIME charset.

Apparently, string is encoded as UCS4String and not correctly decoded
by TIdMessage.

That is not what is happening. The real issue is that the HTML is being
decoded using the wrong charset.

I have triyed to correct (Delphi 2009) with UCS4StringToWideString()
with no result.

Of course not, because that is not the root of the problem.

--
Remy Lebeau (TeamB)
Julio Pião

Posts: 4
Registered: 12/3/08
Re: Some invalid chars on TIdText.body.text if TIdText.Charset ='ISO646-US' [Edit]  
Click to report abuse...   Click to reply to this thread Reply
  Posted: Oct 4, 2016 3:44 AM   in response to: Remy Lebeau (Te... in response to: Remy Lebeau (Te...
Hi Remy,

i was waiting your response since you are helping Indy threads :)

I have already solved a encoding problem on subject/Sender name/sender adress etc ... with your explanations (some years ago) and you can found it here :
cyIndy.pas from my library https://sourceforge.net/projects/tcycomponents/

Here' s my function :

// Decode idMsg.Subject, IdMsg.From.Address etc ...
function ForceDecodeHeader(aHeader: String): String;
var
  VEndPos: Integer;
 
    procedure Terminate_aHeader(AStartPos: Integer; var VEndPos: Integer);
    var
      LCharSet, LEncoding, LData, LDataEnd: Integer;
    begin
      LCharSet := PosIdx('=?', AHeader, AStartPos);  {Do not Localize}
      if (LCharSet = 0) or (LCharSet > VEndPos) then begin
        Exit;
      end;
      Inc(LCharSet, 2);
 
      LEncoding := PosIdx('?', AHeader, LCharSet);  {Do not Localize}
      if (LEncoding = 0) or (LEncoding > VEndPos) then begin
        Exit;
      end;
      Inc(LEncoding);
 
      LData := PosIdx('?', AHeader, LEncoding);  {Do not Localize}
      if (LData = 0) or (LData > VEndPos) then begin
        Exit;
      end;
      Inc(LData);
 
      LDataEnd := PosIdx('?=', AHeader, LData);  {Do not Localize}
      if (LDataEnd = 0) or (LDataEnd > VEndPos) then begin
        // My code :
        aHeader := aHeader + '?=';
      end;
    end;
 
begin
  (*
    Why?: "Subject" not correctly returned on DecodeHeader function:
    Subject Exemple:
    =?iso-8859-1?Q?Est=E1 na hora de avan=E7ar para o RAD Studio XE Enterprise!?=
 
    idMessage.pas:
    Subject := DecodeHeader(Headers.Values['Subject']);
 
     idCoderHeader.pas linha 228:
        LDataEnd := PosIdx('?=', AHeader, LData);
        if (LDataEnd = 0) // RHR or (LDataEnd > VEndPos)
        then begin
          Exit;
        end;
 
  // Depois de ter falado com Remy Lebeau, pela norma iso, não deveria ter espaços no assunto pelo que é por isso que não descodifica ... *)
 
 
  // 2015-02-12 UPDATE!  Seems that as above explained, line 228 is waiting for '?=' that is not found in order to decode IdMsg.From.Address
  VEndPos := Length(aHeader);
  Terminate_aHeader(1, VEndPos);
 
 
  aHeader := StringReplace(aHeader, ' ', ' ', [rfReplaceAll]);
  aHeader := idCoderHeader.DecodeHeader(aHeader);
  Result := StringReplace(aHeader, ' ', ' ', [rfReplaceAll]);
end;


Exemple of use:
cyIndy.ForceDecodeHeader(idMsg.Subject);


Returning to this thread:

"If the HTML is actually encoded as UTF-8, the sender needs to send the MIME Content-Type charset as UTF-8 instead of ISO646-US..."
I anderstand what you say and note that Outlook 2007 decode the same way as indy does.

"...as the MIME charset has priority when processing text"
Because of messagepart is defined as charset="ISO646-US", it must be decoded as is and i agree with you.

But i just want to anderstand what appended on the sender if you can help me :
why the sender application have encoded the html with utf-8 if MIME charset has priority?
For me, charset on html is just like text information since the charset encoding was made by MIME.

"<meta http-equiv=3D"Content-Type" content=3D"text/html; charset=3Dutf-=8">"
Can you also explain me what 3Dutf-=8 means?

Regards,
Mauricio

Edited by: Julio Pião on Oct 4, 2016 3:45 AM

Remy Lebeau (Te...


Posts: 9,447
Registered: 12/23/01
Re: Some invalid chars on TIdText.body.text if TIdText.Charset='ISO646-US' [Edit] [Edit]  
Click to report abuse...   Click to reply to this thread Reply
  Posted: Oct 4, 2016 2:11 PM   in response to: Julio Pião in response to: Julio Pião
Julio wrote:

Why?: "Subject" not correctly returned on DecodeHeader function:
Subject Exemple:
=?iso-8859-1?Q?Est=E1 na hora de avan=E7ar para o RAD Studio XE Enterprise!?=

Right, because that data is malformed because unencoded spaces are not allowed.
In the 'Q' encoding, they must be encoded as '_' instead, eg:

Subject: =?iso-8859-1?Q?Est=E1_na_hora_de_avan=E7ar_para_o_RAD_Studio_XE_Enterprise!?=

But i just want to anderstand what appended on the sender if you can
help me : why the sender application have encoded the html with utf-8
if MIME charset has priority?

I can't answer that, since I don''t know what app the sender is using, or
how the HTML email was setup prior to sending.

Can you also explain me what 3Dutf-=8 means?

It is part of MIME's "quoted-printable" encoding scheme. QP limits lines
to 70 characters, so long lines get soft-wrapped by inserting a "=" followed
by a line break. Reserved and unsafe characters, like "=", in the actual
data get encoded in "=XX" hex format, so "=" characters would be encoded
as "=3D".

--
Remy Lebeau (TeamB)
Julio Pião

Posts: 4
Registered: 12/3/08
Re: Some invalid chars on TIdText.body.text if TIdText.Charset = 'ISO646-US'  
Click to report abuse...   Click to reply to this thread Reply
  Posted: Oct 4, 2016 8:59 AM   in response to: Julio Pião in response to: Julio Pião
x
Julio Pião

Posts: 4
Registered: 12/3/08
Re: Some invalid chars on TIdText.body.text if TIdText.Charset = 'ISO646-US'  
Click to report abuse...   Click to reply to this thread Reply
  Posted: Oct 6, 2016 1:18 AM   in response to: Julio Pião in response to: Julio Pião
Hi Remy,

thanks for your explanations and your time.

Regards,
Mauricio

https://sourceforge.net/projects/tcycomponents/
Legend
Helpful Answer (5 pts)
Correct Answer (10 pts)

Server Response from: ETNAJIVE02