Discussion:
UTF-8 woes on z/OS, a solution - comments invited
(too old to reply)
Robert Prins
2017-09-04 17:30:34 UTC
Permalink
Raw Message
OK, I solved the problem, but maybe someone here can come up with something a
bit more efficient...

There is a file in the non-z/OS world, that used to be pure ASCII (actually
CP437/850), but that has now been converted to UTF-8, due to further
internationalisation requirements. Said file was uploaded to z/OS, processed
into a set of datasets containing various reports, and those reports were later
downloaded to the non-z/OS world, using the same process that was used to upload
them, which could be one of two, IND$FILE, or FTP.

Both FTP and IND$FILE uploads had (and still have) no problems with
CP437/850/UTF-8 data, and although an ü might not have displayed as such on
z/OS, it would have transferred back to the same ü. However, an ü in UTF-8 now
consists of two characters, and that means that, replacing spaces with '='
characters, the original

|=Süd====|
|=Nord===|

report lines now come out as

|=Süd===|
|=Nord===|

when opened in the non z/OS world with an UTF-8 aware application.

Given that, and in this case I was lucky, the PC file had the option to add
comment-type lines, I solved the problem (the z/OS dataset is processed with
PL/I) by adding an extra line to the input file of the required comment
delimiter followed by "ASCII " followed by the 240 ASCII characters from '20'x
to 'ff'x. The PL/I program uses this "special meta-data comment" to transform
the input data, which has been translated by IND$FILE/FTP to EBCDIC back into a
format where all UTF-8 initial characters are translated to '1' and all UTF-8
follow-on bytes to '0', i.e.

dcl ascii char (240); /* containing the 240 characters from '20'x to 'ff'x, read
in via an additional comment record in the original non-z/OS file */
dcl utf8 char (240) init (('11111111111111111111111111111111' ||
'11111111111111111111111111111111' ||
'11111111111111111111111111111111' ||
'00000000000000000000000000000000' ||
'00000000000000000000000000000000' ||
'00111111111111111111111111111111' ||
'11111111111111111111100000000000'));

and to get the number of UTF-8 displayable characters of, e.g. myvar, a char(47)
variable, I use the following

dcl a47(47) pic '9';
dcl more char (20) var;

string(a47) = translate(myvar, utf8, ascii);
more = copy(' ', 47 - sum(a47));

where "more" is the number of extra blanks that needs to be added into the
report column to ensure that the columns line-out again in the non-z/OS UTF-8
world. The (relative) beauty of this approach lies in the fact that the
technique is completely code-page independent, and could even be used with the
PL/I compiler on Windows.

The above works like a charm, however, both translate() and sum(), especially of
pic '9' data, are not exactly the most efficient functions, so the question is,
can anyone think of a more efficient way, other than the quick(?) and dirty
solution of using a macro on the non-z/OS side, to set "more" the the required
number of characters. I'm open to a PL/I callable assembler routine, but the
process must be, like the one above, completely code-page independent!

Robert
--
Robert AH Prins
robert.ah.prins(a)gmail.com

----------------------------------------------------------------------
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to ***@listserv.ua.edu with the message: INFO IBM-MAIN
Charles Mills
2017-09-04 17:55:48 UTC
Permalink
Raw Message
I don't understand the problem.

Yes, ü is two bytes (not characters as you wrote!) in UTF-8. But if the translation is working correctly and the code page is specified correctly it should become one byte in EBCDIC, and assuming the report program treats it as a literal of some sort -- does not expect to deduce meaning from each byte -- it should be perfectly happy with S?d (pretending ? is an EBCDIC ü) as a district or whatever name. The report columns should be correct, and it should come back to UTF-8 land as ü, with the proper number of padding blanks.

It sounds like you are incorrectly translating ü to *two* EBCDIC characters, and that is the root of your problem. See if you can't translate to an EBCDIC code page that includes ü.

Charles


-----Original Message-----
From: IBM Mainframe Discussion List [mailto:IBM-***@LISTSERV.UA.EDU] On Behalf Of Robert Prins
Sent: Monday, September 4, 2017 12:34 PM
To: IBM-***@LISTSERV.UA.EDU
Subject: UTF-8 woes on z/OS, a solution - comments invited

OK, I solved the problem, but maybe someone here can come up with something a bit more efficient...

There is a file in the non-z/OS world, that used to be pure ASCII (actually CP437/850), but that has now been converted to UTF-8, due to further internationalisation requirements. Said file was uploaded to z/OS, processed into a set of datasets containing various reports, and those reports were later downloaded to the non-z/OS world, using the same process that was used to upload them, which could be one of two, IND$FILE, or FTP.

Both FTP and IND$FILE uploads had (and still have) no problems with
CP437/850/UTF-8 data, and although an ü might not have displayed as such on z/OS, it would have transferred back to the same ü. However, an ü in UTF-8 now consists of two characters, and that means that, replacing spaces with '='
characters, the original

|=Süd====|
|=Nord===|

report lines now come out as

|=Süd===|
|=Nord===|

when opened in the non z/OS world with an UTF-8 aware application.

----------------------------------------------------------------------
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to ***@listserv.ua.edu with the message: INFO IBM-MAIN
Robert Prins
2017-09-04 18:55:51 UTC
Permalink
Raw Message
Post by Charles Mills
I don't understand the problem.
That's correct.
Post by Charles Mills
Yes, ü is two bytes (not characters as you wrote!) in UTF-8.
You're correct again.
Post by Charles Mills
But if the translation is working correctly and the code page is specified
correctly it should become one byte in EBCDIC, and assuming the report
program treats it as a literal of some sort -- does not expect to deduce
meaning from each byte -- it should be perfectly happy with S?d (pretending
? is an EBCDIC ü) as a district or whatever name. The report columns should
be correct, and it should come back to UTF-8 land as ü, with the proper
number of padding blanks.
It sounds like you are incorrectly translating ü to *two* EBCDIC characters,
and that is the root of your problem. See if you can't translate to an
EBCDIC code page that includes ü.
I can probably find a set of code-pages that correctly translate the two byte
UTF-8 "ü" character to a one byte EBCDIC "ü" character, but how would those same
two code-pages translate the Polish "ł", the Danish "ø", the Baltic "ė", and the
Greek "Θ", which appear in the same PC-side file to one single character... And
back to the correct UTF-8 character...

That makes the problem maybe more understandable?

Robert
Post by Charles Mills
Charles
-----Original Message----- From: IBM Mainframe Discussion List
on z/OS, a solution - comments invited
OK, I solved the problem, but maybe someone here can come up with something
a bit more efficient...
There is a file in the non-z/OS world, that used to be pure ASCII (actually
CP437/850), but that has now been converted to UTF-8, due to further
internationalisation requirements. Said file was uploaded to z/OS, processed
into a set of datasets containing various reports, and those reports were
later downloaded to the non-z/OS world, using the same process that was used
to upload them, which could be one of two, IND$FILE, or FTP.
Both FTP and IND$FILE uploads had (and still have) no problems with
CP437/850/UTF-8 data, and although an ü might not have displayed as such on
z/OS, it would have transferred back to the same ü. However, an ü in UTF-8
now consists of two characters, and that means that, replacing spaces with
'=' characters, the original
|=Süd====| |=Nord===|
report lines now come out as
|=Süd===| |=Nord===|
when opened in the non z/OS world with an UTF-8 aware application.
---------------------------------------------------------------------- For
IBM-MAIN subscribe / signoff / archive access instructions, send email to
--
Robert AH Prins
robert(a)prino(d)org

----------------------------------------------------------------------
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to ***@listserv.ua.edu with the message: INFO IBM-MAIN
Paul Gilmartin
2017-09-04 19:24:54 UTC
Permalink
Raw Message
Post by Robert Prins
I can probably find a set of code-pages that correctly translate the two byte
UTF-8 "ü" character to a one byte EBCDIC "ü" character, but how would those same
two code-pages translate the Polish "ł", the Danish "ø", the Baltic "ė", and the
Greek "Θ", which appear in the same PC-side file to one single character... And
back to the correct UTF-8 character...
That makes the problem maybe more understandable?
If SBCS is a requirement, then if there is an EBCDIC SBCS code page that
contains "ü", "ł", "ø", "ė", and "Θ", iconv can probably translate UTF-8 to that
code page. Otherwise, you're SOL.

See: https://en.wikipedia.org/wiki/Pigeonhole_principle

-- gil

----------------------------------------------------------------------
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to ***@listserv.ua.edu with the message: INFO IBM-MAIN
Robert Prins
2017-09-04 19:37:42 UTC
Permalink
Raw Message
Post by Paul Gilmartin
Post by Robert Prins
I can probably find a set of code-pages that correctly translate the two byte
UTF-8 "ü" character to a one byte EBCDIC "ü" character, but how would those same
two code-pages translate the Polish "ł", the Danish "ø", the Baltic "ė", and the
Greek "Θ", which appear in the same PC-side file to one single character... And
back to the correct UTF-8 character...
That makes the problem maybe more understandable?
If SBCS is a requirement, then if there is an EBCDIC SBCS code page that
contains "ü", "ł", "ø", "ė", and "Θ", iconv can probably translate UTF-8 to that
code page. Otherwise, you're SOL.
That's why I'm now using the code that I posted. It works, assuming the UTF-8
data is correct. If that isn't the case, then I'm SOL, and the users get what
they deserve, GIGO! ;)

Robert
--
Robert AH Prins
robert(a)prino(d)org

----------------------------------------------------------------------
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to ***@listserv.ua.edu with the message: INFO IBM-MAIN
Robert Prins
2017-09-04 19:39:52 UTC
Permalink
Raw Message
Post by Paul Gilmartin
Post by Robert Prins
I can probably find a set of code-pages that correctly translate the two
byte UTF-8 "ü" character to a one byte EBCDIC "ü" character, but how would
those same two code-pages translate the Polish "ł", the Danish "ø", the
Baltic "ė", and the Greek "Θ", which appear in the same PC-side file to one
single character... And back to the correct UTF-8 character...
That makes the problem maybe more understandable?
If SBCS is a requirement, then if there is an EBCDIC SBCS code page that
contains "ü", "ł", "ø", "ė", and "Θ", iconv can probably translate UTF-8 to
that code page. Otherwise, you're SOL.
That's why I'm now using the code that I posted. It works, assuming the UTF-8
data is correct. If that isn't the case, then I'm SOL, and the users get what
they deserve, GIGO! ;)

Robert
--
Robert AH Prins
robert(a)prino(d)org
--
Robert AH Prins
robert.ah.prins(a)gmail.com

----------------------------------------------------------------------
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to ***@listserv.ua.edu with the message: INFO IBM-MAIN
Charles Mills
2017-09-04 20:16:49 UTC
Permalink
Raw Message
After I read @Robert's reply to my note I was mentally composing more or less what @Gil writes below.

Paraphrasing the vulgar cliché, you can't put 20 bits of data in an 8-bit byte. Ultimately, EBCDIC is what it is, and it ain't UTF-8.

I suppose you might be able to create a custom EBCDIC code page that included "your" European characters -- assuming no more than thirty or so, and then configure z Unicode Services to handle it. Otherwise, to invoke another vulgar cliché, you are indeed SOL (without your homegrown, um, solution).

Charles


-----Original Message-----
From: IBM Mainframe Discussion List [mailto:IBM-***@LISTSERV.UA.EDU] On Behalf Of Paul Gilmartin
Sent: Monday, September 4, 2017 12:26 PM
To: IBM-***@LISTSERV.UA.EDU
Subject: Re: UTF-8 woes on z/OS, a solution - comments invited
Post by Robert Prins
I can probably find a set of code-pages that correctly translate the two byte
UTF-8 "ü" character to a one byte EBCDIC "ü" character, but how would
those same two code-pages translate the Polish "ł", the Danish "ø", the
Baltic "ė", and the Greek "Θ", which appear in the same PC-side file to
one single character... And back to the correct UTF-8 character...
That makes the problem maybe more understandable?
If SBCS is a requirement, then if there is an EBCDIC SBCS code page that contains "ü", "ł", "ø", "ė", and "Θ", iconv can probably translate UTF-8 to that code page. Otherwise, you're SOL.

----------------------------------------------------------------------
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to ***@listserv.ua.edu with the message: INFO IBM-MAIN
Paul Gilmartin
2017-09-04 20:34:00 UTC
Permalink
Raw Message
... another vulgar cliché, ... indeed SOL ...
???
Simply Outa Luck?

-- gil

----------------------------------------------------------------------
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to ***@listserv.ua.edu with the message: INFO IBM-MAIN
Charles Mills
2017-09-04 20:49:41 UTC
Permalink
Raw Message
Not the way I heard it.

Charles


-----Original Message-----
From: IBM Mainframe Discussion List [mailto:IBM-***@LISTSERV.UA.EDU] On Behalf Of Paul Gilmartin
Sent: Monday, September 4, 2017 1:35 PM
To: IBM-***@LISTSERV.UA.EDU
Subject: Re: UTF-8 woes on z/OS, a solution - comments invited
... another vulgar cliché, ... indeed SOL ...
???
Simply Outa Luck?

----------------------------------------------------------------------
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to ***@listserv.ua.edu with the message: INFO IBM-MAIN
Linda
2017-09-05 01:49:39 UTC
Permalink
Raw Message
Ummm and I heard (and used it as) it as Seriously Outa Luck!

Linda

Sent from my iPhone
Post by Charles Mills
Not the way I heard it.
Charles
-----Original Message-----
Sent: Monday, September 4, 2017 1:35 PM
Subject: Re: UTF-8 woes on z/OS, a solution - comments invited
... another vulgar cliché, ... indeed SOL ...
???
Simply Outa Luck?
----------------------------------------------------------------------
For IBM-MAIN subscribe / signoff / archive access instructions,
----------------------------------------------------------------------
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to ***@listserv.ua.edu with the message: INFO IBM-MAIN
Walt Farrell
2017-09-04 22:58:35 UTC
Permalink
Raw Message
Have you considered transferring it to z/OS in binary, rather than converting to EBCDIC. Then just process it in its UNICODE format, which either Java or Enterprise COBOL should be able to handle (Java by default, COBOL with appropriate UNICODE specifications).
--
Walt

----------------------------------------------------------------------
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to ***@listserv.ua.edu with the message: INFO IBM-MAIN
Charles Mills
2017-09-05 00:06:08 UTC
Permalink
Raw Message
COBOL or Java, but what about the OP's PL/I?

Charles

-----Original Message-----
From: IBM Mainframe Discussion List [mailto:IBM-***@LISTSERV.UA.EDU] On Behalf Of Walt Farrell
Sent: Monday, September 4, 2017 4:00 PM
To: IBM-***@LISTSERV.UA.EDU
Subject: Re: UTF-8 woes on z/OS, a solution - comments invited

Have you considered transferring it to z/OS in binary, rather than converting to EBCDIC. Then just process it in its UNICODE format, which either Java or Enterprise COBOL should be able to handle (Java by default, COBOL with appropriate UNICODE specifications).

----------------------------------------------------------------------
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to ***@listserv.ua.edu with the message: INFO IBM-MAIN
David W Noon
2017-09-05 01:00:55 UTC
Permalink
Raw Message
On Mon, 4 Sep 2017 17:07:08 -0700, Charles Mills (***@MCN.ORG)
wrote about "Re: UTF-8 woes on z/OS, a solution - comments invited" (in
Post by Charles Mills
COBOL or Java, but what about the OP's PL/I?
IBM Enterprise PL/I has WIDECHAR(*), which supports UTF-16. It also has
the UTF8(), UTF8TOCHAR() and UTF8TOWCHAR() built-in functions that
translate host code page to UTF-8, UTF-8 to host code page, and UTF-8 to
UTF-16, respectively. These will probably handle UTF-8 translations more
reliably than IND$FILE does.

The problem is the complexity that was previously hidden is now visibly
the province of the programmer.
--
Regards,

Dave [RLU #314465]
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*
***@googlemail.com (David W Noon)
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*

 

----------------------------------------------------------------------
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to ***@listserv.ua.edu with the message: INFO IBM-MAIN
Charles Mills
2017-09-05 01:36:42 UTC
Permalink
Raw Message
Well there you go, then.

FTP or IND$FILE in binary.

Read in UTF-8 and translate to UTF-16.

Process in UTF-16.

Translate report UTF-16 to UTF-8.

Download in binary.

QED

Charles


-----Original Message-----
From: IBM Mainframe Discussion List [mailto:IBM-***@LISTSERV.UA.EDU] On Behalf Of David W Noon
Sent: Monday, September 4, 2017 6:02 PM
To: IBM-***@LISTSERV.UA.EDU
Subject: Re: UTF-8 woes on z/OS, a solution - comments invited
Post by Charles Mills
COBOL or Java, but what about the OP's PL/I?
IBM Enterprise PL/I has WIDECHAR(*), which supports UTF-16. It also has the UTF8(), UTF8TOCHAR() and UTF8TOWCHAR() built-in functions that translate host code page to UTF-8, UTF-8 to host code page, and UTF-8 to UTF-16, respectively. These will probably handle UTF-8 translations more reliably than IND$FILE does.

The problem is the complexity that was previously hidden is now visibly the province of the programmer.
--
Regards,

Dave [RLU #314465]
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*
***@googlemail.com (David W Noon)
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*



----------------------------------------------------------------------
For IBM-MAIN subscribe / signoff / archive access instructions, send email to ***@listserv.ua.edu with the message: INFO IBM-MAIN

----------------------------------------------------------------------
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to ***@listserv.ua.edu with the message: INFO IBM-MAIN
Paul Gilmartin
2017-09-05 02:01:17 UTC
Permalink
Raw Message
Post by David W Noon
wrote about "Re: UTF-8 woes on z/OS, a solution - comments invited" (in
Post by Charles Mills
COBOL or Java, but what about the OP's PL/I?
IBM Enterprise PL/I has WIDECHAR(*), which supports UTF-16. It also has
the UTF8(), UTF8TOCHAR() and UTF8TOWCHAR() built-in functions that
translate host code page to UTF-8, UTF-8 to host code page, and UTF-8 to
UTF-16, respectively. These will probably handle UTF-8 translations more
reliably than IND$FILE does.
The problem is the complexity that was previously hidden is now visibly
the province of the programmer.
Why is there UTF-16?

o It's a variable-length encoding, involving the same complexities as UTF-8.

o It lacks the compactness of UTF-8 in the case of Latin text.

Is it because it's (sort of) an extension of UCS-2?

(What does Java use internally?)

-- gil

----------------------------------------------------------------------
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to ***@listserv.ua.edu with the message: INFO IBM-MAIN
Pew, Curtis G
2017-09-05 12:35:31 UTC
Permalink
Raw Message
Post by Paul Gilmartin
Why is there UTF-16?
o It's a variable-length encoding, involving the same complexities as UTF-8.
o It lacks the compactness of UTF-8 in the case of Latin text.
Is it because it's (sort of) an extension of UCS-2?
(What does Java use internally?)
Unicode was originally supposed to be a fixed-width, 16-bit encoding. Fixed-width was actually a design criteria for the original developers. It was only after it became clear that there was no possible way to fit all the needed characters into 16 bits that the “astral planes”[1] were (reluctantly) added to Unicode and the various UTF encodings defined. In this light, UTF-16 is the closest thing to the original version of Unicode. Also, if your text includes few or no Latin characters UTF-16 may be just as compact, or even more compact, than UTF-8, and can probably be processed more easily.

Since Java was developed when Unicode was still supposed to be a 16-bit encoding the early versions at least used what we would now call UTF-16. As I recall, there was a significant period of time after Unicode abandoned a fixed-width 16-bit representation before Java implementations really supported characters from the “astral planes”.


[1] Unicode is still organized into 64K ranges called “planes”. The original 0–xFFFF range is called the “Basic Multilingual Plane” (BMP) and “astral planes” is a convenient nickname for the other ranges.
--
Pew, Curtis G
***@austin.utexas.edu
ITS Systems/Core/Administrative Services


----------------------------------------------------------------------
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to ***@listserv.ua.edu with the message: INFO IBM-MAIN
Paul Gilmartin
2017-09-05 13:53:03 UTC
Permalink
Raw Message
Post by Pew, Curtis G
Unicode was originally supposed to be a fixed-width, 16-bit encoding. Fixed-width was actually a design criteria for the original developers. It was only after it became clear that there was no possible way to fit all the needed characters into 16 bits that the “astral planes”[1] were (reluctantly) added to Unicode and the various UTF encodings defined. In this light, UTF-16 is the closest thing to the original version of Unicode. Also, if your text includes few or no Latin characters UTF-16 may be just as compact, or even more compact, than UTF-8, and can probably be processed more easily.
Are you confusing UTF-16 and UCS-2?
https://en.wikipedia.org/wiki/UTF-16

UTF-16 (16-bit Unicode Transformation Format) is a character encoding
capable of encoding all 1,112,064 valid code points of Unicode. The
encoding is variable-length, as code points are encoded with one or two
16-bit code units. (also see Comparison of Unicode encodings for a
comparison of UTF-8, -16 & -32)

UTF-16 developed from an earlier fixed-width 16-bit encoding known as UCS-2
(for 2-byte Universal Character Set) once it became clear that 16 bits were
not sufficient for Unicode's user community.[1]
Post by Pew, Curtis G
Since Java was developed when Unicode was still supposed to be a 16-bit encoding the early versions at least used what we would now call UTF-16. As I recall, there was a significant period of time after Unicode abandoned a fixed-width 16-bit representation before Java implementations really supported characters from the “astral planes”.
[1] Unicode is still organized into 64K ranges called “planes”. The original 0–xFFFF range is called the “Basic Multilingual Plane” (BMP) and “astral planes” is a convenient nickname for the other ranges.
-- gil

----------------------------------------------------------------------
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to ***@listserv.ua.edu with the message: INFO IBM-MAIN
Pew, Curtis G
2017-09-05 15:08:23 UTC
Permalink
Raw Message
Post by Paul Gilmartin
Are you confusing UTF-16 and UCS-2?
https://en.wikipedia.org/wiki/UTF-16
UTF-16 (16-bit Unicode Transformation Format) is a character encoding
capable of encoding all 1,112,064 valid code points of Unicode. The
encoding is variable-length, as code points are encoded with one or two
16-bit code units. (also see Comparison of Unicode encodings for a
comparison of UTF-8, -16 & -32)
UTF-16 developed from an earlier fixed-width 16-bit encoding known as UCS-2
(for 2-byte Universal Character Set) once it became clear that 16 bits were
not sufficient for Unicode's user community.[1]
I was trying to say what the second paragraph you quoted says, without explicitly mentioning UCS-2. At least part of the answer to “Why is there UTF-16?” is “Because once there was UCS-2.”
--
Pew, Curtis G
***@austin.utexas.edu
ITS Systems/Core/Administrative Services


----------------------------------------------------------------------
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to ***@listserv.ua.edu with the message: INFO IBM-MAIN
Elardus Engelbrecht
2017-09-05 09:45:52 UTC
Permalink
Raw Message
Post by Linda
Ummm and I heard (and used it as) it as Seriously Outa Luck!
That is the one of the polite versions of "Sh*t Outa Luck"... ;-D


SOL is also (over 200+ meanings according to http://www.acronymfinder.com ):

System Off Line

Smile Out Loud
Sadly Outta Luck (polite form)
Stuff Outta Luck (polite form)
Sorta Outta Luck
Sobbing Out Loud ... whaaaaaaaaaaaa... sniffff... sniffff... ;-)
Swear Out Loud ... %$#@#@%^&**&#@ ...
Short on Luck
Sort of Laughing
Smiling Out Loud
Scream Out Loud
Shoot Out of Luck
Still Out of Luck
Smilling Out Loud
Sorry Out of Luck
Sure/Sore Out of Luck
...etc...


The same goes for RTFM - Polite version - Read The Fine/Freaking Manual.

For a crude version replace Fine/Freaking with your favourite word... ;-D

Is it Friday already?

Groete / Greetings
Elardus Engelbrecht

----------------------------------------------------------------------
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to ***@listserv.ua.edu with the message: INFO IBM-MAIN
Scott Chapman
2017-09-05 11:26:53 UTC
Permalink
Raw Message
Post by Paul Gilmartin
(What does Java use internally?)
-- gil
Currently Java does use UTF-16, but Java 9 will get a little smarter about that, storing ins 1 byte/character ISO8859-1/Latin-1 where it can.
http://openjdk.java.net/jeps/254

The G1 garbage collector (which I believe will be the new default) will also get string deduplication:
http://openjdk.java.net/jeps/192

Since those are internal JVM things, if those make will it into the IBM JVM I of course don't know.

Scott Chapman

----------------------------------------------------------------------
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to ***@listserv.ua.edu with the message: INFO IBM-MAIN
Timothy Sipples
2017-09-05 14:30:00 UTC
Permalink
Raw Message
Post by Paul Gilmartin
Why is there UTF-16?
[....]
o It lacks the compactness of UTF-8 in the case of Latin text.
Japanese Kanji, Traditional Chinese, Simplified Chinese, and emoji (!), as
examples, are not Latin text. More than 1.5 billion people is a lot of
people, and that's not counting all the billions of emoji users. :-)

And who cares about this compactness, really? Bytes are no longer *that*
precious, especially when they're compressed anyway.
Post by Paul Gilmartin
(What does Java use internally?)
UTF-16, as it happens.

FYI, if DB2 for z/OS is in the loop then DB2 will convert UTF-8 to UTF-16
for your PL/I application(s). Just store the UTF-8 data in DB2, use the
WIDECHAR datatype, and it all happens automagically, effortlessly, with no
UTF-8 to UTF-16 programming required. See here for more information:

https://www.ibm.com/support/knowledgecenter/en/SSEPEK_12.0.0/char/src/tpc/db2z_processunidatapli.html

If for some odd reason you absolutely insist on an EBCDIC-ish approach then
you can do what the Japanese have done for decades: Shift Out (SO), Shift
In (SI). Refer to CCSID 930 and CCSID 1390 for inspiration. You'd probably
use one of the EBCDIC Latin 1+euro codepages as a starting point, such as
1140, then SO/SI from there to pick up the exceptional characters.

--------------------------------------------------------------------------------------------------------
Timothy Sipples
IT Architect Executive, Industry Solutions, IBM z Systems, AP/GCG/MEA
E-Mail: ***@sg.ibm.com

----------------------------------------------------------------------
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to ***@listserv.ua.edu with the message: INFO IBM-MAIN
Tony Harminc
2017-09-05 16:06:24 UTC
Permalink
Raw Message
Post by Timothy Sipples
If for some odd reason you absolutely insist on an EBCDIC-ish approach then
you can do what the Japanese have done for decades: Shift Out (SO), Shift
In (SI). Refer to CCSID 930 and CCSID 1390 for inspiration. You'd probably
use one of the EBCDIC Latin 1+euro codepages as a starting point, such as
1140, then SO/SI from there to pick up the exceptional characters.
Another EBCDIC-ish approach would be UTF-EBCDIC. This is fully support by
z/OS Unicode conversion services; perhaps PL/I (and other things) should
make it Just Work under the covers.

Tony H.

----------------------------------------------------------------------
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to ***@listserv.ua.edu with the message: INFO IBM-MAIN
Paul Gilmartin
2017-09-05 15:18:33 UTC
Permalink
Raw Message
Post by Timothy Sipples
FYI, if DB2 for z/OS is in the loop then DB2 will convert UTF-8 to UTF-16
for your PL/I application(s). Just store the UTF-8 data in DB2, use the
WIDECHAR datatype, and it all happens automagically, effortlessly, with no
https://www.ibm.com/support/knowledgecenter/en/SSEPEK_12.0.0/char/src/tpc/db2z_processunidatapli.html
What language(s) cleanly handle vertical alignment of formatted text output when
the text contains UTF-16 supplemental/surrogate (not in the BMP) characters?
Here's an example of /bin/printf's failure for similar input with UTF-8 on MacOS:

The script:
printf "%-22s+++\n" "Hello World."
printf "%-22s+++\n" "Привет мир."
printf "%-22s+++\n" "Bonjour le monde."

writes:
Hello World. +++
Привет мир. +++
Bonjour le monde. +++

I wish the "+++" would line up (at least in a monospaced font).
What sort of PICTURE would work for such, not restricting to BMP?
Post by Timothy Sipples
If for some odd reason you absolutely insist on an EBCDIC-ish approach then
you can do what the Japanese have done for decades: Shift Out (SO), Shift
In (SI). Refer to CCSID 930 and CCSID 1390 for inspiration. You'd probably
use one of the EBCDIC Latin 1+euro codepages as a starting point, such as
1140, then SO/SI from there to pick up the exceptional characters.
The worst of both worlds.

-- gil

----------------------------------------------------------------------
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to ***@listserv.ua.edu with the message: INFO IBM-MAIN
David W Noon
2017-09-05 16:55:33 UTC
Permalink
Raw Message
On Tue, 5 Sep 2017 10:19:45 -0500, Paul Gilmartin
(0000000433f07816-dmarc-***@LISTSERV.UA.EDU) wrote about "Re: UTF-8
woes on z/OS, a solution - comments invited" (in
<***@listserv.ua.edu>):

[snip]
Post by Paul Gilmartin
What language(s) cleanly handle vertical alignment of formatted text output when
the text contains UTF-16 supplemental/surrogate (not in the BMP) characters?
Python and Java, at least.
Post by Paul Gilmartin
printf "%-22s+++\n" "Hello World."
printf "%-22s+++\n" "Привет мир."
printf "%-22s+++\n" "Bonjour le monde."
Hello World. +++
Привет мир. +++
Bonjour le monde. +++
I wish the "+++" would line up (at least in a monospaced font).
This is a bug in your printf UNIX command. It is counting bytes to
determine print position, rather than counting glyphs. It probably isn't
Unicode-aware.
--
Regards,

Dave [RLU #314465]
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*
***@googlemail.com (David W Noon)
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*

 

----------------------------------------------------------------------
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to ***@listserv.ua.edu with the message: INFO IBM-MAIN
Peter Hunkeler
2017-09-06 05:41:08 UTC
Permalink
Raw Message
Post by Paul Gilmartin
Post by Timothy Sipples
If for some odd reason you absolutely insist on an EBCDIC-ish approach then
you can do what the Japanese have done for decades: Shift Out (SO), Shift
In (SI). Refer to CCSID 930 and CCSID 1390 for inspiration. You'd probably
use one of the EBCDIC Latin 1+euro codepages as a starting point, such as
1140, then SO/SI from there to pick up the exceptional characters.
The worst of both worlds.
It's repeating history. The origin of all that code page mess was companies (not countries at that time) starting to build their own custom code page for any character in need that was not in the (single) EBCDIC code page. Later, some standardization was done and country code pages evolved.


While is was justifiable at that time, it is not today. Do not start this mess again by doing your own code page thing in your programs. Go Unicode, UTF-8 or UTF-16, whatever suits best.
--
Peter Hunkeler

----------------------------------------------------------------------
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to ***@listserv.ua.edu with the message: INFO IBM-MAIN
John McKown
2017-09-06 13:16:29 UTC
Permalink
Raw Message
Post by Robert Prins
Post by Paul Gilmartin
Post by Timothy Sipples
If for some odd reason you absolutely insist on an EBCDIC-ish approach
then
Post by Paul Gilmartin
Post by Timothy Sipples
you can do what the Japanese have done for decades: Shift Out (SO), Shift
In (SI). Refer to CCSID 930 and CCSID 1390 for inspiration. You'd
probably
Post by Paul Gilmartin
Post by Timothy Sipples
use one of the EBCDIC Latin 1+euro codepages as a starting point, such as
1140, then SO/SI from there to pick up the exceptional characters.
The worst of both worlds.
It's repeating history. The origin of all that code page mess was
companies (not countries at that time) starting to build their own custom
code page for any character in need that was not in the (single) EBCDIC
code page. Later, some standardization was done and country code pages
evolved.
While is was justifiable at that time, it is not today. Do not start this
mess again by doing your own code page thing in your programs. Go Unicode,
UTF-8 or UTF-16, whatever suits best.
​I agree with the sentiment. On Linux/Intel, I set my locale to en_US.utf8.
The "Go" and "Python3" language definitions _require_ their source to be in
UTF-8. But I wonder how well UTF-8 is really supported by z/OS
_applications_. I'm still stuck on z/OS 1.13 and COBOL 4.2, so I will ask.
Can I directly (and correctly) process UTF-8 coded characters in a COBOL 6
program? Even the multibyte characters?​ What about DFSORT? From the manual
at
https://www.ibm.com/support/knowledgecenter/en/SSLTBW_2.3.0/com.ibm.zos.v2r3.icea100/ice2ca_DFSORT_data_formats.htm
it appears to support UTF8, UTF16, and UTF32. But I'd love to see an
example of how that works. In particular, how do you say "this file is in
UTF8. Sort on the 3rd through the 10th characters."? The problem, to me, is
how do I say "the 3rd through the 10th characters"? If the data is all in
UTF8, then the 3rd character need not start in the 3rd byte. And the number
of bytes is not necessarily 8, but could be from 8 to 32 bytes depending.
Also, according to the same manual (different page), a "character string"
is always in EBCDIC. So I guess if you want to include based on a UTF8
string, you need to use hex encoding.
Post by Robert Prins
--
Peter Hunkeler
----------------------------------------------------------------------
For IBM-MAIN subscribe / signoff / archive access instructions,
--
UNIX was not designed to stop you from doing stupid things, because that
would also stop you from doing clever things. -- Doug Gwyn

Maranatha! <><
John McKown

----------------------------------------------------------------------
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to ***@listserv.ua.edu with the message: INFO IBM-MAIN
Don Poitras
2017-09-06 13:33:27 UTC
Permalink
Raw Message
For collating, I think most people use the ICU libraries. I know the C++
version has been used on z/OS by lots of folks and some searching found
a COBOL page. I have no idea if z/OS COBOL 4.2 can use it.

http://userguide.icu-project.org/usefrom/cobol
Post by Robert Prins
Post by Paul Gilmartin
Post by Timothy Sipples
If for some odd reason you absolutely insist on an EBCDIC-ish approach
then
Post by Paul Gilmartin
Post by Timothy Sipples
you can do what the Japanese have done for decades: Shift Out (SO), Shift
In (SI). Refer to CCSID 930 and CCSID 1390 for inspiration. You'd
probably
Post by Paul Gilmartin
Post by Timothy Sipples
use one of the EBCDIC Latin 1+euro codepages as a starting point, such as
1140, then SO/SI from there to pick up the exceptional characters.
The worst of both worlds.
It's repeating history. The origin of all that code page mess was
companies (not countries at that time) starting to build their own custom
code page for any character in need that was not in the (single) EBCDIC
code page. Later, some standardization was done and country code pages
evolved.
While is was justifiable at that time, it is not today. Do not start this
mess again by doing your own code page thing in your programs. Go Unicode,
UTF-8 or UTF-16, whatever suits best.
?I agree with the sentiment. On Linux/Intel, I set my locale to en_US.utf8.
The "Go" and "Python3" language definitions _require_ their source to be in
UTF-8. But I wonder how well UTF-8 is really supported by z/OS
_applications_. I'm still stuck on z/OS 1.13 and COBOL 4.2, so I will ask.
Can I directly (and correctly) process UTF-8 coded characters in a COBOL 6
program? Even the multibyte characters?? What about DFSORT? From the manual
at
https://www.ibm.com/support/knowledgecenter/en/SSLTBW_2.3.0/com.ibm.zos.v2r3.icea100/ice2ca_DFSORT_data_formats.htm
it appears to support UTF8, UTF16, and UTF32. But I'd love to see an
example of how that works. In particular, how do you say "this file is in
UTF8. Sort on the 3rd through the 10th characters."? The problem, to me, is
how do I say "the 3rd through the 10th characters"? If the data is all in
UTF8, then the 3rd character need not start in the 3rd byte. And the number
of bytes is not necessarily 8, but could be from 8 to 32 bytes depending.
Also, according to the same manual (different page), a "character string"
is always in EBCDIC. So I guess if you want to include based on a UTF8
string, you need to use hex encoding.
Post by Robert Prins
--
Peter Hunkeler
Maranatha! <><
John McKown
--
Don Poitras - SAS Development - SAS Institute Inc. - SAS Campus Drive
***@sas.com (919) 531-5637 Cary, NC 27513

----------------------------------------------------------------------
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to ***@listserv.ua.edu with the message: INFO IBM-MAIN
John McKown
2017-09-06 13:39:59 UTC
Permalink
Raw Message
Post by Don Poitras
For collating, I think most people use the ICU libraries. I know the C++
version has been used on z/OS by lots of folks and some searching found
a COBOL page. I have no idea if z/OS COBOL 4.2 can use it.
http://userguide.icu-project.org/usefrom/cobol
​Thanks! I'll read that over. We don't use anything other than CP-037 and
IBM-1047 EBCDIC (mainly the former) for our character data. But I like to
keep up with what is happening in the real world.​
--
UNIX was not designed to stop you from doing stupid things, because that
would also stop you from doing clever things. -- Doug Gwyn

Maranatha! <><
John McKown

----------------------------------------------------------------------
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to ***@listserv.ua.edu with the message: INFO IBM-MAIN
Don Poitras
2017-09-06 14:12:11 UTC
Permalink
Raw Message
Post by John McKown
Post by Don Poitras
For collating, I think most people use the ICU libraries. I know the C++
version has been used on z/OS by lots of folks and some searching found
a COBOL page. I have no idea if z/OS COBOL 4.2 can use it.
http://userguide.icu-project.org/usefrom/cobol
?Thanks! I'll read that over. We don't use anything other than CP-037 and
IBM-1047 EBCDIC (mainly the former) for our character data. But I like to
keep up with what is happening in the real world.?
Maranatha! <><
John McKown
You're welcome. I noticed that that page says it has sample programs,
but they got truncated somehow. I found the full samples on github:

https://github.com/morecobol/icu4c-cobol-samples/tree/master/src
--
Don Poitras - SAS Development - SAS Institute Inc. - SAS Campus Drive
***@sas.com (919) 531-5637 Cary, NC 27513

----------------------------------------------------------------------
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to ***@listserv.ua.edu with the message: INFO IBM-MAIN
Walt Farrell
2017-09-05 15:41:04 UTC
Permalink
Raw Message
Post by Paul Gilmartin
What language(s) cleanly handle vertical alignment of formatted text output when
the text contains UTF-16 supplemental/surrogate (not in the BMP) characters?
printf "%-22s+++\n" "Hello World."
printf "%-22s+++\n" "Привет мир."
printf "%-22s+++\n" "Bonjour le monde."
Hello World. +++
Привет мир. +++
Bonjour le monde. +++
I wish the "+++" would line up (at least in a monospaced font).
What sort of PICTURE would work for such, not restricting to BMP?
It would take more than a simple script like that, but with programming it can be done. I have a Python program that does it, for example. The key is understanding that some characters don't take up any space when printed (combining characters, for example), and therefore don't contribute to the length of the output string. When those characters are present you need to pad the end with blanks if you want a fixed width output string.

Python has Unicode functions that let you examine the characteristics of the characters within a string so you can figure out the proper length when printed, but I'm not aware of anything built-in like a print function that does that automatically. It would be handy.

Presumably one could do that in other languages, too. And presumably one could implement a print function that did that automatically. Perhaps someone has, or perhaps some language can do it automatically.
--
Walt

----------------------------------------------------------------------
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to ***@listserv.ua.edu with the message: INFO IBM-MAIN
Pew, Curtis G
2017-09-05 16:32:35 UTC
Permalink
Raw Message
Post by Walt Farrell
Python has Unicode functions that let you examine the characteristics of the characters within a string so you can figure out the proper length when printed, but I'm not aware of anything built-in like a print function that does that automatically. It would be handy.
In Python 3, at least, the built-in substitution facility can handle it as-is:

Python 3.5.4 (default, Aug 12 2017, 14:31:52)
[GCC 4.2.1 Compatible Apple LLVM 8.1.0 (clang-802.0.42)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
... print("%-22s+++\n" % hw)
...
Post by Walt Farrell
fmtprt("Hello, world!")
Hello, world! +++
Post by Walt Farrell
fmtprt("Привет мир.")
Привет мир. +++
Post by Walt Farrell
fmtprt("Bonjour le monde.")
Bonjour le monde. +++
--
Pew, Curtis G
***@austin.utexas.edu
ITS Systems/Core/Administrative Services


----------------------------------------------------------------------
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to ***@listserv.ua.edu with the message: INFO IBM-MAIN
David W Noon
2017-09-05 16:51:46 UTC
Permalink
Raw Message
On Tue, 5 Sep 2017 16:33:43 +0000, Pew, Curtis G
(***@AUSTIN.UTEXAS.EDU) wrote about "Re: UTF-8 woes on z/OS, a
solution - comments invited" (in
Python 3 uses UTF-32 for all its default character strings. This
relieves the problem of counting bytes or counting glyphs.
--
Regards,

Dave [RLU #314465]
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*
***@googlemail.com (David W Noon)
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*

 

----------------------------------------------------------------------
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to ***@listserv.ua.edu with the message: INFO IBM-MAIN
Robert Prins
2017-09-05 18:10:18 UTC
Permalink
Raw Message
Post by Walt Farrell
Post by Paul Gilmartin
What language(s) cleanly handle vertical alignment of formatted text output
when the text contains UTF-16 supplemental/surrogate (not in the BMP)
characters? Here's an example of /bin/printf's failure for similar input
The script: printf "%-22s+++\n" "Hello World." printf "%-22s+++\n" "Привет
мир." printf "%-22s+++\n" "Bonjour le monde."
writes: Hello World. +++ Привет мир. +++ Bonjour le monde.
+++
I wish the "+++" would line up (at least in a monospaced font). What sort
of PICTURE would work for such, not restricting to BMP?
It would take more than a simple script like that, but with programming it
can be done. I have a Python program that does it, for example. The key is
understanding that some characters don't take up any space when printed
(combining characters, for example), and therefore don't contribute to the
length of the output string. When those characters are present you need to
pad the end with blanks if you want a fixed width output string.
And that is exactly what I'm doing with my translate/sum method. I know that any
character that starts with the orange bytes in
<https://en.wikipedia.org/wiki/UTF-8#Codepage_layout> is a non-printing one (and
yes a few exceptions that I do not cater for, assuming the non-z/OS file to
contain correct UTF-8) and the translate just sets them to zero.

As I wrote, it works like a charm, but may not be the most efficient way of
doing things, although, given the (still) limited amount of UTF-8 text that has
to undergo this kind of processing, it's probably way faster than converting the
entire file into a multi-byte format, and using PL/I WCHAR's and the ULENGTH()
builtin, which must, in its implementation, do something pretty similar anyway.

Robert
--
Robert AH Prins
robert(a)prino(d)org

----------------------------------------------------------------------
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to ***@listserv.ua.edu with the message: INFO IBM-MAIN
Andy Wood
2017-09-05 22:18:00 UTC
Permalink
Raw Message
On Tue, 5 Sep 2017 22:30:59 +0800, Timothy Sipples <***@SG.IBM.COM> wrote:

...
Post by Timothy Sipples
you can do what the Japanese have done for decades: Shift Out (SO), Shift
In (SI).
ZCZC
DECADESQUERY
NNNN

----------------------------------------------------------------------
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to ***@listserv.ua.edu with the message: INFO IBM-MAIN
Loading...