Discussion:
More musings about Unicode, UTF-8, etc.
Add Reply
David W Noon
2017-09-10 23:02:17 UTC
Reply
Permalink
Raw Message
Hi folks,

I have been doing some experiments on rendering Unicode and determining
the length of rendered text compared to its storage in bytes. I have
used Paul Gilmartin's 3 lines of text as sample data.

I have 4 programs/scripts, of which 3 work and 1 can never work. The
working programs are in C and C++, plus a script for zsh (a UNIX shell).
The script that will never work is for bash (another UNIX shell).

If anybody is interested I will post the code here in a zip archive. Any
takers?
--
Regards,

Dave [RLU #314465]
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*
***@googlemail.com (David W Noon)
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*

 

----------------------------------------------------------------------
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to ***@listserv.ua.edu with the message: INFO IBM-MAIN
Paul Gilmartin
2017-09-11 01:59:34 UTC
Reply
Permalink
Raw Message
Post by David W Noon
I have been doing some experiments on rendering Unicode and determining
the length of rendered text compared to its storage in bytes. I have
used Paul Gilmartin's 3 lines of text as sample data.
I have 4 programs/scripts, of which 3 work and 1 can never work. The
working programs are in C and C++, plus a script for zsh (a UNIX shell).
The script that will never work is for bash (another UNIX shell).
If anybody is interested I will post the code here in a zip archive. Any
takers?
I doubt that LISTSERV will tolerate a zip archive. But if you post,
Cc: my address above.

(zsh long ago was the default script for MacOS. It tried to be a
hybrid of sh and csh. Are you using csh constructs?)

Thanks,
gil

----------------------------------------------------------------------
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to ***@listserv.ua.edu with the message: INFO IBM-MAIN
Jack J. Woehr
2017-09-11 02:44:54 UTC
Reply
Permalink
Raw Message
Post by David W Noon
I have been doing some experiments on rendering Unicode
Go Language.
--
Jack J. Woehr # Science is more than a body of knowledge. It's a way of
www.well.com/~jax # thinking, a way of skeptically interrogating the universe
www.softwoehr.com # with a fine understanding of human fallibility. - Carl Sagan

----------------------------------------------------------------------
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to ***@listserv.ua.edu with the message: INFO IBM-MAIN
Paul Gilmartin
2017-09-11 03:37:24 UTC
Reply
Permalink
Raw Message
Post by David W Noon
I have been doing some experiments on rendering Unicode and determining
the length of rendered text compared to its storage in bytes. I have
used Paul Gilmartin's 3 lines of text as sample data.
I have 4 programs/scripts, of which 3 work and 1 can never work. The
working programs are in C and C++, plus a script for zsh (a UNIX shell).
The script that will never work is for bash (another UNIX shell).
Trying the following:
#! /bin/sh
doit() {
echo; echo ===== $I
"$I" -c "printf \"%-22s+++\n\" \"Hello World.\""
"$I" -c "printf \"%-22s+++\n\" \"Привет мир.\""
"$I" -c "printf \"%-22s+++\n\" \"Bonjour le monde.\""
}
uname -a
for I in ash ksh dash ash csh tcsh zsh bash sh; do doit "$I"; done

on Linux RaspbPi-3-2700 4.9.35-v7+ #1014 SMP Fri Jun 30 14:47:43 BST 2017 armv7l GNU/Linux

... only zsh gives a desirable result.

-- gil

----------------------------------------------------------------------
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to ***@listserv.ua.edu with the message: INFO IBM-MAIN
Paul Gilmartin
2017-09-11 03:56:06 UTC
Reply
Permalink
Raw Message
Post by David W Noon
I have been doing some experiments on rendering Unicode and determining
the length of rendered text compared to its storage in bytes. I have
used Paul Gilmartin's 3 lines of text as sample data.
I have 4 programs/scripts, of which 3 work and 1 can never work. The
working programs are in C and C++, plus a script for zsh (a UNIX shell).
The script that will never work is for bash (another UNIX shell).
And more. This:

#! /bin/sh
printit() {
"$I" -c "printf \"%-22s+++\n\" \"$@\""
}

doit() {
echo; echo ===== $I
printit "Привет мир."
printit "Emmanuel Macron"
printit "문재인"
printit "Enrique Peña Nieto"
printit "Владимир Путин"
printit "Donald Trump"
printit "习近平"
}
uname -a
for I in ash ksh dash csh tcsh zsh bash sh; do doit "$I"; done

... shows:
...
===== zsh
Привет мир. +++
Emmanuel Macron +++
문재인 +++
Enrique Peña Nieto +++
Владимир Путин +++
Donald Trump +++
习近平 +++

===== bash
Привет мир. +++
Emmanuel Macron +++
문재인 +++
Enrique Peña Nieto +++
Владимир Путин+++
Donald Trump +++
习近平 +++
...
zsh is not ideal, but still the best.

-- gil

----------------------------------------------------------------------
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to ***@listserv.ua.edu with the message: INFO IBM-MAIN
David W Noon
2017-09-11 15:40:13 UTC
Reply
Permalink
Raw Message
On Sun, 10 Sep 2017 22:57:18 -0500, Paul Gilmartin
(0000000433f07816-dmarc-***@LISTSERV.UA.EDU) wrote about "Re: More
musings about Unicode, UTF-8, etc." (in
<***@listserv.ua.edu>):

[snip]
Post by Paul Gilmartin
===== zsh
ПрОвет ЌОр. +++
Emmanuel Macron +++
묞재읞 +++
Enrique Peña Nieto +++
ВлаЎОЌОр ПутОМ +++
Donald Trump +++
习近平 +++
===== bash
ПрОвет ЌОр. +++
Emmanuel Macron +++
묞재읞 +++
Enrique Peña Nieto +++
ВлаЎОЌОр ПутОМ+++
Donald Trump +++
习近平 +++
...
zsh is not ideal, but still the best.
I have added these strings to my code and the results are the same as
yours. I suspect the rendering software does not handle CJK characters
very well in Indo-European locales.
I am attaching the zip archive to this message as a fake text file.
Rename it from Unicode_test.zip.txt to Unicode_test.zip and it should
unzip in the usual manner. It contains a directory to hold all its
files, so you can unzip it safely, without polluting another directory.

There is a Makefile included that can build the source code using either
GCC or CLANG using gmake. Those who use other C/C++ compilers will have
to work out their own build sequence.
--
Regards,

Dave [RLU #314465]
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*
***@googlemail.com (David W Noon)
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*


----------------------------------------------------------------------
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to ***@listserv.ua.edu with the message: INFO IBM-MAIN
Timothy Sipples
2017-09-11 04:15:36 UTC
Reply
Permalink
Raw Message
Post by David W Noon
The script that will never work is for bash (another UNIX shell).
I don't understand this sentence. Nor does Rocket Software, I assume:

http://www.rocketsoftware.com/zos-open-source/tools

--------------------------------------------------------------------------------------------------------
Timothy Sipples
IT Architect Executive, Industry Solutions, IBM z Systems, AP/GCG/MEA
E-Mail: ***@sg.ibm.com

----------------------------------------------------------------------
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to ***@listserv.ua.edu with the message: INFO IBM-MAIN
David W Noon
2017-09-11 14:53:57 UTC
Reply
Permalink
Raw Message
On Mon, 11 Sep 2017 12:16:41 +0800, Timothy Sipples (***@SG.IBM.COM)
wrote about "Re: More musings about Unicode, UTF-8, etc." (in
Post by Timothy Sipples
Post by David W Noon
The script that will never work is for bash (another UNIX shell).
http://www.rocketsoftware.com/zos-open-source/tools
My statement was about the script, not the shell.
The issue is the printf command.

bash uses an external command /usr/bin/printf. This, on my system, is
part of Linux's coreutils package. AFAIAA, there is no impetus to make
coreutils Unicode-aware.

In contrast, zsh uses a shell intrinsic for printf.
--
Regards,

Dave [RLU #314465]
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*
***@googlemail.com (David W Noon)
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*

 

----------------------------------------------------------------------
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to ***@listserv.ua.edu with the message: INFO IBM-MAIN
Paul Gilmartin
2017-09-11 05:27:01 UTC
Reply
Permalink
Raw Message
Post by Timothy Sipples
Post by David W Noon
The script that will never work is for bash (another UNIX shell).
http://www.rocketsoftware.com/zos-open-source/tools
In most of the Linux shells that David and I tried, printf is a shell
builtin. Of these, only zsh seems to understand the variable-length
UTF-8 encoding. /usr/bin/printf is UTF-8 ignorant also.

In my CJK examples, most of the glyphs are displayed double-width,
so it's as much a matter of terminal emulator behavior as of printf's
formatting.

-- gil

----------------------------------------------------------------------
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to ***@listserv.ua.edu with the message: INFO IBM-MAIN
Paul Gilmartin
2017-09-11 16:48:18 UTC
Reply
Permalink
Raw Message
Post by David W Noon
I have added these strings to my code and the results are the same as
yours. I suspect the rendering software does not handle CJK characters
very well in Indo-European locales.
I'm calling it a font problem: The CJK characters display double-width,
Post by David W Noon
I am attaching the zip archive to this message as a fake text file.
Rename it from Unicode_test.zip.txt to Unicode_test.zip and it should
unzip in the usual manner. It contains a directory to hold all its
files, so you can unzip it safely, without polluting another directory.
There is a Makefile included that can build the source code using either
GCC or CLANG using gmake. Those who use other C/C++ compilers will have
to work out their own build sequence.
Fails for me with:

525 $ make
make: gcc-config: Command not found
make: gcc-config: Command not found
make: Warning: File 'Unicode_test.cpp' has modification time 19042 s in the future
/g++ -o Unicode_test -pipe -std=gnu++14 -Wall -Wextra -O2 -fomit-frame-pointer Unicode_test.cpp -Wl,--as-needed,--strip-all
make: /g++: Command not found
Makefile:13: recipe for target 'Unicode_test' failed
make: *** [Unicode_test] Error 127

o I'm surprised that the fake text file survived network newline conversions.

o .zip is timezone-ignorant.

-- gil

----------------------------------------------------------------------
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to ***@listserv.ua.edu with the message: INFO IBM-MAIN
David W Noon
2017-09-11 18:23:48 UTC
Reply
Permalink
Raw Message
On Mon, 11 Sep 2017 11:49:29 -0500, Paul Gilmartin
(0000000433f07816-dmarc-***@LISTSERV.UA.EDU) wrote about "Re: More
musings about Unicode, UTF-8, etc." (in
Post by Paul Gilmartin
Post by David W Noon
I have added these strings to my code and the results are the same as
yours. I suspect the rendering software does not handle CJK characters
very well in Indo-European locales.
I'm calling it a font problem: The CJK characters display double-width,
You are correct. I am using a fixed pitch font, but it uses 2 character
cells for the CJK characters.

[snip]
Post by Paul Gilmartin
Post by David W Noon
There is a Makefile included that can build the source code using either
GCC or CLANG using gmake. Those who use other C/C++ compilers will have
to work out their own build sequence.
525 $ make
make: gcc-config: Command not found
make: gcc-config: Command not found
Which operating system are you using?

You should have received the gcc-config command as part of your GCC
toolchain(s). This command allows you to select from multiple versions
of GCC installed.

I developed the code on Gentoo Linux. Such a system can have 5 or 6 GCC
toolchains installed concurrently, so gcc-config is a must have.
Post by Paul Gilmartin
make: Warning: File 'Unicode_test.cpp' has modification time 19042 s in the future
I'm in the BST timezone, so I'm 5 hours ahead of NYC and 8 hours ahead
of LA/SF (and Redmond, WA, for that matter).
Post by Paul Gilmartin
/g++ -o Unicode_test -pipe -std=gnu++14 -Wall -Wextra -O2 -fomit-frame-pointer Unicode_test.cpp -Wl,--as-needed,--strip-all
make: /g++: Command not found
Makefile:13: recipe for target 'Unicode_test' failed
make: *** [Unicode_test] Error 127
If you edit the Makefile to remove the shell subcommands that invoke
gcc-config and remove the slash separator, you should then access gcc
and g++ through your PATH environment variable.
Post by Paul Gilmartin
o I'm surprised that the fake text file survived network newline conversions.
I concluded that Listserv was pretty dumb, so I felt that an attachment
with a filename ending in .txt would survive.
Post by Paul Gilmartin
o .zip is timezone-ignorant.
Yes, it's derived from an old MS-DOS/PC-DOS command and those systems
did not know for timezones when PKZIP was written. The archive file
format does not permit timezone data.
--
Regards,

Dave [RLU #314465]
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*
***@googlemail.com (David W Noon)
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*

----------------------------------------------------------------------
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to ***@listserv.ua.edu with the message: INFO IBM-MAIN
Paul Gilmartin
2017-09-11 22:36:26 UTC
Reply
Permalink
Raw Message
Post by David W Noon
Post by Paul Gilmartin
525 $ make
make: gcc-config: Command not found
make: gcc-config: Command not found
Which operating system are you using?
BunsenLabs. I'm giving up; not strongly motivated.
Post by David W Noon
Post by Paul Gilmartin
o I'm surprised that the fake text file survived network newline conversions.
I concluded that Listserv was pretty dumb, so I felt that an attachment
with a filename ending in .txt would survive.
It arrived with: Content-Transfer-Encoding: base64 which protects it
pretty well. Don't know if you or your MUA elected that.
Post by David W Noon
Post by Paul Gilmartin
o .zip is timezone-ignorant.
Yes, it's derived from an old MS-DOS/PC-DOS command and those systems
did not know for timezones when PKZIP was written. The archive file
format does not permit timezone data.
Pax might have done better. Should be supported by any UNIX-like OS and
most Windows archive extractors.

-- gil

----------------------------------------------------------------------
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to ***@listserv.ua.edu with the message: INFO IBM-MAIN
Loading...