Discussion:
Drowning in service units on z/os 1.13 after migrating from v1.11
(too old to reply)
Jim Mooney
2012-07-26 16:15:34 UTC
Permalink
We upgraded from z/os v1.11 to v1.13 on our z10 about 2 weeks ago. Ops reported that all batch jobs are taking longer 'wall' time. Researching the cause, we Looked at job step termination stats. We discovered that, on z/os v1.13, all steps are reporting about 50 times the number of service units as were reported on z/os v1.11. While TCB & SRB cpu times seem equivalent (no dramatic increase on v1.13), the elapsed time has increased roughly by a factor of 2 and the "SERV" value is dramatically higher for each step. A job that took 30 mins now takes 75 mins. We first thought it may be a problem unique to SORT (SyncSort) but even the Control-R step (as well as non-sort steps in other jobs) is 30x higher.

We have opened a PMR with IBM, but I thought I would see if anyone else has observed anything like this when migrating from z/os v1.11 to v1.13, or may have some idea where to look for an explanation. Our service coefficients have not changed. Any comments would be welcome.

From WLM (same on v1.11 and v1.13):
***** begin WLM screen paste *****
Enter or change the Service Coefficients:

CPU . . . . . . . . . . . . . 1.0 (0.0-99.9)
IOC . . . . . . . . . . . . . 0.5 (0.0-99.9)
MSO . . . . . . . . . . . . . 0.0001 (0.0000-99.9999)
SRB . . . . . . . . . . . . . 1.0 (0.0-99.9)

Enter or change the service definition options:

I/O priority management . . . . . . . . YES (Yes or No)
Dynamic alias tuning management . . . . YES (Yes or No)
***** end of WLM screen paste ********

Job run on z/os v1.13:
01.37.16 JOB05818 ---- THURSDAY, 26 JUL 2012 ----
01.37.16 JOB05818 ICH70001I CTMPUSR LAST ACCESS AT 01:37:14 ON THURSDAY, JULY 26, 2012
01.37.16 JOB05818 $HASP373 PR#2500 STARTED - INIT 4 - CLASS T - SYS SYSP
01.37.16 JOB05818 IEF403I PR#2500 - STARTED - TIME=01.37.16
01.37.18 JOB05818 - -----TIMINGS (MINS.)------ -----PAGING COUNTS----
01.37.18 JOB05818 -STEPNAME PROCSTEP RC EXCP CONN TCB SRB CLOCK SERV WORKLOAD PAGE SWAP VIO SWAPS
01.37.18 JOB05818 -NONCAT2 CONTROLR 00 597 617 .00 .00 .0 3306 BATPRD 0 0 0 0
01.37.19 JOB05818 -PR#2500 STEP00 00 37 32 .00 .00 .0 82 BATPRD 0 0 0 0
01.37.21 JOB05818 -PR#2500 SORT01 00 1436 501 .00 .00 .0 24350 BATPRD 0 0 0 0
01.37.24 JOB05818 -PR#2500 SORT02 00 2656 947 .00 .00 .0 44752 BATPRD 0 0 0 0
01.37.25 JOB05818 -PR#2500 SORT03 00 462 163 .00 .00 .0 27606 BATPRD 0 0 0 0
01.37.28 JOB05818 -PR#2500 SORT04 00 1400 512 .02 .00 .0 130652 BATPRD 0 0 0 0
01.38.03 JOB05818 -PR#2500 SORT05 00 17075 6394 .26 .01 .5 7699795 BATPRD 0 0 0 0
02.53.03 JOB05818 -PR#2500 STEP05 00 961K 313K .67 .09 74.9 2173651 BATPRD 0 0 0 0
02.53.03 JOB05818 IEF404I PR#2500 - ENDED - TIME=02.53.03
02.53.03 JOB05818 -PR#2500 ENDED. NAME- TOTAL TCB CPU TIME= .98 TOTAL ELAPSED TIME= 75.7
*********** end of job step stats **************
Same Job run on z/os v1.11:
01.45.41 JOB02442 ---- FRIDAY, 13 JUL 2012 ----
01.45.41 JOB02442 ICH70001I CTMPUSR LAST ACCESS AT 01:45:41 ON FRIDAY, JULY 13, 2012
01.45.41 JOB02442 $HASP373 PR#2500 STARTED - INIT 31 - CLASS T - SYS SYSP
01.45.41 JOB02442 IEF403I PR#2500 - STARTED - TIME=01.45.41
01.45.45 JOB02442 - -----TIMINGS (MINS.)------ -----PAGING COUNTS----
01.45.45 JOB02442 -STEPNAME PROCSTEP RC EXCP CONN TCB SRB CLOCK SERV WORKLOAD PAGE SWAP VIO SWAPS
01.45.45 JOB02442 -NONCAT2 CONTROLR 00 599 470 .00 .00 .0 127 BATPRD 0 0 0 0
01.45.47 JOB02442 -PR#2500 STEP00 00 47 38 .00 .00 .0 BATPRD 0 0 0 0
01.45.55 JOB02442 -PR#2500 SORT01 00 1558 619 .00 .00 .1 6845 BATPRD 0 0 0 0
01.46.01 JOB02442 -PR#2500 SORT02 00 2885 1058 .00 .00 .1 12080 BATPRD 0 0 0 0
01.46.09 JOB02442 -PR#2500 SORT03 00 496 186 .00 .00 .1 218 BATPRD 0 0 0 0
01.46.29 JOB02442 -PR#2500 SORT04 00 1518 614 .02 .00 .3 621 BATPRD 0 0 0 0
01.50.18 JOB02442 -PR#2500 SORT05 00 18893 7856 .30 .00 3.8 8114 BATPRD 0 0 0 0
02.18.29 JOB02442 -PR#2500 STEP05 00 1066K 370K .68 .08 28.1 150650 BATPRD 0 0 0 0
02.18.29 JOB02442 IEF404I PR#2500 - ENDED - TIME=02.18.29
02.18.29 JOB02442 -PR#2500 ENDED. NAME- TOTAL TCB CPU TIME= 1.03 TOTAL ELAPSED TIME= 32.7


Sorry for the lack of formatting here.
TIA,
-Jim

----------------------------------------------------------------------
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to ***@listserv.ua.edu with the message: INFO IBM-MAIN
Bob Shannon
2012-07-26 16:28:39 UTC
Permalink
Have you processed the SMF data to insure you don't have an error in your IEFACTRT exit?

Bob Shannon
Rocket Software

----------------------------------------------------------------------
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to ***@listserv.ua.edu with the message: INFO IBM-MAIN
Joe du Plumber
2012-07-26 17:02:26 UTC
Permalink
Bob grumbled:
Have you processed the SMF data to insure you don't have an error in your IEFACTRT exit?

Bob Shannon
Rocket Software

-

Bob,

Lets pretend that the OP knows how to tell time and since the elapsed times are clearly posted can't we please presume that the numbers (which are proportionate) tell the story as it happened?

I am wondering if the 1.11 sorts were mostly in-storage ("in-core") sorts while the 1.13 sorts were mostly not (i.e., requiring sortwk I/O which consumed service as well as additional elapsed time).

OP, could you please post the sort sysouts from the larger sorts, both before and after for comparison?

*sigh*
Joe


----------------------------------------------------------------------
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to ***@listserv.ua.edu with the message: INFO IBM-MAIN
Blaicher, Christopher Y.
2012-07-26 17:43:50 UTC
Permalink
Looking at the data in hand, the sorts don't seem to be the prime culprits. CPU times and connect times and EXCP counts appear to be rational between the 1.11 and 1.13 data. STEP5, that the poster says was not a sort, is consistent with the other steps.

Something changed to effect the number of service units in all the steps, but the elapsed time for STEP5 really ballooned up. The elapsed time for SORT5, the only substantial sort, went down from about 230 seconds to about 30 seconds. Actually, the elapsed time for all the sort steps went down.

I see two things to chase from this. #1, What caused the service units to blow up like they did? #2, What changed in STEP5? Did the application make a change to how it works? Or were there other configuration changes made at the same time.

One shop I worked in reconfigured the DASD and by accident put all the high activity volumes on the same group. Can you say massive IOS Queue times? Anyway, check what RMF can tell you. It may give a hint.

Chris Blaicher
Senior Software Engineer, Software Services
Syncsort Incorporated
50 Tice Boulevard, Woodcliff Lake, NJ 07677
P: 201-930-8260 | M: 512-627-3803
E: ***@syncsort.com


-----Original Message-----
From: IBM Mainframe Discussion List [mailto:IBM-***@LISTSERV.UA.EDU] On Behalf Of Joe du Plumber
Sent: Thursday, July 26, 2012 12:02 PM
To: IBM-***@LISTSERV.UA.EDU
Subject: Re: Drowning in service units on z/os 1.13 after migrating from v1.11

Bob grumbled:
Have you processed the SMF data to insure you don't have an error in your IEFACTRT exit?

Bob Shannon
Rocket Software

-

Bob,

Lets pretend that the OP knows how to tell time and since the elapsed times are clearly posted can't we please presume that the numbers (which are proportionate) tell the story as it happened?

I am wondering if the 1.11 sorts were mostly in-storage ("in-core") sorts while the 1.13 sorts were mostly not (i.e., requiring sortwk I/O which consumed service as well as additional elapsed time).

OP, could you please post the sort sysouts from the larger sorts, both before and after for comparison?

*sigh*
Joe


----------------------------------------------------------------------
For IBM-MAIN subscribe / signoff / archive access instructions, send email to ***@listserv.ua.edu with the message: INFO IBM-MAIN



ATTENTION: -----

The information contained in this message (including any files transmitted with this message) may contain proprietary, trade secret or other confidential and/or legally privileged information. Any pricing information contained in this message or in any files transmitted with this message is always confidential and cannot be shared with any third parties without prior written approval from Syncsort. This message is intended to be read only by the individual or entity to whom it is addressed or by their designee. If the reader of this message is not the intended recipient, you are on notice that any use, disclosure, copying or distribution of this message, in any form, is strictly prohibited. If you have received this message in error, please immediately notify the sender and/or Syncsort and destroy all copies of this message in your possession, custody or control.

----------------------------------------------------------------------
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to ***@listserv.ua.edu with the message: INFO IBM-MAIN
Thomas Conley
2012-07-28 14:00:55 UTC
Permalink
Post by Bob Shannon
Have you processed the SMF data to insure you don't have an error in your IEFACTRT exit?
Bob Shannon
Rocket Software
-
Bob,
Lets pretend that the OP knows how to tell time and since the elapsed times are clearly posted can't we please presume that the numbers (which are proportionate) tell the story as it happened?
I am wondering if the 1.11 sorts were mostly in-storage ("in-core") sorts while the 1.13 sorts were mostly not (i.e., requiring sortwk I/O which consumed service as well as additional elapsed time).
OP, could you please post the sort sysouts from the larger sorts, both before and after for comparison?
*sigh*
Joe
----------------------------------------------------------------------
For IBM-MAIN subscribe / signoff / archive access instructions,
Hey, Joe du Plumber (if that's your real name),

How about you come correct with your real name and apologize to Bob for
such a condescending reply? Especially since he was 100% CORRECT!

*sigh*
Tom Conley (and yes, that's my real name)

----------------------------------------------------------------------
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to ***@listserv.ua.edu with the message: INFO IBM-MAIN
Jim Mooney
2012-07-26 17:27:50 UTC
Permalink
Thanks Rob. We are using the vanilla 'SYS1.SAMPLIB(IEEACTRT)' supplied by IBM.

----------------------------------------------------------------------
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to ***@listserv.ua.edu with the message: INFO IBM-MAIN
Staller, Allan
2012-07-26 17:30:37 UTC
Permalink
Re-assembled since going to 1.13??

<snip>
Subject: Re: Drowning in service units on z/os 1.13 after migrating from v1.11

Thanks Rob. We are using the vanilla 'SYS1.SAMPLIB(IEEACTRT)' supplied by IBM.
</snip>


----------------------------------------------------------------------
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to ***@listserv.ua.edu with the message: INFO IBM-MAIN
John Eells
2012-07-26 17:33:06 UTC
Permalink
Post by Jim Mooney
We upgraded from z/os v1.11 to v1.13 on our z10 about 2 weeks ago. Ops reported that all batch jobs are taking longer 'wall' time. Researching the cause, we Looked at job step termination stats. We discovered that, on z/os v1.13, all steps are reporting about 50 times the number of service units as were reported on z/os v1.11. While TCB & SRB cpu times seem equivalent (no dramatic increase on v1.13), the elapsed time has increased roughly by a factor of 2 and the "SERV" value is dramatically higher for each step. A job that took 30 mins now takes 75 mins. We first thought it may be a problem unique to SORT (SyncSort) but even the Control-R step (as well as non-sort steps in other jobs) is 30x higher.
<snip>

A shot in the dark: Do you have the same data sets in the parmlib
concatenation for both R11 and R13?
--
John Eells
z/OS Technical Marketing
IBM Poughkeepsie
***@us.ibm.com

----------------------------------------------------------------------
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to ***@listserv.ua.edu with the message: INFO IBM-MAIN
Lizette Koehler
2012-07-26 18:16:31 UTC
Permalink
Have you compared reports from SMF with what you see in the JOBLOG? Have you verified with SAS+MXG or SAS+MICS or some other SMF process that can generate reports. If you have a scheduling software (like CA Workload Automation aka ESP) it can also show your run times. Compare those against the JOBLOG

Lizette
Post by Jim Mooney
We upgraded from z/os v1.11 to v1.13 on our z10 about 2 weeks ago. Ops reported that all batch jobs are taking longer 'wall' time. Researching the cause, we Looked at job step termination stats. We discovered that, on z/os v1.13, all steps are reporting about 50 times the number of service units as were reported on z/os v1.11. While TCB & SRB cpu times seem equivalent (no dramatic increase on v1.13), the elapsed time has increased roughly by a factor of 2 and the "SERV" value is dramatically higher for each step. A job that took 30 mins now takes 75 mins. We first thought it may be a problem unique to SORT (SyncSort) but even the Control-R step (as well as non-sort steps in other jobs) is 30x higher.
We have opened a PMR with IBM, but I thought I would see if anyone else has observed anything like this when migrating from z/os v1.11 to v1.13, or may have some idea where to look for an explanation. Our service coefficients have not changed. Any comments would be welcome.
***** begin WLM screen paste *****
CPU . . . . . . . . . . . . . 1.0 (0.0-99.9)
IOC . . . . . . . . . . . . . 0.5 (0.0-99.9)
MSO . . . . . . . . . . . . . 0.0001 (0.0000-99.9999)
SRB . . . . . . . . . . . . . 1.0 (0.0-99.9)
I/O priority management . . . . . . . . YES (Yes or No)
Dynamic alias tuning management . . . . YES (Yes or No)
***** end of WLM screen paste ********
01.37.16 JOB05818 ---- THURSDAY, 26 JUL 2012 ----
01.37.16 JOB05818 ICH70001I CTMPUSR LAST ACCESS AT 01:37:14 ON THURSDAY, JULY 26, 2012
01.37.16 JOB05818 $HASP373 PR#2500 STARTED - INIT 4 - CLASS T - SYS SYSP
01.37.16 JOB05818 IEF403I PR#2500 - STARTED - TIME=01.37.16
01.37.18 JOB05818 - -----TIMINGS (MINS.)------ -----PAGING COUNTS----
01.37.18 JOB05818 -STEPNAME PROCSTEP RC EXCP CONN TCB SRB CLOCK SERV WORKLOAD PAGE SWAP VIO SWAPS
01.37.18 JOB05818 -NONCAT2 CONTROLR 00 597 617 .00 .00 .0 3306 BATPRD 0 0 0 0
01.37.19 JOB05818 -PR#2500 STEP00 00 37 32 .00 .00 .0 82 BATPRD 0 0 0 0
01.37.21 JOB05818 -PR#2500 SORT01 00 1436 501 .00 .00 .0 24350 BATPRD 0 0 0 0
01.37.24 JOB05818 -PR#2500 SORT02 00 2656 947 .00 .00 .0 44752 BATPRD 0 0 0 0
01.37.25 JOB05818 -PR#2500 SORT03 00 462 163 .00 .00 .0 27606 BATPRD 0 0 0 0
01.37.28 JOB05818 -PR#2500 SORT04 00 1400 512 .02 .00 .0 130652 BATPRD 0 0 0 0
01.38.03 JOB05818 -PR#2500 SORT05 00 17075 6394 .26 .01 .5 7699795 BATPRD 0 0 0 0
02.53.03 JOB05818 -PR#2500 STEP05 00 961K 313K .67 .09 74.9 2173651 BATPRD 0 0 0 0
02.53.03 JOB05818 IEF404I PR#2500 - ENDED - TIME=02.53.03
02.53.03 JOB05818 -PR#2500 ENDED. NAME- TOTAL TCB CPU TIME= .98 TOTAL ELAPSED TIME= 75.7
*********** end of job step stats **************
01.45.41 JOB02442 ---- FRIDAY, 13 JUL 2012 ----
01.45.41 JOB02442 ICH70001I CTMPUSR LAST ACCESS AT 01:45:41 ON FRIDAY, JULY 13, 2012
01.45.41 JOB02442 $HASP373 PR#2500 STARTED - INIT 31 - CLASS T - SYS SYSP
01.45.41 JOB02442 IEF403I PR#2500 - STARTED - TIME=01.45.41
01.45.45 JOB02442 - -----TIMINGS (MINS.)------ -----PAGING COUNTS----
01.45.45 JOB02442 -STEPNAME PROCSTEP RC EXCP CONN TCB SRB CLOCK SERV WORKLOAD PAGE SWAP VIO SWAPS
01.45.45 JOB02442 -NONCAT2 CONTROLR 00 599 470 .00 .00 .0 127 BATPRD 0 0 0 0
01.45.47 JOB02442 -PR#2500 STEP00 00 47 38 .00 .00 .0 BATPRD 0 0 0 0
01.45.55 JOB02442 -PR#2500 SORT01 00 1558 619 .00 .00 .1 6845 BATPRD 0 0 0 0
01.46.01 JOB02442 -PR#2500 SORT02 00 2885 1058 .00 .00 .1 12080 BATPRD 0 0 0 0
01.46.09 JOB02442 -PR#2500 SORT03 00 496 186 .00 .00 .1 218 BATPRD 0 0 0 0
01.46.29 JOB02442 -PR#2500 SORT04 00 1518 614 .02 .00 .3 621 BATPRD 0 0 0 0
01.50.18 JOB02442 -PR#2500 SORT05 00 18893 7856 .30 .00 3.8 8114 BATPRD 0 0 0 0
02.18.29 JOB02442 -PR#2500 STEP05 00 1066K 370K .68 .08 28.1 150650 BATPRD 0 0 0 0
02.18.29 JOB02442 IEF404I PR#2500 - ENDED - TIME=02.18.29
02.18.29 JOB02442 -PR#2500 ENDED. NAME- TOTAL TCB CPU TIME= 1.03 TOTAL ELAPSED TIME= 32.7
Sorry for the lack of formatting here.
TIA,
-Jim
----------------------------------------------------------------------
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to ***@listserv.ua.edu with the message: INFO IBM-MAIN
Jim Mooney
2012-07-26 18:41:27 UTC
Permalink
Thanks to all for all the suggestions... very helpful. We just noticed the same thing Christopher has pointed out (longest elapsed = COBOL pgm), so we are now focusing on LE and COBOL. No changes to the COBOL source. IEFACTRT was assembled for v1.13. The sort sysouts indicate only that we have 1M less of virt. stg compared to v1.11.

'SYS1.PARMLIB(CEEPRM00)' CEEDOPT section contains:
STACK(128K,128K,BELOW,KEEP,512K,128K),
STORAGE(00,NONE,00,8K),
TERMTHDACT(MSG,,96),

While 'CEE.SCEESAMP(CEEDOPT)' contains:
STACK=((128K,128K,ANYWHERE,KEEP,512K,128K),OVR),

Is it possible BELOW vs. ANYWHERE would cause this?

-Jim

----------------------------------------------------------------------
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to ***@listserv.ua.edu with the message: INFO IBM-MAIN
Cheryl Walker
2012-07-26 20:08:22 UTC
Permalink
Jim,

If total service units have increased, but TCB, SRB, and EXCPs haven't increased, that leaves the MSO component of the service units. You REALLY need to change MSO to 0. Even using MSO=0.0001 can result in enormous values for total service units. An RMF report can easily show you the relationship.

A change in z/OS release seldom changes the amount of storage you have. So during the change, did you either change the MSO setting or reconfigure the LPAR to have more storage. If you did either of these, the service units would increase.

A common phenomenon when the service units increase dramatically can be seen in sites that use multi-period batch. The jobs drop into the next period(s) faster, run at a lower dispatch priority, and have a longer elapsed time.

If this is the problem, the solution is to set MSO=0.

Best regards,
Cheryl

======================
Cheryl Watson
Watson & Walker, Inc.
www.watsonwalker.com
======================
Post by Jim Mooney
Thanks to all for all the suggestions... very helpful. We just noticed the same thing Christopher has pointed out (longest elapsed = COBOL pgm), so we are now focusing on LE and COBOL. No changes to the COBOL source. IEFACTRT was assembled for v1.13. The sort sysouts indicate only that we have 1M less of virt. stg compared to v1.11.
STACK(128K,128K,BELOW,KEEP,512K,128K),
STORAGE(00,NONE,00,8K),
TERMTHDACT(MSG,,96),
STACK=((128K,128K,ANYWHERE,KEEP,512K,128K),OVR),
Is it possible BELOW vs. ANYWHERE would cause this?
-Jim
----------------------------------------------------------------------
For IBM-MAIN subscribe / signoff / archive access instructions,
----------------------------------------------------------------------
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to ***@listserv.ua.edu with the message: INFO IBM-MAIN
Gerhard Adam
2012-07-26 20:55:40 UTC
Permalink
Service units cannot go up unless one of the resources being measured also
goes up. So, unless you have higher CPU usage, or I/O, the service units
cannot go up. Even MSO is only counted when you are using CPU.

The longer elapsed times with no increase in resource consumption indicates
a queuing problem. So, that's one area that should be investigated to
ensure that jobs have access [dispatching priority].

The second thing is that the reported number of service units simply looks
like an error. Values like 7699795 simply look bogus unless the
coefficients were changed to multiple significantly larger values. As I
said, MSO is still calculated by multiplying CPU service, so there has to be
an increase there before it would be affected.

Unless you have these same numbers reported from some other process, I would
assume you have a bug in the IEFACTRT code.

Adam

----------------------------------------------------------------------
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to ***@listserv.ua.edu with the message: INFO IBM-MAIN
Gerhard Adam
2012-07-26 23:03:32 UTC
Permalink
While I can't say anything specific about your system, my own observations on two independent systems [One at 1.11 and the other at 1.12] indicates an error in IEEACTRT in SYS1.SAMPLIB.

In the 1.11 version the following code occurs (NOTE SMF30SRB_L is used).

LG R01,SMF30SRB_L GET SERVICE UNITS USED @P6C 03590035
BAL R14,PCOUNT2 CALL CONVERT ROUTINE @P6C 03600033

In the 1.12 version the code is (NOTE SMF30SRV_L is used):

LG R01,SMF30SRV_L GET SERVICE UNITS USED @P7C
BAL R14,PCOUNT2 CALL CONVERT ROUTINE @P6C

The latter version is correct and indicates TOTAL service used, while the previous only indicates SRB service used. That would account for the discrepancy in the reporting between the two releases.

As I said, I can't say that this is your specific problem, but it would be worth confirming that your IEEACTRT routine is correct.

Adam

----------------------------------------------------------------------
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to ***@listserv.ua.edu with the message: INFO IBM-MAIN
Gerhard Adam
2012-07-26 23:07:31 UTC
Permalink
Actually I just confirmed that was a fix in 1.12.

Adam

-----Original Message-----
From: IBM Mainframe Discussion List [mailto:IBM-***@LISTSERV.UA.EDU] On Behalf Of Gerhard Adam
Sent: Thursday, July 26, 2012 4:03 PM
To: IBM-***@LISTSERV.UA.EDU
Subject: Re: Drowning in service units on z/os 1.13 after migrating from v1.11

While I can't say anything specific about your system, my own observations on two independent systems [One at 1.11 and the other at 1.12] indicates an error in IEEACTRT in SYS1.SAMPLIB.

In the 1.11 version the following code occurs (NOTE SMF30SRB_L is used).

LG R01,SMF30SRB_L GET SERVICE UNITS USED @P6C 03590035
BAL R14,PCOUNT2 CALL CONVERT ROUTINE @P6C 03600033

In the 1.12 version the code is (NOTE SMF30SRV_L is used):

LG R01,SMF30SRV_L GET SERVICE UNITS USED @P7C
BAL R14,PCOUNT2 CALL CONVERT ROUTINE @P6C

The latter version is correct and indicates TOTAL service used, while the previous only indicates SRB service used. That would account for the discrepancy in the reporting between the two releases.

As I said, I can't say that this is your specific problem, but it would be worth confirming that your IEEACTRT routine is correct.

Adam

----------------------------------------------------------------------
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to ***@listserv.ua.edu with the message: INFO IBM-MAIN

----------------------------------------------------------------------
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to ***@listserv.ua.edu with the message: INFO IBM-MAIN
Gerhard Adam
2012-07-26 20:56:55 UTC
Permalink
I don't see anything that indicates resource consumption has increased, therefore the number of service units can't increase [except through calculation error]. Elapsed time also has to be accounted for by either increased resource consumption or increased wait time. Since there is no increase in resource consumption, then we have to conclude it is simply increased waiting. Again, service units will no increase.

MSO is calculated using CPU service, so unless that goes up the numbers will no change.

In the absence of a huge increase in the coefficients, it seems that it's simply a reporting bug. The increased elapsed time is a different problem. I would want to confirm that the reported service units matches a comparable report from RMF.

Adam

----------------------------------------------------------------------
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to ***@listserv.ua.edu with the message: INFO IBM-MAIN
Joel C. Ewing
2012-07-27 14:08:59 UTC
Permalink
Post by Gerhard Adam
I don't see anything that indicates resource consumption has increased, therefore the number of service units can't increase [except through calculation error]. Elapsed time also has to be accounted for by either increased resource consumption or increased wait time. Since there is no increase in resource consumption, then we have to conclude it is simply increased waiting. Again, service units will no increase.
MSO is calculated using CPU service, so unless that goes up the numbers will no change.
In the absence of a huge increase in the coefficients, it seems that it's simply a reporting bug. The increased elapsed time is a different problem. I would want to confirm that the reported service units matches a comparable report from RMF.
Adam
Not true. Service units contribution from MSO involves a PRODUCT of
real memory usage and CPU usage. If memory usage goes up dramatically
and CPU usage stays constant or even declines slightly, with MSO > 0
this can significantly increase the service units. This can definitely
happen if you go from a real-memory constrained environment where
address spaces get all real pages that are not active trimmed from their
working set to one with an abundance of real memory where many pages
once referenced are retained in real memory even though no longer
referenced. MSO > 0 no longer makes sense when real memory is cheap and
plentiful relative to CP seconds and you want to encourage rather than
penalize practices and algorithms that use more real memory to reduce CP
usage.

But, less contention for real memory should, if anything, speed up job
execution, and the perception is that jobs are taking longer elapsed
time to run, not just using more SUs; so this seems an unlikely
explanation for what the site is observing.
--
Joel C. Ewing, Bentonville, AR ***@acm.org

----------------------------------------------------------------------
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to ***@listserv.ua.edu with the message: INFO IBM-MAIN
Gerhard Adam
2012-07-27 15:22:45 UTC
Permalink
Post by Joel C. Ewing
If memory usage goes up dramatically
and CPU usage stays constant or even declines slightly, with MSO > 0
this can significantly increase the service units.
A 20x increase in memory is a bit difficult not to notice, so that's why I said it would be a factor.
Post by Joel C. Ewing
MSO > 0 no longer makes sense when real memory is cheap and plentiful relative to CP seconds
and you want to encourage rather than penalize practices and algorithms that use more real memory to reduce CP usage.
MSO no longer makes sense for any reason. That's why the coefficient can be set to zero. You aren't penalizing anyone since there are no physical swaps, so it's simply nonsense to "charge" for memory use only while the CPU is being used and then "charge" zero service when the user is logically swapped. It is a completely erroneous view of memory usage.

Adam

----------------------------------------------------------------------
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to ***@listserv.ua.edu with the message: INFO IBM-MAIN
Jim Mooney
2012-07-26 23:03:15 UTC
Permalink
Thanks again for all the valuable input. We made no memory changes to the lpar. Moving forward we are concentrating more on the elapsed time problem and less on the service units reporting.

It does seem to be looking more and more like we are cpu constrained (queuing problem). At 2 a.m., when our subject batch job runs, the lpar is at 95% cpu, and our monitor shows our batch job is spending 88 to 92% of it's time waiting for cpu.

I am out for the next 3 days. I will post any findings when I return. Thx so much. -Jim

----------------------------------------------------------------------
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to ***@listserv.ua.edu with the message: INFO IBM-MAIN
Brian Westerman
2012-07-29 04:05:29 UTC
Permalink
I noticed that the original reporter did mention that the amount of VS went down by 1MB which seems trivial except that in the many conversions I have performed moving to 1.13 it should have gotten larger. I do agree the MSO should be set to zero, but I think that the decrease in available VS could also cause more paging (a lot more), especially when it's happening at a time when he reports that the system is heavily used.

I would investigate what you added or changed that decreased the VS. Possibly you upped something that you should not have. In most cases people tend to set things a lot higher than necessary.

If you want to talk about this in more detail so that we can get you more resources, feel free to contact me offline.

Brian

----------------------------------------------------------------------
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to ***@listserv.ua.edu with the message: INFO IBM-MAIN
j***@msiinet.com
2013-05-01 20:36:44 UTC
Permalink
Jim, I was wondering if you ever got this issue resolved? I am seeing the same problem and I am looking for an answer. Please let me know as soon as you can. Thanks!
Continue reading on narkive:
Loading...