Discussion:
DFSORT - a quick way to compare two huge files
Add Reply
R.S.
2018-07-06 11:03:59 UTC
Reply
Permalink
Raw Message
We have two output files, rather huge ones (rather gigabytes than MB's)
with LRECL~1500, FB.
The goal is to compare them and find different records.
While I have some ideas how to do it (even ISPF Compare or ICETOOL) I
don't think it would effective.

Any clues?
--
Radoslaw Skorupka
Lodz, Poland




======================================================================


--
Treść tej wiadomości może zawierać informacje prawnie chronione Banku przeznaczone wyłącznie do użytku służbowego adresata. Odbiorcą może być jedynie jej adresat z wyłączeniem dostępu osób trzecich. Jeżeli nie jesteś adresatem niniejszej wiadomości lub pracownikiem upoważnionym do jej przekazania adresatowi, informujemy, że jej rozpowszechnianie, kopiowanie, rozprowadzanie lub inne działanie o podobnym charakterze jest prawnie zabronione i może być karalne. Jeżeli otrzymałeś tę wiadomość omyłkowo, prosimy niezwłocznie zawiadomić nadawcę wysyłając odpowiedź oraz trwale usunąć tę wiadomość włączając w to wszelkie jej kopie wydrukowane lub zapisane na dysku.

This e-mail may contain legally privileged information of the Bank and is intended solely for business use of the addressee. This e-mail may only be received by the addressee and may not be disclosed to any third parties. If you are not the intended addressee of this e-mail or the employee authorized to forward it to the addressee, be advised that any dissemination, copying, distribution or any other similar activity is legally prohibited and may be punishable. If you received this e-mail by mistake please advise the sender immediately by using the reply facility in your e-mail software and delete permanently this e-mail including any copies of it either printed or saved to hard drive.

mBank S.A. z siedzibą w Warszawie, ul. Senatorska 18, 00-950 Warszawa, www.mBank.pl, e-mail: ***@mBank.plSąd Rejonowy dla m. st. Warszawy XII Wydział Gospodarczy Krajowego Rejestru Sądowego, nr rejestru przedsiębiorców KRS 0000025237, NIP: 526-021-50-88. Według stanu na dzień 01.01.2018 r. kapitał zakładowy mBanku S.A. (w całości wpłacony) wynosi 169.248.488 złotych.


----------------------------------------------------------------------
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to ***@listserv.ua.edu with the message: INFO IBM-MAIN
Elardus Engelbrecht
2018-07-06 11:41:11 UTC
Reply
Permalink
Raw Message
We have two output files, rather huge ones (rather gigabytes than MB's) with LRECL~1500, FB.
The goal is to compare them and find different records.
While I have some ideas how to do it (even ISPF Compare or ICETOOL) I don't think it would effective.
Any clues?
Since you don't say what will happens when you get a record which appears in file 1 or file 2 or in both files, look at

https://www.ibm.com/support/knowledgecenter/en/SSLTBW_2.1.0/com.ibm.zos.v2r1.icea100/ice2ca_Example_3_-_Create_files_with_matching_and_non-matching_records.htm

I quote from above URL:

This example shows how you can match records in input data sets 1 and 2 to produce three output data sets with:

ON fields that appear in both input data set 1 and input data set 2
ON fields that appear only in input data set 1
ON fields that appear only in input data set 2

HTH!

Groete / Greetings
Elardus Engelbrecht

----------------------------------------------------------------------
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to ***@listserv.ua.edu with the message: INFO IBM-MAIN
Elardus Engelbrecht
2018-07-06 11:49:45 UTC
Reply
Permalink
Raw Message
Radoslaw,

I forgot to ask these questions:


Do you want a line by line comparision (same output as with ISPF compare)?

Could you be kind to post a sample input and output? Also show how you want the differences to be handled.

This is to enable IBM-MAIN members to select the best solution for you.

The example I posted may help you, but please note, the contents are first sorted and then records which are the same in both files are copied into one file and records which appear once in either will be copied to the right output.

Groete / Greetings
Elardus Engelbrecht
Post by Elardus Engelbrecht
We have two output files, rather huge ones (rather gigabytes than MB's) with LRECL~1500, FB.
The goal is to compare them and find different records.
While I have some ideas how to do it (even ISPF Compare or ICETOOL) I don't think it would effective.
Any clues?
Since you don't say what will happens when you get a record which appears in file 1 or file 2 or in both files, look at
https://www.ibm.com/support/knowledgecenter/en/SSLTBW_2.1.0/com.ibm.zos.v2r1.icea100/ice2ca_Example_3_-_Create_files_with_matching_and_non-matching_records.htm
ON fields that appear in both input data set 1 and input data set 2
ON fields that appear only in input data set 1
ON fields that appear only in input data set 2
HTH!
Groete / Greetings
Elardus Engelbrecht
----------------------------------------------------------------------
For IBM-MAIN subscribe / signoff / archive access instructions,
----------------------------------------------------------------------
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to ***@listserv.ua.edu with the message: INFO IBM-MAIN
Sri h Kolusu
2018-07-06 13:34:37 UTC
Reply
Permalink
Raw Message
RS,

As Elardus kindly pointed out, you can use JOINKEYS to compare 2 different
files. You have the option of comparing the keys upto 4088 bytes in Binary.
In your case the input file LRECL is only 1500, so even if you wanted to
compare each and every byte then you can do that. However remember that if
you do happen to have duplicates in both files, then you might end up with
Cartesian product is a join of every duplicate record of one file-1 to
every duplicate record of another file-2. For example, if File-1 has 100
duplicate records and file-2 has 100 duplicate records then a Cartesian
join will return 100,000 records. You can eliminate the dups before the
match coding JNF1 and JNF2.

So If you let us know the exact requirement then we can show you how to get
the desired results.

Thanks,
Sri Hari Kolusu
DFSORT Development
IBM Corporation
Date: 07/06/2018 04:07 AM
Subject: DFSORT - a quick way to compare two huge files
We have two output files, rather huge ones (rather gigabytes than MB's)
with LRECL~1500, FB.
The goal is to compare them and find different records.
While I have some ideas how to do it (even ISPF Compare or ICETOOL) I
don't think it would effective.
Any clues?
--
Radoslaw Skorupka
Lodz, Poland
----------------------------------------------------------------------
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to ***@listserv.ua.edu with the message: INFO IBM-MAIN
Paul Gilmartin
2018-07-06 13:34:56 UTC
Reply
Permalink
Raw Message
Post by Elardus Engelbrecht
The example I posted may help you, but please note, the contents are first sorted and then records which are the same in both files are copied into one file and records which appear once in either will be copied to the right output.
Does it report if the two files contain the same records but in different order?
This may or may not matter to the OP.
Post by Elardus Engelbrecht
Post by Elardus Engelbrecht
Since you don't say what will happens when you get a record which appears in file 1 or file 2 or in both files, look at
https://www.ibm.com/support/knowledgecenter/en/SSLTBW_2.1.0/com.ibm.zos.v2r1.icea100/ice2ca_Example_3_-_Create_files_with_matching_and_non-matching_records.htm
ON fields that appear in both input data set 1 and input data set 2
ON fields that appear only in input data set 1
ON fields that appear only in input data set 2
-- gil

----------------------------------------------------------------------
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to ***@listserv.ua.edu with the message: INFO IBM-MAIN
R.S.
2018-07-06 13:47:13 UTC
Reply
Permalink
Raw Message
Post by Paul Gilmartin
Post by Elardus Engelbrecht
The example I posted may help you, but please note, the contents are first sorted and then records which are the same in both files are copied into one file and records which appear once in either will be copied to the right output.
Does it report if the two files contain the same records but in different order?
This may or may not matter to the OP.
No, The records are to be intended in the same order.
In other words first record in FILEA has to be the same as first record
in FILEB and second record in FILEA has to be the same as second record
in FILEB, etc.
The intended result is RC=00 for idencital files and RC>00 for
mismatches (and non-matching records on the output).

We plan to perform comparisons on several files, all of them are FB, but
LRECL is different for every file. Of course files being compared are
the same in terms of DCB.
--
Radoslaw Skorupka
Lodz, Poland




======================================================================


--
Treść tej wiadomości może zawierać informacje prawnie chronione Banku przeznaczone wyłącznie do użytku służbowego adresata. Odbiorcą może być jedynie jej adresat z wyłączeniem dostępu osób trzecich. Jeżeli nie jesteś adresatem niniejszej wiadomości lub pracownikiem upoważnionym do jej przekazania adresatowi, informujemy, że jej rozpowszechnianie, kopiowanie, rozprowadzanie lub inne działanie o podobnym charakterze jest prawnie zabronione i może być karalne. Jeżeli otrzymałeś tę wiadomość omyłkowo, prosimy niezwłocznie zawiadomić nadawcę wysyłając odpowiedź oraz trwale usunąć tę wiadomość włączając w to wszelkie jej kopie wydrukowane lub zapisane na dysku.

This e-mail may contain legally privileged information of the Bank and is intended solely for business use of the addressee. This e-mail may only be received by the addressee and may not be disclosed to any third parties. If you are not the intended addressee of this e-mail or the employee authorized to forward it to the addressee, be advised that any dissemination, copying, distribution or any other similar activity is legally prohibited and may be punishable. If you received this e-mail by mistake please advise the sender immediately by using the reply facility in your e-mail software and delete permanently this e-mail including any copies of it either printed or saved to hard drive.

mBank S.A. z siedzibą w Warszawie, ul. Senatorska 18, 00-950 Warszawa, www.mBank.pl, e-mail: ***@mBank.plSąd Rejonowy dla m. st. Warszawy XII Wydział Gospodarczy Krajowego Rejestru Sądowego, nr rejestru przedsiębiorców KRS 0000025237, NIP: 526-021-50-88. Według stanu na dzień 01.01.2018 r. kapitał zakładowy mBanku S.A. (w całości wpłacony) wynosi 169.248.488 złotych.


----------------------------------------------------------------------
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to ***@listserv.ua.edu with the message: INFO IBM-MAIN
Sri h Kolusu
2018-07-06 14:08:25 UTC
Reply
Permalink
Raw Message
Post by Paul Gilmartin
Does it report if the two files contain the same records but in different order?
Gil,

Yes it can. Here is an example (assuming the compare key is 10 bytes)

//STEP0100 EXEC PGM=SORT
//SYSOUT DD SYSOUT=*
//INA DD *
GIL
MARTIN
PAUL
ELARDUS
KOLUSU
//INB DD *
MARTIN
PAUL
ELARDUS
GIL
KOLUSU
//SORTOUT DD SYSOUT=*
//SYSIN DD *
OPTION COPY
JOINKEYS F1=INA,FIELDS=(1,10,A)
JOINKEYS F2=INB,FIELDS=(1,10,A)
REFORMAT FIELDS=(F1:1,15,F2:11,5)
INREC BUILD=(01,10,
C'IS RECORD # ',
11,05,
C' IN FILE 1 AND IN FILE 2, IT IS RECORD # ',
16,05)
/*
//JNF1CNTL DD *
INREC OVERLAY=(11:SEQNUM,5,FS)
/*
//JNF2CNTL DD *
INREC OVERLAY=(11:SEQNUM,5,FS)
/*

The output from the above job is

ELARDUS IS RECORD # 4 IN FILE 1 AND IN FILE 2, IT IS RECORD # 3
GIL IS RECORD # 1 IN FILE 1 AND IN FILE 2, IT IS RECORD # 4
KOLUSU IS RECORD # 5 IN FILE 1 AND IN FILE 2, IT IS RECORD # 5
MARTIN IS RECORD # 2 IN FILE 1 AND IN FILE 2, IT IS RECORD # 1
PAUL IS RECORD # 3 IN FILE 1 AND IN FILE 2, IT IS RECORD # 2

Further if you have any questions please let me know

Thanks,
Kolusu
DFSORT Development
IBM Corporation
Post by Paul Gilmartin
Date: 07/06/2018 06:36 AM
Subject: Re: DFSORT - a quick way to compare two huge files
Post by Elardus Engelbrecht
The example I posted may help you, but please note, the contents
are first sorted and then records which are the same in both files
are copied into one file and records which appear once in either
will be copied to the right output.
Does it report if the two files contain the same records but in different order?
This may or may not matter to the OP.
Post by Elardus Engelbrecht
Post by Elardus Engelbrecht
Since you don't say what will happens when you get a record which
appears in file 1 or file 2 or in both files, look at
Post by Elardus Engelbrecht
Post by Elardus Engelbrecht
https://www.ibm.com/support/knowledgecenter/en/SSLTBW_2.1.0/
com.ibm.zos.v2r1.icea100/ice2ca_Example_3_-
_Create_files_with_matching_and_non-matching_records.htm
Post by Elardus Engelbrecht
Post by Elardus Engelbrecht
This example shows how you can match records in input data sets 1
ON fields that appear in both input data set 1 and input data set 2
ON fields that appear only in input data set 1
ON fields that appear only in input data set 2
-- gil
----------------------------------------------------------------------
For IBM-MAIN subscribe / signoff / archive access instructions,
----------------------------------------------------------------------
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to ***@listserv.ua.edu with the message: INFO IBM-MAIN
Elardus Engelbrecht
2018-07-06 13:46:42 UTC
Reply
Permalink
Raw Message
Post by Paul Gilmartin
Post by Elardus Engelbrecht
The example I posted may help you, but please note, the contents are first sorted and then records which are the same in both files are copied into one file and records which appear once in either will be copied to the right output.
Does it report if the two files contain the same records but in different order?
This may or may not matter to the OP.
Thanks Paul. You are 101% correct. I have mentioned this caveat.

Other question - does the OP want to compare the WHOLE lines or not. Or are some columns selected for comparision?

Just like Sri, we all need more details before we can help out where we can.

Groete / Greetings
Elardus Engelbrecht

----------------------------------------------------------------------
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to ***@listserv.ua.edu with the message: INFO IBM-MAIN
R.S.
2018-07-06 13:50:05 UTC
Reply
Permalink
Raw Message
Post by Elardus Engelbrecht
Post by Paul Gilmartin
Post by Elardus Engelbrecht
The example I posted may help you, but please note, the contents are first sorted and then records which are the same in both files are copied into one file and records which appear once in either will be copied to the right output.
Does it report if the two files contain the same records but in different order?
This may or may not matter to the OP.
Thanks Paul. You are 101% correct. I have mentioned this caveat.
Other question - does the OP want to compare the WHOLE lines or not. Or are some columns selected for comparision?
Whole lines.
Records in both files are in the same order, so the comparison can be
done record by record.
--
Radoslaw Skorupka
Lodz, Poland




======================================================================


--
Treść tej wiadomości może zawierać informacje prawnie chronione Banku przeznaczone wyłącznie do użytku służbowego adresata. Odbiorcą może być jedynie jej adresat z wyłączeniem dostępu osób trzecich. Jeżeli nie jesteś adresatem niniejszej wiadomości lub pracownikiem upoważnionym do jej przekazania adresatowi, informujemy, że jej rozpowszechnianie, kopiowanie, rozprowadzanie lub inne działanie o podobnym charakterze jest prawnie zabronione i może być karalne. Jeżeli otrzymałeś tę wiadomość omyłkowo, prosimy niezwłocznie zawiadomić nadawcę wysyłając odpowiedź oraz trwale usunąć tę wiadomość włączając w to wszelkie jej kopie wydrukowane lub zapisane na dysku.

This e-mail may contain legally privileged information of the Bank and is intended solely for business use of the addressee. This e-mail may only be received by the addressee and may not be disclosed to any third parties. If you are not the intended addressee of this e-mail or the employee authorized to forward it to the addressee, be advised that any dissemination, copying, distribution or any other similar activity is legally prohibited and may be punishable. If you received this e-mail by mistake please advise the sender immediately by using the reply facility in your e-mail software and delete permanently this e-mail including any copies of it either printed or saved to hard drive.

mBank S.A. z siedzibą w Warszawie, ul. Senatorska 18, 00-950 Warszawa, www.mBank.pl, e-mail: ***@mBank.plSąd Rejonowy dla m. st. Warszawy XII Wydział Gospodarczy Krajowego Rejestru Sądowego, nr rejestru przedsiębiorców KRS 0000025237, NIP: 526-021-50-88. Według stanu na dzień 01.01.2018 r. kapitał zakładowy mBanku S.A. (w całości wpłacony) wynosi 169.248.488 złotych.


----------------------------------------------------------------------
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to ***@listserv.ua.edu with the message: INFO IBM-MAIN
Elardus Engelbrecht
2018-07-06 14:34:32 UTC
Reply
Permalink
Raw Message
Post by Sri h Kolusu
Yes it can. Here is an example (assuming the compare key is 10 bytes)
Thanks. Your example is excellent. I like the amazing combination of JOINKEY, REFORMAT, BUILD and OVERLAY!
Post by Sri h Kolusu
The output from the above job is
ELARDUS IS RECORD # 4 IN FILE 1 AND IN FILE 2, IT IS RECORD # 3
GIL IS RECORD # 1 IN FILE 1 AND IN FILE 2, IT IS RECORD # 4
KOLUSU IS RECORD # 5 IN FILE 1 AND IN FILE 2, IT IS RECORD # 5
MARTIN IS RECORD # 2 IN FILE 1 AND IN FILE 2, IT IS RECORD # 1
PAUL IS RECORD # 3 IN FILE 1 AND IN FILE 2, IT IS RECORD # 2
Uh-oh, are you SORTing me out? ;-D

I like that example where you build up a singe record from two sources. (Similar to Frank Yaeger's example where you list from IRRDBU00 the 0200 and 0220 record types, thus combining the basic RACF details of ids with the TSO Segment of the same id into one output line).

Groete / Greetings
Elardus Engelbrecht

----------------------------------------------------------------------
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to ***@listserv.ua.edu with the message: INFO IBM-MAIN
retired mainframer
2018-07-06 14:58:38 UTC
Reply
Permalink
Raw Message
Is there some reason SUPERC is not appropriate for this task?
-----Original Message-----
Of R.S.
Sent: Friday, July 06, 2018 4:04 AM
Subject: DFSORT - a quick way to compare two huge files
We have two output files, rather huge ones (rather gigabytes than MB's)
with LRECL~1500, FB.
The goal is to compare them and find different records.
While I have some ideas how to do it (even ISPF Compare or ICETOOL) I
don't think it would effective.
----------------------------------------------------------------------
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to ***@listserv.ua.edu with the message: INFO IBM-MAIN
Farley, Peter x23353
2018-07-07 00:52:31 UTC
Reply
Permalink
Raw Message
Last time I checked SUPERC (even the "advanced" version delivered in the HLASM Toolkit) is pretty useless beyond LRECL=200. It wouldn't even show the differences in bytes beyond 176, IIRC. It seems to me that SUPERC's design point is source-code file comparisons, and even there not so much for really-long-line-length-allowed languages.

The commercial compare products (Comparex, INSYNC, FileMaster, etc.) do a pretty good job even on really large data files, if your employer is already paying for them. Comparex was always my favorite for the ability to compare files with random-order (or at least not fully sorted) keys.

But Sri's DFSORT example is impressive, as usual.

Peter

-----Original Message-----
From: IBM Mainframe Discussion List [mailto:IBM-***@LISTSERV.UA.EDU] On Behalf Of retired mainframer
Sent: Friday, July 6, 2018 10:58 AM
To: IBM-***@LISTSERV.UA.EDU
Subject: Re: DFSORT - a quick way to compare two huge files

EXTERNAL EMAIL

Is there some reason SUPERC is not appropriate for this task?
-----Original Message-----
Of R.S.
Sent: Friday, July 06, 2018 4:04 AM
Subject: DFSORT - a quick way to compare two huge files
We have two output files, rather huge ones (rather gigabytes than MB's)
with LRECL~1500, FB.
The goal is to compare them and find different records.
While I have some ideas how to do it (even ISPF Compare or ICETOOL) I
don't think it would effective.
--


This message and any attachments are intended only for the use of the addressee and may contain information that is privileged and confidential. If the reader of the message is not the intended recipient or an authorized representative of the intended recipient, you are hereby notified that any dissemination of this communication is strictly prohibited. If you have received this communication in error, please notify us immediately by e-mail and delete the message and any attachments from your system.

----------------------------------------------------------------------
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to ***@listserv.ua.edu with the message: INFO IBM-MAIN
Hobart Spitz
2018-07-08 12:53:19 UTC
Reply
Permalink
Raw Message
How about something like:

*/* REXX */*
*...*
*"pipe (end ?) ? < file1.data | g: gather | join | pick 1.1500 /== 1501-*",*
* "| c: count lines | literal Mismatched record pairs: | cons",*
* "? < fileb.data | g:",*
* "? c: | var BadCount"*

*exit BadCount*

It works like this:

- < reads a file.
- GATHER reads from the two input streams, alternately, one record at a
time.
- JOIN concatenates each pair of records, one from each file, into a
single record.
- PICK passes mismatches, by comparing the first 1500 bytes to the
second 1500.
- COUNT records the number of records is sees, and, at EOF, writes it to
it's secondary output, which is connected to VAR.
- VAR stores it's record into a REXX variable.
- LITERAL labels the output.
- CONS writes the output to the terminal; substitute > if you prefer.
- ? starts a new stream.
- G: and C: make connections to secondary streams.

If you want just 0 or 1 in in the return code, replace COUNT with FANOUT,
and use EXIT SYMBOL("BadCount") == "LIT". FANOUT will remove itself from
the pipe as soon as the first mismatch is found.

OREXXMan
JCL is the buggy whip of 21st century computing. Stabilize it.
Put Pipelines in the z/OS base. Would you rather process data one
character at a time (Unix/C style), or one record at a time?
IBM has been looking for an HLL for program products; REXX is that language.

On Fri, Jul 6, 2018 at 8:52 PM, Farley, Peter x23353 <
Post by Farley, Peter x23353
Last time I checked SUPERC (even the "advanced" version delivered in the
HLASM Toolkit) is pretty useless beyond LRECL=200. It wouldn't even show
the differences in bytes beyond 176, IIRC. It seems to me that SUPERC's
design point is source-code file comparisons, and even there not so much
for really-long-line-length-allowed languages.
The commercial compare products (Comparex, INSYNC, FileMaster, etc.) do a
pretty good job even on really large data files, if your employer is
already paying for them. Comparex was always my favorite for the ability
to compare files with random-order (or at least not fully sorted) keys.
But Sri's DFSORT example is impressive, as usual.
Peter
-----Original Message-----
Behalf Of retired mainframer
Sent: Friday, July 6, 2018 10:58 AM
Subject: Re: DFSORT - a quick way to compare two huge files
EXTERNAL EMAIL
Is there some reason SUPERC is not appropriate for this task?
-----Original Message-----
Of R.S.
Sent: Friday, July 06, 2018 4:04 AM
Subject: DFSORT - a quick way to compare two huge files
We have two output files, rather huge ones (rather gigabytes than MB's)
with LRECL~1500, FB.
The goal is to compare them and find different records.
While I have some ideas how to do it (even ISPF Compare or ICETOOL) I
don't think it would effective.
--
This message and any attachments are intended only for the use of the
addressee and may contain information that is privileged and confidential.
If the reader of the message is not the intended recipient or an authorized
representative of the intended recipient, you are hereby notified that any
dissemination of this communication is strictly prohibited. If you have
received this communication in error, please notify us immediately by
e-mail and delete the message and any attachments from your system.
----------------------------------------------------------------------
For IBM-MAIN subscribe / signoff / archive access instructions,
----------------------------------------------------------------------
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to ***@listserv.ua.edu with the message: INFO IBM-MAIN
Loading...