Google
 

Trailing-Edge - PDP-10 Archives - decuslib10-02 - 43,50227/kwic.doc
There are 5 other files named kwic.doc in the archive. Click here to see a list.
PROGRAM DISCRIPTION:

    THIS ROUTINE TAKES TWO FILES.  A USER DEFINED STOP LIST,
AND A FILE TO BE KEY-WORD-IN-CONTEXT INDEXED.
THE USER SUPPLIES THE LOCATION OF THE INPUT FILES AND
A PLACE TO WRITE THE INDEX FILE AND A TITLE FOR THE LISTING.
THIS ROUTINE READS THE ENTIRE MASTER FILE (DATA TO BE
INDEXED) INTO CORE AND MUST BE ABLE TO READ IT ALL
INTO CORE AT ONCE TO RUN.  THE PROGRAM ALSO MAKES A FREQUENCY FILE
WHICH CONSISTS OF THE NUMBER OF TIMES EACH INDEX TERM
WAS USED.

    THIS PROGRAM WAS WRITTEN BY G.B. MOERSDORF
AT THE OHIO STATE UNIVERSITY.  THE SYSTEM WAS DEVELOPED
ON OSU'S NON DISK NON SWAPPING 32K PDP-10.  THE SYSTEM RUNS UNDER
A 4NN72 OR BETTER MONITOR.  THE CODE WAS WRITTEN
TO BE COMPLETELY DEVICE INDEPENDANT.  THE ONLY RESTRICTION ON THE
INPUT DEVICES IS THAT THEY CAN DO IMAGE BINARY MODE (10) INPUT.
THE RESTRICTION ON THE LISTING DEVICE IS THAT IT CAN DO
ASCII LINE MODE (1) OUTPUT.  THE LISTING
WIDTH CAN BE ADJUSTED TO ANY SIZE LINE PRINTER
OR TELETYPE WHICH HAS MORE THAN 60 PRINT POSITIONS.
QUICK INSTRUCTIONS TO RUN KWIC:


   THE BEST WAY TO DESCRIBE THE INDEX IS BY MAKING ONE,  TO USE THE
DEMO DATA SUPPLIED DO THE FOLLOWING:

	(1) MOUNT DISTRIBUTION DECTAPE AND ASSIGN 'DSK' OR
		'PIP' TAPE TO YOUR DISK AREA.
	(2) TYPE RUN DTA#:KWIC (FOR DISK RU KWIC)
	(3) WHEN ASKED FOR 'MASTER FILE' TYPE CR TO USE DEFAULT
		OR THE FILE NAME SPECIFICATION IN THE
		FORM 'DEV:FILE.EXT'. I.E. DEVICE NAME: FILE NAME.
		EXTENSION (CR).  DEFAULTS: DEV=DSK,
		FILE=KWIC, EXT=MAS.
		DEFAULTS SPECIFY NAME OF TEST DATA SET.
	(4) WHEN ASKED FOR 'STOP FILE' TYPE CR
		OR FILE SPECIFICATION AS ABOVE.
		DEFAULTS: DEV=DSK, FILE=KWIC, EXT=STP.
	(5) WHEN PROMPED FOR 'INDEX FILE' TYPE CR
		OR FILE SPECIFICATION AS ABOVE
		DEFAULTS: DEV=DSK, FILE=KWIC, EXT=NDX.
		(THIS WILL WRITE LISTING ON DECTAPE OR DISK IF
		YOU HAVE ONE, UNDER THE NAME 'KWIC.NDX' PPN 0,0).
	(6) WHEN PROMPTED WITH 'FREQUENCY FILE' TYPE
		A CARRIAGE RETURN TO DEFAULT TO 'DSK:KWIC.FRQ'.
		THIS IS THE WHERE THE PROGRAM WILL WRITE THE
		FREQUENCY FILE.
	(7) WHEN PROMPTED WITH 'LISTING TITLE' TYPE YOUR
		NIFTY COMPANY NAME OR SLOGAN. (MAX 80 CHARACTERS)
	(8) WHEN IT PRINTS 'EXTI' THE INDEX HAS BEEN WRITTEN
		ON THE FILE DESCRIBED IN STEP 5 AND THE
		FREQUENCY LIST ON THE FILE SPECIFIED IN STEP 6
	(9) PRINT THE INDEX AND FREQUENCY FILES WITH 'PIP'.  AREN'T
		THEY BEAUTIFULL?
	(10) IF IT IS NOT BEAUTIFUL GO TO 'IMPLEMENTATION
		ON YOUR 10'.
FORMAT OF 'STOP LIST' FILE:

    THE USER CREATES A 'STOP LIST' OF WORDS WHICH THE USER FEELS
HAVE NO USE AS INDEX TERMS FOR HIS PATRTICULAR APPLICATION.
ONE SUCH 'STOP LIST' IS SUPPLIED WITH THE PACKAGE, IT IS CALLED
'KWIC.STP'.  THE SUPPLIED LIST IS A GENERALIZED
STOP LIST WHICH CONTAINS 'LOW VALUE' KEYWORDS
SUCH AS, A, AN, IN, THE.  THIS FILE MUST BE IN ALPHABETICAL ORDER.
THE FILE MAY HAVE STANDARD D.E.C. SEQUENCE NUMBERS.
EACH WORD TO BE STOPPED MUST BE DELIMITED BY A CARRIAGE RETURN
LINE FEED.  SPACES AND TABS ARE IGNORED.
FORMAT OF 'MASTER' FILE:

    THE MASTER FILE CONSISTS OF THE DATA TO BE INDEXED BY THE
KWIC PROGRAM.  THIS MAY BE ANY TYPE OF ALPHANUMERICAL DATA.
THE USUAL DATA WOULD BE IN THE FORM OF MANY BOOK TITLES IN
A SPECIFIC AREA OF STUDY, OR POSSIBLY A WHOLE LIBRARY'S CATALOGUE.
BUT THE PROGRAM IS FLEXIBLE ENOUGH TO ALLOW KWIC INDEXING OF
A THESIS PAPER OR SIMILAR DOCUMENT (FOR WHAT IT'S WORTH).
THE DELIMITERS FOR EACH FIELD OF DATA (ALL 3 DELIMITERS)
ARE DECIDED UPON BY THE USER AT ASSEMBLE TIME.
THIS FILE SHOULD HAVE SEQUENCE NUMBERS AS THEY ARE USED IN THE
IDENTIFICATION OF SYNTAX ERROR LOCATIONS.
AFTER DEBUGGING THE DATA YOU MAY REMOVE
THE SEQUENCE NUMBERS TO SAVE DISK SPACE AND THE PROGRAM
WILL OPERATE NORMALLY.
    GENERAL FORMAT OF AN ITEM IN THE MASTER FILE AS FOLLOWS:

	1) STANDARD D.E.C. SEQUENCE NUMBER
	2) FIELD OF DATA TO BE INDEXED (MAY BE CONTINUED
	   ON ANY NUMBER OF LINES. I.E. A CARRIAGE
	   RETURN LINE FEED IS IGNORED COMPLETELY)
	3) THE DELIMITER FOR SORT FIELD ('=' ON THE DISTRIBUTED
	   VERSION.
	4) NEXT ANY DATA TO BE TOTALLY IGNORED BY THE SYSTEM
	   SUCH AS COPYRIGHT DATE AND PUBLISHER. THIS WAS
	   DONE SO THAT THE SAME DATA BASE CAN BE USED FOR
	   THIS PROGRAM AS FOR OTHERS. I.E. ONE THAT
	   USES DATA NOT NORMALLY KWIC INDEXED.
	5) THE I.D. DELIMITER CHARACTER (IN DISTRIBUTION IT IS A '[')
	6) THE IDENTIFICATION NUMBER TO BE ASSOCIATED WITH
	   THE ITEM.  THE MAXIMUM LENGTH OF THIS FIELD IS ALSO
	   ADJUSTABLE AT ASSEMBLE TIME.  IN THE DISTRIBUTION IT
	   IS 10 DIGITS. WARNING! USE NO SPACES OR TABS IN THIS FIELD
	7) THE END OF ITEM DELIMITER.  IN THE DISTRIBUTION IT IS A ']'


NOTE 1:	SINCE A CARRIAGE RETURN LINE FEED IS IGNORED
	TO CONTINUE A WORD ON ANOTHER LINE THE USER MERELY TYPES
	THE REST OF THE WORD WITH NO SPACES. BUT IF HE WISHES TO
	DELIMIT THE WORD WITH A SPACE HE MUST TYPE IT EITHER AT
	THE END OF THE LINE OR THE BEGINNING OF THE CONTINUATION
	LINE. EX:

		THE OHIO STATE UNIVE
		RSITY

		(CONTINUATION OF SAME WORD)


		THE OHIO STATE
		 UNIVERSITY

		(TWO SEPERATE WORDS)

NOTE 2:	A SPACE AND A TAB ARE THE ONLY CHARACTERS WHICH DELIMIT
	A WORD FROM ITS NEIGHBOR.  SEQUENCIAL SPACES OR
	TABS ARE REDUCED TO ONE SPACE ON THE LISTING.
NOTE 3: TWO CONVENTIONS HAVE BEEN USED IN THE TEST DATA
	WHICH YOU MAY WANT TO USE.  THE FIRST IS TO PLACE ALL
	THE AUTHOR'S NAMES IN PARENS.  THIS WILL MAKE ALL THE
	AUTHOR'S NAMES APPEAR IN ONE SPOT IN THE
	INDEX.  THE SECOND CONVENTION IS TO USE A '/' IN FRONT OF
	ANY WORD WHICH IS NOT IN THE TITLE, BUT YOU FEEL
	HAS VALUE AS A INDEX TERM FOR THIS ITEM.

IMPLEMENTATION ON YOUR 10:

    THERE ARE MANY ASSEMBLY PARAMETERS
AND THE ONES WHICH DIRECTLY AFFECT YOUR INSTALATION ARE
AS FOLLOWS:


SWITCH OR	DEFAULT		DESCRIPTION OR
VARIABLE	VALUE		ACTION TAKEN
____________________________________________________________

LPTSIZ		132		THE WIDTH OR DESIRED
				WIDTH OF INDEX LINE. YOU MAY
				WANT TO RESTRICT THIS FOR DUPLICATION
				PURPOSES.  IT MAY BE ANY EVEN NUMBER FROM
				60 TO THE WIDTH OF THE OUTPUT DEVICE
				LINE.

DELSRT		"="		DELIMITER FOR THE SORTED DATA FIELD.
				(THE FIRST FIELD DELIMITER)  THIS MAY
				BE ANY CHARACTER GREATER THAN A SPACE (40)
				PUT THE CHARACTER IN DOUBLE QUOTES.

DELKEY		"["		DELIMITER FOR IDENTIFICATION
				FIELD.  SAME RESTRICTIONS AS FOR
				DELSRT.  DON'T THINK YOU'RE SMART
				AND USE THE SAME CHARACTERS FOR ALL
				OR SOME DELIMITERS.

DELEOL		"]"		DELIMITER FOR THE END OF THE ITEM
				(FOLLOWS THE IDENTIFICATION FIELD)
				SAME RESTRICTIONS AS FOR DELKEY

MAXLIN		^D50		NUMBER OF LINES PUT ON A PAGE (NOT
				INCLUDING THE HEADER)

SIZWRD		^D50		MAXIMUM NUMBER OF CHARACTERS
				IN ANY ONE WORD. I.E. BEFORE
				A SPACE OR TAB.  (THIS ALLOWS ALL
				THOSE ALL TIME FAVORITES LIKE
				ANTIDISESTABLISHMENTENTARIASM)

MAXSAM		^D300		MAXIMUM NUMBER OF WORDS WHICH ARE NOT
				STOP WORDS AND ARE IDENTICAL.
				THIS IS THE SIZE OF THE
				HASH TABLE.

DEBUG		0		IF 1 WILL MAKE A NON REENTRANT
				BEGUGGING VERSION.  USED ONLY WHEN
				FIXING PROGRAM)

REENT		1		IF 1 GIVES REENTRANT CODE
				AND IF 0 MAKES A NON REENTRANT.

IDLEN		^D10		MAXIMUM SIZE OF THE I.D. FIELD

FREQSW		1		IF 1 ASSEMBLES THE
				FREQUENCY LIST CODE.  IF 0, NO
				FREQUENCY LIST IS GENERATED.
NOTES AND RANDOM INFO:

1) DO NOT (NOT!) USE A STRING OF 5 OR MORE "_" CHARACTERS
   IN SEQUENCE ON ANY DATA FILE.
2) THE USER CANNOT SPECIFY A PPN IN A FILE SPECIFICATION.
3) DO NOT END THE STOP LIST WITHOUT A
   CARRIAGE RETRUN LINE FEED. EX:

		NOT THIS^Z
		^
		WRONG WAY

		BUT THIS
		^Z
		^
		CORRECT WAY

4) IF NO SEQUENCE NUMBERS ARE ON THE MASTER FILE THE
   ERROR MESSAGES WILL NOT LOCATE THE LINES IN ERROR ON THE FILE
   BUT MERELY PRINT THE FACT THAT THEY EXIST.
5) THE SYSTEM RUNS UNDER 4NN72 OR BETTER MONITORS (THERE SHOULD BE
   NO MONITOR RESTRICTIONS IF IT RUNS UNDER OUR MONITOR)
6) ON OUR SYSTEM USING ALL OF USER CORE (23K) WE CAN HOLD AND
   KWIC INDEX A LIBRARY CATALOGUE OF 4000 ITEMS.
7) USING THE SAME DATA (A SMALL AMOUNT) THIS PROGRAM HAS
   RUN FASTER ON THE 10 THAN ON OUR 370/165.
8) IF A WORD IN THE STOP LIST IS LONGER THAN 12 CHARACTERS
   IT WILL BE TRUNCATED IN THE LISTING BUT ITS VALUE WILL
   BE UNCHANGED.
ERROR MESSAGES:

THE FOLLOWING IS A LIST OF ERROR MESSAGES AND THEIR MEANING.

1)

CANNOT INIT XXXXX DEVICE

	DEVICE SPECIFIED IN AN INPUT PARM OR A DEFAULT SPECIFICATION
	WAS NOT CORRECT OR AVAILABLE TO THE USER.


2)

CANNOT FIND XXXXX FILE

	THE FILE SPECIFIED (TYPE 'XXXXXX') COULD NOT BE FOUND.


3)

CANNOT ENTER XXXXX FILE

	THE DIRECTORY ON THE DEVICE SPECIFIED TO
WRITE THE 'XXXXX' LISTING ON WAS FULL.


4)

?READ ERROR ON 'XXXXX' FILE

	A DEVICE ERROR OCCURED ON THE 'XXXXX' FILE WHILE READING.


5)

?WRITE ERROR ON 'XXXXX' FILE

	A DEVICE ERROR OCCURED ON THE 'XXXXX' FILE WHILE WRITING.


6)
?MASTER FILE NO LONGER AVAILABLE

	THE PROGRAM RELEASES THE MASTER FILE FOR A SHORT PERIOD
WHILE IT READS IN THE STOP LIST FILE.  THIS IS SO ON A
DECTAPE SYSTEM THESE TWO FILES MAY BE ON THE SAME DRIVE.  THIS ERROR
OCCURES WHEN IT LOOKS FOR THE FILE THE SECOND TIME (AFTER THE
STOP LIST IS READ IN) AND CANNOT FIND IT.  THIS SHOULD NEVER
HAPPEN, IF IT DOES THE JOB BOMBS OFF.

7)

?FATAL UUO FAILURE -BADFAL-

	A CORE UUO FAILED WHILE DE-ALLOTING CORE.
THIS SHOULD BE AN IMPOSSIBLE CONDITION.
THE JOB BOMBS OFF.
8)

?MAXIMUM SIZE WORD EXCEEDED
WORD=CCCCCCCCCC

	A WORD LONGER THAN THE LENGTH SPECIFIED BY THE
'SIZWRD' ASSEMBLY CONSTANT WAS EXCEEDED. JOB
BOMBS OFF.  THE 'CCCCCC' WILL BE THE WORD IN
ERROR.

9)

?TOO MANY MATCHES FOR ARRAY
WORD=CCCCCCCCC

	MORE THAN THE NUMBER OF IDENTICAL INDEX ITEMS SPECIFIED
BY THE 'MAXSAM' ASSEMBLY CONSTANT WERE FOUND.
JOB BOMBS OFF.  THE 'CCCCC' WILL BE THE WORD WHICH
OCCURED MANY TIMES.


10)

?CORE UUO FAILED--TRYING AGAIN

	IF THE CORE UUO FAILS WHILE TRYING TO READ IN DATA
THIS MESSAGE IS PRINTED.  30 SECONDS LATER THE PROGRAM
WILL TRY TO ALLOCATE THE CORE AGAIN.  IT CONTINUES
LOOPING TILL IT GETS THE CORE.  THIS IS USEFULL ON
NON SWAPPING SYSTEMS WHERE A USER CAN WAIT FOR THE CORE TO
BECOME FREE.


11)

?ERROR IN LINE NNNNN---

	THE LINE NNNNN IS BAD OR ONE OF THE NEAR LINES.
	THE SPECIFIC ERROR FOLLOWS.   IF ANY OF THESE
	ERRORS (THE ONES IN SECTION 11) OCCUR THE KWIC INDEX
	AND FREQUENCY LIST ARE NOT GENERATED, ONLY THE STOP LIST.

		---I.D. NUMBER TOO LONG

			MEANS SIZE OF IDENTIFICATION
			NUMBER GREATER THAN 'IDSIZ' ASSEMBLY
			CONSTANT.

		---NO I.D. NUMBER FOUND

		MEANS JUST WHAT IT SAYS.

		---NO SORT DELIM FOUND

			MEANS JUST WHAT IT SAYS.

		---SYNTAX ERROR

			UNDIAGNOSABLE ERROR. (YOUR GUESS)
HAVING PROBLEMS:

  IF YOU FIND ANY BUGS OR HAVE ANY
SUGGESTIONS PLEASE USE THE BELOW ADDRESS.


		G.B. MOERSDORF
		PDP-10 ROOM
		CALDWELL LAB.
		OHIO STATE UNIVERSITY
		COLUMBUS, OHIO 43210
		614-422-8039