03/21/16 Monday 4:26pm EST adore2 locked up, rebooted adore 01/29/16 Friday 6:51pm EST locked up hard, had to power cycle 12/09/15 Wednesday 04:20am EST Locked up, had to reboot 02/13/12 Monday 10:39pm EST sd6 died, replaced somehow I nuked the FATS on the other drives, had to replace with mirrors. 11/03/11 Thursday 02:33am EST load high, locked up, rebooted 07/31/11 Sunday 7:09pm EST load high or locked up probably from incoming, didn't check, rebooted. 06/26/11 Sunday 02:23am EST load high due to hits from 18.104.22.168 fw.block rp1 05/30/11 Monday 06:44am EST www Locked up, out of mbufs. 05/25/11 Wednesday 09:57am EST backbone locked up, cl -> clrp telebooted. 03/31/11 Thursday 5:35pm EST Replaced root drive 12/08/10 Wednesday 10:54am EST Clarity backbone locked up. 06/16/10 Wednesday 11:26pm EST Power outage, offline for about 30 minutes 04/26/10 Monday 4:14pm EST Upstream is down 04/23/10 Monday 4:14pm EST Core router taken off line by accident. Down for about 15 minutes. 04/19/10 Monday 1:57pm EST Crashed from bad swap drive sd13e 03/26/10 Friday 1:17pm EST Locked up on /w10 freezing I think 11/20/09 Friday 10:52pm EST Upstream down for 6 hours from 3pm to 9pm 10/26/09 Monday 12:44am EST Upstream down for 45 minutes or so. 10/07/09 Wednesday 6:25pm EST Web server taken down to clear up confusions on disk mounting. 10/05/09 Monday 03:12am EST Adore2 is down and will be down for a while, we will try to get it back up again as soon as possible. 09/21/09 Monday 07:18am EST Network outage from 2:30am until 7am due to bad ethernet card in core router. Bypasswed eth2 as eth1, and will replace. 07/31/09 Friday 12:03pm EST Power outages in Ithaca this morning interrupted service a number of times from 10 to 12pm, in particular South Hill Business campus and those conected to West Village. 07/16/09 Thursday 9:02pm EST Upstream is down. 05/06/09 Wednesday 10:49am EST Core router had to be rebooted, did not go smoothly. All is well now. 07/04/08 Friday 4:26pm EST System renumbering continues. Light is now www.lightlink.com. Two main DNS servers are still light and majesty but are at 22.214.171.124 and 3. 02/18/08 Monday 11:29pm EST Pleassant Grove is presently offline, Verizon is working on the T1. 02/18/08 Monday 1:14pm EST bond0 widened, pulling 19 megs 01/27/08 Sunday 06:52am EST locked up, rebooted. 10/14/07 Sunday 05:31am EST light locked up, first time in forever. fan dead, perhaps that's why. 10/03/07 Wednesday 07:06am EST UPS on fvr died, taking CMC, watkins glen, WV and Axiohm off line. New one replaced, a big one this time. That one lasted for 10 years. RIP 08/28/07 Tuesday 2:55pm EST Mail pop server locked up this morning about 10am. Rebooted, all is well. 08/08/07 Thursday 7:06pm EST Mail servers under tremendous load from bot storm, no available sockets for real e-mail to come in. Google for 'bot storm'. Mail delayed but not lost. 06/08/07 Friday 3:47pm EST Wvbr link is down 06/02/07 Saturday 3:26pm EST T3 down again. Ticket 0602128 Back 17:23:03 05/31/07 Thursday 08:33am EST T3 down. Calling Paetec. Ticket 53140 Outage in NY area. 05/31/07 Thursday 09:42am EST 05/24/07 Thursday 7:36pm EST Numerous outages this morning on pleasant Grove T1, kionix and pfinc links were caused by faulty router installed last night during DDOS attack. 05/23/07 Thursday 11:50pm EST T3 backbone went off line, called into PAETEC. Turns out it was a DDOS attack from one of our own colo machines. 03/10/07 Saturday 5:50pm EST Power outage at 4pm took all RP customers off line. 03/10/07 Saturday 07:20am EST dialup down from midnight on, for second time. This time bug fixed. 01/11/07 Thursday 9:48pm EST Memory error on adore, crashed. Reseated all memory, rebooted Next time will start to replace chips that report bad. 01/08/07 Monday 2:12pm EST Memory error on J0303 on adore. Will replace. 01/05/07 Friday 11:43pm EST Major power outage on aurora street taking all of Roy parks and Downtown Ithaca and Collegetown off line. 01/04/07 Thursday 01:12am EST Upstream off line at 1:00am for scheduled maintenance. 12/18/06 Monday 12:05pm EST Adore locked up. 12/08/06 Friday 3:30pm EST All of lightlink off line over night for about an hour due to upstream problems. 12/05/06 Tuesday 2:15pm EST adore locked up 12/04/06 Monday 6:22pm EST Light, the legacy web server, was crashed out for about an hour. This has been fixed with netprotect. 11/21/06 Tuesday 12:00am EST Conn Hill radio tower has been down for most of today, we tried to fix it but failed. Will try again tommorow. 10/26/06 Thursday 1:56pm EST adore crashed from asynch fault errors. rebooted 09/24/06 Sunday 05:13am EST le0 went down. ifconfig le0 down, ifconfig le0 up brought it back. That's a new one. 07/31/06 Monday 04:58am EST light crashed. async memory errors errors. rebooted j0203 removed. 07/24/06 Monday 11:55am EST T3 locked up. Rebooted router, came back. That's a first. 07/04/06 Tuesday 03:01am EST light locked up, rebooted. 06/22/06 Thursday 06:09am EST Outgoing smtp server lockedup. Rebooted. 06/13/06 Tuesday 01:09am EST Adore locked up 05/17/06 Wednesday 6:57pm EST Had to reboot light to gain control over spammers using formmails to send spam. 04/25/06 Tuesday 11:55pm EST Multiple short term lockups, hard to diagnose. swap drive failing I believe from errors I saw during a reboot. Removed the drive. 04/11/06 Tuesday 3:58pm EST Short outage this morning due to and old ethernet termination turned bad connecting DNS servers to world. 04/07/06 Friday 1:10pm EST Upstream service died for about an hour at 12 noon. All customers off line. USLEC says it was hardware failure at their end. 03/04/06 Saturday 7:48pm EST Scheduled downs at 6pm and 7pm to upgrade routers at both end of Fairview to Roy parks link. 02/26/06 Sunday 11:14am EST vpn router locked up from disk errors, took all radio vpn customers off line. 02/25/06 Saturday 3:26pm EST web server light crashed, taking out web and DNS. tried to format a dying drive, didn't like it I guess. 02/13/06 Monday 01:55am EST eth0 locked up on light. ./net restarted it 01/24/06 Tuesday 8:50pm EST fvr router locked up, didn't reboot smoothly. Watkins glen, hospital, therm, and most radio users connected to chair, thair and vhair were off line for the count. Took router off line, cleaned up configuration, rebooted smoothly. 01/18/06 Wednesday 5:50pm EST Major outage caused when our stream US LEC lost 4 Ip ranges that feed our ISP. Those not on those ranges were not affected. All is well again. 11/09/05 Wednesday 12:28pm EST Last night core router at Roy Parks gave up the ghost, took a bit to get it back on line. All RoyParks customers were down including hotspots fed by Roy parks. This morning the mail server gave up the ghost, lost /dev/hda, replaced. 10/28/05 Friday 12:04am EST Adore locking up slowly, rebooted. 09/29/05 Thursday 11:14am EST T3 down from unknown reasons. Rebooted core router and everything came back. 09/22/05 Thursday 3:06pm EST Barracuda not working right, delaying mail, taken off line. Lots of spam will get through. 09/03/05 Saturday 3:59pm EST Light load high, many 'D's in ps aus on syslog and httpd All drives check out fine. Rebooted 06/18/05 Saturday 2:04pm EST Various things off line for 5 minutes while replacing burnt UPS batteries. 06/14/05 Wednesday 10:43pm EST Blown transformers down the street caused Nyseg to pull all power from Lightlink at about 1pm until 3:30pm. Working on getting a generator, really we are... :) 06/13/05 Monday 07:52am EST adore crashed, rebooted 06/11/05 Saturday 2:19pm EST Changed UPS on T3 router, everyone down for 2 minutes. 06/11/05 Saturday 1:25pm EST Lightning storms damaged the watkins glen link over the past few days. It is mostly restored. Conn Hill radio customers are still down, should be up by today. 05/17/05 Tuesday 1:19pm EST Yesterday's network outage was caused by a cable cut at Fairview. Hospital, Therm and Watkins Glen were taken off line as was everyone connected to them. 05/16/05 Monday 12:14pm EST adore crashed, rebooted 04/20/05 Wednesday 2:13pm EST web server taken off line by accidentally by setting root perms to rw-r--r-- 04/18/05 Monday 12:50am EST light crashed from data exception, possibly from trying to fromat the bad drive sd12. 04/17/05 Sunday 10:24pm EST web pages moved from sd12 to sd6 04/13/05 Wednesday 11:37pm EST Primary web pages on light /dev/sd6 moved to /dev/sd12. Original drive failing. /dev/sd6 replaced by /dev/sd12 Some NFS links to adore did not come back properly preventing uploads of new web page material. Rebooted adore this morning. 4/14/2005 03/28/05 Monday 11:20am EST pm1 rebooted bouncing some dialup customers. 03/27/05 Sunday 11:18am EST vpn0 rebooted to clear errors on hda. All radio vpn connections reset. fw3 rebooted to add hdc. Babcock, casaroma, fairview and all DHCP out to Watkins Glen offline for a short while. bond0 rebooted to add hdc. All of Roy Parks off line for a short while. wvbrair mis configured at 15:08pm taking various customers off line until about 9pm. 03/26/05 Saturday 3:14pm EST vpn0 taken down to replace hard drive and ide cable. radio vpn connections down for 10 minutes or so. radius dialup authentication also down for same period. 03/12/05 Saturday 7:13pm EST Roy Parks line taken of line to upgrade. Did not go smoothly, multiple problems. Its working presently (in general :) ) 03/10/05 Thursday 11:21am EST Light, main web server, off line for 20 minutes due to homer's stupidity. 03/05/05 Saturday 09:56am EST Dialup server died on mal functioning radius daemon at midnight. Restarted daemon at 8:52am. 02/21/05 Monday 11:25am EST fvradio/cmc1 upgraded to 3.88 on saturday. wg0 upgraded to 3.88 on saturday. rpradio upgraded 3.88 on Sunday. 02/02/05 Wednesday 2:09pm EST Incoming mail server crashed, lost /dev/sdc1 /var/spool/mqueue 01/23/05 Sunday 3:58pm EST Amp taken off of rpair. 01/19/05 Wednesday 10:37pm EST Mail delays today and yesterday from multiple problems with barracuda and our own mail server. Mail server rebooted tonight, and barracuda set to 10 incoming connections max. 12/09/04 Thursday 05:55am EST Mail down from failed root partition. Replaced. 11/24/04 Wednesday 11:36pm EST Major power outage took all of lightlink off line this afternoon for about 2 hours. After power came back, the Fairview to Roy Parks backbone did not come back properly due to unexpected problems with high speed modems forming the bonded line. 09/13/04 Monday 10:47am EST Watkins link at Conn Hill went down again at 11pm last night. Blown amp replaced this morning. 09/12/04 Sunday 09:43am EST Watkins Glen was down from Friday morning to Saturday afternoon. First problem was corruption of a radio cable that destroyed signal quality, and once that was fixed, a problem with usage of channels 1 2 and 3, something we still don't have a cause for. System at conn hill was completely rebuilt with LMR 600 wire, new amps, new antenna and new radio. But the channel problem continued. Presently running on 4. 09/04/04 Saturday 2:34pm EST Adore lost its root drive sd0a. Rebuilt. 08/21/04 Saturday 07:41am EST fvair locked up. Rebooted. 07/24/04 Saturday 2:41pm EST hssi card replaced, T3 network down for 1 minute. 07/05/04 Monday 04:00am EST cmc radio locked up at cmc end. telebooted. 06/17/04 Thursday 8:43pm EST HSSI card went off line again, replaced. 06/17/04 Thursday 12:44pm EST Adore crashed, swtch 06/16/04 Wednesday 10:40pm HSSI card went down again, rebooted. Should be replaced or put in new slot 06/13/04 Sunday 8:01pm EST T3 card HSSI 5/0 went down, all lightlink off line. Rebooted, did not come back. Reseated, rebooted came back. 05/21/04 Friday 8:39pm EST forward and reverse DNS brought into sync, all IP's have reverse DNS now but may have lost some specific names that had been set by hand. 05/15/04 Saturday 7:17pm EST Home directories separated between light and adore. This should not cause any problems, elob and moeorg were copied from adore over to light. 05/13/04 Thursday 4:39pm EST Again lost the primary drive on mail, /var/spool/mail and /var/spool/spam. This is annoying, don't trust drives any more. 04/28/04 Wednesday 3:37pm EST Problems receiving e-mail from nyroc.rr.com mail servers proved to be faulty code in new spam trap code. This is fixed and queued e-mail should be delivered shortly. 04/24/04 Saturday 2:55pm EST Modems are down due to failures in our upstreams upstreams Level 3. No ETA 04/18/04 Sunday 6:39pm EST News downs, fan locked up, oiled and rebooted 04/18/04 Sunday 5:33pm EST Incoming mail server pop.lightlink.com taken down to oil power supply fan. Mail server power supply failed totally a short while later, had to replace it. 02/27/04 Friday 11:41am EST Watkins Glen link down. 02/26/04 Friday 11:40am EST Watkins Glen link went down again. Nick and John G, went to CH and replaced everything, nothing worked, put back in the original and it started working again. Suspect bad wire/amp/antenna. 02/23/04 Monday 1:23pm EST Watkins Glen link is down, CH <-> TH link finally went into total failure. Replaced radio card at CH. Link came back. 02/21/04 Saturday 5:53pm EST Choice One modem lines are giving busy signals. Portmaster 0 rebooted, seemed to clear the problem. 01/30/04 Friday 12:30pm EST modem banks at 277 1228 were upgraded total non functionality last night. Working fine now... 01/18/04 Sunday 3:47pm EST primary mail spool drive lost, replaced with new drive. backups from last night are damaged, so taking from Saturday night. users sent logs of mail received so they can ask for important stuff to be resent. Need to reconsider raid. 01/17/04 Saturday 12:25pm EST outgoing e-mail and webmail interface down while changing UPS that went out. 01/10/04 Saturday 1:28pm EST Net is damaged out in the middle of the US somewhere, google, netscape, yahoo, have only intermittent access. 01/07/04 Wednesday 4:23pm EST wvbr1 locked up, telebooted. 12/31/03 Wednesday 12:14pm EST cmc radio linked locked up, rebooted fvradio. 12/24/03 Wednesday 11:50am EST rp3 link feeding DSL modems at Roy Parks locked up. Router tweaked, and rebooted. If it does it again, we will replace it. 12/13/03 Saturday 6:00pm EST fvradio -> CMC1 and all related backbones are down while we fix the cabling at fairview. 12/06/03 Saturday 1:58pm EST SSID on fvair changed to fvair from lightlink SSID on wvbrair changed to wvbrair from lightlink Users who have not registered their MAC address with us may be blocked from using these two radios. 11/28/03 Friday 8:10pm EST Adore crashed panic on cpu 0 swtch 11/25/03 Tuesday 10:48am EST Light locked up at around 3 am causing failures on web and dialup. Beeper did not go off, apparently Arch Wireless is down also. Serious. 11/23/03 Sunday 3:19pm EST Major outage at a level3.net router in New York City is making connectivity between RR customers and lightlink flakey at best. Been bad since Friday night. No ETA 11/10/03 Monday 10:23am EST POP mail server crashed this morning. VFS No free inodes ask Linus 09/16/03 Tuesday 10:03am EST Adore locked up for unknown reasons. Rebooted 09/04/03 Thursday 2:18pm EST rp2 router feeding paradyne and pairgain DSL out of Roy Parks was rebooted a few times before we found a routing error causing network confusion. We also want to get rid of the private IP's being given out to people because infected machines fill up the masquerading tables making it impossible for anyone on a private Ip to get out. I goofed that change up but good, so things are back where they were for the moment. 08/28/03 Thursday 10:52pm EST Upstream down for about about an hour. 08/23/03 Saturday 2:48pm EST mail server taken down at 2pm per scheduled down to replace failing hard drive. Drive was not replaced, but root partition was moved to a different place on the drive. Went well. 08/15/03 Friday 11:41am EST Radius authetication failed for dialup modems this morning because the authentication server (light) ran out of swap space because the web server was not being hit upon probably due to outages across the state. We were also not using our full battery of swap space, have added 2gig, so we should be fine :) 07/17/03 Thursday 1:07pm EST adore crashed on panic on cpu 0: swtch 07/14/03 Monday 10:47am EST adore crashed from panic on cpu 0: swtch 06/17/03 Tuesday 10:18am EST Mail server 'mx' locked up for unknown reasons. Rebooted. This is a rare event, hope this is not a premonition of things to come. 05/30/03 Friday 3:24pm EST Apparently gem was failing to do web logs properly starting from May 19th, which may or may not have had anything to do with its failure on the 24th. 05/24/03 Saturday 11:27pm EST Gem lost is primary hard drive tonight while replacing a CDROM drive. Turned machine off, put in DVD drive, turned it on, and drive was totally dead. Web e-mail and USER AREA were off line. Some web hits will be permanently lost for Saturday. 05/06/03 Tuesday 2:20pm EST Mail crashed from errors on /dev/sdc. /dev/sdd wouldn't boot when rebooted. Cold booted everything and reseated all the drives, and it came back fine with mininmal damage. I hate this job. 04/30/03 Wednesday 9:37pm EST Alpha test modems will be taken off line for a short time every 12noon pm bouncing those on it. These are the numbers starting with 216 0008 for Ithaca adn 387 7110 for Elmira. Those using our normal modems will not be affected. 04/20/03 Sunday 10:14pm EST adore crashed this morning at 8am from panic on cpu 2 swtch. Seems to be doing this about once a month. rebooted the new modem banks tonight bouncing everyone, testing the second set of modems in the bank. 04/19/03 Saturday 12:22pm EST Massive DDOS attack started 1pm Thursday afternoon taking our webserver and primary DNS off line. Intermittent attacks followed until late Friday afternoon. IP numbers of non IP domains have been moved to get out from under attack. Some pages may not work properly again until client machines are rebooted. 04/10/03 Thursday 5:40pm EST isdn2 modem bank in ithaca locked up, rebooted. 03/20/03 Thursday 2:47pm EST adore crashed from swtch condition. 03/17/03 Monday 7:58pm EST isdn4 bad modem line 1/2 03/16/03 Sunday 7:08pm EST isdn2 modem bank rebooted bouncing 16 users. permanent bad modem at isdn2-8 03/14/03 Friday 12:21pm EST Bad ISDN lines on modem bank 2 caused dropped connections. 03/08/03 Saturday 1:56pm EST pop server taken off line at 12 pm to upgrade motherboard to 900MHz. Went smoothly. 03/01/03 Saturday 9:00pm EST Lost the power strip that powers the core router. Replaced Lost connectivity for about 5 minutes. 02/21/03 Friday 9:30pm EST Power outage around 5pm took down entire system. All systems running again by 9pm. 02/18/03 Tuesday 08:42am EST Our firewall fw1 machine locked up this morning for unknown reasons blocking traffic to all legacy services, mail, web, modem authentication and news. Reboot fixed it. Don't you just hate that? Or maybe you gotta love it because it wasn't anything worse. That's its first lockup ever in about half a year of service. 02/10/03 Monday 11:50pm EST Adore crashed, panic on cpuX swtch 02/03/03 Monday 11:54am EST Recent problems with our upstream resulted in periodic outages to various parts of the net, google.com and others. Fastnet claims to have fixed the situation. It has been reoccuring on and off for the last week or so. 01/25/03 Saturday 11:01pm EST Fastnet, our upstream, is doing upgrades to their core routers to protect against the latest SQL worm. Outage should last about 30 minutes. 01/21/03 Tuesday 04:08am EST Lightlink was down starting at 12 midnight until 4am due to a power failure. 01/04/03 Saturday 09:04am EST Power out at 5am or so. Came back around 8am. Gave it some time to prove stable., everything backup around 9:30. Watkins Glen link down until Sunday due to power outage at Conn Hill. 12/29/02 Sunday 3:39pm EST Gem, webmail and mailing list machine, out of control, sendmail's taking 2 cpu seconds to verify. Rebooted to clear out. 12/27/02 Friday 04:06am EST Light again crashed mbuf map full Did a netstat -na vmunix.6 vmcore.6 from /var/crash and found multiple sendmail connects. Have firewalled off the spammers. This may break something else, too tired to think straight at the moment. I hate spammers. 12/27/02 Friday 02:49am EST Light crashed, mbuf map full 12/18/02 Wednesday 9:59pm EST Mail system died this morning from a spam attack. 7000 pieces of bounced mail accumulated in the mail queue and the mail server ran out of process space. We have taken measures to prevent this in the future. Also later in the afternoon the webmail interface web server died for unknown reasons, so webmail was down until late afternoon. 12/12/02 Thursday 7:07pm EST Wireless link to Watkins Glen went down from ice at about 2pm. Terry Hill Yagis were ice coated. 12/08/02 Sunday 12:14pm EST Modem 14 on 272 2284 not answering. 12/06/02 Friday 11:12pm EST Modem 8 on 272 2284 not answering. 11/26/02 Tuesday 11:39pm EST Another modem on the 5026 banks giving ring no answers. 11/26/02 Tuesday 1:16pm EST Modems on bank one of the V90 group locked up, had to reset the modem bank a few times. Something or someone is causing ping times to isdn.lightlink.com to periodically get huge. 11/19/02 Tuesday 12:37pm EST Web server locked up from scsi errors at 11:30 or so. Lost /dev/sd9a on web server housing /usr/local and /w5 and /var Have temporarily replaced with mirror. Will rebuild the master from the tape backups. 11/09/02 Saturday 10:49am EST Outgoing mail is down cleaning up after a spam attack that took out the mail server. 10/25/02 Friday 12:14pm EST All three mail servers were hosed this morning from a spam attack. User Area was off line, as was webemail and outgoing smtp server. All should be fine now. Really have to figure out some way to stop this particular kind of attack, they are really deadly. 10/22/02 Tuesday 6:48pm EST UPS replaced, knocking off all Ithaca dialup users. 10/21/02 Monday 7:18pm EST UPS died on modem banks, everyone booted. 10/17/02 Thursday 11:08am EST News server died on own determinism, no idea why. Restarted. 10/16/02 Wednesday 3:46pm EST Jammed modem line today refusing to let people connect, got it reset and its working fine now. 10/06/02 Sunday 1:23pm EST Light web server taken down to add more disk space. 10/04/02 Friday 10:46am EST 277 0356 rebooted to clear out lan failures. 09/29/02 Sunday 2:47pm EST Light, web server, and adore, shell server, taken off line for 2 hours to upgrade tables and power that hold them. Dialup was unavailble during this period due to no DNS nor radius authentication. 09/21/02 Saturday 2:10pm EST Majesty, light and adore taken down for 10 minutes to upgrade resolver libc's. 09/16/02 Monday 7:15pm EST Mail taken down to increase mail and spam partitions to 4.5 gigs each. 09/14/02 Saturday 9:03pm EST Incoming pop mail and news offline at 12pm to move computer tables. Outgoing smtp server offline this evening to replace dying ethernet card....... 09/13/02 Friday 12:50am EST Adore shell server crashed Panic on CPU 2: swtch Becoming more often... 09/04/02 Wednesday 8:13pm EST Mail server mx taken down to increase memory to 512Megs. 08/31/02 Saturday 5:07pm EST Adore, the shell machine, crashed on panic on cpu2, swtch. Seems to enjoying doing this every couple of weeks. 08/26/02 Monday 11:47am EST There was a failure this morning in the program that authenticates remote users to our smtp server. Some users were not able to send mail, got 'Lightlink Relaying Denied' error messages. New and improved program has bugs... :) 08/10/02 Saturday 6:59pm EST Ithaca dialup modem banks taken off line to change IP range from 205.232.34.x to 216.7.30.x. All static IP's changed also for dialup users only. 08/04/02 Sunday 12:34pm EST Name service was messed up last night during upgrades. Intermittent failures to get out to the net would have been caused by this. 07/16/02 Tuesday 12:06pm EST web mail interface down over night due to a failed upgrade. mail itself not affected. During the periods when gem was actually down, logins to adore were not available to due to nfs mounts timing out. 06/23/02 Sunday 11:59am EST newsfeeds down for 12 hours due to running out of disk space 06/06/02 Thursday 1:42pm EST Adore rebooted on panic on cpu 0 swtch 05/25/02 Saturday 12:10pm EST Light taken down at 12:00pm to add second CPU. No problems. 05/18/02 Saturday 6:39pm EST Adore taken down at 6pm to install second CPU on its motherboard. So far so good... 05/13/02 Monday 09:20am EST Adore/shell crashed for unknown reasons, cpu panic swtch Seems to be doing it every month or so. 04/27/02 Saturday 6:21pm EST All Ithaca V90 modems were successfully rebooted using automated script which will run at 3:30 every Sunday morning. 04/23/02 Tuesday 11:10pm EST isdn rebooted due to instability. Users were being disconnected on lines 1 and 3 repeatedly. 04/13/02 Saturday 3:23pm EST 12pm started scheduled down. Light taken down, cpu upgraded to 150 MHz, second ethercard added, power supply vaccuumed. Adore taken down, added 4 64mb memory sticks power supply vaccuumed. Light's UPS had its battery replaced. Modems taken off line to replace their UPS battery too. Wheels placed on table holding light and adore. Total down 3.5 hours. 04/06/02 Saturday 1:18pm EST smtp outgoing mail server off line for 10 minutes emergency upgrades. It's back now. 03/23/02 Saturday 11:29am EST web hits are off line for a while, while we practice an upgrade to light on majesty. 03/19/02 Tuesday 3:13pm EST Elmira was down for a few minutes due to an EMI frame outage. 02/03/02 Sunday 7:47pm EST Mail server taken down at 1pm to upgrade memory. Didn't go well, machine turned to glue with 1024Megs, had to remove memory and go back to old configuration. After much playing around mail was back up by 3:30pm. Gonna have to do this again next weekend. 01/28/02 Monday 4:16pm EST Elmira modens offline again from about 12 noon until 4pm due to a mistake made by Verizon while doing upgrades on the circuits. Virtual circuit DLCI's were lost :) 01/26/02 Saturday 3:27pm EST Elmira modems were off line from about 2am until morning due to repeated power outages. Public radio static IP addresses should be working properly now, subnet range was 'shared' by another machine offering bogus mac address for default route 126.96.36.199 01/24/02 Thursday 7:07pm EST Harmony rebooted itself at 4pm for unknown reasons. This affected dialup users on 277 5026 only. In may of 2001 the config file was corrupted or copied over and was no good, so users were able to dial up but not go anywhere beyond local IP numbers. We rebooted again at 6:10pm and again at 7:10pm All should be well now. 12/09/01 Sunday 5:38pm EST Light rebooted last night to get rid of /usr/local/main Adore rebooted tonight for same reason. 12/04/01 Tuesday 2:09pm EST Last night ftp daemon was upgraded to 2.6.2 due to a root exploit, unforunately the new version did not work right and prevented the uploading of pages. This has been fixed. 10/29/01 Monday 12:24am EST Adore crashed, panic on cpu 0 swtch. Been up long time! 09/19/01 Wednesday 6:37pm EST Major web outages caused by Nimda worm attack on our webserver. Things are mostly under control at this time. Outgoing mail server rebooted to replace UPS. 08/02/01 Thursday 08:38am EST Mail server mx and pop locked up at 8:38 from process table full, not sure why. Rebooted. Could have been a lock file problem with the spam trap. 07/26/01 Thursday 12:45pm EST Adore, shell machine, locked up at 11:58am from memory chip going bad. Removed, not yet replaced. 07/19/01 Thursday 2:54pm EST DDOS against iron and seal from 188.8.131.52 using known problems with web interface. Turned off interface and blocked IP range at router. 07/11/01 Wednesday 1:49pm EST Adore lost its root drive swap partition to errors. In process of installing new drive. 06/30/01 Saturday 6:53pm EST Mail taken down to increase disk space. Gives us some time to decide what to do about abandoned but growing mailboxes. Gem rebuilt with new mother board. 06/30/01 Saturday 05:03am EST Gem locked up again, multiple ethernet cards did not work, probably pci bus gone. Will replace motherboard today. smtp server moved over from gem to emerald. 06/29/01 Friday 09:39am EST Gem (outgoing smtp server) locked up ether port. Reseated card in different slot, if continues will replace card. 06/24/01 Sunday 1:34pm EST Rain storms over the past few days have taken their toll on the Fairview to CMC link. It is presently down for repairs. 06/09/01 Saturday 09:16am EST Shell machine adore locked up due to failure of spamtrap to clear lock. Basically I think bad locking finally caught up with us, ran out of processes. Cleared them out and cleared the spamtrap lock and all was well. 05/25/01 Friday 8:03pm EST All web pages restored. End of Event 05/25/01 Friday 10:50am EST Main web drive locked up on light, needs to be replaced. Presently running on mirror drive, do not upload web pages. 05/12/01 Saturday 1:31pm EST All majordomo mailing lists hosted by lightlink have been down since about a day ago. An upgrade to the sendmail.cf on gem broke the majordomo.aliases. Mail sent to the mailing lists would have been returned with User Unknown. My apologies. 05/07/01 Monday 2:46pm EST two modems on isdn2 were giving busy signals, reset seems to have cleared them out. 05/04/01 Friday 8:58pm EST Power strip blew out tonight from old age apparently, knocking all modems off line, along with gem, emerald and majesty. emerald is the smtp server, so outgoing mail was interrupted. Things should be working again. 05/01/01 Tuesday 11:35am EST Between Friday and Monday there was a significant outage on the net between lightlink and earthlink. This was caused by a malfuntioning router in our upstream's network. Due to this network outage, much mail may have been delayed getting to Earthlink, and some mail from earthlink may have been returned to sender as undeliverable. 03/31/01 Saturday 6:48pm EST I crashed the outgoing smtp server accidentally by hitting the wrong button on the UPS, trying to test it, I turned it off. Sigh. 03/18/01 Sunday 6:09pm EST Outgoing smtp server locked up its ethernet port for about 20 minutes this afternoon for unknown reasons. Majesty filed to properly beep me when smtp went due to incorrect settings. This has been fixed terminatedly. 03/07/01 Wednesday 7:52pm EST mail (mx) was taken down at 6pm for scheduled upgrades. Back up at 7pm. Scsi wiring was cleaned up on /dev/sda. spam filter was installed, with no filtering. 03/01/01 Thursday 3:56pm EST gem was rebuilt. web mail was down for a short while. 02/27/01 Tuesday 8:15pm EST Adore load 120.0, large amounts of mail waiting in queue for 5 days was suddenly dumped on me as postmaster as undeliverable. Procmail running 5000 lines of spamtrap drove load high. 02/23/01 Friday 3:12pm EST Got mail bombed again by same guy from sympatico. This time we put in the DUL data base, and it is keeping it out. Still lots of sendmail's opening up. Not ideal. 02/20/01 Tuesday 7:29pm EST Two large mail bombs came in this afternoon bringing all mail serves to their knees. About 3:00 to 5:00pm. No mail lost, just some temper. While putting in filters for the spam I managed to block out the smtp server from everyone for about 10 minutes at 18:14pm 02/11/01 Sunday 4:27pm EST Mail was down for 20 minutes to make emergency changes to disk arrangements. /var/log given own partition so it doesn't 'wear out' the root drive. 02/05/01 Monday 10:03am EST Well after 340 days of uptime, adore finally locked up. Had to reboot. 10/20/00 Friday 6:52pm EST Light rebooted. Error in rc.local required multiple reboots. 10/06/00 Friday 11:28am EST ftp www.lightlink.com locked up, for unknown reasons killed off inetd and restarted. now its fine. That's a first. 10/05/00 Thursday 9:22pm EST DSL modems back on line at about 10:30 am this morning. Another customer's traffic through same switch was triggering a bug in the DSLAM firmware. Moved customer to another switch and DSLAM started to behave again. Will upgrade firmware shortly. Sheesh, only 20 hours of down. 10/04/00 Wednesday 7:06pm EST At 12:39pm this afternoon our Paradyne DSL modem rack went into continuous reboot mode. We have spent 7 hours so far trying to debug it to no avail. Problem is even new chassis, cards and control cards are doing the same thing! This is probably going to be a long down. 09/29/00 Friday 10:34am EST light load went high from SSL attack. 09/23/00 Saturday 8:43pm EST 6:00pm Scheduled down, ftp server rebuilt with new disk drives raising space from 2 gig to 8 gig and adding a 40 gig backup mirror drive. 09/20/00 Wednesday 11:29am EST web server locked up from onslaught of ssl requests from rogue site, blocked at border router. 09/12/00 Tuesday 11:41pm EST Lightning took out the Roy Parks network for 3 hours tonight ending around 10:30pm. Power came and went many times before stabilizing. The Roy Parks router is running at 96 percent cpu, and showing signs of smoking. Will replace the student backbone shortly with dedicated line back to Fairview. 08/23/00 Wednesday 11:19am EST Out of control hits on the secure server locked up light with load 131. Modem dialups were blocked during this time. Gonna have to do something about this, but not sure what. 08/17/00 Thursday 11:33am EST Romance had a hard drive partition crash last night, /dev/hda11. When rebooted, it didn't mount majesty's web data logs, so they have been unavailable. In trying to get them to mount, majesty locked up and had to be rebooted. Majesty had been up for 300 days. Machines get old and rickety if left up too long. Sort of like me. 08/15/00 Tuesday 10:23pm EST Light rebooted at 10:15pm, been up for 137 days, but was beginning to show signs of irrepairable corruption in core. tty's that could not be erased, and virtual domains that went to our home page rather than their own, twice in as many mornings. Also some confusion on the UMG weg page, we will see if all this clears up. 07/24/00 Monday 4:37pm EST 5026 modem bank: modems 67-69 reburned with firmware and reseated in card cage. They were rejecting connections for at least 2 months. 07/18/00 Tuesday 1:15pm EST Load on light went to 110 this morning at around 11am from a flood of secure server hits, probably an errant client. Failures of various warning systems, buddy in particular, caused a delay in finding out about the condition. While load was high, dialup authentication and some DNS services also failed. 07/09/00 Sunday 9:45pm EST news was down for most of today while we copied over the entire news spool to the new machine. It seems to working fine, little or no news should have been lost. aurora went from 200MHz 12 gig to 500MHz 120Gig. 07/07/00 Friday 10:13am EST news was down for about 2 hours yesterday afternoon preparing for a new server. 06/25/00 Sunday 11:10pm EST ftpd upgraded to 2.6.0 plus patch to handle root exploit. 06/07/00 Wednesday 9:48pm EST A major network snafu on the morning of June 6th, caused the net to slow down and become unreachable in many cases. Mail backlogs caused waves of incoming mail when the net opened up again, causing all three of our mail servers to crash from lack of process space. Some mail was lost. 06/02/00 Friday 10:16pm EST Power outages took out Pleasant Grove pop T1 at about 5pm, lasted for about 1 hour. ftp server taken down for emergency repairs, disk upgrades not completed. Power supply replaced due to stuck fan. Motherboard needs to be upgraded to handle 40 gig drives. 04/05/00 Wednesday 7:05pm EST Pleasant Grove T1 suffered an event, becoming sticky and not allowing pings to go through. Went out there but found nothing wrong, rebooted router, but it cleared it self up before I did that. 04/02/00 Sunday 6:30pm EST Light take down to reinstall /dev/sd2, the log drive. Went without incident. 03/31/00 Friday 01:53am EST At about 1:30am, light lost its primary log drive /dev/sd2 panic on cpu 0 brought the machine down. Replaced drive with existing emtpy partitions, will put in new drive later. I do not love this job. It's making me old before my time. 03/30/00 Thursday 4:34pm EST 277 5026 modem bank was rebooted a few times to stop corruption caused by new monitoring program. 03/20/00 Monday 9:27pm EST news server hosed, lost news spool, had to rebuild from scratch. All extant news lost. 03/08/00 Wednesday 11:50am EST pop server mx rebooted to install linxu 2.0.38 03/01/00 Wednesday 8:55pm EST At about 19:17pm this evening, lightlink suffered a distributed denial of service attack that saturated our bandwidth for about 30 minutes. It was directed at a colo machine on our network. 02/29/00 Tuesday 9:18pm EST Earlier today at about 5pm we suffered an attack on our secure server, address 184.108.40.206 was opening repeated connections to port 443 on light. Blocked at T3. Load on light went to 150, took about 30 minutes to find out what was going on, penetrate the machine and block it. Tonight at about 9:15pm, harmony stopped autheticating properly, people calling up with weird errors. Rebooted harmony and the erpcd's running on light. No idea what happened. 02/28/00 Monday 11:48am EST mx, mail server, spawned multiple crond's this morning at about 10am. This caused multiple buddy's to fire up, flooding our system with buddy messages, and finally filling the process table on mx. Although there are no logs indicating mail bounced, it is possible that some did. Mail was also queued on the back up servers and delivered later, but it is not clear that all mail was caught properly. A monitor has been placed on crond on mx, if it goes over 5, I will get beeped immediately. 01/23/00 Sunday 10:54am EST Due to an incorrect /etc/syslog.conf setting, news logs were being dumped on wrong hard drive, resulting in failure to rotate them and the hard drive filling up. News has not been propagating for maybe a day. News was coming in, but not going out. 01/20/00 Thursday 6:06pm EST 17:15pm pop server suffered more scsi errors this afternoon. Took her down to replace entire computer, case, power supply motherboard, cpu and memory. Running on pentium II now. No mail lost. 01/18/00 Tuesday 6:26pm EST 17:10pm pop server crashed completely, main mail drive showing corrupted FATS. Attempts to save it were in vain. Mail Spool saved to temp directory first, should be little damage to most mailboxes. Swapped out entire bay of scsi drives with new one. 01/17/00 Monday 8:53pm EST pop server taken down to replace scsi cable 8:29pm 01/15/00 Saturday 1:16pm EST pop server taken down to swap out SCSI terminator Also taken down last night at about 5pm to Replace scsi jumpers Replace CPU fan Remove Tape drive Get rid of all real time mirrors Rebooted again at 6pm to install new hard drive partitions for nightly backups. We should be able to get about 24 days of full mail spool backups with the present system. 01/03/00 Monday 6:22pm EST Light web server preemptively rebooted to avoid slow down. Every 30 days or so light becomes sticky, there is some evidence this is caused by repeated ssh's into light. 01/03/00 Monday 2:39pm EST Mail server, mx, locked up again, this time we think we caught the hard drive causing it, sdc1 running /var/spool/mqueue. Damage was minimal, all mail should be intact. 12/27/99 Monday 10:04am EST 9:00am Harmony our primary terminal server lost its power supply, replaced with backup. 12/24/99 Friday 12:21pm EST fvradio locked up, needed rebooting. 12/20/99 Monday 6:14pm EST Web mail interface demo version expired today, have installed a new demo, and ordered the real thing. Apparently *LOTS* of people like it! :) 12/19/99 Sunday 6:30pm EST Mail taken down to replace scsi card, and create new disk partitions on /dev/hda and /dev/hdc Popper also changed to server mode. Really cuts down on the file copying. 12/18/99 Saturday 2:49pm EST Mail suffered a severe hard drive crash early this morning and had to be taken down at about 12:30pm to rebuild the mail spool drives. Apparently one of the mirror drives went bad and the mirror software did not disable it as might have been expected. This caused corruption in the file allocation table, which resulted in corrupted mailboxes, and some bounced mail for some people. During the rebuild, some mailboxes were probably totally lost. We will go through the logs and inform those that had bounced mail or lost mail boxes privately. 12/08/99 Wednesday 1:56pm EST External net looks down, can't ping rahul.net 12/08/99 Wednesday 07:20am EST Light load very high 100 or so. Rebooted, got trace back error during the core dump. Rebooted again. Hundreds of radius and httpsd's running. Httpsd seem to be coming from 220.127.116.11, filtered at T3 Rebooted all NetServers. Radius chilled out. Only bad taste is the traceback we got during the first core dump. Could be coincidental, could be something more serious. New swap partition is in place if that matters. 12/05/99 Sunday 6:58pm EST Light and adore upgraded to Y2K patches and libc. Down time 6pm to 7pm 12/05/99 Sunday 2:07pm EST Testing backup dialup server in preparation for tonights down of light and adore, found that password files were not being updated, so some people could not sign on. That has been fixed. Elmira customers are having problems with a burnt out modem which is being fixed shortly. 12/04/99 Saturday 6:10pm EST Mail was offline for 10 minutes tonight to upgrade is primary spool drives from mirrored IDE's to mirrored fast wide SCSI differentials. 12/04/99 Saturday 3:09pm EST Outgoing mail service interrupted for a few minutes, needed to reboot the smtp server due to jammed processes. 12/02/99 Thursday 1:59pm EST Light rebooted. Getting out of memory errors, cpu %0, topp sticking, other anomalies. halt locked up machine, had to cold boot. Adore was locked up while light was being rebooted. 11/25/99 Thursday 12:17pm EST Elmira modems were down for a number of hours this morning due to a power cable cut in the neighborhood. Mail server, mx, was rebooted twice around 12:15 to install new hard drives to hold mail spool. It may be rebooted again through out the day. 11/21/99 Sunday 7:07pm EST Starting about 5:30pm this afternoon news was down for a short while as we moved the machines to a new room. At about 6:30pm, mail was down for the same reason. No mail was lost. The changes put both news and mail on a 100 megabit full duplex switched backbone, to help improve performance. News will be down again in a day or two when we move everything over to scsi drives from IDE which really can't take the I/O. 11/20/99 Saturday 12:00pm EST mx ethernet locked up. Did an eth0 down and up, and it started up again. no idea why 11/19/99 Friday 9:48pm EST At approximately 17:28 this afternoon our network had an 'event', data slowed down, people couldn't sign on, things came to a standstill. Although it is still unclear what happened, at the same time we were being massively overrun by spam coming into our system from one of our own users, which caused our mail server to run out of process room and crash. Bringing this under control took about 2 hours, during which time incoming mail may have been queued, and pop accounts would have been slow to respond. Some mail may have been lost during the crash. 11/05/99 Friday 10:05pm EST secure shell keys changed across all machines. mx lost its trust to smtp (emerald) so pophash.db was not being transfered, causing people to be unable to relay mail through lightlink from remote places to remote places. Started at about 17:00 until 22:00. 11/05/99 Friday 08:17am EST mx mail rebooted to reinstall 2.0.36 OS Addition of mqueue partition seems to have helped in the overload. 11/04/99 Thursday 7:38pm EST Light rebooted to clear out dying OS. top locking up rather than running smoothly, happens every 30 days or so. 11/02/99 Tuesday 4:36pm EST New kernel on mail has made the situation worse. Have added a new partition for /var/spool/mqueue. 10/30/99 Saturday 12:12pm EST Mail was taken down to install new kernel, to see if we can improve the performace issues. However the new kernel refused to boot properly, being unwilling to use one of the mirror drives (hdc) in DMA mode. Booted back to 2.0.36. 10/29/99 Friday 5:50pm EST Our T1 has been decomissioned. At 5pm the T1 was disconnected interrupting service to modems, shell, web and incoming mail. A few minutes later connectivity was restored through our T3. 10/28/99 Thursday 09:49am EST Ethernet card on pop server mx locked up, machine unresponsive to network requests. Rebooted. If it happens again, will replace the ethernet card. 10/24/99 Sunday 2:22pm EST Romance lost its root hard drive. This caused emerald (smtp mail), gem (majordomo) and mx (pop mail) to lock up certain functions and they needed to be rebooted. Nothing was lost, and romance's hard drive was replaced by its mirror. 10/23/99 Saturday 1:40pm EST Majesty patched for Y2K and Libc.so/sa Rebooting may have caused lockups on light and adore. 10/15/99 Friday 8:14pm EST Adore locked up, out of swap space. Don't know which process. Brought under control without rebooting. 10/10/99 Sunday 8:44pm EST V90 modems still giving busy signals every other call. Dial once, get a modem. Dial again, it rings once, then turns busy. Put in another trouble report. 10/07/99 Thursday 8:33pm EST V90 phone lines are returning random busy signals, looks like a problem with Ma Bell. Just keep trying to get in. Trouble ticket opened. 0141031 10/05/99 Tuesday 11:25pm EST mx, pop mail server rebooted. High load, sticky controls, multiple processes running, in particular crond spawning timely jobs, but many of them. Will have to watch for this. Also yesterday a dialup V90 NetServer 16-I was swapped out of the Elmira pop due to failing modems. This caused some problems when customers were moved to another modem bank with incorrect settings. The hunt group remains down in Elmira for one of the v90 banks. 10/05/99 Tuesday 04:43am EST Denial of service attack from dannyboy.easynet.co.uk against port 443 httpsd caused load on light to go 80, jamming it completely. Took me a while to figure it out, and get it blocked. Radius server was also jammed from USR.lightlink.com, creating hundreds of radius daemons. People were not able to sign on from about 2am to 5am. Things should be stable now. 10/03/99 Sunday 2:19pm EST Adore crashed and rebooted itself for unknown reasons. panic cpu0 swtch 09/28/99 Tuesday 8:08pm EST Adore rebooted by me, for system upgrades. /usr/local split into /usr/local and /usr/local/main /dev/sd2g released 09/25/99 Saturday 2:23pm EST Never rains but pours. Romance lost its root drive last night, has been replaced with the mirror drive. Majesty just lost its root swap partition. Swap is presently on the mirror, will probably move the mirror to root and toss the present root drive. /etc/mtab and /etc/dhcpd.leases were lost. 09/07/99 Tuesday 10:18pm EST Lost a modem card in the Multitech racks, caused no answers on 3 modems and interrupted the hunt group. Will busy out and get card replaced. Nysernet had troubles installing a new OC3 fiber ring that hosed some of their network over the past few days. This has caused intermittent connectivity problems to the net. 09/06/99 Monday 8:18pm EST A modem was giving busy signals earlier this afternoon on isdn2. I rebooted the bank bouncing people who were on isdn2. The net seems to have a major outage going on, some sites are not accessible, other's are very slow. Some other ISP's are not affected, so it may be a problem with Sprint. 09/01/99 Wednesday 2:45pm EST ftp daemon was upgraded to 2.5.0 today on adore, and promptly broke uploads. Reverted to old version until fixed. 08/21/99 Saturday 6:23pm EST Scheduled down time at 6pm. Adore upgraded to Sparc 20, 256Meg. 08/20/99 Friday 6:19pm EST Adore locked up around 10am this morning from uninterruptible wait states on the mail partition. I had to crash it and reboot. During the process the password file was corrupted and some could not sign on. Later in the afternoon, I hot swapped /dev/sd1a back into adore so we could get a good root mirror going for tomorrow. This also crashed adore again because the second root drive steals the SCSI target from the first root drive and bye bye system. 08/18/99 Wednesday 6:27pm EST Adore crashed at 17:43pm from swtch panic. It's Sparc 5 will be replaced by a 20 on Saturday. 07/31/99 Saturday 09:24am EST Adore crashed at 3:33am from asynch memory faults, pretty much proving its not a memory problem as all chips are new. So its time to replace the mother board. At 9am this morning I pulled sd1a root mirror drive in order to bring up a new Sparc 20 which will replace adore for a while, and forgot that swap was on the drive, so adore crashed again. 07/18/99 Sunday 12:55pm EST On Saturday night at about 5pm Fairview Square was hit directly by lighting on the power pole that leads into the main power plant of Fairview. The fuses were burnt out on the power pole, and the main circuit breaker inside of the Fairview Complex was blown to smithereens. Power was restored by about 10pm at which point we brought up all systems only to find that the T3 router would not boot. It was replaced by a backup router which took about 1.5 hours and everything was up and running by about 11:30pm. The router seems to have a flakey flashrom card, perhaps a result of the strike, but also perhaps simply a mechanical problem. 07/06/99 Tuesday 11:43pm EST Adore went load hight at about 11:15pm. I tried to play with it extensively to see what might be causing it, no luck. pff -a locks up. ps didn't show procmail this time as the first in line but login. 07/05/99 Monday 7:58pm EST Adore went load high again. Changed procmail to latest version 3.13.1 07/05/99 Monday 6:46pm EST Adore went load high with locked mail partition. Took a core dump and rebooted, will send to Sun. 07/04/99 Sunday 6:29pm EST 6:00pm Tried to switch the bad UPS with the new one without bringing the system down, needless to say it didn't work. Adore knocked off line, along with many modems. 07/01/99 Thursday 12:54pm EST Adore went load high at about 12:30pm this afternoon from a problem we thought was only with the popper. But the popper was not running. This time we got a core dump, and will be sending it to Sun for analysis. 06/26/99 Saturday 10:32am EST Power outage at 10:15am or so. UPS decided to go bad at just that moment, adore, and modems taken off line. 06/24/99 Thursday 11:30pm EST Shell users found fetchmail broken this morning after an upgrade to imapd on mx last night around 12 midnight. In general people should not be using fetchmail, except once after they first enable their shell mail with enableshellmail. MX has been stable, and adore has been stable with the popper off. 06/21/99 Monday 4:05pm EST About 2pm we started a test version of the popper on adore, one that ran as a daemon rather than through inetd. About two hours later adores load started to climb. Every drive partition was listable except /var/spool/mail. This has happened on two different drives in two different partitions, in two different drive trays in two different drive bays. This indicates it is not a drive problem. Also Iwas able to write to another partition on sd9 which holds /var/spool/mail, so the drive was working fine. Since this follows the popper it seems to be a kernel/popper problem. One has to ask why this never occured on light. MX Earlier today mx started to refuse popper connections because the inetd daemon that listens on port 110 was hard coded to turn off if too many connections came in at once. Max Parke hacked the code to get rid of that check, and things have been fine ever since. 06/20/99 Sunday 8:12pm EST Adore locked up from disk drive failure. Always seems like it is the new mail drive. This time in sd9e. 10:00pm Adore locked up again, this time no sign of what was wrong, totally jammed, no cursor no nothing. When I went to reboot, the monitor wouldn't come on. I took out 4 pieces of the new memory and it would boot, I put number 5 back in, and it gave asynch memory fault. So we are running on 4 for the moment. 06/20/99 Sunday 6:21pm EST Adore taken down for emergency service. All memory replaced and upgraded from 145M to 256Meg with Sun bar code memory. 06/20/99 Sunday 12:35pm EST Saturday 6/19/1999 mail was moved from adore to mx starting at 2pm. Most systems were online again at 4pm, but some remote users were not able to send through our smtp server until about 9pm due to failure to copy the pophash.db file over the smtp.lightlink.com Adore locked up at about 1am 6/20/99, and didn't notify us of the failure until 4am when it was rebooted. Crash was asynch memory fault again. Memory will be place tonight. 06/15/99 Tuesday 3:40pm EST Major net outage for most of 6/14/99 Adore locked up at 8:45am this morning, then freed itself, popper unavailable. 06/13/99 Sunday 9:03pm EST Adore locked up again at about 8:30pm. Load 128.0 Seems to be the mail drive getting stuck in a wait state and everything piles up on top of it. Everything else was running fine, but anything accessing that drive failed, like ls or df etc. Mostly poppers were building up. Couldn't kill them off with killer. So... sendmail killed, inetd killed, everyone bounced. Mail partition moved from sd8a to sd9e and popdrop from sd8g to sd9g. Sendmail restarted, inetd restarted, and logins allowed. Hopefully that will be the end of this. 06/11/99 Friday 4:15pm EST A mirror drive on adore locked up this morning at about 3am, causing the load on adore to go to 630.0, preventing people from getting their mail and jamming out shell users. This also locked up light, which shares adore's home directories, which then prevented others from signing on. This is in part why light and adore must be separated. The drive is undergoing testing, it was the mail mirror drive, and locked up during the nightly copy of the mail directories to the mirror. A few users had their pop mail stuck in a half way state when adore came back up, and could not get their mail. These have all been cleaned up. 06/05/99 Saturday 9:51pm EST Scheduled down at 6pm. Mail was moved from light to adore with little problem. Both MX and popper functions were moved. Spool directories were also moved. Light no longer supports mail. 06/04/99 Friday 6:31pm EST Scheduled down at 6pm. /var was moved from /dev/sd8a to /dev/sd9f /var/spool/mail was moved from /dev/sd8a to itself as an outer directory Everything went flawlessly. 05/23/99 Sunday 10:05pm EST Scheduled down at 6:00pm, news2, ftp, majordomo were off line until about 7:30pm to put rollers on the tables that hold them. Then troubles with a hub resulted in news not drawing new news, so hub was replaced. 04/29/99 Thursday 12:18pm EST SMTP server (emerald) went out of control at about 11:15am, no known reason at the moment. Load was at 15, process table full, sendmail requests were being denied. 04/21/99 Wednesday 10:42pm EST Adore crashed hard. New memory is on order. 04/11/99 Sunday 11:18am EST Adore rebooted itself this morning at 9:36am. 04/05/99 Monday 9:40pm EST Modem 3 causing ring no answers all day long. Also modem 54. Sigh 04/01/99 Thursday 12:14am EST admiral (news) was down since 22:00 3/30/99 due to a smoked root drive. pingers did not catch it because admiral had been removed from the ping list for reasons lost to antiquity. No one reported that news was down probably because it did not affect our main news reading machine, however no new news was coming in for over 24 hours. Main drive has been replaced by the mirror, and more stringent monitoring programs will be written to make sure this and other machines do not go down without my finding out about it. 03/28/99 Sunday 10:37pm EST ftp was taken down to fix a recalcitrant ethernet card on the private backbone. 03/28/99 Saturday 9:23am EST Remote relaying through our smtp server was intermittenly broken again for short whiles. This was caused by two sepearate machines trying to update the authorized IP database, thus wiping out each others work. 03/26/99 Friday 11:00pm EST FTP and MAILING list machine were taken down for upgrades. CPU fan was replaced on emerald (ftp) and power supply was replace on gem (mailing list). 03/25/99 Thursday 1:33pm EST In preparation for upgrades to gem and emerald, smtp.lightlink.com was moved to mx. The anti spam database that updates legal IP numbers for remote users to use our smtp server was not updating properly on mx, so from about 8am until now, remote users have not been able to send e-mail. 03/23/99 Tuesday 11:00pm EST Adore rebooted. Telnet was not working, OS beginning to die? Light was still jammed, so rebooted also. 03/08/99 Monday 4:57pm EST Nysernet lost the routes to the T3, so although the routers were up, our T3 customers were unable to get anywhere. 18.104.22.168/24 22.214.171.124/24 126.96.36.199/26 03/08/99 Monday 10:02am EST External net was down from 4:55am til 10am. Nysernet replaced a major router downtown. 03/06/99 Saturday 7:28pm EST Momentary bug in new account creation programs caused /etc/passwd permissions to be not world readable causing the id command on adore to fail, causing shell users errors in the prompt script. 03/03/99 Wednesday 6:17pm EST adore crashed from asynch memory errors. Probably needs to be replaced. 02/28/99 Sunday 3:20pm EST web server was locked up, possibly due to not being restarted properly. 02/28/99 Sunday 12:28pm EST Adore suffered a major down at 11:45am this morning causing light to lock up. It was a partial crash which did not set off the alarms, so we didn't find out about it until 12 noon. Normally adore reboots itself after such events, but this time it just crashed and stayed crashed. Had to cold reboot both adore and light causing minor disk damage, which took about 20 minutes to rebuild. It's just this kind of thing that demands that we separate adore and light and all the machines from each other. 02/19/99 Friday 02:23am EST Web logs were lost for Feb 16 and 17 due to a program bug. 02/17/99 Wednesday 4:07pm EST Light locked up for unknown reasons, millions of popper and sendmail processes all waiting to run, couldn't do top or anything. Rebooted. 02/16/99 Tuesday 11:00pm EST Single modem giving ring no answer on 5026. Its been cleared. 02/15/99 Monday 11:40pm EST Light rebooted to rearrange ftp. Mysterious failures in ftp have occured since the newest version was installed. The newest version did not work at all, and then the old version stopped doing lists properly. We have light working properly now, although we still have no idea why the original software started to fail as it is still failing. Been running dynamically loaded ls's forever, now they don't work and have to use a static ls. Don't ask me. 02/02/99 Tuesday 7:28pm EST Light rebooted, getting sluggish perhaps from long term memory leaks. 71 days is too long to go without rebooting. Apparently radius authentication server did not start properly causing people to not be able to sign on for a while after the boot, until 8:01pm to be exact. 01/31/99 Sunday 6:00pm EST Scheduled down. Adore rebooted with new tape drive in place. Added 3 9.1 gig drives to adore and 1 to light. 01/29/99 Friday 5:26pm EST Bad routing at AOL has caused ICQ and Instant Messenger to file across the Nysernet backbone. 01/23/99 Saturday 8:29pm EST ISDN2 modem bank rebooted to clear out bad modem, everyone on that bank was bounced. 12/14/98 Monday 12:03pm EST Adore crashed this morning at 9:42am for unknown reasons. 12/13/98 Sunday 5:39pm EST We had momentary but major network outages today due to a failing ethernet link and a hub. Changes to the network settings made things worse, and it took about 2 hours to undo the damage that we did trying to fix the original problem. Momentary outages on adore and light would have been experienced. 12/06/98 Sunday 12:23am EST All modem banks have been changed to allow ascii users to sign on, they are rlogined to adore at the password prompt. This means all ppp users *MUST* use non scripted PAP. 11/17/98 Tuesday 6:45pm EST We were spammed last night, and I set the mail limit to 12 connections at a time, which was too low. Today a number of people were not able to send mail because of this, it has been raised to 24. 11/15/98 Sunday 4:15pm EST gopher/help system on adore was left down after last nights system down time. It is working now. 11/14/98 Saturday 6:12pm EST Scheduled down lasted 6:00pm - 6:15pm. Light upgraded to kernel jesw which supports vifc. Adore also down during upgrade, no changes made. 11/13/98 Friday 11:07pm EST Web hit log files are being moved to majesty, there will be some interruption in availability but we are trying to not lose hits. They should be functional again in a day or two. 11/08/98 Sunday 8:15pm EST We got isdn and isdn2 flashed for V90, took much longer than anticipated. All X2's are now X2/V90 capable. 10/27/98 Tuesday 3:32pm EST 16 new modems in, hunt group seems ok, 2 modems were not set properly, gave busy signals, all should be working properly now. 10/05/98 Monday 7:34pm EST Jammed modem on isdn3 cleared. Netserver rebooted, bouncing everyone on isdn3. 09/30/98 Wednesday 12:17pm EST Nysernet is having significant problems at various routing points to the main backbone. They are aware of the problem, and are working on it. Network slows have been happening for a few days, and may continue to happen for a while. 09/29/98 Tuesday 1:04pm EST There appears to be a net outage of some kind, stopping communications from going out or coming in. 09/20/98 Sunday 9:17pm EST System taken down for scheduled maintenance at 6pm until 7pm. Lights root drive was replaced by its mirror and a new mirror put in place. Adore's mirror drive was replaced by a new mirror as big as the root drive. Tape drive would not work with adore, tried multiple different tape drives and cables and terminators. It seems to be a problem with the one drive that is on the scsi bus, it seems to screw up the termination when the tape drive is on there. Put the tape drive before and after the disk drive, it made no difference. Adore is still without its own tape drive. One UPS powering top Multitech modem bank is definitely gone, battery will be replaced next time around. 09/20/98 Sunday 1:15pm EST We are getting fatal disk errors on light's root drive. This is a very bad sign and may involve replacing the root drive with its mirror. 09/19/98 Saturday 08:02am EST At 3:07am light started to get fatal errors on the swap partition of its root drive. This apparently caused the failure of named and primary name service. At the same time majesty seems to be unpingable from our elmira routers, so secondary name service for those people failed also, preventing Elmira customers from getting on line. For some reason lost to antiquity the code that monitors named was set to NOT try and restart it automatically, I have turned this back on. I was not pinged because the pinger modem was off due to system work on majesty last night. Sheesh. The fatal disk errors on the root swap partition are not a good sign, and may result in a crash in the near future. We have root mirrors in place and fully updated in case we need to do a swap. 09/09/98 Wednesday 6:51pm EST Looks like all system jobs ran twice last night, causing double entries in webstats and other areas. Not sure why. 08/31/98 Monday 11:03pm EST Upgraded our news reading server to dnews 46r. The upgrade went smoothly, but an incorrect directory entry resulted in the server using default files for expiration. Thus all articles were expired. The spool will rebuild as time goes by. 08/29/98 Saturday 4:36pm EST News has been suffering a massive planet wide denial of service attack involving an overload of newgroup and sendsys messages. This has caused the load on news servers to skyrocket, including ours, and caused the flow of normal news to come to a crawl. Patches to the news server have helped to alleviate the problem, but the attack continues. 08/20/98 Thursday 7:04pm EST Downtime 1 hour. Scheduled down for light and adore did not go well. 64 meg was put in adore. Two new root mirror drives would not boot properly, on either machine, possibly due to a jumper error which I hadn't accounted for. New Tape drive on adore seemed to cause scsi time outs on sd0 which is real strange. Presently both systems are as they were except adore has 64 meg more memory. 08/07/98 Friday 8:12pm EST Scheduled down for light and adore starting at 6:00pm. Home drive on light was moved to adore, and mirror home drive on adore was moved to light. This broke cgi's which reside in the home directory stupidly, which can not execute setuid from adore on light. This also broke listproc for the same reason. These are now fixed. /home/www -> /homewww Listproc moved to adore. 08/07/98 Friday 11:39am EST Link to Elmira was down from 8:35am until now. Problem with the Frame Relay line was called into EMI and fixed at about 11:30am. 07/19/98 Sunday 2:39pm EST X2 modems rebooted, kicking everyone off Apparently routing started to fail for some users at about 2am Sunday morning. Tech support got about 20 calls on it. It may not have affected everyone. One user reported trying all 3 X2 modem banks, failing on all of them, which is real weird as they are independent machines. No idea what happened. 07/19/98 Sunday 2:14pm EST News was effectively down from about 11pm last night until now due to a routing error during a reboot that went unseen. Not related to the X2 problem above. (I think). News was coming into news1, but not being transferred to news2, readers were able to read from news2, but weren't getting any new news during this time. Expiry on news1 has probably removed some articles which are now lost. 07/05/98 Sunday 01:14am EST gopher and the help system did not restart properly during last reboot. Should be working now. 07/05/98 Sunday 12:27am EST isdn3 was off line due to a network lock up, perhaps for a few days. Everyone was bounced off of isdn2 accidentally in an effort clear this out. 07/04/98 Saturday 7:26pm EST Raw web access logs were lost last night due to a long standing but subtle programming bug. Logs for 7/1 7/3 and 7/4 have been recovered, the rest are lost. Only raw hit logs were affected. 07/04/98 Saturday 6:13pm EST Light taken down at 6pm to replace failing backup tape drive. 07/01/98 Wednesday 18:23pm EST adore crashed for unknown reasons, rebooted itself. 06/30/98 Tuesday 10:00pm EST world read perms turned back on ftp due to failure of cd command to print directory name properly. What a pain. 06/28/98 Sunday 4:24pm EST shell help and gopher were broken by last nights ftp move. These should be working now. It is not immediately clear that running files across NFS between two different platforms (SunOS and Linux) is going to work properly. At present the SunOS executables for ftp are on the Linux box, which probably was not intended, but seems to work anyhow as they are being read across the network before executed on the Sun. Probably what we want is for the /ftp/pub directory to be exported and leave the rest as it is. 06/27/98 Saturday 5:56pm EST ftp on light disabled to move it to emerald Both light and adore needed to be rebooted to let go of the original ftp directory. 7:30pm 06/26/98 Friday 4:50pm EST Harmony locked up, had to be cold booted, bouncing all users. 06/22/98 Monday 9:14pm EST Light was put through an emergency reboot to see if it would clear out a persisting problem with excessive syslogs. 06/17/98 Wednesday 2:17pm EST Sendmail stopped allowing remote mail through our system to remote users at about 10:11am this morning due to a system screwup. It should be working fine now. 06/16/98 Tuesday 11:57am EST NetServer I isdn2 had a jammed modem. Rebooted both isdn1 and isdn2, bouncing everyone off. 06/13/98 Saturday 12:19pm EST Harmony bank 1 and usr and isdn modem banks were rebooted, due to a jammed UPS which had to be rebooted. 06/11/98 Thursday 12:50pm EST 2 modems on harmony were found to not be picking up. They were causing ring no answers for a few days. 06/06/98 Saturday 9:42pm EST We suffered a slow but catestrophic failure of majesty over the past week. It began crashing routinely, every few hours. This has caused news server to be spotty although news2 which is the reader machine did not go down, it was not able to have all the articles available that it would other wise have. Presently majesty is in sick bay, and news.lightlink.com has been replaced with a pentium 333 machine. Things may be rough for a while, but it probably won't crash. 06/03/98 Wednesday 7:30pm EST Router to Elmira rebooted, interruption of service for 3 minutes. 06/02/98 Tuesday 10:37pm EST Sendmail was rebooted improperly and started to fail to let remote users post through lightlink to remote spots. 06/02/98 Tuesday 9:26pm EST Upgraded news software today on news.lightlink.com to make things faster. In process reduced size of active file getting rid of empty groups. This cut the size from almost 2 meg to less than 1 meg which should make news reading faster. 05/30/98 Saturday 07:47am EST Harmony seems to have rebooted itself for unknown reasons. This kicked everyone off of the multitechs. 05/27/98 Wednesday 2:47pm EST news1 crashed at 2:19pm. Probably from too many news feeds, or a failing drive. Haven't figured out which yet. In any case a new machine is being built which can handle the load. news2 was rebooted in the process. 05/25/98 Monday 12:58pm EST News1 crashed again at about 12:30pm. No warnings went off because it seems to have half crashed leaving its ethernet port operational. All drives have been reseated including power sockets. If it crashes again, we will reseat the memory, and if it crashes again, we will replace it with a new super news machine that we are considering in the wings. News2 was not affected except some news may have been lost. 05/24/98 Sunday 4:35pm EST News1 suffered a major crash at 3am in the morning, wiping out the history file. The entire news system was rebuilt this afternoon. News2 was not affected, except that incoming news will have been sparse. News is running properly again. It is possible that news1 is loosing a hard drive, so this may happen again. 05/18/98 Monday 11:18pm EST News2 crashed this afternoon and needed to be rebuilt. The history file was lost and rebuilt, and all binary files were lost. Most other news was not lost. Posting was not working properly until now due to an incorrect permission on one of the news drives. 05/14/98 Thursday 02:30am EST Adore's ethernet locked up tonight for unknown reasons. This tied up light because of the nfs mounted drives. Adore was rebooted but still would not telnet to itself, indicating hardware failure in the ether port. Turned it off and on, and it started working again. Worse comes to worse, adore can have another ethercard put in it, or people can use light for shell if adore dies off completely. 05/13/98 Wednesday 6:57pm EST Shell users were locked out of mail today, possibly for 2 hours. The permissions on the password file were set to non world readable for reasons that we suspect but don't know for sure, nothing serious, more like a bug in the account creation process which has to write to the file to create the new account. The problem occured we believe starting at 2:41pm and lasted until about 5:00pm Monitors have been put in place to beep us within one minute should this happen again. 05/11/98 Monday 01:35am EST isdn1 277 0356 rebooted to clear out jammed lines. 04/25/98 Saturday 8:46pm EST news1 and news2 were taken down this afternoon to install second ether cards in them. This will allow news2 to take news from news1 over a private internal network releiving some of the load on our main network. news1 is being moved to the T3 shortly 04/20/98 Monday 3:12pm EST Modem 3 locked up on isdn1 2770356, picks up but doesn't answer. Reset bouncing modems 3 and 4. 04/16/98 Thursday 6:17pm EST Scheduled Down. Harmony modems and power supplies were reseated a number of times bouncing everyone at least twice. Reseating needs to be done periodically to assure that the contacts are clean to avoid spurious drops or modem failures. 04/16/98 Thursday 02:53am EST Light crashed during nightly mirroring due to out of mbuf errors. This will probably go away with scheduled changes in which machine has which drives. Presently the mirroring takes place over the net and stresses SunOS's ability to handle the traffic. 04/15/98 Wednesday 08:54am EST Mail loop in the ccounsel mailing list caused a crash of gem at about 8am which was not caught until about 2pm. Have installed monitoring to restart mail and beep if it dies. 04/04/98 Saturday 3:07pm EST isdn1 lost its ethernet address, causing people to not be able to go anywhere on the net. Rebooted, bouncing everyone. 04/04/98 Saturday 12:21am EST eggbots and irc servers killed on adore to track down source of DNS hits coming in on light. DNS installed on adore, so it can handle its own name service now. 04/01/98 Wednesday 4:06pm EST Web server taken off line for 5 minutes to track down what's driving light out of control. 03/23/98 Monday 3:06pm EST Modems 20 and 26 on Harmony have been giving ring no answers for a few days. Have busied them out. The entire box needs to have all its cards reseated and rebooted. This will happen next down time. 03/22/98 Sunday 6:22pm EST Light and adore taken down from 6:00pm to 6:25pm to add memory Memory increased from 128M to 320M 03/14/98 Saturday 5:56pm EST News2 was unscheduled down for about 15 minutes to allow rewiring of the machine room. 03/11/98 Wednesday 12:36am EST cisco router rebooted, internet connection lost for 2 minutes. Router had a bad arp entry for 188.8.131.52 which is modem 9 on the USR 277 1076 modem banks. People signing onto that modem could not get out into the internet. Bad arp entry came from a typo I made over a week ago, so this has been going on for a long time. 03/08/98 Sunday 1:03pm EST 277 1076 modem bank rebooted, bouncing everyone 03/05/98 Thursday 12:29am EST News2 was down from 9pm to 12am due to a faulty ether connection. The ether connections were changed to add in a new hub. I have no idea why the system didn't beep me. It looks like the monitor program 'pinger' was locked up, not a good sign. What shall monitor the monitor program? 02/23/98 Monday 5:00pm EST Nynex installed 4 new modems which are now on line. Part of modem bank 1 was knocked off line due to a loose cable twice bouncing everyone on that section. T1 was physically moved to a better location causing a 4 minute network outage. 02/22/98 Sunday 8:30pm EST Cisco router upgraded to 11.1(17). Network down from about 6pm to 6:45pm. 02/21/98 Saturday 2:28pm EST Momentary net outage for unknown reasons. Rebooted cisco router and kentrox CSU/DSU. That didn't clear it up, but then it cleared itself up shortly after. 02/20/98 Friday 09:00am EST X2 NetServer was rebooted by accident, every one bounced. 02/18/98 Wednesday 5:17pm EST Adore was under attack today from someone at 184.108.40.206. Rebooting the system did not end the attack. Used tcpdump to sniff his packets and then blocked him at the border router. 02/17/98 Tuesday 9:27pm EST The X2 NetServer was producing busy signals even though there were modems free. I rebooted the first X2 NetServer and the USR V34 Server bouncing everyone off. It seems to have cleared up. 02/14/98 Saturday 6:31pm EST 277 0356 X2 lines locked up for unknown reasons this afternoon. It may have had to do with a test Cisco 2501 we put on the network to test RIP2, a routing protocol. I presently do not know why that would have cuased the problem, particularly since the new router was on line since last night with no reported troubles. Rebooting the X2 banks did not fix the problem, but taking the Cisco off line did. 02/11/98 Wednesday 12:19am EST Light rebooted itself last night at around 3am in the morning. This caused authentication services to switch over to majesty. When light came back up, I forgot to reset authentication back to light. For unknown reasons, majesty is refusing to authenticate some people, so for much of the morning many people were not able to sign on. Mystery solved. A newer version of the authentication code was running on majesty, that was never supposed to be put in service, and it was dying on config files from the older version that are copied over from light to majesty every night to keep the password files in sync. 02/07/98 Saturday 10:52pm EST Adore crashed again. Took out another memory chip. Gonna keep doing this until we find out which one is bad. 02/05/98 Thursday 11:36pm EST Modems 49-72 rebooted, bouncing everyone off. Bad modem 68 is working again. 02/04/98 Wednesday 12:46pm EST Adore locked up but didn't crash. This caused the modem authentication software to fail on light so some people were not able to get on. Sun tells us its possibly a bad memory chip. I have replaced the chip in slot 0. 01/05/98 Monday 9:25pm EST Majesty was down from 2:41pm this afternoon. News2 was not affected, but new news coming in was stopped until now. Various cross checking pingers failed to go off, new code with bugs etc. My fault. 01/01/98 Thursday 3:05pm EST A number of people have not been able to get signed on to 277 5026. When light crashed this morning, authentication shifted to the back up server, which for reasons still unknown are rejecting certain people's passwords. 01/01/98 Thursday 04:10am EST Light crashed due to mbuf map full error. 12/08/97 Monday 10:51pm EST Light was crashed by a user playing with an exploit. The exploit has been closed on all three machines. The anti relaying code was not restarted when light rebooted so some remote users were not able to send mail. My apologies. 12/08/97 Monday 3:25pm EST Bad modem was causing ring no answers, taken off line. 12/05/97 Friday 5:37pm EST X2 rebooted itself. 12/05/97 Friday 02:20am EST X2 modems rebooted, bouncing everyone about 12 midnight. 11/28/97 Friday 2:41pm EST pm2-elmira is down. Frame connection to elmira rebooted, no change. pm-elmira is fine. 11/24/97 Monday 01:26am EST Aurora (new2) rebooted a number of times at around 11 to 12am, in order to test various things. Apparently it hasn't been posting properly to usenet for about 3 days since the 20th. This was a result of starting it from the startup profile /etc/rc.d/rc.local. For unknown reasons it runs but won't post. 11/22/97 Saturday 7:39pm EST Aurora (news2) taken down for 3 minutes to replace ethernet card. If this doesn't stop the crashing, we will replace the mother board. 11/21/97 Friday 7:15pm EST Tested an attack against the Cisco border router, and brought it down for about 2 minutes. 11/21/97 Friday 12:38pm EST Aurora locked up at 8am. Because of the nature of lock up, no warnings went off. I believe it is caused by bad ethernet code for the card we are using. The card will be replaced shortly. 11/20/97 Thursday 10:38am EST Aurora (news2) locked up its ether port again. Various beeping mechanisms in place to warn me and reboot the system automatically failed. Will need to do more testing on this. Don't know why the etherport is locking up. Have replaced the ethernet card, the mother board may be going bad. 11/17/97 Monday 2:01pm EST Overview database on news2 has been erased. It will rebuild slowly as people download articles. Downloads may be slower for a while. Hopefully this will fix the missing news article problem. The Overview database had headers in it for articles that were expired so they show as available in netscape but return ARTICLE EXPIRED when customers hit on them. 11/16/97 Sunday 10:14pm EST News2 rebooted. Was running with 64meg instead of 128meg due to improper boot last time around. This is fixed. 11/16/97 Sunday 9:27pm EST News2 suffered a few crashes over the past few days from ethernet lockup. Apparently some of its history, index and overview files got corrupted. This may have resulted in headers being downloaded without bodies. We have rebuilt the data base and run an expire, this hopefully will clean it up. If not, it should clean up by itself as time progresses and things get naturally expired. 11/14/97 Friday 6:25pm EST Modems and news stopped for 5 minutes to reset a warning light on a UPS. Everyone bounced. The UPS seems to have cleared out its bad battery indicator light, but it may come back. 11/14/97 Friday 4:16pm EST News2 down for about 15 minutes to replace a bad ethernet card. 11/06/97 Thursday 11:57pm EST X2 rebooted itself. 11/04/97 Tuesday 4:48pm EST I had to take light down to clear out a module that was misbehaving badly. I was trying to trace a hacker who was trying out passwords on our system, and in loading the tracing software, it took over the console and wouldn't give me control back. 10/31/97 Friday 12:24pm EST Majesty locked up from about 5:45am from scsi disk failure, eventually dying from process table full. 10/22/97 Wednesday 1:27pm EST Aurora locked up at 5:30am this morning and did not beep me because the modem was off. News restored at 1:30pm 10/20/97 Monday 11:29pm EST Aurora (news2) locked up for unknown reasons, rebooted. 10/19/97 Sunday 7:02pm EST Majesty (news) crashed from mbuf map full. 10/17/97 Friday 5:47pm EST X2 rebooted again to clear polluted arp cache. 10/17/97 Friday 3:41pm EST USR modems down, box not responding. X2 modems rebooted by accident. 10/12/97 Sunday 10:23am EST Looks like there was an external network outage at around 8:30am. 10/12/97 Sunday 10:19am EST Romance's network locked up for unknown reasons, for about 3 minutes. Rebooting cleared it out. 10/12/97 Sunday 05:23am EST Internal network was locked up by unknown causes from 3:15am to now. Incoming traffic went to 100 percent and outgoing went to 0. Shutting down all machines did not clear the condition although during the final reboot of all of them it managed to clear itself. Nysernet and Sprint are looking into it. 10/08/97 Wednesday 10:39pm EST X2/ISDN server totally locked up, rebooted bouncing everyone Talked to USR today, they have escalated the service call. 10/07/97 Tuesday 9:48pm EST External network seems to be partially down. 10/07/97 Tuesday 7:41pm EST Modem 12 on NetServer locked up. 10/06/97 Monday 8:11pm EST X2/ISDN server locking up routinely. Have asked USR to escalate the complaint. 10/04/97 Saturday 12:52pm EST News crashed at 6am from disk full errors. 10/03/97 Friday 11:41am EST X2 modem server rebooted to clear out jammed lines. 10/02/97 Thursday 11:58pm EST Main router locked up a few times tonight for unknown reasons. All access to the net was blocked out. 10/01/97 Wednesday 4:30pm EST News2 was down for about 30 minutes at 2pm. Loose cable on a hard drive caused it to crash, cable is damaged but working. 09/29/97 Monday 9:13pm EST News2 rebooted by accident. 09/29/97 Monday 6:28pm EST Modem 12 on the X2 server is locked up. Evidence indicates that the NetServer rebooted itself at 16:05 this afternoon, bouncing everyone. Called USR, they said modem is picking up on digital calls but not analogue calls, that this is a Nynex problem. USR said the ISDN line is presenting an analogue call as a digital call which is incorrect. Called Nynex, reported the line that is not picking up. Gotta leave the modem locked up so Nynex can see it fail. 09/26/97 Friday 3:46pm EST Modem 2 on the *NEW* Netserver locked up. 09/21/97 Sunday 12:19am EST Yesterday when light crashed and was rebooted, an older version of the apache web server was started which caused some anomalies including failures of freebie domains to go to their proper pages. This is fixed. X2's seem stable. X2's rebooted to install newest code. Everyone bounced 12:24am 09/20/97 Saturday 9:16pm EST News has been jammed most of the day. It was running but throttled due to a binary fill up this morning. When I restarted it this morning, it didn't restart properly, but kept running anyhow, so I never got beeped. 09/19/97 Friday 9:47pm EST Called Nynex about the 8 ISDN lines going up and down all night. 09/19/97 Friday 8:48pm EST Light crashed. I quit out of Xwindows and got a cpu panic. Been up way too long. 09/19/97 Friday 6:50pm EST Nynex ISDN seems to have gone down. All X2 channels got dropped and gave busy signals. NetServer won't resynch up. Nynex ISDN seems to be going down every 10 minutes for about 5 minutes. 09/18/97 Thursday 9:38pm EST X2/ISDN server replaced. Down for about 3 hours. Took it back out because it wasn't working, but it was my problem, so the new one is back in again. Looks like its working. 09/16/97 Tuesday 2:33pm EST X2 server locked up. Worked with USR on it for a number of hours. Doing one last test before we replace it. Please continue to report modem failures. 09/13/97 Saturday 4:17pm EST NetServer Modem 2 locked up, call into USR to study it. 09/05/97 Friday 7:39pm EST X2/ISDN NetServer locked up on modem 1, giving no answers. Rebooted. 09/04/97 Thursday 1:32pm EST External net is down. 08/31/97 Sunday 2:58pm EST Nynex Frame Relay line died at 11:50am They are working on it. 08/25/97 Monday 5:02pm EST Harmony password file was lost at 3:50pm for unknown reasons. Signons failed until about 4:10pm when we noticed it. 08/22/97 Friday 12:14am EST X2 modems rebooted twiced for system work. Everyone was bounced 08/19/97 Tuesday 02:13am EST Password file for X2 modems was destroyed again at 12 midnight. This time I found out what did it. 08/18/97 Monday 11:23am EST Password file for X2 modem banks was corrupted last night, Down since 12 midnight. 08/16/97 Saturday 4:57pm EST X2 port 4 is apparently hung, I did a hard reset on it, which booted 3 too. 08/11/97 Monday 4:18pm EST Majesty crashed from out of memory errors. This caused light to stick during shell logins from NFS mounts. 08/09/97 Saturday 4:48pm EST Majesty crashed, out of mbufs. 08/07/97 Thursday 12:09am EST I tried to regather articles from the last day to make up for the 0 day expire of this afternoon. news2 promptly went and filled all 4 disks all the way and bombed, something I don't quite understand because there wasn't 12 gig of news to steal from news1. So I had to expire that too, but by this time all those articles were in the history file, even though they were now expired, so I couldn't go back and reget them again without starting from scratch which is what annoyed everyone last time. So, right now everything looks stable and is working properly. Most of this was my unfamiliarity with the new software, and its lousy manual. The articles will fill in again as new ones come in. 08/06/97 Wednesday 8:06pm EST news2 was put through a 0 day expire this afternoon after it filled up one drive leaving the rest empty. This bug is fixed but to restart it properly we had to empty the one drive. All four drives are filling up now. 08/04/97 Monday 6:29pm EST news2 rebuilt from scratch using latest, greatest 4.2g Articles will refill in. 08/02/97 Saturday 10:54pm EST news seems to be stablized. news2 however got hopelessly corrupted. It was been wiped clean, and groups will have to reload themselves. 07/26/97 Saturday 8:08pm EST inn1.6b2 has destroyed the news history file. news will be down for a number of hours. news2 is fine. news will probably be up again around 9am plus or minus. 07/25/97 Friday 10:43pm EST Getting ring no answer on modem 43. Taken off line. 07/24/97 Thursday 4:42pm EST Apparently we suffered some kind of SYN attack causing the web server to lock up and name service to freeze. 07/23/97 Wednesday 8:56pm EST News has been down for a few hours while we install inn 1.6b1. Not going smoothly, tin doesn't seem to want to read. rn works fine. 07/23/97 Wednesday 5:32pm EST NetServer rebooted itself. 07/22/97 Tuesday 2:38pm EST Significant net outage. 07/22/97 Tuesday 12:44am EST ISDN/X2 rebooted multiple times. Uploaded newest code, Courier I works worse than before. 07/20/97 Sunday 4:51pm EST The system password file was corrupted at 5:10am this morning due to a bug in our own security software. Logins to light and e-mail retrieval were prevented until about 6:00am when we finally figured out what had happened. 07/19/97 Saturday 1:24pm EST NetServer ISDN cold booted to clear out two jammed channels, s11 s12 07/13/97 Sunday 2:25pm EST News2 will be slow today, we are rebuilding the history database. While the rebuild is taking place, many articles will show up as there while in fact not being there, this may cause problems with your newsreaders. Lots of spam in alt.sex is rendering the groups totally useless. Expiration on alt.sex* has been set to 0 days, expiration on all other groups has been set to 7. Binaries are still 4. 0 days doesn't mean we aren't carrying them, it means they get expired fully every night. Filtering of postings crossposted to more than 10 groups has been turned off on news2. News hasn't been filtering. 07/09/97 Wednesday 01:03am EST ISDN/X2 server rebooted at about 11pm or so. Then again a while later. It first locked up with endless radius daemons sending majesty to load 40, then after it was rebooted, it wouldn't do ISDN any more, X2 worked fine, and *V2=0 worked fine, but *V2=5 wouldn't work at all.. Turning it off and on cleared it out. 07/08/97 Tuesday 12:20am EST ISDN/X2 server rebooted. It was in some terrible loop with the radius server on majesty, causing MILLIONS of radiusd's to be spawned driving the load on majesty up to 50. 06/30/97 Monday 9:01pm EST Net Server IDSN rebooted to set speeds to 115200. Testing slow speeds. 06/28/97 Saturday 2:14pm EST Nynex stopped by to check errors on the T1, trouble report originated by Nynex monitoring. T1 was down for about 5 minutes causing interruption in connection to the net. 06/27/97 Friday 12:40am EST An experimental version of sendmail died silently at around 11:17pm for unknown reasons. Outgoing mail was down for about 30 minutes until I was informed of the down. I will place sendmail in the system monitor so this won't happen in the future. 06/21/97 Saturday 10:27pm EST Light taken down due to network instability. Getting le0 out of mbuf errors, tremendous activity coming in on le0 or going out, not sure. ciscoin and ciscoout don't show anything special. Getting mbuf errors even while in single user mode, that indicates tremendous hits on light coming from outside. Installed new kernel while we are at it, jesv. Light booted 3 times without incident, pray hard. Looks like it was a ping attack, we will get the joker the next time it happens. 06/15/97 Sunday 10:27pm EST Light crashed again at 18:57pm and would not reboot at all. After multiple failures to reboot, I turned it off, pulled the memory and reseated it, and did the same with all cables and internal disks. Then it rebooted, but it may simply have changed its mind. It is possible the crashes are from a dying root drive, there is some indication that there were problems with the swap space, and boot sectors. There is another mirror drive ready to go if it crashes again. If that's not it, it's going to be rough finding out what it is. We have a second server being prepared which can take over the functions of light if necessary, but it is far from ready to do this on short notice at this time. 06/15/97 Sunday 3:41pm EST Light crashed for unknown reasons. Watchdog reset. Came back automatically and tried to reboot, said bad boot block or something, I stupidly didn't write it down. Then I rebooted it by hand and it booted and ran a *LOT* of fsck's on all drives. When it went to reboot, it failed with MEMORY OUT OF ALIGNMENT, no tracebacks this time. Then it booted by hand fine. We may have a bad boot block on the main root disk, but this doesn't explain why light crashed in the first place. 06/12/97 Thursday 6:35pm EST Light and related systems taken down to replace UPS. Light did not boot smoothly, it gave MEMORY OUT OF ALIGNMENT followed by a traceback. This is new and not good. 06/11/97 Wednesday 7:55pm EST Network interference from romance being down caused aurora to lock up due to NFS mounts. News2 down for about 20 minutes. 06/06/97 Friday 11:31am EST Elmira link was down for about 45 minutes for unknown reasons. Rebooting the frame relay router at our end seems to have reestablished the link. 06/02/97 Monday 3:15pm EST Light crashed from running out of mbufs, a bug in the apache web server. It crashed at 15367/15744 mbufs. finwait was 2474 05/24/97 Saturday 11:50pm EST Majesty (news.lightlink.com) was down for about 6 hours from a subtle news crash. I did not notice and no one called to inform me. Aurora (new2) was not affected. Some articles may have been lost. 05/20/97 Tuesday 6:14pm EST Light taken down to install new kernel and make security patches. Running Kernel jess 6:00pm -> 6:15pm 05/18/97 Sunday 6:40pm EST Light taken down at 6:00pm to rewire part of machine room. 05/18/97 Sunday 3:00pm EST News (majesty) was down for about an hour to patch the OS. Majesty is theoretically fully patched at this point. 05/07/97 Wednesday 01:01am EST Sendmail and web taken off line at about 12:30am for a few minutes to deal with an incoming relay spam. 05/05/97 Monday 02:09am EST Harmony modem bank one reset, bouncing everyone. 04/30/97 Wednesday 11:32am EST Light crashed. 04/29/97 Tuesday 6:20pm EST Nynex has taken down the huntgroup to fix an earlier problem. Everyone is getting busy signals on 5026. 04/28/97 Monday 12:54pm EST Harmony reset bouncing everyone 04/25/97 Friday 12:24am EST Nynex has destroyed the hunt group again. Line is giving busies. Dial in on 277 4940. 04/22/97 Tuesday 6:10pm EST Load driven to 30 by errant cgi on the part of local user. 04/21/97 Monday 3:51pm EST Elmira POP was offline for about 3 minutes, for unknown reasons. 04/20/97 Sunday 12:29am EST I locked up light by accident by bringing up adore with same IP number! Sorry. 04/19/97 Saturday 7:02pm EST Light taken down to replace failing boot drive. Harmony rebooted with new operational image 11.2.3 6:00pm -> 6:30pm 04/19/97 Saturday 3:20pm EST Netserver 16-I rebooted at 3:10pm. 04/17/97 Thursday 01:47am EST Cisco router got locked up around 12:30am, process table full. Rebooted at 01:45am Apparently we suffered a syn flood attack from a disgruntled irc user. 04/16/97 Wednesday 6:02pm EST Majesty (news) is down for upgrades. 6:00pm -> 7:00pm 04/15/97 Tuesday 3:03pm EST Spam filter is causing more damage than its worth. Have removed it. sendmail was down for 10 minutes due to an error restarting the non spam filter version. 04/08/97 Tuesday 2:36pm EST Light is loosing is main root drive. There is a backup mirror drive if the main drive fails totally. A new drive is on order. Light rebooted at 2:30pm to determine which drive was failing. 04/08/97 Tuesday 11:03am EST Elmira off line for about 30 minutes due to network lockup at elmira end. 04/05/97 Saturday 11:57pm EST Elmira Frame Relay to Nynex is down, lights are dead. Called into Nynex Repair. Nynex T1 Box was burnt out, replaced and line came back up. Downtime 24 hours. 04/05/97 Saturday 12:09pm EST Rebooted light by accident. Sorry. 04/05/97 Saturday 01:00am EST Binary partitions on aurora (news2) rearranged to provide more space. All articles lost in alt.binaries.pictures.* and a number of warez and games groups. 04/04/97 Friday 8:45pm EST Light rebooted at 6:00pm to install new 9 Gig drive. Rebooted a second time to clear out jammed format session on drive. 6:00pm -> 6:07pm 8:45pm -> 8:50pm 04/01/97 Tuesday 5:40pm EST News (majesty) rebooted with new kernel with statd security patch. 04/01/97 Tuesday 5:30pm EST Light rebooted with new kernel with statd security patch. 04/01/97 Tuesday 3:29pm EST News2 taken down to install 128Meg of memory. 15:22 -> 15:30 03/31/97 Monday 5:47pm EST Light went out of control, load went to 20. I halted the system and rebooted. Earlier had 'out of mbuf' errors. Probably caused by Apache 1.2b4, we are now back 1.1.1 03/28/97 Friday 1:40pm EST Network rewiring may have caused intermittent freezes between 11am and 1:00pm. 03/28/97 Friday 11:05am EST News died silently at 4:42am probably from an errant control message with too long a header. 11:44am news rebooted 03/27/97 Thursday 2:15pm EST News and news2 off line for 15 minutes at 2:00pm to rewire. Failed, have to do it again. 03/27/97 Thursday 09:15am EST Network locked up this morning at 8:12am necessitating a reboot. It then locked up again at 8:26am needing anothe reboot. There was very strange activity on one of the network hubs (BNC connector to downstairs lan) even though there should have been no traffic at all going through it. Unfortunately when the network is locked up, network sniffers to view traffic don't show much! We will be separating the 4 ethernet hubs into their own areas, making it easier to tell which line may be in trouble. Presently they are all stacked in one central site making it impossible to quickly trace the wires from blinking lights to machines they are connected to. 03/26/97 Wednesday 9:32pm EST Majesty (news) locked up the ethernet at 9:11pm causing all machines to stop responding. Light, majesty and aurora rebooted. 03/26/97 Wednesday 4:01pm EST Nynex had a major event last night, causing our roll over to go off line. All modems were working, but were not rolling over on busy. People calling in on 277 5026 were getting busy signals, because that modem was busy, although many others were not. This fixed around 1am, but then broke again this afternoon around 2pm. All modems work, and most roll over properly, but the roll over on 5026 has gone off line again. Dial in on 277 4940. 03/19/97 Wednesday 5:50pm EST Aurora down: 4:00 -> 5:50pm News2 (aurora) down for memory upgrades. 512K Cache installed, 128Meg of memory still not being recognized by Linux. 03/19/97 Wednesday 12:59am EST USR Couriers rebooted with new kernel image, bouncing everyone. 03/11/97 Tuesday 1:54pm EST Aurora (news2) rebooted. 03/11/97 Tuesday 12:14pm EST Aurora (new2) taken down to install 128Meg. Rebooted fine but wouldn't recognize the memory. Still running on an effective 64Meg. 03/10/97 Monday 10:45am EST Light semi jammed at about 8:46am for unknown reasons, causing various network malfunctions. Rebooted at 10:45am. News was down for the period of the reboot. 03/08/97 Saturday 12:17am EST News outsend queue was jammed, some articles were not being posted to the 15 swap sites. They were going out the main feeds. The queue has been freed, if articles are still on our system, they will be sent, if not please resend. 03/06/97 Thursday 12:20am EST News taken down because of full partitions. Expire set to 1 day to clear out space so that we can clean up the mess. News2 running fine, expire is 14 days. 02/26/97 Wednesday 08:40am EST Light became crippled at 5:19am with a Process Table Full error. Beepers did not go off because it was still responding to pings. Rebooted at 8:35am. 02/24/97 Monday 8:21pm EST Majesty (news) rebooted at 8:06pm to install new kernel. 02/23/97 Sunday 10:15pm EST Light was taken down this afternoon for about 20 minutes due to a jammed process that drove the load to 8. News was rebooted twice at 22:00pm, to install new kernels. 02/23/97 Sunday 2:47pm EST News crashed from filled up warez partition. 02/22/97 Saturday 9:53pm EST Amphenonal connector pulled out of the modem banks taking modems 33 through 48 off line for a while. Don't know how long it was going on. 02/20/97 Thursday 9:13pm EST Majesty (news) rebooted with new kernel. 02/20/97 Thursday 05:12am EST Majesty's etherport crashed for the second time. Took about an hour to get majesty and light disentangled, had to reboot both. I hate NFS. Have no idea what is wrong with Majesty. 02/17/97 Monday 9:29pm EST Light was taken down at 6:00pm to install new lib.c and security patches. It did not go smoothly, and we are still getting bad signal stack errors every once in a while. Came back on line at 7:00pm and had to reboot a few times after. 02/15/97 Saturday 4:31pm EST News2 was down for an hour to add new disk space. Entire news spools was wiped clean. It will refresh as you hit on groups again. 02/14/97 Friday 12:03am EST Majesty (news) rebooted to install new security upgrades. 02/13/97 Thursday 1:53pm EST Majesty (news) taken down for an hour for maintenance of news spool. 02/10/97 Monday 11:59pm EST Majesty (news) locked up and wouldn't communicate over the ether port. Routes were all right, but it wouldn't telnet in or out or even to itself. I had a hard time getting control back. Finally able to do a clean reboot. It's working now. 02/08/97 Saturday 5:26pm EST News2 restarted with clean spool, replicate true is on. Expect it to bomb. 02/07/97 Friday 01:19am EST News2 taken down for 1 hour to add new drive. Have 12 gig on board. 02/06/97 Thursday 6:42pm EST News2 was taken down for about 15 minutes to install another hard drive. One more to go in later. No news was lost. 02/05/97 Wednesday 9:59pm EST news2 got completely munged with the replicate true setting. I have turned it off and had to rebuild the news spool. 02/04/97 Tuesday 4:28pm EST News2 taken down to move machine to machine room. News spool was wiped clean to install "replicate true" which will force news2 to follow article numbers on news. 02/02/97 Sunday 6:29pm EST News was down for one hour due to a mistake. 01/30/97 Thursday 2:14pm EST Cisco router locked up. Had to reboot it. Down for 10 minutes. Turns out Jane had set a wrong IP in her work machine! Not a fault of the Cisco, easy to replicate. 01/26/97 Sunday 10:37pm EST Light rebooted at 10:00pm with corrected version of new kernel. SOMAXCONN set to 127 in header.h files and binary edit of uipc_socket.o 01/25/97 Saturday 5:33pm EST Light rebooted with new kernel. jesi contains SOMAXCONN 127 which may help web server jams. 01/21/97 Tuesday 03:34am EST Couriers rebooted to clear out jammed modem. 01/14/97 Tuesday 2:03pm EST Light taken down to install new virtual domain code. I screwed up on taking light down (forgot to disengage majesty) and what should have been a 5 minute cycle turned into a 20 minute cycle. Installation when smoothly and it seems to work at first glance. Light may crash. 01/09/97 Thursday 8:32pm EST Modem rack 1 has a bad power supply. It is presently running on its spare. I found this out by accident when I went to reseat both power supplies alternately. The modems died when set to power supply 1. Everyone on rack one got bounced. Will replace. 01/06/97 Monday 10:26pm EST Sprint has taken over management of our primary router to the internet. I don't know if this is a good thing or not. I get to not have the write password, but apparently they also changed the read password, so the monitoring software is registering throughput as zero. Not. 01/03/97 Friday 2:51pm EST Courier modem number 4 was producing ring no answer. The rack has been reset. 12/30/96 Monday 7:22pm EST Light crashed. I have reinstalled the original kernel using the older virtual interface code, kernel jesf. We started crashing with jesg. 12/28/96 Saturday 4:51pm EST Light crashed. Time to get rid of the new vif code. Don't know if the old code supports more than 128 virtual domains. 12/28/96 Saturday 11:12am EST Connection to external network was down from 4:00 to about 10:30. This is usually caused by problems external to lightlink, but this time for some reason the cisco router had locked up. Sprint is "looking into it". I have placed a pinger on the cisco from majesty so that if it happens again I will be beeped immediately. 12/27/96 Friday 6:11pm EST Light crashed. 12/26/96 Thursday 3:40pm EST News taken down for upgrades. 12/26/96 Thursday 1:41pm EST I crashed light installing new software. 12/23/96 Monday 11:40am EST Light crashed. Probably caused by new virtual domain code. 12/20/96 Friday 5:00pm EST Modems 49-72 upgraded to flash prom 28MR113D.HEX Modems 04-12 upgraded to flash prom 28MR113D.HEX All modems upgraded. 12/20/96 Friday 07:41am EST Light crashed, cpu panic on mget. 12/17/96 Tuesday 6:17pm EST Vif 1.1 installed, virtual domain code. Running kernel jesg Light taken down at 6:00pm to install code. USR's taken off line to switch from coax to twisted pair, then rebooted. 12/12/96 Thursday 7:16pm EST Virtual domains moved to 205.232.88.xx Web was down from about 6:00pm until now. 12/05/96 Thursday 5:49pm EST Majesty was accidentally halted by one of our staff. News was down from 4:15pm until about 5:00pm. This caused anomolies in the mail server and shell accounts due to NFS mounted drives between light and majesty. 12/02/96 Monday 12:42pm EST Web server was down for 10 minutes due to a bad domain name. 12/01/96 Sunday 6:23pm EST Light was down for scheduled maintenance starting at about 6:10pm. /tmp was cleaned out, and various things checked to make sure they work properly (rootenv etc.) 11/26/96 Tuesday 2:42pm EST I crashed light by accident with an incorrect kill command. 11/26/96 Tuesday 1:20pm EST One line fixed, one to go. 11/26/96 Tuesday 12:34pm EST 2 lines are still bad, one is producing ring no answers at around modem 45. Just keep trying, you will eventually hop over someone getting the ring no answer and get in. 11/25/96 Monday 7:51pm EST Well there were 6 dead lines. Now there are 8. Making progress. Nynex just showed up, we are down to 2 bad lines, one can be busied out, the other is still ring no answer. 11/25/96 Monday 3:14pm EST Presently the rotary seems to be working properly except we have found 6 dead lines in the middle of it that were working fine a few days ago. Nynex has been informed. 11/25/96 Monday 1:12pm EST Hunt group is in process of being repaired. In mean while 18 modems are off line. ftp filled up over night due to failure of maintenance program over the past few days. This has been fixed to beep me. web server did not restart smoothly this morning after web stat runs. No reason determined. Results are cgi failures, and other anomalies. 11/22/96 Friday 6:40pm EST Nynex installed the remaining 9 lines for new modems and managed to stick the new numbers in the middle of the hunt group. Because the numbers were not working and had no modems on them if they did, people were unable to dialup after the first 30 modems were filled up. This produced long term ring no answers all afternoon. 11/18/96 Monday 9:06pm EST News crashed due to disk full errors, and then later again due my own errors. 11/17/96 Sunday 11:47pm EST News taken down for security enhancements. 11/16/96 Saturday 7:01pm EST News was down for about 45 minutes. A simple reboot to install new security software turned into a nightmare when I couldn't sign on at all. A bit too much security I would say. Actually the cause was a bug in some commented lines in rc.local that prevented it from running fully during boot up, and the password authentication daemon was not getting started. Its a very erie feeling not being able to sign on to your own machine. 11/03/96 Sunday 2:44pm EST All shell access has been closed off to accounts that have not used shell in the past month. 10/31/96 Thursday 8:59pm EST I erased the password file accidentally while cleaning up for security sweeps. e-mail and shell were unavailable for about 10 minutes. 10/25/96 Friday 5:54pm EST Light and majesty both rebooted to add security features. Running jesf and jojof 10/24/96 Thursday 12:24pm EST Light rebooted at 11:pm last night to clear out remains of intruder. Forgot to install new named, so at 12 midnight when named was reset it bombed out. Name service was defective from about 12 midnight to 11am this morning. 10/24/96 Thursday 12:24pm EST A trogin login program was found on light recording shell passwords. Trojan seems to have been installed Aug 16th. Not doing too well. 10/18/96 Friday 11:15pm EST We are under the influence of a *VERY* bad mail loop between Cornell listproc, the department of mathematics and ourselves, namely my own mail. A listserve I run at cornell is sending error messages to my account at the math department which was just closed. The messages are bouncing back to the listproc which is then bouncing them back again to math. Each time the files get bigger. They started off about 20K, now they are 600K. I am going to have to disable my mail lest the mail spool fill up. Cornell has been notified, but no response yet. 10/17/96 Thursday 10:18pm EST Problems with named have been fixed. 10/17/96 Thursday 10:17pm EST Cornell has asked me to remove all cornell.* groups from our newsfeeds. They are meant for cornell only. 10/17/96 Thursday 7:04pm EST At 2:07pm the web server suffered an unclean restart which resulted in it rejecting all cgi execution for a few hours, before I was notified. Don't know what caused the problem, except that the server was restarted by hand and then the error logs show cgi's failing left and right. 10/16/96 Wednesday 12:48pm EST External net is down. 10/12/96 Saturday 11:46pm EST External net is down from about 11pm to 2am They are working on fixing the router problems that have plagued Ithaca for a long time. 10/10/96 Thursday 11:19pm EST Light rebooted to install corrected named files. 10/09/96 Wednesday 11:59pm EST I screwed up and killed a whole mess of processes that should have been left running. Rebooted light to restabilize system. 10/09/96 Wednesday 2:07pm EST External net is going up and down while they work on a bad primary Ithaca router. 10/07/96 Monday 11:08pm EST Light rebooted to clear out named. We will be installing a new version of named and the vif code shortly. Please bear with us, presently evey time I add a new virtual domain I have to reboot the system. 10/05/96 Saturday 9:08pm EST Light rebooted to clear out named. Named is due for replacement soon. 10/02/96 Wednesday 10:18pm EST The root partition filled up this morning causing the loss of web hit stats from about 4:20 in the morning to about 11:00am. 10/02/96 Wednesday 1:03pm EST Light was rebooted to clear out a full root paritition. The root system filled up during the mirror run. This occured because the mirror drives are not mounted during normal reboots such as the one last night, for reasons as yet undetermined. Since the mirroring is to directories attached to the root partition, when the mirror partitions are not attached, it merely fills up the very small root partition. 10/02/96 Wednesday 12:49am EST Light was down for two hours starting at about 11pm. We ran into a problem with named (NAME-D) and virtual domains. If named is loaded first and then /etc/rc.vif is run, it will work properly. vifs are virtual interfaces, we need one for each virtual domain. They assign the virtual IP number to the ethernet port so it can respond to more than one IP number. But after the vifs are loaded, if named is killed outright and restarted, it runs out of file descriptors and complains about bad file numbers and refuses to load any name service records. There seems to be a limit of about 40 to 50 virtual domains, before named starts to go bad. We have over 100. Most of the down time was spent trying to corner the problem and determine the max number of vifs. I was playing with sendmail on majesty and had killed off named and restarted it a few times on both machines. This is what started the problem. When I restarted it on light, it died and it took me a good half hour to figure out what was going on. It's also really hard to think straight with fear running through your stomach. 09/29/96 Sunday 6:53pm EST Light was down for scheduled maintenance from 6:00pm to 6:50pm. Disk drive bay filters were cleaned. A broken SCSI cable feeding the tape backup drive was replaced. Harmony was extensively tested for name service and signon authentication while light was down. The RX11.1 code works much better than the old R9.0. Trumpet winsock 2.0B seems to ignore the second DNS number in ethernet mode, but works fine during dialup. Timing differences however seem to trip up the script. This will need more study. Win95 dialup worked fine. Next scheduled down time will be next Sunday at 6:00pm to move the web and log directories to their own partition. 09/27/96 Friday 7:21pm EST News is down to repair a disk drive bay. 09/24/96 Tuesday 1:59pm EST External net is not fully up. 13:45pm external net seems back up. 09/23/96 Monday 2:26pm EST Web stats for 19960922 and 19960923 were munged and had to be redone. They should be correct now. 09/23/96 Monday 10:52am EST Light's /home directories were filled up today by a user who had a 600 meg file in his home directory. This was going on since 5:45am until now. 09/21/96 Saturday 10:18pm EST News was restarted due to system weirdnesses with the w command. 09/20/96 Friday 7:54pm EST News rebooted to find which fan is going bad. It is going to have to be replaced soon. 09/19/96 Thursday 12:48am EST News was rebooted a few times around 11:00pm to clear out jammed serial ports from uucp experiments. 09/15/96 Sunday 9:57pm EST News rebooted to clear out jammed processes on serial ports. 09/15/96 Sunday 11:38am EST External net has been down since 06:43am -> 3:00pm Ithaca 1 Router was down at Nysernet. 09/12/96 Thursday 11:25am EST External net is down. 09/11/96 Wednesday 10:32pm EST The web server went out of control at 8:20pm. We spent about 30 minutes trying to do an autopsy on the dying processes, only to find that some could not be killed off. Since one remaining process was tying up port 80 in an (D) uninterruptible I/O wait, we had to reboot light to clear it out. 09/09/96 Monday 12:59pm EST News was down for about 30 minutes to remove an air conditioner with a blown compressor. 09/08/96 Sunday 8:48pm EST Ithaca-1 router is off line (external network). 09/06/96 Friday 11:05am EST Web server has been stabilized to a large degree. Source of load spikes has been found, and hung children and root servers are being worked on. Web stats are being moved to 6:30am once a day so that the server is not killed every hour interupting ftp downloads etc. 09/03/96 Tuesday 4:31pm EST Power outage at about 4:10pm -> 4:42pm 09/03/96 Tuesday 01:10am EST Web hit logs program is causing the server to die. I will be running web hits by hand periodically until I figure out why. 09/02/96 Monday 3:43pm EST Web server logs from 7:00pm or so last evening until now have been lost due to a programming error on my part. 09/02/96 Monday 12:08am EST Modems were reset bouncing everyone. 08/31/96 Saturday 1:01pm EST Working on the web server, it will be up and down repeatedly. 13:01 -> 15:00 08/30/96 Friday 9:47pm EST Light crashed due to a mistake I made. 08/23/96 Friday 12:29am EST Perl5 was down until now from earlier upgrades today. Links in /usr/local/bin were nuked, and permissions were set wrong. 08/22/96 Thursday 9:55pm EST We are presently running vanilla apache 1.1.1. The web stats are still be accumulated but they are not being distributed to users. This will start happening again when things stabilize. The Secure server is working again (the documentation was wrong) but is presently down. 08/22/96 Thursday 6:54pm EST ApacheSSL 1.3 replaced with vanila apache 1.1.1 18:55 -> Keep Alive turned off to see if it effects spiking. 08/22/96 Thursday 10:08am EST The Apache web server is not working perfectly. It is causing load spikes on the system which make light stick momentarily every once in a while. It may be related to cgi's. Apache is aware of the problem and is 'working on it'. 1.1.1. however so far seems to have fixed the main problem with 1.05 which is chronic freezing. The Secure server is not responding yet, probably due to a configuration error from the upgrade. 08/21/96 Wednesday 2:22pm EST Apache web server is down for upgrades. 14:22 -> Running 1.1.1 -> 16:52 16:52 -> Running 1.05 -> 18:00 18:00 -> Running 1.1.1 -> ***** 08/19/96 Monday 09:24am EST News crashed at 6;07am from a news posting collision between two groups alt.binaires and alt.binaries. 08/15/96 Thursday 6:11pm EST The following newsgroups have been moved to a new partition. All articles in these groups have been temporarily lost. /alt/binaries/multimedia /alt/binaries/games /alt/binaries/mac /alt/binaries/sounds /alt/binaries/misc /alt/fan /alt/mac 08/13/96 Tuesday 5:34pm EST News is down for upgrades 17:33 -> 19:07 Big 7 news groups now belong on their own partition which will allow longer expire times for them and the world wide groups that remain on the original partition. soc sci comp rec talk misc news 08/13/96 Tuesday 11:23am EST Last night at about 1:00am, there was a strange anomaly in the harmony password authentication process. A number of people were blocked from getting on. Causes are as yet unknown and the anomaly has not been repeated. 08/10/96 Saturday 12:37pm EST Majesty/News are going to be up and down all afternoon for upgrades. Down: 12:37 -> 12:43 16:18 -> 20:16 I tried to move the rec group from the root news partition to its own partition pending moving all of the Big 7 to their own space. The rec partition has 400 meg in it, and it took hours and hours and hours to copy it over. I finally cut it short and proceeded to erase the original files on the root directory, and THAT took hours and hours and hours. So, this is not going to work. I took this course of action because I wanted to save the present spool of news in rec (and the big 7 when I go to move them) but it just takes too long. So probably what I am going to do is just nuke the news spool, reformat and rearrange it the way it ought to be, and then let it rebuild itself with new articles over the ensuing days. We either suffer long down times and preserve articles (although losing new articles coming in during the down times), or we nuke the whole spool and get it running again in about 1 hour, losing everything on it, but having little down time and losing few new articles. 08/10/96 Saturday 01:33am EST news reset twice to clear out expiration confusion resulting from earlier upgrade. 08/09/96 Friday 1:49pm EST news is down for about 2 hours for upgrades. 13:49 -> 15:41 The overview data base was moved from /sd4a to /sd5h to even out the load on our news drives. 08/02/96 Friday 10:42pm EST The external net is down. 07/26/96 Friday 2:52pm EST The external net is down. 07/25/96 Thursday 4:44pm EST Jason, Light suffered an event a few moments ago, the load went to 50 and people started calling in thinking we were down. During such high load periods, cron kicks in and starts taking snap shots of the system resulting in the following top display. You will notice mogrify is running at 99 percent CPU, using 781 Meg of swap space, 41meg of which was resident. You must put limits on this puppy as it is highly destructive in the hands of your users. Homer Thu Jul 25 16:26:39 EDT 1996 last pid: 29724; load averages: 47.44, 36.88, 20.52 16:27:09 186 processes: 160 sleeping, 24 running, 1 zombie, 1 stopped Memory: 111M available, 53M in use, 58M free, 8984K locked PID USERNAME PRI NICE SIZE RES STATE TIME WCPU CPU COMMAND 29114 nobody 102 10 781M 41M run 1:09 99.64% 18.75% mogrify 29709 sysdiag -12 -20 1704K 1428K run 0:00 0.00% 0.00% top 29168 sysdiag 25 0 652K 212K run 0:00 0.00% 0.00% <swapped> 28724 nobody 25 0 448K 328K run 0:00 0.00% 0.00% httpd 63 sysdiag 25 0 88K 104K run 43:39 0.00% 0.00% portmap 17674 docdoo -5 2 5780K 1412K run 17:07 0.00% 0.00% parse 12119 sysdiag -5 0 352K 200K run 7:01 0.00% 0.00% erpcd 28828 nobody -5 0 448K 176K run 0:00 0.00% 0.00% httpd 28778 gila -5 0 2532K 164K run 0:00 0.00% 0.00% pine3.95 22690 maxho -5 4 1296K 128K run 3:36 0.00% 0.00% bawt 455 root 25 0 11M 136K run 63:01 0.00% 0.00% Xsun 29153 nobody -5 0 436K 312K run 0:00 0.00% 0.00% httpd 29079 nobody -5 0 436K 236K run 0:00 0.00% 0.00% httpd 29109 nobody -5 0 440K 220K run 0:00 0.00% 0.00% httpd 7979 slam -5 0 1080K 216K run 3:30 0.00% 0.00% eggdrop Homer 07/24/96 Wednesday 4:41pm EST htaccess passwords for web pages is presently not working. We are looking into it. This is fixed. There is a new module called unix_auth_module, that is supposed to use /etc/passwd passwords. It conflicts with the normal dbm passwords. The server was recompiled without the unix_auth_module and dbm passwords started working again. 07/22/96 Monday 6:27pm EST Web stats were down for the past day. No web logs were lost, but they weren't being collated every hour as usual. Things should be back to normal and all past web logs should be in todays files. 07/22/96 Monday 6:25pm EST News was down for about an hour to do routine maintenance on disk drives and to install a new mirror drive. 07/19/96 Friday 12:53pm EST News and Majesty were down from 11:12am to 12:52pm for maintenance and upgrades. 07/15/96 Monday 1:44pm EST Web server was down for about 15 minutes due to DNS failure on a virtual domain. My fault. 07/04/96 Thursday 10:47pm EST News server crashed from full partition on alt.binaries. 06/27/96 Thursday 7:44pm EST Last night at 12:00am or so the web server crashed and was down all night. Since there were no new hits, the web stat accumulation program simply kept adding the last batch of stats to the log files every hour on the half hour, producing multiple entries for the time between 11:30pm and 12:00am when the server crashed. I have cleaned up all the log files for 19960627, so there should be no duplicates in them. It is unknown why the server crashed, but it may have been related to a domain name change I made, failing to inform the virtual domain server of the change. The apache server is very sensitive to dns failures, it will crash the whole web server if it fails to get dns on any of its virtual domains. I have fixed the bug in the stats program that wrote duplicate stats to people's log files, and the apache server is due for an upgrade shortly. 06/27/96 Thursday 12:05pm EST Light was rebooted to install jese kernel, which includes patch 102264-02 ufs_lockf.o to fix the crash we suffered yesterday. 06/27/96 Thursday 11:20am EST Harmony was rebooted to install fix for failure to reprompt. 06/26/96 Wednesday 11:46am EST Light crashed and rebooted itself for unknown reasons, probably not related to scsibus errors. Lot's of changes have been made to the kernel recently so some instability may have been entered into the system. 06/25/96 Tuesday 4:18pm EST Harmony reset to set auto_detect_timeout to 10 from 30. This will prevent some ring no answers. 06/25/96 Tuesday 3:47pm EST Harmony was rebooted at around 12:30pm to install a new setting to help fix some of the scripting failures the new code is causing. This fixed a lot of scripts and promptly broke a few more. 06/24/96 Monday 10:08pm EST News/Majesty was down for 15 minutes at 10:00pm or so to add 64 meg of memory bringing total to 160meg. 06/23/96 Sunday 4:52pm EST News/majesty was down for about 20 minutes at 4:10pm, for upgrades. /usr/local software drive on light was copied over to majesty /sd7. Majesty is now running on identical software to light. 06/22/96 Saturday 11:32pm EST Alt.binaries filled up the news partition and bombed out the news server. I have nuked the alt.binaries group, starting with a clean slate. They will fill back up again over the next few days. Someone is flooding the net with binaries, possibly in a.b.misc, and this is filling up partitions and using up extreme amounts of T1 bandwidth. 06/22/96 Saturday 7:13pm EST Harmony was taken down at 6:00pm until 6:30pm to install RX11.1.3. First indications are that this has NOT fixed the sign on problem. Further testing will be necessary to see if it helps the upload problem. Homer 06/19/96 Wednesday 10:14pm EST News was down for 10 minutes. Rebuilt kernel on majesty for 127 max users instead of 200. Told that 200 wraps around at 128! Rebooted. 06/19/96 Wednesday 12:30pm EST A number of user perl scripts went out of control at 10:00pm last night producing high loads and sticky response. The situation lasted for about an hour before I caught it and killed them. The alt.binaries groups have been flooded recently, tying up both our incoming bandwidth and disk space. I cleaned out the alt.binaries groups to one day (although expire is still 3) and I have upped our T1 to 768K effective in about a week. 06/16/96 Sunday 11:47pm EST Light was quarantined for 2 hours from 6:00pm to 8:00pm or so to move the ftp/web directories to a new 4 Gig drive. The web server was down for this time, and shell usage was locked out. E-mail, news, dialup and web surfing should all have been working fine. The upgarde went with out incident. 06/11/96 Tuesday 12:59pm EST Incremental backups as of 8am are in /backup. 06/11/96 Tuesday 12:19pm EST /etc/motd was corrupted. I replaced the kernel and motd and rebooted. 06/11/96 Tuesday 12:06pm EST We lost the home drive sd6 at about 8:30am this morning. Light was down until now restoring the home directories to a new drive. sd6 is the drive that has been having so many errors over the past year, I think it finally decided to die for good. Home directories are current as of 12 midnight or so. Incrementals will be made available shortly if the system proves stable. 06/10/96 Monday 5:52pm EST Web server pages were messed up for about 30 minutes due to a typo I made. 06/09/96 Sunday 8:24pm EST ftp brought back on line. Everything looks like it is working. 06/09/96 Sunday 7:13pm EST Light was taken offline for scheduled maintenance at 6:45pm. /home/ftp was moved to /ftp A link was placed in /home/ftp -> /ftp, so nothing should break from this. During this time I was able to sign on to harmony using winsock, and surf the net although www.lightlink.com responded with a socket error. I also was not able to read or send mail, Eudora responded with "Connection refused by lightlink.com". This means that majesty was working as both nameserver and security server for those using Harmony to sign on. By changing the definition of smtp.lightlink.com and mail.lightlink.com from light to majesty, it might be possible to keep outgoing mail functional during these down times. But this is possibly more dangerous than it is worth. 06/09/96 Sunday 11:17am EST News throttled itself from a full news partition over night, at least that is what it says it did. The expire run has apparently cleaned out the over full articles but did not restart news. Recent spams and cross postings between news heirarchies have caused strain on the various news partitions, I am cutting back all news to 4 days expire time to clear things out, and will open them back up again according to our disk space availability. 06/07/96 Friday 10:37am EST ftp daemon jammed from too many failed uploads. I have increased the permissible jobs to 100 and installed a monitor that will warn me when it gets above 20. 06/05/96 Wednesday 10:40am EST Light crashed from cpu panic after /sd6a went out of control. First look shows no damage. 06/04/96 Tuesday 11:35pm EST Web server killed and restarted. It was beginning to stick again. 06/03/96 Monday 12:42pm EST Ciscoout shows that external net was down for a short while at 12 noon. 05/29/96 Wednesday 12:55am EST web server is acting up. keeps jamming for unknown reasons. This started tonight, but has been reported sporadically in the past. 05/23/96 Thursday 01:13am EST Unscheduled reset of Harmony bouncing everyone. There is a setting called forwarding_timer that is supposed to be set to 5 (25ms) but is off instead. This is supposed to be the cause of much upset, perhaps even ftp upload problems. However I was unable to turn the setting on. 05/16/96 Thursday 4:37pm EST beta6 ftpd jammed with failed sesssions. Rebooted with beta11. 05/16/96 Thursday 12:00am EST Unscheduled downtime. All modem ports were reset at 12 midnight, to reset the subnet mask of each port from 255.255.255.0 to 0.0.0.0. Xylogics says the prior setting would cause trouble. 05/14/96 Tuesday 11:52pm EST News server was shut off from Clarity tonight for a few hours while we tried to determine where the bandwidth is going. (It's not Clarity. 6/4/96) 05/14/96 Tuesday 11:49pm EST The web server shut itself down tonight for a few minutes after I deleted an account that had moved on. The account had a virtual domain with us, and when the directories were deleted, the httpd server decided it didn't want to run any more. Not sure if that is a feature or a bug. 05/14/96 Tuesday 1:34pm EST News should be fully back on line. 05/14/96 Tuesday 12:08pm EST News is down for testing. Back at 12:45pm. Remote domain reading is still down. 05/13/96 Monday 3:06pm EST News is down for testing. You will notice from the sharp fall off on the ciscoout numbers, that news is somehow using up all our bandwidth. News is presently back up, but various feeds are locked out, until we find which one is causing the problem. Problem seems to have been in part all the new micro feeds we have added. We were sending too much to all of them at once. Things are more in control now. 05/10/96 Friday 11:44am EST Harmony rebooted itself at 8:36am under the new operational image. So much for that. 05/10/96 Friday 01:19am EST Harmony rebooted itself. It is now running the new operational image which Xylogics says won't reboot itself any more. 05/07/96 Tuesday 3:42pm EST Harmony rebooted itself at 7:25am this morning bouncing everyone off. Xylogics says that if we install the new R9.2.21 operational image, it will stop doing this. I have installed this in the proper directory, but it involves taking harmony down in order to load it. Probably I will simply let harmony fail again and reboot it self with the new image. 05/06/96 Monday 7:08pm EST News was down for about 45 minutes due to a bug in an upgrade that I did not spot. I have reverted the code to the original until a fix is forthcoming. Xylogics has acknowledged that the recent reboots of harmony are the result of a bug in the R9.2.7 code we are running, and has offered a fix in R9.2.21, which will not compile. They are aware of that too. 05/04/96 Saturday 12:04am EST Harmony rebooted itself at 2:46pm on Friday, probably bouncing everyone off. 05/03/96 Friday 8:23pm EST News will be up and down for short periods as we try to work the bugs out of new news software. We are trying to install software that will allow us to feed remote sites at 10 times our present rate, thus making a fair exchange for our 5 redundant newsfeeds (that are presently not enabled because we couldn't feed them as fast as they were feeding us by a factor of 10.) 04/29/96 Monday 4:37pm EST The Sprint network was apparently down from 4:30am to about 7:30am this morning. 04/29/96 Monday 11:43am EST Light/web/mail/ftp was taken down for scheduled down time from 11:00am to 11:32am. Installations went without incident. 04/28/96 Sunday 11:23pm EST Majest/news taken off line for about 45 minutes at 10:23pm for hardware upgrades. 16 gig of mirror drives were installed on the fast wide bus, and 4 gig of news space was reinstalled. Installation went without incident. 04/25/96 Thursday 7:26pm EST Harmony rebooted itself at 7:02pm for unknown reasons. booting everyone off. 04/23/96 Tuesday 2:59pm EST Scheduled Down Time Light was taken down at noon for 15 minutes to install a new mirror root drive. Majesty was taken down at 12:15pm for 45 minutes to install a new mirror root drive, and a fast wide scsi card. Both installations went without incident. 04/22/96 Monday 12:36pm EST System taken down for scheduled down time at 11am. Could not get light to fail to boot no matter how I reset the kernel to the way it was during the crash. This is not a good sign. Harmony will boot properly from Majesty, but will not authenticate. Probably a simple configuration error. Root mirror drives were not installed, not enough time. System will be down on Tuesday during Scheduled down time from 11am to 1pm. 04/20/96 Saturday 8:58pm EST Presently playing with mirroring name service on majesty/light for both forward and reverse lookups. Name server may be sticky or momentarily non existent during these tests. 04/20/96 Saturday 4:43pm EST Earlier today while Light was down, out of the corner of my eye I saw Harmony reset the modem banks. Apparently during this time it also lost its default route, preventing dialup users from going out onto the net. The mean time to failure around here is getting ridiculous. Update: Apparently for whatever as yet undetermined reason, Harmony rebooted itself while light was down. Since the new experimental Harmony code WAS resident on Majesty, Harmony rebooted itself from Majesty using a vanilla config.annex file which did not have the default route to the cisco in it. Thus we lost our default route. Amazing. These machines are talking to each other behind my back. Anyhow the upshot of all this is Harmony is now runnint X9.2.21 rather than X9.2.7. 04/20/96 Saturday 2:47pm EST Crashed again. This time out of the blue for no good reason. Panic on CPU 0, almost definitely on sd5. Some damage to web hit files. Perhaps the drive is bad. I do have another one, but it would take significant down time to swap them over. I see its going to be a hard life. However I also see that people were signing on while light was down, so the harmony software is working on majesty. Also placed the kernel back into asynch mode which is usually more stable. 04/20/96 Saturday 2:03pm EST At about 1:50pm I walked in on light to find it generating repeated errors on sd5. These were not sufficient to crash it, nor for anyone to report it, but the tape backup had failed and it was clearly going down. I halted light gracefully and rebooted. I took the opportunity to remove the second drive bay, which is now destined for Majesty as mirror, and I installed a new $100 gold plated radium cored forced perfect terminator on the fast wide bus. There was no damage and light rebooted flawlessly twice. We may have fixed the 'refusing to boot more than 3 drives' problem, but we have not fixed the scsi buss errors problems at all. The procedure I used to bring light down gracefully was: 1.) halt command which would not complete because so many errors were happening on sd5 it couldn't sync the drives. 2.) Stop-A which forcibly halts the processor in mid stride. 3.) Turned the drive bay off and then on again and let them come up to speed which apparently cleans out the jammed buss. 4.) Go, which starts the processor again allowing it to complete the syncing and halting process. 04/19/96 Friday 10:57pm EST Light crashed while formatting a recalcitrant new disk. "CPU panic on 0" Not sure why. Possibly the disk was not properly registered with the kernel, it had been complaining about read errors prior to trying to format it. Normally I would have rebooted the system at this point, but we have a mortorium on rebooting in place. The kernel complaint was possibly caused by having pulled the disk to get its serial number and repluging it. They are supposed to be hot plugable though. Light rebooted just fine by the way, no damage. Moment of terror though. Most people probably can't conceive of being in fear when their computer crashes. But I go through it every time Light burps. Some times I just have to walk around until the adrenaline ceases. I can't continue living this way. Panic on cpu 0 indeed. Panic on operator 0 is more like it. 04/19/96 Friday 7:51pm EST Getting intermittent ring no answers. On Monday I am going to have call forwarding on ring no answer installed. It's going to cost $3/line per month or about $210/month for our 72 modems. ************************************************************************ ****************** CRASH OF 4/17/96 *********************************** RECENT EVENTS of 4/17/96 During the scheduled down time of 6pm to 7pm, after I had swapped in the new scsi bus, the tape drive and sd8 failed to come on line during the first boot attempt. There was some other impropriety in how I had configured the system which I do not remember now, and I thought the booting failure was possibly related to it (NOT). I fixed the minor error, and rebooted and it booted fine and the system came up at 7pm. Then I started a tape back up, which promptly jammed. So I rebooted Light, clearing out the jammed tape process, and started another backup. This one bombed out with a serious scsi error. Then I reset the drives to asynch mode in the kernel and rebooted again. Again the tape backup got a scsi error. Then I figured well maybe the cable has gotten ruffled, so I took the system down, and replaced the $100 gold plated radium cored cable with a $10 cheapie that had been working fine on Majesty. When I went to reboot, the tape drive wouldn't come on line at all, and neither would sd8d. Now this is exactly what happened two weekends ago during the crash, only that time I was not smart enough to realize what was going on, and whatever I did in my hurry, I managed to wipe out the FATS on all of the drives. This time I was very careful not to do anything stupid, and the drives so far have remained whole through out all this. OK, so at about 11pm the system would not boot at all, the tape drive and sd8 wouldn't come on line no matter what I did. I then started thinking it might be various things and set about to test them. First thing I did was put the old scsi card back in. No change. The next thing I did was swap the drive bay which holds the 4 drives. We just happened to have a new one lying around which is destined to hold the web log stats and more drives. That was convenient. No change. Then I took all drives out of their drawers (these are hot plugable drives) and checked and reseated their scsi jumpers and internal cables, all to no avail. Then I swapped the CPU board from Majesty to see if it was the CPU card. No change. Then I noticed that if I booted with one of the drives off, the missing drive would come on line. Thus we have sd5 sd6 sd7 sd8, and sd8 was not coming on line, but if I turned off either 5 6 or 7, it would come on line just fine. (sd = scsi drive). Then I stuck both drive bays on the system and put in a fifth drive that was destined for web stats. Again, only three drives would come on line, I could turn off any two and the other three would come on line. Something has to be working over time to do this. I saw that probably we could bring the system back up, if I moved the contents of sd8 to the single wide bus, and left only 5 6 and 7 on the fast wide bus. At this time it was about 2am, I was getting very tired and beginning to make mistakes (deadly mistakes), so I called it quits and went to bed for 4 hours, leaving the system off. When I got up at 6:45am, I went in to start putting together the new configuration, and it just booted all the way just fine. Somehow my getting some sleep, made the system work. Now this in fact has been the story of this ISP, and I am beginning to think that we have something REALLY FLAKEY going on here that has not been spotted. Time and time again, over the past 9 months, I mean repeatedbly, when the system was rebooted it was hell to bring back up, then it would suddenly start working again for no reason, particularly after having been left off for a while. Then it would run fine for a while, even days, then suddenly start getting scsi errors again, causing numerous crashes etc. OK, so in summary this is what has been swapped. 1.) The Sparc 20 mother board (months ago). No change. 2.) The scsi card. No change. 3.) The CPU card. No change. 4.) The drive bay. No change. 5.) The drives themselves, one at a time. No change. 6.) ALL cables, many times. No change. 7.) All terminators. No change. Presently, light is running, but I am afraid to take it down lest it not come up again. A tape job is fully jammed so backups can't be done without rebooting. Of course if I reboot, the tape drive won't come on line at all. Majesty is down because for some reason it won't boot with Light's CPU card in it. So things are a mess. 04/18/96 Thursday 10:43am EST I called the makers of the scsi card PTISP, and told them how the bus was only finding 3 drives out of 4. The guy told me to twiddle a number in the driver and recompile the kernel that would allow the driver to recognize more than 3 drives (which apparently it was set to recognize.) I asked him, well how come we have been running on 4 drives all these months (albeit flakily) and he said he didn't know. I made the change and rebooted. All 4 drives were already on line from when I woke up, and they stayed on line during the reboot, so not much is proven there yet. However the full system tape backup completed which is a major improvement over 5 failues in a row. I had asked the guy, how could your fast wide bus software affect the single wide bus that the tape drive was on, and he said he didn't know. However I must point out that the tape drive is backing up drives on the fast wide bus and thus there might be some interaction between them. We did however get on minor scsi error during the backup, but not enough to bomb it out. So things are not working perfectly. 04/18/96 Thursday 11:49am EST Light is back up with its own CPU card, and Majesty and News are back up. Tape backups are working. Gonna skip tonight's down time. Been down enough. Homer ************************************************************************ ************************************************************************ 04/18/96 Thursday 08:46am EST Made change to scsi driver as suggested by tech support. scsi_ncmds_per_dev from 3 to 32. It is supposed to allow recognition of more than 3 drives. This doesn't explain why we have been running 4 drives for months. 04/18/96 Thursday 07:34am EST Light was down all night, and may be up and down all day. It is presently running in a crippled mode, tape backups are NOT working. Majesty and news are also down as it won't boot with Light's CPU board. I can't take Light down to swap the boards just yet because it probably won't reboot. The SCSI errors have finally reared their ugly head and are no longer playing patsy with me. More to come but I have to get on with debugging the system. 04/17/96 Wednesday 9:19pm EST Got a scsi bus error on single wide bus within 5 minutes of starting tape job. Taking light down to reset kernel to ansychronous mode. 04/17/96 Wednesday 8:52pm EST Light rebooted to clear out a jammed tape process. 04/17/96 Wednesday 7:41pm EST There is some evidence that there was an 'event' of some sort around 5:30pm this afternoon. We were out and when we came back, the number of people on the modems was real low, like everyone had been kicked off or Harmony had rebooted itself for no reason. The modem stats chart shows when this happened. I have no idea what it was. Some reports are coming in that lines are disconnecting and then people are getting ring no answers, then it clears up. Until we track this down I can't say much about it except keep on trying and reporting the problems. 04/17/96 Wednesday 7:10pm EST Lightlink was taken down at 6pm for scheduled hardware upgrades. There will be further down times this week at the same time period. 1.) New fast wide scsi card was swapped in for the original card to see if this effects the prevalence of scsi bus errors that have plagued us from the beginning. They are mostly occuring on sd5 now, which is the first of the fast wide drives. When the single wide drives for news were on light, most of the scsi errors were happening on the single wide bus, so we never tested a new double wide bus card. But as soon as news was moved to Majesty, the errors moved over to the double wide bus. If swapping the card fixes the scsi bus errors, then the old card will be sent back for repair or burial. If scsi bus errors continue to happen, adjustments will be made to the scsi programming interface, to see if that changes anything. 1a.) New kernel (jesb) was installed which allows for additions of more fast wide drives, and increases number of supportable virtual domains from 64 to 128. Scsi drives are presently in synchronous mode, which generally exacerbates scsi erros, and precipidates crashes from those that occur. Full tape backups are being done at least once and sometimes twice a day, and incremental backups are being done hourly. There are presently no hard drive backups which will start when we get the next 4 Gig drive on line. 2.) Drive sd4 was taken off the single wide scsi bus and returned to Majesty for use as another 4 gig News drive. Majesty now has its full 16 gig news complement back in order. sd4 was recruited from Majesty during the crash of two weekends ago to hold crash remains. The crash remains are now on tape and will remain so for a long time. 3.) Harmony was reburned with new configuration settings, one of which will hopefully allow Majesty to act as a password server for Harmony when Light is down. This will allow users to signon to Harmony from whence they can surf the web and read news. If Light is down they will not have access to their home, ftp or web directories, nor their e-mail. 4.) A new $100 gold plated radium cored SCSI cable was installed on Majesty for the news drives. Majesty by the way has never had a scsi error, but it doesn't have any fast wide's either. 5.) Secondary name service was not working properly on Majesty while Light was down, this will need to be fixed. 04/16/96 Tuesday 12:57pm EST Light rebooted to clear out jammed /dev/ttyp1's 04/11/96 Thursday 5:19pm EST Scsi bus went out of control at about 1:30pm. Light was halted and rebooted. No damage. Scsi card is scheduled for replacement. 04/10/96 Wednesday 10:00pm EST ftpd daemon reset 04/10/96 Wednesday 2:47pm EST Modem banks reset themselves for unknown reasons bouncing everyone. I was playing with modem 40 (14A) 277 9530 trying to see why it was giving a ring no answer when suddenly the whole modem rack when off line. I was merely dialing in. When I first stuck the phone in 14A it was dead. Then when I tried it again, I had a dial tone. I am going to repunch it just to make sure the wires are in tight. There is also an odd occurance in the modems.stat home page where every once in a while the stats page registers zero people on line for one 15 minute cycle. Is it possible that Harmony is down during these times and rebooting itself? 04/10/96 Wednesday 12:40am EST Got a ring no answer today, and few reports of ring no answer or pickup after 3 or 4 rings which is really strange. 04/09/96 Tuesday 8:04pm EST News server reset with rebuilt install. 04/09/96 Tuesday 12:40am EST ftp daemon reset 04/08/96 Monday 5:03pm EST Majesty and News were taken down for 30 minutes to oil a squealing fan cooling our news disk drive bay. 04/07/96 Sunday 11:42pm EST ftpd daemon reset. 04/06/96 Saturday 12:03pm EST fptd daemon reset news server reset 04/05/96 Friday 3:31pm EST I crashed the modem bank (I think) looking for the bad modem. Anyhow the whole bank went down, and harmony jammed. Harmony was rebooted, and modem banks 1 through 8 have been reburned with correct settings. To my chagrin the MultiTech management software does not load the modems with config files dependably. It leaves settings out randomly as it does all the modems. So each one has to be checked carefully by hand and set again if it didn't take hold on the first pass. 04/05/96 Friday 2:10pm EST Getting reports of a bad modem not letting people on. Will try to hunt down. 04/04/96 Thursday 6:03pm EST Light was rebooted to clear out a jammed tape process. I have been playing the tape backup machine extensively to get a good feeling for it while we program a system to automate the handling and recording of tape backups. I did a tape erase just to see what it would do. It started to erase the tape (very slowly) so I decided to kill it. Turns out it is an unkillable process. Rather than wait 5 hours for the job to finish, I rebooted the system. As light was halting it said, "Some processes wouldn't die." No sh*t Sherlock. 04/03/96 Wednesday 11:09am EST News should be fully functional at this time. Going to swap out the fast wide scsi board with a new one. If that don't work, we are going to swap out the CPU card. If that don't work, we are going to get rid of the fast wide bus and see if a Sun can work with its own hardware. 04/03/96 Wednesday 02:56am EST Crashed again. 04/03/96 Wednesday 02:34am EST Light taken down to clear out jammed scsi bus. /dev/sd5a home directories caught in endless error loop. 04/02/96 Tuesday 9:01pm EST ftp daemon reset. 04/02/96 Tuesday 1:02pm EST Deer Park router is still down, some parts of the world may not be accessible. 04/02/96 Tuesday 12:38pm EST Nysernet Ithaca-1 router was down from 10:57 to 11:23am this morning. 04/01/96 Monday 12:28pm EST Sprint is accepting mail again. 04/01/96 Monday 12:15pm EST News is still flakey. Sporadic reports of people not being able to access server. Also Sprint is refusing to accept postings, they are looking into it. None are being lost, but none are being sent. Postings are being sent to uunet so we are not totally locked out. News is receiving fine. Same for secondary news feed, receiving fine, refusing to accept outgoing. They have been informed. New backup drives for light are on order. Hourly tape backups are being done. 03/31/96 Sunday 5:20pm EST Light rebooted to bring drives back up in asynch mode. 03/31/96 Sunday 3:06pm EST cron on light was down, web stats weren't being compiled, although none were lost (except when we were down). They are now up to date running every hour. uucp was also down, it should be working again. 03/31/96 Sunday 11:37am EST News is presently up and down as we try to fix various things with the history file. 03/31/96 Sunday 11:37am EST News was down over night due to incorrect permissions on alt.sex 03/31/96 Sunday 11:36am EST Lightlink was down from 7:30am to about 10pm on Saturday due to loss of File Allocation Tables across 4 main drives. 03/29/96 Friday 2:44pm EST The external network connection to nysernet was down due to their hardware failure for a few minutes between 2:00pm 2:30pm. They called and said it was fixed and apologized for the outage. 03/29/96 Friday 12:10pm EST Mail was down for a few hours due to /var filling up from an incoming spam of a user. 03/28/96 Thursday 12:25pm EST Light crashed during news expire. News will be moved off of light today some time. 03/28/96 Thursday 11:32am EST ftp jammed, filled up with interrupted processes. preventing people from logging on to ftp. This has to be fixed! 03/22/96 Friday 1:49pm EST /var ran out of disk space. news throttled, was down until now. 03/20/96 Wednesday 6:51pm EST Modem 3 producing ring no answer, now off line. 03/20/96 Wednesday 10:19am EST ftp server reset. 03/19/96 Tuesday 12:44pm EST Light crashed at 12:25pm during news expire. 03/19/96 Tuesday 09:00pm EST ftpd daemon jammed with broken jobs. Reset 03/14/96 Thursday 09:39am EST Light crashed and rebooted itself at 1:10am last night. News was down for the night. 03/14/96 Thursday 12:48am EST ftp daemon reset. 03/12/96 Tuesday 2:25pm EST News was throttled from alt/binaries filling up from 12:00pm this afternoon. 03/11/96 Monday 7:47pm EST /etc/nologin removed. 03/11/96 Monday 7:27pm EST Modem bank was down from 6:00pm til now. Preparing for installing of new modem bank. 03/09/96 Saturday 7:22pm EST Light reboot to clear out jammed tape job. Nynex busied out the bad modem, so the rack should be fine for the moment. He said there was a definite break at the Central Office somewhere. 03/09/96 Saturday 2:47pm EST It's a bad phone line, not a bad modem rack! 03/09/96 Saturday 11:14am EST The modem rack is divided into two sections by a bad slot at modem 26. Modems 1-25 start at 277-5026 Modems 28-48 start at 277-3567 Modem 27 is dedicated and not available for public use. 03/09/96 Saturday 01:15am EST The modem rack was reset at about 12:30am bouncing everyone off. Slot 26 is bad, its not the modem. The modem tries to answer, but the slot doesn't convey the information. This produces a ring no answer when you run into that modem. This divides the rack into two sections of about 25 modems each, one starting at 277-5026 and one at 273-3567. If you get ring-no-answer on the first number try the second. If you still get ring no answer, then all modems are busy. Modems 9 and 17 were bad a few days ago, and resetting the rack fixed that. This indicates that there was nothing wrong with the modems which had been individually reset to no avail, and that the rack mother board is going bad. Today's travails is just more of the same. 03/08/96 Friday 5:43pm EST Modem 26 was the cause of the ring no answer. It would light up and pretend to answer, but it wouldn't. It is presently off line. 03/08/96 Friday 4:28pm EST We are having definite problems with ring no answer. That means you call up the modem banks, there are modems free, but the phone just rings. If you call up on one phone it will just ring, but a second phone will get a modem. This is often caused by a modem refusing to answer in the middle of a rotary, but every time I try to narrow it down to a modem, it picks up and someone gets on! The ring no answer also happens when all modems are busy, because there are 12 new lines at the end of the rotary that (sniff sniff) don't have any modems on them yet. 03/08/96 Friday 1:14pm EST Light crashed from news expire at 12:36pm. 03/05/96 Tuesday 5:11pm EST Light rebooted to clear out jammed tape backup process. 03/04/96 Monday 11:08am EST Light taken down to remove 64meg of memory to move over to Majesty for news server. Modems cold booted to help clear our 9 and 17. Seems to have worked. Harmony rebooted. 03/03/96 Sunday 1:03pm EST We have two modems have the gone sour. 9 and 17. They are presently off line. Apparently modem 17 has been bad since 3/1/96 03/03/96 Sunday 11:53am EST Apache web server shifted over to new logging. 03/03/96 Sunday 11:36am EST We suffered a bum modem this morning that prevented people from getting on beyond that modem. The phone would just ring and not pickup. That modem is now off line. 02/29/96 Thursday 09:26am EST Light crashed. A number of monitoring procedures we had in place to warn us that light was down failed. This probably resulted in a number of hours of down time without awareness that we were down. Just before light crashed, the load had gone very high for unknown reasons, possibly massive spamming through our remailer, which is now off line until further notice. Unfortunately two web hit log files were lost during the crash, and we are not presently keeping backups of them because they are so massive. With the new web hit software in place this will change. In the meanwhile my apologies to those who lost web hit files. 02/27/96 Tuesday 3:50pm EST Light was rebooted to clear out misbehavior of the server that wouldn't fix on its own. 02/27/96 Tuesday 3:08pm EST All referer and agent logs have been temporarily turned off as the server is crashing from them apparently. 02/27/96 Tuesday 2:01pm EST The apache server has run out of file handles. This will take some serious revamping to fix. This caused the server to not respond. 02/27/96 Tuesday 1:11pm EST apache web server reset Added new mime.types for x-director 02/26/96 Monday 5:00pm EST ftp server reset 02/23/96 Friday 9:22pm EST Apache server reset. 02/23/96 Friday 12:31pm EST Light crashed at 12:07pm during news expire. 02/22/96 Thursday 1:29pm EST FTP server reset, everyone bounced. 02/22/96 Thursday 01:58am EST News was killed momentarily and rebooted. 02/20/96 Tuesday 12:58pm EST I had to kill news a few times to get the expire to start working. Something has to change around here. 02/20/96 Tuesday 12:21pm EST Man was that a mess. The process which crashed light last night managed to start itself up from cron again, even though I had disabled the cron file. I guess I failed to restart cron so it had it in memory. This process took the load to 6.0 at around 2pm. This slowed everything down to a point where the various hard drive backups and tape backups didn't complete on time. Thus the tape backup started while the hard drive back up was still going, causing things to slow down even more. Then the tape back up failed to complete by the time other jobs started at 6am which failed to complete by the time news expire started at 11am, and by 11:30 the load was 12.0 and EVERYTHING was still running and nothing was getting done. I rebooted light to clear it all out, after trying to tear it apart process by process. There is only so much you can do with an autopsy. 02/19/96 Monday 11:27pm EST Light went out of control again for reasons that are not clear. (This was later determined to be caused by a user program scanning the news spool for data.) 02/19/96 Monday 10:57pm EST Light went out of control. I tried to bring it down gracefully and was barely able to halt it in time before it crashed for real. Reasons unknown. 02/15/96 Thursday 12:14pm EST Light was down for 10 minutes to oil a noisy fan in the news drive bay. It may still need to be replaced. 02/14/96 Wednesday 9:56pm EST Light crashed at 9:34pm probably caused by my moving a disk drive bay around. SCSI cables suck even when they cost $99 each. 02/13/96 Tuesday 01:24am EST Web server was down for 10 minutes for a recompile, to add referer and agent logs. 02/10/96 Saturday 11:43pm EST News was taken down for 5 minutes to check load averages used by news. 02/10/96 Saturday 01:58am EST The outbound mail queue was lost last night due to a spam. The mail queue holds outgoing mail that can not be delivered for various reasons at the receiving end. The sendmail program tries every hour to send the mail for many days and then quits. 02/08/96 Thursday 11:51pm EST Apparently news throttled itself when /var filled up. I did not see this and so news was down all afternoon. 02/08/96 Thursday 4:03pm EST /var ran out of spool space today for a few minutes. This has been temporarily fixed. I also moved the www log directories to a new partition, so the web server was down for about 5 minutes. 02/06/96 Tuesday 12:09pm EST News was down for about an hour as we made preparations for moving it to Majesty. 02/05/96 Monday 12:09pm EST Light crashed at 11:45am during news expire. There was a huge load spike just prior indicating that the apache server had gone our of control, however all log records from 11:03am on were eradicated when the disks were fsck'd The apache server is supposed to kill off excess servers and not allow them to go over 50 in any case, usually such a crash indicates 300 or more. I am going to install a little script that will reset the apache server any time the servers go over 50 as part of the monitor program. Monitor data can be found in /var/log/monitor and is open to public view. through our http://www.lightlink.com/stats.html home page. 02/04/96 Sunday 1:08pm EST Kernel reinstalled yet again, light rebooted. Now we are running in Asynch mode again. OK hopefully this will handle the crashes. 02/04/96 Sunday 1:00pm EST jes kernel reinstalled scsi set to 0x58 and light rebooted. 02/04/96 Sunday 12:19pm EST Light crashed 11:46am. Tried to reset scsi bus to asynch mode, but it doesn't seem to be taking. It got reset from asynch back to synch during the last install of the new kernel. 02/03/96 Saturday 4:55pm EST Light crashed 4:31pm 02/03/96 Saturday 1:41pm EST Top 18 modems were disconnected and reseated in the correct order. 02/01/96 Thursday 4:05pm EST Light crashed 3:51pm 02/01/96 Thursday 11:26am EST Harmony is failing to do name service properly for unknown reasons. OK, this is fixed. 02/01/96 Thursday 12:34am EST Web server was up and down a few times as we played with new installation. No permanent changes yet. 01/30/96 Wednesday 10:29am EST Light was rebooted to install new kernel to fix bug in tape drive software, and install NFS to bring Majesty on line. 01/30/96 Tuesday 12:32pm EST ftp jammed for a while due to too many zombie processes left over from something or other. I killed all ftp processes, perhaps bumping a few real ones in present time. 01/26/96 Friday 11:26am EST 11:00am Modems were rebooted and reburned with correct init strings. All modems visually check for correct init speed and speed. Please report garbage on the screen during login incidents immediately. Lower 39 modems have retraining off, top 9 modems have retrain on. Harmony rebooted to clear out defunct routes in the cache. Light rebooted. 01/25/96 Thursday 4:17pm EST Light rebooted. We were getting load spikes, signifying impending crash. Have moved swap space back off of /sd3b to /sd0b. If this clears up the spikes then that means sd3 is bad. 01/25/96 Thursday 11:38am EST Light crashed at 11:25am from news expire during expireover. We are getting a second Sparc 20 which will carry news, so hopefully this will move the crashes off of light to that machine, until we can find out what is going on. 01/24/96 Wednesday 7:04pm EST CSU/DSU was brought off line for 2 minutes by accident via a loose power plug while moving it. 01/21/96 Sunday 10:50pm EST Light went unstable with spike loads of 101 or so. I was barely able to halt it gracefully, nothing would 'vfork', and we were getting stack errors on the swap partition. This was AFTER the move to /sd4. 01/21/96 Sunday 7:36pm EST After the crash this morning during news overview expire, we started to get repeated load spikes indicating near crashes. Since the only change that was made was to move the overview data base to /sd3a, this indicates that /sd3a is causing troubles. So now we are moving it to /sd4a. At the same time the erotica groups will be moved /sd3d to /sd4h, and /sd3 will take taken fully off line to see if the system stabilizes. No news will be lost, but news will be fully down while the transfer takes place. 01/21/96 Sunday 11:56am EST Light crashed during news expire at 11:56am. It crashed during expiry of the overview data base. which is presently on /sd3a. When I first moved it to /sd3a a few days ago I noticed a number of high load spikes every few minutes which is usually a sign of instability and 'near' crashes. It is possible that /sd3a is causing some of our problems. I may consider moving the over view data base once again to /sd4a, taking /sd3a out of the loop. 01/19/96 Friday 6:19pm EST News overview data base is being rebuilt, to allow threaded news readers like tin to access old news. 01/19/96 Friday 3:39pm EST Light taken down for 10 minutes to move news overview data base to /sd3a. This caused the destruction of the old data base. No news was lost, however old news is not accessible unitl overview data base is rebuilt. 01/18/96 Thursday 9:50pm EST News was down for posting for 7 hours while the history data base was rebuilt. Obnoxious. 01/13/96 Saturday 3:45pm EST Light went out of control at 3:33pm from SCSI Bus errors on sd5a. I tried to take it down gracefully, but to no avail. 01/12/96 Friday 5:30pm EST News has been unreadable from modems 37 through 48 for a while. We have been running an open nntp server until recently, which means everyone in the world can read news from our server. In trying to locate where our outgoing bandwidth is going, I turned this off temporarily. By mistake modems 37 through 48 were been enabled to read news, and were only enabled by default of everyone being able to read. When I closed off the open access, those modems were locked out also. They are now properly enabled as themselves. I may reopen up the news port in the future, our bandwidth is NOT being consumed by news reading by external sites. 01/12/96 Friday 4:02pm EST Light was taken down at 2:40pm to install new Forced Perfect Terminator on SCSI bus. Rebooting failed a few times due to typos in rc.local. Later permissions were incorrectly changed on root directory locking a few people out for a few minutes. 01/11/96 Thursday 6:16pm EST News expire is presently acting strange. Expireover went for 7 hours and did not complete. I killed it. 01/06/96 Saturday 12:29pm EST Light crashed at 12:06pm during news expire. Installed new kernel with vif=64. adb -w /vmunix scsi_options?W 5* $q Presumably that sets the scsi bus to asynch mode. 12/31/95 Sunday 11:12am EST Web server was down from 11:00pm last night due to a failure to restart it properly. 12/27/95 Wednesday 07:35am EST Light taken down for 10 minutes to replace scsi cable. We are now running with two Gold Cables. Tape drive is first in the chain. 12/26/95 Tuesday 12:09pm EST Light crashed at 11:55am during news expire. 12/24/95 Sunday 12:07pm EST Light crashed at 11:58am during news expire presumably from scsi bus error. 12/20/95 Wednesday 09:25am EST Light taken down at 8:00am for system work on scsi bus. 12/19/95 Tuesday 10:17am EST Light crashed from scsi bus errors during news expire at 10:00am. 12/14/95 Thursday 09:28am EST Light crashed from Scsi bus errors during news expire at 9:07am. I took the opportunity to install two new gold plated cables ($99 each), so that sun techies can stop asking us if we are using lousy cables. 12/12/95 Tuesday 3:59pm EST Light taken down to bring news back on line. Before putting the disks on line, I put the main scsi cable in light and terminated it with the active terminator. Light would not boot, but gave nasty errors resulting in needing to manually fsck /dev/rsd0g. I replaced the cable with an earlier 6 foot cable, and did the same thing and light booted fine. Then I attached the news drives, and booted again. The tape drive is still off line. 12/12/95 Tuesday 12:17pm EST I took light down to take off the news drives which are suspected of causing the scsi buss errors. When I went to reboot, it wouldn't reboot. It wouldn't load vmunix, when it did load vmunix it said it couldn't find /sd8d which is on another bus entirely. I finally had to take the tape machine off the external scsi bus, and then it booted fine. Something is very strange in Denmark. 12/12/95 Tuesday 09:17am EST Harmony taken down to install 12 new lines. 12/10/95 Sunday 08:45am EST Light taken down to reinstall sd0. Going to go back to original factory configuration and take external drives off line to see if errors still happen. This will require that news be down. 12/09/95 Saturday 12:04pm EST After replacing the internal cable and terminating board, light became more unstable than before, producing load peaks of 20 (near crashes) every 5 minutes or so during news expire. I took light down and put back in the old terminator and cable, and also took the drive tower apart to verify that all termination settings were correct, and reseated all scsi cables. 12/09/95 Saturday 09:38am EST Light crashed, and then crashed again on reboot. Anyone want an ISP cheap? 12/09/95 Saturday 09:08am EST Light taken down to replace internal scsi terminator back plane. Got a scsi error 10 minutes after coming up again. Took light down again to replace internal scsi cable. That's the last of it, there is nothing more to replace except for the entire external disk drive set. Homer 12/08/95 Friday 2:37pm EST Sendmail was down for about an hour from a uucp typo. Sorry. 12/07/95 Thursday 12:51am EST Light crashed at 11:36pm. 12/05/95 Tuesday 08:13am EST Light taken down to unplug and reseat internal scsi cable from motherboard and pc hard drive termination board. System running with cable in place but no floppy nor CDROM (and no internal hard drive). Found permissions set wrong on web counter directory, causing web counters to fail. Web counters should be working correctly now. 12/05/95 Tuesday 02:29am EST Coffee break 1:53am 12/04/95 Monday 12:55am EST Light took coffee break at 12:14am 12/03/95 Sunday 1:55pm EST Light was rebooted with new kernel. At the same time, just after compiling new kernel, tape backup became jammed, couldn't kill the process at all. Then shutdown jammed, until I turned off tape drive. Then boot -s jammed with "can't find swap on sd8b, no such etc." Turned everything off, and on, and boot worked fine. I changed the MAXUPROC from 25 to 256 in sys/param.h in kernel. Apache .8 has an apparent limit of 26 virtual hosts, but the change to 256 did not fix the problem. I also changed the max process id from 30000 to 300000. I wonder what happens when the machine runs into a max process id? 12/03/95 Sunday 11:44am EST Tape backup failed again with sd3 out of the loop. Now we take sd1 off line, which is the main news drive, so news will be down for a while. Homer 12/03/95 Sunday 09:27am EST Light took a coffee break at 5:37am, rebooting itself. The tape drive was off line, so it is not the tape drive electronics. I took light down at 7:00am to move /usr/local/ off of /sd3a and over to /sd8d. sd3 has been taken off line, as it is suspected as being the culprit for the crashes. News is a mess because light keeps crashing during expires, so we may have to clean the slate and and start over. I may institue a policy whereby newsgroups are kept for a shorter period of time, except for those explicitly being read or requested. This will greatly aid in cleaning up the mess after succh crashes. cgi's are still not working. 12/02/95 Saturday 8:21pm EST Light was rebooted to try and solve cgi failures by increasing kernel allocation tables (maxusers = 200) Cisco was brought off line to stop incoming hits on web site. Neither worked. At this time we have no idea why cgi's are not working. Homer 12/02/95 Saturday 7:57pm EST Cisco was rebooted again. 12/02/95 Saturday 3:26pm EST Cisco was rebooted and Harmony was rebooted, kicking everyone off, in order to clear out the old arp tables and install the new ethernet address of the motherboard. Prior to this the virtual domains were not working at all. 12/02/95 Saturday 2:22pm EST Light was taken down for 10 minutes to swap /sd3d and /sd7a. sd3d was the backup partition and sd7a was the erotica partition. Now its the other way around. The purpose of this is to decrease the backup time as /sd3d is single wide and /sd7a is double wide. In the process all .pictures.erotica were lost. News is still down and will continue to be down for a while. 12/02/95 Saturday 1:36pm EST Light was down to replace the mother board for a few hours this morning. Unfortunately, this did not repair the problem. 11/28/95 Tuesday 07:59am EST System down for lab time from 7:30 till now. Main boot drive sd4 was moved from scsi id 5 to 3. The system boots by default from 3. This increases the chances the system will reboot itself after a severe crash. All swap space has been removed from the single wide scsi bus, to take the pressure off the bus and to decrease the likelihood that a scsi error will happen on the swap partition, an event most likely to crash the system. Sun has offered to send us a new motherboard which will be installed when it arrives. 11/27/95 Monday 2:07pm EST Due to a typo, sendmail was not functioning correctly for about 3 hours until now. 11/27/95 Monday 12:22pm EST Light crashed at 2:00am from scsi bus errors. We were down until 7:am. Home drive is out again, however it is unlikely it is the source of the problem. Sun is sending a new mother board. 11/26/95 Sunday 08:37am EST System down for lab time from 8 until now. We put the home drive back in to see if it would fail to boot because it was cold. It booted fine. We then took light down for 30 minutes and turned it off to see if it would fail to boot because it was cold. It booted fine. The cold idea came from the fact that light would not boot after being down the night of the blackout during the storm after being off for about 30 minutes. At this time I have no idea what is going on, we are still getting errors and are considering how to best approach the problem with minimum down time. 11/25/95 Saturday 08:37am EST System was down for lab time from 7:30 until now. We successfully moved all of the home drive partitions to the sd4, and got the machine to boot from sd4. Not obvious. If the system proves stable in this configuration, the home drive will be reinstalled, formated to erase all its data, and sent off for replacement. News is still down for another 30 minutes or so. Harmony was stable over night. 11/24/95 Friday 08:53am EST The system was down for lab time from 7:30am until now. Harmony was stable over night. We moved the /usr and /var (mail) partitions to sd4 to get them off the failing home drive sd0. We are still booting from sd0 but all important partitions have been moved in anticipation of its replacement. Harmony was rebooted to set its configuration file correctly, for the new subnets. However apparently a bug in the harmony code prevents it from reading in subnetted masks properly, so they all have to be added by hand after the booting takes place. This will require more work and a few questions to Tech Support at Xylogics. 11/23/95 Friday 08:55am EST System was down for lab time. Harmony again lost its default route to the internet twice over night. We think we know what is causing it, but not why. The route loss seems to be connected with hand entering the route after deleting it for test purposes. When the route is added through the normal configuration procedure during boot up it is stable, if it is then deleted and reentered by hand it seems to have a time to live. We will query Xylogics tech support about this on Monday. 11/22/95 Wednesday 10:05am EST For reasons that are still unclear, Harmony lost its default route out to the internet this morning at about 8:30am. It should be working again at the moment. We may have to reboot harmony if things continue to be unstable. Thanks to all who brought this to my attention. 11/20/95 Monday 08:55am EST System was down from 7am until now for lab time. 11/17/95 Friday 12:19pm EST CSU/DSU was brought off line resulting in a 20 second disconnection from the internet. Since the storm it has been getting errors, Sprint will be testing it this morning between 6am and 7am. During that time there will be interruption of service to the internet. 11/16/95 Thursday 10:22am EST System was down from 8am to 9am for system work to rebuild new home drive. Expect continued down times during early morning hours. News was down until now. 11/15/95 Wednesday 04:58am EST System was down for 2 hours due to a power outage in upper college town. We then had a LOT of trouble bringing it back up again. possibly due to the same problem causing intermittent scsi buss errors, which are going to be dealt with terminatedly in short order. 11/12/95 Sunday 07:38am EST Cisco router was rebooted intentionally to clear out a problem with config file transfers. It didn't fix it. Connection to the internet was down for about 20 seconds. 11/08/95 Wednesday 07:18am EST System was down for an hour until now to add the new disk drive. All four 4gigs are now in one tower, this got rid of a SCSI cable bringing the total to two on scsi bus one. The system has been stable for a while, but it is not yet clear why, I believe the scsi bus errors have merely gone dormant. We have taken a load off the home drive by moving the swap partition to another drive, this may have contributed to the quieting of the errors. Please continue to expect down times during early morning hours between 3am and 7am. 11/06/95 Monday 09:17am EST System was down from about 6:00am to 7:00am to begin move of home drive over to new drive. It is suspected that the home drive is going bad which is causeing all the SCSI buss errors. The system will be down every morning for some time until this is resolve, between 3am and about 8am. 11/05/95 Sunday 10:09am EST An error has been found in the Virtual Domain code, rebooting to repair. 11/05/95 Sunday 07:32am EST Going down again. Swap partition moved to /dev/sd7b in preparation for moving root directory to /dev/sd7. My guess presently is that sd0 main drive is bad. 11/05/95 Sunday 06:52am EST Going down to replace swap partition. 11/04/95 Saturday 03:06am EST System down for 10 minutes to change SCSI cables. All cables are brand new. Next check will be to change the swap partition off the home drive. 11/03/95 Friday 05:12am EST System was down for about 1 hour while we worked on scsi cables. 11/03/95 Thusday 012:02pm EST The system crashed for unknown reasons, and rebooted itself, probably due to sporadic I/O errors we have been getting on the primary SCSI bus. The system will be down tonight to work on the problem. 10/30/95 Monday 04:58am EST News was down for 2 hours due to disk full errors. Sprint has been down again, and maybe when they came back up, we got flooded with back log that would have other wise expired. 10/25/95 Wednesday 02:14am EST System was down for 2 hours due to system crash. Apparent cause was badly seated cables, complicated by resulting corrupted super blocks on 2 drives. 10/21/95 Saturday 04:51am EST Light rebooted to increase virtual domains to 32. Down for 20 minutes while I forgot to bring it back up again, having too much fun with new color scanner. 10/04/95 Wednesday 12:45am EST I crashed light playing with security holes. 10/02/95 Monday 04:48am EST Harmony was rebooted to increase time out to 40 minutes. 09/27/95 Wednesday 01:52am EST Harmony was down for about 15 minutes while we tried a security upgrade. The upgrade failed. We are running as normal. Light was rebooted during the process. 09/26/95 Tuesday 03:35am EST We had a run away process last night send the system load to 4.0 and then to 7.0. This made things very slow. 09/16/95 Saturday 01:38am EST System was down for 2 hours for upgrades. See news for details. 09/13/95 Wednesday 01:04am EST Routes out of light to the internet were down for about 30 minutes by accident. 09/08/95 Friday 9:15pm EST web server was down for about 2 hours by accident. 09/02/95 Saturday 4:59pm EST An error in programming caused all password files to be renamed improperly resulting in momentary loss of an ability to signon to the system, and/or execute certain programs. 08/31/95 Thursday 3:23pm EST System rebooted to physically rearrange drives back into place after yesterdays fiasco. 08/30/95 Wednesday 8:46pm EST We suffered a total power outage this afternoon around 3pm. A transformer caught fire down the block towards college town. The system stayed up for 20 minutes into the power blackout, and then we had to shut down as the backup batteries had run out. During this time I chose to do some back up maintenance on two micro fans cooling disk drives that were beginning to whine and make noise, a sign that they would soon jam risking overheating and damage to the power supplies. When I went to hook the whole system back together about 30 minutes after the power came on, I found that it would not boot. This was apparently caused by SCSI connections gone bad. It took me three hours or so to get the system working again, after trying many different cables. The final answer was to cable the various components in a different order. A very dissatisfactory solution. 08/28/95 Monday 1:30pm EST System rebooted at 2am this morning to check out new virtual domain kernel. It didn't work, so system rebooted again with normal kernel. 08/24/95 Thursday 1:56pm EST News was down until 8am this morning, to rebuild the history data base. 08/23/95 Wednesday 8:18pm EST News has been down for 2 hours and will continue to be down for another 2 hours or so. 08/23/95 Wednesday 02:34am EST We managed to get the modems converted to 115200 without bringing down the system, however during the transitions some of the modems were set wrong, and some people may have gotten rings without picking up, or gotten garbage on the screen from mis set modems. All should be working now. 08/21/95 Monday 6:45pm EST System was down for 2 hours for intallation of 16 more modem lines. New modems have not yet been configured. 08/20/95 Sunday 5:43pm EST System was down for 3 hours for upgrades. Annex was rebooted at end of upgrade. See 'news' for details. 08/13/95 Sunday 7:50pm EST News was not responding to slip requests since last night due to an error in a configuration file. 08/12/95 Saturday 2:26pm EST System rebooted to clear out some things. 08/11/95 Friday 4:35pm EST FTP was broken for about 2 hours today while we tried to install the new version. We are presently running the old version. 08/10/95 Thursday 1:47pm EST System was taken down for about 30 minutes for upgrades. The annex was also rebooted to install new settings. 08/09/95 Wednesday 2:42pm EST News was down from about 12;00am last night till 2:30am when I spotted it. Reading was not affected, but incoming articles were stopped. The cause was an incorrect permission on the news directory junk which caused a failure to write to it. 08/06/95 Sunday 8:24pm EST System down for reboot to bring two new fast wide scsi drives on line. 08/05/95 Saturday 8:04pm EST System down for 30 minutes to install fast wide scsi bus card. 08/05/95 Saturday 7:32pm EST I created my first mail loop today. Took the system to load 31 or so, for a few minutes. 08/04/95 Friday 7:08pm EST T1 was down for 5 minutes while we rebooted the T1 CSU/DSU modem. We have a defective T1 modem, nothing serious, but I can't log on to it. We are also getting data errors higher than normal which is a completely separate matter. The modem will be replaced shortly, and the data errors are being looked into by sprint. 08/04/95 Friday 3:34pm EST System down for 30 minutes. I had to bring the system down to clean up an error I made. 08/03/95 Thursday 7:16pm EST System rebooted to clean things up. 08/03/95 Thursday 7:05pm EST I crashed the system. 07/30/95 Sunday 7:07pm EST News was down for about an hour. 07/29/95 Saturday 8:25pm EST System rebooted.