Author |
Message |
marc.aronson
|
Posted: Fri Feb 20, 2009 8:17 pm |
|
Joined: Tue Jan 18, 2005 2:07 am
Posts: 1532
Location:
California
|
I have two machines on the network, both with r5.5 installed. Here are two scenarios:
1. NFS mount remote file system. Copy a 450MB file from the local machine to the NFS server. Throughput is 59Mbits/second. When I run top, between 50% and 80% of the client's dual-cores go to "wa" (I/O wait)
2. CIFS mount remote file system. Copy a 450MB file from the local machine to the Samba server. Throughput is 95Mbits/second. When I run top, virtually no client time is going to "wa".
It's surprising that CIFS/ Samba is delivering better performance than NFS, but the massive amount of CPU going to I/O wait under NFS is fatal. If I have a recording going at the same time on the client machine, I get buffer overruns in the backend. Has anyone else seen this and any ideas how to correct it? Thanks. Specs for both machines are below.
Marc
Client: Dual-core E8400, 3.0ghz 2GB RAM, gigabit network interface, KM R5.5.
Server: P4-2.8ghz, 2GB RAM, gigabit network interface, KM R5.5.
_________________ Marc
The views expressed are my own and do not necessarily reflect the views of my employer.
|
|
Top |
|
|
graysky
|
Posted: Sat Feb 21, 2009 3:37 am |
|
Joined: Wed Dec 10, 2003 8:31 pm
Posts: 1996
Location:
/dev/null
|
Can't answer your specific question about NFS/Samba, but I too have observed Samba giving superior xfer speeds on my old hardware (Athlon XP-based systems). As a side note, you might be able to further boost Samba xfer speeds. See this post for a suggested mod to the /etc/samba/smb.conf that worked for me.
Are you behind a switch and if so, is it and are your NIC's able to use jumbo frames? If the answer to that is yes, you may further accelerate your LAN xfers by enabling jumbo frames.
P.S. your CPU usage seems excessive to me given your hardware specs. Are both those NICs on-board or runnig in a PCI slot?
_________________ Retired KM user (R4 - R6.04); friend to LH users.
|
|
Top |
|
|
marc.aronson
|
Posted: Sat Feb 21, 2009 11:55 am |
|
Joined: Tue Jan 18, 2005 2:07 am
Posts: 1532
Location:
California
|
graysky, thanks for responding. You're post triggered a thought and I've found out what's going on with the CPU utilization -- winds up this is "normal". I tried doing a large file copy from a local disk to another local disk and noticed that I/O wait times also spike in that scenario. It winds up that high CPU I/O wait times is normal during file I/O and that this high utilization will not prevent another process from using the CPU, so I've been chasing a ghost. I also found this threadwhich explains a bit more.
I ran an experiment and verified that with two CPU intensive processes running concurrently with either the local copy or the NFS copy, I/O wait times drop to almost 0 as the other processes CPU utilization goes to 100%.
I've tried jumbo frames in the past and they have not provided a significant improvement. I seem to recall finding an article that claimed that jumbo frames were mostly useful on older, slower equipment and not as much of an advantage on new, fast equipment, so perhaps that's why it didn't help me. Both the NIC's are on the motherboard.
I am still left with the question of why NFS is slower than Samba, why the backend will suffer buffer overruns and I am facing another problem:
When my network interface is under intense load I will suddenly loose network connectivity and will find the error messages shown below in kern.log. In this case the intense I/O was happening over Samba. This failure is happening on the server machine. In this scenario, the hardware is as follows:
Server: Intel core duo E8400, MSI P6NG Neo-Digital motherboard with integrated Realtek 8201CL NIC, 2GB RAM, Knoppmyth R5.5.
Client: Windows Vista on Dell desktop, intel core 2 duo Q6600, 4GB RAM, integrated NIC.
I've seen this happen periodically with the server and have seen it happen with different clients. It only happens under intense load, and I am wondering if it is a driver issue.
Any thoughts on this one?
Marc
Quote: Feb 20 19:27:50 mythhd kernel: NETDEV WATCHDOG: eth0: transmit timed out Feb 20 19:27:50 mythhd kernel: eth0: Got tx_timeout. irq: 00000032 Feb 20 19:27:50 mythhd kernel: eth0: Ring at 2f814000 Feb 20 19:27:50 mythhd kernel: eth0: Dumping tx registers Feb 20 19:27:50 mythhd kernel: 0: 00000032 000000ff 00000003 024803ca 00000000 00000000 00000000 00000000 Feb 20 19:27:50 mythhd kernel: 20: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
_________________ Marc
The views expressed are my own and do not necessarily reflect the views of my employer.
|
|
Top |
|
|
tjc
|
Posted: Sat Feb 21, 2009 1:12 pm |
|
Joined: Thu Mar 25, 2004 11:00 am
Posts: 9551
Location:
Arlington, MA
|
marc.aronson wrote: I seem to recall finding an article that claimed that jumbo frames were mostly useful on older, slower equipment and not as much of an advantage on new, fast equipment, so perhaps that's why it didn't help me.
Well it's actually more complicated than that, what you're really dealing with is the overhead and latencies (turn around time, etc.) versus the volume of data. Generally the faster the raw bit rate on a communications channel of any kind is, the the bigger the chunks you want to send to maximize throughput over it.
There are analogies in other parts of CS, for example if you have a 8 core box running a java application with multiple threads (roughly speaking 1.5x to 2.5x the number of cores is usually optimal) and you have single threaded garbage collection, the throughput of the system drops by about ((Nores-1)/Ncores)) * (GCTime/TotalTime). This is almost exactly the same mechanism observed with high speed comms when the transmit side of that HUGE pipe is waiting for a tiny little ACK message from the other side. (in case it's not clear the GC time maps to time spent waitng for an ACK so that you can "close the books" on a transmitted packet.) The throughput graph looks like this, where al the white space on the chart represent wasted bandwidth:
Code: # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # ## # # # #### # # # # # # # # # # # # # # # # # # # ### # # # ######################################################################
(Vertical axis is data/work volume, horizontal axis is time)
We're also not even counting other limiting factors here, like disk and system bus speeds, simplex versus duplex transmission, ... but the main effect they have is to lower the vertical peaks on the graph and make the gaps on the horizontal axis longer or shorter. Real world examples would also look slightly more chaotic as well with both broader and more irregular peaks and gaps, however, I can only do so much with ASCII graphics and the basic shape remains the same.
I'll have to check with my math and operations research friends to be sure, but I think that the fundamental result here comes from queueing theory.
|
|
Top |
|
|
marc.aronson
|
Posted: Tue Feb 24, 2009 6:07 am |
|
Joined: Tue Jan 18, 2005 2:07 am
Posts: 1532
Location:
California
|
I'm still trying to chase down the "NETDEV WATCHDOG: eth0: transmit timed out" problem as I just got nailed again. My MSI P6NG Neo-Digital motherboard has a Realtek 8201CL controller. When I do an lspci I see
Quote: 00:0f.0 Ethernet controller: nVidia Corporation MCP73 Ethernet (rev a2) Subsystem: Micro-Star International Co., Ltd. Unknown device 7505 Flags: bus master, 66MHz, fast devsel, latency 0, IRQ 16 Memory at fe977000 (32-bit, non-prefetchable) [size=4K] I/O ports at c880 [size=8] Memory at fe97e800 (32-bit, non-prefetchable) [size=256] Memory at fe97e400 (32-bit, non-prefetchable) [size=16] Capabilities: [44] Power Management version 2 Capabilities: [50] Message Signalled Interrupts: Mask+ 64bit+ Queue=0/3 Enable-
Does it make sense that the ethernet controller is being identifes as being from Nvidia given that its a Realtek controller?
_________________ Marc
The views expressed are my own and do not necessarily reflect the views of my employer.
|
|
Top |
|
|
marc.aronson
|
Posted: Tue Feb 24, 2009 11:42 am |
|
Joined: Tue Jan 18, 2005 2:07 am
Posts: 1532
Location:
California
|
OK, I'm starting to understand a bit more but I need to ask a question: How can I determine which driver was loaded for my integrated ethernet device? The presence of the following messages leads me to believe its the "forcedeth" driver from nvidia:
Quote: Feb 24 07:13:31 mythhd kernel: forcedeth.c: Reverse Engineered nForce ethernet driver. Version 0.60. Feb 24 07:13:31 mythhd kernel: forcedeth: using HIGHDMA Feb 24 07:13:31 mythhd kernel: eth0: forcedeth.c: subsystem: 01462:7505 bound to 0000:00:0f.0 Having said this, when I do an "lsmod" I don't see forcedeth listed -- below is what I see. I have a suspicion that this is an age-old issue with instability in the nvidia forcedeth driver and the work-around involves reloading the driver. Thanks for any help you can provide! Marc Quote: Module Size Used by nvidia 7100068 36 autofs4 22148 1 nfsd 219380 13 exportfs 8448 1 nfsd lirc_pvr150 19512 3 lirc_dev 16132 1 lirc_pvr150 fintek71882 9988 0 hwmon 6404 1 fintek71882 ipv6 257956 33 af_packet 24584 0 agpgart 30808 1 nvidia fuse 42900 0 raw1394 27388 2 dv1394 20572 0 pcmcia 36524 0 yenta_socket 26764 0 rsrc_nonstatic 14720 1 yenta_socket pcmcia_core 36504 3 pcmcia,yenta_socket,rsrc_nonstatic video 19472 0 output 6912 1 video sbs 19848 0 fan 7684 0 dock 11668 0 container 7552 0 joydev 13248 0 battery 13832 0 ac 7940 0 aufs 126084 0 ftdi_sio 36360 0 usbhid 42496 0 ff_memless 8840 1 usbhid usb_storage 79680 0 usbserial 34024 1 ftdi_sio uhci_hcd 26000 0 nvram 11144 0 lgdt330x 12164 1 mt352 9988 0 dvb_pll 15652 2 stv0299 13576 0 nxt200x 16900 1 saa7134_dvb 18444 1 wm8775 9644 0 cx25840 29772 0 videobuf_dvb 8580 1 saa7134_dvb tda1004x 19076 1 saa7134_dvb ivtv 139200 3 lirc_pvr150 saa7115 19404 0 msp3400 33612 0 snd_hda_intel 347544 0 tuner 37712 0 tea5767 9860 1 tuner tda8290 16132 1 tuner tda18271 15620 1 tda8290 tda827x 13700 1 tda8290 tuner_xc2028 22800 1 tuner tda9887 13188 1 tuner tuner_simple 12424 1 tuner mt20xx 15624 1 tuner tea5761 8324 1 tuner snd_pcm_oss 40608 0 snd_mixer_oss 18304 1 snd_pcm_oss b2c2_flexcop_pci 11288 1 b2c2_flexcop 28428 1 b2c2_flexcop_pci i2c_algo_bit 9604 1 ivtv saa7134 125008 1 saa7134_dvb snd_pcm 70916 2 snd_hda_intel,snd_pcm_oss cx2341x 15236 1 ivtv compat_ioctl32 5120 1 saa7134 videobuf_dma_sg 14724 3 saa7134_dvb,videobuf_dvb,saa7134 videobuf_core 18564 3 videobuf_dvb,saa7134,videobuf_dma_sg ir_kbd_i2c 11664 1 saa7134 dvb_core 74656 4 lgdt330x,stv0299,videobuf_dvb,b2c2_flexcop snd_timer 23300 1 snd_pcm snd_page_alloc 11912 2 snd_hda_intel,snd_pcm snd_hwdep 11012 1 snd_hda_intel videodev 30336 2 ivtv,saa7134 v4l2_common 19712 9 wm8775,cx25840,ivtv,saa7115,msp3400,tuner,saa7134,cx2341x,videodev v4l1_compat 17668 2 ivtv,videodev ir_common 34180 2 saa7134,ir_kbd_i2c firmware_class 11392 9 lirc_pvr150,pcmcia,nxt200x,saa7134_dvb,cx25840,tda1004x,ivtv,tuner_xc2028,b2c2_flexcop snd 52644 6 snd_hda_intel,snd_pcm_oss,snd_mixer_oss,snd_pcm,snd_timer,snd_hwdep tveeprom 18320 2 ivtv,saa7134 thermal 16540 0 ohci_hcd 23556 0 ehci_hcd 34572 0 i2c_core 23680 30 nvidia,lirc_pvr150,lgdt330x,mt352,dvb_pll,stv0299,nxt200x,saa7134_dvb,wm8775,cx25840,tda1004x,ivtv,saa7115,msp3400,tuner,tea5767,\ tda8290,tda18271,tda827x,tuner_xc2028,tda9887,tuner_simple,mt20xx,tea5761,b2c2_flexcop,i2c_algo_bit,saa7134,ir_kbd_i2c,v4l2_common,tveeprom pcspkr 6528 0 serio_raw 9348 0 button 10128 0 processor 32296 1 thermal soundcore 10080 1 snd rtc_cmos 11168 0 rtc_core 18568 1 rtc_cmos rtc_lib 6656 1 rtc_core evdev 12928 0 tsdev 11456 0 usbcore 125448 8 ftdi_sio,usbhid,usb_storage,usbserial,uhci_hcd,ohci_hcd,ehci_hcd sbp2 23048 0 ohci1394 32432 2 dv1394 ieee1394 83896 4 raw1394,dv1394,sbp2,ohci1394
_________________ Marc
The views expressed are my own and do not necessarily reflect the views of my employer.
|
|
Top |
|
|
abigailsweetashoney
|
Posted: Tue Feb 24, 2009 12:33 pm |
|
Joined: Tue Nov 14, 2006 2:55 pm
Posts: 245
Location:
South Jersey
|
Marc,
Are you sure you're at full duplex? Have you looked at the netstat -ni output for errors? Have you tried using a cat 6 cable and/or removing the hubs/switches?
_________________ R6.04, dual core 3ghz, 3 gig memory, Zotac 8400 passive heat sink dvi/hdmi out video, 500 gig sata, dual tuner hdhomerun, streamzap remote
Abby
|
|
Top |
|
|
marc.aronson
|
Posted: Wed Feb 25, 2009 2:29 am |
|
Joined: Tue Jan 18, 2005 2:07 am
Posts: 1532
Location:
California
|
abigailsweetashoney, good suggestion but it's not the problem. I started to have problems with throughput on the network dropping way down, so I put a spare gigabit PCI card I have into the system and I am seeing transfer rates faster than I've even seen before -- 400mbits/second of data on a samba transfer from the myth box to a windows vista box. At this point I very suspicious that it's either a hardware issue with the on-board NIC or a driver issue.
It looks like nvidia provides the integrated NIC driver -- if I update to the latest nvidia driver will it also update me to the latest nvidia NIC driver?
Marc
_________________ Marc
The views expressed are my own and do not necessarily reflect the views of my employer.
|
|
Top |
|
|
marc.aronson
|
Posted: Tue Sep 01, 2009 8:56 am |
|
Joined: Tue Jan 18, 2005 2:07 am
Posts: 1532
Location:
California
|
marc.aronson wrote: I'm still trying to chase down the "NETDEV WATCHDOG: eth0: transmit timed out" problem as I just got nailed again. My MSI P6NG Neo-Digital motherboard has a Realtek 8201CL controller. When I do an lspci I see Quote: 00:0f.0 Ethernet controller: nVidia Corporation MCP73 Ethernet (rev a2) Subsystem: Micro-Star International Co., Ltd. Unknown device 7505 Flags: bus master, 66MHz, fast devsel, latency 0, IRQ 16 Memory at fe977000 (32-bit, non-prefetchable) [size=4K] I/O ports at c880 [size=8] Memory at fe97e800 (32-bit, non-prefetchable) [size=256] Memory at fe97e400 (32-bit, non-prefetchable) [size=16] Capabilities: [44] Power Management version 2 Capabilities: [50] Message Signalled Interrupts: Mask+ 64bit+ Queue=0/3 Enable-
Does it make sense that the ethernet controller is being identifes as being from Nvidia given that its a Realtek controller?
After 6 months of chasing this problem on an on-again / off-again basis I've finally nailed it down to two discrete issues:
1. A bad coupler between two cables was causing the problem to occur with increasing frequency -- as often as every few minutes under load. It was located in a vulnerable spot and I suspect it got "bongo'ed" during a cleaning of that room. I replaced it was a new coupler and the problem frequency reduced itself to only happening every 2-3 hours when under intense load.
2. I then reduced the "max xmit" and "buffer size" Samba parameters in /etc/samba/smb.conf from 65535 to 8192. I have been running for 5 hours under load without a problem.
I'm not sure I understand why step # 2 made a difference, but all is well that ends well...
_________________ Marc
The views expressed are my own and do not necessarily reflect the views of my employer.
|
|
Top |
|
|
mihanson
|
Posted: Tue Sep 01, 2009 2:13 pm |
|
Joined: Sun Sep 25, 2005 3:50 pm
Posts: 1013
Location:
Los Angeles
|
marc.aronson wrote: How can I determine which driver was loaded for my integrated ethernet device?
Ok, I know I'm late to the party and it may not matter to you anymore, but if you want to know what ethernet driver is being used for a particular interface:
Code: [mythtv@mythbox-mbe ~]$ sudo ethtool -i eth0 driver: e1000e version: 0.3.3.3-k6 firmware-version: 0.15-4 bus-info: 0000:0d:00.0
FWIW, I'm having issues on my main workstation box when I transfer large files (25 GB-ish) to my backend machine over NFS. The driver for the onboard NIC in my workstation is forcedeth. I was seeing some RX errors on the workstation and noticed the auto negotiation was selecting 100Mbit/s full duplex when it should be choosing 1000Mbit/s full duplex. I swapped out the NIC yesterday (installed a RealTek based r8169 driver) but it did not help matters. Speed and duplex with the Realtek NIC was correctly selected (1000/full). iperf showed stellar throughput (900+ Mbit/s), but real-world file transfers (via cp, mv and rsync) were stalling and taking forever.
Given the info I've stumbled upon in this thread, I may give SAMBA a try. In the meantime, I found that pulling the large files over NFS (ssh into server, mount workstation over NFS and initiate the transfer from the server side) was much more stable and faster than pushing them (workstation to NFS mounted server) but transfer speeds were no where near the 900+ Mbit/s I saw with iperf. More like a steady 200 Mbit/s.
_________________ Mike
My Hardware Profile
|
|
Top |
|
|
mihanson
|
Posted: Tue Sep 01, 2009 4:27 pm |
|
Joined: Sun Sep 25, 2005 3:50 pm
Posts: 1013
Location:
Los Angeles
|
mihanson wrote: Given the info I've stumbled upon in this thread, I may give SAMBA a try. In the meantime, I found that pulling the large files over NFS (ssh into server, mount workstation over NFS and initiate the transfer from the server side) was much more stable and faster than pushing them (workstation to NFS mounted server) but transfer speeds were no where near the 900+ Mbit/s I saw with iperf. More like a steady 200 Mbit/s.
Had a chance to get SAMBA installed and running. Results are slightly better than my "pull" scenario above. With SAMBA I can push the large file from workstation to server at about 23-27MB/s or about 175-200Mbit/s. Way better than "pushing" over NFS, but about the same as "pulling."
_________________ Mike
My Hardware Profile
|
|
Top |
|
|
marc.aronson
|
Posted: Sat Oct 03, 2009 11:15 am |
|
Joined: Tue Jan 18, 2005 2:07 am
Posts: 1532
Location:
California
|
marc.aronson wrote: marc.aronson wrote: I'm still trying to chase down the "NETDEV WATCHDOG: eth0: transmit timed out" problem as I just got nailed again. My MSI P6NG Neo-Digital motherboard has a Realtek 8201CL controller. When I do an lspci I see Quote: 00:0f.0 Ethernet controller: nVidia Corporation MCP73 Ethernet (rev a2) Subsystem: Micro-Star International Co., Ltd. Unknown device 7505 Flags: bus master, 66MHz, fast devsel, latency 0, IRQ 16 Memory at fe977000 (32-bit, non-prefetchable) [size=4K] I/O ports at c880 [size=8] Memory at fe97e800 (32-bit, non-prefetchable) [size=256] Memory at fe97e400 (32-bit, non-prefetchable) [size=16] Capabilities: [44] Power Management version 2 Capabilities: [50] Message Signalled Interrupts: Mask+ 64bit+ Queue=0/3 Enable-
Does it make sense that the ethernet controller is being identifes as being from Nvidia given that its a Realtek controller? After 6 months of chasing this problem on an on-again / off-again basis I've finally nailed it down to two discrete issues: 1. A bad coupler between two cables was causing the problem to occur with increasing frequency -- as often as every few minutes under load. It was located in a vulnerable spot and I suspect it got "bongo'ed" during a cleaning of that room. I replaced it was a new coupler and the problem frequency reduced itself to only happening every 2-3 hours when under intense load. 2. I then reduced the "max xmit" and "buffer size" Samba parameters in /etc/samba/smb.conf from 65535 to 8192. I have been running for 5 hours under load without a problem. I'm not sure I understand why step # 2 made a difference, but all is well that ends well...
And the saga continues. While I did achieve stability when doing samba-based copies from a WIndows box to my mythtv box, I subsequently found the problem occurred frequently when watching recordings stored on the mythtv box and played on my Networked Media Tank. The NMT uses NFS to mount the mythtv file system.
Many experiments later I tried adding the "NOAPIC" option to the LILO boot options. Things were stable enough that I have tried putting "max xmit" and "buffer size" back to 65535. I am still running tests, but things so far look good. Of course, I've felt that way before and then had the problem come back.
Does anyone understand why the NOAPIC option might resolve the problem I am seeing?
_________________ Marc
The views expressed are my own and do not necessarily reflect the views of my employer.
|
|
Top |
|
|
marc.aronson
|
Posted: Mon Oct 05, 2009 12:47 am |
|
Joined: Tue Jan 18, 2005 2:07 am
Posts: 1532
Location:
California
|
I ran 2 5-hour stress tests and the network remained stable at all times. I restored all buffers to larger sizes for these tests. During the tests I had the follow 3 concurrent jobs:
1. Samba-based copies between myth box and Windows Vista box at gigabit speeds.
2. NFS-based playback from myth box to linux-based NMT (Networked Media Tank) box. The NMT has 100mbps NIC.
3. ftp-based copies from myth box to a Linux-based NAS (brand=Airlink101). The Airlink NAS has a 100mbps NIC.
So it looks like the "noapic" option was the key.
_________________ Marc
The views expressed are my own and do not necessarily reflect the views of my employer.
|
|
Top |
|
|