server generates Delayed Write errors copying Very large Files

Discussion in 'Windows 64bit' started by Phil Lewis, May 24, 2008.

  1. Phil Lewis

    Phil Lewis Guest

    Apologies as this turned out to be rather lengthy....

    I run a development engineering lab for a financial services company and we
    are running into a rather peculiar but very troubling problem on some of our
    performance servers when copying, moving or backing up very large files,
    i.e. files greater than about 52 Gigabytes in size (yes GB not MB).
    Primarily these are SQL Server database files but we also see the same
    problems copying or relocating large Virtual Server Hard Drive files of that
    size or larger.
    The problem is actually an old one that I think has not been dealt with, the
    end result for us on Windows Server 2000 and Windows Server 2003 R2 is that
    after copying about 52 Gbytes of the file, Windows starts reporting "Windows
    delayed write" errors and at that point the file copy collapses and stops.
    Although the system reports the copy is still running, no further data is
    being successfully copied. All the file IO and other windows processes
    slowdown considerably (more on this in a little bit). In each case where a
    delayed write error is generated, Event Viewer shows the first error as
    being event ID 50 and or Event 26. The problem is seen when using Drive
    letters and UNC paths and we asked about hotfixes for Server 2003 SP2 but
    were told by support there were none as the fix described in KB Article
    890352 [ http://support.microsoft.com/kb/890352/ ] was rolled into SP2 and
    did not apply to our issue.

    We have only recently begun to see these problems because until recently
    most of our performance testing model used fairly small working sets for
    data (typically under 100GB total) and thus each file-group in our databases
    was less than 40GB so we never really saw a problem. I've seen this problem
    on ALL versions of Windows including Server 2008. In the server 2008 case,
    the O/S collapses completely and cannot be shutdown.In most cases we have to
    power-off the server to get the problem system to recover, no delayed write
    error is reported on server 2008.

    I've tested this on a variety of servers (listed next) and in as many cases
    as possible I tested on multiple servers with the same config and with
    different O/S editions. I also tested one of the servers that was
    experiencing the problems first, using small files (files from 1byte up to
    40 GigaBytes in size). The test set I use is approx 870GB in total size and
    has been continuously copying for about 40 days now continuously on this
    server, I think at last check it had copied around 2,973 Terabytes of data
    on this server, all without error.

    Primary Test Servers and configurations
    HP ML570 Quad Xeon w/ 16GB Ram and 1.2TBytes local storage + 3.4TB SAN
    storage on a Nexsan Sataboy Box. WinSvr 2003R2 SP2 x86 and x64.
    HP DL580 Dual Xeon w/8Gb or 16Gb Ram and 1.2TB Local Storage + 3.4 TB SAN
    Storage on a Nexsan Sataboy Box. WinSvr 2003R2 SP2 x86 and x64.
    HP DL380 G4p Dual Xeon w/4Gb Ram and 300GB Local Storage + 3.4 TB SAN
    Storage on a Nexsan Sataboy Box. WinSvr 2003R2 SP2 x86
    DELL 2950 dual quad Core 2.83Ghz w/32 GB Ram 320GB Local Storage 4.2TB on an
    EMC AX150 SAN and 3.4 TB SAN Storage on a Nexsan Sataboy Box. WinSvr 2003R2
    SP2 x86 and x64, Win Svr 2008 x64.
    DELL 2950 dual quad Core 2.5Ghz w/4 GB Ram 140GB Local Storage 4.2TB on an
    EMC AX150 SAN and 3.4 TB SAN Storage on a Nexsan Sataboy Box. WinSvr 2003R2
    SP2 x86 and WinSvr2008 x86
    Dell Precision 460T wkstation Dual Xeon 2.4Ghz with 4GB Ram and 1TB of local
    storage (SATA) 3.4 TB SAN Storage on a Nexsan Sataboy Box via iSCSI WinSvr
    2003R2 SP2 x64
    With the exception of the ML570 I've tested on multiple servers of the same
    type. All servers have succeeded in copying large volumes of files 40Gb and
    smaller.

    On our HPs all drives are SCSI hot swappable. Either 10 or 15K. We don't mix
    spindle speeds in raid groups. On our Dells, all local drives are SAS, on
    the EMC AX150 all drives are are SATA and on the SATABoy all drives are
    SATA. On all systems file copies using small files up to 0 - 40GB are all
    successful.
    All systems are running the latest BIOS and we have seen the same behavior
    on prior BIOS versions. All disk controller firware is updated to the latest
    version and like the BIOSes the same behavior existed on earlier versions.
    We have checked and updated hard disk firmware, where new versions are
    available. Same issues as for controller firmware. On local Hard drives we
    run Raid 1 or Raid 5 to get best performance or max capacity. Both Raid
    modes exhibit the same behavior. I have tested on the HP's with no Raid at
    all and the same results occurred.

    I try not to do specialty O/S builds for our lab environment. I build a
    straight default Windows O/S configurations, fully patch it with microsoft
    Patches and burn in test the system, then go test for this problem. I do not
    tweak system settings or apply registry hacks until I get baseline test
    data. In all cases here for the file copy tests I have not tweaked the
    system settings or registry at all. Our servers are set for background
    performance for the system cache, although we tested with 'foreground' set
    without success too. We've tried large and small pagefiles and have moved
    pagefiles to seperate disk spindles to see if it made a difference.
    I've tried all manner of file copy and file sync tools, but what it comes
    down to that if the file being copied is written on the systems' storage
    system (local or SAN), the system will collapse and file copying will fail
    somewhere around 52 - 58GB being copied. Windows Server 2008 has given me my
    best look into the problem and what appears to be happening is that the
    system cache keeps expanding until all physical memory is used and the paged
    pool keeps growing until it hits around 380MB and non paged hits about 82MB
    (I think the latter is right). What I then see is the CPU goes flat line as
    does the Total Disk byes written in Perfmon but the Physical memory usage
    history in task manager suddenly starts ramping until it gets out around
    70GB and then everything is done and either the system hangs (server2008) or
    delayed write errors occur.

    One place where I do not seem to see the problem is SAN Drive to SAN Drive
    Copies on the EMC SAN. I always see this problem on the SATABOY SAN with
    large files when copying large files to the SAN volumes regardless of Cache
    Settings. I can backup the files to tape but due to their size its an
    expensive option both in media cost and time to backup and restore the data,
    my preferred option is to backup to removeable Hard Disk (External USB -
    SATA), sure its slower but it offers operating efficiencies right now I
    cannot get with Tape (if it worked). Some testing has centered on using
    external drives, but most of the testing on my systems has been to copy or
    move the files from one volume to another on the server. It doesn't matter
    whether I turn caching off or on for external USB drives or local drives. I
    have had very occasional success on the EMC SAN ensuring that windows O/S
    disk caching is off. Success using this method has been spotty and limited
    to servers with 32GB or more of RAM. One very telling test setup was to
    populate the Server with 128GB RAM and run windows Server 2008 x64. In that
    case almost all file copies were successful, although they became painfully
    slow after about 60GB was copied and it took over 5 hours to copy the last
    ~35GB of a 95.8 GB file.

    I tried using Backup Exec and MsBackup to backup the files to a hard disk
    but it failed everytime, when I run the same backup to tape it is
    successful.
    This is leading me to think the problem is generated in the lower level file
    system filter drivers. I've tested with and without AntiVirus software in
    the mix and have likewise tested systems that are built raw with no patches
    at all and see the same problems. I also tried splitting the file into
    chunks and copying the pieces and while I can split the file, I cannot join
    it again as the processes all seem to rely on creating a temp file and the
    process of copying the large temp file always results in the delayed write
    errors being generated. I've also tried zipping up the file to reduce its
    size (database files compress really well) but that process likewise
    requires a file copy of a large file, and at some point that fails. It
    should go without saying I've tried using SQL Db Backup writing to disk
    storage and it fails everytime, tape is successful. In fact it was this very
    act that caused me to begin investigating the problem in the first place.

    In days gone by there were loads of users seeing this problem on Win XP
    copying much smaller files and there are some other people seeing this
    problem today on Windows Server, Microsoft are very quiet on the subject for
    windows server, I think in no small part because very few people are seeing
    the problem and there is no demand to identify the problem or to fix it. I
    cannot believe though I'm one of the first people to see the problem.
    I'll quite happily accept its a configuration issue if someone can tell me
    how to fix the problem! All attempt to tweak a system has not yielded any
    success. What also sucks is that you typically have to wait ~30 mins to find
    out that the problem will manifest.

    I'm about ready to escalate this issue to Microsoft, I think I now have
    enough test data to do so, but thought I would bounce this off others to see
    if anyone else has a solution or guidance first.

    Phil
    Checkfree:
    Phil Lewis, May 24, 2008
    #1
    1. Advertising

  2. The data section of the event ID 50 contains a lot of useful information for
    diagnosing the exact cause of the issue. Could you check the "words" radio
    button and post the dump data? Also, what is the source of the ID 50? is
    it mrxsmb or disk?




    "Phil Lewis" <> wrote in message
    news:...
    > Apologies as this turned out to be rather lengthy....
    >
    > I run a development engineering lab for a financial services company and
    > we are running into a rather peculiar but very troubling problem on some
    > of our performance servers when copying, moving or backing up very large
    > files, i.e. files greater than about 52 Gigabytes in size (yes GB not MB).
    > Primarily these are SQL Server database files but we also see the same
    > problems copying or relocating large Virtual Server Hard Drive files of
    > that size or larger.
    > The problem is actually an old one that I think has not been dealt with,
    > the end result for us on Windows Server 2000 and Windows Server 2003 R2 is
    > that after copying about 52 Gbytes of the file, Windows starts reporting
    > "Windows delayed write" errors and at that point the file copy collapses
    > and stops. Although the system reports the copy is still running, no
    > further data is being successfully copied. All the file IO and other
    > windows processes slowdown considerably (more on this in a little bit).
    > In each case where a delayed write error is generated, Event Viewer shows
    > the first error as being event ID 50 and or Event 26. The problem is seen
    > when using Drive letters and UNC paths and we asked about hotfixes for
    > Server 2003 SP2 but were told by support there were none as the fix
    > described in KB Article 890352 [
    > http://support.microsoft.com/kb/890352/ ] was rolled into SP2 and did not
    > apply to our issue.
    >
    > We have only recently begun to see these problems because until recently
    > most of our performance testing model used fairly small working sets for
    > data (typically under 100GB total) and thus each file-group in our
    > databases was less than 40GB so we never really saw a problem. I've seen
    > this problem on ALL versions of Windows including Server 2008. In the
    > server 2008 case, the O/S collapses completely and cannot be shutdown.In
    > most cases we have to power-off the server to get the problem system to
    > recover, no delayed write error is reported on server 2008.
    >
    > I've tested this on a variety of servers (listed next) and in as many
    > cases as possible I tested on multiple servers with the same config and
    > with different O/S editions. I also tested one of the servers that was
    > experiencing the problems first, using small files (files from 1byte up to
    > 40 GigaBytes in size). The test set I use is approx 870GB in total size
    > and has been continuously copying for about 40 days now continuously on
    > this server, I think at last check it had copied around 2,973 Terabytes
    > of data on this server, all without error.
    >
    > Primary Test Servers and configurations
    > HP ML570 Quad Xeon w/ 16GB Ram and 1.2TBytes local storage + 3.4TB SAN
    > storage on a Nexsan Sataboy Box. WinSvr 2003R2 SP2 x86 and x64.
    > HP DL580 Dual Xeon w/8Gb or 16Gb Ram and 1.2TB Local Storage + 3.4 TB SAN
    > Storage on a Nexsan Sataboy Box. WinSvr 2003R2 SP2 x86 and x64.
    > HP DL380 G4p Dual Xeon w/4Gb Ram and 300GB Local Storage + 3.4 TB SAN
    > Storage on a Nexsan Sataboy Box. WinSvr 2003R2 SP2 x86
    > DELL 2950 dual quad Core 2.83Ghz w/32 GB Ram 320GB Local Storage 4.2TB on
    > an EMC AX150 SAN and 3.4 TB SAN Storage on a Nexsan Sataboy Box. WinSvr
    > 2003R2 SP2 x86 and x64, Win Svr 2008 x64.
    > DELL 2950 dual quad Core 2.5Ghz w/4 GB Ram 140GB Local Storage 4.2TB on an
    > EMC AX150 SAN and 3.4 TB SAN Storage on a Nexsan Sataboy Box. WinSvr
    > 2003R2 SP2 x86 and WinSvr2008 x86
    > Dell Precision 460T wkstation Dual Xeon 2.4Ghz with 4GB Ram and 1TB of
    > local storage (SATA) 3.4 TB SAN Storage on a Nexsan Sataboy Box via iSCSI
    > WinSvr 2003R2 SP2 x64
    > With the exception of the ML570 I've tested on multiple servers of the
    > same type. All servers have succeeded in copying large volumes of files
    > 40Gb and smaller.
    >
    > On our HPs all drives are SCSI hot swappable. Either 10 or 15K. We don't
    > mix spindle speeds in raid groups. On our Dells, all local drives are SAS,
    > on the EMC AX150 all drives are are SATA and on the SATABoy all drives are
    > SATA. On all systems file copies using small files up to 0 - 40GB are all
    > successful.
    > All systems are running the latest BIOS and we have seen the same behavior
    > on prior BIOS versions. All disk controller firware is updated to the
    > latest version and like the BIOSes the same behavior existed on earlier
    > versions. We have checked and updated hard disk firmware, where new
    > versions are available. Same issues as for controller firmware. On local
    > Hard drives we run Raid 1 or Raid 5 to get best performance or max
    > capacity. Both Raid modes exhibit the same behavior. I have tested on the
    > HP's with no Raid at all and the same results occurred.
    >
    > I try not to do specialty O/S builds for our lab environment. I build a
    > straight default Windows O/S configurations, fully patch it with microsoft
    > Patches and burn in test the system, then go test for this problem. I do
    > not tweak system settings or apply registry hacks until I get baseline
    > test data. In all cases here for the file copy tests I have not tweaked
    > the system settings or registry at all. Our servers are set for background
    > performance for the system cache, although we tested with 'foreground' set
    > without success too. We've tried large and small pagefiles and have moved
    > pagefiles to seperate disk spindles to see if it made a difference.
    > I've tried all manner of file copy and file sync tools, but what it comes
    > down to that if the file being copied is written on the systems' storage
    > system (local or SAN), the system will collapse and file copying will fail
    > somewhere around 52 - 58GB being copied. Windows Server 2008 has given me
    > my best look into the problem and what appears to be happening is that the
    > system cache keeps expanding until all physical memory is used and the
    > paged pool keeps growing until it hits around 380MB and non paged hits
    > about 82MB (I think the latter is right). What I then see is the CPU goes
    > flat line as does the Total Disk byes written in Perfmon but the Physical
    > memory usage history in task manager suddenly starts ramping until it gets
    > out around 70GB and then everything is done and either the system hangs
    > (server2008) or delayed write errors occur.
    >
    > One place where I do not seem to see the problem is SAN Drive to SAN Drive
    > Copies on the EMC SAN. I always see this problem on the SATABOY SAN with
    > large files when copying large files to the SAN volumes regardless of
    > Cache Settings. I can backup the files to tape but due to their size its
    > an expensive option both in media cost and time to backup and restore the
    > data, my preferred option is to backup to removeable Hard Disk (External
    > USB - SATA), sure its slower but it offers operating efficiencies right
    > now I cannot get with Tape (if it worked). Some testing has centered on
    > using external drives, but most of the testing on my systems has been to
    > copy or move the files from one volume to another on the server. It
    > doesn't matter whether I turn caching off or on for external USB drives or
    > local drives. I have had very occasional success on the EMC SAN ensuring
    > that windows O/S disk caching is off. Success using this method has been
    > spotty and limited to servers with 32GB or more of RAM. One very telling
    > test setup was to populate the Server with 128GB RAM and run windows
    > Server 2008 x64. In that case almost all file copies were successful,
    > although they became painfully slow after about 60GB was copied and it
    > took over 5 hours to copy the last ~35GB of a 95.8 GB file.
    >
    > I tried using Backup Exec and MsBackup to backup the files to a hard disk
    > but it failed everytime, when I run the same backup to tape it is
    > successful.
    > This is leading me to think the problem is generated in the lower level
    > file system filter drivers. I've tested with and without AntiVirus
    > software in the mix and have likewise tested systems that are built raw
    > with no patches at all and see the same problems. I also tried splitting
    > the file into chunks and copying the pieces and while I can split the
    > file, I cannot join it again as the processes all seem to rely on creating
    > a temp file and the process of copying the large temp file always results
    > in the delayed write errors being generated. I've also tried zipping up
    > the file to reduce its size (database files compress really well) but that
    > process likewise requires a file copy of a large file, and at some point
    > that fails. It should go without saying I've tried using SQL Db Backup
    > writing to disk storage and it fails everytime, tape is successful. In
    > fact it was this very act that caused me to begin investigating the
    > problem in the first place.
    >
    > In days gone by there were loads of users seeing this problem on Win XP
    > copying much smaller files and there are some other people seeing this
    > problem today on Windows Server, Microsoft are very quiet on the subject
    > for windows server, I think in no small part because very few people are
    > seeing the problem and there is no demand to identify the problem or to
    > fix it. I cannot believe though I'm one of the first people to see the
    > problem.
    > I'll quite happily accept its a configuration issue if someone can tell me
    > how to fix the problem! All attempt to tweak a system has not yielded any
    > success. What also sucks is that you typically have to wait ~30 mins to
    > find out that the problem will manifest.
    >
    > I'm about ready to escalate this issue to Microsoft, I think I now have
    > enough test data to do so, but thought I would bounce this off others to
    > see if anyone else has a solution or guidance first.
    >
    > Phil
    > Checkfree:
    >
    John Fullbright, May 24, 2008
    #2
    1. Advertising

  3. In news:,
    Phil Lewis <> typed:
    > Apologies as this turned out to be rather lengthy....
    >
    > I run a development engineering lab for a financial services company
    > and we are running into a rather peculiar but very troubling problem
    > on some of our performance servers when copying, moving or backing up
    > very large files, i.e. files greater than about 52 Gigabytes in size
    > (yes GB not MB). Primarily these are SQL Server database files but we
    > also see the same problems copying or relocating large Virtual Server
    > Hard Drive files of that size or larger.
    > The problem is actually an old one that I think has not been dealt
    > with, the end result for us on Windows Server 2000 and Windows Server
    > 2003 R2 is that after copying about 52 Gbytes of the file, Windows
    > starts reporting "Windows delayed write" errors and at that point the
    > file copy collapses and stops. Although the system reports the copy
    > is still running, no further data is being successfully copied. All
    > the file IO and other windows processes slowdown considerably (more
    > on this in a little bit). In each case where a delayed write error
    > is generated, Event Viewer shows the first error as being event ID 50
    > and or Event 26. The problem is seen when using Drive letters and
    > UNC paths and we asked about hotfixes for Server 2003 SP2 but were
    > told by support there were none as the fix described in KB Article
    > 890352 [ http://support.microsoft.com/kb/890352/ ] was rolled into
    > SP2 and did not apply to our issue.
    > We have only recently begun to see these problems because until
    > recently most of our performance testing model used fairly small
    > working sets for data (typically under 100GB total) and thus each
    > file-group in our databases was less than 40GB so we never really saw
    > a problem. I've seen this problem on ALL versions of Windows
    > including Server 2008. In the server 2008 case, the O/S collapses
    > completely and cannot be shutdown.In most cases we have to power-off
    > the server to get the problem system to recover, no delayed write
    > error is reported on server 2008.
    > I've tested this on a variety of servers (listed next) and in as many
    > cases as possible I tested on multiple servers with the same config
    > and with different O/S editions. I also tested one of the servers
    > that was experiencing the problems first, using small files (files
    > from 1byte up to 40 GigaBytes in size). The test set I use is approx
    > 870GB in total size and has been continuously copying for about 40
    > days now continuously on this server, I think at last check it had
    > copied around 2,973 Terabytes of data on this server, all without
    > error.
    > Primary Test Servers and configurations
    > HP ML570 Quad Xeon w/ 16GB Ram and 1.2TBytes local storage + 3.4TB SAN
    > storage on a Nexsan Sataboy Box. WinSvr 2003R2 SP2 x86 and x64.
    > HP DL580 Dual Xeon w/8Gb or 16Gb Ram and 1.2TB Local Storage + 3.4 TB
    > SAN Storage on a Nexsan Sataboy Box. WinSvr 2003R2 SP2 x86 and x64.
    > HP DL380 G4p Dual Xeon w/4Gb Ram and 300GB Local Storage + 3.4 TB SAN
    > Storage on a Nexsan Sataboy Box. WinSvr 2003R2 SP2 x86
    > DELL 2950 dual quad Core 2.83Ghz w/32 GB Ram 320GB Local Storage
    > 4.2TB on an EMC AX150 SAN and 3.4 TB SAN Storage on a Nexsan Sataboy
    > Box. WinSvr 2003R2 SP2 x86 and x64, Win Svr 2008 x64.
    > DELL 2950 dual quad Core 2.5Ghz w/4 GB Ram 140GB Local Storage 4.2TB
    > on an EMC AX150 SAN and 3.4 TB SAN Storage on a Nexsan Sataboy Box.
    > WinSvr 2003R2 SP2 x86 and WinSvr2008 x86
    > Dell Precision 460T wkstation Dual Xeon 2.4Ghz with 4GB Ram and 1TB
    > of local storage (SATA) 3.4 TB SAN Storage on a Nexsan Sataboy Box
    > via iSCSI WinSvr 2003R2 SP2 x64
    > With the exception of the ML570 I've tested on multiple servers of
    > the same type. All servers have succeeded in copying large volumes of
    > files 40Gb and smaller.
    >
    > On our HPs all drives are SCSI hot swappable. Either 10 or 15K. We
    > don't mix spindle speeds in raid groups. On our Dells, all local
    > drives are SAS, on the EMC AX150 all drives are are SATA and on the
    > SATABoy all drives are SATA. On all systems file copies using small
    > files up to 0 - 40GB are all successful.
    > All systems are running the latest BIOS and we have seen the same
    > behavior on prior BIOS versions. All disk controller firware is
    > updated to the latest version and like the BIOSes the same behavior
    > existed on earlier versions. We have checked and updated hard disk
    > firmware, where new versions are available. Same issues as for
    > controller firmware. On local Hard drives we run Raid 1 or Raid 5 to
    > get best performance or max capacity. Both Raid modes exhibit the
    > same behavior. I have tested on the HP's with no Raid at all and the
    > same results occurred.
    > I try not to do specialty O/S builds for our lab environment. I build
    > a straight default Windows O/S configurations, fully patch it with
    > microsoft Patches and burn in test the system, then go test for this
    > problem. I do not tweak system settings or apply registry hacks until
    > I get baseline test data. In all cases here for the file copy tests I
    > have not tweaked the system settings or registry at all. Our servers
    > are set for background performance for the system cache, although we
    > tested with 'foreground' set without success too. We've tried large
    > and small pagefiles and have moved pagefiles to seperate disk
    > spindles to see if it made a difference. I've tried all manner of file
    > copy and file sync tools, but what it
    > comes down to that if the file being copied is written on the
    > systems' storage system (local or SAN), the system will collapse and
    > file copying will fail somewhere around 52 - 58GB being copied.
    > Windows Server 2008 has given me my best look into the problem and
    > what appears to be happening is that the system cache keeps expanding
    > until all physical memory is used and the paged pool keeps growing
    > until it hits around 380MB and non paged hits about 82MB (I think the
    > latter is right). What I then see is the CPU goes flat line as does
    > the Total Disk byes written in Perfmon but the Physical memory usage
    > history in task manager suddenly starts ramping until it gets out
    > around 70GB and then everything is done and either the system hangs
    > (server2008) or delayed write errors occur.
    > One place where I do not seem to see the problem is SAN Drive to SAN
    > Drive Copies on the EMC SAN. I always see this problem on the SATABOY
    > SAN with large files when copying large files to the SAN volumes
    > regardless of Cache Settings. I can backup the files to tape but due
    > to their size its an expensive option both in media cost and time to
    > backup and restore the data, my preferred option is to backup to
    > removeable Hard Disk (External USB - SATA), sure its slower but it
    > offers operating efficiencies right now I cannot get with Tape (if
    > it worked). Some testing has centered on using external drives, but
    > most of the testing on my systems has been to copy or move the files
    > from one volume to another on the server. It doesn't matter whether I
    > turn caching off or on for external USB drives or local drives. I
    > have had very occasional success on the EMC SAN ensuring that windows
    > O/S disk caching is off. Success using this method has been spotty
    > and limited to servers with 32GB or more of RAM. One very telling
    > test setup was to populate the Server with 128GB RAM and run windows
    > Server 2008 x64. In that case almost all file copies were successful,
    > although they became painfully slow after about 60GB was copied and
    > it took over 5 hours to copy the last ~35GB of a 95.8 GB file.
    > I tried using Backup Exec and MsBackup to backup the files to a hard
    > disk but it failed everytime, when I run the same backup to tape it is
    > successful.
    > This is leading me to think the problem is generated in the lower
    > level file system filter drivers. I've tested with and without
    > AntiVirus software in the mix and have likewise tested systems that
    > are built raw with no patches at all and see the same problems. I
    > also tried splitting the file into chunks and copying the pieces and
    > while I can split the file, I cannot join it again as the processes
    > all seem to rely on creating a temp file and the process of copying
    > the large temp file always results in the delayed write errors being
    > generated. I've also tried zipping up the file to reduce its size
    > (database files compress really well) but that process likewise
    > requires a file copy of a large file, and at some point that fails.
    > It should go without saying I've tried using SQL Db Backup writing to
    > disk storage and it fails everytime, tape is successful. In fact it
    > was this very act that caused me to begin investigating the problem
    > in the first place.
    > In days gone by there were loads of users seeing this problem on Win
    > XP copying much smaller files and there are some other people seeing
    > this problem today on Windows Server, Microsoft are very quiet on the
    > subject for windows server, I think in no small part because very few
    > people are seeing the problem and there is no demand to identify the
    > problem or to fix it. I cannot believe though I'm one of the first
    > people to see the problem. I'll quite happily accept its a configuration
    > issue if someone can
    > tell me how to fix the problem! All attempt to tweak a system has not
    > yielded any success. What also sucks is that you typically have to
    > wait ~30 mins to find out that the problem will manifest.
    >
    > I'm about ready to escalate this issue to Microsoft, I think I now
    > have enough test data to do so, but thought I would bounce this off
    > others to see if anyone else has a solution or guidance first.
    >
    > Phil
    > Checkfree:


    Interesting dillema. I have not seen it before.

    What are the Source names for the event IDs you posted?

    Out of curiousity, I looked up EventID 50, but there are numerous Source
    names, none of which their descriptions did I find to indicate a "Windows
    Delayed Write" message, as you've described. Looking through different
    possible error messages of "{Delayed Write Failed}," I found numerous
    possiblities, but you would have to sort through them. One indicated a
    problem wtih the MFT, another indicated issues with fragmentation causing
    it, and one concerining clustering drivers conflicts and one more about
    compatibility issues with the HBA the SAN is connected to.

    See if you can weed through the following EventID links. Look up based on
    the Source name. If you don't have a subscription and it indicates an MSKB,
    prefix the KB number so it the URL reads
    "htttp://support.microsoft.com/kb/######" (###### being the KB number).

    http://eventid.net/display.asp?eventid=50&source=
    http://eventid.net/display.asp?eventid=26&source=


    --
    Regards,
    Ace

    This posting is provided "AS-IS" with no warranties or guarantees and
    confers no rights.

    Ace Fekay, MCSE 2003 & 2000, MCSA 2003 & 2000, MCSE+I, MCT,
    MVP Microsoft MVP - Directory Services
    Microsoft Certified Trainer

    For urgent issues, you may want to contact Microsoft PSS directly. Please
    check http://support.microsoft.com for regional support phone numbers.

    Infinite Diversities in Infinite Combinations
    Ace Fekay [MVP], May 24, 2008
    #3
  4. Being that they're at Windows 2003 SP2 or Windows 2008, my guess would be
    CcDirtyPageThreshold is set too high. Let's see the details of the event ID
    50. The event ID 26 is really immaterial, it's just the application popup
    message (you got the dialog that says lost delayed write)., The evend ID 50
    will contain the error code passed up the driver stack, as well as any SCSI
    sense data. Between that and possibly a short perfmon run during a copy, we
    should be able to sort this out. My guess is that we'll see half of RAM
    cached before writing starts, then you'll either see c000009a (out of
    resources) or a timeout due to slow disk once dirty pages are flushed. In
    perfom, you'll see heavy paging once dirty pages start flushing, and this is
    really the cause of the lack of server responsiveness.

    The narrative

    "Windows Server 2008 has given me my best look into the problem and
    >> what appears to be happening is that the system cache keeps expanding
    >> until all physical memory is used and the paged pool keeps growing
    >> until it hits around 380MB and non paged hits about 82MB (I think the
    >> latter is right). What I then see is the CPU goes flat line as does
    >> the Total Disk byes written in Perfmon but the Physical memory usage
    >> history in task manager suddenly starts ramping until it gets out
    >> around 70GB and then everything is done and either the system hangs"


    would be supportive of this guess.

    John



    "Ace Fekay [MVP]" <> wrote in message
    news:%...
    > In news:,
    > Phil Lewis <> typed:
    >> Apologies as this turned out to be rather lengthy....
    >>
    >> I run a development engineering lab for a financial services company
    >> and we are running into a rather peculiar but very troubling problem
    >> on some of our performance servers when copying, moving or backing up
    >> very large files, i.e. files greater than about 52 Gigabytes in size
    >> (yes GB not MB). Primarily these are SQL Server database files but we
    >> also see the same problems copying or relocating large Virtual Server
    >> Hard Drive files of that size or larger.
    >> The problem is actually an old one that I think has not been dealt
    >> with, the end result for us on Windows Server 2000 and Windows Server
    >> 2003 R2 is that after copying about 52 Gbytes of the file, Windows
    >> starts reporting "Windows delayed write" errors and at that point the
    >> file copy collapses and stops. Although the system reports the copy
    >> is still running, no further data is being successfully copied. All
    >> the file IO and other windows processes slowdown considerably (more
    >> on this in a little bit). In each case where a delayed write error
    >> is generated, Event Viewer shows the first error as being event ID 50
    >> and or Event 26. The problem is seen when using Drive letters and
    >> UNC paths and we asked about hotfixes for Server 2003 SP2 but were
    >> told by support there were none as the fix described in KB Article
    >> 890352 [ http://support.microsoft.com/kb/890352/ ] was rolled into
    >> SP2 and did not apply to our issue.
    >> We have only recently begun to see these problems because until
    >> recently most of our performance testing model used fairly small
    >> working sets for data (typically under 100GB total) and thus each
    >> file-group in our databases was less than 40GB so we never really saw
    >> a problem. I've seen this problem on ALL versions of Windows
    >> including Server 2008. In the server 2008 case, the O/S collapses
    >> completely and cannot be shutdown.In most cases we have to power-off
    >> the server to get the problem system to recover, no delayed write
    >> error is reported on server 2008.
    >> I've tested this on a variety of servers (listed next) and in as many
    >> cases as possible I tested on multiple servers with the same config
    >> and with different O/S editions. I also tested one of the servers
    >> that was experiencing the problems first, using small files (files
    >> from 1byte up to 40 GigaBytes in size). The test set I use is approx
    >> 870GB in total size and has been continuously copying for about 40
    >> days now continuously on this server, I think at last check it had
    >> copied around 2,973 Terabytes of data on this server, all without
    >> error.
    >> Primary Test Servers and configurations
    >> HP ML570 Quad Xeon w/ 16GB Ram and 1.2TBytes local storage + 3.4TB SAN
    >> storage on a Nexsan Sataboy Box. WinSvr 2003R2 SP2 x86 and x64.
    >> HP DL580 Dual Xeon w/8Gb or 16Gb Ram and 1.2TB Local Storage + 3.4 TB
    >> SAN Storage on a Nexsan Sataboy Box. WinSvr 2003R2 SP2 x86 and x64.
    >> HP DL380 G4p Dual Xeon w/4Gb Ram and 300GB Local Storage + 3.4 TB SAN
    >> Storage on a Nexsan Sataboy Box. WinSvr 2003R2 SP2 x86
    >> DELL 2950 dual quad Core 2.83Ghz w/32 GB Ram 320GB Local Storage
    >> 4.2TB on an EMC AX150 SAN and 3.4 TB SAN Storage on a Nexsan Sataboy
    >> Box. WinSvr 2003R2 SP2 x86 and x64, Win Svr 2008 x64.
    >> DELL 2950 dual quad Core 2.5Ghz w/4 GB Ram 140GB Local Storage 4.2TB
    >> on an EMC AX150 SAN and 3.4 TB SAN Storage on a Nexsan Sataboy Box.
    >> WinSvr 2003R2 SP2 x86 and WinSvr2008 x86
    >> Dell Precision 460T wkstation Dual Xeon 2.4Ghz with 4GB Ram and 1TB
    >> of local storage (SATA) 3.4 TB SAN Storage on a Nexsan Sataboy Box
    >> via iSCSI WinSvr 2003R2 SP2 x64
    >> With the exception of the ML570 I've tested on multiple servers of
    >> the same type. All servers have succeeded in copying large volumes of
    >> files 40Gb and smaller.
    >>
    >> On our HPs all drives are SCSI hot swappable. Either 10 or 15K. We
    >> don't mix spindle speeds in raid groups. On our Dells, all local
    >> drives are SAS, on the EMC AX150 all drives are are SATA and on the
    >> SATABoy all drives are SATA. On all systems file copies using small
    >> files up to 0 - 40GB are all successful.
    >> All systems are running the latest BIOS and we have seen the same
    >> behavior on prior BIOS versions. All disk controller firware is
    >> updated to the latest version and like the BIOSes the same behavior
    >> existed on earlier versions. We have checked and updated hard disk
    >> firmware, where new versions are available. Same issues as for
    >> controller firmware. On local Hard drives we run Raid 1 or Raid 5 to
    >> get best performance or max capacity. Both Raid modes exhibit the
    >> same behavior. I have tested on the HP's with no Raid at all and the
    >> same results occurred.
    >> I try not to do specialty O/S builds for our lab environment. I build
    >> a straight default Windows O/S configurations, fully patch it with
    >> microsoft Patches and burn in test the system, then go test for this
    >> problem. I do not tweak system settings or apply registry hacks until
    >> I get baseline test data. In all cases here for the file copy tests I
    >> have not tweaked the system settings or registry at all. Our servers
    >> are set for background performance for the system cache, although we
    >> tested with 'foreground' set without success too. We've tried large
    >> and small pagefiles and have moved pagefiles to seperate disk
    >> spindles to see if it made a difference. I've tried all manner of file
    >> copy and file sync tools, but what it
    >> comes down to that if the file being copied is written on the
    >> systems' storage system (local or SAN), the system will collapse and
    >> file copying will fail somewhere around 52 - 58GB being copied.
    >> Windows Server 2008 has given me my best look into the problem and
    >> what appears to be happening is that the system cache keeps expanding
    >> until all physical memory is used and the paged pool keeps growing
    >> until it hits around 380MB and non paged hits about 82MB (I think the
    >> latter is right). What I then see is the CPU goes flat line as does
    >> the Total Disk byes written in Perfmon but the Physical memory usage
    >> history in task manager suddenly starts ramping until it gets out
    >> around 70GB and then everything is done and either the system hangs
    >> (server2008) or delayed write errors occur.
    >> One place where I do not seem to see the problem is SAN Drive to SAN
    >> Drive Copies on the EMC SAN. I always see this problem on the SATABOY
    >> SAN with large files when copying large files to the SAN volumes
    >> regardless of Cache Settings. I can backup the files to tape but due
    >> to their size its an expensive option both in media cost and time to
    >> backup and restore the data, my preferred option is to backup to
    >> removeable Hard Disk (External USB - SATA), sure its slower but it
    >> offers operating efficiencies right now I cannot get with Tape (if
    >> it worked). Some testing has centered on using external drives, but
    >> most of the testing on my systems has been to copy or move the files
    >> from one volume to another on the server. It doesn't matter whether I
    >> turn caching off or on for external USB drives or local drives. I
    >> have had very occasional success on the EMC SAN ensuring that windows
    >> O/S disk caching is off. Success using this method has been spotty
    >> and limited to servers with 32GB or more of RAM. One very telling
    >> test setup was to populate the Server with 128GB RAM and run windows
    >> Server 2008 x64. In that case almost all file copies were successful,
    >> although they became painfully slow after about 60GB was copied and
    >> it took over 5 hours to copy the last ~35GB of a 95.8 GB file.
    >> I tried using Backup Exec and MsBackup to backup the files to a hard
    >> disk but it failed everytime, when I run the same backup to tape it is
    >> successful.
    >> This is leading me to think the problem is generated in the lower
    >> level file system filter drivers. I've tested with and without
    >> AntiVirus software in the mix and have likewise tested systems that
    >> are built raw with no patches at all and see the same problems. I
    >> also tried splitting the file into chunks and copying the pieces and
    >> while I can split the file, I cannot join it again as the processes
    >> all seem to rely on creating a temp file and the process of copying
    >> the large temp file always results in the delayed write errors being
    >> generated. I've also tried zipping up the file to reduce its size
    >> (database files compress really well) but that process likewise
    >> requires a file copy of a large file, and at some point that fails.
    >> It should go without saying I've tried using SQL Db Backup writing to
    >> disk storage and it fails everytime, tape is successful. In fact it
    >> was this very act that caused me to begin investigating the problem
    >> in the first place.
    >> In days gone by there were loads of users seeing this problem on Win
    >> XP copying much smaller files and there are some other people seeing
    >> this problem today on Windows Server, Microsoft are very quiet on the
    >> subject for windows server, I think in no small part because very few
    >> people are seeing the problem and there is no demand to identify the
    >> problem or to fix it. I cannot believe though I'm one of the first
    >> people to see the problem. I'll quite happily accept its a configuration
    >> issue if someone can
    >> tell me how to fix the problem! All attempt to tweak a system has not
    >> yielded any success. What also sucks is that you typically have to
    >> wait ~30 mins to find out that the problem will manifest.
    >>
    >> I'm about ready to escalate this issue to Microsoft, I think I now
    >> have enough test data to do so, but thought I would bounce this off
    >> others to see if anyone else has a solution or guidance first.
    >>
    >> Phil
    >> Checkfree:

    >
    > Interesting dillema. I have not seen it before.
    >
    > What are the Source names for the event IDs you posted?
    >
    > Out of curiousity, I looked up EventID 50, but there are numerous Source
    > names, none of which their descriptions did I find to indicate a "Windows
    > Delayed Write" message, as you've described. Looking through different
    > possible error messages of "{Delayed Write Failed}," I found numerous
    > possiblities, but you would have to sort through them. One indicated a
    > problem wtih the MFT, another indicated issues with fragmentation causing
    > it, and one concerining clustering drivers conflicts and one more about
    > compatibility issues with the HBA the SAN is connected to.
    >
    > See if you can weed through the following EventID links. Look up based on
    > the Source name. If you don't have a subscription and it indicates an
    > MSKB, prefix the KB number so it the URL reads
    > "htttp://support.microsoft.com/kb/######" (###### being the KB number).
    >
    > http://eventid.net/display.asp?eventid=50&source=
    > http://eventid.net/display.asp?eventid=26&source=
    >
    >
    > --
    > Regards,
    > Ace
    >
    > This posting is provided "AS-IS" with no warranties or guarantees and
    > confers no rights.
    >
    > Ace Fekay, MCSE 2003 & 2000, MCSA 2003 & 2000, MCSE+I, MCT,
    > MVP Microsoft MVP - Directory Services
    > Microsoft Certified Trainer
    >
    > For urgent issues, you may want to contact Microsoft PSS directly. Please
    > check http://support.microsoft.com for regional support phone numbers.
    >
    > Infinite Diversities in Infinite Combinations
    >
    >
    John Fullbright, May 24, 2008
    #4
  5. Phil Lewis

    Phil Lewis Guest

    Hi John
    Event 50 is being generated by Disk. I think that is being seen on all
    servers, though I've seen one or two cases where mrxsmb was the source when
    doing network copies. I can also confirm that at least 50% of RAM was indeed
    cached, in most cases it was much higher. While we have seen c000009a (out
    of resources), in almost all cases, disk timeouts occurred as a failure
    point and yes indeed we experience very heavy paging when or just before
    the system tips over. On Server 2008 we were seeing the dedicated pagefile
    disk active solidly for over an hour before we finally gave up waiting for
    the copy to quit and the system to shutdown.
    I'll get the event viewer dumps later tonight and I can also get a Perfmon
    Dump as well. Let me know what counters you would like to see included so I
    get you exactly the necessary data. tonight, I can just load a Perfmon
    template on the server and dump the output if you want to specify a bunch of
    counters.

    I have not played with CcDirtyPageThreshold yet, I was extremely tempted
    last week, but have thus far resisted. We run SP2 on all our WinSvr2003
    servers. and one article I read had a hotfix for the problem that applied
    only to SP1. Microsoft Support confirmed the fix in the article was included
    in SP2 and I've not yet added the DWord value to modify it.

    Phil

    ..
    "John Fullbright" <fjohn@donotspamnetappdotcom> wrote in message
    news:...
    > Being that they're at Windows 2003 SP2 or Windows 2008, my guess would be
    > CcDirtyPageThreshold is set too high. Let's see the details of the event
    > ID 50. The event ID 26 is really immaterial, it's just the application
    > popup message (you got the dialog that says lost delayed write)., The
    > evend ID 50 will contain the error code passed up the driver stack, as
    > well as any SCSI sense data. Between that and possibly a short perfmon
    > run during a copy, we should be able to sort this out. My guess is that
    > we'll see half of RAM cached before writing starts, then you'll either see
    > c000009a (out of resources) or a timeout due to slow disk once dirty
    > pages are flushed. In perfom, you'll see heavy paging once dirty pages
    > start flushing, and this is really the cause of the lack of server
    > responsiveness.
    >
    > The narrative
    >
    > "Windows Server 2008 has given me my best look into the problem and
    >>> what appears to be happening is that the system cache keeps expanding
    >>> until all physical memory is used and the paged pool keeps growing
    >>> until it hits around 380MB and non paged hits about 82MB (I think the
    >>> latter is right). What I then see is the CPU goes flat line as does
    >>> the Total Disk byes written in Perfmon but the Physical memory usage
    >>> history in task manager suddenly starts ramping until it gets out
    >>> around 70GB and then everything is done and either the system hangs"

    >
    > would be supportive of this guess.
    >
    > John
    >
    >
    >
    > "Ace Fekay [MVP]" <> wrote in message
    > news:%...
    >> In news:,
    >> Phil Lewis <> typed:
    >>> Apologies as this turned out to be rather lengthy....
    >>>
    >>> I run a development engineering lab for a financial services company
    >>> and we are running into a rather peculiar but very troubling problem
    >>> on some of our performance servers when copying, moving or backing up
    >>> very large files, i.e. files greater than about 52 Gigabytes in size
    >>> (yes GB not MB). Primarily these are SQL Server database files but we
    >>> also see the same problems copying or relocating large Virtual Server
    >>> Hard Drive files of that size or larger.
    >>> The problem is actually an old one that I think has not been dealt
    >>> with, the end result for us on Windows Server 2000 and Windows Server
    >>> 2003 R2 is that after copying about 52 Gbytes of the file, Windows
    >>> starts reporting "Windows delayed write" errors and at that point the
    >>> file copy collapses and stops. Although the system reports the copy
    >>> is still running, no further data is being successfully copied. All
    >>> the file IO and other windows processes slowdown considerably (more
    >>> on this in a little bit). In each case where a delayed write error
    >>> is generated, Event Viewer shows the first error as being event ID 50
    >>> and or Event 26. The problem is seen when using Drive letters and
    >>> UNC paths and we asked about hotfixes for Server 2003 SP2 but were
    >>> told by support there were none as the fix described in KB Article
    >>> 890352 [ http://support.microsoft.com/kb/890352/ ] was rolled into
    >>> SP2 and did not apply to our issue.
    >>> We have only recently begun to see these problems because until
    >>> recently most of our performance testing model used fairly small
    >>> working sets for data (typically under 100GB total) and thus each
    >>> file-group in our databases was less than 40GB so we never really saw
    >>> a problem. I've seen this problem on ALL versions of Windows
    >>> including Server 2008. In the server 2008 case, the O/S collapses
    >>> completely and cannot be shutdown.In most cases we have to power-off
    >>> the server to get the problem system to recover, no delayed write
    >>> error is reported on server 2008.
    >>> I've tested this on a variety of servers (listed next) and in as many
    >>> cases as possible I tested on multiple servers with the same config
    >>> and with different O/S editions. I also tested one of the servers
    >>> that was experiencing the problems first, using small files (files
    >>> from 1byte up to 40 GigaBytes in size). The test set I use is approx
    >>> 870GB in total size and has been continuously copying for about 40
    >>> days now continuously on this server, I think at last check it had
    >>> copied around 2,973 Terabytes of data on this server, all without
    >>> error.
    >>> Primary Test Servers and configurations
    >>> HP ML570 Quad Xeon w/ 16GB Ram and 1.2TBytes local storage + 3.4TB SAN
    >>> storage on a Nexsan Sataboy Box. WinSvr 2003R2 SP2 x86 and x64.
    >>> HP DL580 Dual Xeon w/8Gb or 16Gb Ram and 1.2TB Local Storage + 3.4 TB
    >>> SAN Storage on a Nexsan Sataboy Box. WinSvr 2003R2 SP2 x86 and x64.
    >>> HP DL380 G4p Dual Xeon w/4Gb Ram and 300GB Local Storage + 3.4 TB SAN
    >>> Storage on a Nexsan Sataboy Box. WinSvr 2003R2 SP2 x86
    >>> DELL 2950 dual quad Core 2.83Ghz w/32 GB Ram 320GB Local Storage
    >>> 4.2TB on an EMC AX150 SAN and 3.4 TB SAN Storage on a Nexsan Sataboy
    >>> Box. WinSvr 2003R2 SP2 x86 and x64, Win Svr 2008 x64.
    >>> DELL 2950 dual quad Core 2.5Ghz w/4 GB Ram 140GB Local Storage 4.2TB
    >>> on an EMC AX150 SAN and 3.4 TB SAN Storage on a Nexsan Sataboy Box.
    >>> WinSvr 2003R2 SP2 x86 and WinSvr2008 x86
    >>> Dell Precision 460T wkstation Dual Xeon 2.4Ghz with 4GB Ram and 1TB
    >>> of local storage (SATA) 3.4 TB SAN Storage on a Nexsan Sataboy Box
    >>> via iSCSI WinSvr 2003R2 SP2 x64
    >>> With the exception of the ML570 I've tested on multiple servers of
    >>> the same type. All servers have succeeded in copying large volumes of
    >>> files 40Gb and smaller.
    >>>
    >>> On our HPs all drives are SCSI hot swappable. Either 10 or 15K. We
    >>> don't mix spindle speeds in raid groups. On our Dells, all local
    >>> drives are SAS, on the EMC AX150 all drives are are SATA and on the
    >>> SATABoy all drives are SATA. On all systems file copies using small
    >>> files up to 0 - 40GB are all successful.
    >>> All systems are running the latest BIOS and we have seen the same
    >>> behavior on prior BIOS versions. All disk controller firware is
    >>> updated to the latest version and like the BIOSes the same behavior
    >>> existed on earlier versions. We have checked and updated hard disk
    >>> firmware, where new versions are available. Same issues as for
    >>> controller firmware. On local Hard drives we run Raid 1 or Raid 5 to
    >>> get best performance or max capacity. Both Raid modes exhibit the
    >>> same behavior. I have tested on the HP's with no Raid at all and the
    >>> same results occurred.
    >>> I try not to do specialty O/S builds for our lab environment. I build
    >>> a straight default Windows O/S configurations, fully patch it with
    >>> microsoft Patches and burn in test the system, then go test for this
    >>> problem. I do not tweak system settings or apply registry hacks until
    >>> I get baseline test data. In all cases here for the file copy tests I
    >>> have not tweaked the system settings or registry at all. Our servers
    >>> are set for background performance for the system cache, although we
    >>> tested with 'foreground' set without success too. We've tried large
    >>> and small pagefiles and have moved pagefiles to seperate disk
    >>> spindles to see if it made a difference. I've tried all manner of file
    >>> copy and file sync tools, but what it
    >>> comes down to that if the file being copied is written on the
    >>> systems' storage system (local or SAN), the system will collapse and
    >>> file copying will fail somewhere around 52 - 58GB being copied.
    >>> Windows Server 2008 has given me my best look into the problem and
    >>> what appears to be happening is that the system cache keeps expanding
    >>> until all physical memory is used and the paged pool keeps growing
    >>> until it hits around 380MB and non paged hits about 82MB (I think the
    >>> latter is right). What I then see is the CPU goes flat line as does
    >>> the Total Disk byes written in Perfmon but the Physical memory usage
    >>> history in task manager suddenly starts ramping until it gets out
    >>> around 70GB and then everything is done and either the system hangs
    >>> (server2008) or delayed write errors occur.
    >>> One place where I do not seem to see the problem is SAN Drive to SAN
    >>> Drive Copies on the EMC SAN. I always see this problem on the SATABOY
    >>> SAN with large files when copying large files to the SAN volumes
    >>> regardless of Cache Settings. I can backup the files to tape but due
    >>> to their size its an expensive option both in media cost and time to
    >>> backup and restore the data, my preferred option is to backup to
    >>> removeable Hard Disk (External USB - SATA), sure its slower but it
    >>> offers operating efficiencies right now I cannot get with Tape (if
    >>> it worked). Some testing has centered on using external drives, but
    >>> most of the testing on my systems has been to copy or move the files
    >>> from one volume to another on the server. It doesn't matter whether I
    >>> turn caching off or on for external USB drives or local drives. I
    >>> have had very occasional success on the EMC SAN ensuring that windows
    >>> O/S disk caching is off. Success using this method has been spotty
    >>> and limited to servers with 32GB or more of RAM. One very telling
    >>> test setup was to populate the Server with 128GB RAM and run windows
    >>> Server 2008 x64. In that case almost all file copies were successful,
    >>> although they became painfully slow after about 60GB was copied and
    >>> it took over 5 hours to copy the last ~35GB of a 95.8 GB file.
    >>> I tried using Backup Exec and MsBackup to backup the files to a hard
    >>> disk but it failed everytime, when I run the same backup to tape it is
    >>> successful.
    >>> This is leading me to think the problem is generated in the lower
    >>> level file system filter drivers. I've tested with and without
    >>> AntiVirus software in the mix and have likewise tested systems that
    >>> are built raw with no patches at all and see the same problems. I
    >>> also tried splitting the file into chunks and copying the pieces and
    >>> while I can split the file, I cannot join it again as the processes
    >>> all seem to rely on creating a temp file and the process of copying
    >>> the large temp file always results in the delayed write errors being
    >>> generated. I've also tried zipping up the file to reduce its size
    >>> (database files compress really well) but that process likewise
    >>> requires a file copy of a large file, and at some point that fails.
    >>> It should go without saying I've tried using SQL Db Backup writing to
    >>> disk storage and it fails everytime, tape is successful. In fact it
    >>> was this very act that caused me to begin investigating the problem
    >>> in the first place.
    >>> In days gone by there were loads of users seeing this problem on Win
    >>> XP copying much smaller files and there are some other people seeing
    >>> this problem today on Windows Server, Microsoft are very quiet on the
    >>> subject for windows server, I think in no small part because very few
    >>> people are seeing the problem and there is no demand to identify the
    >>> problem or to fix it. I cannot believe though I'm one of the first
    >>> people to see the problem. I'll quite happily accept its a configuration
    >>> issue if someone can
    >>> tell me how to fix the problem! All attempt to tweak a system has not
    >>> yielded any success. What also sucks is that you typically have to
    >>> wait ~30 mins to find out that the problem will manifest.
    >>>
    >>> I'm about ready to escalate this issue to Microsoft, I think I now
    >>> have enough test data to do so, but thought I would bounce this off
    >>> others to see if anyone else has a solution or guidance first.
    >>>
    >>> Phil
    >>> Checkfree:

    >>
    >> Interesting dillema. I have not seen it before.
    >>
    >> What are the Source names for the event IDs you posted?
    >>
    >> Out of curiousity, I looked up EventID 50, but there are numerous Source
    >> names, none of which their descriptions did I find to indicate a "Windows
    >> Delayed Write" message, as you've described. Looking through different
    >> possible error messages of "{Delayed Write Failed}," I found numerous
    >> possiblities, but you would have to sort through them. One indicated a
    >> problem wtih the MFT, another indicated issues with fragmentation causing
    >> it, and one concerining clustering drivers conflicts and one more about
    >> compatibility issues with the HBA the SAN is connected to.
    >>
    >> See if you can weed through the following EventID links. Look up based on
    >> the Source name. If you don't have a subscription and it indicates an
    >> MSKB, prefix the KB number so it the URL reads
    >> "htttp://support.microsoft.com/kb/######" (###### being the KB number).
    >>
    >> http://eventid.net/display.asp?eventid=50&source=
    >> http://eventid.net/display.asp?eventid=26&source=
    >>
    >>
    >> --
    >> Regards,
    >> Ace
    >>
    >> This posting is provided "AS-IS" with no warranties or guarantees and
    >> confers no rights.
    >>
    >> Ace Fekay, MCSE 2003 & 2000, MCSA 2003 & 2000, MCSE+I, MCT,
    >> MVP Microsoft MVP - Directory Services
    >> Microsoft Certified Trainer
    >>
    >> For urgent issues, you may want to contact Microsoft PSS directly. Please
    >> check http://support.microsoft.com for regional support phone numbers.
    >>
    >> Infinite Diversities in Infinite Combinations
    >>
    >>

    >
    >
    Phil Lewis, May 24, 2008
    #5
  6. The fix is there with SP2, but you still need to set the registry key.

    cache
    copy reads/se
    data flush pages/sec

    memory
    pages/sec
    free system page table entries

    physical disk
    sec/read
    sec/write
    reads/sec
    writes/sec



    John Fullbright [Exchange MVP]




    "Phil Lewis" <> wrote in message
    news:...
    > Hi John
    > Event 50 is being generated by Disk. I think that is being seen on all
    > servers, though I've seen one or two cases where mrxsmb was the source
    > when doing network copies. I can also confirm that at least 50% of RAM was
    > indeed cached, in most cases it was much higher. While we have seen
    > c000009a (out of resources), in almost all cases, disk timeouts occurred
    > as a failure point and yes indeed we experience very heavy paging when or
    > just before the system tips over. On Server 2008 we were seeing the
    > dedicated pagefile disk active solidly for over an hour before we finally
    > gave up waiting for the copy to quit and the system to shutdown.
    > I'll get the event viewer dumps later tonight and I can also get a Perfmon
    > Dump as well. Let me know what counters you would like to see included so
    > I get you exactly the necessary data. tonight, I can just load a Perfmon
    > template on the server and dump the output if you want to specify a bunch
    > of counters.
    >
    > I have not played with CcDirtyPageThreshold yet, I was extremely tempted
    > last week, but have thus far resisted. We run SP2 on all our WinSvr2003
    > servers. and one article I read had a hotfix for the problem that applied
    > only to SP1. Microsoft Support confirmed the fix in the article was
    > included in SP2 and I've not yet added the DWord value to modify it.
    >
    > Phil
    >
    > .
    > "John Fullbright" <fjohn@donotspamnetappdotcom> wrote in message
    > news:...
    >> Being that they're at Windows 2003 SP2 or Windows 2008, my guess would be
    >> CcDirtyPageThreshold is set too high. Let's see the details of the event
    >> ID 50. The event ID 26 is really immaterial, it's just the application
    >> popup message (you got the dialog that says lost delayed write)., The
    >> evend ID 50 will contain the error code passed up the driver stack, as
    >> well as any SCSI sense data. Between that and possibly a short perfmon
    >> run during a copy, we should be able to sort this out. My guess is that
    >> we'll see half of RAM cached before writing starts, then you'll either
    >> see c000009a (out of resources) or a timeout due to slow disk once dirty
    >> pages are flushed. In perfom, you'll see heavy paging once dirty pages
    >> start flushing, and this is really the cause of the lack of server
    >> responsiveness.
    >>
    >> The narrative
    >>
    >> "Windows Server 2008 has given me my best look into the problem and
    >>>> what appears to be happening is that the system cache keeps expanding
    >>>> until all physical memory is used and the paged pool keeps growing
    >>>> until it hits around 380MB and non paged hits about 82MB (I think the
    >>>> latter is right). What I then see is the CPU goes flat line as does
    >>>> the Total Disk byes written in Perfmon but the Physical memory usage
    >>>> history in task manager suddenly starts ramping until it gets out
    >>>> around 70GB and then everything is done and either the system hangs"

    >>
    >> would be supportive of this guess.
    >>
    >> John
    >>
    >>
    >>
    >> "Ace Fekay [MVP]" <> wrote in message
    >> news:%...
    >>> In news:,
    >>> Phil Lewis <> typed:
    >>>> Apologies as this turned out to be rather lengthy....
    >>>>
    >>>> I run a development engineering lab for a financial services company
    >>>> and we are running into a rather peculiar but very troubling problem
    >>>> on some of our performance servers when copying, moving or backing up
    >>>> very large files, i.e. files greater than about 52 Gigabytes in size
    >>>> (yes GB not MB). Primarily these are SQL Server database files but we
    >>>> also see the same problems copying or relocating large Virtual Server
    >>>> Hard Drive files of that size or larger.
    >>>> The problem is actually an old one that I think has not been dealt
    >>>> with, the end result for us on Windows Server 2000 and Windows Server
    >>>> 2003 R2 is that after copying about 52 Gbytes of the file, Windows
    >>>> starts reporting "Windows delayed write" errors and at that point the
    >>>> file copy collapses and stops. Although the system reports the copy
    >>>> is still running, no further data is being successfully copied. All
    >>>> the file IO and other windows processes slowdown considerably (more
    >>>> on this in a little bit). In each case where a delayed write error
    >>>> is generated, Event Viewer shows the first error as being event ID 50
    >>>> and or Event 26. The problem is seen when using Drive letters and
    >>>> UNC paths and we asked about hotfixes for Server 2003 SP2 but were
    >>>> told by support there were none as the fix described in KB Article
    >>>> 890352 [ http://support.microsoft.com/kb/890352/ ] was rolled into
    >>>> SP2 and did not apply to our issue.
    >>>> We have only recently begun to see these problems because until
    >>>> recently most of our performance testing model used fairly small
    >>>> working sets for data (typically under 100GB total) and thus each
    >>>> file-group in our databases was less than 40GB so we never really saw
    >>>> a problem. I've seen this problem on ALL versions of Windows
    >>>> including Server 2008. In the server 2008 case, the O/S collapses
    >>>> completely and cannot be shutdown.In most cases we have to power-off
    >>>> the server to get the problem system to recover, no delayed write
    >>>> error is reported on server 2008.
    >>>> I've tested this on a variety of servers (listed next) and in as many
    >>>> cases as possible I tested on multiple servers with the same config
    >>>> and with different O/S editions. I also tested one of the servers
    >>>> that was experiencing the problems first, using small files (files
    >>>> from 1byte up to 40 GigaBytes in size). The test set I use is approx
    >>>> 870GB in total size and has been continuously copying for about 40
    >>>> days now continuously on this server, I think at last check it had
    >>>> copied around 2,973 Terabytes of data on this server, all without
    >>>> error.
    >>>> Primary Test Servers and configurations
    >>>> HP ML570 Quad Xeon w/ 16GB Ram and 1.2TBytes local storage + 3.4TB SAN
    >>>> storage on a Nexsan Sataboy Box. WinSvr 2003R2 SP2 x86 and x64.
    >>>> HP DL580 Dual Xeon w/8Gb or 16Gb Ram and 1.2TB Local Storage + 3.4 TB
    >>>> SAN Storage on a Nexsan Sataboy Box. WinSvr 2003R2 SP2 x86 and x64.
    >>>> HP DL380 G4p Dual Xeon w/4Gb Ram and 300GB Local Storage + 3.4 TB SAN
    >>>> Storage on a Nexsan Sataboy Box. WinSvr 2003R2 SP2 x86
    >>>> DELL 2950 dual quad Core 2.83Ghz w/32 GB Ram 320GB Local Storage
    >>>> 4.2TB on an EMC AX150 SAN and 3.4 TB SAN Storage on a Nexsan Sataboy
    >>>> Box. WinSvr 2003R2 SP2 x86 and x64, Win Svr 2008 x64.
    >>>> DELL 2950 dual quad Core 2.5Ghz w/4 GB Ram 140GB Local Storage 4.2TB
    >>>> on an EMC AX150 SAN and 3.4 TB SAN Storage on a Nexsan Sataboy Box.
    >>>> WinSvr 2003R2 SP2 x86 and WinSvr2008 x86
    >>>> Dell Precision 460T wkstation Dual Xeon 2.4Ghz with 4GB Ram and 1TB
    >>>> of local storage (SATA) 3.4 TB SAN Storage on a Nexsan Sataboy Box
    >>>> via iSCSI WinSvr 2003R2 SP2 x64
    >>>> With the exception of the ML570 I've tested on multiple servers of
    >>>> the same type. All servers have succeeded in copying large volumes of
    >>>> files 40Gb and smaller.
    >>>>
    >>>> On our HPs all drives are SCSI hot swappable. Either 10 or 15K. We
    >>>> don't mix spindle speeds in raid groups. On our Dells, all local
    >>>> drives are SAS, on the EMC AX150 all drives are are SATA and on the
    >>>> SATABoy all drives are SATA. On all systems file copies using small
    >>>> files up to 0 - 40GB are all successful.
    >>>> All systems are running the latest BIOS and we have seen the same
    >>>> behavior on prior BIOS versions. All disk controller firware is
    >>>> updated to the latest version and like the BIOSes the same behavior
    >>>> existed on earlier versions. We have checked and updated hard disk
    >>>> firmware, where new versions are available. Same issues as for
    >>>> controller firmware. On local Hard drives we run Raid 1 or Raid 5 to
    >>>> get best performance or max capacity. Both Raid modes exhibit the
    >>>> same behavior. I have tested on the HP's with no Raid at all and the
    >>>> same results occurred.
    >>>> I try not to do specialty O/S builds for our lab environment. I build
    >>>> a straight default Windows O/S configurations, fully patch it with
    >>>> microsoft Patches and burn in test the system, then go test for this
    >>>> problem. I do not tweak system settings or apply registry hacks until
    >>>> I get baseline test data. In all cases here for the file copy tests I
    >>>> have not tweaked the system settings or registry at all. Our servers
    >>>> are set for background performance for the system cache, although we
    >>>> tested with 'foreground' set without success too. We've tried large
    >>>> and small pagefiles and have moved pagefiles to seperate disk
    >>>> spindles to see if it made a difference. I've tried all manner of file
    >>>> copy and file sync tools, but what it
    >>>> comes down to that if the file being copied is written on the
    >>>> systems' storage system (local or SAN), the system will collapse and
    >>>> file copying will fail somewhere around 52 - 58GB being copied.
    >>>> Windows Server 2008 has given me my best look into the problem and
    >>>> what appears to be happening is that the system cache keeps expanding
    >>>> until all physical memory is used and the paged pool keeps growing
    >>>> until it hits around 380MB and non paged hits about 82MB (I think the
    >>>> latter is right). What I then see is the CPU goes flat line as does
    >>>> the Total Disk byes written in Perfmon but the Physical memory usage
    >>>> history in task manager suddenly starts ramping until it gets out
    >>>> around 70GB and then everything is done and either the system hangs
    >>>> (server2008) or delayed write errors occur.
    >>>> One place where I do not seem to see the problem is SAN Drive to SAN
    >>>> Drive Copies on the EMC SAN. I always see this problem on the SATABOY
    >>>> SAN with large files when copying large files to the SAN volumes
    >>>> regardless of Cache Settings. I can backup the files to tape but due
    >>>> to their size its an expensive option both in media cost and time to
    >>>> backup and restore the data, my preferred option is to backup to
    >>>> removeable Hard Disk (External USB - SATA), sure its slower but it
    >>>> offers operating efficiencies right now I cannot get with Tape (if
    >>>> it worked). Some testing has centered on using external drives, but
    >>>> most of the testing on my systems has been to copy or move the files
    >>>> from one volume to another on the server. It doesn't matter whether I
    >>>> turn caching off or on for external USB drives or local drives. I
    >>>> have had very occasional success on the EMC SAN ensuring that windows
    >>>> O/S disk caching is off. Success using this method has been spotty
    >>>> and limited to servers with 32GB or more of RAM. One very telling
    >>>> test setup was to populate the Server with 128GB RAM and run windows
    >>>> Server 2008 x64. In that case almost all file copies were successful,
    >>>> although they became painfully slow after about 60GB was copied and
    >>>> it took over 5 hours to copy the last ~35GB of a 95.8 GB file.
    >>>> I tried using Backup Exec and MsBackup to backup the files to a hard
    >>>> disk but it failed everytime, when I run the same backup to tape it is
    >>>> successful.
    >>>> This is leading me to think the problem is generated in the lower
    >>>> level file system filter drivers. I've tested with and without
    >>>> AntiVirus software in the mix and have likewise tested systems that
    >>>> are built raw with no patches at all and see the same problems. I
    >>>> also tried splitting the file into chunks and copying the pieces and
    >>>> while I can split the file, I cannot join it again as the processes
    >>>> all seem to rely on creating a temp file and the process of copying
    >>>> the large temp file always results in the delayed write errors being
    >>>> generated. I've also tried zipping up the file to reduce its size
    >>>> (database files compress really well) but that process likewise
    >>>> requires a file copy of a large file, and at some point that fails.
    >>>> It should go without saying I've tried using SQL Db Backup writing to
    >>>> disk storage and it fails everytime, tape is successful. In fact it
    >>>> was this very act that caused me to begin investigating the problem
    >>>> in the first place.
    >>>> In days gone by there were loads of users seeing this problem on Win
    >>>> XP copying much smaller files and there are some other people seeing
    >>>> this problem today on Windows Server, Microsoft are very quiet on the
    >>>> subject for windows server, I think in no small part because very few
    >>>> people are seeing the problem and there is no demand to identify the
    >>>> problem or to fix it. I cannot believe though I'm one of the first
    >>>> people to see the problem. I'll quite happily accept its a
    >>>> configuration issue if someone can
    >>>> tell me how to fix the problem! All attempt to tweak a system has not
    >>>> yielded any success. What also sucks is that you typically have to
    >>>> wait ~30 mins to find out that the problem will manifest.
    >>>>
    >>>> I'm about ready to escalate this issue to Microsoft, I think I now
    >>>> have enough test data to do so, but thought I would bounce this off
    >>>> others to see if anyone else has a solution or guidance first.
    >>>>
    >>>> Phil
    >>>> Checkfree:
    >>>
    >>> Interesting dillema. I have not seen it before.
    >>>
    >>> What are the Source names for the event IDs you posted?
    >>>
    >>> Out of curiousity, I looked up EventID 50, but there are numerous Source
    >>> names, none of which their descriptions did I find to indicate a
    >>> "Windows Delayed Write" message, as you've described. Looking through
    >>> different possible error messages of "{Delayed Write Failed}," I found
    >>> numerous possiblities, but you would have to sort through them. One
    >>> indicated a problem wtih the MFT, another indicated issues with
    >>> fragmentation causing it, and one concerining clustering drivers
    >>> conflicts and one more about compatibility issues with the HBA the SAN
    >>> is connected to.
    >>>
    >>> See if you can weed through the following EventID links. Look up based
    >>> on the Source name. If you don't have a subscription and it indicates an
    >>> MSKB, prefix the KB number so it the URL reads
    >>> "htttp://support.microsoft.com/kb/######" (###### being the KB number).
    >>>
    >>> http://eventid.net/display.asp?eventid=50&source=
    >>> http://eventid.net/display.asp?eventid=26&source=
    >>>
    >>>
    >>> --
    >>> Regards,
    >>> Ace
    >>>
    >>> This posting is provided "AS-IS" with no warranties or guarantees and
    >>> confers no rights.
    >>>
    >>> Ace Fekay, MCSE 2003 & 2000, MCSA 2003 & 2000, MCSE+I, MCT,
    >>> MVP Microsoft MVP - Directory Services
    >>> Microsoft Certified Trainer
    >>>
    >>> For urgent issues, you may want to contact Microsoft PSS directly.
    >>> Please
    >>> check http://support.microsoft.com for regional support phone numbers.
    >>>
    >>> Infinite Diversities in Infinite Combinations
    >>>
    >>>

    >>
    >>

    >
    >
    John Fullbright, May 26, 2008
    #6
  7. Phil Lewis

    Phil Lewis Guest

    Hi John I'm about to grab the counters to a CSV File or do you prefer
    Binary?

    What do you recommend in terms of sample interval? I typically use 5 Second
    intervals for this kind of thing but will adjust to whatever you prefer.

    I'm grabbing Physical disk counters for both _Total and the Target disk
    [D:\] for the Copy.

    The first set of counters will be without the reg key being added to see the
    system in the raw state. I'll then add the key, what do you recommend in
    terms of a value? the Server is a HP ML570 Quad Xeon Dual Core 3.33GHz with
    16GB of Ram. Reading suggests starting at half of installed Ram which would
    be 8192.

    Phil


    "John Fullbright" <fjohn@donotspamnetappdotcom> wrote in message
    news:...
    > The fix is there with SP2, but you still need to set the registry key.
    >
    > cache
    > copy reads/se
    > data flush pages/sec
    >
    > memory
    > pages/sec
    > free system page table entries
    >
    > physical disk
    > sec/read
    > sec/write
    > reads/sec
    > writes/sec
    >
    >
    >
    > John Fullbright [Exchange MVP]
    >
    Phil Lewis, May 28, 2008
    #7
  8. Phil Lewis

    Mat Young Guest

    Avg Disk Sec Read and write would be handy. if you are getting Lost write
    because your array is being overun beyond the spindle count, then this may
    show up in a latency response spike beforehand.

    It would also be good to match IO versus spindle count. To do this properly
    you will need cache hit ratio from the arrays.

    rgds

    mat

    "Phil Lewis" wrote:

    > Hi John I'm about to grab the counters to a CSV File or do you prefer
    > Binary?
    >
    > What do you recommend in terms of sample interval? I typically use 5 Second
    > intervals for this kind of thing but will adjust to whatever you prefer.
    >
    > I'm grabbing Physical disk counters for both _Total and the Target disk
    > [D:\] for the Copy.
    >
    > The first set of counters will be without the reg key being added to see the
    > system in the raw state. I'll then add the key, what do you recommend in
    > terms of a value? the Server is a HP ML570 Quad Xeon Dual Core 3.33GHz with
    > 16GB of Ram. Reading suggests starting at half of installed Ram which would
    > be 8192.
    >
    > Phil
    >
    >
    > "John Fullbright" <fjohn@donotspamnetappdotcom> wrote in message
    > news:...
    > > The fix is there with SP2, but you still need to set the registry key.
    > >
    > > cache
    > > copy reads/se
    > > data flush pages/sec
    > >
    > > memory
    > > pages/sec
    > > free system page table entries
    > >
    > > physical disk
    > > sec/read
    > > sec/write
    > > reads/sec
    > > writes/sec
    > >
    > >
    > >
    > > John Fullbright [Exchange MVP]
    > >

    >
    >
    >
    Mat Young, May 28, 2008
    #8
  9. Phil Lewis

    Phil Lewis Guest

    Well I'm getting some very interesting findiong that I don't yet know what
    to make of.

    First off It looks like the whole train of events ultimately leading up to
    the delayed write failure begins about 2-4 mins into the copy process. The
    File copy in Windows Explorer begins normally and the 88.6GByte copy shows
    it is expecting to take around 24-26 mins. Watching a whole slew of counters
    in Perfmon I can see the Cache Bytes Growing steadily and Disk Total Reads
    per second running avg around 190. Somewhere around 2-4 mins into the copy
    the time to complete estimate in windows Explorer suddenly climbs
    dramatically topping out around 100Mins. At the exact same time in perfmon I
    see a sudden drop in the Total Disk Reads / Sec. The level drops from about
    190 AVG to about 32 and the cache bytes growth goes dead flat. Avg Disk
    queue stays the same and Writes continue to occur at the same levels as
    before. After several minutes the Disk reads go back up to their previous
    levels for a minute or two then they drop again and shortly after that, the
    Disk write counters tank again and within 10-15 seconds The Delayed Write
    error occurs.

    So far I cannot see what is causing the Disk Reads to suddenly tank to about
    25% of the initial level. When the system, picks back up midstream there is
    a sudden spike in the number of Lazy Write Flushes and lazy write pages
    lasting about 1 or 2 seconds then they drop off to almost zero.

    Throughout copy the Free System Page Table Entries is flat at 40209. When
    the system picks up again mid-Stream, the Cache Bytes and System Cache
    Resident Bytes suddenly ramp up and seems to flatten out at approx
    360,000,000 Bytes. Once the system Tanks the Bytes stay up until the file
    copy is aborted and after a few mins it drops back to about 125,000,000
    Bytes.

    John, I'll Email you the Perfmon trace from the first run earlier later
    tonight.



    "John Fullbright" <fjohn@donotspamnetappdotcom> wrote in message
    news:...
    > The fix is there with SP2, but you still need to set the registry key.
    >
    > cache
    > copy reads/se
    > data flush pages/sec
    >
    > memory
    > pages/sec
    > free system page table entries
    >
    > physical disk
    > sec/read
    > sec/write
    > reads/sec
    > writes/sec
    >
    >
    >
    > John Fullbright [Exchange MVP]
    >
    >
    Phil Lewis, May 28, 2008
    #9
  10. Phil Lewis

    Kenny Speer Guest

    Please take these comments with a grain of salt. It's typically pretty
    difficult to troubleshoot these issues via newsgroup.

    Phil Lewis wrote:
    > Well I'm getting some very interesting findiong that I don't yet know what
    > to make of.
    >
    > First off It looks like the whole train of events ultimately leading up to
    > the delayed write failure begins about 2-4 mins into the copy process. The
    > File copy in Windows Explorer begins normally and the 88.6GByte copy shows
    > it is expecting to take around 24-26 mins. Watching a whole slew of counters
    > in Perfmon I can see the Cache Bytes Growing steadily and Disk Total Reads
    > per second running avg around 190.


    [Kenny] This is most likely being caused by two things. First the NTFS
    Cache writes is where your low time comes from. The AVG 190 IOPs to
    your array are most likely from a write-back cache in the array.

    Somewhere around 2-4 mins into the copy
    > the time to complete estimate in windows Explorer suddenly climbs
    > dramatically topping out around 100Mins.


    [Kenny] You just filled your cache for the filesystem.

    At the exact same time in perfmon I
    > see a sudden drop in the Total Disk Reads / Sec. The level drops from about
    > 190 AVG to about 32 and the cache bytes growth goes dead flat.


    [Kenny] Array cache is most likely filled at this point, so your
    throughput to the array just dropped since at this point you are running
    as fast as the array can write to disk instead of how fast it can write
    to cache.

    Avg Disk
    > queue stays the same and Writes continue to occur at the same levels as
    > before. After several minutes the Disk reads go back up to their previous
    > levels for a minute or two then they drop again


    [Kenny] This is the cache filling and flushing then filling and flushing.

    and shortly after that, the
    > Disk write counters tank again and within 10-15 seconds The Delayed Write
    > error occurs.


    [Kenny] Hmm, not sure about this. It is quite possible that the array
    simply dropped the IO on the floor. What do the event logs show up to
    this point? Are there any logs you can enable on the target?

    >
    > So far I cannot see what is causing the Disk Reads to suddenly tank to about
    > 25% of the initial level. When the system, picks back up midstream there is
    > a sudden spike in the number of Lazy Write Flushes and lazy write pages
    > lasting about 1 or 2 seconds then they drop off to almost zero.
    >
    > Throughout copy the Free System Page Table Entries is flat at 40209. When
    > the system picks up again mid-Stream, the Cache Bytes and System Cache
    > Resident Bytes suddenly ramp up and seems to flatten out at approx
    > 360,000,000 Bytes. Once the system Tanks the Bytes stay up until the file
    > copy is aborted and after a few mins it drops back to about 125,000,000
    > Bytes.
    >
    > John, I'll Email you the Perfmon trace from the first run earlier later
    > tonight.
    >
    >
    >
    > "John Fullbright" <fjohn@donotspamnetappdotcom> wrote in message
    > news:...
    >> The fix is there with SP2, but you still need to set the registry key.
    >>
    >> cache
    >> copy reads/se
    >> data flush pages/sec
    >>
    >> memory
    >> pages/sec
    >> free system page table entries
    >>
    >> physical disk
    >> sec/read
    >> sec/write
    >> reads/sec
    >> writes/sec
    >>
    >>
    >>
    >> John Fullbright [Exchange MVP]
    >>
    >>

    >
    Kenny Speer, May 28, 2008
    #10
  11. Note that all the target storage listed is SATA disk. RAID 1 has a write
    penalty of 2 and RAID 5 has a write penalty of 4, compounding the issue.
    Eveng copying from one RAID array to another, your read performance will be
    greater than your write performance. This is totally consistent with the
    CcDirtyPageThreshold set too high scenario. Once host cache fills to the
    point where it starts flushing, you quickly swamp any cache on the storage
    and the high page faults begin. If you lower the threshold, the idea is that
    you wil stay within the fill/flush range and not swamp the disk.


    "Kenny Speer" <> wrote in message
    news:usoA%...
    > Please take these comments with a grain of salt. It's typically pretty
    > difficult to troubleshoot these issues via newsgroup.
    >
    > Phil Lewis wrote:
    >> Well I'm getting some very interesting findiong that I don't yet know
    >> what to make of.
    >>
    >> First off It looks like the whole train of events ultimately leading up
    >> to the delayed write failure begins about 2-4 mins into the copy process.
    >> The File copy in Windows Explorer begins normally and the 88.6GByte copy
    >> shows it is expecting to take around 24-26 mins. Watching a whole slew of
    >> counters in Perfmon I can see the Cache Bytes Growing steadily and Disk
    >> Total Reads per second running avg around 190.

    >
    > [Kenny] This is most likely being caused by two things. First the NTFS
    > Cache writes is where your low time comes from. The AVG 190 IOPs to your
    > array are most likely from a write-back cache in the array.
    >
    > Somewhere around 2-4 mins into the copy
    >> the time to complete estimate in windows Explorer suddenly climbs
    >> dramatically topping out around 100Mins.

    >
    > [Kenny] You just filled your cache for the filesystem.
    >
    > At the exact same time in perfmon I
    >> see a sudden drop in the Total Disk Reads / Sec. The level drops from
    >> about 190 AVG to about 32 and the cache bytes growth goes dead flat.

    >
    > [Kenny] Array cache is most likely filled at this point, so your
    > throughput to the array just dropped since at this point you are running
    > as fast as the array can write to disk instead of how fast it can write to
    > cache.
    >
    > Avg Disk
    >> queue stays the same and Writes continue to occur at the same levels as
    >> before. After several minutes the Disk reads go back up to their previous
    >> levels for a minute or two then they drop again

    >
    > [Kenny] This is the cache filling and flushing then filling and flushing.
    >
    > and shortly after that, the
    >> Disk write counters tank again and within 10-15 seconds The Delayed Write
    >> error occurs.

    >
    > [Kenny] Hmm, not sure about this. It is quite possible that the array
    > simply dropped the IO on the floor. What do the event logs show up to
    > this point? Are there any logs you can enable on the target?
    >
    >>
    >> So far I cannot see what is causing the Disk Reads to suddenly tank to
    >> about 25% of the initial level. When the system, picks back up midstream
    >> there is a sudden spike in the number of Lazy Write Flushes and lazy
    >> write pages lasting about 1 or 2 seconds then they drop off to almost
    >> zero.
    >>
    >> Throughout copy the Free System Page Table Entries is flat at 40209. When
    >> the system picks up again mid-Stream, the Cache Bytes and System Cache
    >> Resident Bytes suddenly ramp up and seems to flatten out at approx
    >> 360,000,000 Bytes. Once the system Tanks the Bytes stay up until the file
    >> copy is aborted and after a few mins it drops back to about 125,000,000
    >> Bytes.
    >>
    >> John, I'll Email you the Perfmon trace from the first run earlier later
    >> tonight.
    >>
    >>
    >>
    >> "John Fullbright" <fjohn@donotspamnetappdotcom> wrote in message
    >> news:...
    >>> The fix is there with SP2, but you still need to set the registry key.
    >>>
    >>> cache
    >>> copy reads/se
    >>> data flush pages/sec
    >>>
    >>> memory
    >>> pages/sec
    >>> free system page table entries
    >>>
    >>> physical disk
    >>> sec/read
    >>> sec/write
    >>> reads/sec
    >>> writes/sec
    >>>
    >>>
    >>>
    >>> John Fullbright [Exchange MVP]
    >>>
    >>>

    >>
    John Fullbright, May 29, 2008
    #11
  12. Phil Lewis

    Phil Lewis Guest

    Yep I believe this is what I'm looking at right now But I'm in need of
    guidance on what value to set for the registry entry:
    SystemCacheDirtyPageThreshold at least to start with. I've tried several
    very large values and two smaller ones 896 and 256 and neither corrected the
    issue but did produce subtlely differing behaviors in the counters. Very
    Small values get the dropoff in disk reads / Sec to occur in just a few secs
    of starting which is fine and About a minute later I See a spike in the lazy
    writes. This is preceeded by a jump in the Cache Bytes and System Cache
    Resident Bytes. What else is interesting and here is where the problems come
    from is that the cache grows each time just before lazy write occurs UNTIL
    System Cache Resident Bytes hits right around 522,000,000 Bytes then the
    frequency inteval of lazy write flushes drop from about 30 - 40 secs to
    about 3 or more mins and cache growth goes flat line. The writes are
    continuing but clearly the system seems to be getting overwhelmed. it should
    also be noted that lazy writes and flushes are occurring still but the
    frequent flushes stop once the sytem cache get filled.
    Something else interesting is that the Cache Bytes is continuing to grow
    slowly while System Cache Resident Bytes is essentially flat it look like
    system cache include the full system working set in its size and the working
    set appears to grow over time. I'm betting that once cache bytes get closer
    to 960-1024 MB the system will tip over.

    I take it there is no way to adjust the forced writes interval of the lazy
    writes, is that true?

    I should also point out that it does not matter what disk type or raid mode
    is used. I'm easily able to produce this error on any Windows Server 2000
    2003 2008 running any combination of disks SCSI, SAS, SATA PATA,
    ExtUSB-SATA.All I need is a large file >52GB in size, which I have a good
    number of.

    Phil





    "John Fullbright" <fjohn@donotspamnetappdotcom> wrote in message
    news:%...
    > Note that all the target storage listed is SATA disk. RAID 1 has a write
    > penalty of 2 and RAID 5 has a write penalty of 4, compounding the issue.
    > Eveng copying from one RAID array to another, your read performance will
    > be greater than your write performance. This is totally consistent with
    > the CcDirtyPageThreshold set too high scenario. Once host cache fills to
    > the point where it starts flushing, you quickly swamp any cache on the
    > storage and the high page faults begin. If you lower the threshold, the
    > idea is that you wil stay within the fill/flush range and not swamp the
    > disk.
    >
    >
    > "Kenny Speer" <> wrote in message
    > news:usoA%...
    >> Please take these comments with a grain of salt. It's typically pretty
    >> difficult to troubleshoot these issues via newsgroup.
    >>
    >> Phil Lewis wrote:
    >>> Well I'm getting some very interesting findiong that I don't yet know
    >>> what to make of.
    >>>
    >>> First off It looks like the whole train of events ultimately leading up
    >>> to the delayed write failure begins about 2-4 mins into the copy
    >>> process. The File copy in Windows Explorer begins normally and the
    >>> 88.6GByte copy shows it is expecting to take around 24-26 mins. Watching
    >>> a whole slew of counters in Perfmon I can see the Cache Bytes Growing
    >>> steadily and Disk Total Reads per second running avg around 190.

    >>
    >> [Kenny] This is most likely being caused by two things. First the NTFS
    >> Cache writes is where your low time comes from. The AVG 190 IOPs to your
    >> array are most likely from a write-back cache in the array.
    >>
    >> Somewhere around 2-4 mins into the copy
    >>> the time to complete estimate in windows Explorer suddenly climbs
    >>> dramatically topping out around 100Mins.

    >>
    >> [Kenny] You just filled your cache for the filesystem.
    >>
    >> At the exact same time in perfmon I
    >>> see a sudden drop in the Total Disk Reads / Sec. The level drops from
    >>> about 190 AVG to about 32 and the cache bytes growth goes dead flat.

    >>
    >> [Kenny] Array cache is most likely filled at this point, so your
    >> throughput to the array just dropped since at this point you are running
    >> as fast as the array can write to disk instead of how fast it can write
    >> to cache.
    >>
    >> Avg Disk
    >>> queue stays the same and Writes continue to occur at the same levels as
    >>> before. After several minutes the Disk reads go back up to their
    >>> previous levels for a minute or two then they drop again

    >>
    >> [Kenny] This is the cache filling and flushing then filling and flushing.
    >>
    >> and shortly after that, the
    >>> Disk write counters tank again and within 10-15 seconds The Delayed
    >>> Write error occurs.

    >>
    >> [Kenny] Hmm, not sure about this. It is quite possible that the array
    >> simply dropped the IO on the floor. What do the event logs show up to
    >> this point? Are there any logs you can enable on the target?
    Phil Lewis, May 29, 2008
    #12
  13. Phil Lewis

    Dilip Naik Guest

    Phil, can you provide an update to this problem? Have you found a solution?

    I am highly interested in the solution for personal reasons and am willing
    to spend some cycles tracking one down if not already found

    Dilip
    www.msftmvp.com

    "Phil Lewis" <> wrote in message
    news:e$...
    > Yep I believe this is what I'm looking at right now But I'm in need of
    > guidance on what value to set for the registry entry:
    > SystemCacheDirtyPageThreshold at least to start with. I've tried several
    > very large values and two smaller ones 896 and 256 and neither corrected
    > the issue but did produce subtlely differing behaviors in the counters.
    > Very Small values get the dropoff in disk reads / Sec to occur in just a
    > few secs of starting which is fine and About a minute later I See a spike
    > in the lazy writes. This is preceeded by a jump in the Cache Bytes and
    > System Cache Resident Bytes. What else is interesting and here is where
    > the problems come from is that the cache grows each time just before lazy
    > write occurs UNTIL System Cache Resident Bytes hits right around
    > 522,000,000 Bytes then the frequency inteval of lazy write flushes drop
    > from about 30 - 40 secs to about 3 or more mins and cache growth goes flat
    > line. The writes are continuing but clearly the system seems to be getting
    > overwhelmed. it should also be noted that lazy writes and flushes are
    > occurring still but the frequent flushes stop once the sytem cache get
    > filled.
    > Something else interesting is that the Cache Bytes is continuing to grow
    > slowly while System Cache Resident Bytes is essentially flat it look like
    > system cache include the full system working set in its size and the
    > working set appears to grow over time. I'm betting that once cache bytes
    > get closer to 960-1024 MB the system will tip over.
    >
    > I take it there is no way to adjust the forced writes interval of the lazy
    > writes, is that true?
    >
    > I should also point out that it does not matter what disk type or raid
    > mode is used. I'm easily able to produce this error on any Windows Server
    > 2000 2003 2008 running any combination of disks SCSI, SAS, SATA PATA,
    > ExtUSB-SATA.All I need is a large file >52GB in size, which I have a good
    > number of.
    >
    > Phil
    >
    >
    >
    >
    >
    > "John Fullbright" <fjohn@donotspamnetappdotcom> wrote in message
    > news:%...
    >> Note that all the target storage listed is SATA disk. RAID 1 has a write
    >> penalty of 2 and RAID 5 has a write penalty of 4, compounding the issue.
    >> Eveng copying from one RAID array to another, your read performance will
    >> be greater than your write performance. This is totally consistent with
    >> the CcDirtyPageThreshold set too high scenario. Once host cache fills to
    >> the point where it starts flushing, you quickly swamp any cache on the
    >> storage and the high page faults begin. If you lower the threshold, the
    >> idea is that you wil stay within the fill/flush range and not swamp the
    >> disk.
    >>
    >>
    >> "Kenny Speer" <> wrote in message
    >> news:usoA%...
    >>> Please take these comments with a grain of salt. It's typically pretty
    >>> difficult to troubleshoot these issues via newsgroup.
    >>>
    >>> Phil Lewis wrote:
    >>>> Well I'm getting some very interesting findiong that I don't yet know
    >>>> what to make of.
    >>>>
    >>>> First off It looks like the whole train of events ultimately leading up
    >>>> to the delayed write failure begins about 2-4 mins into the copy
    >>>> process. The File copy in Windows Explorer begins normally and the
    >>>> 88.6GByte copy shows it is expecting to take around 24-26 mins.
    >>>> Watching a whole slew of counters in Perfmon I can see the Cache Bytes
    >>>> Growing steadily and Disk Total Reads per second running avg around
    >>>> 190.
    >>>
    >>> [Kenny] This is most likely being caused by two things. First the NTFS
    >>> Cache writes is where your low time comes from. The AVG 190 IOPs to
    >>> your array are most likely from a write-back cache in the array.
    >>>
    >>> Somewhere around 2-4 mins into the copy
    >>>> the time to complete estimate in windows Explorer suddenly climbs
    >>>> dramatically topping out around 100Mins.
    >>>
    >>> [Kenny] You just filled your cache for the filesystem.
    >>>
    >>> At the exact same time in perfmon I
    >>>> see a sudden drop in the Total Disk Reads / Sec. The level drops from
    >>>> about 190 AVG to about 32 and the cache bytes growth goes dead flat.
    >>>
    >>> [Kenny] Array cache is most likely filled at this point, so your
    >>> throughput to the array just dropped since at this point you are running
    >>> as fast as the array can write to disk instead of how fast it can write
    >>> to cache.
    >>>
    >>> Avg Disk
    >>>> queue stays the same and Writes continue to occur at the same levels as
    >>>> before. After several minutes the Disk reads go back up to their
    >>>> previous levels for a minute or two then they drop again
    >>>
    >>> [Kenny] This is the cache filling and flushing then filling and
    >>> flushing.
    >>>
    >>> and shortly after that, the
    >>>> Disk write counters tank again and within 10-15 seconds The Delayed
    >>>> Write error occurs.
    >>>
    >>> [Kenny] Hmm, not sure about this. It is quite possible that the array
    >>> simply dropped the IO on the floor. What do the event logs show up to
    >>> this point? Are there any logs you can enable on the target?

    >
    >
    Dilip Naik, Jul 1, 2008
    #13
  14. Phil Lewis

    Dilip Naik Guest

    Phil,

    One more suggestion

    Reading this mail thread clearly indicates that both read and write caching
    are suspected. Why not try getting rid of Large Cache and rerun the
    experiment?

    Contact me off line - I have a filter driver that I would be willing to
    quickly modify to suit your needs - the driver would ensure certain files
    bypass the System Cache. I am not looking for any $$ to do this.

    Dilip
    www.msftmvp.com

    "Dilip Naik" <> wrote in message
    news:...
    > Phil, can you provide an update to this problem? Have you found a
    > solution?
    >
    > I am highly interested in the solution for personal reasons and am willing
    > to spend some cycles tracking one down if not already found
    >
    > Dilip
    > www.msftmvp.com
    >
    > "Phil Lewis" <> wrote in message
    > news:e$...
    >> Yep I believe this is what I'm looking at right now But I'm in need of
    >> guidance on what value to set for the registry entry:
    >> SystemCacheDirtyPageThreshold at least to start with. I've tried
    >> several very large values and two smaller ones 896 and 256 and neither
    >> corrected the issue but did produce subtlely differing behaviors in the
    >> counters. Very Small values get the dropoff in disk reads / Sec to occur
    >> in just a few secs of starting which is fine and About a minute later I
    >> See a spike in the lazy writes. This is preceeded by a jump in the Cache
    >> Bytes and System Cache Resident Bytes. What else is interesting and here
    >> is where the problems come from is that the cache grows each time just
    >> before lazy write occurs UNTIL System Cache Resident Bytes hits right
    >> around 522,000,000 Bytes then the frequency inteval of lazy write flushes
    >> drop from about 30 - 40 secs to about 3 or more mins and cache growth
    >> goes flat line. The writes are continuing but clearly the system seems to
    >> be getting overwhelmed. it should also be noted that lazy writes and
    >> flushes are occurring still but the frequent flushes stop once the sytem
    >> cache get filled.
    >> Something else interesting is that the Cache Bytes is continuing to grow
    >> slowly while System Cache Resident Bytes is essentially flat it look like
    >> system cache include the full system working set in its size and the
    >> working set appears to grow over time. I'm betting that once cache bytes
    >> get closer to 960-1024 MB the system will tip over.
    >>
    >> I take it there is no way to adjust the forced writes interval of the
    >> lazy writes, is that true?
    >>
    >> I should also point out that it does not matter what disk type or raid
    >> mode is used. I'm easily able to produce this error on any Windows Server
    >> 2000 2003 2008 running any combination of disks SCSI, SAS, SATA PATA,
    >> ExtUSB-SATA.All I need is a large file >52GB in size, which I have a good
    >> number of.
    >>
    >> Phil
    >>
    >>
    >>
    >>
    >>
    >> "John Fullbright" <fjohn@donotspamnetappdotcom> wrote in message
    >> news:%...
    >>> Note that all the target storage listed is SATA disk. RAID 1 has a
    >>> write penalty of 2 and RAID 5 has a write penalty of 4, compounding the
    >>> issue. Eveng copying from one RAID array to another, your read
    >>> performance will be greater than your write performance. This is
    >>> totally consistent with the CcDirtyPageThreshold set too high scenario.
    >>> Once host cache fills to the point where it starts flushing, you quickly
    >>> swamp any cache on the storage and the high page faults begin. If you
    >>> lower the threshold, the idea is that you wil stay within the fill/flush
    >>> range and not swamp the disk.
    >>>
    >>>
    >>> "Kenny Speer" <> wrote in message
    >>> news:usoA%...
    >>>> Please take these comments with a grain of salt. It's typically pretty
    >>>> difficult to troubleshoot these issues via newsgroup.
    >>>>
    >>>> Phil Lewis wrote:
    >>>>> Well I'm getting some very interesting findiong that I don't yet know
    >>>>> what to make of.
    >>>>>
    >>>>> First off It looks like the whole train of events ultimately leading
    >>>>> up to the delayed write failure begins about 2-4 mins into the copy
    >>>>> process. The File copy in Windows Explorer begins normally and the
    >>>>> 88.6GByte copy shows it is expecting to take around 24-26 mins.
    >>>>> Watching a whole slew of counters in Perfmon I can see the Cache Bytes
    >>>>> Growing steadily and Disk Total Reads per second running avg around
    >>>>> 190.
    >>>>
    >>>> [Kenny] This is most likely being caused by two things. First the NTFS
    >>>> Cache writes is where your low time comes from. The AVG 190 IOPs to
    >>>> your array are most likely from a write-back cache in the array.
    >>>>
    >>>> Somewhere around 2-4 mins into the copy
    >>>>> the time to complete estimate in windows Explorer suddenly climbs
    >>>>> dramatically topping out around 100Mins.
    >>>>
    >>>> [Kenny] You just filled your cache for the filesystem.
    >>>>
    >>>> At the exact same time in perfmon I
    >>>>> see a sudden drop in the Total Disk Reads / Sec. The level drops from
    >>>>> about 190 AVG to about 32 and the cache bytes growth goes dead flat.
    >>>>
    >>>> [Kenny] Array cache is most likely filled at this point, so your
    >>>> throughput to the array just dropped since at this point you are
    >>>> running as fast as the array can write to disk instead of how fast it
    >>>> can write to cache.
    >>>>
    >>>> Avg Disk
    >>>>> queue stays the same and Writes continue to occur at the same levels
    >>>>> as before. After several minutes the Disk reads go back up to their
    >>>>> previous levels for a minute or two then they drop again
    >>>>
    >>>> [Kenny] This is the cache filling and flushing then filling and
    >>>> flushing.
    >>>>
    >>>> and shortly after that, the
    >>>>> Disk write counters tank again and within 10-15 seconds The Delayed
    >>>>> Write error occurs.
    >>>>
    >>>> [Kenny] Hmm, not sure about this. It is quite possible that the array
    >>>> simply dropped the IO on the floor. What do the event logs show up to
    >>>> this point? Are there any logs you can enable on the target?

    >>
    >>

    >
    Dilip Naik, Jul 1, 2008
    #14
  15. Same problem with x64 XP Pro SP2

    I'm having the same issue running XP Pro x64 SP2. When using Vice Versa Pro to copy from one drive to another, it eventually starts generating event ID 50 followed by event ID 26. I'm copying from my external USB 2.0 Drobo (which worked fine with Vista-32 bit or XP 32-bit) to an internal hard drive. I'm eager to hear the solution to this issue.
    Robert Rohlin, Jul 22, 2008
    #15
  16. Re: Same problem with x64 XP Pro SP2

    "Robert Rohlin" wrote in message news:...
    > I'm having the same issue running XP Pro x64 SP2. When using Vice Versa
    > Pro to copy from one drive to another, it eventually starts generating
    > event ID 50 followed by event ID 26. I'm copying from my external USB 2.0
    > Drobo (which worked fine with Vista-32 bit or XP 32-bit) to an internal
    > hard drive. I'm eager to hear the solution to this issue.



    Same issue as what? I am using WinMail and I see no thread to which you
    have replied. How about some details about your computer and its
    configuration.
    Colin Barnhorst, Jul 22, 2008
    #16
  17. Phil Lewis

    Brian Drab Guest

    Encountering the exact same issue

    Did you ever find a solution to this? We are using an MSA1000 SAN and have the EXACT symptoms you describe execept that our limit is about 36 GB. Under is fine...Over is exactly as you describe.

    > On Saturday, May 24, 2008 12:45 PM Phil Lewis wrote:


    > Apologies as this turned out to be rather lengthy....
    >
    > I run a development engineering lab for a financial services company and we
    > are running into a rather peculiar but very troubling problem on some of our
    > performance servers when copying, moving or backing up very large files,
    > i.e. files greater than about 52 Gigabytes in size (yes GB not MB).
    > Primarily these are SQL Server database files but we also see the same
    > problems copying or relocating large Virtual Server Hard Drive files of that
    > size or larger.
    > The problem is actually an old one that I think has not been dealt with, the
    > end result for us on Windows Server 2000 and Windows Server 2003 R2 is that
    > after copying about 52 Gbytes of the file, Windows starts reporting "Windows
    > delayed write" errors and at that point the file copy collapses and stops.
    > Although the system reports the copy is still running, no further data is
    > being successfully copied. All the file IO and other windows processes
    > slowdown considerably (more on this in a little bit). In each case where a
    > delayed write error is generated, Event Viewer shows the first error as
    > being event ID 50 and or Event 26. The problem is seen when using Drive
    > letters and UNC paths and we asked about hotfixes for Server 2003 SP2 but
    > were told by support there were none as the fix described in KB Article
    > 890352 [ http://support.microsoft.com/kb/890352/ ] was rolled into SP2 and
    > did not apply to our issue.
    >
    > We have only recently begun to see these problems because until recently
    > most of our performance testing model used fairly small working sets for
    > data (typically under 100GB total) and thus each file-group in our databases
    > was less than 40GB so we never really saw a problem. I've seen this problem
    > on ALL versions of Windows including Server 2008. In the server 2008 case,
    > the O/S collapses completely and cannot be shutdown.In most cases we have to
    > power-off the server to get the problem system to recover, no delayed write
    > error is reported on server 2008.
    >
    > I've tested this on a variety of servers (listed next) and in as many cases
    > as possible I tested on multiple servers with the same config and with
    > different O/S editions. I also tested one of the servers that was
    > experiencing the problems first, using small files (files from 1byte up to
    > 40 GigaBytes in size). The test set I use is approx 870GB in total size and
    > has been continuously copying for about 40 days now continuously on this
    > server, I think at last check it had copied around 2,973 Terabytes of data
    > on this server, all without error.
    >
    > Primary Test Servers and configurations
    > HP ML570 Quad Xeon w/ 16GB Ram and 1.2TBytes local storage + 3.4TB SAN
    > storage on a Nexsan Sataboy Box. WinSvr 2003R2 SP2 x86 and x64.
    > HP DL580 Dual Xeon w/8Gb or 16Gb Ram and 1.2TB Local Storage + 3.4 TB SAN
    > Storage on a Nexsan Sataboy Box. WinSvr 2003R2 SP2 x86 and x64.
    > HP DL380 G4p Dual Xeon w/4Gb Ram and 300GB Local Storage + 3.4 TB SAN
    > Storage on a Nexsan Sataboy Box. WinSvr 2003R2 SP2 x86
    > DELL 2950 dual quad Core 2.83Ghz w/32 GB Ram 320GB Local Storage 4.2TB on an
    > EMC AX150 SAN and 3.4 TB SAN Storage on a Nexsan Sataboy Box. WinSvr 2003R2
    > SP2 x86 and x64, Win Svr 2008 x64.
    > DELL 2950 dual quad Core 2.5Ghz w/4 GB Ram 140GB Local Storage 4.2TB on an
    > EMC AX150 SAN and 3.4 TB SAN Storage on a Nexsan Sataboy Box. WinSvr 2003R2
    > SP2 x86 and WinSvr2008 x86
    > Dell Precision 460T wkstation Dual Xeon 2.4Ghz with 4GB Ram and 1TB of local
    > storage (SATA) 3.4 TB SAN Storage on a Nexsan Sataboy Box via iSCSI WinSvr
    > 2003R2 SP2 x64
    > With the exception of the ML570 I've tested on multiple servers of the same
    > type. All servers have succeeded in copying large volumes of files 40Gb and
    > smaller.
    >
    > On our HPs all drives are SCSI hot swappable. Either 10 or 15K. We don't mix
    > spindle speeds in raid groups. On our Dells, all local drives are SAS, on
    > the EMC AX150 all drives are are SATA and on the SATABoy all drives are
    > SATA. On all systems file copies using small files up to 0 - 40GB are all
    > successful.
    > All systems are running the latest BIOS and we have seen the same behavior
    > on prior BIOS versions. All disk controller firware is updated to the latest
    > version and like the BIOSes the same behavior existed on earlier versions.
    > We have checked and updated hard disk firmware, where new versions are
    > available. Same issues as for controller firmware. On local Hard drives we
    > run Raid 1 or Raid 5 to get best performance or max capacity. Both Raid
    > modes exhibit the same behavior. I have tested on the HP's with no Raid at
    > all and the same results occurred.
    >
    > I try not to do specialty O/S builds for our lab environment. I build a
    > straight default Windows O/S configurations, fully patch it with microsoft
    > Patches and burn in test the system, then go test for this problem. I do not
    > tweak system settings or apply registry hacks until I get baseline test
    > data. In all cases here for the file copy tests I have not tweaked the
    > system settings or registry at all. Our servers are set for background
    > performance for the system cache, although we tested with 'foreground' set
    > without success too. We've tried large and small pagefiles and have moved
    > pagefiles to seperate disk spindles to see if it made a difference.
    > I've tried all manner of file copy and file sync tools, but what it comes
    > down to that if the file being copied is written on the systems' storage
    > system (local or SAN), the system will collapse and file copying will fail
    > somewhere around 52 - 58GB being copied. Windows Server 2008 has given me my
    > best look into the problem and what appears to be happening is that the
    > system cache keeps expanding until all physical memory is used and the paged
    > pool keeps growing until it hits around 380MB and non paged hits about 82MB
    > (I think the latter is right). What I then see is the CPU goes flat line as
    > does the Total Disk byes written in Perfmon but the Physical memory usage
    > history in task manager suddenly starts ramping until it gets out around
    > 70GB and then everything is done and either the system hangs (server2008) or
    > delayed write errors occur.
    >
    > One place where I do not seem to see the problem is SAN Drive to SAN Drive
    > Copies on the EMC SAN. I always see this problem on the SATABOY SAN with
    > large files when copying large files to the SAN volumes regardless of Cache
    > Settings. I can backup the files to tape but due to their size its an
    > expensive option both in media cost and time to backup and restore the data,
    > my preferred option is to backup to removeable Hard Disk (External USB -
    > SATA), sure its slower but it offers operating efficiencies right now I
    > cannot get with Tape (if it worked). Some testing has centered on using
    > external drives, but most of the testing on my systems has been to copy or
    > move the files from one volume to another on the server. It doesn't matter
    > whether I turn caching off or on for external USB drives or local drives. I
    > have had very occasional success on the EMC SAN ensuring that windows O/S
    > disk caching is off. Success using this method has been spotty and limited
    > to servers with 32GB or more of RAM. One very telling test setup was to
    > populate the Server with 128GB RAM and run windows Server 2008 x64. In that
    > case almost all file copies were successful, although they became painfully
    > slow after about 60GB was copied and it took over 5 hours to copy the last
    > ~35GB of a 95.8 GB file.
    >
    > I tried using Backup Exec and MsBackup to backup the files to a hard disk
    > but it failed everytime, when I run the same backup to tape it is
    > successful.
    > This is leading me to think the problem is generated in the lower level file
    > system filter drivers. I've tested with and without AntiVirus software in
    > the mix and have likewise tested systems that are built raw with no patches
    > at all and see the same problems. I also tried splitting the file into
    > chunks and copying the pieces and while I can split the file, I cannot join
    > it again as the processes all seem to rely on creating a temp file and the
    > process of copying the large temp file always results in the delayed write
    > errors being generated. I've also tried zipping up the file to reduce its
    > size (database files compress really well) but that process likewise
    > requires a file copy of a large file, and at some point that fails. It
    > should go without saying I've tried using SQL Db Backup writing to disk
    > storage and it fails everytime, tape is successful. In fact it was this very
    > act that caused me to begin investigating the problem in the first place.
    >
    > In days gone by there were loads of users seeing this problem on Win XP
    > copying much smaller files and there are some other people seeing this
    > problem today on Windows Server, Microsoft are very quiet on the subject for
    > windows server, I think in no small part because very few people are seeing
    > the problem and there is no demand to identify the problem or to fix it. I
    > cannot believe though I'm one of the first people to see the problem.
    > I'll quite happily accept its a configuration issue if someone can tell me
    > how to fix the problem! All attempt to tweak a system has not yielded any
    > success. What also sucks is that you typically have to wait ~30 mins to find
    > out that the problem will manifest.
    >
    > I'm about ready to escalate this issue to Microsoft, I think I now have
    > enough test data to do so, but thought I would bounce this off others to see
    > if anyone else has a solution or guidance first.
    >
    > Phil
    > Checkfree:



    >> On Saturday, May 24, 2008 1:16 PM John Fullbright wrote:


    >> The data section of the event ID 50 contains a lot of useful information for
    >> diagnosing the exact cause of the issue. Could you check the "words" radio
    >> button and post the dump data? Also, what is the source of the ID 50? is
    >> it mrxsmb or disk?
    >>
    >>
    >>
    >>
    >> "Phil Lewis" <> wrote in message
    >> news:...



    >>> On Saturday, May 24, 2008 1:22 PM Ace Fekay [MVP] wrote:


    >>> In news:,
    >>> Phil Lewis <> typed:
    >>>
    >>> Interesting dillema. I have not seen it before.
    >>>
    >>> What are the Source names for the event IDs you posted?
    >>>
    >>> Out of curiousity, I looked up EventID 50, but there are numerous Source
    >>> names, none of which their descriptions did I find to indicate a "Windows
    >>> Delayed Write" message, as you've described. Looking through different
    >>> possible error messages of "{Delayed Write Failed}," I found numerous
    >>> possiblities, but you would have to sort through them. One indicated a
    >>> problem wtih the MFT, another indicated issues with fragmentation causing
    >>> it, and one concerining clustering drivers conflicts and one more about
    >>> compatibility issues with the HBA the SAN is connected to.
    >>>
    >>> See if you can weed through the following EventID links. Look up based on
    >>> the Source name. If you don't have a subscription and it indicates an MSKB,
    >>> prefix the KB number so it the URL reads
    >>> "htttp://support.microsoft.com/kb/######" (###### being the KB number).
    >>>
    >>> http://eventid.net/display.asp?eventid=50&source=
    >>> http://eventid.net/display.asp?eventid=26&source=
    >>>
    >>>
    >>> --
    >>> Regards,
    >>> Ace
    >>>
    >>> This posting is provided "AS-IS" with no warranties or guarantees and
    >>> confers no rights.
    >>>
    >>> Ace Fekay, MCSE 2003 & 2000, MCSA 2003 & 2000, MCSE+I, MCT,
    >>> MVP Microsoft MVP - Directory Services
    >>> Microsoft Certified Trainer
    >>>
    >>> For urgent issues, you may want to contact Microsoft PSS directly. Please
    >>> check http://support.microsoft.com for regional support phone numbers.
    >>>
    >>> Infinite Diversities in Infinite Combinations



    >>>> On Saturday, May 24, 2008 2:03 PM John Fullbright wrote:


    >>>> Being that they're at Windows 2003 SP2 or Windows 2008, my guess would be
    >>>> CcDirtyPageThreshold is set too high. Let's see the details of the event ID
    >>>> 50. The event ID 26 is really immaterial, it's just the application popup
    >>>> message (you got the dialog that says lost delayed write)., The evend ID 50
    >>>> will contain the error code passed up the driver stack, as well as any SCSI
    >>>> sense data. Between that and possibly a short perfmon run during a copy, we
    >>>> should be able to sort this out. My guess is that we'll see half of RAM
    >>>> cached before writing starts, then you'll either see c000009a (out of
    >>>> resources) or a timeout due to slow disk once dirty pages are flushed. In
    >>>> perfom, you'll see heavy paging once dirty pages start flushing, and this is
    >>>> really the cause of the lack of server responsiveness.
    >>>>
    >>>> The narrative
    >>>>
    >>>> "Windows Server 2008 has given me my best look into the problem and
    >>>>
    >>>> would be supportive of this guess.
    >>>>
    >>>> John
    >>>>
    >>>>
    >>>>
    >>>> "Ace Fekay [MVP]" <> wrote in message
    >>>> news:%...



    >>>>> On Saturday, May 24, 2008 3:47 PM Phil Lewis wrote:


    >>>>> Hi John
    >>>>> Event 50 is being generated by Disk. I think that is being seen on all
    >>>>> servers, though I've seen one or two cases where mrxsmb was the source when
    >>>>> doing network copies. I can also confirm that at least 50% of RAM was indeed
    >>>>> cached, in most cases it was much higher. While we have seen c000009a (out
    >>>>> of resources), in almost all cases, disk timeouts occurred as a failure
    >>>>> point and yes indeed we experience very heavy paging when or just before
    >>>>> the system tips over. On Server 2008 we were seeing the dedicated pagefile
    >>>>> disk active solidly for over an hour before we finally gave up waiting for
    >>>>> the copy to quit and the system to shutdown.
    >>>>> I'll get the event viewer dumps later tonight and I can also get a Perfmon
    >>>>> Dump as well. Let me know what counters you would like to see included so I
    >>>>> get you exactly the necessary data. tonight, I can just load a Perfmon
    >>>>> template on the server and dump the output if you want to specify a bunch of
    >>>>> counters.
    >>>>>
    >>>>> I have not played with CcDirtyPageThreshold yet, I was extremely tempted
    >>>>> last week, but have thus far resisted. We run SP2 on all our WinSvr2003
    >>>>> servers. and one article I read had a hotfix for the problem that applied
    >>>>> only to SP1. Microsoft Support confirmed the fix in the article was included
    >>>>> in SP2 and I've not yet added the DWord value to modify it.
    >>>>>
    >>>>> Phil
    >>>>>
    >>>>> .
    >>>>> "John Fullbright" <fjohn@donotspamnetappdotcom> wrote in message
    >>>>> news:...



    >>>>>> On Monday, May 26, 2008 2:39 AM John Fullbright wrote:


    >>>>>> The fix is there with SP2, but you still need to set the registry key.
    >>>>>>
    >>>>>> cache
    >>>>>> copy reads/se
    >>>>>> data flush pages/sec
    >>>>>>
    >>>>>> memory
    >>>>>> pages/sec
    >>>>>> free system page table entries
    >>>>>>
    >>>>>> physical disk
    >>>>>> sec/read
    >>>>>> sec/write
    >>>>>> reads/sec
    >>>>>> writes/sec
    >>>>>>
    >>>>>>
    >>>>>>
    >>>>>> John Fullbright [Exchange MVP]
    >>>>>>
    >>>>>>
    >>>>>>
    >>>>>>
    >>>>>> "Phil Lewis" <> wrote in message
    >>>>>> news:...



    >>>>>>> On Wednesday, May 28, 2008 9:50 AM Phil Lewis wrote:


    >>>>>>> Hi John I'm about to grab the counters to a CSV File or do you prefer
    >>>>>>> Binary?
    >>>>>>>
    >>>>>>> What do you recommend in terms of sample interval? I typically use 5 Second
    >>>>>>> intervals for this kind of thing but will adjust to whatever you prefer.
    >>>>>>>
    >>>>>>> I'm grabbing Physical disk counters for both _Total and the Target disk
    >>>>>>> [D:\] for the Copy.
    >>>>>>>
    >>>>>>> The first set of counters will be without the reg key being added to see the
    >>>>>>> system in the raw state. I'll then add the key, what do you recommend in
    >>>>>>> terms of a value? the Server is a HP ML570 Quad Xeon Dual Core 3.33GHz with
    >>>>>>> 16GB of Ram. Reading suggests starting at half of installed Ram which would
    >>>>>>> be 8192.
    >>>>>>>
    >>>>>>> Phil
    >>>>>>>
    >>>>>>>
    >>>>>>> "John Fullbright" <fjohn@donotspamnetappdotcom> wrote in message
    >>>>>>> news:...



    >>>>>>>> On Wednesday, May 28, 2008 5:53 PM MatYoun wrote:


    >>>>>>>> Avg Disk Sec Read and write would be handy. if you are getting Lost write
    >>>>>>>> because your array is being overun beyond the spindle count, then this may
    >>>>>>>> show up in a latency response spike beforehand.
    >>>>>>>>
    >>>>>>>> It would also be good to match IO versus spindle count. To do this properly
    >>>>>>>> you will need cache hit ratio from the arrays.
    >>>>>>>>
    >>>>>>>> rgds
    >>>>>>>>
    >>>>>>>> mat
    >>>>>>>>
    >>>>>>>> "Phil Lewis" wrote:



    >>>>>>>>> On Wednesday, May 28, 2008 6:03 PM Phil Lewis wrote:


    >>>>>>>>> Well I'm getting some very interesting findiong that I don't yet know what
    >>>>>>>>> to make of.
    >>>>>>>>>
    >>>>>>>>> First off It looks like the whole train of events ultimately leading up to
    >>>>>>>>> the delayed write failure begins about 2-4 mins into the copy process. The
    >>>>>>>>> File copy in Windows Explorer begins normally and the 88.6GByte copy shows
    >>>>>>>>> it is expecting to take around 24-26 mins. Watching a whole slew of counters
    >>>>>>>>> in Perfmon I can see the Cache Bytes Growing steadily and Disk Total Reads
    >>>>>>>>> per second running avg around 190. Somewhere around 2-4 mins into the copy
    >>>>>>>>> the time to complete estimate in windows Explorer suddenly climbs
    >>>>>>>>> dramatically topping out around 100Mins. At the exact same time in perfmon I
    >>>>>>>>> see a sudden drop in the Total Disk Reads / Sec. The level drops from about
    >>>>>>>>> 190 AVG to about 32 and the cache bytes growth goes dead flat. Avg Disk
    >>>>>>>>> queue stays the same and Writes continue to occur at the same levels as
    >>>>>>>>> before. After several minutes the Disk reads go back up to their previous
    >>>>>>>>> levels for a minute or two then they drop again and shortly after that, the
    >>>>>>>>> Disk write counters tank again and within 10-15 seconds The Delayed Write
    >>>>>>>>> error occurs.
    >>>>>>>>>
    >>>>>>>>> So far I cannot see what is causing the Disk Reads to suddenly tank to about
    >>>>>>>>> 25% of the initial level. When the system, picks back up midstream there is
    >>>>>>>>> a sudden spike in the number of Lazy Write Flushes and lazy write pages
    >>>>>>>>> lasting about 1 or 2 seconds then they drop off to almost zero.
    >>>>>>>>>
    >>>>>>>>> Throughout copy the Free System Page Table Entries is flat at 40209. When
    >>>>>>>>> the system picks up again mid-Stream, the Cache Bytes and System Cache
    >>>>>>>>> Resident Bytes suddenly ramp up and seems to flatten out at approx
    >>>>>>>>> 360,000,000 Bytes. Once the system Tanks the Bytes stay up until the file
    >>>>>>>>> copy is aborted and after a few mins it drops back to about 125,000,000
    >>>>>>>>> Bytes.
    >>>>>>>>>
    >>>>>>>>> John, I'll Email you the Perfmon trace from the first run earlier later
    >>>>>>>>> tonight.
    >>>>>>>>>
    >>>>>>>>>
    >>>>>>>>>
    >>>>>>>>> "John Fullbright" <fjohn@donotspamnetappdotcom> wrote in message
    >>>>>>>>> news:...



    >>>>>>>>>> On Wednesday, May 28, 2008 6:50 PM Kenny Speer wrote:


    >>>>>>>>>> Please take these comments with a grain of salt. It's typically pretty
    >>>>>>>>>> difficult to troubleshoot these issues via newsgroup.
    >>>>>>>>>>
    >>>>>>>>>> Phil Lewis wrote:
    >>>>>>>>>>
    >>>>>>>>>> [Kenny] This is most likely being caused by two things. First the NTFS
    >>>>>>>>>> Cache writes is where your low time comes from. The AVG 190 IOPs to
    >>>>>>>>>> your array are most likely from a write-back cache in the array.
    >>>>>>>>>>
    >>>>>>>>>> Somewhere around 2-4 mins into the copy
    >>>>>>>>>>
    >>>>>>>>>> [Kenny] You just filled your cache for the filesystem.
    >>>>>>>>>>
    >>>>>>>>>> At the exact same time in perfmon I
    >>>>>>>>>>
    >>>>>>>>>> [Kenny] Array cache is most likely filled at this point, so your
    >>>>>>>>>> throughput to the array just dropped since at this point you are running
    >>>>>>>>>> as fast as the array can write to disk instead of how fast it can write
    >>>>>>>>>> to cache.
    >>>>>>>>>>
    >>>>>>>>>> Avg Disk
    >>>>>>>>>>
    >>>>>>>>>> [Kenny] This is the cache filling and flushing then filling and flushing.
    >>>>>>>>>>
    >>>>>>>>>> and shortly after that, the
    >>>>>>>>>>
    >>>>>>>>>> [Kenny] Hmm, not sure about this. It is quite possible that the array
    >>>>>>>>>> simply dropped the IO on the floor. What do the event logs show up to
    >>>>>>>>>> this point? Are there any logs you can enable on the target?



    >>>>>>>>>>> On Wednesday, May 28, 2008 9:55 PM John Fullbright wrote:


    >>>>>>>>>>> Note that all the target storage listed is SATA disk. RAID 1 has a write
    >>>>>>>>>>> penalty of 2 and RAID 5 has a write penalty of 4, compounding the issue.
    >>>>>>>>>>> Eveng copying from one RAID array to another, your read performance will be
    >>>>>>>>>>> greater than your write performance. This is totally consistent with the
    >>>>>>>>>>> CcDirtyPageThreshold set too high scenario. Once host cache fills to the
    >>>>>>>>>>> point where it starts flushing, you quickly swamp any cache on the storage
    >>>>>>>>>>> and the high page faults begin. If you lower the threshold, the idea is that
    >>>>>>>>>>> you wil stay within the fill/flush range and not swamp the disk.
    >>>>>>>>>>>
    >>>>>>>>>>>
    >>>>>>>>>>> "Kenny Speer" <> wrote in message
    >>>>>>>>>>> news:usoA%...



    >>>>>>>>>>>> On Thursday, May 29, 2008 9:38 AM Phil Lewis wrote:


    >>>>>>>>>>>> Yep I believe this is what I'm looking at right now But I'm in need of
    >>>>>>>>>>>> guidance on what value to set for the registry entry:
    >>>>>>>>>>>> SystemCacheDirtyPageThreshold at least to start with. I've tried several
    >>>>>>>>>>>> very large values and two smaller ones 896 and 256 and neither corrected the
    >>>>>>>>>>>> issue but did produce subtlely differing behaviors in the counters. Very
    >>>>>>>>>>>> Small values get the dropoff in disk reads / Sec to occur in just a few secs
    >>>>>>>>>>>> of starting which is fine and About a minute later I See a spike in the lazy
    >>>>>>>>>>>> writes. This is preceeded by a jump in the Cache Bytes and System Cache
    >>>>>>>>>>>> Resident Bytes. What else is interesting and here is where the problems come
    >>>>>>>>>>>> from is that the cache grows each time just before lazy write occurs UNTIL
    >>>>>>>>>>>> System Cache Resident Bytes hits right around 522,000,000 Bytes then the
    >>>>>>>>>>>> frequency inteval of lazy write flushes drop from about 30 - 40 secs to
    >>>>>>>>>>>> about 3 or more mins and cache growth goes flat line. The writes are
    >>>>>>>>>>>> continuing but clearly the system seems to be getting overwhelmed. it should
    >>>>>>>>>>>> also be noted that lazy writes and flushes are occurring still but the
    >>>>>>>>>>>> frequent flushes stop once the sytem cache get filled.
    >>>>>>>>>>>> Something else interesting is that the Cache Bytes is continuing to grow
    >>>>>>>>>>>> slowly while System Cache Resident Bytes is essentially flat it look like
    >>>>>>>>>>>> system cache include the full system working set in its size and the working
    >>>>>>>>>>>> set appears to grow over time. I'm betting that once cache bytes get closer
    >>>>>>>>>>>> to 960-1024 MB the system will tip over.
    >>>>>>>>>>>>
    >>>>>>>>>>>> I take it there is no way to adjust the forced writes interval of the lazy
    >>>>>>>>>>>> writes, is that true?
    >>>>>>>>>>>>
    >>>>>>>>>>>> I should also point out that it does not matter what disk type or raid mode
    >>>>>>>>>>>> is used. I'm easily able to produce this error on any Windows Server 2000
    >>>>>>>>>>>> 2003 2008 running any combination of disks SCSI, SAS, SATA PATA,
    >>>>>>>>>>>> ExtUSB-SATA.All I need is a large file >52GB in size, which I have a good
    >>>>>>>>>>>> number of.
    >>>>>>>>>>>>
    >>>>>>>>>>>> Phil
    >>>>>>>>>>>>
    >>>>>>>>>>>>
    >>>>>>>>>>>>
    >>>>>>>>>>>>
    >>>>>>>>>>>>
    >>>>>>>>>>>> "John Fullbright" <fjohn@donotspamnetappdotcom> wrote in message
    >>>>>>>>>>>> news:%...



    >>>>>>>>>>>>> On Tuesday, July 01, 2008 1:03 AM Dilip Naik wrote:


    >>>>>>>>>>>>> Phil, can you provide an update to this problem? Have you found a solution?
    >>>>>>>>>>>>>
    >>>>>>>>>>>>> I am highly interested in the solution for personal reasons and am willing
    >>>>>>>>>>>>> to spend some cycles tracking one down if not already found
    >>>>>>>>>>>>>
    >>>>>>>>>>>>> Dilip
    >>>>>>>>>>>>> www.msftmvp.com
    >>>>>>>>>>>>>
    >>>>>>>>>>>>> "Phil Lewis" <> wrote in message
    >>>>>>>>>>>>> news:e$...



    >>>>>>>>>>>>>> On Tuesday, July 01, 2008 10:44 AM Dilip Naik wrote:


    >>>>>>>>>>>>>> Phil,
    >>>>>>>>>>>>>>
    >>>>>>>>>>>>>> One more suggestion
    >>>>>>>>>>>>>>
    >>>>>>>>>>>>>> Reading this mail thread clearly indicates that both read and write caching
    >>>>>>>>>>>>>> are suspected. Why not try getting rid of Large Cache and rerun the
    >>>>>>>>>>>>>> experiment?
    >>>>>>>>>>>>>>
    >>>>>>>>>>>>>> Contact me off line - I have a filter driver that I would be willing to
    >>>>>>>>>>>>>> quickly modify to suit your needs - the driver would ensure certain files
    >>>>>>>>>>>>>> bypass the System Cache. I am not looking for any $$ to do this.
    >>>>>>>>>>>>>>
    >>>>>>>>>>>>>> Dilip
    >>>>>>>>>>>>>> www.msftmvp.com
    >>>>>>>>>>>>>>
    >>>>>>>>>>>>>> "Dilip Naik" <> wrote in message
    >>>>>>>>>>>>>> news:...



    >>>>>>>>>>>>>>> On Tuesday, July 22, 2008 11:18 AM Robert Rohlin wrote:


    >>>>>>>>>>>>>>> I'm having the same issue running XP Pro x64 SP2. When using Vice Versa Pro to copy from one drive to another, it eventually starts generating event ID 50 followed by event ID 26. I'm copying from my external USB 2.0 Drobo (which worked fine with Vista-32 bit or XP 32-bit) to an internal hard drive. I'm eager to hear the solution to this issue.



    >>>>>>>>>>>>>>>> On Tuesday, January 27, 2009 12:48 PM Jeremy R wrote:


    >>>>>>>>>>>>>>>> I didn't read all of the replies but it looks like one of our rackspace techs replied :). We've had this issue plaguing us on our 2003 R2 SP2 x64 cluster. When our managed backup solution runs, it backs up a 107gb sql compressed file and basically steals all memory resources from the OS as you've seen.
    >>>>>>>>>>>>>>>>
    >>>>>>>>>>>>>>>>
    >>>>>>>>>>>>>>>>
    >>>>>>>>>>>>>>>> If you want a quick and dirty workaround, don't initiate file copies from remote systems, initiate from the system that has the large file, and use Teracopy. Teracopy so far is the only program I have seen that bypasses the windows cache and uses direct I/O transfer. I can transfer the 107gb file now across the network and it doesn't steal 1kb from any process. We tried the CcDirtyPageThreshold and windows server ignores it half the time. Sometimes it works, sometimes it doesn't obey....(through windows explorer or robocopy). Teracopy seems to work every time.
    >>>>>>>>>>>>>>>>
    >>>>>>>>>>>>>>>>
    >>>>>>>>>>>>>>>>
    >>>>>>>>>>>>>>>> Another workaround to buy time is to use Redgate backup tools, we can knock a 240gb sql database down to 100gb or so using their max compression. Redgate also allows you to split your backup files into as many as you like, so you can still push your 100gb file in 8gb chunks across the network.
    >>>>>>>>>>>>>>>>
    >>>>>>>>>>>>>>>>
    >>>>>>>>>>>>>>>>
    >>>>>>>>>>>>>>>> Sorry if this was already recommended...didn't have time to read every reply.



    >>>>>>>>>>>>>>>> Submitted via EggHeadCafe - Software Developer Portal of Choice
    >>>>>>>>>>>>>>>> MongoDb vs SQL Server Basic Speed Tests
    >>>>>>>>>>>>>>>> http://www.eggheadcafe.com/tutorial.../mongodb-vs-sql-server-basic-speed-tests.aspx
    Brian Drab, Sep 27, 2010
    #17
  18. Re: Encountering the exact same issue

    On Sep 26, 9:52 pm, Brian Drab <> wrote:
    > Did you ever find a solution to this? We are using an MSA1000 SAN and have the EXACT symptoms you describe execept that our limit is about 36 GB. Under is fine...Over is exactly as you describe.
    >


    A solution to what? The original post is over 2 years old, so what's
    was the problem? If there were an answer I think it would have been
    posted sometime in the intervening years!
    Bobby Johnson, Sep 29, 2010
    #18
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Scott D. Weber For Unuathorized Thoughts Inc.

    Backing Up Large Files..Or A Large Amount Of Files

    Scott D. Weber For Unuathorized Thoughts Inc., Sep 17, 2003, in forum: Computer Support
    Replies:
    1
    Views:
    952
    Joseph
    Sep 19, 2003
  2. A.F. Berlijn

    Delayed Write Failed

    A.F. Berlijn, Jan 26, 2004, in forum: Computer Support
    Replies:
    1
    Views:
    524
  3. John Leeth

    Delayed write failed message

    John Leeth, Mar 2, 2004, in forum: Computer Support
    Replies:
    4
    Views:
    4,281
    °Mike°
    Mar 2, 2004
  4. EMB
    Replies:
    6
    Views:
    459
  5. Joe
    Replies:
    4
    Views:
    1,077
Loading...

Share This Page