Removing Duplicate Messages

Discussion in 'Firefox' started by Jowah, Jan 12, 2005.

  1. Jowah

    Jowah Guest

    I did some folder maintenance in TB earlier which resulted in a few
    thousand duplicate emails. (Believe me, I had no choice) Does anyone
    know of a method or extension for deleting duplicates?

    thanks in advance,

    jowah
     
    Jowah, Jan 12, 2005
    #1
    1. Advertisements

  2. Jowah

    Ralph Fox Guest

    On Wed, 12 Jan 2005 05:07:40 GMT, in message

    How to remove duplicates from a Tbird mail folder.

    (The following instructions are for Windows users, but could be
    adapted for Linux too.)


    1. Copy the script below between the ~~~~~ wavy lines, and
    save it to a file named "dedup_mbox.awk" in an empty directory
    somewhere on your hard drive.

    2. The script is an AWK script. To run this script you will need
    a copy of gawk.exe or awk.exe.

    Go to http://unxutils.sourceforge.net/
    download the file UnxUtils.zip
    extract the file gawk.exe from the zip file
    and put gawk.exe in the same directory.

    3. Each Tbird e-mail folder xxxxx corresponds to two files on
    your hard drive:
    xxxxx (no file extension)
    xxxxx.msf ( a .msf file extension)

    Find these files. They will be in the directory shown in here
    in Tbird's settings...

    Tools -> Account Settings -> (select account) -> Server Settings
    Local directory: [_________________________]


    4. Before you do anything more below, first compact the folder xxxxx
    which you want to remove duplicates from. This is _VERY_ important.
    • Either: File -> Compact Folders
    • or: right-click on folder xxxxx -> Compact This Folder


    5. Copy the file xxxxx to the same directory as "dedup_mbox.awk"
    and "gawk.exe", and rename the copy to xxxxx_old.

    6. Open a DOS window, go to this directory, and
    enter the DOS command

    gawk.exe -f dedup_mbox.awk xxxxx_old >xxxxx_new

    where xxxxx was the name of the local folder file
    from #3 above.

    You will now have a new file xxxxx_new which is the
    same as xxxxx_old but with the duplicates removed.

    7. Close Mozilla mail and replace the original xxxxx
    with a copy of xxxxx_new renamed to just xxxxx.

    When you re-open Mozilla mail, the e-mail folder xxxxx
    will now have all the duplicates removed.

    If you ever need to do this again, you only need to repeat from
    step #4 on.


    Here is the script itself


    dedup_mbox.awk
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    #!/usr/bin/awk -f
    #
    # ---------------------------------------------------------------
    #
    # Removes duplicate messages from a mbox file.
    # Duplicates are identified by message-id.
    #
    # ---------------------------------------------------------------


    BEGIN {

    # this is a state-driven program
    # states are "INIT", "SCAN", "SKIP" and "COPY"

    state = "INIT" ;

    backbuffer = "" ;

    }

    /^From / {

    # separator line indicating the start of a message in the file

    if ( state == "SCAN" )
    {
    # We only get here when we encounter a new message in the file,
    # and the previous message in the file had no body and no
    # message-id header.

    # The code below makes a policy decision to treat any message
    # with no message-id header as not a duplicate.

    if ( backbuffer != "" )
    printf( "%s", backbuffer ) ;
    backbuffer = "" ;
    }

    # set state to scanning for a message-id header,
    # saving text to print if this is found to be a new message.

    state = "SCAN" ;
    backbuffer = "" ;
    }

    /^Message-ID: / {

    # if we are in SCAN state, then this is the message-id header we are looking for.

    if ( state == "SCAN" )
    {
    # in header and not yet seen a message-id header

    message_id = $2 ;

    if ( have_seen[ message_id ] == "YES" )
    {
    # this is a duplicate message
    #printf "Duplicate: %s\n", message_id | "cat 1>&2;" ;

    backbuffer = "" ;
    state = "SKIP" ;
    }
    else
    {
    # not a duplicate.
    #printf "Original: %s\n", message_id | "cat 1>&2;" ;

    have_seen[ message_id ] = "YES" ;

    # print from separator and all header lines before this.

    if ( backbuffer != "" )
    printf( "%s", backbuffer ) ;
    backbuffer = "" ;
    state = "COPY" ;
    }
    }
    }

    /^$/ {

    # empty line, possibly the boundary between headers and body

    if ( state == "SCAN" )
    {
    # We only get here when we reach the end of the header
    # and there was no message-id in header.

    # The code below makes a policy decision to treat any message
    # with no message-id header as not a duplicate.

    if ( backbuffer != "" )
    printf( "%s", backbuffer ) ;
    backbuffer = "" ;
    state = "COPY" ;
    }
    }

    {
    # any line

    if ( state == "SCAN" )
    {
    backbuffer = backbuffer $0 "\n" ;
    }
    else
    if ( state == "SKIP" )
    {
    backbuffer = "" ;
    }
    else
    {
    printf( "%s%s\n", backbuffer, $0 ) ;
    backbuffer = "" ;
    }
    }

    END {

    if ( state == "SCAN" )
    {
    # We only get here when we reach the end of the file,
    # and the last message in the file had no body and no
    # message-id header.

    # The code below makes a policy decision to treat any message
    # with no message-id header as not a duplicate.

    if ( backbuffer != "" )
    printf( "%s", backbuffer ) ;
    backbuffer = "" ;
    }

    state="INIT" ;
    }
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~


    --
    Cheers,
    Ralph

    "There is only one boss, the customer. And he can fire everybody in
    the company from the chairman on down, simply by spending his money
    somewhere else." -- Sam Walton
     
    Ralph Fox, Jan 13, 2005
    #2
    1. Advertisements

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments (here). After that, you can post your question and our members will help you out.