understand Mozilla Thunderbird files...

Discussion in 'Firefox' started by Joh, Nov 27, 2004.

  1. Joh

    Joh Guest

    hello,

    i would like to write a python script to remove duplicates emails from
    mozilla thunderbird inbox files... or a least dumping theses email
    headers in a text file.

    please where can i find some urls to understand how theses files are
    coded ?

    thanks
    Joh, Nov 27, 2004
    #1
    1. Advertising

  2. Joh wrote:

    > i would like to write a python script to remove duplicates emails from
    > mozilla thunderbird inbox files... or a least dumping theses email
    > headers in a text file.
    >
    > please where can i find some urls to understand how theses files are
    > coded ?


    Search for "mbox format"!

    J
    =?ISO-8859-1?Q?J=FCrgen_Harter?=, Nov 27, 2004
    #2
    1. Advertising

  3. Joh

    Ralph Fox Guest

    On 27 Nov 2004 09:21:05 -0800, in message
    <>, Joh wrote:

    > i would like to write a python script to remove duplicates emails from
    > mozilla thunderbird inbox files...


    The AWK script below will do this.


    > or a least dumping theses email
    > headers in a text file.
    >
    > please where can i find some urls to understand how theses files are
    > coded ?



    1. Each email folder in Mozilla or Mozilla Thunderbird corresponds to
    two files. For example, for the "Inbox" folder there are two
    files

    Inbox
    Inbox.msf

    1.1 The file "Inbox" (no extension) is in mbox file format.

    A web search will turn up many descriptions of the mbox file
    format.

    Also see note 2 below

    1.2 The file "Inbox.msf" is a summary file. It will be
    re-created if it does not exist.


    2. Mozilla adds two proprietary headers to received email messages,
    hold message status flags (e.g. read/unread, flagged, marked
    as deleted, etc.). If you look at the messages in Mozilla
    Thunderbird's mbox files, you will see these two extra headers
    (with possibly different values).

    | X-Mozilla-Status: 8001
    | X-Mozilla-Status2: 00000000


    3. Below is AWK script to remove duplicate emails from a mbox format
    file. Duplicates are detected by comparing the message-IDs.

    You can use this script on Mozilla Thunderbird's mbox format
    mail files. For example, with the "Inbox" email folder...

    3.1 First compact the "Inbox" email folder in Mozilla.
    3.2 Run the file "Inbox" through this AWK script to generate
    an output file (say) "Inbox_dedup.tmp", and then replace the
    file "Inbox" with "Inbox_dedup.tmp".
    3.3 Delete the file "Inbox.msf".


    4. When you delete a message from a Mozilla Thunderbird email folder,
    Mozilla Thunderbird does not immediately remove the message from
    the mbox file. Instead, Mozilla Thunderbird sets a status flag
    to indicate that the message is marked as deleted. The deleted
    message is removed when you "Compact" the folder in Mozilla
    Thunderbird.

    The AWK script below will not notice that a message has been
    marked as deleted (it does not check the proprietary Mozilla flags).

    So if you don't compact the email folder first, then you run
    the risk of the following situation happening.

    Risk
    • There are two copies of the same message (a duplicate).
    • The first copy in the file is marked for delete; the second is not.
    • The AWK script keeps the first copy, sees the second is a duplicate
    and removes it.


    5. The script dedup_mbox.awk

    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    #!/usr/bin/awk -f
    #
    # ---------------------------------------------------------------
    #
    # Removes duplicate messages from a mbox file.
    # Duplicates are identified by message-id.
    #
    # ---------------------------------------------------------------


    BEGIN {

    # this is a state-driven program
    # states are "INIT", "SCAN", "SKIP" and "COPY"

    state = "INIT" ;

    backbuffer = "" ;

    }

    /^From / {

    # separator line indicating the start of a message in the file

    if ( state == "SCAN" )
    {
    # We only get here when we encounter a new message in the file,
    # and the previous message in the file had no body and no
    # message-id header.

    # The code below makes a policy decision to treat any message
    # with no message-id header as not a duplicate.

    if ( backbuffer != "" )
    printf( "%s", backbuffer ) ;
    backbuffer = "" ;
    }

    # set state to scanning for a message-id header,
    # saving text to print if this is found to be a new message.

    state = "SCAN" ;
    backbuffer = "" ;
    }

    /^Message-ID: / {

    # if we are in SCAN state, then this is the message-id header we are looking for.

    if ( state == "SCAN" )
    {
    # in header and not yet seen a message-id header

    message_id = $2 ;

    if ( have_seen[ message_id ] == "YES" )
    {
    # this is a duplicate message
    #printf "Duplicate: %s\n", message_id | "cat 1>&2;" ;

    backbuffer = "" ;
    state = "SKIP" ;
    }
    else
    {
    # not a duplicate.
    #printf "Original: %s\n", message_id | "cat 1>&2;" ;

    have_seen[ message_id ] = "YES" ;

    # print from separator and all header lines before this.

    if ( backbuffer != "" )
    printf( "%s", backbuffer ) ;
    backbuffer = "" ;
    state = "COPY" ;
    }
    }
    }

    /^$/ {

    # empty line, possibly the boundary between headers and body

    if ( state == "SCAN" )
    {
    # We only get here when we reach the end of the header
    # and there was no message-id in header.

    # The code below makes a policy decision to treat any message
    # with no message-id header as not a duplicate.

    if ( backbuffer != "" )
    printf( "%s", backbuffer ) ;
    backbuffer = "" ;
    state = "COPY" ;
    }
    }

    {
    # any line

    if ( state == "SCAN" )
    {
    backbuffer = backbuffer $0 "\n" ;
    }
    else
    if ( state == "SKIP" )
    {
    backbuffer = "" ;
    }
    else
    {
    printf( "%s%s\n", backbuffer, $0 ) ;
    backbuffer = "" ;
    }
    }

    END {

    if ( state == "SCAN" )
    {
    # We only get here when we reach the end of the file,
    # and the last message in the file had no body and no
    # message-id header.

    # The code below makes a policy decision to treat any message
    # with no message-id header as not a duplicate.

    if ( backbuffer != "" )
    printf( "%s", backbuffer ) ;
    backbuffer = "" ;
    }

    state="INIT" ;
    }
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~


    --
    Cheers,
    Ralph

    "Curiosity skilled the cat."
    Ralph Fox, Nov 28, 2004
    #3
  4. Joh

    smalolepszy Guest

    I have 2 question;

    1. Have you any specification of msf file. I saw in this file subject,
    from etc?

    2. Maybe do you know where I can find Thunderbird folders. Is there
    any information about their location in windows registry? And if users
    have many profiles which is his main profil?

    Thanks.
    smalolepszy, Dec 3, 2004
    #4
  5. Joh

    Joh Guest

    exactly what i need :)

    thank to you both :) i had understand format and have a ready to use tool :)
    Joh, Dec 3, 2004
    #5
  6. Joh

    Moz Champion Guest

    smalolepszy wrote:

    > I have 2 question;
    >
    > 1. Have you any specification of msf file. I saw in this file subject,
    > from etc?
    >
    > 2. Maybe do you know where I can find Thunderbird folders. Is there
    > any information about their location in windows registry? And if users
    > have many profiles which is his main profil?
    >
    > Thanks.


    A .msf file is a Mail Summary file
    they are created if required when Mozilla accesses a folder/file that
    doesnt have one extant. Since Moz does this automatically, the general
    response to any problems with the files is to simply delete them with
    Mozilla off, as the product will then rewrite and recreate them when the
    user next accesses the file/folder. They can be indentified in the Mail
    folder by the suffix .msf of course. i.e. for the Inbox there will be a
    file called Inbox, a file called Inbox.msf (which is the summary/index)
    and perhaps a Inbox.sbd (if there are sub folders in the Inbox itself)

    The precise locaton on your profile folder depends on the version you
    are running as well as your operating system.
    From current version of Thunderbird use the HELP menu and view the
    release notes, the profile location for specific systems is listed there.

    If you have multiple profiles, the user should know which one he/she is
    using, because they chose it on startup

    --
    Mozilla Champion
    UFAQ - http://www.UFAQ.org
    Mozilla Champions - http://mozillachampions.mozdev.org
    Mozilla Manual - http://mozmanual.mozdev.org/
    Moz Champion, Dec 7, 2004
    #6
  7. Joh

    Guest

    Thanks for that script, it came in very handy.
    I suggest changing the line

    /^Message-ID: / {

    to

    /^Message-I[Dd]: / {

    because it seems sometimes it's 'Id', not 'ID'.

    Cheers,
    Nick



    Ralph Fox wrote:
    > On 27 Nov 2004 09:21:05 -0800, in message
    > <>, Joh wrote:
    >
    > > i would like to write a python script to remove duplicates emails

    from
    > > mozilla thunderbird inbox files...

    >
    > The AWK script below will do this.
    >
    >
    > > or a least dumping theses email
    > > headers in a text file.
    > >
    > > please where can i find some urls to understand how theses files

    are
    > > coded ?

    >
    >
    > 1. Each email folder in Mozilla or Mozilla Thunderbird corresponds

    to
    > two files. For example, for the "Inbox" folder there are two
    > files
    >
    > Inbox
    > Inbox.msf
    >
    > 1.1 The file "Inbox" (no extension) is in mbox file format.
    >
    > A web search will turn up many descriptions of the mbox file


    > format.
    >
    > Also see note 2 below
    >
    > 1.2 The file "Inbox.msf" is a summary file. It will be
    > re-created if it does not exist.
    >
    >
    > 2. Mozilla adds two proprietary headers to received email messages,
    > hold message status flags (e.g. read/unread, flagged, marked
    > as deleted, etc.). If you look at the messages in Mozilla
    > Thunderbird's mbox files, you will see these two extra headers
    > (with possibly different values).
    >
    > | X-Mozilla-Status: 8001
    > | X-Mozilla-Status2: 00000000
    >
    >
    > 3. Below is AWK script to remove duplicate emails from a mbox format


    > file. Duplicates are detected by comparing the message-IDs.
    >
    > You can use this script on Mozilla Thunderbird's mbox format
    > mail files. For example, with the "Inbox" email folder...
    >
    > 3.1 First compact the "Inbox" email folder in Mozilla.
    > 3.2 Run the file "Inbox" through this AWK script to generate
    > an output file (say) "Inbox_dedup.tmp", and then replace the
    > file "Inbox" with "Inbox_dedup.tmp".
    > 3.3 Delete the file "Inbox.msf".
    >
    >
    > 4. When you delete a message from a Mozilla Thunderbird email

    folder,
    > Mozilla Thunderbird does not immediately remove the message from
    > the mbox file. Instead, Mozilla Thunderbird sets a status flag
    > to indicate that the message is marked as deleted. The deleted
    > message is removed when you "Compact" the folder in Mozilla
    > Thunderbird.
    >
    > The AWK script below will not notice that a message has been
    > marked as deleted (it does not check the proprietary Mozilla

    flags).
    >
    > So if you don't compact the email folder first, then you run
    > the risk of the following situation happening.
    >
    > Risk
    > · There are two copies of the same message (a duplicate).
    > · The first copy in the file is marked for delete; the second

    is not.
    > · The AWK script keeps the first copy, sees the second is a

    duplicate
    > and removes it.
    >
    >
    > 5. The script dedup_mbox.awk
    >
    >

    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    > #!/usr/bin/awk -f
    > #
    > # ---------------------------------------------------------------
    > #
    > # Removes duplicate messages from a mbox file.
    > # Duplicates are identified by message-id.
    > #
    > # ---------------------------------------------------------------
    >
    >
    > BEGIN {
    >
    > # this is a state-driven program
    > # states are "INIT", "SCAN", "SKIP" and "COPY"
    >
    > state = "INIT" ;
    >
    > backbuffer = "" ;
    >
    > }
    >
    > /^From / {
    >
    > # separator line indicating the start of a message in the file
    >
    > if ( state == "SCAN" )
    > {
    > # We only get here when we encounter a new message in the

    file,
    > # and the previous message in the file had no body and no
    > # message-id header.
    >
    > # The code below makes a policy decision to treat any message
    > # with no message-id header as not a duplicate.
    >
    > if ( backbuffer != "" )
    > printf( "%s", backbuffer ) ;
    > backbuffer = "" ;
    > }
    >
    > # set state to scanning for a message-id header,
    > # saving text to print if this is found to be a new message.
    >
    > state = "SCAN" ;
    > backbuffer = "" ;
    > }
    >
    > /^Message-ID: / {
    >
    > # if we are in SCAN state, then this is the message-id header we

    are looking for.
    >
    > if ( state == "SCAN" )
    > {
    > # in header and not yet seen a message-id header
    >
    > message_id = $2 ;
    >
    > if ( have_seen[ message_id ] == "YES" )
    > {
    > # this is a duplicate message
    > #printf "Duplicate: %s\n", message_id | "cat 1>&2;" ;
    >
    > backbuffer = "" ;
    > state = "SKIP" ;
    > }
    > else
    > {
    > # not a duplicate.
    > #printf "Original: %s\n", message_id | "cat 1>&2;" ;
    >
    > have_seen[ message_id ] = "YES" ;
    >
    > # print from separator and all header lines before this.
    >
    > if ( backbuffer != "" )
    > printf( "%s", backbuffer ) ;
    > backbuffer = "" ;
    > state = "COPY" ;
    > }
    > }
    > }
    >
    > /^$/ {
    >
    > # empty line, possibly the boundary between headers and body
    >
    > if ( state == "SCAN" )
    > {
    > # We only get here when we reach the end of the header
    > # and there was no message-id in header.
    >
    > # The code below makes a policy decision to treat any message
    > # with no message-id header as not a duplicate.
    >
    > if ( backbuffer != "" )
    > printf( "%s", backbuffer ) ;
    > backbuffer = "" ;
    > state = "COPY" ;
    > }
    > }
    >
    > {
    > # any line
    >
    > if ( state == "SCAN" )
    > {
    > backbuffer = backbuffer $0 "\n" ;
    > }
    > else
    > if ( state == "SKIP" )
    > {
    > backbuffer = "" ;
    > }
    > else
    > {
    > printf( "%s%s\n", backbuffer, $0 ) ;
    > backbuffer = "" ;
    > }
    > }
    >
    > END {
    >
    > if ( state == "SCAN" )
    > {
    > # We only get here when we reach the end of the file,
    > # and the last message in the file had no body and no
    > # message-id header.
    >
    > # The code below makes a policy decision to treat any message
    > # with no message-id header as not a duplicate.
    >
    > if ( backbuffer != "" )
    > printf( "%s", backbuffer ) ;
    > backbuffer = "" ;
    > }
    >
    > state="INIT" ;
    > }
    >

    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    >
    >
    > --
    > Cheers,
    > Ralph
    >
    > "Curiosity skilled the cat."
    , Dec 11, 2004
    #7
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Norvin Adams III
    Replies:
    6
    Views:
    1,977
    db cooper
    Jul 13, 2004
  2. Julien

    Configure & understand radius

    Julien, Jun 7, 2004, in forum: Cisco
    Replies:
    0
    Views:
    536
    Julien
    Jun 7, 2004
  3. maxxot2004
    Replies:
    0
    Views:
    747
    maxxot2004
    Sep 10, 2004
  4. Replies:
    3
    Views:
    418
  5. scadav

    Trying to Understand Layer 2

    scadav, Jul 6, 2005, in forum: Cisco
    Replies:
    6
    Views:
    518
    stephen
    Jul 7, 2005
Loading...

Share This Page