Velocity Reviews

Velocity Reviews (http://www.velocityreviews.com/forums/index.php)
-   Python (http://www.velocityreviews.com/forums/f43-python.html)
-   -   Extracting data from dump file (http://www.velocityreviews.com/forums/t554591-extracting-data-from-dump-file.html)

TYR 11-23-2007 06:46 PM

Extracting data from dump file
 
I have a large dump file that originated in a MySQL db; I need to get
it into an SQLite file.

Various options are suggested around the web; none of them seem to
work (most failing to import the thing in the first place). So I
removed the assorted taggery from each end, leaving just a big text
file taking the following format:

('value', 'value', 'value, 'value'),
('value','value','value','value')...

I planned to find some way of splitting the thing at the commas
outside the bracketed groups, thus giving me a list of tuples; then I
could of course CREATE TABLE foo VALUES '1', '2, '3', '4' and then
iterate through the list INSERTing INTO.

Then my problems began; I tried using the python csv method, replacing
the string ),( with \t and then using \t as the delimiter. First
problem; there's a size limit coded into the module. No problem, use
csv.field_size_limit() to alter it. Problem; it doesn't actually parse
at all, just sends the whole thing as a string and the SQL INSERT
fails with a "not enough args" error.

Tried using string.split() and re.split(data, r'\t'); first gave the
same error, second failed with a "too many named groups" error. Tried
using ; as a delimiter and going back to csv; this fails to match
the ; for some reason. Any ideas?

John Machin 11-23-2007 09:28 PM

Re: Extracting data from dump file
 
On Nov 24, 5:46 am, TYR <a.harrow...@gmail.com> wrote:
> I have a large dump file that originated in a MySQL db; I need to get
> it into an SQLite file.
>
> Various options are suggested around the web; none of them seem to
> work (most failing to import the thing in the first place). So I
> removed the assorted taggery from each end, leaving just a big text
> file taking the following format:
>
> ('value', 'value', 'value, 'value'),
> ('value','value','value','value')...


Consider the possibility that individual data rows are separated by "),
\n(" or maybe "), \n(" or something else, not "),(".

What does
print repr(open("my_dump.txt", "rb").read(1000))
tell you? Are there any "\n" in there at all? BTW, what platform are
you running on?

>
> I planned to find some way of splitting the thing at the commas
> outside the bracketed groups, thus giving me a list of tuples; then I
> could of course CREATE TABLE foo VALUES '1', '2, '3', '4' and then


CREATE TABLE foo VALUES '1', .... are you sure?


> iterate through the list INSERTing INTO.
>
> Then my problems began; I tried using the python csv method, replacing
> the string ),( with \t and then using \t as the delimiter. First
> problem; there's a size limit coded into the module. No problem, use
> csv.field_size_limit() to alter it.


>>> import csv
>>> csv.field_size_limit()

131072
>>>


And do you really expect that any one of your data rows will take more
than 128KB in the dump file?? How large did you have to set the limit
to make it "work"???

> Problem; it doesn't actually parse
> at all,


Sounds like it did parse, with the result being one csv row containing
one very long field.

> just sends the whole thing as a string and the SQL INSERT
> fails with a "not enough args" error.
>
> Tried using string.split() and re.split(data, r'\t'); first gave the
> same error, second failed with a "too many named groups" error.


All of these are indications that you don't have the correct
delimiter.

> Tried
> using ; as a delimiter and going back to csv; this fails to match
> the ; for some reason. Any ideas?


Examine your dump carefully. Are there any newlines other than at the
logical end of data rows (like e.g. it's folded at column 78)? If so,
you'll have to replace those by spaces or delete them. Work out
exactly what is separating data rows. Let's call it "sep". If you can
use file_contents.replace(sep, "\t"), then you can use
file_contents.split(sep) directly; there's no need for a two step
approach. If sep is variable, then use re.split.

Try this on a small subset of your dump file. Before you go anywhere
near the database, check the results of your split: did you get the
expected number of rows [currently you are getting 1 row], does each
row look OK? Unsubtle hint: the first row may start with "junk(" and
the last may end with ")junk" if you haven't already allowed for that.

Then try inserting into the database. Print a row number and row
contents before each insert (or wrap try/except around each insert) so
that you know which row any error relates to.

If you still have a problem, ask again, but do show exactly what your
dump file looks like and what your code is -- my crystal ball needs a
long recharge after all of the above!

HTH,
John

Dennis Lee Bieber 11-23-2007 09:42 PM

Re: Extracting data from dump file
 
On Fri, 23 Nov 2007 10:46:12 -0800 (PST), TYR <a.harrowell@gmail.com>
declaimed the following in comp.lang.python:

> I have a large dump file that originated in a MySQL db; I need to get
> it into an SQLite file.
>
> Various options are suggested around the web; none of them seem to
> work (most failing to import the thing in the first place). So I
> removed the assorted taggery from each end, leaving just a big text
> file taking the following format:
>

I'd think the simplest would be to tweak the contents of the dump by
removing the MySQL specific terms, or replacing some with SQLite syntax,
then batching the whole file into the command line utility.

I presume your file is similar to:

-=-=-=-=-=-=-=-
-- MySQL Administrator dump 1.4
--
-- ------------------------------------------------------
-- Server version 5.0.24a-community-max-nt-log


/*!40101 SET @OLD_CHARACTER_SET_CLIENT=@@CHARACTER_SET_CLIENT */;
/*!40101 SET @OLD_CHARACTER_SET_RESULTS=@@CHARACTER_SET_RESULTS */;
/*!40101 SET @OLD_COLLATION_CONNECTION=@@COLLATION_CONNECTION */;
/*!40101 SET NAMES utf8 */;

/*!40014 SET @OLD_UNIQUE_CHECKS=@@UNIQUE_CHECKS, UNIQUE_CHECKS=0 */;
/*!40014 SET @OLD_FOREIGN_KEY_CHECKS=@@FOREIGN_KEY_CHECKS,
FOREIGN_KEY_CHECKS=0 */;
/*!40101 SET @OLD_SQL_MODE=@@SQL_MODE, SQL_MODE='NO_AUTO_VALUE_ON_ZERO'
*/;


--
-- Create schema bestiaria
--

CREATE DATABASE /*!32312 IF NOT EXISTS*/ bestiaria;
USE bestiaria;

--
-- Table structure for table `bestiaria`.`artlinks`
--

DROP TABLE IF EXISTS `artlinks`;
CREATE TABLE `artlinks` (
`ID` int(11) NOT NULL auto_increment,
`category` enum('archive','artist','zine','reference','news') NOT NULL
default 'archive',
`name` varchar(75) NOT NULL default '',
`URL` varchar(75) NOT NULL default '',
`banner` varchar(75) default NULL,
`width` int(11) default NULL,
`height` int(11) default NULL,
`description` varchar(250) default NULL,
PRIMARY KEY (`ID`)
) ENGINE=MyISAM AUTO_INCREMENT=49 DEFAULT CHARSET=latin1;

--
-- Dumping data for table `bestiaria`.`artlinks`
--

/*!40000 ALTER TABLE `artlinks` DISABLE KEYS */;
INSERT INTO `artlinks`
(`ID`,`category`,`name`,`URL`,`banner`,`width`,`he ight`,`description`)
VALUES

(1,'archive','Epilogue','http://www.epilogue.net/',NULL,NULL,NULL,NULL),

(2,'archive','FurNation','http://www.furnation.com/','images/furnation.gif',404,44,'&quot;The
Artists\' Image Archive&quot;'),

(3,'archive','Elfwood','http://www.elfwood.com/',NULL,NULL,NULL,'Reorganized,
subsuming the former <i>Lothlorien</i> and <i>Zone47</i> subsites'),
(5,'archive','The Orlando Furry
Archives','http://fanarchive.net/',NULL,NULL,NULL,NULL),
(6,'archive','Side 7','http://www.side7.com/',NULL,NULL,NULL,'&quot;the
Artist\'s Archive&quot;'),

(7,'archive','VCL','http://us.vclart.net/vcl/Artists/',NULL,NULL,NULL,'Furry
Image Archives'),
(37,'archive','Deviant
Art','http://www.deviantart.com/',NULL,NULL,NULL,'An online artist
collective and marketplace. Various genres.'),
(8,'archive','YERF','http://www.yerf.com/',NULL,NULL,NULL,'Formerly
known as <i>The Squeeky Clean Furry Archives</I>'),
(9,'zine','Herd
Mentality','http://www.furry.org.au/aatheus/herdmentality/',NULL,NULL,NULL,'&quot;The
Centaur Fanzine!&quot; (although the definition of &quot;centaur&quot;
is rather loose)'),

(10,'zine','Huzzah','http://www.huzzah.com/',NULL,NULL,NULL,'&quot;Anthropomorphics
for Artists&quot;');
INSERT INTO `artlinks`
(`ID`,`category`,`name`,`URL`,`banner`,`width`,`he ight`,`description`)
VALUES
....
-=-=-=-=-=-=-=-

> ('value', 'value', 'value, 'value'),
> ('value','value','value','value')...
>
> I planned to find some way of splitting the thing at the commas
> outside the bracketed groups, thus giving me a list of tuples; then I
> could of course CREATE TABLE foo VALUES '1', '2, '3', '4' and then


That sure doesn't look like any "create table" syntax I've ever
seen...

> iterate through the list INSERTing INTO.
>
> Then my problems began; I tried using the python csv method, replacing
> the string ),( with \t and then using \t as the delimiter. First


Don't know about your file, but mine does NOT have
"""
....),(...
"""
instead I have
"""
....),
(...
"""

or in one line, encoded: "...),\n(..."

I wouldn't expect the CSV module to have much luck with most of the
dump -- one would have to remove all the comments, CREATE stuff, and the
INSERT lines, leaving just the "(...)" rows. THEN remove any "(" that
starts a line, and any ")," that ends a line, while avoiding any
embedded... Then, maybe, "unescape" any MySQL escapes in text columns
before using the SQLite parameter substitution to escape them for /it/.

> problem; there's a size limit coded into the module. No problem, use
> csv.field_size_limit() to alter it. Problem; it doesn't actually parse
> at all, just sends the whole thing as a string and the SQL INSERT
> fails with a "not enough args" error.
>
> Tried using string.split() and re.split(data, r'\t'); first gave the
> same error, second failed with a "too many named groups" error. Tried
> using ; as a delimiter and going back to csv; this fails to match
> the ; for some reason. Any ideas?


Write an SQL parser of sorts (I have no experience with the python
parsing stuff, so would brute force one...)... Something like:

-=-=-=-=-=-=-=-
"""
Brute Force template for MySQL Dump Parser to SQLite
"""

import sys

fout = sys.stdout #for info only at this point

DUMP_FILE = "test.sql"

def parse_insert(fin, ln):
insert = ln.replace("`", "'")
for ln in fin: #another nesting
ln = ln.strip()
fout.write("%s\n\t%s;\n\n" % (insert, ln[:-1]))
if ln.endswith(";"): break


def parse_table(fin):
uniq = []
fields = []
pkey = ""
for ln in fin: #yes, nested read functionality
ln = ln.strip()
if ln.startswith(")") and ln.endswith(";"):
break
if ln.startswith("PRIMARY KEY"):
pkey = "\tprimary key %s" %
ln[ln.index("("):ln.index(")")+1].replace("`", "'")
continue
if ln.startswith("UNIQUE"):
uniq.append("\tUNIQUE %s" %
ln[ln.index("("):ln.index(")")+1].replace("`", "'"))
continue
flds = ln.split()
flds[-1] = flds[-1].rstrip(",")
field_name = flds[0].replace("`", "'")
ftype = flds[1].lower()
if ("int" in ftype
or "bit" in ftype
or "bool" in ftype):
field_type = "integer"
elif ("dec" in ftype
or "float" in ftype
or "double" in ftype
or "real" in ftype):
field_type = "real"
elif ("text" in ftype
or "char" in ftype
or "enum" in ftype
or "set" in ftype):
field_type = "text"
elif "blob" in ftype:
field_type = "blob"
elif ("date" in ftype
or "time" in ftype
or "year" in ftype):
field_type = "text"
else:
fout.write("Ignoring MySQL field type: %s\n" % ftype)
field_type = " "
clause = ""
if "auto_increment" in ln:
clause += " primary key autoincrement "
if "UNIQUE" in ln:
clause += " unique "
if "NOT NULL" in ln:
clause += " not null "
if "default" in ln:
clause += ln[ln.index("default"):]
if clause.endswith(","):
clause = clause[:-1]
fields.append("\t%s %s %s" % (field_name, field_type, clause))
if pkey:
fields.append(pkey)
if uniq:
fields.extend(uniq)
fout.write("%s\n" % ",\n".join(fields))
fout.write(");\n")

def parse_create(fin, ln):
if ln.startswith("CREATE DATABASE "):
while not ln.endswith(";"):
ln = fin.readline()
ln = ln.strip()
dbname = ln.split()[-1][:-1]
fout.write("SQLite: con = .connect('%s.db')\n" % dbname)
elif ln.startswith("CREATE TABLE "):
table_name = ln.split()[2].replace("`", "'")
fout.write("SQLite: create table %s\n(\n" % table_name)
parse_table(fin)
else:
fout.write("Ignoring MySQL statement: %s\n" % ln)

def parse_file(fin):
for ln in fin:
ln = ln.strip()
if (not ln
or ln.startswith("--")
or ln.startswith("/*")):
continue #ignore comments and blank lines
elif ln.startswith("CREATE "):
parse_create(fin, ln)
elif ln.startswith("INSERT "):
parse_insert(fin, ln)
else:
fout.write("Ignoring MySQL statement: %s\n" % ln)

if __name__ == "__main__":
fin = open(DUMP_FILE, "r")
parse_file(fin)
fin.close()
# fout.close()
-=-=-=-=-=-=-=-

Which produces something like the following when fed the start of my
dump file... I just created diagnostic output, but the output could
either be formatted to be an SQLite command line batch file, or
converted into DB-API compatible calls to SQLite from within the program
(as implied by that connect line)

I did not attempt any tests for escaped fields in the inserts... That
section of code should really be expanded to strip the ( ... ), do a CSV
logic split, each field then unescaped, reescaped as needed, and
rejoined...


-=-=-=-=-=-=-=-
SQLite: con = .connect('bestiaria.db')
Ignoring MySQL statement: USE bestiaria;
Ignoring MySQL statement: DROP TABLE IF EXISTS `artlinks`;
SQLite: create table 'artlinks'
(
'ID' integer primary key autoincrement not null ,
'category' text not null default 'archive',
'name' text not null default '',
'URL' text not null default '',
'banner' text default NULL,
'width' integer default NULL,
'height' integer default NULL,
'description' text default NULL,
primary key ('ID')
);
INSERT INTO 'artlinks'
('ID','category','name','URL','banner','width','he ight','description')
VALUES

(1,'archive','Epilogue','http://www.epilogue.net/',NULL,NULL,NULL,NULL);

INSERT INTO 'artlinks'
('ID','category','name','URL','banner','width','he ight','description')
VALUES

(2,'archive','FurNation','http://www.furnation.com/','images/furnation.gif',404,44,'&quot;The
Artists\' Image Archive&quot;');

INSERT INTO 'artlinks'
('ID','category','name','URL','banner','width','he ight','description')
VALUES

(3,'archive','Elfwood','http://www.elfwood.com/',NULL,NULL,NULL,'Reorganized,
subsuming the former <i>Lothlorien</i> and <i>Zone47</i> subsites');

INSERT INTO 'artlinks'
('ID','category','name','URL','banner','width','he ight','description')
VALUES
(5,'archive','The Orlando Furry
Archives','http://fanarchive.net/',NULL,NULL,NULL,NULL);

INSERT INTO 'artlinks'
('ID','category','name','URL','banner','width','he ight','description')
VALUES
(6,'archive','Side
7','http://www.side7.com/',NULL,NULL,NULL,'&quot;the Artist\'s
Archive&quot;');

INSERT INTO 'artlinks'
('ID','category','name','URL','banner','width','he ight','description')
VALUES

(7,'archive','VCL','http://us.vclart.net/vcl/Artists/',NULL,NULL,NULL,'Furry
Image Archives');

INSERT INTO 'artlinks'
('ID','category','name','URL','banner','width','he ight','description')
VALUES
(37,'archive','Deviant
Art','http://www.deviantart.com/',NULL,NULL,NULL,'An online artist
collective and marketplace. Various genres.');

INSERT INTO 'artlinks'
('ID','category','name','URL','banner','width','he ight','description')
VALUES
(8,'archive','YERF','http://www.yerf.com/',NULL,NULL,NULL,'Formerly
known as <i>The Squeeky Clean Furry Archives</I>');

INSERT INTO 'artlinks'
('ID','category','name','URL','banner','width','he ight','description')
VALUES
(9,'zine','Herd
Mentality','http://www.furry.org.au/aatheus/herdmentality/',NULL,NULL,NULL,'&quot;The
Centaur Fanzine!&quot; (although the definition of &quot;centaur&quot;
is rather loose)');

INSERT INTO 'artlinks'
('ID','category','name','URL','banner','width','he ight','description')
VALUES

(10,'zine','Huzzah','http://www.huzzah.com/',NULL,NULL,NULL,'&quot;Anthropomorphics
for Artists&quot;');

INSERT INTO 'artlinks'
('ID','category','name','URL','banner','width','he ight','description')
VALUES
(11,'zine','Sofawolf
Press','http://www.sofawolf.com/','images/swplogo.gif',158,62,'Sofawolf
Press showcases quality furry fiction in book and magazine format.');

INSERT INTO 'artlinks'
('ID','category','name','URL','banner','width','he ight','description')
VALUES
(12,'artist','Theresa Larsson: Aspects of
Art','http://www.theresalarsson.se/','images/aspects_of_art.jpg',468,60,NULL);

INSERT INTO 'artlinks'
('ID','category','name','URL','banner','width','he ight','description')
VALUES
(13,'artist','Megan
Giles','http://www.furnation.com/Megan_giles/','images/mgiles.jpg',468,80,NULL);
--
Wulfraed Dennis Lee Bieber KD6MOG
wlfraed@ix.netcom.com wulfraed@bestiaria.com
HTTP://wlfraed.home.netcom.com/
(Bestiaria Support Staff: web-asst@bestiaria.com)
HTTP://www.bestiaria.com/


All times are GMT. The time now is 07:41 AM.

Powered by vBulletin®. Copyright ©2000 - 2014, vBulletin Solutions, Inc.
SEO by vBSEO ©2010, Crawlability, Inc.