![]() |
[ANN] Metadata 1.1
tarball: http://dark.fhtr.org/repos/metadata/metadata-1.1.tar.gz
gem: http://dark.fhtr.org/repos/metadata/metadata-1.1.gem git: http://dark.fhtr.org/repos/metadata Changes ------- * more README documentation - all output fields in appendix - grouped tested formats * more extensive testing * fixed a bug with document text extraction * took out empty Document.PageSizeNames * use more fields from extract (keywords, language, revision history among others) * use more dcraw metadata, ignore failed exif for raws * renamed Image.Frames to Image.FrameCount * added Image.LayerCount for layered images * use more fields from exif: colorspace, colormode * fixed exif output to use numbers instead of strings where appropriate (focal length, exposure time, ISO speed, Fnumber) * optional md5sum and/or sha1sum in the metadata: mdh [-m] [-s] and Metadata.sha1sum|md5sum = true|false Thanks ------ Konrad Meyer for his patient testing and bug reports. Darren Kirby for the heads-up on wmainfo's ASF-parsing capabilities (along with being the author of wmainfo-rb and flacinfo-rb.) Description ----------- This package `Metadata' comes with a library called `metadata' and a small program called `mdh'. The library probes files for their metadata (e.g. jpeg dimensions and camera make, mp3 artist, pdf text and word count) and returns the metadata as a Hash. All strings in the metadata are converted to UTF-8. The `mdh'-program can print out file metadata as YAML and package the metadata with the file. The metadata hash follows the shared file metadata spec naming, with some additional fields, see list at the end of this file (Appendix A.) For details on the MDH file format, see the end of this file (Appendix B.) Usage ----- # print out metadata for myfile.jpg mdh myfile.jpg # create myfile.jpg.mdh, which consists of an MDH metadata header + myfile.jpg mdh -c myfile.jpg # print out the metadata header from an MDH file mdh -e -p myfile.jpg.mdh # strip out the metadata header from an MDH file and save it to myfile.jpg mdh -e myfile.jpg.mdh # print out the list of options mdh -h irb> require 'metadata' irb> Metadata.extract('myfile.jpg') irb> Metadata.extract_text('myfile.pdf') irb> Pathname.new("myfile.jpg").metadata List of supported formats ------------------------- Audio: Whatever you manage to make mplayer play. Plus special handlers for FLAC, m4a, ape, musepack, wavepack and wma. Successfully tested with: mp3, flac, ogg, wav, ra, m4a, wma Should also work: wv, mpc, ape Video: Whatever you manage to make mplayer play. Successfully tested with: wmv, mov, divx, xvid, flv, ogm, mpg, mkv Images: Should handle pretty much anything. I.e. anything handled by ExifTool, ImageMagick, Imlib2 or dcraw. Successfully tested with: Web formats: jpeg, png, gif, svg Camera raws: nef, dng, crw, pef, orf Image editor state dumps: psd, xcf The rest: tga, tif, bmp, xpm, ppm Documents: Successfully tested with: Web formats: html, txt Print formats: pdf, ps, ps.gz OO formats: sxi, odp MS formats: doc, ppt, xls - I'm using unoconv to convert OO & MS docs to temp PDFs for the text & dimensions extraction, so those bits of data are missing. MSOffice docs are missing dimensions for the same reason. Here's a way to get them: ( first, get Thumbnailer: http://dark.fhtr.org/repos/thumbnailer/ ) $ thumbnailer -s 1 -k foo.odp /tmp/foo.jpg $ mdh foo.odp $ rm foo.odp-temp.pdf /tmp/foo.jpg Others: - BitTorrent .torrent files - Archive contents - Whatever `extract' outputs and I am handling Requirements ------------ * Ruby 1.8 * Tons of metadata extraction programs and libs. This package has many dependencies since there is no single universal metadata header format that all files use. Blame resource forks, filename extensions, bags of bytes and mimetypes. List of gems: flacinfo-rb wmainfo-rb MP4Info id3lib-ruby apetag List of Debian packages: dcraw libimlib2-ruby extract libimage-exiftool-perl poppler-utils mplayer html2text imagemagick unhtml pstotext antiword catdoc shared-mime-info * You do want to install the latest versions of dcraw and shared-mime-info to be able to handle camera raw images. http://cybercom.net/~dcoffin/dcraw/ http://freedesktop.org/wiki/Software/shared-mime-info * Python + chardet library http://chardet.feedparser.org/ Install ------- De-compress archive and enter its top directory. Then type: ($ su) # ruby setup.rb These simple step installs this program under the default location of Ruby libraries. You can also install files into your favorite directory by supplying setup.rb some options. Try "ruby setup.rb --help". Appendix A: Metadata fields -------------------------------------- This list contains the metadata fields output by Metadata and mdh. The list follows the shared file metadata spec for the most part. http://wiki.freedesktop.org/wiki/Spe...emetadata-spec field name | field type ---------------------------------------------------------------------- Archive.Contents array of pathnames Audio.Band string Audio.Composer string Audio.Conductor string Audio.Copyright string (copyright message) Audio.Grouping string Audio.Image binary string (embedded image data) Audio.InterpretedBy string Audio.Lyricist string Audio.Publisher string Audio.RemixedBy string Audio.Subtitle string Audio.Tempo integer Audio.VariableBitrate boolean Audio.Writer string Audio.Publicationright string Audio.File string Audio.EAN/UPC string Audio.ISBN string Audio.Catalog string Audio.LC string Audio.Media string Audio.Index string Audio.Related string Audio.ISRC string Audio.Abstract string Audio.Language string Audio.Bibliography string Audio.Introplay string Audio.Dummy string Audio.DebutAlbum string Audio.RecordDate string Audio.RecordLocation string v-- ORIGINAL FIELDS USED --v Audio.Title string Audio.Artist string Audio.Album string Audio.AlbumArtist string Audio.AlbumTrackCount integer Audio.TrackNo integer Audio.DiscNo integer Audio.Performer string Audio.Duration float Audio.ReleaseDate datetime Audio.Comment string Audio.Genre string Audio.Codec string Audio.Samplerate integer Audio.Bitrate float Audio.Channels integer Audio.Lyrics string Doc.Album string Doc.Artist string Doc.Charset string Doc.Description string Doc.Genre string Doc.Language string Doc.ModifyDate date Doc.PageSizeName string (A4, A5, letter, ...) Doc.RevisionHistory array of strings Doc.ParagraphCount integer Doc.LineCount integer Doc.CharacterCount integer Doc.LastSavedBy string Doc.Keywords array of strings Doc.Template string v-- ORIGINAL FIELDS USED --v Doc.Title string Doc.Subject string Doc.Author string Doc.PageCount integer Doc.WordCount integer Doc.Created datetime File.Software string (software used to create the file) File.MD5Sum string (md5sum of file's contents) File.SHA1Sum string (sha1sum of file's contents) v-- ORIGINAL FIELDS USED --v File.Format string (mime type, inode/directory for dirs) File.Size integer File.Content string File.Modified string Image.DateCreated date Image.DateTimeCreated date Image.DateTimeOriginal date Image.DimensionUnit string (px, mm, pt, ...) Image.Editor string Image.EXIF string (exiftool output) Image.FrameCount integer Image.LayerCount integer Image.Modified date Image.OriginatingProgram string Image.ComponentCount integer Image.ColorMode string (e.g. RGB) Image.ColorSpace string (e.g. sRGB) v-- ORIGINAL FIELDS USED --v Image.Height float Image.Width float Image.Title string Image.Date datetime Image.Creator string Image.Description string Image.Software string Image.CameraMake string Image.CameraModel string Image.ExposureProgram string Image.ExposureTime float Image.Fnumber float Image.Flash boolean Image.FocalLength float Image.ISOSpeed float Image.MeteringMode string Image.WhiteBalance string Image.Copyright string Location.Latitude float Location.Longitude float Video.Album string Video.Artist string Video.Bitrate integer Video.Codec string Video.Comment string Video.Duration float Video.Framerate float (frames per second) Video.Genre string Video.ReleaseDate date Video.Title string Video.TrackNo integer Video.Demuxer string BitTorrent.Name string BitTorrent.Files array of { 'path' => string, 'length' => integer, 'md5sum' => string } BitTorrent.Length integer (size of single-file torrents) BitTorrent.MD5Sum string (md5sum for single-file torrents) BitTorrent.PieceCount integer BitTorrent.PieceLength integer (length of a single piece BitTorrent.Comment string BitTorrent.Announce string (announce url) BitTorrent.AnnounceList array of arrays of strings BitTorrent.Nodes array of [hostname, port] -arrays Appendix B: The MDH file format ------------------------------- MDH files are built as follows: bytes | content --------------- 3 | "MDH" - MDH file format identifier 1 | "\x01" - MDH file format version number 4 | Long, network byte order - the size of the metadata struct in bytes var | YAML - The MDH metadata struct var | The actual file contents All string fields in the metadata are UTF-8. License ------- Ruby's -- Ilmari Heikkinen <ilmari.heikkinen gmail com> http://fhtr.blogspot.com |
Re: [ANN] Metadata 1.1
--nextPart7143535.uxaS0X2PQS
Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Content-Disposition: inline Quoth Ilmari Heikkinen: > tarball: http://dark.fhtr.org/repos/metadata/metadata-1.1.tar.gz > gem: http://dark.fhtr.org/repos/metadata/metadata-1.1.gem > git: http://dark.fhtr.org/repos/metadata >=20 >=20 > Changes > ------- > * more README documentation > - all output fields in appendix > - grouped tested formats > * more extensive testing > * fixed a bug with document text extraction > * took out empty Document.PageSizeNames >=20 > * use more fields from extract > (keywords, language, revision history among others) >=20 > * use more dcraw metadata, ignore failed exif for raws > * renamed Image.Frames to Image.FrameCount > * added Image.LayerCount for layered images > * use more fields from exif: colorspace, colormode > * fixed exif output to use numbers instead of strings where > appropriate (focal length, exposure time, ISO speed, Fnumber) >=20 > * optional md5sum and/or sha1sum in the metadata: > mdh [-m] [-s] > and > Metadata.sha1sum|md5sum =3D true|false >=20 >=20 > Thanks > ------ >=20 > Konrad Meyer for his patient testing and bug reports. > Darren Kirby for the heads-up on wmainfo's ASF-parsing capabilities > (along with being the author of wmainfo-rb and flacinfo-rb.) >=20 >=20 > Description > ----------- >=20 > This package `Metadata' comes with a library called `metadata' and > a small program called `mdh'. >=20 > The library probes files for their metadata (e.g. jpeg dimensions > and camera make, mp3 artist, pdf text and word count) and returns the > metadata as a Hash. All strings in the metadata are converted to UTF-8. >=20 > The `mdh'-program can print out file metadata as YAML and package the > metadata with the file. >=20 > The metadata hash follows the shared file metadata spec naming, with so= me > additional fields, see list at the end of this file (Appendix A.) >=20 > For details on the MDH file format, see the end of this file (Appendix = B.) >=20 >=20 > Usage > ----- >=20 > # print out metadata for myfile.jpg > mdh myfile.jpg >=20 > # create myfile.jpg.mdh, which consists of an MDH metadata header +=20 myfile.jpg > mdh -c myfile.jpg >=20 > # print out the metadata header from an MDH file > mdh -e -p myfile.jpg.mdh >=20 > # strip out the metadata header from an MDH file and save it to myfile.= jpg > mdh -e myfile.jpg.mdh >=20 > # print out the list of options > mdh -h >=20 > irb> require 'metadata' > irb> Metadata.extract('myfile.jpg') > irb> Metadata.extract_text('myfile.pdf') > irb> Pathname.new("myfile.jpg").metadata >=20 >=20 > List of supported formats > ------------------------- >=20 > Audio: > Whatever you manage to make mplayer play. > Plus special handlers for FLAC, m4a, ape, musepack, wavepack and wma. >=20 > Successfully tested with: > mp3, flac, ogg, wav, ra, m4a, wma >=20 > Should also work: > wv, mpc, ape >=20 >=20 > Video: > Whatever you manage to make mplayer play. >=20 > Successfully tested with: > wmv, mov, divx, xvid, flv, ogm, mpg, mkv >=20 >=20 > Images: > Should handle pretty much anything. > I.e. anything handled by ExifTool, ImageMagick, Imlib2 or dcraw. >=20 > Successfully tested with: > Web formats: > jpeg, png, gif, svg > Camera raws: > nef, dng, crw, pef, orf > Image editor state dumps: > psd, xcf > The rest: > tga, tif, bmp, xpm, ppm >=20 >=20 > Documents: > Successfully tested with: > Web formats: > html, txt > Print formats: > pdf, ps, ps.gz > OO formats: > sxi, odp > MS formats: > doc, ppt, xls >=20 > - I'm using unoconv to convert OO & MS docs to temp PDFs for the text= & > dimensions extraction, so those bits of data are missing. MSOffice= =20 docs > are missing dimensions for the same reason. Here's a way to get the= m: > ( first, get Thumbnailer: http://dark.fhtr.org/repos/thumbnailer/ ) > $ thumbnailer -s 1 -k foo.odp /tmp/foo.jpg > $ mdh foo.odp > $ rm foo.odp-temp.pdf /tmp/foo.jpg >=20 >=20 > Others: > - BitTorrent .torrent files > - Archive contents > - Whatever `extract' outputs and I am handling >=20 >=20 > Requirements > ------------ >=20 > * Ruby 1.8 >=20 > * Tons of metadata extraction programs and libs. > This package has many dependencies since there is no single universal > metadata header format that all files use. Blame resource forks,=20 filename > extensions, bags of bytes and mimetypes. >=20 > List of gems: > flacinfo-rb > wmainfo-rb > MP4Info > id3lib-ruby > apetag >=20 > List of Debian packages: > dcraw > libimlib2-ruby > extract > libimage-exiftool-perl > poppler-utils > mplayer > html2text > imagemagick > unhtml > pstotext > antiword > catdoc > shared-mime-info >=20 > * You do want to install the latest versions of dcraw and > shared-mime-info to be able to handle camera raw images. > http://cybercom.net/~dcoffin/dcraw/ > http://freedesktop.org/wiki/Software/shared-mime-info >=20 > * Python + chardet library > http://chardet.feedparser.org/ >=20 >=20 > Install > ------- >=20 > De-compress archive and enter its top directory. > Then type: >=20 > ($ su) > # ruby setup.rb >=20 > These simple step installs this program under the default > location of Ruby libraries. You can also install files into > your favorite directory by supplying setup.rb some options. > Try "ruby setup.rb --help". >=20 >=20 > Appendix A: Metadata fields > -------------------------------------- >=20 > This list contains the metadata fields output by Metadata and mdh. > The list follows the shared file metadata spec for the most part. > http://wiki.freedesktop.org/wiki/Spe...emetadata-spec >=20 > field name | field type > ---------------------------------------------------------------------- > Archive.Contents array of pathnames >=20 > Audio.Band string > Audio.Composer string > Audio.Conductor string > Audio.Copyright string (copyright message) > Audio.Grouping string > Audio.Image binary string (embedded image data) > Audio.InterpretedBy string > Audio.Lyricist string > Audio.Publisher string > Audio.RemixedBy string > Audio.Subtitle string > Audio.Tempo integer > Audio.VariableBitrate boolean > Audio.Writer string > Audio.Publicationright string > Audio.File string > Audio.EAN/UPC string > Audio.ISBN string > Audio.Catalog string > Audio.LC string > Audio.Media string > Audio.Index string > Audio.Related string > Audio.ISRC string > Audio.Abstract string > Audio.Language string > Audio.Bibliography string > Audio.Introplay string > Audio.Dummy string > Audio.DebutAlbum string > Audio.RecordDate string > Audio.RecordLocation string > v-- ORIGINAL FIELDS USED --v > Audio.Title string > Audio.Artist string > Audio.Album string > Audio.AlbumArtist string > Audio.AlbumTrackCount integer > Audio.TrackNo integer > Audio.DiscNo integer > Audio.Performer string > Audio.Duration float > Audio.ReleaseDate datetime > Audio.Comment string > Audio.Genre string > Audio.Codec string > Audio.Samplerate integer > Audio.Bitrate float > Audio.Channels integer > Audio.Lyrics string >=20 > Doc.Album string > Doc.Artist string > Doc.Charset string > Doc.Description string > Doc.Genre string > Doc.Language string > Doc.ModifyDate date > Doc.PageSizeName string (A4, A5, letter, ...) > Doc.RevisionHistory array of strings > Doc.ParagraphCount integer > Doc.LineCount integer > Doc.CharacterCount integer > Doc.LastSavedBy string > Doc.Keywords array of strings > Doc.Template string > v-- ORIGINAL FIELDS USED --v > Doc.Title string > Doc.Subject string > Doc.Author string > Doc.PageCount integer > Doc.WordCount integer > Doc.Created datetime >=20 > File.Software string (software used to create the file) > File.MD5Sum string (md5sum of file's contents) > File.SHA1Sum string (sha1sum of file's contents) > v-- ORIGINAL FIELDS USED --v > File.Format string (mime type, inode/directory for dirs) > File.Size integer > File.Content string > File.Modified string >=20 > Image.DateCreated date > Image.DateTimeCreated date > Image.DateTimeOriginal date > Image.DimensionUnit string (px, mm, pt, ...) > Image.Editor string > Image.EXIF string (exiftool output) > Image.FrameCount integer > Image.LayerCount integer > Image.Modified date > Image.OriginatingProgram string > Image.ComponentCount integer > Image.ColorMode string (e.g. RGB) > Image.ColorSpace string (e.g. sRGB) > v-- ORIGINAL FIELDS USED --v > Image.Height float > Image.Width float > Image.Title string > Image.Date datetime > Image.Creator string > Image.Description string > Image.Software string > Image.CameraMake string > Image.CameraModel string > Image.ExposureProgram string > Image.ExposureTime float > Image.Fnumber float > Image.Flash boolean > Image.FocalLength float > Image.ISOSpeed float > Image.MeteringMode string > Image.WhiteBalance string > Image.Copyright string >=20 > Location.Latitude float > Location.Longitude float >=20 > Video.Album string > Video.Artist string > Video.Bitrate integer > Video.Codec string > Video.Comment string > Video.Duration float > Video.Framerate float (frames per second) > Video.Genre string > Video.ReleaseDate date > Video.Title string > Video.TrackNo integer > Video.Demuxer string >=20 > BitTorrent.Name string > BitTorrent.Files array of { 'path' =3D> string, > 'length' =3D> integer, > 'md5sum' =3D> string } > BitTorrent.Length integer (size of single-file torrents) > BitTorrent.MD5Sum string (md5sum for single-file torrents) > BitTorrent.PieceCount integer > BitTorrent.PieceLength integer (length of a single piece > BitTorrent.Comment string > BitTorrent.Announce string (announce url) > BitTorrent.AnnounceList array of arrays of strings > BitTorrent.Nodes array of [hostname, port] -arrays >=20 >=20 >=20 > Appendix B: The MDH file format > ------------------------------- >=20 > MDH files are built as follows: >=20 > bytes | content > --------------- > 3 | "MDH" - MDH file format identifier > 1 | "\x01" - MDH file format version number > 4 | Long, network byte order - the size of the metadata struct in=20 bytes > var | YAML - The MDH metadata struct > var | The actual file contents >=20 > All string fields in the metadata are UTF-8. >=20 >=20 > License > ------- >=20 > Ruby's >=20 >=20 > -- > Ilmari Heikkinen <ilmari.heikkinen gmail com> > http://fhtr.blogspot.com Is the gem working now? If so, very cool. Thanks, =2D-=20 Konrad Meyer <konrad@tylerc.org> http://konrad.sobertillnoon.com/ --nextPart7143535.uxaS0X2PQS Content-Type: application/pgp-signature; name=signature.asc Content-Description: This is a digitally signed message part. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.7 (GNU/Linux) iD8DBQBG+FvHCHB0oCiR2cwRAjhOAJ93kVxUmFVd1NtpnkS4gB 8QY3+nOACgm0XT xP29dZhySVGof87A3TYYJZs= =suqV -----END PGP SIGNATURE----- --nextPart7143535.uxaS0X2PQS-- |
Re: [ANN] Metadata 1.1
On 9/25/07, Konrad Meyer <konrad@tylerc.org> wrote:
> Quoth Ilmari Heikkinen: > > gem: http://dark.fhtr.org/repos/metadata/metadata-1.1.gem > > Is the gem working now? If so, very cool. > It's working, but it's not on rubyforge. And I'm sort of queasy on putting it there, due to the dephell of external programs. Justification for dephell: those projects live or die based on whether they handle everything in their specialty area. And I'm too busy for NIH :-/ |
| All times are GMT. The time now is 12:48 PM. |
Powered by vBulletin®. Copyright ©2000 - 2013, vBulletin Solutions, Inc.
SEO by vBSEO ©2010, Crawlability, Inc.