Hi Peter,
Thanks very much for your reply, it's being hard to get some feedback,
and your answer has been very useful. I'll try to explain myself a
little better, I'm sorry if my grammar is confuse.
>> "User-defined"? Is there a standard for corpus linguistics? Like TEI?
You are right, most of linguistics annotate corpora with subsets of
TEI, but I mean that the output labels could be defined freely by the
user. That the output was not fitted/locked to some specific XML as
DocBook or TEI, that the mapping can be definded in a very flexibe way.
>> Some of them have money, some don't. But IMHE they are well used to
>> using Open Source software, and there is plenty available.
Yes, I've found "plenty" of some tools to translate tags, almost all
one-to-one mapping and just a few from "styles" to structural labels;
but all of them were far from the main feature: the capacity to be
*trainned*, just from sample cases, to tag specific blocks from
typographical/lexical clues to structural labels.
> function-oriented XML documents). May be some area on publishing, but
> I think that they will not be interested in "small" desktop
> applications.
>> Who is "they"?
Sorry, I wanted to refer to the "publisshing guys". What I've listen is
that publishing companies that works with XML utilize big applications
(full systems of documentation handling) and that this companies wont
be atracted and wont trust in a small desktop application.
>> Lots of us work in or close to this field. There certainly is a
>> demand for this, but it's very small, especially in small businesses.
Ok, I've been thinking, and I assume that small bussiness have not
value in its old documents, it has sense. May be I could focus in
bussiness related to text/documentation handling. I've known that there
are some small companies making bussiness in offering services to big
publishing companies, specifically one of this services seems to be
migration of texts to xml (as books or dictionaries). Do you know any
about them?
>> currently faster and cheaper to send the whole corpus to a company in
>> the Indian subcontinent or on the Pacific Rim and have it rekeyed or
>> scanned into XML there. In general, companies are not interested in
I've listen something, but I understood that this was used as a
replacement for the OCR phase, from paper document.
>> If there was any interest in preserving them, they wouldn't have
>> used WordPerfect, Lotus, or Word formats (or whatever) to store
>> them in in the first place, would they?
Not in a rational world, but I suspect that are/were too people
trusting in Word
>> library projects; and some publishing-oriented preservation projects
>> are more likely to have a demand for this software -- but they don't
>> have large sums of money to spend on it, and it is arguable that if
You are right, that were my fear, although the software is not one of
"large sums of money" kind.
>> You seem to be confused about your objective: you say "...in
>> small business" but in the preceding paragraph you say that
>> "...they will not be interested in 'small' desktop applications."
Not so in objective but in grammar, I hope that the previous paragraph
clear the referent of the anaphora.
I know that I'm doing a powerfull but "small desktop application", I
know that I'm focusing to "small bussiness" due to the size of my
company, that is a uISV (mono/bi personal). But, yes, I'm very confused
about what bussiness is actually tagging text documents with XML tags.
>> Those are three very unlikely candidates as there is already
>> software to handle them in many ways.
Sure? Software that handles "typical-user produced documents"? without
styles and even with spaces as tabulation and breaks at end of line ...
>> Legacy obsolete binary wordprocessing and DTP formats are the hardest
>> to deal with, especially when they reside on obsolete media.
Yes, but you are pointing to another bussiness about "document format
converters", and that is a different thing, isnt?
>> I just posted about this the other day: see the thread "looking for a
>> mentor" in c.t.x (Message-ID <40jon3F1b4uk...@individual.net> et seq.)
I've readed, but I was not lucky to find the program. May be this
bussines niche expired some years ago and I'm too late? everybody
translated its valuable documents to xml...
>> The IR people have been trying to do this for decades.
>> I may be biased in favour of markup, but I really don't see any progress
Are you talking about Information Retrieval? I'm biased in favour of
markup, but most of documents in the world are not marked, this was the
reason that I thought (wrongly?) that it would be nice to develop and
automatic structural tagger...
>> Very, very hard to do in the first pass, because the sequence
>> and structure may simply not match. Much easier if you use an
>> interim markup structure, made for the job, and do a final
>> conversion to the target vocabulary afterwards.
Interesting, I'll think on it.
> - A lot of cleaning and normalization small tasks: removing headers,
>> Yes, very useful, and something that a lot of conversion software
>> is very bad at.
I agree.
>> Let us know if you find any businesses who are interested.
>> With the obvious caveats already mentioned, legacy documents
>> simply are not interesting for businesses.
Your messages was as useful as sad... now I'm thinking about two ways
left in order to recycle the application:
1) bussines of XML tagging services to publishing houses: I dont know
what tools are they using, but sure not trainnable and automatic to tag
fastly hundreds/thousands of pages.
2) bussines of HTML to XML mapping: there are a lot of contents
published in HTML during last 10 years, sure that there are people
trying to recover this articles and information...
Well, as you can see I'm a bit worried, if you could elaborate a little
more from my reply and spent a bit more of your time I'll be again very
grateful to you. By the way, Happy Christmas!
Francesc