Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > Discussion on some Code Issues

Reply
Thread Tools

Discussion on some Code Issues

 
 
subhabangalore@gmail.com
Guest
Posts: n/a
 
      07-04-2012
Dear Group,

I am Sri Subhabrata Banerjee trying to write from Gurgaon, India to discusssome coding issues. If any one of this learned room can shower some light I would be helpful enough.

I got to code a bunch of documents which are combined together.
Like,

1)A Mumbai-bound aircraft with 99 passengers on board was struck by lightning on Tuesday evening that led to complete communication failure in mid-airand forced the pilot to make an emergency landing.
2) The discovery of a new sub-atomic particle that is key to understanding how the universe is built has an intrinsic Indian connection.
3) A bomb explosion outside a shopping mall here on Tuesday left no one injured, but Nigerian authorities put security agencies on high alert fearing more such attacks in the city.

The task is to separate the documents on the fly and to parse each of the documents with a definite set of rules.

Now, the way I am processing is:
I am clubbing all the documents together, as,

A Mumbai-bound aircraft with 99 passengers on board was struck by lightningon Tuesday evening that led to complete communication failure in mid-air and forced the pilot to make an emergency landing.The discovery of a new sub-atomic particle that is key to understanding how the universe is built hasan intrinsic Indian connection. A bomb explosion outside a shopping mall here on Tuesday left no one injured, but Nigerian authorities put security agencies on high alert fearing more such attacks in the city.

But they are separated by a tag set, like,
A Mumbai-bound aircraft with 99 passengers on board was struck by lightningon Tuesday evening that led to complete communication failure in mid-air and forced the pilot to make an emergency landing.$
The discovery of a new sub-atomic particle that is key to understanding howthe universe is built has an intrinsic Indian connection.$
A bomb explosion outside a shopping mall here on Tuesday left no one injured, but Nigerian authorities put security agencies on high alert fearing more such attacks in the city.

To detect the document boundaries, I am splitting them into a bag of words and using a simple for loop as,
for i in range(len(bag_words)):
if bag_words[i]=="$":
print (bag_words[i],i)

There is no issue. I am segmenting it nicely. I am using annotated corpus so applying parse rules.

The confusion comes next,

As per my problem statement the size of the file (of documents combined together) wont increase on the fly. So, just to support all kinds of combinations I am appending in a list the I values, taking its length, and using slice. Works perfect. Question is, is there a smarter way to achieve this, and a curious question if the documents are on the fly with no preprocessed tag set like $ how may I do it? From a bunch without EOF isnt it a classification problem?

There is no question on parsing it seems I am achieving it independent of length of the document.

If any one in the group can suggest how I am dealing with the problem and which portions should be improved and how?

Thanking You in Advance,

Best Regards,
Subhabrata Banerjee.
 
Reply With Quote
 
 
 
 
Steven D'Aprano
Guest
Posts: n/a
 
      07-05-2012
On Wed, 04 Jul 2012 16:21:46 -0700, subhabangalore wrote:

[...]
> I got to code a bunch of documents which are combined together.

[...]
> The task is to separate the documents on the fly and to parse each of
> the documents with a definite set of rules.
>
> Now, the way I am processing is:
> I am clubbing all the documents together, as,

[...]
> But they are separated by a tag set

[...]
> To detect the document boundaries,


Let me see if I understand your problem.

You have a bunch of documents. You stick them all together into one
enormous lump. And then you try to detect the boundaries between one file
and the next within the enormous lump.

Why not just process each file separately? A simple for loop over the
list of files, before consolidating them into one giant file, will avoid
all the difficulty of trying to detect boundaries within files.

Instead of:

merge(output_filename, list_of_files)
for word in parse(output_filename):
if boundary_detected: do_something()
process(word)

Do this instead:

for filename in list_of_files:
do_something()
for word in parse(filename):
process(word)


> I am splitting them into a bag of
> words and using a simple for loop as,
> for i in range(len(bag_words)):
> if bag_words[i]=="$":
> print (bag_words[i],i)



What happens if a file already has a $ in it?


> There is no issue. I am segmenting it nicely. I am using annotated
> corpus so applying parse rules.
>
> The confusion comes next,
>
> As per my problem statement the size of the file (of documents combined
> together) won’t increase on the fly. So, just to support all kinds of
> combinations I am appending in a list the “I” values, taking its length,
> and using slice. Works perfect.


I don't understand this. What sort of combinations do you think you need
to support? What are "I" values, and why are they important?



--
Steven
 
Reply With Quote
 
 
 
 
Rick Johnson
Guest
Posts: n/a
 
      07-05-2012
On Jul 4, 6:21*pm, (E-Mail Removed) wrote:
> [...]
> To detect the document boundaries, I am splitting them into a bag
> of words and using a simple for loop as,
>
> for i in range(len(bag_words)):
> * * * * if bag_words[i]=="$":
> * * * * * * print (bag_words[i],i)


Ignoring that you are attacking the problem incorrectly: that is very
poor method of splitting a string since especially the Python gods
have given you *power* over string objects. But you are going to have
an even greater problem if the string contains a "$" char that you DID
NOT insert :-O. You'd be wise to use a sep that is not likely to be in
the file data. For example: "<SEP>" or "<SPLIT-HERE>". But even that
approach is naive! Why not streamline the entire process and pass a
list of file paths to a custom parser object instead?

 
Reply With Quote
 
subhabangalore@gmail.com
Guest
Posts: n/a
 
      07-05-2012
On Thursday, July 5, 2012 4:51:46 AM UTC+5:30, (unknown) wrote:
> Dear Group,
>
> I am Sri Subhabrata Banerjee trying to write from Gurgaon, India to discuss some coding issues. If any one of this learned room can shower some light I would be helpful enough.
>
> I got to code a bunch of documents which are combined together.
> Like,
>
> 1)A Mumbai-bound aircraft with 99 passengers on board was struck by lightning on Tuesday evening that led to complete communication failure in mid-air and forced the pilot to make an emergency landing.
> 2) The discovery of a new sub-atomic particle that is key to understanding how the universe is built has an intrinsic Indian connection.
> 3) A bomb explosion outside a shopping mall here on Tuesday left no one injured, but Nigerian authorities put security agencies on high alert fearing more such attacks in the city.
>
> The task is to separate the documents on the fly and to parse each of thedocuments with a definite set of rules.
>
> Now, the way I am processing is:
> I am clubbing all the documents together, as,
>
> A Mumbai-bound aircraft with 99 passengers on board was struck by lightning on Tuesday evening that led to complete communication failure in mid-airand forced the pilot to make an emergency landing.The discovery of a new sub-atomic particle that is key to understanding how the universe is built has an intrinsic Indian connection. A bomb explosion outside a shopping mallhere on Tuesday left no one injured, but Nigerian authorities put securityagencies on high alert fearing more such attacks in the city.
>
> But they are separated by a tag set, like,
> A Mumbai-bound aircraft with 99 passengers on board was struck by lightning on Tuesday evening that led to complete communication failure in mid-airand forced the pilot to make an emergency landing.$
> The discovery of a new sub-atomic particle that is key to understanding how the universe is built has an intrinsic Indian connection.$
> A bomb explosion outside a shopping mall here on Tuesday left no one injured, but Nigerian authorities put security agencies on high alert fearing more such attacks in the city.
>
> To detect the document boundaries, I am splitting them into a bag of words and using a simple for loop as,
> for i in range(len(bag_words)):
> if bag_words[i]=="$":
> print (bag_words[i],i)
>
> There is no issue. I am segmenting it nicely. I am using annotated corpusso applying parse rules.
>
> The confusion comes next,
>
> As per my problem statement the size of the file (of documents combined together) wont increase on the fly. So, just to support all kinds of combinations I am appending in a list the I values, taking its length, and using slice. Works perfect. Question is, is there a smarter way to achieve this, and a curious question if the documents are on the fly with no preprocessed tag set like $ how may I do it? From a bunch without EOF isnt it a classification problem?
>
> There is no question on parsing it seems I am achieving it independent oflength of the document.
>
> If any one in the group can suggest how I am dealing with the problem andwhich portions should be improved and how?
>
> Thanking You in Advance,
>
> Best Regards,
> Subhabrata Banerjee.



Hi Steven, It is nice to see your post. They are nice and I learnt so many things from you. "I" is for index of the loop.
Now my clarification I thought to do "import os" and process files in a loop but that is not my problem statement. I have to make a big lump of text and detect one chunk. Looping over the line number of file I am not using because I may not be able to take the slices-this I need. I thought to give re.findall a try but that is not giving me the slices. Slice spreads here. The power issue of string! I would definitely give it a try. Happy Day AheadRegards, Subhabrata Banerjee.
 
Reply With Quote
 
Peter Otten
Guest
Posts: n/a
 
      07-05-2012
http://www.velocityreviews.com/forums/(E-Mail Removed) wrote:

> On Thursday, July 5, 2012 4:51:46 AM UTC+5:30, (unknown) wrote:
>> Dear Group,
>>
>> I am Sri Subhabrata Banerjee trying to write from Gurgaon, India to
>> discuss some coding issues. If any one of this learned room can shower
>> some light I would be helpful enough.
>>
>> I got to code a bunch of documents which are combined together.
>> Like,
>>
>> 1)A Mumbai-bound aircraft with 99 passengers on board was struck by
>> lightning on Tuesday evening that led to complete communication failure
>> in mid-air and forced the pilot to make an emergency landing. 2) The
>> discovery of a new sub-atomic particle that is key to understanding how
>> the universe is built has an intrinsic Indian connection. 3) A bomb
>> explosion outside a shopping mall here on Tuesday left no one injured,
>> but Nigerian authorities put security agencies on high alert fearing more
>> such attacks in the city.
>>
>> The task is to separate the documents on the fly and to parse each of the
>> documents with a definite set of rules.
>>
>> Now, the way I am processing is:
>> I am clubbing all the documents together, as,
>>
>> A Mumbai-bound aircraft with 99 passengers on board was struck by
>> lightning on Tuesday evening that led to complete communication failure
>> in mid-air and forced the pilot to make an emergency landing.The
>> discovery of a new sub-atomic particle that is key to understanding how
>> the universe is built has an intrinsic Indian connection. A bomb
>> explosion outside a shopping mall here on Tuesday left no one injured,
>> but Nigerian authorities put security agencies on high alert fearing more
>> such attacks in the city.
>>
>> But they are separated by a tag set, like,
>> A Mumbai-bound aircraft with 99 passengers on board was struck by
>> lightning on Tuesday evening that led to complete communication failure
>> in mid-air and forced the pilot to make an emergency landing.$ The
>> discovery of a new sub-atomic particle that is key to understanding how
>> the universe is built has an intrinsic Indian connection.$ A bomb
>> explosion outside a shopping mall here on Tuesday left no one injured,
>> but Nigerian authorities put security agencies on high alert fearing more
>> such attacks in the city.
>>
>> To detect the document boundaries, I am splitting them into a bag of
>> words and using a simple for loop as, for i in range(len(bag_words)):
>> if bag_words[i]=="$":
>> print (bag_words[i],i)
>>
>> There is no issue. I am segmenting it nicely. I am using annotated corpus
>> so applying parse rules.
>>
>> The confusion comes next,
>>
>> As per my problem statement the size of the file (of documents combined
>> together) won’t increase on the fly. So, just to support all kinds of
>> combinations I am appending in a list the “I” values, taking its length,
>> and using slice. Works perfect. Question is, is there a smarter way to
>> achieve this, and a curious question if the documents are on the fly with
>> no preprocessed tag set like “$” how may I do it? From a bunch without
>> EOF isn’t it a classification problem?
>>
>> There is no question on parsing it seems I am achieving it independent of
>> length of the document.
>>
>> If any one in the group can suggest how I am dealing with the problem and
>> which portions should be improved and how?
>>
>> Thanking You in Advance,
>>
>> Best Regards,
>> Subhabrata Banerjee.

>
>
> Hi Steven, It is nice to see your post. They are nice and I learnt so many
> things from you. "I" is for index of the loop. Now my clarification I
> thought to do "import os" and process files in a loop but that is not my
> problem statement. I have to make a big lump of text and detect one chunk.
> Looping over the line number of file I am not using because I may not be
> able to take the slices-this I need. I thought to give re.findall a try
> but that is not giving me the slices. Slice spreads here. The power issue
> of string! I would definitely give it a try. Happy Day Ahead Regards,
> Subhabrata Banerjee.


Then use re.finditer():

start = 0
for match in re.finditer(r"\$", data):
end = match.start()
print(start, end)
print(data[start:end])
start = match.end()

This will omit the last text. The simplest fix is to put another "$"
separator at the end of your data.

 
Reply With Quote
 
subhabangalore@gmail.com
Guest
Posts: n/a
 
      07-05-2012
Dear Peter,
That is a nice one. I am thinking if I can write "for lines in f" sort of code that is easy but then how to find out the slices then, btw do you know in any case may I convert the index position of file to the list position provided I am writing the list for the same file we are reading.

Best Regards,
Subhabrata.

On Thursday, July 5, 2012 1:00:12 PM UTC+5:30, Peter Otten wrote:
> (E-Mail Removed) wrote:
>
> > On Thursday, July 5, 2012 4:51:46 AM UTC+5:30, (unknown) wrote:
> >> Dear Group,
> >>
> >> I am Sri Subhabrata Banerjee trying to write from Gurgaon, India to
> >> discuss some coding issues. If any one of this learned room can shower
> >> some light I would be helpful enough.
> >>
> >> I got to code a bunch of documents which are combined together.
> >> Like,
> >>
> >> 1)A Mumbai-bound aircraft with 99 passengers on board was struck by
> >> lightning on Tuesday evening that led to complete communication failure
> >> in mid-air and forced the pilot to make an emergency landing. 2) The
> >> discovery of a new sub-atomic particle that is key to understanding how
> >> the universe is built has an intrinsic Indian connection. 3) A bomb
> >> explosion outside a shopping mall here on Tuesday left no one injured,
> >> but Nigerian authorities put security agencies on high alert fearing more
> >> such attacks in the city.
> >>
> >> The task is to separate the documents on the fly and to parse each of the
> >> documents with a definite set of rules.
> >>
> >> Now, the way I am processing is:
> >> I am clubbing all the documents together, as,
> >>
> >> A Mumbai-bound aircraft with 99 passengers on board was struck by
> >> lightning on Tuesday evening that led to complete communication failure
> >> in mid-air and forced the pilot to make an emergency landing.The
> >> discovery of a new sub-atomic particle that is key to understanding how
> >> the universe is built has an intrinsic Indian connection. A bomb
> >> explosion outside a shopping mall here on Tuesday left no one injured,
> >> but Nigerian authorities put security agencies on high alert fearing more
> >> such attacks in the city.
> >>
> >> But they are separated by a tag set, like,
> >> A Mumbai-bound aircraft with 99 passengers on board was struck by
> >> lightning on Tuesday evening that led to complete communication failure
> >> in mid-air and forced the pilot to make an emergency landing.$ The
> >> discovery of a new sub-atomic particle that is key to understanding how
> >> the universe is built has an intrinsic Indian connection.$ A bomb
> >> explosion outside a shopping mall here on Tuesday left no one injured,
> >> but Nigerian authorities put security agencies on high alert fearing more
> >> such attacks in the city.
> >>
> >> To detect the document boundaries, I am splitting them into a bag of
> >> words and using a simple for loop as, for i in range(len(bag_words)):
> >> if bag_words[i]=="$":
> >> print (bag_words[i],i)
> >>
> >> There is no issue. I am segmenting it nicely. I am using annotated corpus
> >> so applying parse rules.
> >>
> >> The confusion comes next,
> >>
> >> As per my problem statement the size of the file (of documents combined
> >> together) wont increase on the fly. So, just to support all kinds of
> >> combinations I am appending in a list the I values, taking its length,
> >> and using slice. Works perfect. Question is, is there a smarter way to
> >> achieve this, and a curious question if the documents are on the fly with
> >> no preprocessed tag set like $ how may I do it? From a bunch without
> >> EOF isnt it a classification problem?
> >>
> >> There is no question on parsing it seems I am achieving it independentof
> >> length of the document.
> >>
> >> If any one in the group can suggest how I am dealing with the problem and
> >> which portions should be improved and how?
> >>
> >> Thanking You in Advance,
> >>
> >> Best Regards,
> >> Subhabrata Banerjee.

> >
> >
> > Hi Steven, It is nice to see your post. They are nice and I learnt so many
> > things from you. "I" is for index of the loop. Now my clarification I
> > thought to do "import os" and process files in a loop but that is not my
> > problem statement. I have to make a big lump of text and detect one chunk.
> > Looping over the line number of file I am not using because I may not be
> > able to take the slices-this I need. I thought to give re.findall a try
> > but that is not giving me the slices. Slice spreads here. The power issue
> > of string! I would definitely give it a try. Happy Day Ahead Regards,
> > Subhabrata Banerjee.

>
> Then use re.finditer():
>
> start = 0
> for match in re.finditer(r"\$", data):
> end = match.start()
> print(start, end)
> print(data[start:end])
> start = match.end()
>
> This will omit the last text. The simplest fix is to put another "$"
> separator at the end of your data.


 
Reply With Quote
 
subhabangalore@gmail.com
Guest
Posts: n/a
 
      07-05-2012
Dear Peter,
That is a nice one. I am thinking if I can write "for lines in f" sort of code that is easy but then how to find out the slices then, btw do you know in any case may I convert the index position of file to the list position provided I am writing the list for the same file we are reading.

Best Regards,
Subhabrata.

On Thursday, July 5, 2012 1:00:12 PM UTC+5:30, Peter Otten wrote:
> (E-Mail Removed) wrote:
>
> > On Thursday, July 5, 2012 4:51:46 AM UTC+5:30, (unknown) wrote:
> >> Dear Group,
> >>
> >> I am Sri Subhabrata Banerjee trying to write from Gurgaon, India to
> >> discuss some coding issues. If any one of this learned room can shower
> >> some light I would be helpful enough.
> >>
> >> I got to code a bunch of documents which are combined together.
> >> Like,
> >>
> >> 1)A Mumbai-bound aircraft with 99 passengers on board was struck by
> >> lightning on Tuesday evening that led to complete communication failure
> >> in mid-air and forced the pilot to make an emergency landing. 2) The
> >> discovery of a new sub-atomic particle that is key to understanding how
> >> the universe is built has an intrinsic Indian connection. 3) A bomb
> >> explosion outside a shopping mall here on Tuesday left no one injured,
> >> but Nigerian authorities put security agencies on high alert fearing more
> >> such attacks in the city.
> >>
> >> The task is to separate the documents on the fly and to parse each of the
> >> documents with a definite set of rules.
> >>
> >> Now, the way I am processing is:
> >> I am clubbing all the documents together, as,
> >>
> >> A Mumbai-bound aircraft with 99 passengers on board was struck by
> >> lightning on Tuesday evening that led to complete communication failure
> >> in mid-air and forced the pilot to make an emergency landing.The
> >> discovery of a new sub-atomic particle that is key to understanding how
> >> the universe is built has an intrinsic Indian connection. A bomb
> >> explosion outside a shopping mall here on Tuesday left no one injured,
> >> but Nigerian authorities put security agencies on high alert fearing more
> >> such attacks in the city.
> >>
> >> But they are separated by a tag set, like,
> >> A Mumbai-bound aircraft with 99 passengers on board was struck by
> >> lightning on Tuesday evening that led to complete communication failure
> >> in mid-air and forced the pilot to make an emergency landing.$ The
> >> discovery of a new sub-atomic particle that is key to understanding how
> >> the universe is built has an intrinsic Indian connection.$ A bomb
> >> explosion outside a shopping mall here on Tuesday left no one injured,
> >> but Nigerian authorities put security agencies on high alert fearing more
> >> such attacks in the city.
> >>
> >> To detect the document boundaries, I am splitting them into a bag of
> >> words and using a simple for loop as, for i in range(len(bag_words)):
> >> if bag_words[i]=="$":
> >> print (bag_words[i],i)
> >>
> >> There is no issue. I am segmenting it nicely. I am using annotated corpus
> >> so applying parse rules.
> >>
> >> The confusion comes next,
> >>
> >> As per my problem statement the size of the file (of documents combined
> >> together) wont increase on the fly. So, just to support all kinds of
> >> combinations I am appending in a list the I values, taking its length,
> >> and using slice. Works perfect. Question is, is there a smarter way to
> >> achieve this, and a curious question if the documents are on the fly with
> >> no preprocessed tag set like $ how may I do it? From a bunch without
> >> EOF isnt it a classification problem?
> >>
> >> There is no question on parsing it seems I am achieving it independentof
> >> length of the document.
> >>
> >> If any one in the group can suggest how I am dealing with the problem and
> >> which portions should be improved and how?
> >>
> >> Thanking You in Advance,
> >>
> >> Best Regards,
> >> Subhabrata Banerjee.

> >
> >
> > Hi Steven, It is nice to see your post. They are nice and I learnt so many
> > things from you. "I" is for index of the loop. Now my clarification I
> > thought to do "import os" and process files in a loop but that is not my
> > problem statement. I have to make a big lump of text and detect one chunk.
> > Looping over the line number of file I am not using because I may not be
> > able to take the slices-this I need. I thought to give re.findall a try
> > but that is not giving me the slices. Slice spreads here. The power issue
> > of string! I would definitely give it a try. Happy Day Ahead Regards,
> > Subhabrata Banerjee.

>
> Then use re.finditer():
>
> start = 0
> for match in re.finditer(r"\$", data):
> end = match.start()
> print(start, end)
> print(data[start:end])
> start = match.end()
>
> This will omit the last text. The simplest fix is to put another "$"
> separator at the end of your data.


 
Reply With Quote
 
Peter Otten
Guest
Posts: n/a
 
      07-06-2012
(E-Mail Removed) wrote:

[Please don't top-post]

>> start = 0
>> for match in re.finditer(r"\$", data):
>> end = match.start()
>> print(start, end)
>> print(data[start:end])
>> start = match.end()


> That is a nice one. I am thinking if I can write "for lines in f" sort of
> code that is easy but then how to find out the slices then,


You have to keep track both of the offset of the line and the offset within
the line:

def offsets(lines, pos=0):
for line in lines:
yield pos, line
pos += len(line)

start = 0
for line_start, line in offsets(lines):
for pos, part in offsets(re.split(r"(\$)", line), line_start):
if part == "$":
print(start, pos)
start = pos + 1

(untested code, I'm assuming that the file ends with a $)

> btw do you
> know in any case may I convert the index position of file to the list
> position provided I am writing the list for the same file we are reading.


Use a lookup list with the end positions of the texts and then find the
relevant text with bisect.

>>> ends = [10, 20, 50]
>>> filepos = 15
>>> bisect.bisect(ends, filepos)

1 # position 15 belongs to the second text


 
Reply With Quote
 
subhabangalore@gmail.com
Guest
Posts: n/a
 
      07-07-2012
On Thursday, July 5, 2012 4:51:46 AM UTC+5:30, (unknown) wrote:
> Dear Group,
>
> I am Sri Subhabrata Banerjee trying to write from Gurgaon, India to discuss some coding issues. If any one of this learned room can shower some light I would be helpful enough.
>
> I got to code a bunch of documents which are combined together.
> Like,
>
> 1)A Mumbai-bound aircraft with 99 passengers on board was struck by lightning on Tuesday evening that led to complete communication failure in mid-air and forced the pilot to make an emergency landing.
> 2) The discovery of a new sub-atomic particle that is key to understanding how the universe is built has an intrinsic Indian connection.
> 3) A bomb explosion outside a shopping mall here on Tuesday left no one injured, but Nigerian authorities put security agencies on high alert fearing more such attacks in the city.
>
> The task is to separate the documents on the fly and to parse each of thedocuments with a definite set of rules.
>
> Now, the way I am processing is:
> I am clubbing all the documents together, as,
>
> A Mumbai-bound aircraft with 99 passengers on board was struck by lightning on Tuesday evening that led to complete communication failure in mid-airand forced the pilot to make an emergency landing.The discovery of a new sub-atomic particle that is key to understanding how the universe is built has an intrinsic Indian connection. A bomb explosion outside a shopping mallhere on Tuesday left no one injured, but Nigerian authorities put securityagencies on high alert fearing more such attacks in the city.
>
> But they are separated by a tag set, like,
> A Mumbai-bound aircraft with 99 passengers on board was struck by lightning on Tuesday evening that led to complete communication failure in mid-airand forced the pilot to make an emergency landing.$
> The discovery of a new sub-atomic particle that is key to understanding how the universe is built has an intrinsic Indian connection.$
> A bomb explosion outside a shopping mall here on Tuesday left no one injured, but Nigerian authorities put security agencies on high alert fearing more such attacks in the city.
>
> To detect the document boundaries, I am splitting them into a bag of words and using a simple for loop as,
> for i in range(len(bag_words)):
> if bag_words[i]=="$":
> print (bag_words[i],i)
>
> There is no issue. I am segmenting it nicely. I am using annotated corpusso applying parse rules.
>
> The confusion comes next,
>
> As per my problem statement the size of the file (of documents combined together) wont increase on the fly. So, just to support all kinds of combinations I am appending in a list the I values, taking its length, and using slice. Works perfect. Question is, is there a smarter way to achieve this, and a curious question if the documents are on the fly with no preprocessed tag set like $ how may I do it? From a bunch without EOF isnt it a classification problem?
>
> There is no question on parsing it seems I am achieving it independent oflength of the document.
>
> If any one in the group can suggest how I am dealing with the problem andwhich portions should be improved and how?
>
> Thanking You in Advance,
>
> Best Regards,
> Subhabrata Banerjee.


Thanks Peter but I feel your earlier one was better, I got an interesting one:
[i - 1 for i in range(len(f1)) if f1.startswith('$', i - 1)]

But I am bit intrigued with another question,

suppose I say:
file_open=open("/python32/doc1.txt","r")
file=a1.read().lower()
for line in file:
line_word=line.split()

This works fine. But if I print it would be printed continuously.
I like to store in some variable,so that I may print line of my choice and manipulate them at my choice.
Is there any way out to this problem?


Regards,
Subhabrata Banerjee
 
Reply With Quote
 
Dennis Lee Bieber
Guest
Posts: n/a
 
      07-07-2012
On Sat, 7 Jul 2012 12:54:16 -0700 (PDT), (E-Mail Removed)
declaimed the following in gmane.comp.python.general:

> But I am bit intrigued with another question,
>
> suppose I say:
> file_open=open("/python32/doc1.txt","r")
> file=a1.read().lower()
> for line in file:
> line_word=line.split()
>
> This works fine. But if I print it would be printed continuously.


"This works fine" -- Really?

1) Why are you storing data files in the install directory of your
Python interpreter?

2) "a1" is undefined -- you should get an exception on that line which
makes the following irrelevant; replacing "a1" with "file_open" leads
to...

3) "file" is a) a predefined function in Python, which you have just
shadowed and b) a poor name for a string containing the contents of a
file

4) "for line in file", since "file" is a string, will iterate over EACH
CHARACTER, meaning (since there is nothing to split) that "line_word" is
also just a single character.

for line in file.split("\n"):

will split the STRING into logical lines (assuming a new-line character
splits the lines) and permit the subsequent split to pull out wordS
("line_word" is misleading, as to will contain a LIST of words from the
line).

> I like to store in some variable,so that I may print line of my choice and manipulate them at my choice.
> Is there any way out to this problem?
>
>
> Regards,
> Subhabrata Banerjee

--
Wulfraed Dennis Lee Bieber AF6VN
(E-Mail Removed) HTTP://wlfraed.home.netcom.com/

 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
[SOC] Google Summer of Code Discussion (Deadline extended!!) Jeremy McAnally Ruby 0 04-02-2008 04:27 AM
code a discussion forum using java guessmyname Java 4 01-18-2006 07:01 PM
ANN: Web services security issues (August 10 panel discussion in San Diego) Ken North XML 0 08-05-2004 01:55 PM
SNMP Issues in Cisco Routers; Vulnerability Issues in TCP =?iso-8859-1?Q?Frisbee=AE?= MCSE 0 04-21-2004 03:00 PM
Can we have some serious discussion here !!! Prashant MCSE 11 12-31-2003 01:57 PM



Advertisments