Velocity Reviews

Velocity Reviews (http://www.velocityreviews.com/forums/index.php)
-   Python (http://www.velocityreviews.com/forums/f43-python.html)
-   -   Byte Offsets of Tokens, Ngrams and Sentences? (http://www.velocityreviews.com/forums/t730232-byte-offsets-of-tokens-ngrams-and-sentences.html)

Muhammad Adeel 08-06-2010 09:07 AM

Byte Offsets of Tokens, Ngrams and Sentences?
 
Hi,

Does any one know how to tokenize a string in python that returns the
byte offsets and tokens? Moreover, the sentence splitter that returns
the sentences and byte offsets? Finally n-grams returned with byte
offsets.

Input:
This is a string.

Output:
This 0
is 5
a 8
string. 10


thanks

Gabriel Genellina 08-06-2010 09:49 AM

Re: Byte Offsets of Tokens, Ngrams and Sentences?
 
En Fri, 06 Aug 2010 06:07:32 -0300, Muhammad Adeel <nawabadeel@gmail.com>
escribió:

> Does any one know how to tokenize a string in python that returns the
> byte offsets and tokens? Moreover, the sentence splitter that returns
> the sentences and byte offsets? Finally n-grams returned with byte
> offsets.
>
> Input:
> This is a string.
>
> Output:
> This 0
> is 5
> a 8
> string. 10


Like this?

py> import re
py> s = "This is a string."
py> for g in re.finditer("\S+", s):
.... print g.group(), g.start()
....
This 0
is 5
a 8
string. 10

--
Gabriel Genellina


Muhammad Adeel 08-06-2010 10:06 AM

Re: Byte Offsets of Tokens, Ngrams and Sentences?
 
On Aug 6, 10:49*am, "Gabriel Genellina" <gagsl-...@yahoo.com.ar>
wrote:
> En Fri, 06 Aug 2010 06:07:32 -0300, Muhammad Adeel <nawabad...@gmail.com> *
> escribió:
>
> > Does any one know how to tokenize a string in python that returns the
> > byte offsets and tokens? Moreover, the sentence splitter that returns
> > the sentences and byte offsets? Finally n-grams returned with byte
> > offsets.

>
> > Input:
> > This is a string.

>
> > Output:
> > This *0
> > is * * *5
> > a * * * 8
> > string. * 10

>
> Like this?
>
> py> import re
> py> s = "This is a string."
> py> for g in re.finditer("\S+", s):
> ... * print g.group(), g.start()
> ...
> This 0
> is 5
> a 8
> string. 10
>
> --
> Gabriel Genellina


Hi,

Thanks. Can you please tell me how to do for n-grams and sentences as
well?


All times are GMT. The time now is 02:35 AM.

Powered by vBulletin®. Copyright ©2000 - 2014, vBulletin Solutions, Inc.
SEO by vBSEO ©2010, Crawlability, Inc.