Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Java > Regular Expressions

Reply
Thread Tools

Regular Expressions

 
 
Markos Charatzas
Guest
Posts: n/a
 
      02-05-2004
Hi all,

I'm trying to parse the following expression but i'm having difficulties
understanding the whole "parse a String" theory.

The string starts like this
XXX XXX 00-00-00000000000:0000:00XXXX X000 X

and then continues 'n' times

either in the same form as before
XXX XXX 00-00-00000000000:0000:00XXXX X000 X

or
00-00-00000000000:0000:00XXXX X000 X


Where 0 a digit and X a character.

I have this idea of checking the first 10 bytes of each string to see
whether or not they represent a character.
If yes then I link the current 'XXX XXX ' with the remaining string,
If not then I link the last 'XXX XXX ' with the remaining string.


but i've having trouble implementing it

Thanx in advance for ur responses.
 
Reply With Quote
 
 
 
 
nos
Guest
Posts: n/a
 
      02-05-2004

"Markos Charatzas" <(E-Mail Removed)> wrote in message
news:(E-Mail Removed)...
> Hi all,
>
> I'm trying to parse the following expression but i'm having difficulties
> understanding the whole "parse a String" theory.
>
> The string starts like this
> XXX XXX 00-00-00000000000:0000:00XXXX X000 X
>
> and then continues 'n' times
>
> either in the same form as before
> XXX XXX 00-00-00000000000:0000:00XXXX X000 X
>
> or
> 00-00-00000000000:0000:00XXXX X000 X
>
>
> Where 0 a digit and X a character.
>
> I have this idea of checking the first 10 bytes of each string to see
> whether or not they represent a character.
> If yes then I link the current 'XXX XXX ' with the remaining string,
> If not then I link the last 'XXX XXX ' with the remaining string.
>
>
> but i've having trouble implementing it
>
> Thanx in advance for ur responses.


Perhaps I am incorrect, but are not Strings comprised
of characters?
Can you provide a concrete example?


 
Reply With Quote
 
 
 
 
Chris Smith
Guest
Posts: n/a
 
      02-06-2004
Markos Charatzas wrote:
> I'm trying to parse the following expression but i'm having difficulties
> understanding the whole "parse a String" theory.


Okay. Part of your confusion may come from a confusion about the nature
of character strings in Java. Let's clear that one up first:

> I have this idea of checking the first 10 bytes of each string to see
> whether or not they represent a character.


This makes absolutely no sense. I don't know what you mean by "byte"
and "character", but here is the general take on those:

1. A "character" is a single component of a string. There are many
possible characters; in Java, a character can be any of 64 million
different standard Unicode characters. These include letters and digits
and punctuation from a variety of worldwide languages, plus some control
characters, math symbols, and a lot of other stuff.

2. A "byte" is an eight-bit binary value.

3. A "string" is a sequence of characters. Strings have no particular
connection to bytes, though, and it makes no sense at all to talk about
the first ten bytes of a string. Strings simply don't contain bytes;
they contain characters.

4. Characters and bytes are related by something called a character
encoding. There are many different character encodings (easily hundreds
of them), and a very common mistake is to assume the one you're familiar
with -- often Windows CP1252 or ISO 8859-1 -- is the *only* possible
encoding. Strings don't have an encoding, but whenever you write them
to a binary form (such as a file or network stream), you are writing
them using some specific encoding.

Now, on to your problem:

> The string starts like this
> XXX XXX 00-00-00000000000:0000:00XXXX X000 X
>
> and then continues 'n' times
>
> either in the same form as before
> XXX XXX 00-00-00000000000:0000:00XXXX X000 X
>
> or
> 00-00-00000000000:0000:00XXXX X000 X
>
>
> Where 0 a digit and X a character.
>
> I have this idea of checking the first 10 bytes of each string to see
> whether or not they represent a character.
> If yes then I link the current 'XXX XXX ' with the remaining string,
> If not then I link the last 'XXX XXX ' with the remaining string.
>
>
> but i've having trouble implementing it


Have you got anything at all to show us? Since the title of your post
is "Regular Expressions", should I assume that you want to use regular
expressions to implement this? What do you mean by "X [is] a
character"? That it's a letter (and if so, in what language -- English
only, or is it okay if it's a letter in the current locale, whatever
that may be)? Or could it be a digit or punctuation mark or even a
control character?

One thing I'll say is that this looks a lot more like a lexing problem
than a true parsing problem. Regular expressions are, therefore, an
appropriate tool for solving it.

--
www.designacourse.com
The Easiest Way to Train Anyone... Anywhere.

Chris Smith - Lead Software Developer/Technical Trainer
MindIQ Corporation
 
Reply With Quote
 
Markos Charatzas
Guest
Posts: n/a
 
      02-06-2004
Yeap,

Sorry about the confusion!

I was a bit over my head when I wrote it having spend more than 2 hours
trying to figure it out.

When I mentioned 'X as character' '0 as digit' I really meant X being
[a-zA-Z] and 0 [0-9].

Also, by saying '10 bytes of a String' i meant the 5 first characters
since 1 char is 2 bytes in Java.

I do have in mind Regular Expressions, cause I believe its the solution
to my problem.

I thought about it again and I'm wondering whether it makes sense to
look for the complete 'XXX XXX 'expression and match it to the
trailing characters till another 'XXX XXX ' comes along.

Thanks for your time reading this.




Chris Smith wrote:
> Markos Charatzas wrote:
>
>>I'm trying to parse the following expression but i'm having difficulties
>>understanding the whole "parse a String" theory.

>
>
> Okay. Part of your confusion may come from a confusion about the nature
> of character strings in Java. Let's clear that one up first:
>
>
>>I have this idea of checking the first 10 bytes of each string to see
>>whether or not they represent a character.

>
>
> This makes absolutely no sense. I don't know what you mean by "byte"
> and "character", but here is the general take on those:
>
> 1. A "character" is a single component of a string. There are many
> possible characters; in Java, a character can be any of 64 million
> different standard Unicode characters. These include letters and digits
> and punctuation from a variety of worldwide languages, plus some control
> characters, math symbols, and a lot of other stuff.
>
> 2. A "byte" is an eight-bit binary value.
>
> 3. A "string" is a sequence of characters. Strings have no particular
> connection to bytes, though, and it makes no sense at all to talk about
> the first ten bytes of a string. Strings simply don't contain bytes;
> they contain characters.
>
> 4. Characters and bytes are related by something called a character
> encoding. There are many different character encodings (easily hundreds
> of them), and a very common mistake is to assume the one you're familiar
> with -- often Windows CP1252 or ISO 8859-1 -- is the *only* possible
> encoding. Strings don't have an encoding, but whenever you write them
> to a binary form (such as a file or network stream), you are writing
> them using some specific encoding.
>
> Now, on to your problem:
>
>
>>The string starts like this
>>XXX XXX 00-00-00000000000:0000:00XXXX X000 X
>>
>>and then continues 'n' times
>>
>>either in the same form as before
>>XXX XXX 00-00-00000000000:0000:00XXXX X000 X
>>
>>or
>>00-00-00000000000:0000:00XXXX X000 X
>>
>>
>>Where 0 a digit and X a character.
>>
>>I have this idea of checking the first 10 bytes of each string to see
>>whether or not they represent a character.
>>If yes then I link the current 'XXX XXX ' with the remaining string,
>>If not then I link the last 'XXX XXX ' with the remaining string.
>>
>>
>>but i've having trouble implementing it

>
>
> Have you got anything at all to show us? Since the title of your post
> is "Regular Expressions", should I assume that you want to use regular
> expressions to implement this? What do you mean by "X [is] a
> character"? That it's a letter (and if so, in what language -- English
> only, or is it okay if it's a letter in the current locale, whatever
> that may be)? Or could it be a digit or punctuation mark or even a
> control character?
>
> One thing I'll say is that this looks a lot more like a lexing problem
> than a true parsing problem. Regular expressions are, therefore, an
> appropriate tool for solving it.
>

 
Reply With Quote
 
Markos Charatzas
Guest
Posts: n/a
 
      02-06-2004
Ok, I managed to find this REGEX to do the trick.

[A-Z\s]{10}(\d{1}.{37}){1,}

Thanks all of you for trying to help!



Markos Charatzas wrote:
>
> Yeap,
>
> Sorry about the confusion!
>
> I was a bit over my head when I wrote it having spend more than 2 hours
> trying to figure it out.
>
> When I mentioned 'X as character' '0 as digit' I really meant X being
> [a-zA-Z] and 0 [0-9].
>
> Also, by saying '10 bytes of a String' i meant the 5 first characters
> since 1 char is 2 bytes in Java.
>
> I do have in mind Regular Expressions, cause I believe its the solution
> to my problem.
>
> I thought about it again and I'm wondering whether it makes sense to
> look for the complete 'XXX XXX 'expression and match it to the
> trailing characters till another 'XXX XXX ' comes along.
>
> Thanks for your time reading this.
>
>
>
>
> Chris Smith wrote:
>
>> Markos Charatzas wrote:
>>
>>> I'm trying to parse the following expression but i'm having
>>> difficulties understanding the whole "parse a String" theory.

>>
>>
>>
>> Okay. Part of your confusion may come from a confusion about the
>> nature of character strings in Java. Let's clear that one up first:
>>
>>
>>> I have this idea of checking the first 10 bytes of each string to see
>>> whether or not they represent a character.

>>
>>
>>
>> This makes absolutely no sense. I don't know what you mean by "byte"
>> and "character", but here is the general take on those:
>>
>> 1. A "character" is a single component of a string. There are many
>> possible characters; in Java, a character can be any of 64 million
>> different standard Unicode characters. These include letters and
>> digits and punctuation from a variety of worldwide languages, plus
>> some control characters, math symbols, and a lot of other stuff.
>>
>> 2. A "byte" is an eight-bit binary value.
>>
>> 3. A "string" is a sequence of characters. Strings have no particular
>> connection to bytes, though, and it makes no sense at all to talk
>> about the first ten bytes of a string. Strings simply don't contain
>> bytes; they contain characters.
>>
>> 4. Characters and bytes are related by something called a character
>> encoding. There are many different character encodings (easily
>> hundreds of them), and a very common mistake is to assume the one
>> you're familiar with -- often Windows CP1252 or ISO 8859-1 -- is the
>> *only* possible encoding. Strings don't have an encoding, but
>> whenever you write them to a binary form (such as a file or network
>> stream), you are writing them using some specific encoding.
>>
>> Now, on to your problem:
>>
>>
>>> The string starts like this
>>> XXX XXX 00-00-00000000000:0000:00XXXX X000 X
>>>
>>> and then continues 'n' times
>>>
>>> either in the same form as before
>>> XXX XXX 00-00-00000000000:0000:00XXXX X000 X
>>>
>>> or
>>> 00-00-00000000000:0000:00XXXX X000 X
>>>
>>>
>>> Where 0 a digit and X a character.
>>>
>>> I have this idea of checking the first 10 bytes of each string to see
>>> whether or not they represent a character.
>>> If yes then I link the current 'XXX XXX ' with the remaining string,
>>> If not then I link the last 'XXX XXX ' with the remaining string.
>>>
>>>
>>> but i've having trouble implementing it

>>
>>
>>
>> Have you got anything at all to show us? Since the title of your post
>> is "Regular Expressions", should I assume that you want to use regular
>> expressions to implement this? What do you mean by "X [is] a
>> character"? That it's a letter (and if so, in what language --
>> English only, or is it okay if it's a letter in the current locale,
>> whatever that may be)? Or could it be a digit or punctuation mark or
>> even a control character?
>>
>> One thing I'll say is that this looks a lot more like a lexing problem
>> than a true parsing problem. Regular expressions are, therefore, an
>> appropriate tool for solving it.
>>

 
Reply With Quote
 
Dale King
Guest
Posts: n/a
 
      02-06-2004
"Chris Smith" <(E-Mail Removed)> wrote in message
news:(E-Mail Removed)4.net...
> Markos Charatzas wrote:
> > I'm trying to parse the following expression but i'm having difficulties
> > understanding the whole "parse a String" theory.

>
> Okay. Part of your confusion may come from a confusion about the nature
> of character strings in Java. Let's clear that one up first:
>
> > I have this idea of checking the first 10 bytes of each string to see
> > whether or not they represent a character.

>
> This makes absolutely no sense. I don't know what you mean by "byte"
> and "character", but here is the general take on those:
>
> 1. A "character" is a single component of a string. There are many
> possible characters; in Java, a character can be any of 64 million
> different standard Unicode characters. These include letters and digits
> and punctuation from a variety of worldwide languages, plus some control
> characters, math symbols, and a lot of other stuff.



And in JDK1.5 it has gotten slightly more complex, since it now supports
Unicode 4.0 and surrogates.
--
Dale King


 
Reply With Quote
 
skeptic
Guest
Posts: n/a
 
      02-07-2004
"Dale King" <kingd[at]tmicha[dot]net> wrote in message news:<(E-Mail Removed)>...
> > 1. A "character" is a single component of a string. There are many
> > possible characters; in Java, a character can be any of 64 million
> > different standard Unicode characters. These include letters and digits
> > and punctuation from a variety of worldwide languages, plus some control
> > characters, math symbols, and a lot of other stuff.

>
>
> And in JDK1.5 it has gotten slightly more complex, since it now supports
> Unicode 4.0 and surrogates.


Hello Dale!

Just curious, has the 'char' type been widened (e.g. to 4 bytes)?
If not, how do they implement the charAt(i)?

Regards
 
Reply With Quote
 
Thomas Schodt
Guest
Posts: n/a
 
      02-08-2004
skeptic wrote:

> Just curious, has the 'char' type been widened (e.g. to 4 bytes)?


http://makeashorterlink.com/?P37821657

> If not, how do they implement the charAt(i)?


Try it.
 
Reply With Quote
 
Dale King
Guest
Posts: n/a
 
      02-09-2004
"skeptic" <(E-Mail Removed)> wrote in message
news:(E-Mail Removed) om...
> "Dale King" <kingd[at]tmicha[dot]net> wrote in message

news:<(E-Mail Removed)>...
> > > 1. A "character" is a single component of a string. There are many
> > > possible characters; in Java, a character can be any of 64 million
> > > different standard Unicode characters. These include letters and

digits
> > > and punctuation from a variety of worldwide languages, plus some

control
> > > characters, math symbols, and a lot of other stuff.

> >
> >
> > And in JDK1.5 it has gotten slightly more complex, since it now supports
> > Unicode 4.0 and surrogates.

>
> Just curious, has the 'char' type been widened (e.g. to 4 bytes)?
> If not, how do they implement the charAt(i)?



No, it still is 16 bits. Basically String and Character arrays are now
encoded in UTF-16 as opposed to UCS-2. To handle characters outside the BMP
requires the use of surrogates. They now distinguish between code points
(the Unicode value) and code units (Java char which is either a symbol from
BMP or a surrogate).

The best way to see what changes is to view the docs for Character (which
Thomas provided a link to) and also for String and search for "1.5" and see
the methods and values added since 1.5.
--
Dale King


 
Reply With Quote
 
skeptic
Guest
Posts: n/a
 
      02-10-2004
"Dale King" <kingd[at]tmicha[dot]net> wrote in message news:<(E-Mail Removed)>...
................
> > > And in JDK1.5 it has gotten slightly more complex, since it now supports
> > > Unicode 4.0 and surrogates.

> >
> > Just curious, has the 'char' type been widened (e.g. to 4 bytes)?
> > If not, how do they implement the charAt(i)?

>
>
> No, it still is 16 bits. Basically String and Character arrays are now
> encoded in UTF-16 as opposed to UCS-2. To handle characters outside the BMP
> requires the use of surrogates. They now distinguish between code points
> (the Unicode value) and code units (Java char which is either a symbol from
> BMP or a surrogate).
>
> The best way to see what changes is to view the docs for Character (which
> Thomas provided a link to) and also for String and search for "1.5" and see
> the methods and values added since 1.5.


Hi Dale!
I'm familiar with the basics of Unicode. Let me emphasize the point of
the question.
If the data inside a String are kept as UTF16-encoded char array, then
getting the i-th char is not as simple as return _data[i], hence slow.
The use of int[] solves it, but adds to memory hogginess.
What was their choice?

Regards
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Custom Regular Expressions in ASP.net Jay Douglas ASP .Net 3 11-03-2003 08:09 PM
Regular expressions mark Perl 4 10-28-2003 12:37 PM
perl regular expressions return last matched occurence? Dustin D. Perl 1 08-28-2003 01:51 AM
matching curly braces and regular expressions Dustin D. Perl 0 08-26-2003 11:18 PM
Add custom regular expressions to the validation list of available expressions Jay Douglas ASP .Net 0 08-15-2003 10:19 PM



Advertisments