Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > Uniquely identifying each & every html template

Reply
Thread Tools

Uniquely identifying each & every html template

 
 
alex23
Guest
Posts: n/a
 
      01-21-2013
On Jan 22, 1:03*am, Ferrous Cranus <(E-Mail Removed)> wrote:
> ALL, iam asking for is a way to make this work.


No, ALL you are asking is for us to take an _impossible_ situation and
make it magically work for you, without your having to improve your
understanding of the problem or modifying your requirements in any
way. You don't see *your ignorance* as the problem, preferring instead
to blame others and Python itself for your failings. None of the
solutions proposed satisfy you because they seem like too much work,
and you're convinced that this can just happen.

It can't, and you desperately need to educate yourself on some vital
aspects of _how the web works_ (and Python, and file systems, and *NIX
environments etc etc).

 
Reply With Quote
 
 
 
 
alex23
Guest
Posts: n/a
 
      01-21-2013
On Jan 22, 1:07*am, Ferrous Cranus <(E-Mail Removed)> wrote:
> Perhaps we should look into on how's the OS handles the file to get an idea on how its done?


Who is this "we" you speak of? You mean "you", right?

You do that and get back to us when you believe you've found something
that helps.

 
Reply With Quote
 
 
 
 
Oscar Benjamin
Guest
Posts: n/a
 
      01-21-2013
On 21 January 2013 23:01, Tom P <(E-Mail Removed)> wrote:
> On 01/21/2013 01:39 PM, Oscar Benjamin wrote:
>>
>> On 21 January 2013 12:06, Ferrous Cranus <(E-Mail Removed)> wrote:
>>>
>>> , 21 2013 11:31:24 .. UTC+2, Chris
>>> Angelico :
>>>>
>>>>
>>>> Seriously, you're asking for something that's beyond the power of
>>>> humans or computers. You want to identify that something's the same
>>>> file, without tracking the change or having any identifiable tag.
>>>>
>>>> That's a fundamentally impossible task.
>>>
>>>
>>> No, it is difficult but not impossible.
>>> It just cannot be done by tagging the file by:
>>>
>>> 1. filename
>>> 2. filepath
>>> 3. hash (math algorithm producing a string based on the file's contents)
>>>
>>> We need another way to identify the file WITHOUT using the above
>>> attributes.

>>
>>
>> This is a very old problem (still unsolved I believe):
>> http://en.wikipedia.org/wiki/Ship_of_Theseus
>>

> That wiki article gives a hint to a poosible solution -use a timestamp to
> determine which key is valid when.


In the Ship of Theseus, it is only argued that it is the same ship
because people were aware of the incremental changes that took place
along the way. The same applies here: if you don't track the
incremental changes and the two files have nothing concrete in common,
what does it mean to say that a file is "the same file" as some older
file?

That being said, I've always been impressed with the way that git can
understand when I think that a file is the same as some older file
(though it does sometimes go wrong):

~/tmp$ git init
Initialized empty Git repository in /home/oscar/tmp/.git/
~/tmp$ vim old.py
~/tmp$ cat old.py
#!/usr/bin/env python

print('This is a fairly useless script.')
print("Maybe I'll improve it later...")
~/tmp$ git add old.py
~/tmp$ git status
# On branch master
#
# Initial commit
#
# Changes to be committed:
# (use "git rm --cached <file>..." to unstage)
#
# new file: old.py
#
~/tmp$ git commit
[master (root-commit) 8e91665] First commit
1 file changed, 4 insertions(+)
create mode 100644 old.py
~/tmp$ ls
old.py
~/tmp$ cat old.py > new.py
~/tmp$ rm old.py
~/tmp$ vim new.py
~/tmp$ cat new.py
#!/usr/bin/env python

print('This is a fairly useless script.')
print("Maybe I'll improve it later...")

print("Although, I've edited it somewhat, it's still useless")
~/tmp$ git status
# On branch master
# Changes not staged for commit:
# (use "git add/rm <file>..." to update what will be committed)
# (use "git checkout -- <file>..." to discard changes in working directory)
#
# deleted: old.py
#
# Untracked files:
# (use "git add <file>..." to include in what will be committed)
#
# new.py
no changes added to commit (use "git add" and/or "git commit -a")
~/tmp$ git add -A .
~/tmp$ git status
# On branch master
# Changes to be committed:
# (use "git reset HEAD <file>..." to unstage)
#
# renamed: old.py -> new.py
#

So it *is* Theseus' ship!


Oscar
 
Reply With Quote
 
Chris Angelico
Guest
Posts: n/a
 
      01-22-2013
On Tue, Jan 22, 2013 at 10:43 AM, Oscar Benjamin
<(E-Mail Removed)> wrote:
> On 21 January 2013 23:01, Tom P <(E-Mail Removed)> wrote:
>> On 01/21/2013 01:39 PM, Oscar Benjamin wrote:
>>> This is a very old problem (still unsolved I believe):
>>> http://en.wikipedia.org/wiki/Ship_of_Theseus
>>>

>> That wiki article gives a hint to a poosible solution -use a timestamp to
>> determine which key is valid when.

>
> In the Ship of Theseus, it is only argued that it is the same ship
> because people were aware of the incremental changes that took place
> along the way. The same applies here: if you don't track the
> incremental changes and the two files have nothing concrete in common,
> what does it mean to say that a file is "the same file" as some older
> file?
>
> That being said, I've always been impressed with the way that git can
> understand when I think that a file is the same as some older file
> (though it does sometimes go wrong):


Yeah, git's awesome like that It looks at file similarity, though,
so if you completely rewrite a file and simultaneously rename/move it,
git will lose track of it. And as you say, sometimes it gets things
wrong - if you merge a large file into a small one, git will report it
as a deletion and rename. (Of course, it doesn't make any difference.
It's just a matter of reporting.) Mercurial, if I understand
correctly, actually _tracks_ moves (and copies), but git just records
a deletion and a creation.

My family in fact has a literal "grandfather's axe" (except that I
don't think either of my grandfathers actually owned it, but it's my
Dad's old axe) that has had many new handles and a couple of new
heads. Bringing it back to computers, we have on our network two
computers "Stanley" and "Ollie" that have been there ever since we
first set up that network. Back then, it was coax cable, 10base2, no
routers/switches/etc, and the computers were I think early Pentiums.
We installed the database on one of them, and set the other in Dad's
office. Today, we have a modern Ethernet setup with modern hardware
and cat-5 cable; we still have Stanley with the database and Ollie in
the office. The name/identity of the computer is mostly associated
with its roles; but those roles can shift too (there was a time when
Ollie was the internet gateway, but that's no longer the case).
Identity is its own attribute.

The problem isn't that identity can't exist. It's that it can't be
discovered. That takes external knowledge. Dave's analogy is accurate.

ChrisA
 
Reply With Quote
 
rusi
Guest
Posts: n/a
 
      01-22-2013
On Jan 21, 5:55*pm, alex23 <(E-Mail Removed)> wrote:
> On Jan 21, 10:39*pm, Oscar Benjamin <(E-Mail Removed)>
> wrote:
>
> > This is a very old problem (still unsolved I believe):http://en.wikipedia.org/wiki/Ship_of_Theseus

>
> +1 internets for referencing my most favourite thought experiment
> ever


+2 Oscar for giving me this name.

A more apposite (to computers) experience:

Ive a computer whose OS I wanted to upgrade without disturbing the
existing setup. Decided to fit a new hard disk with a new OS.
Installed the OS on a new hard disk, fitted the new hard disk into the
old computer and rebooted.

The messages that started coming were: New Hardware detected: monitor,
mouse, network card etc etc. but not new disk!

Strange! The only one thing new is not seen as new but all the old
things are seen as new.


So
Ask a layman whats a computer and he'll point to the box and call it
'CPU'.
Ask a more computer literate person and he'll point to the chip inside
the box and say 'CPU'
Ask the computer itself and it says 'Disk'.

Moral:
Object identity is at best hard -- usually unsolvable
 
Reply With Quote
 
rusi
Guest
Posts: n/a
 
      01-22-2013
On Jan 21, 8:07*pm, Ferrous Cranus <(E-Mail Removed)> wrote:
> Τη Δευτ*ρα, 21 Ιανουαρίου 2013 9:20:15 π.μ.. UTC+2, ο χρήστης Chris Angelico *γραψε:
>
>
>
>
>
>
>
>
>
> > On Mon, Jan 21, 2013 at 6:08 PM, Ferrous Cranus <(E-Mail Removed)>wrote:

>
> > > An .html page must retain its database counter value even if its:

>
> > > (renamed && moved && contents altered)

>
> > Then you either need to tag them in some external way, or have some

>
> > kind of tracking operation - for instance, if you require that all

>
> > renames/moves be done through a script, that script can update its

>
> > pointer. Otherwise, you need magic, and lots of it.

>
> > ChrisA

>
> Perhaps we should look into on how's the OS handles the file to get an idea on how its done?


Yes…
Perhaps the most useful for you suggestion Ive seen in this thread is
to look at git.
If you do you will find that
a. git has to do a great deal more work than you expect to factorize
out content-tracking from file-tracking
b. Yet it can get it wrong

Look at
snapshoting file systems http://en.wikipedia.org/wiki/Snapsho...9#File_systems
like winfs (cancelled) and btrfs
Slightly more practical may be timevault http://www.dedoimedo.com/computers/timevault.html
 
Reply With Quote
 
Chris Angelico
Guest
Posts: n/a
 
      01-22-2013
On Tue, Jan 22, 2013 at 2:24 PM, rusi <(E-Mail Removed)> wrote:
> Ive a computer whose OS I wanted to upgrade without disturbing the
> existing setup. Decided to fit a new hard disk with a new OS.
> Installed the OS on a new hard disk, fitted the new hard disk into the
> old computer and rebooted.
>
> The messages that started coming were: New Hardware detected: monitor,
> mouse, network card etc etc. but not new disk!
>
> Strange! The only one thing new is not seen as new but all the old
> things are seen as new.


That's because you asked the OS to look at the computer, and the OS
was on the disk. So in that sense, you did give it a whole lot of new
hardware but not a new disk. However, Windows Product Activation would
probably have called that a new computer, meaning that Microsoft deems
it to be new. (I've no idea about other non-free systems. Free systems
don't care about new computer vs same computer, of course.)

ChrisA
 
Reply With Quote
 
Ferrous Cranus
Guest
Posts: n/a
 
      01-22-2013
Τη Δευτ*ρα, 21 Ιανουαρίου 2013 10:48:11 μ.μ. UTC+2, ο χρήστης Piet van Oostrum *γραψε:
> Ferrous Cranus <(E-Mail Removed)> writes:
>
>
>
> > This python script acts upon websites other people use and every html

>
> > templates has been written by different methods(notepad++,

>
> > dreamweaver, joomla).

>
> >

>
> > Renames and moves are performed, either by shell access or either by

>
> > cPanel access by website owners.

>
> >

>
> > That being said i have no control on HOW and WHEN users alter their html pages.

>
>
>
> Under these circumstances the only way to solve it is to put an
>
> identification *inside* the file and make sure it will not be changed.
>
> It could for example be some invisible piece of HTML, or an attribute to
>
> some tag. If that can't be done the problem cannot be solved and it
>
> makes no sense keeping asking the same question over and over again.


The solution you propose is what i already use for my website.
Since its my website i can edit all the .html i want embedding a unique number in each and evey one of them as i showed in my initial post.

Problem is i'am not allowed to do the same with the other websites i host.
And apart from that even if i was allowed to, an html page could be rewritten thus the identified would get lost.
 
Reply With Quote
 
Ferrous Cranus
Guest
Posts: n/a
 
      01-22-2013
Τη Τρίτη, 22 Ιανουαρίου 2013 6:04:09 π.μ. UTC+2, οχρήστης Tim Roberts *γραψε:
> Ferrous Cranus <(E-Mail Removed)> wrote:
>
> >

>
> >Renames and moves are performed, either by shell access or either by cPanel access by website owners.

>
> >

>
> >That being said i have no control on HOW and WHEN users alter their htmlpages.

>
>
>
> Right, and that makes it impossible to solve this problem.
>
>
>
> Think about some scenarios. Let's say I have a web site with two pages:
>
> ~/web/page1.html
>
> ~/web/page2.html
>
>
>
> Now let's say I use some editor to make a copy of page1 called page1a.html.
>
> ~/web/page1.html
>
> ~/web/page1a.html
>
> ~/web/page2.html
>
>
>
> Should page1a.html be considered the same page as page1.html? What if I
>
> subsequently delete page1.html? What if I don't? How long will you wait
>
> before deciding they are the same?
>
> --
>
> Tim Roberts, http://www.velocityreviews.com/forums/(E-Mail Removed)
>
> Providenza & Boekelheide, Inc.


You are right, it cannot be done.

So i have 2 options .

Either identify an .html file from its "filepath" or from its "hash".

Which method do you advice me to utilize?
 
Reply With Quote
 
John Gordon
Guest
Posts: n/a
 
      01-22-2013
In <(E-Mail Removed)> Ferrous Cranus <(E-Mail Removed)> writes:

> > If that's the case, then I figure you have about 3 choices:
> > 1) use the file path as your key, instead of requiring a number


> No, i cannot, because it would mess things at a later time on when i for
> example:


> 1. mv name.html othername.html (document's filename altered)
> 2. mv name.html /subfolder/name.html (document's filepath altered)


Will the file always reside on the same device? If so, perhaps you could
use the file inode number as the key.

(That seems fairly brittle though. For example if the disk crashes and is
restored from a backup, the inodes could easily be different.)

--
John Gordon A is for Amy, who fell down the stairs
(E-Mail Removed) B is for Basil, assaulted by bears
-- Edward Gorey, "The Gashlycrumb Tinies"

 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Generated UPDATE statement. could not determine which columns uniquely identify the rows for "Customers" bazzer ASP .Net 8 03-23-2007 08:26 PM
could not determine which columns uniquely identify the rows for ... bazzer ASP .Net 0 04-10-2006 11:09 AM
Semi OT: Uniquely Identifying Substrings for an Elem in a Set: substr, Sets and Complexity Veli-Pekka Ttil Perl Misc 6 08-23-2005 09:10 AM
How do I uniquely identify a control? Alan Silver ASP .Net 6 02-24-2005 06:31 PM
The best way to uniquely identify anonymous visitors muser8@hotmail.com ASP .Net 2 07-26-2004 11:47 PM



Advertisments