This week's book giveaway is in the Cloud/Virtualization forum.
We're giving away four copies of Grokking Bitcoin and have Kalle Rosenbaum on-line!
See this thread for details.
Win a copy of Grokking Bitcoin this week in the Cloud/Virtualization forum!
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Campbell Ritchie
  • Liutauras Vilda
  • Bear Bibeault
  • Tim Cooke
  • Junilu Lacar
Sheriffs:
  • Paul Clapham
  • Devaka Cooray
  • Knute Snortum
Saloon Keepers:
  • Ron McLeod
  • Tim Moores
  • Stephan van Hulst
  • Tim Holloway
  • Frits Walraven
Bartenders:
  • Carey Brown
  • salvin francis
  • Claude Moore

Encoding-decoding  RSS feed

 
Ranch Hand
Posts: 34
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hello, I am in the process of learning Python 3 for the purposes of NLP.
I am trying to work with a .txt that has non-ASCII characters. In the exercise I have to demonstrate the differences in the length of documents. My code looks like this



I understand what the lines do, I checked the solved exercise, and It is the same as this, but for some reason it won´t compile.
I get the following error message:
Traceback (most recent call last):
 File "C:...my folders...", line 56, in <module>
   print (len(open('hr.txt').read()))
 File "C:\Users\user\AppData\Local\Programs\Python\Python37-32\lib\encodings\cp1250.py", line 23, in decode
   return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 88422: character maps to <undefined>

Can someone help me out?
 
Saloon Keeper
Posts: 20641
122
Android Eclipse IDE Java Linux Redhat Tomcat Server
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Character code 0x'90' is a control character, not a printable character.

Some control characters, such as NL and CR and TAB I would expect to decode properly, even though they are not printable, since they are print control characters. But 0x'90' is a generic device control with no standard meaning, so apparently the code converter rejected it.
 
M. Gumblert
Ranch Hand
Posts: 34
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Okay, I am not sure what you are saying. Does this mean that the code is right? There shouldn't be any problems with the .txt file as well. What can I do?
 
Tim Holloway
Saloon Keeper
Posts: 20641
122
Android Eclipse IDE Java Linux Redhat Tomcat Server
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
That's the way I read it. Code is OK, data isn't. It happens to me a lot.
 
We don't have time to be charming! Quick, read this tiny ad:
Create Edit Print & Convert PDF Using Free API with Java
https://coderanch.com/wiki/703735/Create-Convert-PDF-Free-Spire
  • Post Reply Bookmark Topic Watch Topic
  • New Topic
Boost this thread!